Using LVM’s new cache feature

If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.

But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.

The Set-up

To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:

20140522_101903

There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.

In other news: wow, SSDs from brand manufacturers are getting really cheap now!

In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:

# lsblk
NAME                                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                        8:0    0   1.8T  0 disk  
└─sda1                                     8:1    0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdb                                        8:16   0   1.8T  0 disk  
└─sdb1                                     8:17   0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdc                                        8:32   0 232.9G  0 disk  
└─sdc1                                     8:33   0 232.9G  0 part  

Performance

Before starting to set up the caching layer, let’s find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.

HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec

Note these numbers don’t show the real benefit of SSDs — namely that performance doesn’t collapse as soon as you randomly access the disk.

Terminology

The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:

origin LV           OriginLV      large slow LV
cache data LV       CacheDataLV   small fast LV for cache pool data
cache metadata LV   CacheMetaLV   small fast LV for cache pool metadata
cache pool LV       CachePoolLV   CacheDataLV + CacheMetaLV
cache LV            CacheLV       OriginLV + CachePoolLV

Creating the LVs

Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:

# lvcreate -L 100G -n testoriginlv vg_guests
  Logical volume "testoriginlv" created
# mkfs -t ext4 /dev/vg_guests/testoriginlv

Also note that resizing cached LVs is not currently supported (coming later — for now you can work around it by removing the cache, resizing, then recreating the cache).

Creating the cache layer

What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks — it simply doesn’t work otherwise.

Therefore my first step is to extend my existing VG to include the fast disk:

# vgextend vg_guests /dev/sdc1
  Volume group "vg_guests" successfully extended

I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).

# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1
  Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1
  Logical volume "lv_cache" created
# lvs
  LV                     VG        Attr       LSize
  lv_cache               vg_guests -wi-a----- 229.00g
  lv_cache_meta          vg_guests -wi-a-----   1.00g
  testoriginlv           vg_guests -wi-a----- 100.00g
# pvs
  PV         VG        Fmt  Attr PSize   PFree  
  /dev/md127 vg_guests lvm2 a--    1.82t 932.89g
  /dev/sdc1  vg_guests lvm2 a--  232.88g   2.88g

(You’ll notice that my cache is bigger than my test OriginLV, but that’s fine as once I’ve worked out all the gotchas, my real OriginLV will be over 1 TB).

Why did I leave 2.88GB of free space in the PV? I’m not sure actually. However the first time I did this, I didn’t leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex’s comment below.

Convert the CacheDataLV and CacheMetaLV into a “cache pool”:

# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
  Logical volume "lvol0" created
  Converted vg_guests/lv_cache to cache pool.

Now attach the cache pool to the OriginLV to create the final cache object:

# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
  vg_guests/testoriginlv is now cached.

Benchmark

Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:

LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec

Which is exactly the same as the backing hard disk.

Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using dd isn’t a useful test of dm-cache.

What I’m going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).

About these ads

39 Comments

Filed under Uncategorized

39 responses to “Using LVM’s new cache feature

  1. Since the cache disks need to be in the same VG, how do you ensure that LVM is not using space from the SSD when you allocate space for the original LV (before you allocate space for the cache)?

    • rich

      Indeed. It turns out (I did not know this before) that you can place LVs on particular PVs when you create them:

      lvcreate -L 10G -n slow vg_guests /dev/md127
      
      • And if you’re trying to do this for an existing volume, then maybe you could convert to a raidtipe first and use lvconvert –replace to migrate off undesired physical volume?

      • For moving an LV, you can use pvmove with the -n option:

        pvmove -n lv_to_move /dev/from_pv /dev/to_pv

  2. This is highly interesting, thank you for bringing it to my attention! LVM is certainly evolving.

    Could it be done also with a raid1 of 2 SSD’s as cache? To guard against write errors to the cache. Without some sort of cache redundancy, I’d be reluctant to deploy this in a production environment with –cachemode writeback ….

    I’m a little worried about the failure-rate of SSD’s in a setup with high write-ratios, and the best place to deploy SSD’s in my opinion is where we have lots of random writes – such as database tables (indexes should always be contained in memory, otherwise buy more memory…).

    I see that lvcreate now also has a –mirror option, but I can’t see how to ensure that the two copies are on different pv’s.

    BR Bent

    • rich

      Yes, you can make the CacheDataLV and CacheMetaLV redundant by doing this (taken from lvmcache(7)):

      1. Create a 2-way RAID1 cache data LV
      # lvcreate --type raid1 -m 1 -L 1G -n lvx_cache vg \
                  /dev/fast1 /dev/fast2
      
      2. Create a 2-way RAID1 cache metadata LV
      # lvcreate --type raid1 -m 1 -L 8M -n lvx_cache_meta vg \
                  /dev/fast1 /dev/fast2
      

      Another way would be to use software RAID to create a mirrored pair of SSDs which would be used as the fast PV.

      I don’t know which, if either, is preferable.

      • Cool! I think ,the preferred way would be the one that allows me to replace broken disks online, whether they’re holding cache volumes or regular data volumes :-)

      • rich

        I imagine both must let you do hot disk replacement. However I’ve only ever used Linux md for that. I’ve never used device-mapper’s mirroring capability at all.

      • Both allow hot disk replacement. lvm.conf has the raid_fault_policy setting; if set to ‘remove’ it drops the failed mirror, and you can use lvchange –mirrors to add another one. The cool case is if it’s set to ‘allocate’ – it will then create a new mirror right when the old one fails, as long as there’s enough free space on a disk that doesn’t already contain a mirror. This requires dmeventd be running, but lvm starts that internally on its own.

        As far as “ensuring the copies are on different PVs” that’s not a concern. It’s handled by the allocation policy of the LV – there are several, but ‘cling’ is the most generally useful and (IIRC) the default. The _only_ one that allows mirrors to be on the same drive is “anywhere” which is emphatically not the default. LVs default to a value of ‘inherit’, which takes the VG default; you can set the VG default with `vgchange –alloc`.

      • Agh, I messed up. You add mirrors with lvconvert –mirrors, not lvchange.

  3. Cyberax

    Are the ‘cache’ partitions added to the total pool?

    I.e. if I have a 1Tb ‘slow’ pool and 300Gb of ‘fast’ SSD then would the total size be 1.3Tb or 1Tb?

    • rich

      It would be 1T.

      The cache layer sits on top of the OriginLV, and copies blocks from the OriginLV to make them accessible faster.

      If you remove the cache layer, then LVM writes back the changed data to the OriginLV and you’re left with just the OriginLV. Therefore it’s pretty obvious that this wouldn’t work if the size of the LV increased.

  4. From what version of LVM is that cache feature available ?

    • rich

      It was added in Feb 2014, so pretty recently. Not sure about version numbers, but I did the test using

      lvm2-2.02.106-3.fc21.x86_64
      device-mapper-1.02.85-3.fc21.x86_64
      kernel-3.14.3-200.fc20.x86_64
      
  5. Dan

    Can a cache only be used for a single origin LV? Do I have to manually divide up my ssd into multiple cache LVs if I want multiple origin LVs to be cached?

    • rich

      AFAIK (and I’m not an expert) it appears that a cache can only apply to a single origin LV, and that if you wanted to cache multiple origin LVs, then you would need multiple caches. However you might want to ask on the linux-lvm mailing list for confirmation from the experts.

  6. lol

    compared to zfs this is joke

    • rich

      All Oracle have to do is relicense ZFS and it could be included in upstream Linux too. Until then it’s irrelevant.

    • Ian Macintosh

      How so? What you’re saying it’s not entirely clear if you think zfs is the joke or if dm/lvm is the joke? Or as to what the ‘joke’ is either.

  7. Chris Bennett

    Enjoyed your multiple articles on building the cluster and now storage tiering. Check out btier (http://www.lessfs.com/wordpress/) as well, which is less a cache and more a tiered storage design with blocks moving between tiers, so you get the sum of space available for use.

  8. Just FYI, the reason it needed the additional gigabyte (and why it created ‘lvol0′) is because of a failure-recovery feature.

    You see, both thin pools and cache pools have metadata that, in extraordinary circumstances, may break and need to be – essentially – fsck’d. Using the thin_check/thin_repair/cache_check/cache_repair tools, you can do that – but only a total chump would do that in-place; just like how all the cool kids snapshot their filesystem before they run fsck :P

    So what LVM does is, on creation of either type of pool, allocate a new hidden LV – you can see it in `lvs -a`; it’s listed [lvol0_pmspare] (the brackets denoting that it’s hidden, although the actual name is just lvol0_pmspare). This is sized to be the same size as the _largest_ pool metadata LV in the VG, and there’s only one lvol0_pmspare per VG. So if you had a thin pool with 5gb metadata and a cache pool with 3gb metada, you’d have a single 5gb lvol0_pmspare.

    When something goes wrong, and you repair the metadata, the repaired result is written to lvol0_pmspare, and you can then verify that the result makes sense and apply it.

  9. johnsimcall

    Rich, I’d like to try this out. How did you go about getting the RPMs installed. Are you using Fedora 20, or an alpha release?

    • rich

      You can use the packages in F20 or Rawhide. They are effectively at the same upstream version, so it doesn’t appear to matter which you use.

      To enable Rawhide, install the fedora-release-rawhide package and install what you need by doing:

      yum --enablerepo=rawhide install lvm2 device-mapper
      
  10. OERNii

    Thank you so much for this. I’ve been searching for docs on dm-cache for a long time.

    I’ve got it running on RHEL7rc btw.

    • OERNii

      One other thing: I createted a snapshot on vg_guests/testoriginlv and the whole cache somehow collapsed.

      • rich

        That doesn’t surprise me much, but you may want to file a bug about it so the LVM developers can add some code to disallow this case.

      • It also seems to collapse on creating a snapshot on the cached lvm. That is: about 2 hours after creating and removing a snapshot on cached var the system started to generate massive amounts of funky errors and would not boot afterwards. All of this on CentOS 7, not good this! :(

        Thankfully with the help of cache_repair and friends the damage was minimal, but there was damage.

      • rich

        In case it’s not clear, I think the feature is far from ready for prime-time, especially on CentOS.

      • I know, now :)

        But on the other hand, I just switched to rsync for var. Works for me. And from what I’ve seen backing up var (lots and lots of tiny files) went from 7+ hours the first time, to just about 4 hours the second time with noticable less I/O wait. That is better then flashcache which seems a bit more mature.

        FYI all of this is just on my hobby server in my spare time :) @work we will soon switch to SSD only in RAID1.

  11. Pingback: Cluster performance: baseline testing | Richard WM Jones

  12. Giorgos

    I created my cache pool with –cachemode writethrough, but according to dmsetup status, it is running in writeback mode. Is it supposed to work like that?

    • rich

      I’ve no idea. However here is my dmsetup status:

      vg_guests-lv_cache_cdata: 0 419430400 linear 
      vg_guests-lv_cache_cmeta: 0 2097152 linear 
      vg_guests-libvirt--images: 0 1677721600 linear 
      vg_guests-testorigin: 0 1677721600 cache 8 20301/262144 64 151/6553600 215 524414 66 522009 0 0 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8
      vg_guests-testorigin_corig: 0 1677721600 linear 
      

      What does writeback 2 mean?

  13. Andre

    I am afraid that cache-pool is still not available on CentOS 6. COuld you confirm that? Which distro are you using?

  14. Hi,
    I’m doing this on a CentOS 7 machine.
    The cached LV can be created, but it always set to inactive after reboot, and thus can’t be mounted automatically via /etc/fstab.
    Did you have such problems?

  15. I wrote a small tool for making parts of this process easier: https://github.com/larsks/lvcache

    This permits you to, for example, run “lvcache create myvg/mylv”, and it will create the cache LV (as a configureable percent of the size of the origin LV), the metadata LV, create the cache pool, and attach it to the origin device.

  16. Andre

    How would you compare this feature to BCACHE which, I believe, is also native for CentOS 7?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s