Using LVM’s new cache feature

If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.

But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.

The Set-up

To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:

20140522_101903

There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.

In other news: wow, SSDs from brand manufacturers are getting really cheap now!

In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:

# lsblk
NAME                                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                        8:0    0   1.8T  0 disk  
└─sda1                                     8:1    0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdb                                        8:16   0   1.8T  0 disk  
└─sdb1                                     8:17   0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdc                                        8:32   0 232.9G  0 disk  
└─sdc1                                     8:33   0 232.9G  0 part  

Performance

Before starting to set up the caching layer, let’s find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.

HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec

Note these numbers don’t show the real benefit of SSDs — namely that performance doesn’t collapse as soon as you randomly access the disk.

Terminology

The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:

origin LV           OriginLV      large slow LV
cache data LV       CacheDataLV   small fast LV for cache pool data
cache metadata LV   CacheMetaLV   small fast LV for cache pool metadata
cache pool LV       CachePoolLV   CacheDataLV + CacheMetaLV
cache LV            CacheLV       OriginLV + CachePoolLV

Creating the LVs

Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:

# lvcreate -L 100G -n testoriginlv vg_guests
  Logical volume "testoriginlv" created
# mkfs -t ext4 /dev/vg_guests/testoriginlv

Also note that resizing cached LVs is not currently supported (coming later — for now you can work around it by removing the cache, resizing, then recreating the cache).

Creating the cache layer

What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks — it simply doesn’t work otherwise.

Therefore my first step is to extend my existing VG to include the fast disk:

# vgextend vg_guests /dev/sdc1
  Volume group "vg_guests" successfully extended

I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).

# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1
  Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1
  Logical volume "lv_cache" created
# lvs
  LV                     VG        Attr       LSize
  lv_cache               vg_guests -wi-a----- 229.00g
  lv_cache_meta          vg_guests -wi-a-----   1.00g
  testoriginlv           vg_guests -wi-a----- 100.00g
# pvs
  PV         VG        Fmt  Attr PSize   PFree  
  /dev/md127 vg_guests lvm2 a--    1.82t 932.89g
  /dev/sdc1  vg_guests lvm2 a--  232.88g   2.88g

(You’ll notice that my cache is bigger than my test OriginLV, but that’s fine as once I’ve worked out all the gotchas, my real OriginLV will be over 1 TB).

Why did I leave 2.88GB of free space in the PV? I’m not sure actually. However the first time I did this, I didn’t leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex’s comment below.

Convert the CacheDataLV and CacheMetaLV into a “cache pool”:

# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
  Logical volume "lvol0" created
  Converted vg_guests/lv_cache to cache pool.

Now attach the cache pool to the OriginLV to create the final cache object:

# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
  vg_guests/testoriginlv is now cached.

Benchmark

Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:

LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec

Which is exactly the same as the backing hard disk.

Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using dd isn’t a useful test of dm-cache.

What I’m going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).

70 Comments

Filed under Uncategorized

70 responses to “Using LVM’s new cache feature

  1. Since the cache disks need to be in the same VG, how do you ensure that LVM is not using space from the SSD when you allocate space for the original LV (before you allocate space for the cache)?

    • rich

      Indeed. It turns out (I did not know this before) that you can place LVs on particular PVs when you create them:

      lvcreate -L 10G -n slow vg_guests /dev/md127
      
      • And if you’re trying to do this for an existing volume, then maybe you could convert to a raidtipe first and use lvconvert –replace to migrate off undesired physical volume?

      • For moving an LV, you can use pvmove with the -n option:

        pvmove -n lv_to_move /dev/from_pv /dev/to_pv

  2. This is highly interesting, thank you for bringing it to my attention! LVM is certainly evolving.

    Could it be done also with a raid1 of 2 SSD’s as cache? To guard against write errors to the cache. Without some sort of cache redundancy, I’d be reluctant to deploy this in a production environment with –cachemode writeback ….

    I’m a little worried about the failure-rate of SSD’s in a setup with high write-ratios, and the best place to deploy SSD’s in my opinion is where we have lots of random writes – such as database tables (indexes should always be contained in memory, otherwise buy more memory…).

    I see that lvcreate now also has a –mirror option, but I can’t see how to ensure that the two copies are on different pv’s.

    BR Bent

    • rich

      Yes, you can make the CacheDataLV and CacheMetaLV redundant by doing this (taken from lvmcache(7)):

      1. Create a 2-way RAID1 cache data LV
      # lvcreate --type raid1 -m 1 -L 1G -n lvx_cache vg \
                  /dev/fast1 /dev/fast2
      
      2. Create a 2-way RAID1 cache metadata LV
      # lvcreate --type raid1 -m 1 -L 8M -n lvx_cache_meta vg \
                  /dev/fast1 /dev/fast2
      

      Another way would be to use software RAID to create a mirrored pair of SSDs which would be used as the fast PV.

      I don’t know which, if either, is preferable.

      • Cool! I think ,the preferred way would be the one that allows me to replace broken disks online, whether they’re holding cache volumes or regular data volumes 🙂

      • rich

        I imagine both must let you do hot disk replacement. However I’ve only ever used Linux md for that. I’ve never used device-mapper’s mirroring capability at all.

      • Both allow hot disk replacement. lvm.conf has the raid_fault_policy setting; if set to ‘remove’ it drops the failed mirror, and you can use lvchange –mirrors to add another one. The cool case is if it’s set to ‘allocate’ – it will then create a new mirror right when the old one fails, as long as there’s enough free space on a disk that doesn’t already contain a mirror. This requires dmeventd be running, but lvm starts that internally on its own.

        As far as “ensuring the copies are on different PVs” that’s not a concern. It’s handled by the allocation policy of the LV – there are several, but ‘cling’ is the most generally useful and (IIRC) the default. The _only_ one that allows mirrors to be on the same drive is “anywhere” which is emphatically not the default. LVs default to a value of ‘inherit’, which takes the VG default; you can set the VG default with `vgchange –alloc`.

      • Agh, I messed up. You add mirrors with lvconvert –mirrors, not lvchange.

  3. Cyberax

    Are the ‘cache’ partitions added to the total pool?

    I.e. if I have a 1Tb ‘slow’ pool and 300Gb of ‘fast’ SSD then would the total size be 1.3Tb or 1Tb?

    • rich

      It would be 1T.

      The cache layer sits on top of the OriginLV, and copies blocks from the OriginLV to make them accessible faster.

      If you remove the cache layer, then LVM writes back the changed data to the OriginLV and you’re left with just the OriginLV. Therefore it’s pretty obvious that this wouldn’t work if the size of the LV increased.

  4. From what version of LVM is that cache feature available ?

    • rich

      It was added in Feb 2014, so pretty recently. Not sure about version numbers, but I did the test using

      lvm2-2.02.106-3.fc21.x86_64
      device-mapper-1.02.85-3.fc21.x86_64
      kernel-3.14.3-200.fc20.x86_64
      
  5. Dan

    Can a cache only be used for a single origin LV? Do I have to manually divide up my ssd into multiple cache LVs if I want multiple origin LVs to be cached?

    • rich

      AFAIK (and I’m not an expert) it appears that a cache can only apply to a single origin LV, and that if you wanted to cache multiple origin LVs, then you would need multiple caches. However you might want to ask on the linux-lvm mailing list for confirmation from the experts.

  6. lol

    compared to zfs this is joke

  7. Chris Bennett

    Enjoyed your multiple articles on building the cluster and now storage tiering. Check out btier (http://www.lessfs.com/wordpress/) as well, which is less a cache and more a tiered storage design with blocks moving between tiers, so you get the sum of space available for use.

  8. Just FYI, the reason it needed the additional gigabyte (and why it created ‘lvol0’) is because of a failure-recovery feature.

    You see, both thin pools and cache pools have metadata that, in extraordinary circumstances, may break and need to be – essentially – fsck’d. Using the thin_check/thin_repair/cache_check/cache_repair tools, you can do that – but only a total chump would do that in-place; just like how all the cool kids snapshot their filesystem before they run fsck 😛

    So what LVM does is, on creation of either type of pool, allocate a new hidden LV – you can see it in `lvs -a`; it’s listed [lvol0_pmspare] (the brackets denoting that it’s hidden, although the actual name is just lvol0_pmspare). This is sized to be the same size as the _largest_ pool metadata LV in the VG, and there’s only one lvol0_pmspare per VG. So if you had a thin pool with 5gb metadata and a cache pool with 3gb metada, you’d have a single 5gb lvol0_pmspare.

    When something goes wrong, and you repair the metadata, the repaired result is written to lvol0_pmspare, and you can then verify that the result makes sense and apply it.

  9. johnsimcall

    Rich, I’d like to try this out. How did you go about getting the RPMs installed. Are you using Fedora 20, or an alpha release?

    • rich

      You can use the packages in F20 or Rawhide. They are effectively at the same upstream version, so it doesn’t appear to matter which you use.

      To enable Rawhide, install the fedora-release-rawhide package and install what you need by doing:

      yum --enablerepo=rawhide install lvm2 device-mapper
      
  10. OERNii

    Thank you so much for this. I’ve been searching for docs on dm-cache for a long time.

    I’ve got it running on RHEL7rc btw.

    • OERNii

      One other thing: I createted a snapshot on vg_guests/testoriginlv and the whole cache somehow collapsed.

      • rich

        That doesn’t surprise me much, but you may want to file a bug about it so the LVM developers can add some code to disallow this case.

      • It also seems to collapse on creating a snapshot on the cached lvm. That is: about 2 hours after creating and removing a snapshot on cached var the system started to generate massive amounts of funky errors and would not boot afterwards. All of this on CentOS 7, not good this! 😦

        Thankfully with the help of cache_repair and friends the damage was minimal, but there was damage.

      • rich

        In case it’s not clear, I think the feature is far from ready for prime-time, especially on CentOS.

      • I know, now 🙂

        But on the other hand, I just switched to rsync for var. Works for me. And from what I’ve seen backing up var (lots and lots of tiny files) went from 7+ hours the first time, to just about 4 hours the second time with noticable less I/O wait. That is better then flashcache which seems a bit more mature.

        FYI all of this is just on my hobby server in my spare time 🙂 @work we will soon switch to SSD only in RAID1.

      • I thought this may be of interest:


        First graph shows the throughput on my filesystems, the interesting part is the nightly backup of important stuff on /var. Green is /var, the source, blue is /backup, the target. Notice there is continious i/o for about 4 hours on both filesystems. FYI /var is a lvm_cache SSD backed softRAID1 lv, backup is on a seperate 4TB e-SATA disk.

        The second graph, i/o wait, nicely shows lvm_cache in action. Only a few spikes during the whole backup process for green, whereas blue is responsable for most of the i/o wait I see in my CPU graphs. My guess is it’s even better because one should expect backup to outperform var if it wasn’t for the caching SSD.

        These are my dmsetup stats:

        0 2662400000
        cache
        8 5380/131072
        128 504275/1589248
        11279835 18619397
        8377215 8657042
        0 136305 4294967293 1
        writeback 2
        migration_threshold 2048
        mq 10
        random_threshold 4
        sequential_threshold 512
        discard_promote_adjustment 1
        read_promote_adjustment 4
        write_promote_adjustment 8

        In plain English: +50% read hitrate and almost 100% write hitrate. In all fairness not really real world material since the cache lv is still large enough to hold the entire diskspace used by the slow lv, but still it does show what lvm cache could do for you.

  11. Pingback: Cluster performance: baseline testing | Richard WM Jones

  12. Giorgos

    I created my cache pool with –cachemode writethrough, but according to dmsetup status, it is running in writeback mode. Is it supposed to work like that?

    • rich

      I’ve no idea. However here is my dmsetup status:

      vg_guests-lv_cache_cdata: 0 419430400 linear 
      vg_guests-lv_cache_cmeta: 0 2097152 linear 
      vg_guests-libvirt--images: 0 1677721600 linear 
      vg_guests-testorigin: 0 1677721600 cache 8 20301/262144 64 151/6553600 215 524414 66 522009 0 0 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8
      vg_guests-testorigin_corig: 0 1677721600 linear 
      

      What does writeback 2 mean?

    • Ara

      it is look likes here.
      https://utcc.utoronto.ca/~cks/space/blog/linux/DmCacheChangeWriteMode

      I can turn my dmcache to wtiethrough mode only following by these steps:
      before:

      dmsetup status vg0-root
      0 20971520 cache 8 7243/29696 128 108/3654592 5 406 613 8036 0 108 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8

      dmsetup table vg0-root
      >>> 0 20971520 cache 253:3 253:2 253:4 128 0 default 0
      #### now replace ‘0 default 0’ with ‘1 writethrough default 0’
      dmsetup reload –table ‘0 20971520 cache 253:3 253:2 253:4 128 1 writethrough default 0’ vg0-root

      #### suspend and resume it
      dmsetup suspend vg0-root
      dmsetup resume vg0-root

      ####Voila
      dmsetup status vg0-root
      0 20971520 cache 8 10868/29696 128 0/3654592 0 0 0 1 0 0 0 1 writethrough 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8

  13. Andre

    I am afraid that cache-pool is still not available on CentOS 6. COuld you confirm that? Which distro are you using?

  14. Hi,
    I’m doing this on a CentOS 7 machine.
    The cached LV can be created, but it always set to inactive after reboot, and thus can’t be mounted automatically via /etc/fstab.
    Did you have such problems?

  15. I wrote a small tool for making parts of this process easier: https://github.com/larsks/lvcache

    This permits you to, for example, run “lvcache create myvg/mylv”, and it will create the cache LV (as a configureable percent of the size of the origin LV), the metadata LV, create the cache pool, and attach it to the origin device.

  16. Andre

    How would you compare this feature to BCACHE which, I believe, is also native for CentOS 7?

  17. Nout Gemmeke

    Hi What’s going wrong here ?

    [root@sf01 by-uuid]# lvscan
    ACTIVE ‘/dev/system_vg/root’ [9,77 GiB] inherit
    ACTIVE ‘/dev/system_vg/var’ [9,77 GiB] inherit
    ACTIVE ‘/dev/system_vg/tmp’ [4,88 GiB] inherit
    ACTIVE ‘/dev/system_vg/swap’ [31,25 GiB] inherit
    ACTIVE ‘/dev/cloud_vg/cache-meta’ [1,00 GiB] inherit
    ACTIVE ‘/dev/cloud_vg/cache’ [237,47 GiB] inherit
    [root@sf01 by-uuid]# lvconvert –type cache-pool –poolmetadata cloud_vg/cache_meta cloud_vg/cache
    Unknown metadata LV cloud_vg/cache_meta.

  18. laurens

    Hallo, first thank you for all details. But I was wonder is it write cache or read? or both?
    Also can you run this with multiple lvm’s? because I got two lvm’s that I would like to connect to a single lvm.

    What is the trim size for this? I’m using an adaptec read cache at the moment, but I’m rather disappointed because it uses 1 MB blocks 😦

  19. laurens

    Hallo,

    Is this a read cache only? or can we use a write cache also?

    Best Regards,

    Laurens

  20. Pingback: Usando discos SSD para cache no LVM | HAmc

  21. laurens

    A Tip: You can use this on Centos 6.6, at least the updated version of LVM. Make sure to remove the cache before re-installing on Centos 6.6 because it will crash, otherwise…

  22. Glad to see they added another undocumented, lightly tested and maybe helpful feature to LVM.

  23. lavamind

    Hi Richard, I read your thread on the LVM mailing list after stumbling on this article. Interesting stuff! In my tests, I seem to have found that there’s an extra delay introduced with the inital promotion of blocks from the origin LV to the cache. For example, with a “read_promote_adjustment” of 0, on a clean cache device it will promote every block accessed to the cache, so the first pass is around 100% to 150% *slower* than reading directly from the origin LV directly. In subsequent read passes, I get an appreciable, stable 40% speed increase. With a higher promote adjustment value, dm-cache will promote block much more slowly to the cache, and thus the initial “promoting” penalty can be reduced to almost nil, at the cost of obtaining the full benefits of dm-cache only once the cache is greatly warmed up. I really don’t know much about the internals of dm-cache or device-mapper, but I suspect this is caused because block promotion occurs synchonously to the I/O flow: so when a client happens to read a block which dm-cache decides to promote, the delay of moving/writing this block to the cache is incurred. Has anyone else made similar observations?

  24. Pingback: LVM: Lehká vánoční magie pro adminy | honzas.cz — Jan Švec převážně vážně

  25. Pingback: EnhanceIO and Check_MK plugin « BenV's notes

  26. Pingback: ​Here comes RHEL beta 6.7 | iTelNews

  27. Neduz

    Good blog post! Additional hint for Ubuntu users: install thin-provisioning-tools as well, otherwise you’ll end up with “/usr/sbin/cache_check: execvp failed: No such file or directory” during boot and be dropped in the emergency shell.

  28. Hi,
    Is it possible to cache a thin pool? With all its logical volume to one cache pool?

  29. Pingback: LVMCACHE — LVM caching | Jayan Venugopalan

  30. Pingback: SSD Caching – Actually doing it | R e n d r a g

  31. hello

    Hey! It’s really nice reading your blog!

    Could you please tell me; regarding this LVM caching feature – is it caching the mostly used files on FS level, or mostly used blocks on cached device? I mean, if I have a VM on an image that is 20GB, will the whole file get cached, or just mostly used fragments of it?

  32. Pingback: Testing glusterfs on centos | Danielle and Roger Blog

  33. Lukas

    Hi,
    Was trying that on a test system to add caching to an existing volume. Works fine so far.

    What happens if the caching SSD failes? Is the data still accessible? How to recover from such a failure (replacing the SSD)?

    Thanks

    • Bob

      Exactly this problem I had with LMV cache. After a system failure, the data could no longer be reconstructed. And to date, there is no consistent solution to LVM cache if the SSD fails.

  34. Its worth pointing out that lvdisplay on your “real” Logical Volume will show the cache hits/misses for both reads and writes and promotion/demotions. Very handy data:
    LV Size 256.00 GiB
    Cache used blocks 39.53%
    Cache metadata blocks 19.48%
    Cache dirty blocks 0.00%
    Cache read hits/misses 187925 / 92970
    Cache wrt hits/misses 5792563 / 912731
    Cache demotions 0
    Cache promotions 24574

  35. Andrew

    I’m a bit confused as to the value of this. You need to create a cache-pool/object for each and every lv? but I have dozens of lv’s on my system (1-2 for each virtual machine).

    I have a a (quite slow) raid5 md set of 14TB, with lvm over the top of that. What I want to do is cache the most used blocks into a new 256G SSD. It seems that every lv needs it’s own cache pool which seems like an extrordinary amount of overhead.

    or am I missing something obvious …. ?

    A.

  36. I followed this guide and everything is ok but after I reboot my machine the logical volume disappear. I can’t mount it. It’s like if it doesn’t exist.
    If i do and lvremove vg01/lv_cache and reboot, the volume appear again and I can use.
    Can’t understand what’s appening.
    Ubuntu 18.04 LTS.

Leave a reply to Chris Bennett Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.