If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.
But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.
The Set-up
To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:
There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.
In other news: wow, SSDs from brand manufacturers are getting really cheap now!
In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.8T 0 disk └─sda1 8:1 0 1.8T 0 part └─md127 9:127 0 1.8T 0 raid1 sdb 8:16 0 1.8T 0 disk └─sdb1 8:17 0 1.8T 0 part └─md127 9:127 0 1.8T 0 raid1 sdc 8:32 0 232.9G 0 disk └─sdc1 8:33 0 232.9G 0 part
Performance
Before starting to set up the caching layer, let’s find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.
HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec
Note these numbers don’t show the real benefit of SSDs — namely that performance doesn’t collapse as soon as you randomly access the disk.
Terminology
The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:
origin LV OriginLV large slow LV cache data LV CacheDataLV small fast LV for cache pool data cache metadata LV CacheMetaLV small fast LV for cache pool metadata cache pool LV CachePoolLV CacheDataLV + CacheMetaLV cache LV CacheLV OriginLV + CachePoolLV
Creating the LVs
Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:
# lvcreate -L 100G -n testoriginlv vg_guests Logical volume "testoriginlv" created # mkfs -t ext4 /dev/vg_guests/testoriginlv
Also note that resizing cached LVs is not currently supported (coming later — for now you can work around it by removing the cache, resizing, then recreating the cache).
Creating the cache layer
What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks — it simply doesn’t work otherwise.
Therefore my first step is to extend my existing VG to include the fast disk:
# vgextend vg_guests /dev/sdc1 Volume group "vg_guests" successfully extended
I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).
# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1 Logical volume "lv_cache_meta" created # lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1 Logical volume "lv_cache" created # lvs LV VG Attr LSize lv_cache vg_guests -wi-a----- 229.00g lv_cache_meta vg_guests -wi-a----- 1.00g testoriginlv vg_guests -wi-a----- 100.00g # pvs PV VG Fmt Attr PSize PFree /dev/md127 vg_guests lvm2 a-- 1.82t 932.89g /dev/sdc1 vg_guests lvm2 a-- 232.88g 2.88g
(You’ll notice that my cache is bigger than my test OriginLV, but that’s fine as once I’ve worked out all the gotchas, my real OriginLV will be over 1 TB).
Why did I leave 2.88GB of free space in the PV? I’m not sure actually. However the first time I did this, I didn’t leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex’s comment below.
Convert the CacheDataLV and CacheMetaLV into a “cache pool”:
# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache Logical volume "lvol0" created Converted vg_guests/lv_cache to cache pool.
Now attach the cache pool to the OriginLV to create the final cache object:
# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv vg_guests/testoriginlv is now cached.
Benchmark
Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:
LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec
Which is exactly the same as the backing hard disk.
Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using dd
isn’t a useful test of dm-cache.
What I’m going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).
Since the cache disks need to be in the same VG, how do you ensure that LVM is not using space from the SSD when you allocate space for the original LV (before you allocate space for the cache)?
Indeed. It turns out (I did not know this before) that you can place LVs on particular PVs when you create them:
And if you’re trying to do this for an existing volume, then maybe you could convert to a raidtipe first and use lvconvert –replace to migrate off undesired physical volume?
For moving an LV, you can use pvmove with the -n option:
pvmove -n lv_to_move /dev/from_pv /dev/to_pv
This is highly interesting, thank you for bringing it to my attention! LVM is certainly evolving.
Could it be done also with a raid1 of 2 SSD’s as cache? To guard against write errors to the cache. Without some sort of cache redundancy, I’d be reluctant to deploy this in a production environment with –cachemode writeback ….
I’m a little worried about the failure-rate of SSD’s in a setup with high write-ratios, and the best place to deploy SSD’s in my opinion is where we have lots of random writes – such as database tables (indexes should always be contained in memory, otherwise buy more memory…).
I see that lvcreate now also has a –mirror option, but I can’t see how to ensure that the two copies are on different pv’s.
BR Bent
Yes, you can make the CacheDataLV and CacheMetaLV redundant by doing this (taken from lvmcache(7)):
Another way would be to use software RAID to create a mirrored pair of SSDs which would be used as the fast PV.
I don’t know which, if either, is preferable.
Cool! I think ,the preferred way would be the one that allows me to replace broken disks online, whether they’re holding cache volumes or regular data volumes 🙂
I imagine both must let you do hot disk replacement. However I’ve only ever used Linux md for that. I’ve never used device-mapper’s mirroring capability at all.
Both allow hot disk replacement. lvm.conf has the raid_fault_policy setting; if set to ‘remove’ it drops the failed mirror, and you can use lvchange –mirrors to add another one. The cool case is if it’s set to ‘allocate’ – it will then create a new mirror right when the old one fails, as long as there’s enough free space on a disk that doesn’t already contain a mirror. This requires dmeventd be running, but lvm starts that internally on its own.
As far as “ensuring the copies are on different PVs” that’s not a concern. It’s handled by the allocation policy of the LV – there are several, but ‘cling’ is the most generally useful and (IIRC) the default. The _only_ one that allows mirrors to be on the same drive is “anywhere” which is emphatically not the default. LVs default to a value of ‘inherit’, which takes the VG default; you can set the VG default with `vgchange –alloc`.
Agh, I messed up. You add mirrors with lvconvert –mirrors, not lvchange.
Are the ‘cache’ partitions added to the total pool?
I.e. if I have a 1Tb ‘slow’ pool and 300Gb of ‘fast’ SSD then would the total size be 1.3Tb or 1Tb?
It would be 1T.
The cache layer sits on top of the OriginLV, and copies blocks from the OriginLV to make them accessible faster.
If you remove the cache layer, then LVM writes back the changed data to the OriginLV and you’re left with just the OriginLV. Therefore it’s pretty obvious that this wouldn’t work if the size of the LV increased.
From what version of LVM is that cache feature available ?
It was added in Feb 2014, so pretty recently. Not sure about version numbers, but I did the test using
Can a cache only be used for a single origin LV? Do I have to manually divide up my ssd into multiple cache LVs if I want multiple origin LVs to be cached?
AFAIK (and I’m not an expert) it appears that a cache can only apply to a single origin LV, and that if you wanted to cache multiple origin LVs, then you would need multiple caches. However you might want to ask on the linux-lvm mailing list for confirmation from the experts.
compared to zfs this is joke
All Oracle have to do is relicense ZFS and it could be included in upstream Linux too. Until then it’s irrelevant.
How so? What you’re saying it’s not entirely clear if you think zfs is the joke or if dm/lvm is the joke? Or as to what the ‘joke’ is either.
The real joke is ZFS’s terrible performance on Linux.
Enjoyed your multiple articles on building the cluster and now storage tiering. Check out btier (http://www.lessfs.com/wordpress/) as well, which is less a cache and more a tiered storage design with blocks moving between tiers, so you get the sum of space available for use.
Just FYI, the reason it needed the additional gigabyte (and why it created ‘lvol0’) is because of a failure-recovery feature.
You see, both thin pools and cache pools have metadata that, in extraordinary circumstances, may break and need to be – essentially – fsck’d. Using the thin_check/thin_repair/cache_check/cache_repair tools, you can do that – but only a total chump would do that in-place; just like how all the cool kids snapshot their filesystem before they run fsck 😛
So what LVM does is, on creation of either type of pool, allocate a new hidden LV – you can see it in `lvs -a`; it’s listed [lvol0_pmspare] (the brackets denoting that it’s hidden, although the actual name is just lvol0_pmspare). This is sized to be the same size as the _largest_ pool metadata LV in the VG, and there’s only one lvol0_pmspare per VG. So if you had a thin pool with 5gb metadata and a cache pool with 3gb metada, you’d have a single 5gb lvol0_pmspare.
When something goes wrong, and you repair the metadata, the repaired result is written to lvol0_pmspare, and you can then verify that the result makes sense and apply it.
Thanks — I’ve added a link to your comment from the main text.
Rich, I’d like to try this out. How did you go about getting the RPMs installed. Are you using Fedora 20, or an alpha release?
You can use the packages in F20 or Rawhide. They are effectively at the same upstream version, so it doesn’t appear to matter which you use.
To enable Rawhide, install the
fedora-release-rawhide
package and install what you need by doing:Thank you so much for this. I’ve been searching for docs on dm-cache for a long time.
I’ve got it running on RHEL7rc btw.
One other thing: I createted a snapshot on vg_guests/testoriginlv and the whole cache somehow collapsed.
That doesn’t surprise me much, but you may want to file a bug about it so the LVM developers can add some code to disallow this case.
It also seems to collapse on creating a snapshot on the cached lvm. That is: about 2 hours after creating and removing a snapshot on cached var the system started to generate massive amounts of funky errors and would not boot afterwards. All of this on CentOS 7, not good this! 😦
Thankfully with the help of cache_repair and friends the damage was minimal, but there was damage.
In case it’s not clear, I think the feature is far from ready for prime-time, especially on CentOS.
I know, now 🙂
But on the other hand, I just switched to rsync for var. Works for me. And from what I’ve seen backing up var (lots and lots of tiny files) went from 7+ hours the first time, to just about 4 hours the second time with noticable less I/O wait. That is better then flashcache which seems a bit more mature.
FYI all of this is just on my hobby server in my spare time 🙂 @work we will soon switch to SSD only in RAID1.
I thought this may be of interest:
First graph shows the throughput on my filesystems, the interesting part is the nightly backup of important stuff on /var. Green is /var, the source, blue is /backup, the target. Notice there is continious i/o for about 4 hours on both filesystems. FYI /var is a lvm_cache SSD backed softRAID1 lv, backup is on a seperate 4TB e-SATA disk.
The second graph, i/o wait, nicely shows lvm_cache in action. Only a few spikes during the whole backup process for green, whereas blue is responsable for most of the i/o wait I see in my CPU graphs. My guess is it’s even better because one should expect backup to outperform var if it wasn’t for the caching SSD.
These are my dmsetup stats:
0 2662400000
cache
8 5380/131072
128 504275/1589248
11279835 18619397
8377215 8657042
0 136305 4294967293 1
writeback 2
migration_threshold 2048
mq 10
random_threshold 4
sequential_threshold 512
discard_promote_adjustment 1
read_promote_adjustment 4
write_promote_adjustment 8
In plain English: +50% read hitrate and almost 100% write hitrate. In all fairness not really real world material since the cache lv is still large enough to hold the entire diskspace used by the slow lv, but still it does show what lvm cache could do for you.
Pingback: Cluster performance: baseline testing | Richard WM Jones
I created my cache pool with –cachemode writethrough, but according to dmsetup status, it is running in writeback mode. Is it supposed to work like that?
I’ve no idea. However here is my
dmsetup status
:What does
writeback 2
mean?it is look likes here.
https://utcc.utoronto.ca/~cks/space/blog/linux/DmCacheChangeWriteMode
I can turn my dmcache to wtiethrough mode only following by these steps:
before:
dmsetup status vg0-root
0 20971520 cache 8 7243/29696 128 108/3654592 5 406 613 8036 0 108 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8
dmsetup table vg0-root
>>> 0 20971520 cache 253:3 253:2 253:4 128 0 default 0
#### now replace ‘0 default 0’ with ‘1 writethrough default 0’
dmsetup reload –table ‘0 20971520 cache 253:3 253:2 253:4 128 1 writethrough default 0’ vg0-root
#### suspend and resume it
dmsetup suspend vg0-root
dmsetup resume vg0-root
####Voila
dmsetup status vg0-root
0 20971520 cache 8 10868/29696 128 0/3654592 0 0 0 1 0 0 0 1 writethrough 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8
I am afraid that cache-pool is still not available on CentOS 6. COuld you confirm that? Which distro are you using?
Fedora Rawhide.
Guess I should wait and work with CentOS 7 which is already available.
Thanks for the tutorial!
Hi,
I’m doing this on a CentOS 7 machine.
The cached LV can be created, but it always set to inactive after reboot, and thus can’t be mounted automatically via /etc/fstab.
Did you have such problems?
I’m solved this problem with lvm2-lvmetad service:
systemctl enable lvm2-lvmetad
Actually this is the bug in lvm2:
https://bugzilla.redhat.com/show_bug.cgi?id=1023250
Workaround is set global/use_lvmetad=0 in /etc/lvm/lvm.conf
I wrote a small tool for making parts of this process easier: https://github.com/larsks/lvcache
This permits you to, for example, run “lvcache create myvg/mylv”, and it will create the cache LV (as a configureable percent of the size of the origin LV), the metadata LV, create the cache pool, and attach it to the origin device.
How would you compare this feature to BCACHE which, I believe, is also native for CentOS 7?
Hi What’s going wrong here ?
[root@sf01 by-uuid]# lvscan
ACTIVE ‘/dev/system_vg/root’ [9,77 GiB] inherit
ACTIVE ‘/dev/system_vg/var’ [9,77 GiB] inherit
ACTIVE ‘/dev/system_vg/tmp’ [4,88 GiB] inherit
ACTIVE ‘/dev/system_vg/swap’ [31,25 GiB] inherit
ACTIVE ‘/dev/cloud_vg/cache-meta’ [1,00 GiB] inherit
ACTIVE ‘/dev/cloud_vg/cache’ [237,47 GiB] inherit
[root@sf01 by-uuid]# lvconvert –type cache-pool –poolmetadata cloud_vg/cache_meta cloud_vg/cache
Unknown metadata LV cloud_vg/cache_meta.
This isn’t a support channel. Ask on the LVM mailing lists.
try this:
lvconvert –type cache-pool –poolmetadata cloud_vg/cache-meta cloud_vg/cache
Hallo, first thank you for all details. But I was wonder is it write cache or read? or both?
Also can you run this with multiple lvm’s? because I got two lvm’s that I would like to connect to a single lvm.
What is the trim size for this? I’m using an adaptec read cache at the moment, but I’m rather disappointed because it uses 1 MB blocks 😦
Hallo,
Is this a read cache only? or can we use a write cache also?
Best Regards,
Laurens
Pingback: Usando discos SSD para cache no LVM | HAmc
A Tip: You can use this on Centos 6.6, at least the updated version of LVM. Make sure to remove the cache before re-installing on Centos 6.6 because it will crash, otherwise…
Glad to see they added another undocumented, lightly tested and maybe helpful feature to LVM.
Hi Richard, I read your thread on the LVM mailing list after stumbling on this article. Interesting stuff! In my tests, I seem to have found that there’s an extra delay introduced with the inital promotion of blocks from the origin LV to the cache. For example, with a “read_promote_adjustment” of 0, on a clean cache device it will promote every block accessed to the cache, so the first pass is around 100% to 150% *slower* than reading directly from the origin LV directly. In subsequent read passes, I get an appreciable, stable 40% speed increase. With a higher promote adjustment value, dm-cache will promote block much more slowly to the cache, and thus the initial “promoting” penalty can be reduced to almost nil, at the cost of obtaining the full benefits of dm-cache only once the cache is greatly warmed up. I really don’t know much about the internals of dm-cache or device-mapper, but I suspect this is caused because block promotion occurs synchonously to the I/O flow: so when a client happens to read a block which dm-cache decides to promote, the delay of moving/writing this block to the cache is incurred. Has anyone else made similar observations?
Pingback: LVM: Lehká vánoční magie pro adminy | honzas.cz — Jan Švec převážně vážně
Pingback: EnhanceIO and Check_MK plugin « BenV's notes
Pingback: Here comes RHEL beta 6.7 | iTelNews
Good blog post! Additional hint for Ubuntu users: install thin-provisioning-tools as well, otherwise you’ll end up with “/usr/sbin/cache_check: execvp failed: No such file or directory” during boot and be dropped in the emergency shell.
Hi,
Is it possible to cache a thin pool? With all its logical volume to one cache pool?
Probably best to ask on an LVM mailing list.
Pingback: LVMCACHE — LVM caching | Jayan Venugopalan
Pingback: SSD Caching – Actually doing it | R e n d r a g
Hey! It’s really nice reading your blog!
Could you please tell me; regarding this LVM caching feature – is it caching the mostly used files on FS level, or mostly used blocks on cached device? I mean, if I have a VM on an image that is 20GB, will the whole file get cached, or just mostly used fragments of it?
It operates entirely at the block layer, and knows nothing about files.
I know this is not a support Forum, but…. It works with lvm, not the filesystem so the caching is block level.
Pingback: Testing glusterfs on centos | Danielle and Roger Blog
Hi,
Was trying that on a test system to add caching to an existing volume. Works fine so far.
What happens if the caching SSD failes? Is the data still accessible? How to recover from such a failure (replacing the SSD)?
Thanks
Exactly this problem I had with LMV cache. After a system failure, the data could no longer be reconstructed. And to date, there is no consistent solution to LVM cache if the SSD fails.
Its worth pointing out that lvdisplay on your “real” Logical Volume will show the cache hits/misses for both reads and writes and promotion/demotions. Very handy data:
LV Size 256.00 GiB
Cache used blocks 39.53%
Cache metadata blocks 19.48%
Cache dirty blocks 0.00%
Cache read hits/misses 187925 / 92970
Cache wrt hits/misses 5792563 / 912731
Cache demotions 0
Cache promotions 24574
I’m a bit confused as to the value of this. You need to create a cache-pool/object for each and every lv? but I have dozens of lv’s on my system (1-2 for each virtual machine).
I have a a (quite slow) raid5 md set of 14TB, with lvm over the top of that. What I want to do is cache the most used blocks into a new 256G SSD. It seems that every lv needs it’s own cache pool which seems like an extrordinary amount of overhead.
or am I missing something obvious …. ?
A.
I followed this guide and everything is ok but after I reboot my machine the logical volume disappear. I can’t mount it. It’s like if it doesn’t exist.
If i do and lvremove vg01/lv_cache and reboot, the volume appear again and I can use.
Can’t understand what’s appening.
Ubuntu 18.04 LTS.