If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.
But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.
To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:
There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.
In other news: wow, SSDs from brand manufacturers are getting really cheap now!
In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
└─sda1 8:1 0 1.8T 0 part
└─md127 9:127 0 1.8T 0 raid1
sdb 8:16 0 1.8T 0 disk
└─sdb1 8:17 0 1.8T 0 part
└─md127 9:127 0 1.8T 0 raid1
sdc 8:32 0 232.9G 0 disk
└─sdc1 8:33 0 232.9G 0 part
Before starting to set up the caching layer, let's find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.
HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec
Note these numbers don't show the real benefit of SSDs -- namely that performance doesn't collapse as soon as you randomly access the disk.
The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:
origin LV OriginLV large slow LV
cache data LV CacheDataLV small fast LV for cache pool data
cache metadata LV CacheMetaLV small fast LV for cache pool metadata
cache pool LV CachePoolLV CacheDataLV + CacheMetaLV
cache LV CacheLV OriginLV + CachePoolLV
Creating the LVs
Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:
# lvcreate -L 100G -n testoriginlv vg_guests
Logical volume "testoriginlv" created
# mkfs -t ext4 /dev/vg_guests/testoriginlv
Also note that resizing cached LVs is not currently supported (coming later -- for now you can work around it by removing the cache, resizing, then recreating the cache).
Creating the cache layer
What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks -- it simply doesn't work otherwise.
Therefore my first step is to extend my existing VG to include the fast disk:
# vgextend vg_guests /dev/sdc1
Volume group "vg_guests" successfully extended
I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).
# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1
Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1
Logical volume "lv_cache" created
LV VG Attr LSize
lv_cache vg_guests -wi-a----- 229.00g
lv_cache_meta vg_guests -wi-a----- 1.00g
testoriginlv vg_guests -wi-a----- 100.00g
PV VG Fmt Attr PSize PFree
/dev/md127 vg_guests lvm2 a-- 1.82t 932.89g
/dev/sdc1 vg_guests lvm2 a-- 232.88g 2.88g
(You'll notice that my cache is bigger than my test OriginLV, but that's fine as once I've worked out all the gotchas, my real OriginLV will be over 1 TB).
Why did I leave 2.88GB of free space in the PV?
I'm not sure actually. However the first time I did this, I didn't leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex's comment below.
Convert the CacheDataLV and CacheMetaLV into a "cache pool":
# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
Logical volume "lvol0" created
Converted vg_guests/lv_cache to cache pool.
Now attach the cache pool to the OriginLV to create the final cache object:
# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
vg_guests/testoriginlv is now cached.
Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:
LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec
Which is exactly the same as the backing hard disk.
Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using
dd isn't a useful test of dm-cache.
What I'm going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).