Tag Archives: lvm

LVM cache contd: Tip: Using tags

(Thanks Alasdair Kergon)

# pvchange --addtag ssd /dev/sdc1
  Physical volume "/dev/sdc1" changed
  1 physical volume changed / 0 physical volumes not changed
# pvchange --addtag slow /dev/md127
  Physical volume "/dev/md127" changed
  1 physical volume changed / 0 physical volumes not changed
# pvs -o+tags
  PV         VG        Fmt  Attr PSize   PFree   PV Tags
  /dev/md127 vg_guests lvm2 a--    1.82t 962.89g slow
  /dev/sdc1  vg_guests lvm2 a--  232.88g 232.88g ssd

These tags can be used when placing logical volumes on to specific physical volumes:

# lvcreate -L 800G -n testorigin vg_guests @slow
  1. https://www.redhat.com/archives/linux-lvm/2014-May/msg00047.html
  2. https://www.redhat.com/archives/linux-lvm/2014-May/msg00048.html

Leave a comment

Filed under Uncategorized

Removing the cache from an LV

If you’ve cached an LV (see yesterday’s post), how do you remove the cache and go back to just a simple origin LV?

It turns out to be simple, but you must make sure you are removing the cache pool (not the origin LV, not the CacheMetaLV):

# lvremove vg_guests/lv_cache
  Flushing cache for testoriginlv.
  0 blocks must still be flushed.
  Logical volume "lv_cache" successfully removed

This command deletes the CacheDataLV and CacheMetaLV. To reenable the cache you have to go through the process here again.

This is also what you must (currently) do if you want to resize the LV — ie. remove the cache pool, resize the origin LV, then recreate the cache on top. Apparently they’re going to get lvresize to do the right thing in a future version of LVM.

1 Comment

Filed under Uncategorized

Using LVM’s new cache feature

If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.

But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.

The Set-up

To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:


There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.

In other news: wow, SSDs from brand manufacturers are getting really cheap now!

In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:

# lsblk
NAME                                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                        8:0    0   1.8T  0 disk  
└─sda1                                     8:1    0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdb                                        8:16   0   1.8T  0 disk  
└─sdb1                                     8:17   0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdc                                        8:32   0 232.9G  0 disk  
└─sdc1                                     8:33   0 232.9G  0 part  


Before starting to set up the caching layer, let’s find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.

HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec

Note these numbers don’t show the real benefit of SSDs — namely that performance doesn’t collapse as soon as you randomly access the disk.


The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:

origin LV           OriginLV      large slow LV
cache data LV       CacheDataLV   small fast LV for cache pool data
cache metadata LV   CacheMetaLV   small fast LV for cache pool metadata
cache pool LV       CachePoolLV   CacheDataLV + CacheMetaLV
cache LV            CacheLV       OriginLV + CachePoolLV

Creating the LVs

Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:

# lvcreate -L 100G -n testoriginlv vg_guests
  Logical volume "testoriginlv" created
# mkfs -t ext4 /dev/vg_guests/testoriginlv

Also note that resizing cached LVs is not currently supported (coming later — for now you can work around it by removing the cache, resizing, then recreating the cache).

Creating the cache layer

What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks — it simply doesn’t work otherwise.

Therefore my first step is to extend my existing VG to include the fast disk:

# vgextend vg_guests /dev/sdc1
  Volume group "vg_guests" successfully extended

I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).

# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1
  Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1
  Logical volume "lv_cache" created
# lvs
  LV                     VG        Attr       LSize
  lv_cache               vg_guests -wi-a----- 229.00g
  lv_cache_meta          vg_guests -wi-a-----   1.00g
  testoriginlv           vg_guests -wi-a----- 100.00g
# pvs
  PV         VG        Fmt  Attr PSize   PFree  
  /dev/md127 vg_guests lvm2 a--    1.82t 932.89g
  /dev/sdc1  vg_guests lvm2 a--  232.88g   2.88g

(You’ll notice that my cache is bigger than my test OriginLV, but that’s fine as once I’ve worked out all the gotchas, my real OriginLV will be over 1 TB).

Why did I leave 2.88GB of free space in the PV? I’m not sure actually. However the first time I did this, I didn’t leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex’s comment below.

Convert the CacheDataLV and CacheMetaLV into a “cache pool”:

# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
  Logical volume "lvol0" created
  Converted vg_guests/lv_cache to cache pool.

Now attach the cache pool to the OriginLV to create the final cache object:

# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
  vg_guests/testoriginlv is now cached.


Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:

LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec

Which is exactly the same as the backing hard disk.

Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using dd isn’t a useful test of dm-cache.

What I’m going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).


Filed under Uncategorized

Use guestfish and nbdkit to examine physical disk locations

This question came up: LVM makes sure that the logical volumes it creates are properly aligned to a generous block size, to ensure most efficient access. However what happens if the partition underneath an LVM physical volume isn’t aligned? Is LVM smart enough to move the LV around so it is still aligned, relative to the physical disk underneath?

This is easily answered using guestfish and captive ndkit. (Since captive nbdkit is a new feature, you’ll need Rawhide this Fedora 20 update, or to compile nbdkit from source if you want to run the commands below.)

First we create a badly aligned partition, and put an LVM physical volume inside it, and an LVM logical volume in the physical volume. This kind of disk construction is trivial with guestfish:

$ guestfish -N disk <<EOF
  part-init /dev/sda mbr
  # Unaligned partition, starts on sector 63
  part-add /dev/sda p 63 -100

  pvcreate /dev/sda1
  vgcreate VG /dev/sda1
  lvcreate LV VG 32

  # You could also create a filesystem here:
  #mkfs ext2 /dev/VG/LV

Use virt-filesystems to verify the layout of the disk:

$ virt-filesystems -a test1.img --all --long -h
Name        Type        VFS      Label  MBR  Size  Parent
/dev/VG/LV  filesystem  unknown  -      -    32M   -
/dev/VG/LV  lv          -        -      -    32M   /dev/VG
/dev/VG     vg          -        -      -    96M   /dev/sda1
/dev/sda1   pv          -        -      -    96M   -
/dev/sda1   partition   -        -      83   100M  /dev/sda
/dev/sda    device      -        -      -    100M  -

Now using nbdkit we can try writing to the logical volume, while observing the actual read and write operations that happen on the underlying file:

$ nbdkit -f -v file file=test1.img \
  --run '
    guestfish -x --format=raw -a $nbd \
      run : \
      pwrite-device /dev/VG/LV " " 0

This command prints out a lot of debugging. At the end we see:

libguestfs: trace: pwrite_device "/dev/VG/LV" " " 0
nbdkit: file[1]: debug: acquire per-connection request lock
nbdkit: file[1]: debug: pread count=4096 offset=1080832
nbdkit: file[1]: debug: release per-connection request lock
nbdkit: file[1]: debug: acquire per-connection request lock
nbdkit: file[1]: debug: pwrite count=4096 offset=1080832
nbdkit: file[1]: debug: release per-connection request lock
libguestfs: trace: pwrite_device = 1

It looks like writing to the first few bytes of /dev/VG/LV resulted in a read and a write of a 4K block starting at byte offset 1080832 (relative to the start of the physical disk).

1080832 = 0x107E00 which is 512 byte aligned. That is not aligned for performance, showing that LVM does not do any magic to adjust LVs relative to the underlying disk when the partition it is on is not aligned properly.

Notice that I created the partition starting on sector 63. If you add 1 sector (512 bytes, 0x200) to 0x107E00 you get 0x108000 which is aligned to 32K.

Leave a comment

Filed under Uncategorized

virt-resize –shrink now works

(Shrinking is still tedious, but at least it now works. Expanding is much simpler).

Look at the original disk image and decide what scope there is for shrinking it. In this case we could reduce the size by as much as 5.5GB, but for the purpose of this test I will only shrink the image from 10GB to 7GB.

$ virt-df /tmp/disk.img
Filesystem                           1K-blocks       Used  Available  Use%
disk.img:/dev/sda1                      495844      29565     440679    6%
disk.img:/dev/vg_f13x64/lv_root        7804368    2077504    5647596   27%

We can expand guests automatically, but virt-resize is much more conservative about the complex business of shrinking guests. You have to first use guestfish on (a copy of) the original disk to manually shrink the content of the partitions you want to shrink. In this case I manually shrink the root filesystem, its LV, and the PV.

$ guestfish -a /tmp/disk.img

Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.

Type: 'help' for a list of commands
      'man' to read the manual
      'quit' to quit the shell

><fs> run
><fs> resize2fs-size /dev/vg_f13x64/lv_root 4G
libguestfs: error: resize2fs_size: resize2fs 1.41.12 (17-May-2010)
Please run 'e2fsck -f /dev/vg_f13x64/lv_root' first.

This known problem with resize2fs is documented.

><fs> e2fsck-f /dev/vg_f13x64/lv_root
><fs> resize2fs-size /dev/vg_f13x64/lv_root 4G
><fs> lvresize /dev/vg_f13x64/lv_root 4096
><fs> pvresize-size /dev/sda2 6G
libguestfs: error: pvresize_size: /dev/vda2:
/dev/vda2: cannot resize to 191 extents as later ones are allocated.

This is unexpected: pvresize does not “defrag”. Perhaps in a future version of pvresize, pvmove or libguestfs we will add this, but for now I have to workaround it by deleting the troublesome swap partition, doing the resize, then recreating the swap.

><fs> lvremove /dev/vg_f13x64/lv_swap
><fs> pvresize-size /dev/sda2 6G
><fs> lvcreate lv_swap vg_f13x64 512
><fs> mkswap /dev/vg_f13x64/lv_swap
><fs> exit

Now I can perform the resize operation itself using virt-resize. Note this is also a copy operation, since virt-resize never changes the source disk.

$ truncate -s 7G /tmp/disk2.img
$ virt-resize --shrink /dev/sda2 /tmp/disk.img /tmp/disk2.img
Summary of changes:
/dev/sda1: partition will be left alone
/dev/sda2: partition will be resized from 9.5G to 6.5G
Copying /dev/sda1 ...
Copying /dev/sda2 ...

After a test boot, the final guest worked(!)


Filed under Uncategorized

New tool: virt-resize

Virt-resize lets you resize existing virtual machines (not live however).

$ rm -f /tmp/centos.img
$ truncate -s 12G /tmp/centos.img
$ virt-resize --resize sda1=200% --resize sda2=11.2G \
    /dev/vg_trick/CentOS5x32 /tmp/centos.img
Summary of changes:
/dev/sda1: partition will be resized from 101.9M to 203.9M
/dev/sda2: partition will be resized from 9.9G to 11.2G
There is a surplus of 644971316 bytes (615.1M).
An extra partition will be created for the surplus.
Copying /dev/sda1 ... done
Copying /dev/sda2 ... done
$ rm -f /tmp/centos.img
$ truncate -s 12G /tmp/centos.img
$ virt-resize --resize sda1=200% --expand sda2 \
    /dev/vg_trick/CentOS5x32 /tmp/centos.img 
Summary of changes:
/dev/sda1: partition will be resized from 101.9M to 203.9M
/dev/sda2: partition will be resized from 9.9G to 11.8G
Copying /dev/sda1 ... done
Copying /dev/sda2 ... done

After some discussion on the list we decided to start with a simple / primitive tool and work upwards. So virt-resize as implemented now will not resize filesystems and PVs. You have to do that as a separate step after running the tool (either running pvresize/resize2fs in the guest or using guestfish for offline changes.

The new tool’s man page after the cut.
Continue reading

1 Comment

Filed under Uncategorized

Is ext2/3/4 faster? On LVM?

This question arose at work — is LVM a performance penalty compared to using straight partitions? To save you the trouble, the answer is “not really”. There is a very small penalty, but as with all benchmarks it does depend on what the benchmark measures versus what your real workload does. In any case, here is a small guestfish script you can use to compare the performance of various filesystems with or without LVM, with various operations. Whether you trust the results is up to you, but I would advise caution.

#!/bin/bash -


for fs in ext2 ext3 ext4; do
    for lvm in off on; do
        rm -f $tmpfile
        if [ $lvm = "on" ]; then
            guestfish <<EOF
              sparse $tmpfile 1G
              part-disk /dev/sda efi
              pvcreate /dev/sda1
              vgcreate VG /dev/sda1
              lvcreate LV VG 800
              mkfs $fs /dev/VG/LV
        else # no LVM
            guestfish <<EOF
              sparse $tmpfile 1G
              part-disk /dev/sda efi
              mkfs $fs /dev/sda1
        echo "fs=$fs lvm=$lvm"
        guestfish -a $tmpfile -m $dev <<EOF
          time fallocate /file1 200000000
          time cp /file1 /file2
fs=ext2 lvm=off
elapsed time: 2.74 seconds
elapsed time: 4.52 seconds
fs=ext2 lvm=on
elapsed time: 2.60 seconds
elapsed time: 4.24 seconds
fs=ext3 lvm=off
elapsed time: 2.62 seconds
elapsed time: 4.31 seconds
fs=ext3 lvm=on
elapsed time: 3.07 seconds
elapsed time: 4.79 seconds

# notice how ext4 is much faster at fallocate, because it
# uses extents

fs=ext4 lvm=off
elapsed time: 0.05 seconds
elapsed time: 3.54 seconds
fs=ext4 lvm=on
elapsed time: 0.05 seconds
elapsed time: 4.16 seconds


Filed under Uncategorized