Tag Archives: virtualization

Super-nested KVM

Regular readers of this blog will of course be familiar with the joys of virtualization. One of those joys is nested virtualization — running a virtual machine in a virtual machine. Nested KVM is a thing too — that is, emulating the virtualization extensions in the CPU so that the second level guest gets at least some of the acceleration benefits that a normal first level guest would get.

My question is: How deeply can you nest KVM?

This is not so easy to test at the moment, so I’ve created a small project / disk image which when booted on KVM will launch a nested guest, which launches a nested guest, and so on until (usually) the host crashes, or you run out of memory, or your patience is exhausted by the poor performance of nested KVM.

The answer, by the way, is just 3 levels [on AMD hardware], which is rather disappointing. Hopefully this will encourage the developers to take a closer look at the bugs in nested virt.

Git repo: http://git.annexia.org/?p=supernested.git;a=summary
Binary images: http://oirase.annexia.org/supernested/

How does this work?

Building a simple appliance is easy. I’m using supermin to do that.

The problem is how does the appliance run another appliance? How do you put the same appliance inside the appliance? Obviously that’s impossible (right?)

The way it works is inside the Lx hypervisor it runs the L(x+1) qemu on /dev/sda, with a protective overlay stored in memory so we don’t disrupt the Lx hypervisor. Since /dev/sda literally is the appliance disk image, this all kinda works.


Filed under Uncategorized

virt-log now supports the Windows Event Log

New virt tool virt-log now supports the Windows Event Log. If you have a recent Windows guest you can display the System event log by doing:

$ virt-log -d Win8 | less

What you will see is a very long XML file.

This requires an Evtx parser. I have now chosen this library for Fedora (it needs a reviewer, as you can see). The code is sensible and maintained.

It also only works for Windows ≥ Vista, because Microsoft completely rewrote the way that log files are stored, from one strange binary format to another strange binary format [so a little different from the systemd journal …].

As usual, patches to virt-log to support other guest operating systems are welcome.

Leave a comment

Filed under Uncategorized

New in libguestfs: virt-log

In libguestfs ≥ 1.27.17, there’s a new tool called virt-log for displaying the log files from a disk image or virtual machine:

$ virt-log -a disk.img | less

Previously you could write:

$ virt-cat -a disk.img /var/log/messages

That worked for some Linux guests, but several things happened:

Virt-log is designed to do the right thing automatically (although at the moment Windows support is not finished). In particular it will automatically decode and display the systemd journal, and it knows the different locations that some Linux distros store their plain text log files.


Filed under Uncategorized

Cluster performance: baseline testing

I’m using fio (as recommended by Linus!) to baseline test my virtualization cluster. My fio script is supposed to look a bit like a qemu process:


It has 4 large “disks” (size=1g) and 4 large “qemu processes” (numjobs=4) running in parallel. Each test thread can have up to 4 IOs in flight (iodepth=4) and the size of IOs is 64K which matches qcow2 default cluster size. I enabled O_DIRECT (direct=1) because we normally use qemu cache=none so that live migration works.

The first node now has a RAID 1 array of spinning rust (hard disks) and a smaller SSD, and the plan is to use LVM-cache so the SSD can sit on top of the RAID array.

Performance of the RAID 1 array of hard disks

The raw performance of the RAID 1 array (this includes the filesystem) is fairly dismal:


Performance of the SSD

The SSD in contrast does a lot better:


However you want to look at the details, the fact is that the test runs 11 times faster on the SSD.

The effect of NFS

What about when we NFS-mount the RAID array or the SSD on another node? This should tell us the effect of NFS.


NFS makes this test run 3 times slower.

For the NFS-mounted SSD:


NFS makes this test run 4.5 times slower.

The effect of virtualization

By running the virtual machine on the first node (with the disks) it should be possible to see just the effect of virtualization. Since this is backed by the RAID 1 hard disk array, not SSDs or LVM cache, it should be compared only to the RAID 1 performance.


The effect of virtualization (virtio-scsi in this case) is about an 8% drop in performance, which is not something I’m going to worry about.


  • The gains from the SSD (ie. using LVM cache) could outweigh the losses from having to use NFS to share the disk images.
  • It’s worth looking at alternate high bandwidth, low-latency interconnects (instead of 1 gigE) to make NFS perform better. I’m going to investigate using Infiniband soon.

These are just the baseline measurements without LVM cache.

I’ve included links to the full test results. fio gives a huge amount of detail, and it’s helpful to keep the HOWTO open so you can understand all the figures it is producing.

1 Comment

Filed under Uncategorized

Using LVM’s new cache feature

If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.

But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.

The Set-up

To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:


There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.

In other news: wow, SSDs from brand manufacturers are getting really cheap now!

In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:

# lsblk
NAME                                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                        8:0    0   1.8T  0 disk  
└─sda1                                     8:1    0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdb                                        8:16   0   1.8T  0 disk  
└─sdb1                                     8:17   0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdc                                        8:32   0 232.9G  0 disk  
└─sdc1                                     8:33   0 232.9G  0 part  


Before starting to set up the caching layer, let’s find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.

HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec

Note these numbers don’t show the real benefit of SSDs — namely that performance doesn’t collapse as soon as you randomly access the disk.


The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:

origin LV           OriginLV      large slow LV
cache data LV       CacheDataLV   small fast LV for cache pool data
cache metadata LV   CacheMetaLV   small fast LV for cache pool metadata
cache pool LV       CachePoolLV   CacheDataLV + CacheMetaLV
cache LV            CacheLV       OriginLV + CachePoolLV

Creating the LVs

Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:

# lvcreate -L 100G -n testoriginlv vg_guests
  Logical volume "testoriginlv" created
# mkfs -t ext4 /dev/vg_guests/testoriginlv

Also note that resizing cached LVs is not currently supported (coming later — for now you can work around it by removing the cache, resizing, then recreating the cache).

Creating the cache layer

What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks — it simply doesn’t work otherwise.

Therefore my first step is to extend my existing VG to include the fast disk:

# vgextend vg_guests /dev/sdc1
  Volume group "vg_guests" successfully extended

I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).

# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1
  Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1
  Logical volume "lv_cache" created
# lvs
  LV                     VG        Attr       LSize
  lv_cache               vg_guests -wi-a----- 229.00g
  lv_cache_meta          vg_guests -wi-a-----   1.00g
  testoriginlv           vg_guests -wi-a----- 100.00g
# pvs
  PV         VG        Fmt  Attr PSize   PFree  
  /dev/md127 vg_guests lvm2 a--    1.82t 932.89g
  /dev/sdc1  vg_guests lvm2 a--  232.88g   2.88g

(You’ll notice that my cache is bigger than my test OriginLV, but that’s fine as once I’ve worked out all the gotchas, my real OriginLV will be over 1 TB).

Why did I leave 2.88GB of free space in the PV? I’m not sure actually. However the first time I did this, I didn’t leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex’s comment below.

Convert the CacheDataLV and CacheMetaLV into a “cache pool”:

# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
  Logical volume "lvol0" created
  Converted vg_guests/lv_cache to cache pool.

Now attach the cache pool to the OriginLV to create the final cache object:

# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
  vg_guests/testoriginlv is now cached.


Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:

LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec

Which is exactly the same as the backing hard disk.

Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using dd isn’t a useful test of dm-cache.

What I’m going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).


Filed under Uncategorized

libguestfs RHEL 7.1 preview packages (yes, really)

RHEL 7 isn’t out yet, but if you’re using the the RHEL 7 RC, you’re on one of our beta programs, or you can wait for RHEL or CentOS 7.0 to be released, then you can upgrade libguestfs with these RHEL 7.1 libguestfs preview packages.

Amongst the new features:

Leave a comment

Filed under Uncategorized

Notes on getting VMware ESXi to run under KVM

This is mostly adapted from this long thread on the VMware community site.

I got VMware ESXi 5.5.0 running on upstream KVM today.

First I had to disable the “VMware backdoor”. When VMware runs, it detects that qemu underneath is emulating this port and tries to use it to query the machine (instead of using CPUID and so on). Unfortunately qemu’s emulation of the VMware backdoor is very half-assed. There’s no way to disable it except to patch qemu:

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index eaf3e61..ca1c422 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -204,7 +204,7 @@ static void pc_init1(QEMUMachineInitArgs *args,
     pc_vga_init(isa_bus, pci_enabled ? pci_bus : NULL);
     /* init basic PC hardware */
-    pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, xen_enabled(),
+    pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, 1,
     pc_nic_init(isa_bus, pci_bus);

It would be nice if this was configurable in qemu. This is now being fixed upstream.

Secondly I had to turn off MSR emulation. This is, unfortunately, a machine-wide setting:

# echo 1 > /sys/module/kvm/parameters/ignore_msrs
# cat /sys/module/kvm/parameters/ignore_msrs

Thirdly I had to give the ESXi virtual machine an IDE disk and an e1000 vmxnet3 network card. Note also that ESXi requires ≥ 2 vCPUs and at least 2 GB of RAM.

Screenshot - 190514 - 16:04:38

Screenshot - 190514 - 16:21:09


Filed under Uncategorized