Tag Archives: kvm
Regular readers of this blog will of course be familiar with the joys of virtualization. One of those joys is nested virtualization — running a virtual machine in a virtual machine. Nested KVM is a thing too — that is, emulating the virtualization extensions in the CPU so that the second level guest gets at least some of the acceleration benefits that a normal first level guest would get.
My question is: How deeply can you nest KVM?
This is not so easy to test at the moment, so I’ve created a small project / disk image which when booted on KVM will launch a nested guest, which launches a nested guest, and so on until (usually) the host crashes, or you run out of memory, or your patience is exhausted by the poor performance of nested KVM.
The answer, by the way, is just 3 levels [on AMD hardware], which is rather disappointing. Hopefully this will encourage the developers to take a closer look at the bugs in nested virt.
Git repo: http://git.annexia.org/?p=supernested.git;a=summary
Binary images: http://oirase.annexia.org/supernested/
How does this work?
Building a simple appliance is easy. I’m using supermin to do that.
The problem is how does the appliance run another appliance? How do you put the same appliance inside the appliance? Obviously that’s impossible (right?)
The way it works is inside the Lx hypervisor it runs the L(x+1) qemu on
/dev/sda, with a protective overlay stored in memory so we don’t disrupt the Lx hypervisor. Since
/dev/sda literally is the appliance disk image, this all kinda works.
This is mostly adapted from this long thread on the VMware community site.
I got VMware ESXi 5.5.0 running on upstream KVM today.
First I had to disable the “VMware backdoor”. When VMware runs, it detects that qemu underneath is emulating this port and tries to use it to query the machine (instead of using CPUID and so on). Unfortunately qemu’s emulation of the VMware backdoor is very half-assed. There’s no way to disable it except to patch qemu:
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c index eaf3e61..ca1c422 100644 --- a/hw/i386/pc_piix.c +++ b/hw/i386/pc_piix.c @@ -204,7 +204,7 @@ static void pc_init1(QEMUMachineInitArgs *args, pc_vga_init(isa_bus, pci_enabled ? pci_bus : NULL); /* init basic PC hardware */ - pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, xen_enabled(), + pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, 1, 0x4); pc_nic_init(isa_bus, pci_bus);
It would be nice if this was configurable in qemu. This is now being fixed upstream.
Secondly I had to turn off MSR emulation. This is, unfortunately, a machine-wide setting:
# echo 1 > /sys/module/kvm/parameters/ignore_msrs # cat /sys/module/kvm/parameters/ignore_msrs Y
Thirdly I had to give the ESXi virtual machine an IDE disk and an
network card. Note also that ESXi requires ≥ 2 vCPUs and at least 2 GB of RAM.
I managed to get KVM working on the Cubietruck last week. It’s not exactly simple, but this post describes in overview how to do it.
(1) You will need a Cubietruck, a CP2102 serial cable, a micro SDHC card, a card reader for your host computer, and a network patch cable (the board supports wifi but it doesn’t work with the newer kernel we’ll be using). Optional: 2.5″ SATA HDD or SSD.
(2) Start with Hans De Goede’s AllWinner remix of Fedora 19, and get that working. It’s important to read his README file carefully.
make oldconfig make menuconfig
In menuconfig, enable Large Page Address Extension (LPAE), and then enable KVM in the Virtualization menu.
LOADADDR=0x40008000 make uImage dtbs make modules
sudo cp arch/arm/boot/uImage /boot/uImage.sunxi-test sudo cp arch/arm/boot/dts/sun7i-a20-cubietruck.dtb /boot/sun7i-a20-cubietruck.dtb.sunxi-test sudo make modules_install
Reboot, interrupt u-boot (using the serial console), and type the following commands to load the new kernel:
setenv bootargs console=ttyS0,115200 loglevel=9 earlyprintk ro rootwait root=/dev/mmcblk0p3 ext2load mmc 0 0x46000000 uImage.sunxi-test ext2load mmc 0 0x4b000000 sun7i-a20-cubietruck.dtb.sunxi-test env set fdt_high ffffffff bootm 0x46000000 - 0x4b000000
(4) Build this modified u-boot which supports Hyp mode.
make cubietruck_config make
sudo dd if=u-boot-sunxi-with-spl.bin of=/dev/YOURSDCARD bs=1024 seek=8
Reboot again, use the commands above to boot into the upstream kernel, and if everything worked you should see:
Brought up 2 CPUs SMP: Total of 2 processors activated. CPU: All CPU(s) started in HYP mode. CPU: Virtualization extensions available.
/dev/kvm should exist.
Hack QEMU to create Cortex-A7 CPUs using this one-line patch.
Edit: dgilmore tells me this is no longer necessary. Instead make sure you use the qemu
-cpu host option.
Then you should be able to create VMs using libvirt. Note if using libguestfs you will need to use the direct backend (
LIBGUESTFS_BACKEND=direct) because of this libvirt bug.
There are lots of cloud disk images floating around. They are designed to run in clouds where there is a boot-time network service called cloud-init available that provides initial configuration. If that’s not present, or you’re just trying to boot these images in KVM/libvirt directly without any cloud, then things can go wrong.
Luckily it’s fairly easy to create a config disk (aka “seed disk”) which you attach to the guest and then let cloud-init in the guest get its configuration from there. No cloud, or even network, required.
I’m going to use a tool called virt-make-fs to make the config disk, as it’s easy to use and doesn’t require root. There are other tools around, eg. make-seed-disk which do a similar job. (NB: You might hit this bug in virt-make-fs, which should be fixed in the latest version).
I’m also using a cloud image downloaded from the Fedora project, but any cloud image should work.
First I create my cloud-init metadata. This consists of two files.
meta-data contains host and network configuration:
instance-id: iid-123456 local-hostname: cloudy
user-data contains other custom configuration (note
not a comment, it’s a directive to tell cloud-init the format of the file):
#cloud-config password: 123456 runcmd: - [ useradd, -m, -p, "", rjones ] - [ chage, -d, 0, rjones ]
(The idea behind this split is probably not obvious, but apparently it’s because the
meta-data is meant to be supplied by the Cloud, and the
user-data is meant to be supplied by the Cloud’s customer. In this case, no cloud, so we’re going to supply both!)
I put these two files into a directory, and run
virt-make-fs to create the config disk:
$ ls meta-data user-data $ virt-make-fs --type=msdos --label=cidata . /tmp/seed.img $ virt-filesystems -a /tmp/seed.img --all --long -h Name Type VFS Label MBR Size Parent /dev/sda filesystem vfat cidata - 286K - /dev/sda device - - - 286K -
Now I need to pass some kernel options when booting the Fedora cloud image, and the only way to do that is if I boot from an external kernel & initrd. This is not as complicated as it sounds, and virt-builder has an option to get the kernel and initrd that I’m going to need:
$ virt-builder --get-kernel Fedora-cloud.raw download: /boot/vmlinuz-3.9.5-301.fc19.x86_64 -> ./vmlinuz-3.9.5-301.fc19.x86_64 download: /boot/initramfs-3.9.5-301.fc19.x86_64.img -> ./initramfs-3.9.5-301.fc19.x86_64.img
Finally I’m going to boot the guest using KVM (you could also use libvirt with a little extra effort):
$ qemu-kvm -m 1024 \ -drive file=Fedora-cloud.raw,if=virtio \ -drive file=seed.img,if=virtio \ -kernel ./vmlinuz-3.9.5-301.fc19.x86_64 \ -initrd ./initramfs-3.9.5-301.fc19.x86_64.img \ -append 'root=/dev/vda1 ro ds=nocloud-net'
You’ll be able to log in either as fedora/123456 or rjones (no password), and you should see that the hostname has been set to
I did some performance tests on the User-Mode Linux backend compared to the ordinary KVM-based appliance and the results are quite interesting.
The first test is to run the C API test suite using UML and KVM on baremetal. All times are in seconds, averaged over a few runs:
tests/c-api (baremetal) — UML: 630 — KVM: 332
UML is roughly half the speed, but do remember that the test is very system-call intensive, which is one of the worst cases for UML.
The same test again, but performed inside a KVM virtual machine (on the same hardware):
tests/c-api (virtualized) — UML: 334 — KVM: 961
The results of this are so surprising I went back and retested everything several times, but this is completely reproducible. UML runs the C API test suite about twice as fast virtualized as on baremetal.
KVM (no surprise) runs several times slower. Inside the VM there is no hardware virtualization, and so qemu-kvm has to fall back on TCG software emulation of everything.
One conclusion you might draw from this is that UML could be a better choice of backend if you want to use libguestfs inside a VM (eg. in the cloud). As always, you should measure your own workload.
The second test is of start-up times. If you want to use libguestfs to process a lot of disk images, this matters.
start-up (baremetal) — UML: 3.9 — KVM: 3.7
start-up (virtualized) — UML: 3.0 — KVM: 8-11
The start-up time of KVM virtualized was unstable, but appeared to be around 3 times slower than on baremetal. UML performs about the same in both cases.
A couple of conclusions that I take from this:
(1) Most of the time is now spent initializing the appliance, searching for LVM and RAID and so on. The choice of hypervisor makes no difference. This is never going to go away, even if libguestfs was rewritten to use (eg) containers, or if libguestfs linked directly to kernel code. It just takes this time for this kernel & userspace LVM/MD/filesystem code to initialize.
(2) The overhead of starting a KVM VM is not any different from starting a big Linux application. This is no surprise for people who have used KVM for a long time, but it’s counter-intuitive for most people who think that VMs “must” be heavyweight compared to ordinary processes.
The third test is of uploading data from the host into a disk image. I created a 1 GB disk image containing an ext2 filesystem, and I timed how long it took to upload 500 MB of data to a file on this filesystem.
upload (baremetal) — UML: 147 — KVM: 16
upload (virtualized) — UML: 149 — KVM: 73
KVM is predictably much slower when no hardware virtualization is available, by a factor of about 4.5 times.
UML is overall far slower than KVM, but it is at least consistent.
In order to work out why UML is so much slower, I wanted to find out if it was because of the emulated serial port that we push the data through, or because writes to the disk are slow, so I carried out some extra tests:
upload-no-write (baremetal) — UML: 141 — KVM: 11
upload-no-write (virtualized) — UML: 140 — KVM: 20
write-no-upload (baremetal) — UML: 7 — KVM: 13
write-no-upload (virtualized) — UML: 9 — KVM: 25
My conclusion is that the UML emulated serial device is over 10 times slower than KVM’s virtio-serial. This is a problem, but at least it’s a well-defined problem the UML team can fix with an example (virtio-serial) that it’s possible to do much better.
Finally, notice that UML appears faster than KVM at writes.
In fact what’s happening is a difference in caching modes: For safety, libguestfs forces KVM to bypass the host disk cache. This ensures that modifications made to disk images remain consistent even if there is a sudden power failure.
The UML backend currently uses the host cache, so the writes weren’t hitting the disk before the test finished (this is in fact a bug in UML since libguestfs performs an fsync inside the appliance, which UML does not honour).
As always with benchmarks, the moral is to take everything with a pinch of salt and measure your workloads!