Tag Archives: performance

Libguestfs appliance boot in under 600ms

$ ./run ./utils/boot-benchmark/boot-benchmark
Warming up the libguestfs cache ...
Running the tests ...

test version: libguestfs 1.33.28
 test passes: 10
host version: Linux moo.home.annexia.org 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    host CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
     backend: direct               [to change set $LIBGUESTFS_BACKEND]
        qemu: /home/rjones/d/qemu/x86_64-softmmu/qemu-system-x86_64 [to change set $LIBGUESTFS_HV]
qemu version: QEMU emulator version 2.5.94, Copyright (c) 2003-2008 Fabrice Bellard
         smp: 1                    [to change use --smp option]
     memsize: 500                  [to change use --memsize option]
      append:                      [to change use --append option]

Result: 575.9ms ±5.3ms

There are various tricks here:

  1. I’m using the (still!) not upstream qemu DMA patches.
  2. I’ve compiled my own very minimal guest Linux kernel.
  3. I’m using my nearly upstream "crypto: Add a flag allowing the self-tests to be disabled at runtime." patch.
  4. I’ve got two sets of non-upstream libguestfs patches 1, 2
  5. I am not using libvirt, but if you do want to use libvirt, make sure you use the very latest version since it contains an important performance patch.



Filed under Uncategorized

Super-nested KVM

Regular readers of this blog will of course be familiar with the joys of virtualization. One of those joys is nested virtualization — running a virtual machine in a virtual machine. Nested KVM is a thing too — that is, emulating the virtualization extensions in the CPU so that the second level guest gets at least some of the acceleration benefits that a normal first level guest would get.

My question is: How deeply can you nest KVM?

This is not so easy to test at the moment, so I’ve created a small project / disk image which when booted on KVM will launch a nested guest, which launches a nested guest, and so on until (usually) the host crashes, or you run out of memory, or your patience is exhausted by the poor performance of nested KVM.

The answer, by the way, is just 3 levels [on AMD hardware], which is rather disappointing. Hopefully this will encourage the developers to take a closer look at the bugs in nested virt.

Git repo: http://git.annexia.org/?p=supernested.git;a=summary
Binary images: http://oirase.annexia.org/supernested/

How does this work?

Building a simple appliance is easy. I’m using supermin to do that.

The problem is how does the appliance run another appliance? How do you put the same appliance inside the appliance? Obviously that’s impossible (right?)

The way it works is inside the Lx hypervisor it runs the L(x+1) qemu on /dev/sda, with a protective overlay stored in memory so we don’t disrupt the Lx hypervisor. Since /dev/sda literally is the appliance disk image, this all kinda works.


Filed under Uncategorized

Cluster performance: baseline testing

I’m using fio (as recommended by Linus!) to baseline test my virtualization cluster. My fio script is supposed to look a bit like a qemu process:


It has 4 large “disks” (size=1g) and 4 large “qemu processes” (numjobs=4) running in parallel. Each test thread can have up to 4 IOs in flight (iodepth=4) and the size of IOs is 64K which matches qcow2 default cluster size. I enabled O_DIRECT (direct=1) because we normally use qemu cache=none so that live migration works.

The first node now has a RAID 1 array of spinning rust (hard disks) and a smaller SSD, and the plan is to use LVM-cache so the SSD can sit on top of the RAID array.

Performance of the RAID 1 array of hard disks

The raw performance of the RAID 1 array (this includes the filesystem) is fairly dismal:


Performance of the SSD

The SSD in contrast does a lot better:


However you want to look at the details, the fact is that the test runs 11 times faster on the SSD.

The effect of NFS

What about when we NFS-mount the RAID array or the SSD on another node? This should tell us the effect of NFS.


NFS makes this test run 3 times slower.

For the NFS-mounted SSD:


NFS makes this test run 4.5 times slower.

The effect of virtualization

By running the virtual machine on the first node (with the disks) it should be possible to see just the effect of virtualization. Since this is backed by the RAID 1 hard disk array, not SSDs or LVM cache, it should be compared only to the RAID 1 performance.


The effect of virtualization (virtio-scsi in this case) is about an 8% drop in performance, which is not something I’m going to worry about.


  • The gains from the SSD (ie. using LVM cache) could outweigh the losses from having to use NFS to share the disk images.
  • It’s worth looking at alternate high bandwidth, low-latency interconnects (instead of 1 gigE) to make NFS perform better. I’m going to investigate using Infiniband soon.

These are just the baseline measurements without LVM cache.

I’ve included links to the full test results. fio gives a huge amount of detail, and it’s helpful to keep the HOWTO open so you can understand all the figures it is producing.

1 Comment

Filed under Uncategorized

Performance of User-Mode Linux as a libguestfs backend

As of libguestfs 1.23.16, the User-Mode Linux backend is now a supported feature upstream, meaning that at least it gets tested fully for each release.

I did some performance tests on the User-Mode Linux backend compared to the ordinary KVM-based appliance and the results are quite interesting.

The first test is to run the C API test suite using UML and KVM on baremetal. All times are in seconds, averaged over a few runs:

tests/c-api (baremetal) — UML: 630 — KVM: 332

UML is roughly half the speed, but do remember that the test is very system-call intensive, which is one of the worst cases for UML.

The same test again, but performed inside a KVM virtual machine (on the same hardware):

tests/c-api (virtualized) — UML: 334 — KVM: 961

The results of this are so surprising I went back and retested everything several times, but this is completely reproducible. UML runs the C API test suite about twice as fast virtualized as on baremetal.

KVM (no surprise) runs several times slower. Inside the VM there is no hardware virtualization, and so qemu-kvm has to fall back on TCG software emulation of everything.

One conclusion you might draw from this is that UML could be a better choice of backend if you want to use libguestfs inside a VM (eg. in the cloud). As always, you should measure your own workload.

The second test is of start-up times. If you want to use libguestfs to process a lot of disk images, this matters.

start-up (baremetal) — UML: 3.9 — KVM: 3.7
start-up (virtualized) — UML: 3.0 — KVM: 8-11

The start-up time of KVM virtualized was unstable, but appeared to be around 3 times slower than on baremetal. UML performs about the same in both cases.

A couple of conclusions that I take from this:

(1) Most of the time is now spent initializing the appliance, searching for LVM and RAID and so on. The choice of hypervisor makes no difference. This is never going to go away, even if libguestfs was rewritten to use (eg) containers, or if libguestfs linked directly to kernel code. It just takes this time for this kernel & userspace LVM/MD/filesystem code to initialize.

(2) The overhead of starting a KVM VM is not any different from starting a big Linux application. This is no surprise for people who have used KVM for a long time, but it’s counter-intuitive for most people who think that VMs “must” be heavyweight compared to ordinary processes.

The third test is of uploading data from the host into a disk image. I created a 1 GB disk image containing an ext2 filesystem, and I timed how long it took to upload 500 MB of data to a file on this filesystem.

upload (baremetal) — UML: 147 — KVM: 16
upload (virtualized) — UML: 149 — KVM: 73

KVM is predictably much slower when no hardware virtualization is available, by a factor of about 4.5 times.

UML is overall far slower than KVM, but it is at least consistent.

In order to work out why UML is so much slower, I wanted to find out if it was because of the emulated serial port that we push the data through, or because writes to the disk are slow, so I carried out some extra tests:

upload-no-write (baremetal) — UML: 141 — KVM: 11
upload-no-write (virtualized) — UML: 140 — KVM: 20
write-no-upload (baremetal) — UML: 7 — KVM: 13
write-no-upload (virtualized) — UML: 9 — KVM: 25

My conclusion is that the UML emulated serial device is over 10 times slower than KVM’s virtio-serial. This is a problem, but at least it’s a well-defined problem the UML team can fix with an example (virtio-serial) that it’s possible to do much better.

Finally, notice that UML appears faster than KVM at writes.

In fact what’s happening is a difference in caching modes: For safety, libguestfs forces KVM to bypass the host disk cache. This ensures that modifications made to disk images remain consistent even if there is a sudden power failure.

The UML backend currently uses the host cache, so the writes weren’t hitting the disk before the test finished (this is in fact a bug in UML since libguestfs performs an fsync inside the appliance, which UML does not honour).

As always with benchmarks, the moral is to take everything with a pinch of salt and measure your workloads!


Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 4

[Part 1, part 2, part 3.]

Finally I modified the test to do some representative work: We now load a real Windows XP guest, inspect it (a heavyweight operation), and mount and stat each filesystem. I won’t reproduce the entire test program again because only the test subroutine has changed:

sub test {
    my $g = Sys::Guestfs->new;
    $g->add_drive_ro ("/tmp/winxp.img");
    $g->launch ();

    # Inspect the guest (ignore the result).
    $g->inspect_os ();

    # Approximate what virt-df does.
    my %fses = $g->list_filesystems ();
    foreach (keys %fses) {
        my $mounted = 0;
        eval { $g->mount_ro ($_, "/"); $mounted = 1; };
        if ($mounted) {
            $g->statvfs ("/");
            $g->umount_all ();

    return $g;

Even with all that work going on, I was able to inspect more than 1 disk per second on my laptop, and run 60 threads in parallel with good performance and scalability:


Leave a comment

Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 3

A problem encountered in part 2 was that I couldn’t measure the maximum number of parallel libguestfs appliances that can be run at the same time. There are two reasons for that. The simpler one is that libvirt has a limit of 20 connections, which is easily overcome by setting LIBGUESTFS_ATTACH_METHOD=appliance to eliminate libvirt and run qemu directly. The harder one is that by the time the last appliances in the test are starting to launch, earlier ones have already shut down and their threads have exited.

What is needed is for the test to work in two phases: In the first phase we start up all the threads and launch all the appliances. Only when this is complete do we enter the second phase where we shut down all the appliances.

The easiest way to do this is by modifying the test to use a barrier (or in fact to implement a barrier using the condition primitives). See the modified test script below.

With the modified test script I was able to run ≥ 110 and < 120 parallel appliances in ~ 13 GB of free RAM, or around 120 MB / appliance, still with excellent performance and nearly linear scalability:


#!/usr/bin/perl -w

use strict;
use threads qw(yield);
use threads::shared qw(cond_broadcast cond_wait lock);
use Sys::Guestfs;
use Time::HiRes qw(time);

my $nr_threads_launching :shared;

sub test {
    my $g = Sys::Guestfs->new;
    $g->add_drive_ro ("/dev/null");
    $g->launch ();
    return $g;

# Get everything into cache.
test (); test (); test ();

sub thread {
    my $g = test ();

        lock ($nr_threads_launching);
        cond_broadcast ($nr_threads_launching);
        cond_wait ($nr_threads_launching) until $nr_threads_launching == 0;

    $g->close ();

# Test increasing numbers of threads until it fails.
for (my $nr_threads = 10; $nr_threads < 1000; $nr_threads += 10) {
    my $start_t = time ();

    $nr_threads_launching = $nr_threads;

    my @threads;
    foreach (1..$nr_threads) {
        push @threads, threads->create (\&thread)
    foreach (@threads) {
        $_->join ();
        if (my $err = $_->error ()) {
            die "launch failed with $nr_threads threads: $err"

    my $end_t = time ();
    printf ("%d %.2f\n", $nr_threads, $end_t - $start_t);

1 Comment

Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 2

One problem with the previous test is that I hit a limit of 20 parallel appliances and mistakenly thought that I’d hit a memory limit. In fact libvirt out of the box limits the number of client connections to 20. You can adjust libvirt’s limit by editing /etc/libvirt/libvirtd.conf, but easier for us is to simply eliminate libvirt from the equation by doing:


which causes libguestfs to run qemu directly. In my first test I reached 48 parallel launches before I killed the program (because that’s a lot of parallelism and there seemed no end in sight). Scalability of the libguestfs / qemu combination was excellent again:


But there’s more! (In the next part …)


Filed under Uncategorized