Tag Archives: benchmarks

Performance of User-Mode Linux as a libguestfs backend

As of libguestfs 1.23.16, the User-Mode Linux backend is now a supported feature upstream, meaning that at least it gets tested fully for each release.

I did some performance tests on the User-Mode Linux backend compared to the ordinary KVM-based appliance and the results are quite interesting.

The first test is to run the C API test suite using UML and KVM on baremetal. All times are in seconds, averaged over a few runs:

tests/c-api (baremetal) — UML: 630 — KVM: 332

UML is roughly half the speed, but do remember that the test is very system-call intensive, which is one of the worst cases for UML.

The same test again, but performed inside a KVM virtual machine (on the same hardware):

tests/c-api (virtualized) — UML: 334 — KVM: 961

The results of this are so surprising I went back and retested everything several times, but this is completely reproducible. UML runs the C API test suite about twice as fast virtualized as on baremetal.

KVM (no surprise) runs several times slower. Inside the VM there is no hardware virtualization, and so qemu-kvm has to fall back on TCG software emulation of everything.

One conclusion you might draw from this is that UML could be a better choice of backend if you want to use libguestfs inside a VM (eg. in the cloud). As always, you should measure your own workload.


The second test is of start-up times. If you want to use libguestfs to process a lot of disk images, this matters.

start-up (baremetal) — UML: 3.9 — KVM: 3.7
start-up (virtualized) — UML: 3.0 — KVM: 8-11

The start-up time of KVM virtualized was unstable, but appeared to be around 3 times slower than on baremetal. UML performs about the same in both cases.

A couple of conclusions that I take from this:

(1) Most of the time is now spent initializing the appliance, searching for LVM and RAID and so on. The choice of hypervisor makes no difference. This is never going to go away, even if libguestfs was rewritten to use (eg) containers, or if libguestfs linked directly to kernel code. It just takes this time for this kernel & userspace LVM/MD/filesystem code to initialize.

(2) The overhead of starting a KVM VM is not any different from starting a big Linux application. This is no surprise for people who have used KVM for a long time, but it’s counter-intuitive for most people who think that VMs “must” be heavyweight compared to ordinary processes.


The third test is of uploading data from the host into a disk image. I created a 1 GB disk image containing an ext2 filesystem, and I timed how long it took to upload 500 MB of data to a file on this filesystem.

upload (baremetal) — UML: 147 — KVM: 16
upload (virtualized) — UML: 149 — KVM: 73

KVM is predictably much slower when no hardware virtualization is available, by a factor of about 4.5 times.

UML is overall far slower than KVM, but it is at least consistent.

In order to work out why UML is so much slower, I wanted to find out if it was because of the emulated serial port that we push the data through, or because writes to the disk are slow, so I carried out some extra tests:

upload-no-write (baremetal) — UML: 141 — KVM: 11
upload-no-write (virtualized) — UML: 140 — KVM: 20
write-no-upload (baremetal) — UML: 7 — KVM: 13
write-no-upload (virtualized) — UML: 9 — KVM: 25

My conclusion is that the UML emulated serial device is over 10 times slower than KVM’s virtio-serial. This is a problem, but at least it’s a well-defined problem the UML team can fix with an example (virtio-serial) that it’s possible to do much better.

Finally, notice that UML appears faster than KVM at writes.

In fact what’s happening is a difference in caching modes: For safety, libguestfs forces KVM to bypass the host disk cache. This ensures that modifications made to disk images remain consistent even if there is a sudden power failure.

The UML backend currently uses the host cache, so the writes weren’t hitting the disk before the test finished (this is in fact a bug in UML since libguestfs performs an fsync inside the appliance, which UML does not honour).

As always with benchmarks, the moral is to take everything with a pinch of salt and measure your workloads!

Advertisements

3 Comments

Filed under Uncategorized

Benchmarks: uploading files

I spent a few hours working out the fastest way to upload a large file into a disk image (using libguestfs, but the results are applicable to ordinary virtual machines too).

The timings were done on an idle physical machine with plenty of free RAM. The disk image was located in /dev/shm (ie. shared memory) in order to remove the physical effects of spinning magnetic media, so what I hope we are measuring here is pure software overhead. The test script is down at the bottom of this posting in case you want to try to reproduce the results.

In all cases, a 1500M file of zeroes was being uploaded into a 2G disk image. All tests were repeated 5 times, with the last 3 times averaged and shown to the nearest ½ second.

As a baseline I used the libguestfs upload method. This works by copying from a host file, over a fast virtio-serial connection, into the virtual machine. The disk image was raw-format and sparse. The host was synched before and after the upload to ensure that all writes go through to the backing disk (in shared memory).

raw sparse, upload: 8 seconds, 188 MBps

Preallocating the disk made things slower:

raw prealloc, upload: 8.5 s, 176 MBps, -6%

(Preallocation also made no difference when I repeated the test with a real hard disk as backing).

Instead of using the upload command, we can attach the source file as a disk and copy it directly using dd if=/dev/vdb of=/big.file bs=1024k. This was very slightly faster than the baseline:

raw prealloc, dd: 7.5, 200, +6%

However when I tested this again using a real disk for backing, dd made no measurable difference over upload.

Using qcow2 as a backing disk instead of raw made little difference:

qcow2 no prealloc, upload: 7.5, 200, +6%
qcow2 preallocation=metadata, upload: 8, 188, -
qcow2 preallocation=metadata, dd: 8, 188, -

Until very recently, libguestfs defaulted to mounting disks with -o sync, a historical mistake in the API. Although I am doing these tests without this option, adding it shows how much of a penalty this causes:

raw prealloc, dd, sync: 12, 125, -50%
raw prealloc, upload, sync: 138, 11, -1625%

The guest filesystem was ext2 which doesn’t have a journal. What is the penalty for using ext3 (using the default journaling options)?

raw prealloc, dd, ext3: 10.5, 143, -31%

This is surprising because I wouldn’t expect that journal writes while creating a file would be that significant.

Finally, if we compress the input file (down to 1.5MB), surely there will be less data being pushed over the virtio-serial channel and everything will be quicker? In fact, no it’s slower, even if we enable all the virtual CPUs in the guest:

raw prealloc, upload tar.gz: 10.5, 143, -31%
raw prealloc, upload tar.gz, -smp 4: 10.5, 143, -31%

Does it matter that the file I was uploading was all zero bytes? Does qemu optimize this case? I tested this using the fill function to generate files containing other bytes, and as far as I can tell, neither qcow2 nor the kernel (for raw) optimizes the all-zero-byte case.

My conclusion is that intuition is not a very good guide. Measure it!


#!/bin/bash -

set -e

# Configuration.
guest_format=ext2
#disk_image=test1.img
disk_image=/dev/shm/test1.img
disk_image_size=2G
mount_options=
#mount_options=sync

# Create big file to upload.
# Only do this once so the file is cached.

#rm -f big.file big.file.tar.gz
#truncate -s 1G big.file
#tar zcf big.file.tar.gz big.file

test() {
    # Choose a method to allocate the image.
    rm -f times $disk_image

    qemu-img create -f qcow2 $disk_image $disk_image_size >/dev/null
    #qemu-img create -f qcow2 -o preallocation=metadata $disk_image $disk_image_size >/dev/null
    #truncate -s $disk_image_size $disk_image
    #fallocate -l $disk_image_size $disk_image
    # or: dd if=/dev/zero of=$disk_image bs=1024k count=2048 > /dev/null 2>&1

    # Perform the test.

    guestfish -a $disk_image <<EOF

    #smp 4

    # For dd test, add file as /dev/sdb
    #add-ro big.file

    run

    part-disk /dev/sda mbr
    mkfs $guest_format /dev/sda1
    mount-options "$mount_options" /dev/sda1 /

    # Sync before test.
    sync
    !sync

    !date +%s.%N >> times

    #upload big.file /big.file
    #dd /dev/sdb /big.file
    #tgz-in big.file.tar.gz /
    #fill 1 1500M /big.file
    #fill 0 1500M /big.file

EOF

    # Ensure all data is written to disk.
    sync

    date +%s.%N >> times

    # Display time taken in seconds.
    awk '{ if (!start) { start = $1 } else { print $1-start } }' < times
}

test
test
test
test
test

rm -f $disk_image

Leave a comment

Filed under Uncategorized