Tag Archives: performance

Frame pointers vs DWARF – my verdict

A couple of weeks ago I wrote a blog posting here about Fedora having frame pointers (LWN backgrounder, HN thread). I made some mistakes in that blog posting and retracted it, but I wasn’t wrong about the conclusions, just wrong about how I reached them. Frame pointers are much better than DWARF. DWARF unwinding might have some theoretical advantages but it’s worse in every practical respect.

In particular:

  1. Frame pointers give you much faster profiling with much less overhead. This practically means you can do continuous performance collection and analysis which would be impossible with DWARF.
  2. DWARF unwinding has foot-guns which make it easy to screw up and collect insufficient data for analysis. You cannot know in advance how much data to collect. The defaults are much too small, and even increasing the collection size to unreasonably large sizes isn’t enough.
  3. The overhead of collecting DWARF callgraph data adversely affects what you’re trying to analyze.
  4. Frame pointers have some corner cases which they don’t handle well (certain leaf and most inlined functions aren’t collected), but these don’t matter a great deal in reality.
  5. DWARF unwinding can show inlined functions as if they are separate stack frames. (Opinions differ as to whether or not this is an advantage.)

Below I’ll try to demonstrate some of the issues, but first a little bit of background is necessary about how all this works.

When you run perf record -a on a workload, the kernel fires a timer interrupt on every CPU 100s or 1000s of times a second. Each interrupt must collect a stack trace for that CPU at that moment which is then sent up to the userspace perf process that writes it to a perf.data file in the current directory. Obviously collecting this stack trace and writing it to the file must be done as quickly as possible with the least overhead.

Also the stack trace may start inside the kernel and go all the way out to userspace (unless the CPU was running userspace code at the moment it was interrupted in which case it just collects userspace). That involves unwinding the two different stacks.

For the kernel stack, the kernel has its own unwinding information called ORC. For the userspace stack you choose (with the perf --call-graph option) whether to use frame pointers or DWARF. For frame pointers the kernel is able to immediately walk up the userspace stack all the way to the top (assuming everything was compiled with frame pointers, but that is now true for Fedora 38). For DWARF however the format is complicated and the kernel cannot unwind it immediately. Instead the kernel just collects the user stack. But collecting the whole stack would consume far too much storage, so by default it only collects the first 8K. Many userspace stacks will be larger than this, in which case the data collection will simply be incomplete – it will never be possible to recover the full stack trace. You can adjust the size of stack collected, but that massively bloats the perf.data file as we’ll see below.

To demonstrate what I mean, I collected a set of traces using fio and nbdkit on Fedora 38, using both frame pointers and DWARF. The command is:

sudo perf record -a -g [--call-graph=...] -- nbdkit -U - null 1G --run 'export uri; fio nbd.fio'

with the nbd.fio file from fio’s examples.

I used no --call-graph option for collecting frame pointers (as it is the default), and --call-graph=dwarf,{4096,8192,16384,32768} to collect the DWARF examples with 4 different stack sizes.

I converted the resulting data into flame graphs using Brendan Gregg’s tools.

Everything was run on my idle 12 core / 24 thread AMD development machine.

TypeSize of perf.dataLost chunksFlame graph
Frame pointers934 MB0Link
DWARF (4K)10,104 MB425Link
DWARF (8K)18,733 MB1,643Link
DWARF (16K)35,149 MB5,333Link
DWARF (32K)57,590 MB545,024Link

The first most obvious thing is that even with the smallest stack data collection, DWARF’s perf.data is over 10 times larger, and it balloons even larger once you start to collect more reasonable stack sizes. For a single minute of data collection, collecting 10s of gigabytes of data is not very practical even on high end machines, and continuous performance analysis would be impossible at these data rates.

Related to this, the overhead of perf increases. It is ~ 0.1% for frame pointers. For DWARF the overhead goes: 0.8% (4K), 1.5% (8K), 2.8% (16K), 2.7% (32K). But this disguises the true overhead because it doesn’t count the cost of writing to disk. Unfortunately on this machine I have full disk encryption enabled (which does add a lot to the overhead of writing nearly 60 GB of perf data), but you can see the overhead of all that encryption separate from perf in the flame graph. The total overhead of perf + writing + encryption is about 20%.

This may also be the reason for seeing so many “lost chunks” even on this very fast machine. All of the DWARF tests even at the smallest size printed:

Check IO/CPU overload!

But is the DWARF data accurate? Clearly not. This is to be expected, collecting a partial user stack is not going to be enough to reconstruct a stack trace, but remember that even with 4K of stack, the perf.data is already > 10 times larger than for frame pointers. Zooming in to the nbdkit process only and comparing the flamegraphs shows significant amounts of incomplete stack traces, even when collecting 32K of stack.

On the left, nbdkit with frame pointers (correct). On the right, nbdkit with DWARF and 32K collection size. Notice on the right the large number of unattached frames. nbdkit main() does not directly call Unix domain socket send and receive functions!

If 8K (the default) is insufficient, and even 32K is not enough, how large do we need to make the DWARF stack collection? I couldn’t find out because I don’t have enough space for the expected 120 GB perf.data file at the next size up.

Let’s have a look at one thing which DWARF can do — show inlined and leaf functions. The stack trace for these is more accurate as you can see below. (To reproduce, zoom in on the nbd_poll function). On the left, frame pointers. On the right DWARF with 32K stacks, showing the extra enter_* frames which are inlined.

My final summary here is that for most purposes you would be better off using frame pointers, and it’s a good thing that Fedora 38 now compiles everything with frame pointers. It should result in easier performance analysis, and even makes continuous performance analysis more plausible.

2 Comments

Filed under Uncategorized

nbdkit for loopback pt 7: a slow disk

What happens to filesystems and programs when the disk is slow? You can test this using nbdkit and the delay filter. This command creates a 4G virtual disk in memory and injects a 2 second delay into every read operation:

$ nbdkit --filter=delay memory size=4G rdelay=2

You can loopback mount this as a device on the host:

# modprobe nbd
# nbd-client -b 512 localhost /dev/nbd0
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 4096MB
Connected /dev/nbd0

Partitioning and formatting is really slow!

# sgdisk -n 1 /dev/nbd0
Creating new GPT entries in memory.
... sits here for about 10 seconds ...
The operation has completed successfully.
# mkfs.ext4 /dev/nbd0p1
mke2fs 1.44.3 (10-July-2018)
waiting ...

Actually I killed it and decided to restart the test with a smaller delay. Since the memory plugin was rewritten to use a sparse array, we’re serializing all requests as an easy way to lock the sparse array data structure. This doesn’t matter normally because requests to the memory plugin are extremely fast, but once you inject delays this means that every request into nbdkit is serialized. Thus for example two reads issued in parallel at the same time by the kernel are delayed by 2+2 = 4 seconds instead of 2 seconds in total.

However shutting down the NBD connection reveals likely kernel bugs in the NBD driver:

[74176.112087] block nbd0: NBD_DISCONNECT
[74176.112148] block nbd0: Disconnected due to user request.
[74176.112151] block nbd0: shutting down sockets
[74176.112183] print_req_error: I/O error, dev nbd0, sector 6144
[74176.112252] print_req_error: I/O error, dev nbd0, sector 6144
[74176.112257] Buffer I/O error on dev nbd0p1, logical block 4096, async page read
[74176.112260] Buffer I/O error on dev nbd0p1, logical block 4097, async page read
[74176.112263] Buffer I/O error on dev nbd0p1, logical block 4098, async page read
[74176.112265] Buffer I/O error on dev nbd0p1, logical block 4099, async page read
[74176.112267] Buffer I/O error on dev nbd0p1, logical block 4100, async page read
[74176.112269] Buffer I/O error on dev nbd0p1, logical block 4101, async page read
[74176.112271] Buffer I/O error on dev nbd0p1, logical block 4102, async page read
[74176.112274] Buffer I/O error on dev nbd0p1, logical block 4103, async page read

Note nbdkit did not return any I/O errors, but the connection was closed with in-flight delayed requests. Well at least our testing is finding bugs!

I tried again with a 500ms delay and using the file plugin which is fully parallel:

$ rm -f temp
$ truncate -s 4G temp
$ nbdkit --filter=delay file file=temp rdelay=500ms

I was able to partition and create a filesystem more easily on this because of the shorter delay and the fact that parallel kernel requests are delayed “in parallel” [same steps as above], and then mount it on a temp directory:

# mount /dev/nbd0p1 /tmp/mnt

The effect is rather strange, like using an NFS mount from a remote server. Initial file reads are slow, and then they are fast (as they are cached in memory). If you drop Linux caches:

# echo 3 > /proc/sys/vm/drop_caches

then everything becomes slow again.

Confident that parallel requests were being delayed in parallel, I also increased the delay back up to 2 seconds (still using the file plugin). This is like swimming in treacle or what I imagine it would be like to mount an NFS filesystem from the other side of the world over a 56K modem.

I wasn’t able to find any further bugs, but this should be useful for someone who wants to test this kind of thing.

1 Comment

Filed under Uncategorized

Libguestfs appliance boot in under 600ms

$ ./run ./utils/boot-benchmark/boot-benchmark
Warming up the libguestfs cache ...
Running the tests ...

test version: libguestfs 1.33.28
 test passes: 10
host version: Linux moo.home.annexia.org 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    host CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
     backend: direct               [to change set $LIBGUESTFS_BACKEND]
        qemu: /home/rjones/d/qemu/x86_64-softmmu/qemu-system-x86_64 [to change set $LIBGUESTFS_HV]
qemu version: QEMU emulator version 2.5.94, Copyright (c) 2003-2008 Fabrice Bellard
         smp: 1                    [to change use --smp option]
     memsize: 500                  [to change use --memsize option]
      append:                      [to change use --append option]

Result: 575.9ms ±5.3ms

There are various tricks here:

  1. I’m using the (still!) not upstream qemu DMA patches.
  2. I’ve compiled my own very minimal guest Linux kernel.
  3. I’m using my nearly upstream "crypto: Add a flag allowing the self-tests to be disabled at runtime." patch.
  4. I’ve got two sets of non-upstream libguestfs patches 1, 2
  5. I am not using libvirt, but if you do want to use libvirt, make sure you use the very latest version since it contains an important performance patch.

Previously

4 Comments

Filed under Uncategorized

Super-nested KVM

Regular readers of this blog will of course be familiar with the joys of virtualization. One of those joys is nested virtualization — running a virtual machine in a virtual machine. Nested KVM is a thing too — that is, emulating the virtualization extensions in the CPU so that the second level guest gets at least some of the acceleration benefits that a normal first level guest would get.

My question is: How deeply can you nest KVM?

This is not so easy to test at the moment, so I’ve created a small project / disk image which when booted on KVM will launch a nested guest, which launches a nested guest, and so on until (usually) the host crashes, or you run out of memory, or your patience is exhausted by the poor performance of nested KVM.

The answer, by the way, is just 3 levels [on AMD hardware], which is rather disappointing. Hopefully this will encourage the developers to take a closer look at the bugs in nested virt.

Git repo: http://git.annexia.org/?p=supernested.git;a=summary
Binary images: http://oirase.annexia.org/supernested/

How does this work?

Building a simple appliance is easy. I’m using supermin to do that.

The problem is how does the appliance run another appliance? How do you put the same appliance inside the appliance? Obviously that’s impossible (right?)

The way it works is inside the Lx hypervisor it runs the L(x+1) qemu on /dev/sda, with a protective overlay stored in memory so we don’t disrupt the Lx hypervisor. Since /dev/sda literally is the appliance disk image, this all kinda works.

3 Comments

Filed under Uncategorized

Cluster performance: baseline testing

I’m using fio (as recommended by Linus!) to baseline test my virtualization cluster. My fio script is supposed to look a bit like a qemu process:

[virt]
ioengine=libaio
iodepth=4
rw=randrw
bs=64k
direct=1
size=1g
numjobs=4

It has 4 large “disks” (size=1g) and 4 large “qemu processes” (numjobs=4) running in parallel. Each test thread can have up to 4 IOs in flight (iodepth=4) and the size of IOs is 64K which matches qcow2 default cluster size. I enabled O_DIRECT (direct=1) because we normally use qemu cache=none so that live migration works.

The first node now has a RAID 1 array of spinning rust (hard disks) and a smaller SSD, and the plan is to use LVM-cache so the SSD can sit on top of the RAID array.

Performance of the RAID 1 array of hard disks

The raw performance of the RAID 1 array (this includes the filesystem) is fairly dismal:

virt-ham0-raid1.txt

Performance of the SSD

The SSD in contrast does a lot better:

virt-ham0-ssd.txt

However you want to look at the details, the fact is that the test runs 11 times faster on the SSD.

The effect of NFS

What about when we NFS-mount the RAID array or the SSD on another node? This should tell us the effect of NFS.

virt-ham1-raid1-nfs.txt

NFS makes this test run 3 times slower.

For the NFS-mounted SSD:

virt-ham1-ssd-nfs.txt

NFS makes this test run 4.5 times slower.

The effect of virtualization

By running the virtual machine on the first node (with the disks) it should be possible to see just the effect of virtualization. Since this is backed by the RAID 1 hard disk array, not SSDs or LVM cache, it should be compared only to the RAID 1 performance.

virt-vm-on-ham0.txt

The effect of virtualization (virtio-scsi in this case) is about an 8% drop in performance, which is not something I’m going to worry about.

Conclusions

  • The gains from the SSD (ie. using LVM cache) could outweigh the losses from having to use NFS to share the disk images.
  • It’s worth looking at alternate high bandwidth, low-latency interconnects (instead of 1 gigE) to make NFS perform better. I’m going to investigate using Infiniband soon.

These are just the baseline measurements without LVM cache.

I’ve included links to the full test results. fio gives a huge amount of detail, and it’s helpful to keep the HOWTO open so you can understand all the figures it is producing.

1 Comment

Filed under Uncategorized

Performance of User-Mode Linux as a libguestfs backend

As of libguestfs 1.23.16, the User-Mode Linux backend is now a supported feature upstream, meaning that at least it gets tested fully for each release.

I did some performance tests on the User-Mode Linux backend compared to the ordinary KVM-based appliance and the results are quite interesting.

The first test is to run the C API test suite using UML and KVM on baremetal. All times are in seconds, averaged over a few runs:

tests/c-api (baremetal) — UML: 630 — KVM: 332

UML is roughly half the speed, but do remember that the test is very system-call intensive, which is one of the worst cases for UML.

The same test again, but performed inside a KVM virtual machine (on the same hardware):

tests/c-api (virtualized) — UML: 334 — KVM: 961

The results of this are so surprising I went back and retested everything several times, but this is completely reproducible. UML runs the C API test suite about twice as fast virtualized as on baremetal.

KVM (no surprise) runs several times slower. Inside the VM there is no hardware virtualization, and so qemu-kvm has to fall back on TCG software emulation of everything.

One conclusion you might draw from this is that UML could be a better choice of backend if you want to use libguestfs inside a VM (eg. in the cloud). As always, you should measure your own workload.


The second test is of start-up times. If you want to use libguestfs to process a lot of disk images, this matters.

start-up (baremetal) — UML: 3.9 — KVM: 3.7
start-up (virtualized) — UML: 3.0 — KVM: 8-11

The start-up time of KVM virtualized was unstable, but appeared to be around 3 times slower than on baremetal. UML performs about the same in both cases.

A couple of conclusions that I take from this:

(1) Most of the time is now spent initializing the appliance, searching for LVM and RAID and so on. The choice of hypervisor makes no difference. This is never going to go away, even if libguestfs was rewritten to use (eg) containers, or if libguestfs linked directly to kernel code. It just takes this time for this kernel & userspace LVM/MD/filesystem code to initialize.

(2) The overhead of starting a KVM VM is not any different from starting a big Linux application. This is no surprise for people who have used KVM for a long time, but it’s counter-intuitive for most people who think that VMs “must” be heavyweight compared to ordinary processes.


The third test is of uploading data from the host into a disk image. I created a 1 GB disk image containing an ext2 filesystem, and I timed how long it took to upload 500 MB of data to a file on this filesystem.

upload (baremetal) — UML: 147 — KVM: 16
upload (virtualized) — UML: 149 — KVM: 73

KVM is predictably much slower when no hardware virtualization is available, by a factor of about 4.5 times.

UML is overall far slower than KVM, but it is at least consistent.

In order to work out why UML is so much slower, I wanted to find out if it was because of the emulated serial port that we push the data through, or because writes to the disk are slow, so I carried out some extra tests:

upload-no-write (baremetal) — UML: 141 — KVM: 11
upload-no-write (virtualized) — UML: 140 — KVM: 20
write-no-upload (baremetal) — UML: 7 — KVM: 13
write-no-upload (virtualized) — UML: 9 — KVM: 25

My conclusion is that the UML emulated serial device is over 10 times slower than KVM’s virtio-serial. This is a problem, but at least it’s a well-defined problem the UML team can fix with an example (virtio-serial) that it’s possible to do much better.

Finally, notice that UML appears faster than KVM at writes.

In fact what’s happening is a difference in caching modes: For safety, libguestfs forces KVM to bypass the host disk cache. This ensures that modifications made to disk images remain consistent even if there is a sudden power failure.

The UML backend currently uses the host cache, so the writes weren’t hitting the disk before the test finished (this is in fact a bug in UML since libguestfs performs an fsync inside the appliance, which UML does not honour).

As always with benchmarks, the moral is to take everything with a pinch of salt and measure your workloads!

3 Comments

Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 4

[Part 1, part 2, part 3.]

Finally I modified the test to do some representative work: We now load a real Windows XP guest, inspect it (a heavyweight operation), and mount and stat each filesystem. I won’t reproduce the entire test program again because only the test subroutine has changed:

sub test {
    my $g = Sys::Guestfs->new;
    $g->add_drive_ro ("/tmp/winxp.img");
    $g->launch ();

    # Inspect the guest (ignore the result).
    $g->inspect_os ();

    # Approximate what virt-df does.
    my %fses = $g->list_filesystems ();
    foreach (keys %fses) {
        my $mounted = 0;
        eval { $g->mount_ro ($_, "/"); $mounted = 1; };
        if ($mounted) {
            $g->statvfs ("/");
            $g->umount_all ();
        }
    }

    return $g;
}

Even with all that work going on, I was able to inspect more than 1 disk per second on my laptop, and run 60 threads in parallel with good performance and scalability:

data

Leave a comment

Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 3

A problem encountered in part 2 was that I couldn’t measure the maximum number of parallel libguestfs appliances that can be run at the same time. There are two reasons for that. The simpler one is that libvirt has a limit of 20 connections, which is easily overcome by setting LIBGUESTFS_ATTACH_METHOD=appliance to eliminate libvirt and run qemu directly. The harder one is that by the time the last appliances in the test are starting to launch, earlier ones have already shut down and their threads have exited.

What is needed is for the test to work in two phases: In the first phase we start up all the threads and launch all the appliances. Only when this is complete do we enter the second phase where we shut down all the appliances.

The easiest way to do this is by modifying the test to use a barrier (or in fact to implement a barrier using the condition primitives). See the modified test script below.

With the modified test script I was able to run ≥ 110 and < 120 parallel appliances in ~ 13 GB of free RAM, or around 120 MB / appliance, still with excellent performance and nearly linear scalability:

data


#!/usr/bin/perl -w

use strict;
use threads qw(yield);
use threads::shared qw(cond_broadcast cond_wait lock);
use Sys::Guestfs;
use Time::HiRes qw(time);

my $nr_threads_launching :shared;

sub test {
    my $g = Sys::Guestfs->new;
    $g->add_drive_ro ("/dev/null");
    $g->launch ();
    return $g;
}

# Get everything into cache.
test (); test (); test ();

sub thread {
    my $g = test ();

    {
        lock ($nr_threads_launching);
        $nr_threads_launching--;
        cond_broadcast ($nr_threads_launching);
        cond_wait ($nr_threads_launching) until $nr_threads_launching == 0;
    }

    $g->close ();
}

# Test increasing numbers of threads until it fails.
for (my $nr_threads = 10; $nr_threads < 1000; $nr_threads += 10) {
    my $start_t = time ();

    $nr_threads_launching = $nr_threads;

    my @threads;
    foreach (1..$nr_threads) {
        push @threads, threads->create (\&thread)
    }
    foreach (@threads) {
        $_->join ();
        if (my $err = $_->error ()) {
            die "launch failed with $nr_threads threads: $err"
        }
    }

    my $end_t = time ();
    printf ("%d %.2f\n", $nr_threads, $end_t - $start_t);
}

1 Comment

Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 2

One problem with the previous test is that I hit a limit of 20 parallel appliances and mistakenly thought that I’d hit a memory limit. In fact libvirt out of the box limits the number of client connections to 20. You can adjust libvirt’s limit by editing /etc/libvirt/libvirtd.conf, but easier for us is to simply eliminate libvirt from the equation by doing:

export LIBGUESTFS_ATTACH_METHOD=appliance

which causes libguestfs to run qemu directly. In my first test I reached 48 parallel launches before I killed the program (because that’s a lot of parallelism and there seemed no end in sight). Scalability of the libguestfs / qemu combination was excellent again:

data

But there’s more! (In the next part …)

2 Comments

Filed under Uncategorized

Multiple libguestfs appliances in parallel, part 1

I wrote the Perl script below to find out how many libguestfs appliances we can start in parallel. The results are surprising (-ly good):

data

What’s happening here is that we’re booting up a KVM guest with 500 MB of memory, booting the Linux kernel, booting a minimal userspace, then shutting the whole lot down. And then doing that in parallel with 1, 2, .. 20 threads.

[Note: Hardware is my Lenovo x230 laptop with an Intel Core(TM) i7-3520M CPU @ 2.90GHz, 2 cores with 4 threads, 16 GB of RAM with approx. 13 GB free. Software is: Fedora 18 with libguestfs 1.20.2, libvirt 1.0.2 (from Rawhide), qemu 1.4.0 (from Rawhide)]

The test fails at 21 threads because there isn’t enough free memory, so each qemu instance is allocating around 660 MB of RAM. This is wrong: It failed because libvirt out of the box limits the maximum number of clients to 20. See next part in this series.

Up to 4 parallel launches, you can clearly see the effect of better utilization of the parallelism of the CPU — the total elapsed time hardly moves, even though we’re doing up to 4 times more work.


#!/usr/bin/perl -w

use strict;
use threads;
use Sys::Guestfs;
use Time::HiRes qw(time);

sub test {
    my $g = Sys::Guestfs->new;
    $g->add_drive_ro ("/dev/null");
    $g->launch ();
}

# Get everything into cache.
test (); test (); test ();

# Test increasing numbers of threads until it fails.
for my $nr_threads (1..100) {
    my $start_t = time ();
    my @threads;
    foreach (1..$nr_threads) {
        push @threads, threads->create (\&test)
    }
    foreach (@threads) {
        $_->join ();
        if (my $err = $_->error ()) {
            die "launch failed with nr_threads = $nr_threads: $err"
        }
    }
    my $end_t = time ();
    printf ("%d %.2f\n", $nr_threads, $end_t - $start_t);
}

2 Comments

Filed under Uncategorized