Tag Archives: qemu

Tracing QEMU guest execution

When QEMU executes a guest using software emulation (“TCG”), it translates blocks of guest code to native code and then executes them (the TCG translation process is described in the talk “Towards multi-threaded TCG” by Alex Bennée and Frederic Konrad). If you’re interested in tracing guest code — perhaps in order to look at what code is being run or to benchmark it — it should be possible to instrument the translated blocks. And in fact it is. However I thought I’d document this since it took me ages to work out and it’s not exactly obvious how to do it.

Firstly you have to compile QEMU from source. Before compiling it, read docs/tracing.txt carefully. Also edit trace-events and remove the disable keyword from the following lines in that file:

 # TCG related tracing (mostly disabled by default)
 # cpu-exec.c
-disable exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"
+exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"

Add those trace events to your /tmp/events file. Also it’s useful to put your full qemu command line into a script, with the additional -trace events=/tmp/events parameter, so you have a way to rerun the trace.

What you end up with after you’ve done this and run your guest under the tracing test conditions is an enormous trace file. My trace file, simply from a kernel boot, was 3.9 GB.

You can now analyze the log using the scripts/simpletrace.py script, as described in the QEMU tracing documentation. Again, the output will be enormous. Mine begins like this, I’ve aligned the output to make it a bit easier to read:

$ ./scripts/simpletrace.py trace-events trace-4491 | head
exec_tb       0.000 pid=4491 tb=0x7fa869afe010 pc=0xfffffff0
exec_tb_exit  1.953 pid=4491 next_tb=0x0 flags=0x0
exec_tb      15.525 pid=4491 tb=0x7fa869afe090 pc=0xfe05b
exec_tb_exit  0.800 pid=4491 next_tb=0x7fa869afe090 flags=0x0
exec_tb       7.215 pid=4491 tb=0x7fa869afe110 pc=0xfe066
exec_tb_exit  0.234 pid=4491 next_tb=0x0 flags=0x0
exec_tb       5.940 pid=4491 tb=0x7fa869afe190 pc=0xfe06a
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0
exec_tb       2.945 pid=4491 tb=0x7fa869afe210 pc=0xfe070
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0

The pid and *tb fields are not very interesting, being the QEMU PID and the internal address of the translated blocks.

However the pc (program counter) field and the timestamp (µs delta from previous trace event) are useful: Notice that the first translated block of guest code is located at guest address 0xffff_fff0, which is the linear address where x86 CPUs boot from, and the second at 0xf_e05b (segmented address F000:E05B) which is the start of the BIOS ROM, so that’s encouraging.

Assuming you’ve now decoded the whole file (the decoded trace takes 5.5GB for me!), how can we turn these raw timestamp deltas and raw linear addresses into useful information? At this point we’re on our own, and I ended up writing Perl scripts to process and analyze the data.

The first Perl script is simple enough, and is just used to associate absolute timestamps with the program counter:

#!/usr/bin/perl -w

use warnings;
use strict;

my $ts = 0;

while (<>) {
    my $ts_delta;
    my $pc;

    if (m{^exec_tb(_nocache)? ([-\d.]+).*pc=0x([a-fA-F0-9]+)}) {
        $ts_delta = $2;
        $pc = "$3";
    }
    elsif (m{^exec_tb_exit ([-\d.]+)}) {
        $ts_delta = $1;
    }
    elsif (m{^Dropped_Event ([-\d.]+)}) {
        $ts_delta = $1;
    }
    else {
        die "could not parse output: $_"
    }
    $ts += $ts_delta;

    if (defined $pc) {
        print "$ts $pc\n";
    }
}

How do we know what program counter corresponds to what code? For this it’s helpful to know some magic numbers associated with booting PCs:

Address Meaning
0xfffffff0 Initial PC
0xfe05b BIOS ROM entry point
0x7c00 Bootloader entry point
Note: not used if you load the kernel using the -kernel option.
0x1000000 64 bit kernel entry point
This may be different for your kernel. Use readelf -e vmlinux

With that knowledge I can put together a timeline of my kernel boot by hand:

Time Stage
0 BIOS
1.16s Enter kernel
38.4s Power off

Not too interesting. The next step is to reverse the PC addresses into kernel symbols. There used to be a useful script called ksymoops to do this, but it seems to have disappeared, so I wrote my own:

#!/usr/bin/perl -w
#
# Find everything that looks like a kernel address in the input
# and turn it into a symbol using gdb.
#
# Usage:
#   ksyms.pl vmlinux < input > output
# where 'vmlinux' is the kernel image which must contain debug
# symbols (on Fedora, find this in kernel-debuginfo).

use warnings;
use strict;

my $vmlinux = shift;
my %cache = ();

while (<>) {
    s{(^|\s)([0-9a-f]{6,16})(\s|$)}{ $1 . lookup ($2) . $3 }gei;
    print
}

sub lookup
{
    local $_;
    my $addr = $_[0];

    return $cache{$addr} if exists $cache{$addr};

    # Run gdb to lookup this symbol.
    my $cmd = "gdb -batch -s '$vmlinux' -ex 'info symbol 0x$addr'";
    open PIPE, "$cmd 2>&1 |" or die "$cmd: $!";
    my $r = <PIPE>;
    close PIPE;
    chomp $r;
    if ($r =~ m/^No symbol/) {
        # No match, just return the original string, but add the original
        # string to the cache so we don't do the lookup again.
        $cache{$addr} = $addr;
        return $addr;
    }

    # Memoize the match and return it.
    $cache{$addr} = $r;
    return $r;
}

You can run it like this:

$ ksyms.pl /usr/lib/debug/lib/modules/4.4.4-301.fc23.x86_64/vmlinux times.log > symbols.log

Come back tomorrow for further analysis …

4 Comments

Filed under Uncategorized

virt-v2v, libguestfs and qemu remote drivers in RHEL 7

Upstream qemu can access a variety of remote disks, like NBD and Ceph. This feature is exposed in libguestfs so you can easily mount remote storage.

However in RHEL 7 many of these drivers are disabled, because they’re not stable enough to support. I was asked exactly how this works, and this post is my answer — as it’s not as simple as it sounds.

There are (at least) five separate layers involved:

qemu code What block drivers are compiled into qemu, and which ones are compiled out completely.
qemu block driver r/o whitelist A whitelist of drivers that qemu allows you to use read-only.
qemu block driver r/w whitelist A whitelist of drivers that qemu allows you to use for read and write.
libvirt What libvirt enables (not covered in this discussion).
libguestfs In RHEL we patch out some qemu remote storage types using a custom patch.

Starting at the bottom of the stack, in RHEL we use ./configure --disable-* flags to disable a few features: Ceph is disabled on !x86_64 and 9pfs is disabled everywhere. This means the qemu binary won’t even contain code for those features.

If you run qemu-img --help in RHEL 7, you’ll see the drivers which are compiled into the binary:

$ rpm -qf /usr/bin/qemu-img
qemu-img-1.5.3-92.el7.x86_64
$ qemu-img --help
[...]
Supported formats: vvfat vpc vmdk vhdx vdi ssh
sheepdog rbd raw host_cdrom host_floppy host_device
file qed qcow2 qcow parallels nbd iscsi gluster dmg
tftp ftps ftp https http cloop bochs blkverify blkdebug

Although you can use all of those in qemu-img, not all of those drivers work in qemu (the hypervisor). qemu implements two whitelists. The RHEL 7 qemu-kvm.spec file looks like this:

./configure [...]
    --block-drv-rw-whitelist=qcow2,raw,file,host_device,blkdebug,nbd,iscsi,gluster,rbd \
    --block-drv-ro-whitelist=vmdk,vhdx,vpc,ssh,https

The --block-drv-rw-whitelist parameter configures the drivers for which full read and write access is permitted and supported in RHEL 7. It’s quite a short list!

Even shorter is the --block-drv-ro-whitelist parameter — drivers for which only read-only access is allowed. You can’t use qemu to open these files for write. You can use these as (read-only) backing files, but you can’t commit to those backing files.

In practice what happens is you get an error if you try to use non-whitelisted block drivers:

$ /usr/libexec/qemu-kvm -drive file=test.vpc
qemu-kvm: -drive file=test.vpc: could not open disk image
test.vpc: Driver 'vpc' can only be used for read-only devices
$ /usr/libexec/qemu-kvm -drive file=test.qcow1
qemu-kvm: -drive file=test.qcow1: could not open disk
image test.qcow1: Driver 'qcow' is not whitelisted

Note that’s a qcow v1 (ancient format) file, not modern qcow2.

Side note: Only qemu (the hypervisor) enforces the whitelist. Tools like qemu-img ignore it.

At the top of the stack, libguestfs has a patch which removes support for many remote protocols. Currently (RHEL 7.2/7.3) we disable: ftp, ftps, http, https, tftp, gluster, iscsi, sheepdog, ssh. That leaves only: local file, rbd (Ceph) and NBD enabled.

virt-v2v uses a mixture of libguestfs and qemu-img to convert VMware and Xen guests to run on KVM. To access VMware we need to use https and to access Xen we use ssh. Both of those drivers are disabled in libguestfs, and only available read-only in the qemu whitelist. However that’s sufficient for virt-v2v, since all it does is add the https or ssh driver as a read-only backing file. (If you are interested in finding out more about how virt-v2v works, then I gave a talk about it at the KVM Forum which is available online).

In summary — it’s complicated.

5 Comments

Filed under Uncategorized

Run Linux on RISC-V in your browser

http://riscv.org/angel/

Previously, running Linux/RISC-V on qemu.

Leave a comment

Filed under Uncategorized

Booting RISC-V Linux with qemu

There are various open source ISAs and chip designs. I’ve previously run OpenRISC 1200 on an FPGA. Another effort is the RISC-V (“RISC Five”) project, which is developing an open, patent-free 64 bit ISA. It has a sister project lowRISC which aims to produce a synthesizable RISC-V FPGA design “in 6 months”, and tape out by the end of this year (I’m a little skeptical of the timeframes).

RISC-V has added support to a fork of qemu:

$ git remote add riscv https://github.com/riscv/riscv-qemu
$ git fetch riscv
$ git checkout -b riscv-master --track riscv/master
$ ./configure --target-list="riscv-softmmu"
$ make
$ ./riscv-softmmu/qemu-system-riscv -cpu \?
RISCV 'riscv-generic'
$ ./riscv-softmmu/qemu-system-riscv -machine \?
Supported machines are:
board                RISCV Board (default)
none                 empty machine

To save yourself a world of pain, download a RISC-V Linux kernel binary and root image from here.

$ file ~/vmlinux
/home/rjones/vmlinux: ELF 64-bit LSB executable, UCB RISC-V, version 1 (SYSV), statically linked, BuildID[sha1]=d0a6d680362018e0f3b9208a7ea7f79b2b403f7c, not stripped

Then you can boot the image in the usual way:

$ ./riscv-softmmu/qemu-system-riscv \
    -display none \
    -kernel ~/vmlinux \
    -hda ~/root.bin \
    -serial stdio

The root filesystem is very sparse:

# uname -a
Linux ucbvax 3.14.15-g4073e84-dirty #4 Sun Jan 11 07:17:06 PST 2015 riscv GNU/Linux
# ls /bin
ash       chgrp     dd        ln        mv        rmdir     touch
base64    chmod     df        ls        nice      sleep     true
busybox   chown     echo      mkdir     printenv  stat      uname
cat       cp        false     mknod     pwd       stty      usleep
catv      date      fsync     mount     rm        sync
# ls /sbin
init
# ls /usr/bin
[          dirname    groups     mkfifo     sha1sum    tac        uniq
[[         dos2unix   head       nohup      sha256sum  tail       unix2dos
basename   du         hostid     od         sha3sum    tee        uudecode
cal        env        id         printf     sha512sum  test       uuencode
cksum      expand     install    readlink   sort       tr         wc
comm       expr       logname    realpath   split      tty        whoami
cut        fold       md5sum     seq        sum        unexpand   yes

Obligatory comic strip

1 Comment

Filed under Uncategorized

Edit UEFI varstores

See end of post for an important update

UEFI firmware has a concept of persistent variables. They are used to control the boot order amongst other things. They are stored in non-volatile RAM on the system board, or for virtual machines in a host file.

When a UEFI machine is running you can edit these variables using various tools, such as Peter Jones’s efivar library, or the efibootmgr program.

These programs don’t actually edit the varstore directly. They access the kernel /sys/firmware/efi interface, but even the kernel doesn’t edit the varstore. It just redirects to the UEFI runtime “Variable Services”, so what is really running is UEFI code (possibly proprietary, but more usually from the open source TianoCore project).

So how can you edit varstores offline? The NVRAM file format is peculiar to say the least, and the only real specification is the code that writes it from Tianocore. So somehow you must reuse that code. To make it more complicated, the varstore NVRAM format is tied to the specific firmware that uses it, so varstores used on aarch64 aren’t compatible with those on x86-64, nor are SecureBoot varstores compatible with normal ones.

virt-efivars is an attempt to do that. It’s rather “meta”. You write a small editor program (an example is included), and virt-efivars compiles it into a tiny appliance. You then boot the appliance using qemu + UEFI firmware + varstore combination, the editor program runs and edits the varstore, using the UEFI code.

It works .. at least on aarch64 which is the only convenient machine I have that has virtualized UEFI.

Git repo: http://git.annexia.org/?p=virt-efivars.git;a=summary

Update:

After studying this problem some more, Laszlo Ersek came up with a different and better plan:

  1. Boot qemu with only the OVMF code & varstore attached. No OS or appliance.
  2. This should drop you into a UEFI shell which is accessible over qemu’s serial port.
  3. Send appropriate setvar commands to update the variables. Using expect this should be automatable.

Leave a comment

Filed under Uncategorized

Half-baked ideas: qemu -M container

For more half-baked ideas, see the ideas tag.

Containers offer a way to do limited virtualization with fewer resources. But a lot of people have belatedly realized that containers aren’t secure, and so there’s a trend for putting containers into real virtual machines.

Unfortunately qemu is not very well suited to just running a single instance of the Linux kernel, as we in the libguestfs community have long known. There are at least a couple of problems:

  1. You have to allocate a fixed amount of RAM to the VM. This is basically a guess. Do you guess too large and have memory wasted in guest kernel structures, or do you guess too small and have the VM fail at random?
  2. There’s a large amount of overhead — firmware, legacy device emulation and other nonsense — which is essentially irrelevant to the special case of running a Linux appliance in a VM.

Here’s the half-baked idea: Let’s make a qemu “container mode/machine” which is better for this one task.

Unlike other proposals in this area, I’m not suggesting that we throw away or rewrite qemu. That’s stupid, as qemu gives us lots of useful abilities.

Instead the right way to do this is to implement a special virtio-ram device where the guest kernel can start off with a very tiny amount of RAM and request more memory on demand. And an empty machine type which is just for running appliances (qemu on ARM already has this: mach-virt).

Libguestfs people and container people, all happy. What’s not to like?

1 Comment

Filed under Uncategorized

Streaming NBD server

The command:

qemu-img convert input output

does not work if the output is a pipe.

It’d sure be nice if it did though! For one thing, we could use this in virt-v2v to stream images into OpenStack Glance (instead of having to spool them into a temporary file).

I mentioned this to Paolo Bonzini yesterday and he suggested a simple workaround. Just replace the output with:

qemu-img convert -n input nbd:...

and write an NBD server that turns the sequence of writes from qemu-img into a stream that gets written to a pipe. Assuming the output is raw, then qemu-img convert will write, starting at disk offset 0, linearly through to the end of the disk image.

How to write such an NBD server easily? nbdkit is a project I started to make it easy to write NBD servers.

So I wrote a streaming plugin which does exactly that, in 243 lines of code.

Using a feature called captive nbdkit, you can rewrite the above command as:

nbdkit -U - streaming pipe=/tmp/output --run '
  qemu-img convert -n input -O raw $nbd
'

(This command will “hang” when you run it — you have to attach some process to read from the pipe, eg: md5sum < /tmp/output)

Further work

The streaming plugin will a lot more generally useful if it supported a sliding window, allowing limited reverse seeking and reading. So there’s a nice little project for a motivated person. See here

5 Comments

Filed under Uncategorized