Tracing QEMU guest execution

When QEMU executes a guest using software emulation (“TCG”), it translates blocks of guest code to native code and then executes them (the TCG translation process is described in the talk “Towards multi-threaded TCG” by Alex Bennée and Frederic Konrad). If you’re interested in tracing guest code — perhaps in order to look at what code is being run or to benchmark it — it should be possible to instrument the translated blocks. And in fact it is. However I thought I’d document this since it took me ages to work out and it’s not exactly obvious how to do it.

Firstly you have to compile QEMU from source. Before compiling it, read docs/tracing.txt carefully. Also edit trace-events and remove the disable keyword from the following lines in that file:

 # TCG related tracing (mostly disabled by default)
 # cpu-exec.c
-disable exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"
+exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"

Add those trace events to your /tmp/events file. Also it’s useful to put your full qemu command line into a script, with the additional -trace events=/tmp/events parameter, so you have a way to rerun the trace.

What you end up with after you’ve done this and run your guest under the tracing test conditions is an enormous trace file. My trace file, simply from a kernel boot, was 3.9 GB.

You can now analyze the log using the scripts/simpletrace.py script, as described in the QEMU tracing documentation. Again, the output will be enormous. Mine begins like this, I’ve aligned the output to make it a bit easier to read:

$ ./scripts/simpletrace.py trace-events trace-4491 | head
exec_tb       0.000 pid=4491 tb=0x7fa869afe010 pc=0xfffffff0
exec_tb_exit  1.953 pid=4491 next_tb=0x0 flags=0x0
exec_tb      15.525 pid=4491 tb=0x7fa869afe090 pc=0xfe05b
exec_tb_exit  0.800 pid=4491 next_tb=0x7fa869afe090 flags=0x0
exec_tb       7.215 pid=4491 tb=0x7fa869afe110 pc=0xfe066
exec_tb_exit  0.234 pid=4491 next_tb=0x0 flags=0x0
exec_tb       5.940 pid=4491 tb=0x7fa869afe190 pc=0xfe06a
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0
exec_tb       2.945 pid=4491 tb=0x7fa869afe210 pc=0xfe070
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0

The pid and *tb fields are not very interesting, being the QEMU PID and the internal address of the translated blocks.

However the pc (program counter) field and the timestamp (µs delta from previous trace event) are useful: Notice that the first translated block of guest code is located at guest address 0xffff_fff0, which is the linear address where x86 CPUs boot from, and the second at 0xf_e05b (segmented address F000:E05B) which is the start of the BIOS ROM, so that’s encouraging.

Assuming you’ve now decoded the whole file (the decoded trace takes 5.5GB for me!), how can we turn these raw timestamp deltas and raw linear addresses into useful information? At this point we’re on our own, and I ended up writing Perl scripts to process and analyze the data.

The first Perl script is simple enough, and is just used to associate absolute timestamps with the program counter:

#!/usr/bin/perl -w

use warnings;
use strict;

my $ts = 0;

while (<>) {
    my $ts_delta;
    my $pc;

    if (m{^exec_tb(_nocache)? ([-\d.]+).*pc=0x([a-fA-F0-9]+)}) {
        $ts_delta = $2;
        $pc = "$3";
    }
    elsif (m{^exec_tb_exit ([-\d.]+)}) {
        $ts_delta = $1;
    }
    elsif (m{^Dropped_Event ([-\d.]+)}) {
        $ts_delta = $1;
    }
    else {
        die "could not parse output: $_"
    }
    $ts += $ts_delta;

    if (defined $pc) {
        print "$ts $pc\n";
    }
}

How do we know what program counter corresponds to what code? For this it’s helpful to know some magic numbers associated with booting PCs:

Address Meaning
0xfffffff0 Initial PC
0xfe05b BIOS ROM entry point
0x7c00 Bootloader entry point
Note: not used if you load the kernel using the -kernel option.
0x1000000 64 bit kernel entry point
This may be different for your kernel. Use readelf -e vmlinux

With that knowledge I can put together a timeline of my kernel boot by hand:

Time Stage
0 BIOS
1.16s Enter kernel
38.4s Power off

Not too interesting. The next step is to reverse the PC addresses into kernel symbols. There used to be a useful script called ksymoops to do this, but it seems to have disappeared, so I wrote my own:

#!/usr/bin/perl -w
#
# Find everything that looks like a kernel address in the input
# and turn it into a symbol using gdb.
#
# Usage:
#   ksyms.pl vmlinux < input > output
# where 'vmlinux' is the kernel image which must contain debug
# symbols (on Fedora, find this in kernel-debuginfo).

use warnings;
use strict;

my $vmlinux = shift;
my %cache = ();

while (<>) {
    s{(^|\s)([0-9a-f]{6,16})(\s|$)}{ $1 . lookup ($2) . $3 }gei;
    print
}

sub lookup
{
    local $_;
    my $addr = $_[0];

    return $cache{$addr} if exists $cache{$addr};

    # Run gdb to lookup this symbol.
    my $cmd = "gdb -batch -s '$vmlinux' -ex 'info symbol 0x$addr'";
    open PIPE, "$cmd 2>&1 |" or die "$cmd: $!";
    my $r = <PIPE>;
    close PIPE;
    chomp $r;
    if ($r =~ m/^No symbol/) {
        # No match, just return the original string, but add the original
        # string to the cache so we don't do the lookup again.
        $cache{$addr} = $addr;
        return $addr;
    }

    # Memoize the match and return it.
    $cache{$addr} = $r;
    return $r;
}

You can run it like this:

$ ksyms.pl /usr/lib/debug/lib/modules/4.4.4-301.fc23.x86_64/vmlinux times.log > symbols.log

Come back tomorrow for further analysis …

3 Comments

Filed under Uncategorized

3 responses to “Tracing QEMU guest execution

  1. Pingback: Tracing QEMU guest execution part 2 | Richard WM Jones

  2. rothschilds@naver.com

    hello, i followed your introduction “tracing QEMU guest excution” and i noticed nothing printed in my trace file. did i miss something ? i used this instruction “qemu-system-x86_64 -m 2G -trace events=/tmp/events ../qemu/test.img” to get trace. i checked my /tmp/events file and configured with trace-backends=simple option. what is wrong my system ? plz kindly help me. Thanks. !

  3. Arjun

    Hey. Nice post. But how do I switch off the tb caching so that the exec_tb_nocache logs the traces? Any idea?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s