Tracing QEMU guest execution

When QEMU executes a guest using software emulation (“TCG”), it translates blocks of guest code to native code and then executes them (the TCG translation process is described in the talk “Towards multi-threaded TCG” by Alex Bennée and Frederic Konrad). If you’re interested in tracing guest code — perhaps in order to look at what code is being run or to benchmark it — it should be possible to instrument the translated blocks. And in fact it is. However I thought I’d document this since it took me ages to work out and it’s not exactly obvious how to do it.

Firstly you have to compile QEMU from source. Before compiling it, read docs/tracing.txt carefully. Also edit trace-events and remove the disable keyword from the following lines in that file:

 # TCG related tracing (mostly disabled by default)
 # cpu-exec.c
-disable exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"
+exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"

Add those trace events to your /tmp/events file. Also it’s useful to put your full qemu command line into a script, with the additional -trace events=/tmp/events parameter, so you have a way to rerun the trace.

What you end up with after you’ve done this and run your guest under the tracing test conditions is an enormous trace file. My trace file, simply from a kernel boot, was 3.9 GB.

You can now analyze the log using the scripts/simpletrace.py script, as described in the QEMU tracing documentation. Again, the output will be enormous. Mine begins like this, I’ve aligned the output to make it a bit easier to read:

$ ./scripts/simpletrace.py trace-events trace-4491 | head
exec_tb       0.000 pid=4491 tb=0x7fa869afe010 pc=0xfffffff0
exec_tb_exit  1.953 pid=4491 next_tb=0x0 flags=0x0
exec_tb      15.525 pid=4491 tb=0x7fa869afe090 pc=0xfe05b
exec_tb_exit  0.800 pid=4491 next_tb=0x7fa869afe090 flags=0x0
exec_tb       7.215 pid=4491 tb=0x7fa869afe110 pc=0xfe066
exec_tb_exit  0.234 pid=4491 next_tb=0x0 flags=0x0
exec_tb       5.940 pid=4491 tb=0x7fa869afe190 pc=0xfe06a
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0
exec_tb       2.945 pid=4491 tb=0x7fa869afe210 pc=0xfe070
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0

The pid and *tb fields are not very interesting, being the QEMU PID and the internal address of the translated blocks.

However the pc (program counter) field and the timestamp (µs delta from previous trace event) are useful: Notice that the first translated block of guest code is located at guest address 0xffff_fff0, which is the linear address where x86 CPUs boot from, and the second at 0xf_e05b (segmented address F000:E05B) which is the start of the BIOS ROM, so that’s encouraging.

Assuming you’ve now decoded the whole file (the decoded trace takes 5.5GB for me!), how can we turn these raw timestamp deltas and raw linear addresses into useful information? At this point we’re on our own, and I ended up writing Perl scripts to process and analyze the data.

The first Perl script is simple enough, and is just used to associate absolute timestamps with the program counter:

#!/usr/bin/perl -w

use warnings;
use strict;

my $ts = 0;

while (<>) {
    my $ts_delta;
    my $pc;

    if (m{^exec_tb(_nocache)? ([-\d.]+).*pc=0x([a-fA-F0-9]+)}) {
        $ts_delta = $2;
        $pc = "$3";
    }
    elsif (m{^exec_tb_exit ([-\d.]+)}) {
        $ts_delta = $1;
    }
    elsif (m{^Dropped_Event ([-\d.]+)}) {
        $ts_delta = $1;
    }
    else {
        die "could not parse output: $_"
    }
    $ts += $ts_delta;

    if (defined $pc) {
        print "$ts $pc\n";
    }
}

How do we know what program counter corresponds to what code? For this it’s helpful to know some magic numbers associated with booting PCs:

Address Meaning
0xfffffff0 Initial PC
0xfe05b BIOS ROM entry point
0x7c00 Bootloader entry point
Note: not used if you load the kernel using the -kernel option.
0x1000000 64 bit kernel entry point
This may be different for your kernel. Use readelf -e vmlinux

With that knowledge I can put together a timeline of my kernel boot by hand:

Time Stage
0 BIOS
1.16s Enter kernel
38.4s Power off

Not too interesting. The next step is to reverse the PC addresses into kernel symbols. There used to be a useful script called ksymoops to do this, but it seems to have disappeared, so I wrote my own:

#!/usr/bin/perl -w
#
# Find everything that looks like a kernel address in the input
# and turn it into a symbol using gdb.
#
# Usage:
#   ksyms.pl vmlinux < input > output
# where 'vmlinux' is the kernel image which must contain debug
# symbols (on Fedora, find this in kernel-debuginfo).

use warnings;
use strict;

my $vmlinux = shift;
my %cache = ();

while (<>) {
    s{(^|\s)([0-9a-f]{6,16})(\s|$)}{ $1 . lookup ($2) . $3 }gei;
    print
}

sub lookup
{
    local $_;
    my $addr = $_[0];

    return $cache{$addr} if exists $cache{$addr};

    # Run gdb to lookup this symbol.
    my $cmd = "gdb -batch -s '$vmlinux' -ex 'info symbol 0x$addr'";
    open PIPE, "$cmd 2>&1 |" or die "$cmd: $!";
    my $r = <PIPE>;
    close PIPE;
    chomp $r;
    if ($r =~ m/^No symbol/) {
        # No match, just return the original string, but add the original
        # string to the cache so we don't do the lookup again.
        $cache{$addr} = $addr;
        return $addr;
    }

    # Memoize the match and return it.
    $cache{$addr} = $r;
    return $r;
}

You can run it like this:

$ ksyms.pl /usr/lib/debug/lib/modules/4.4.4-301.fc23.x86_64/vmlinux times.log > symbols.log

Come back tomorrow for further analysis …

14 Comments

Filed under Uncategorized

14 responses to “Tracing QEMU guest execution

  1. Pingback: Tracing QEMU guest execution part 2 | Richard WM Jones

  2. rothschilds@naver.com

    hello, i followed your introduction “tracing QEMU guest excution” and i noticed nothing printed in my trace file. did i miss something ? i used this instruction “qemu-system-x86_64 -m 2G -trace events=/tmp/events ../qemu/test.img” to get trace. i checked my /tmp/events file and configured with trace-backends=simple option. what is wrong my system ? plz kindly help me. Thanks. !

  3. Arjun

    Hey. Nice post. But how do I switch off the tb caching so that the exec_tb_nocache logs the traces? Any idea?

  4. ouss

    what i understand that pc returns the address of tb so the address of *tb did match the adress of what .

    other question please tc-ptr adress is the adress of host code in all cases (i want to say before translating or after it match to the code of the host architecture.thanks

  5. Nesrine

    Could any one help me to understand how to add trace events to the /tmp/events file ??
    I used “echo memory_region_ops_read > /tmp/events ” but I didn’t find any trace file after launching qemu !

  6. Patching trace-events is not needed anymore since 5caa262fdadae08f7c5af7f1328d424823811aaa (Jul 17) I think.

  7. hao

    Awesome articles. Do you know how to print instruction flow into a file?

  8. Improved version of ksyms.pl script:


    #!/usr/bin/perl
    # (Improved version of ksyms.pl taken from https://rwmj.wordpress.com/2016/03/17/tracing-qemu-guest-execution/ )
    #
    # Find everything that looks like a kernel address in the input
    # and append it with source code descriptions using addr2line.
    #
    # Usage:
    # ksyms_improved.pl source_root_dir vmlinux_path addr2line_path < input > output
    # where
    # 'input' is a trace file with lines containing PC (Program Counter) values
    # 'output' is a processed 'input' appended with descriptive source lines for debugging
    # 'source_root_dir' is a root of path with linux kernel sources, from which
    # vmlinux binary was build (actually, any path may be given, it's just to shorten file paths in 'output')
    # 'vmlinux_path' is a path to vmlinux binary
    # 'addr2line_path' is a path to addr2line utility from toolchain vmlinux was built
    # (system/distro default one may fit in case if host is of same architecture)
    #
    # Example: ksyms_improved.pl ./kernel-src ./kernel-build/vmlinux addr2line < trace_dump.txt > trace_dump_vmlinux.txt
    use File::Spec qw(abs2rel);
    use FileHandle;
    use IPC::Open3;
    my %cache = ();
    my $source_root_dir = shift;
    $source_root_dir = `readlink -n $source_root_dir`;
    my $vmlinux_path = shift;
    my $addr2line_path = shift;
    # Start addr2line utility in parallel to use its stdio streams to perform conversions
    # (makes huge speedup to processing of large data)
    my $addr2line_cmd = "$addr2line_path -f -i -p -e $vmlinux_path";
    my $addr2line_in = new FileHandle;
    my $addr2line_out = new FileHandle;
    my $addr2line_err = new FileHandle;
    open3($addr2line_in, $addr2line_out, $addr2line_err, "$addr2line_cmd") or die "$addr2line_cmd: $!";
    vec($addrline_s_bits, $addr2line_out->fileno(), 1) = 1;
    vec($addrline_s_bits, $addr2line_err->fileno(), 1) = 1;
    while (<>) {
    s{^(.*\s)([0-9a-fA-F]{6,16})(.*)$}{ $1 . $2 . $3 . addr_to_src_code_desc ($2) }e;
    print
    }
    sub addr_to_src_code_desc
    {
    local $_;
    my $addr = $_[0];
    my $src_code_desc;
    # Check if data is already memorized
    return $cache{$addr} if exists $cache{$addr};
    # Send address to addr2line utility…
    print $addr2line_in "$addr\n";
    # … and wait&receive result from it
    # (we cannot simply use buffered "$src_code_desc = <$addr2line_out>",
    # because it may consist of multiple lines)
    # (also we assume that addr2line flushes stdout explicitly and once it writes complete answer)
    my $nfound = select($addrline_s_bits,undef,undef,undef);
    die "addr2line I/O error\n" if ($nfound == -1);
    sysread $addr2line_out,$src_code_desc,100000; #TODO: replace magic LENGTH with appropriate value
    # Format result for better readability
    $src_code_desc =~ s{(at\s+)(.+)(\:\d+)}{ $1 . path_relative_to_source_root ($2) . $3 }ge;
    $src_code_desc = "\n$src_code_desc";
    # Memorize data for later reuse.
    $cache{$addr} = $src_code_desc;
    return $src_code_desc;
    }
    sub path_relative_to_source_root
    {
    local $_;
    my $abs_path = $_[0];
    my $rel_path = File::Spec->abs2rel( $abs_path, $source_root_dir );
    return $rel_path;
    }

    It have super-fast performance and more descriptive source code positions output 🙂

  9. hck

    Thank you for making this post. It’s super interesting! I tried but it’s slow so I wonder if you have any ways to speed it up?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.