When QEMU executes a guest using software emulation (“TCG”), it translates blocks of guest code to native code and then executes them (the TCG translation process is described in the talk “Towards multi-threaded TCG” by Alex Bennée and Frederic Konrad). If you’re interested in tracing guest code — perhaps in order to look at what code is being run or to benchmark it — it should be possible to instrument the translated blocks. And in fact it is. However I thought I’d document this since it took me ages to work out and it’s not exactly obvious how to do it.
Firstly you have to compile QEMU from source. Before compiling it, read docs/tracing.txt
carefully. Also edit trace-events
and remove the disable
keyword from the following lines in that file:
# TCG related tracing (mostly disabled by default)
# cpu-exec.c
-disable exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"
+exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"
Add those trace events to your /tmp/events
file. Also it’s useful to put your full qemu command line into a script, with the additional -trace events=/tmp/events
parameter, so you have a way to rerun the trace.
What you end up with after you’ve done this and run your guest under the tracing test conditions is an enormous trace file. My trace file, simply from a kernel boot, was 3.9 GB.
You can now analyze the log using the scripts/simpletrace.py
script, as described in the QEMU tracing documentation. Again, the output will be enormous. Mine begins like this, I’ve aligned the output to make it a bit easier to read:
$ ./scripts/simpletrace.py trace-events trace-4491 | head
exec_tb 0.000 pid=4491 tb=0x7fa869afe010 pc=0xfffffff0
exec_tb_exit 1.953 pid=4491 next_tb=0x0 flags=0x0
exec_tb 15.525 pid=4491 tb=0x7fa869afe090 pc=0xfe05b
exec_tb_exit 0.800 pid=4491 next_tb=0x7fa869afe090 flags=0x0
exec_tb 7.215 pid=4491 tb=0x7fa869afe110 pc=0xfe066
exec_tb_exit 0.234 pid=4491 next_tb=0x0 flags=0x0
exec_tb 5.940 pid=4491 tb=0x7fa869afe190 pc=0xfe06a
exec_tb_exit 0.222 pid=4491 next_tb=0x0 flags=0x0
exec_tb 2.945 pid=4491 tb=0x7fa869afe210 pc=0xfe070
exec_tb_exit 0.222 pid=4491 next_tb=0x0 flags=0x0
The pid
and *tb
fields are not very interesting, being the QEMU PID and the internal address of the translated blocks.
However the pc
(program counter) field and the timestamp (µs delta from previous trace event) are useful: Notice that the first translated block of guest code is located at guest address 0xffff_fff0
, which is the linear address where x86 CPUs boot from, and the second at 0xf_e05b
(segmented address F000:E05B
) which is the start of the BIOS ROM, so that’s encouraging.
Assuming you’ve now decoded the whole file (the decoded trace takes 5.5GB for me!), how can we turn these raw timestamp deltas and raw linear addresses into useful information? At this point we’re on our own, and I ended up writing Perl scripts to process and analyze the data.
The first Perl script is simple enough, and is just used to associate absolute timestamps with the program counter:
#!/usr/bin/perl -w
use warnings;
use strict;
my $ts = 0;
while (<>) {
my $ts_delta;
my $pc;
if (m{^exec_tb(_nocache)? ([-\d.]+).*pc=0x([a-fA-F0-9]+)}) {
$ts_delta = $2;
$pc = "$3";
}
elsif (m{^exec_tb_exit ([-\d.]+)}) {
$ts_delta = $1;
}
elsif (m{^Dropped_Event ([-\d.]+)}) {
$ts_delta = $1;
}
else {
die "could not parse output: $_"
}
$ts += $ts_delta;
if (defined $pc) {
print "$ts $pc\n";
}
}
How do we know what program counter corresponds to what code? For this it’s helpful to know some magic numbers associated with booting PCs:
Address |
Meaning |
0xfffffff0 |
Initial PC |
0xfe05b |
BIOS ROM entry point |
0x7c00 |
Bootloader entry point Note: not used if you load the kernel using the -kernel option. |
0x1000000 |
64 bit kernel entry point This may be different for your kernel. Use readelf -e vmlinux |
With that knowledge I can put together a timeline of my kernel boot by hand:
Time |
Stage |
0 |
BIOS |
1.16s |
Enter kernel |
38.4s |
Power off |
Not too interesting. The next step is to reverse the PC addresses into kernel symbols. There used to be a useful script called ksymoops to do this, but it seems to have disappeared, so I wrote my own:
#!/usr/bin/perl -w
#
# Find everything that looks like a kernel address in the input
# and turn it into a symbol using gdb.
#
# Usage:
# ksyms.pl vmlinux < input > output
# where 'vmlinux' is the kernel image which must contain debug
# symbols (on Fedora, find this in kernel-debuginfo).
use warnings;
use strict;
my $vmlinux = shift;
my %cache = ();
while (<>) {
s{(^|\s)([0-9a-f]{6,16})(\s|$)}{ $1 . lookup ($2) . $3 }gei;
print
}
sub lookup
{
local $_;
my $addr = $_[0];
return $cache{$addr} if exists $cache{$addr};
# Run gdb to lookup this symbol.
my $cmd = "gdb -batch -s '$vmlinux' -ex 'info symbol 0x$addr'";
open PIPE, "$cmd 2>&1 |" or die "$cmd: $!";
my $r = <PIPE>;
close PIPE;
chomp $r;
if ($r =~ m/^No symbol/) {
# No match, just return the original string, but add the original
# string to the cache so we don't do the lookup again.
$cache{$addr} = $addr;
return $addr;
}
# Memoize the match and return it.
$cache{$addr} = $r;
return $r;
}
You can run it like this:
$ ksyms.pl /usr/lib/debug/lib/modules/4.4.4-301.fc23.x86_64/vmlinux times.log > symbols.log
Come back tomorrow for further analysis …