Tag Archives: TCG

Tracing QEMU guest execution part 2

After an overnight 12+ hour run of my Perl scripts I now have a 52 million line file that consists of timestamps, kernel symbols, and other untranslated linear addresses. The only possible way to analyze this will be with yet more scripts, but already there are lots of interesting things.

Here is the kernel code entering userspace:

11409434.8589973 prepare_exit_to_usermode in section .text
11409435.5189973 retint_user + 8 in section .text
11409436.4899973 7fb95ab3cd20
11409447.6899973 7fb95ab4c930
11409469.2169973 7fb95ab577f0

Userspace symbols cannot be decoded because we don’t know which process is being run. More importantly, code in kernel modules cannot be decoded, so we only see core kernel functions.

Handling a timer interrupt:

18000723.5261105 apic_timer_interrupt in section .text
18000725.2241105 smp_apic_timer_interrupt in section .text
18000726.7681105 native_apic_mem_write in section .text
18000729.2691105 smp_apic_timer_interrupt + 46 in section .text
18000729.8361105 irq_enter in section .text
18000730.3941105 rcu_irq_enter in section .text
18000731.1401105 rcu_irq_enter + 92 in section .text
18000731.7111105 irq_enter + 14 in section .text
18000732.3171105 smp_apic_timer_interrupt + 51 in section .text
18000740.9941105 exit_idle in section .text
18000741.5481105 smp_apic_timer_interrupt + 56 in section .text
18000742.0881105 local_apic_timer_interrupt in section .text
18000744.0341105 tick_handle_periodic in section .text
18000744.6341105 _raw_spin_lock in section .text
18000745.9291105 tick_periodic + 67 in section .text
18000747.3941105 do_timer in section .text

Userspace loading a kernel module:

7806760.57896065 40087d
7806765.09696065 4442b0
7806765.65396065 entry_SYSCALL_64 in section .text
7806766.14496065 entry_SYSCALL_64 + 32 in section .text
7806766.46296065 entry_SYSCALL_64 + 36 in section .text
7806788.75296065 sys_init_module in section .text
7806796.76296065 sys_init_module + 62 in section .text
7806797.28296065 sys_init_module + 62 in section .text
7806800.64896065 sys_init_module + 65 in section .text
7806801.94496065 capable in section .text
7806802.91196065 security_capable in section .text
7806804.30796065 cap_capable in section .text
7806804.87796065 security_capable + 72 in section .text
7806805.43596065 ns_capable + 41 in section .text
7806805.92096065 capable + 23 in section .text
7806810.46796065 sys_init_module + 75 in section .text
7806815.59796065 sys_init_module + 86 in section .text
7806821.10196065 sys_init_module + 96 in section .text
7806827.28496065 sys_init_module + 109 in section .text
7806831.23396065 sys_init_module + 129 in section .text
7806839.75396065 security_kernel_module_from_file in section .text
[etc]

What am I interested in knowing? My overall goal is to find areas in the kernel and userspace that we can optimize to make boot faster. Specifically it seems interesting at first to look at two questions:

  1. Which kernel functions consume the most time? This is “profiling 101”. It may indicate kernel functions which are particularly slow (eg. they contain sleeps).
  2. The time elapsed before each kernel function is called the first time. Because studying the whole trace is impossible (by hand), finding the first time each kernel function is called may give me an insight into major timeline events — eg. the first time one of the ext4 mount functions is called, we know that userspace has got to the point where it mounts the root filesystem.
  3. When does userspace make system calls? This is related to the previous point. It’s also the easiest type of analysis to do because system call entry is easy to detect from the trace.

More to follow …

2 Comments

Filed under Uncategorized

Tracing QEMU guest execution

When QEMU executes a guest using software emulation (“TCG”), it translates blocks of guest code to native code and then executes them (the TCG translation process is described in the talk “Towards multi-threaded TCG” by Alex Bennée and Frederic Konrad). If you’re interested in tracing guest code — perhaps in order to look at what code is being run or to benchmark it — it should be possible to instrument the translated blocks. And in fact it is. However I thought I’d document this since it took me ages to work out and it’s not exactly obvious how to do it.

Firstly you have to compile QEMU from source. Before compiling it, read docs/tracing.txt carefully. Also edit trace-events and remove the disable keyword from the following lines in that file:

 # TCG related tracing (mostly disabled by default)
 # cpu-exec.c
-disable exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
-disable exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"
+exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_nocache(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
+exec_tb_exit(void *next_tb, unsigned int flags) "tb:%p flags=%x"

Add those trace events to your /tmp/events file. Also it’s useful to put your full qemu command line into a script, with the additional -trace events=/tmp/events parameter, so you have a way to rerun the trace.

What you end up with after you’ve done this and run your guest under the tracing test conditions is an enormous trace file. My trace file, simply from a kernel boot, was 3.9 GB.

You can now analyze the log using the scripts/simpletrace.py script, as described in the QEMU tracing documentation. Again, the output will be enormous. Mine begins like this, I’ve aligned the output to make it a bit easier to read:

$ ./scripts/simpletrace.py trace-events trace-4491 | head
exec_tb       0.000 pid=4491 tb=0x7fa869afe010 pc=0xfffffff0
exec_tb_exit  1.953 pid=4491 next_tb=0x0 flags=0x0
exec_tb      15.525 pid=4491 tb=0x7fa869afe090 pc=0xfe05b
exec_tb_exit  0.800 pid=4491 next_tb=0x7fa869afe090 flags=0x0
exec_tb       7.215 pid=4491 tb=0x7fa869afe110 pc=0xfe066
exec_tb_exit  0.234 pid=4491 next_tb=0x0 flags=0x0
exec_tb       5.940 pid=4491 tb=0x7fa869afe190 pc=0xfe06a
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0
exec_tb       2.945 pid=4491 tb=0x7fa869afe210 pc=0xfe070
exec_tb_exit  0.222 pid=4491 next_tb=0x0 flags=0x0

The pid and *tb fields are not very interesting, being the QEMU PID and the internal address of the translated blocks.

However the pc (program counter) field and the timestamp (µs delta from previous trace event) are useful: Notice that the first translated block of guest code is located at guest address 0xffff_fff0, which is the linear address where x86 CPUs boot from, and the second at 0xf_e05b (segmented address F000:E05B) which is the start of the BIOS ROM, so that’s encouraging.

Assuming you’ve now decoded the whole file (the decoded trace takes 5.5GB for me!), how can we turn these raw timestamp deltas and raw linear addresses into useful information? At this point we’re on our own, and I ended up writing Perl scripts to process and analyze the data.

The first Perl script is simple enough, and is just used to associate absolute timestamps with the program counter:

#!/usr/bin/perl -w

use warnings;
use strict;

my $ts = 0;

while (<>) {
    my $ts_delta;
    my $pc;

    if (m{^exec_tb(_nocache)? ([-\d.]+).*pc=0x([a-fA-F0-9]+)}) {
        $ts_delta = $2;
        $pc = "$3";
    }
    elsif (m{^exec_tb_exit ([-\d.]+)}) {
        $ts_delta = $1;
    }
    elsif (m{^Dropped_Event ([-\d.]+)}) {
        $ts_delta = $1;
    }
    else {
        die "could not parse output: $_"
    }
    $ts += $ts_delta;

    if (defined $pc) {
        print "$ts $pc\n";
    }
}

How do we know what program counter corresponds to what code? For this it’s helpful to know some magic numbers associated with booting PCs:

Address Meaning
0xfffffff0 Initial PC
0xfe05b BIOS ROM entry point
0x7c00 Bootloader entry point
Note: not used if you load the kernel using the -kernel option.
0x1000000 64 bit kernel entry point
This may be different for your kernel. Use readelf -e vmlinux

With that knowledge I can put together a timeline of my kernel boot by hand:

Time Stage
0 BIOS
1.16s Enter kernel
38.4s Power off

Not too interesting. The next step is to reverse the PC addresses into kernel symbols. There used to be a useful script called ksymoops to do this, but it seems to have disappeared, so I wrote my own:

#!/usr/bin/perl -w
#
# Find everything that looks like a kernel address in the input
# and turn it into a symbol using gdb.
#
# Usage:
#   ksyms.pl vmlinux < input > output
# where 'vmlinux' is the kernel image which must contain debug
# symbols (on Fedora, find this in kernel-debuginfo).

use warnings;
use strict;

my $vmlinux = shift;
my %cache = ();

while (<>) {
    s{(^|\s)([0-9a-f]{6,16})(\s|$)}{ $1 . lookup ($2) . $3 }gei;
    print
}

sub lookup
{
    local $_;
    my $addr = $_[0];

    return $cache{$addr} if exists $cache{$addr};

    # Run gdb to lookup this symbol.
    my $cmd = "gdb -batch -s '$vmlinux' -ex 'info symbol 0x$addr'";
    open PIPE, "$cmd 2>&1 |" or die "$cmd: $!";
    my $r = <PIPE>;
    close PIPE;
    chomp $r;
    if ($r =~ m/^No symbol/) {
        # No match, just return the original string, but add the original
        # string to the cache so we don't do the lookup again.
        $cache{$addr} = $addr;
        return $addr;
    }

    # Memoize the match and return it.
    $cache{$addr} = $r;
    return $r;
}

You can run it like this:

$ ksyms.pl /usr/lib/debug/lib/modules/4.4.4-301.fc23.x86_64/vmlinux times.log > symbols.log

Come back tomorrow for further analysis …

14 Comments

Filed under Uncategorized

Caseless virtualization cluster, part 4

AMD supports nested virtualization a bit more reliably than Intel, which was one of the reasons to go for AMD processors in my virtualization cluster. (The other reason is they are much cheaper)

But how well does it perform? Not too badly as it happens.

I tested this by creating a Fedora 20 guest (the L1 guest). I could create a nested (L2) guest inside that, but a simpler way is to use guestfish to carry out some baseline performance measurements. Since libguestfs is creating a short-lived KVM appliance, it benefits from hardware virt acceleration when available. And since libguestfs ≥ 1.26, there is a new option that lets you force software emulation so you can easily test the effect with & without hardware acceleration.

L1 performance

Let’s start on the host (L0), measuring L1 performance. Note that you have to run the commands shown at least twice, both because supermin will build and cache the appliance first time and because it’s a fairer test of hardware acceleration if everything is cached in memory.

This AMD hardware turns out to be pretty good:

$ time guestfish -a /dev/null run
real	0m2.585s

(2.6 seconds is the time taken to launch a virtual machine, all its userspace and a daemon, then shut it down. I’m using libvirt to manage the appliance).

Forcing software emulation (disabling hardware acceleration):

$ time LIBGUESTFS_BACKEND_SETTINGS=force_tcg guestfish -a /dev/null run
real	0m9.995s

L2 performance

Inside the L1 Fedora guest, we run the same tests. Note this is testing L2 performance (the libguestfs appliance running on top of an L1 guest), ie. nested virt:

$ time guestfish -a /dev/null run
real	0m5.750s

Forcing software emulation:

$ time LIBGUESTFS_BACKEND_SETTINGS=force_tcg guestfish -a /dev/null run
real	0m9.949s

Conclusions

These are just some simple tests. I’ll be doing something more comprehensive later. However:

  1. First level hardware virtualization performance on these AMD chips is excellent.
  2. Nested virt is about 40% of non-nested speed.
  3. TCG performance is slower as expected, but shows that hardware virt is being used and is beneficial even in the nested case.

Other data

The host has 8 cores and 16 GB of RAM. /proc/cpuinfo for one of the host cores is:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-8320 Eight-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1
bogomips	: 7031.39
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

The L1 guest has 1 vCPU and 4 GB of RAM. /proc/cpuinfo in the guest:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD Opteron 63xx class CPU
stepping	: 0
microcode	: 0x1000065
cpu MHz		: 3515.548
cache size	: 512 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb lm rep_good nopl extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm svm abm sse4a misalignsse 3dnowprefetch xop fma4 tbm arat
bogomips	: 7031.09
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

Update

As part of the discussion in the comments about whether this has 4 or 8 physical cores, here is the lstopo output:

lstopo

9 Comments

Filed under Uncategorized

Nested virtualization (not) enabled

Interesting thing I learned a few days ago:

kvm: Nested Virtualization enabled

does not always mean that nested virtualization is being used.

If you use qemu’s software emulation (more often known as TCG) then it emulates a generic-looking AMD CPU with SVM (AMD’s virtualization feature).

AMD virtualization easily supports nesting (unlike Intel’s VT which is a massive PITA to nest), and when the KVM module is loaded, it notices the “AMD” host CPU with SVM and willingly enables nested virt. There’s actually a little bit of benefit to this because it avoids a second layer of TCG being needed if you did run a L2 guest in there (although it’s still going to be slow).

Leave a comment

Filed under Uncategorized