Tracing QEMU guest execution part 3

In the previous post I posed three questions about my detailed function-level trace of the kernel booting under QEMU. The first one was Which kernel functions consume the most time?

We don’t have stack traces, although that would be a good direction for future work. So if a function “A” calls another function “B” like this:

A
  --> calls B
  <-- B returns
A

then we’re going to assign just the time in the two parts of “A” to “A”. In other words, “A” doesn’t inherit the time taken running “B”. (Except if “B” is inlined, in which case the trace cannot distinguish which code is in “A” and which is in “B”).

The other limitation is lack of insight into what kernel modules are doing. This is very difficult to solve: compiling kernel modules into the kernel proper would change what I’m trying to measure considerably.

Given those limitations, I wrote a Perl script to do all that from my previous trace. I’m not going to bother reproducing the script here because it’s quite simple, but also very specific to the trace format I’m using. If you got this far, you’ll have no trouble writing your own analysis tools.

The results are at the end of the post. For each function that was called, I had a look into the kernel code to find out what it seems to be doing, and those notes are below.

  • sha256_transform is the SHA256 function. While I’m not sure what it is being used for (some kind of module signature check seems to be the most likely explanation), the more interesting thing is that we’re not using any specialized version of the function (eg. with AVX suppport). That’s easily explained: we’re using TCG, not KVM, so no host processor features are available. However on a real boot we would be using AVX, so the function should take a lot less time, and I think we can discount it.
  • native_safe_halt is the function which halts the processor when it is idle (eg. waiting for an interrupt). Is it worrying that we spend 1/50th of the time not working? Would it help to have more virtual CPUs or is there an inherent lack of parallelism in the boot process?
  • It’s interesting that we spend such a large amount of time in the sort function. It’s used all over the place, eg. for sorting the BIOS E820 map, sorting memory mappings, checking boot parameters, sorting lists of wake-up events …
  • task_tick_fair is part of the Completely Fair Scheduler.
  • If there’s a boot option to disable ftrace, I cannot find it.

My main conclusion is there is no “smoking gun” here. Everything generally points to things that a kernel (or at least, the core, non-module parts of a kernel) should be doing. So this analysis has not been very helpful.

Raw results for functions taking ≥ 0.2% of the total run time.

sha256_transform               1072864.28001441 (2.8%)
native_safe_halt               753975.64900082 (2.0%)
sort                           734425.954994341 (1.9%)
task_tick_fair                 511745.507986832 (1.3%)
ftrace_cmp_ips                 467639.446000438 (1.2%)
_raw_spin_unlock_irqrestore    335345.294988811 (0.9%)
__schedule                     316954.467999394 (0.8%)
__fentry__                     309859.036999101 (0.8%)
u64_swap                       249592.363002313 (0.7%)
update_curr                    241778.167999783 (0.6%)
_raw_spin_lock                 238322.789997374 (0.6%)
memmap_init_zone               230832.76299999 (0.6%)
update_wall_time               229990.71499946 (0.6%)
_raw_spin_lock_irqsave         218091.154004406 (0.6%)
_cond_resched                  211584.050989096 (0.6%)
scheduler_tick                 208170.097999859 (0.5%)
rcu_check_callbacks            206013.37700165 (0.5%)
put_prev_entity                193928.260001965 (0.5%)
setup_boot_APIC_clock          191295.978000884 (0.5%)
__switch_to                    189243.42600077 (0.5%)
update_process_times           185443.74699448 (0.5%)
enqueue_entity                 179565.212001403 (0.5%)
sched_clock                    178514.772999486 (0.5%)
get_page_from_freelist         170826.371996992 (0.4%)
kmem_cache_alloc               169795.631002006 (0.4%)
pick_next_task_fair            169043.417992091 (0.4%)
vsnprintf                      163273.791001736 (0.4%)
_raw_write_unlock_irqrestore   154829.26900091 (0.4%)
__do_softirq                   152027.232993216 (0.4%)
__list_add                     151362.769998061 (0.4%)
update_cfs_shares              150060.764994967 (0.4%)
strcmp                         149939.629001472 (0.4%)
smp_apic_timer_interrupt       149457.189996168 (0.4%)
set_next_entity                143404.835003117 (0.4%)
hrtimer_run_queues             138194.965994698 (0.4%)
timekeeping_update             135443.124995073 (0.4%)
try_to_wake_up                 135295.856005485 (0.4%)
tick_periodic                  126476.179998957 (0.3%)
rcu_gp_kthread                 126272.871002012 (0.3%)
mpihelp_addmul_1               125503.634000804 (0.3%)
mutex_lock                     125317.344006523 (0.3%)
unmap_single_vma               123475.184995107 (0.3%)
kernfs_link_sibling            122028.707998739 (0.3%)
sched_clock_cpu                120505.648994287 (0.3%)
bsearch                        118000.180000488 (0.3%)
native_flush_tlb_single        116030.164000188 (0.3%)
irq_exit                       115968.542998462 (0.3%)
update_rq_clock.part           109138.79299999 (0.3%)
cmp_name                       108721.06799904 (0.3%)
sha_transform                  108014.420998494 (0.3%)
list_del                       107140.00499763 (0.3%)
memset                         103831.057999056 (0.3%)
find_symbol_in_section         103420.847998359 (0.3%)
apic_timer_interrupt           103368.188003249 (0.3%)
lookup_fast                    102929.846999108 (0.3%)
run_timer_softirq              100605.568003849 (0.3%)
_paravirt_ident_64             97908.6249995169 (0.3%)
dput                           97668.9629971841 (0.3%)
page_remove_rmap               92813.1569912173 (0.2%)
_raw_spin_lock_irq             92344.2499994165 (0.2%)
dequeue_entity                 89615.0369963748 (0.2%)
sched_clock_tick               89098.4689948051 (0.2%)
update_min_vruntime            87686.293001336 (0.2%)
prepare_to_wait_event          87595.5289991312 (0.2%)
set_pte_vaddr                  87066.420998778 (0.2%)
__wake_up                      85818.1300026891 (0.2%)
walk_component                 85295.2989990991 (0.2%)
finish_task_switch             84811.1139978375 (0.2%)
cpuacct_charge                 84761.027003726 (0.2%)
page_add_file_rmap             83768.9749897327 (0.2%)
pick_next_entity               83307.4529978009 (0.2%)
entry_SYSCALL_64               82818.1540005887 (0.2%)
inode_permission               82560.7290010359 (0.2%)
kmem_cache_free                82266.4810035541 (0.2%)
native_apic_mem_write          80268.7489996338 (0.2%)
account_process_tick           79699.9440020286 (0.2%)
native_set_fixmap              77178.2519990653 (0.2%)
mpihelp_submul_1               76799.3109990237 (0.2%)
check_preempt_wakeup           75775.3149971971 (0.2%)
retint_user                    75658.6009985786 (0.2%)
__enqueue_entity               75257.4749966217 (0.2%)
sched_clock_local              74627.073001001 (0.2%)
handle_mm_fault                74344.1189997355 (0.2%)
mpih_sqr_n_basecase            74242.2409967501 (0.2%)
set_pte_vaddr_pud              71680.7560005574 (0.2%)
link_path_walk                 70946.7810055222 (0.2%)
__change_page_attr_set_clr     68968.3919992647 (0.2%)
iowrite16                      68629.3319995683 (0.2%)
find_next_zero_bit             68093.1289994775 (0.2%)
__free_pages_bootmem           68008.2799999039 (0.2%)
probe_kernel_read              67817.4460009944 (0.2%)
schedule                       67664.0859990818 (0.2%)
__alloc_pages_nodemask         67535.849997178 (0.2%)
ftrace_modify_code_direct      67373.704996652 (0.2%)
mutex_unlock                   66822.493999619 (0.2%)
copy_user_generic_unrolled     65886.3849978868 (0.2%)
text_poke                      65293.4880003785 (0.2%)
prefetchw                      64868.9050002131 (0.2%)
rcu_irq_enter                  64733.4789974079 (0.2%)
strlen                         64342.5080017841 (0.2%)
kernfs_dop_revalidate          64132.6360015515 (0.2%)
__inode_permission             63989.6499996874 (0.2%)
kernfs_iop_permission          63648.077004876 (0.2%)
rcu_nocb_kthread               63553.7460004445 (0.2%)
pcpu_alloc                     63374.1510004376 (0.2%)
__list_del_entry               62550.6220011206 (0.2%)
__add_to_page_cache_locked     61942.2679978972 (0.2%)
__acct_update_integrals        60902.0239998158 (0.2%)
rcu_irq_exit                   59696.2780016055 (0.2%)
do_set_pte                     59611.520994694 (0.2%)
memcpy                         59302.9080009956 (0.2%)
irq_enter                      59102.7570020873 (0.2%)
__d_lookup                     58982.2149961805 (0.2%)
dequeue_task_fair              58575.8359895833 (0.2%)
rcu_process_callbacks          58379.6520016473 (0.2%)
local_apic_timer_interrupt     58290.110000887 (0.2%)
tick_handle_periodic           57626.9209991428 (0.2%)

1 Comment

Filed under Uncategorized

One response to “Tracing QEMU guest execution part 3

  1. I’d be willing to bet that many of the places that ‘sort’ is used, it is not really the best tool for the job.

    In particular, wake-up events, instead of being kept in a fully sorted list, should probably be kept in some form of min-heap, keyed by wake up time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.