In the previous post I posed three questions about my detailed function-level trace of the kernel booting under QEMU. The first one was Which kernel functions consume the most time?
We don’t have stack traces, although that would be a good direction for future work. So if a function “A” calls another function “B” like this:
A --> calls B <-- B returns A
then we’re going to assign just the time in the two parts of “A” to “A”. In other words, “A” doesn’t inherit the time taken running “B”. (Except if “B” is inlined, in which case the trace cannot distinguish which code is in “A” and which is in “B”).
The other limitation is lack of insight into what kernel modules are doing. This is very difficult to solve: compiling kernel modules into the kernel proper would change what I’m trying to measure considerably.
Given those limitations, I wrote a Perl script to do all that from my previous trace. I’m not going to bother reproducing the script here because it’s quite simple, but also very specific to the trace format I’m using. If you got this far, you’ll have no trouble writing your own analysis tools.
The results are at the end of the post. For each function that was called, I had a look into the kernel code to find out what it seems to be doing, and those notes are below.
- sha256_transform is the SHA256 function. While I’m not sure what it is being used for (some kind of module signature check seems to be the most likely explanation), the more interesting thing is that we’re not using any specialized version of the function (eg. with AVX suppport). That’s easily explained: we’re using TCG, not KVM, so no host processor features are available. However on a real boot we would be using AVX, so the function should take a lot less time, and I think we can discount it.
- native_safe_halt is the function which halts the processor when it is idle (eg. waiting for an interrupt). Is it worrying that we spend 1/50th of the time not working? Would it help to have more virtual CPUs or is there an inherent lack of parallelism in the boot process?
- It’s interesting that we spend such a large amount of time in the sort function. It’s used all over the place, eg. for sorting the BIOS E820 map, sorting memory mappings, checking boot parameters, sorting lists of wake-up events …
- task_tick_fair is part of the Completely Fair Scheduler.
- If there’s a boot option to disable ftrace, I cannot find it.
My main conclusion is there is no “smoking gun” here. Everything generally points to things that a kernel (or at least, the core, non-module parts of a kernel) should be doing. So this analysis has not been very helpful.
Raw results for functions taking ≥ 0.2% of the total run time.
sha256_transform 1072864.28001441 (2.8%) native_safe_halt 753975.64900082 (2.0%) sort 734425.954994341 (1.9%) task_tick_fair 511745.507986832 (1.3%) ftrace_cmp_ips 467639.446000438 (1.2%) _raw_spin_unlock_irqrestore 335345.294988811 (0.9%) __schedule 316954.467999394 (0.8%) __fentry__ 309859.036999101 (0.8%) u64_swap 249592.363002313 (0.7%) update_curr 241778.167999783 (0.6%) _raw_spin_lock 238322.789997374 (0.6%) memmap_init_zone 230832.76299999 (0.6%) update_wall_time 229990.71499946 (0.6%) _raw_spin_lock_irqsave 218091.154004406 (0.6%) _cond_resched 211584.050989096 (0.6%) scheduler_tick 208170.097999859 (0.5%) rcu_check_callbacks 206013.37700165 (0.5%) put_prev_entity 193928.260001965 (0.5%) setup_boot_APIC_clock 191295.978000884 (0.5%) __switch_to 189243.42600077 (0.5%) update_process_times 185443.74699448 (0.5%) enqueue_entity 179565.212001403 (0.5%) sched_clock 178514.772999486 (0.5%) get_page_from_freelist 170826.371996992 (0.4%) kmem_cache_alloc 169795.631002006 (0.4%) pick_next_task_fair 169043.417992091 (0.4%) vsnprintf 163273.791001736 (0.4%) _raw_write_unlock_irqrestore 154829.26900091 (0.4%) __do_softirq 152027.232993216 (0.4%) __list_add 151362.769998061 (0.4%) update_cfs_shares 150060.764994967 (0.4%) strcmp 149939.629001472 (0.4%) smp_apic_timer_interrupt 149457.189996168 (0.4%) set_next_entity 143404.835003117 (0.4%) hrtimer_run_queues 138194.965994698 (0.4%) timekeeping_update 135443.124995073 (0.4%) try_to_wake_up 135295.856005485 (0.4%) tick_periodic 126476.179998957 (0.3%) rcu_gp_kthread 126272.871002012 (0.3%) mpihelp_addmul_1 125503.634000804 (0.3%) mutex_lock 125317.344006523 (0.3%) unmap_single_vma 123475.184995107 (0.3%) kernfs_link_sibling 122028.707998739 (0.3%) sched_clock_cpu 120505.648994287 (0.3%) bsearch 118000.180000488 (0.3%) native_flush_tlb_single 116030.164000188 (0.3%) irq_exit 115968.542998462 (0.3%) update_rq_clock.part 109138.79299999 (0.3%) cmp_name 108721.06799904 (0.3%) sha_transform 108014.420998494 (0.3%) list_del 107140.00499763 (0.3%) memset 103831.057999056 (0.3%) find_symbol_in_section 103420.847998359 (0.3%) apic_timer_interrupt 103368.188003249 (0.3%) lookup_fast 102929.846999108 (0.3%) run_timer_softirq 100605.568003849 (0.3%) _paravirt_ident_64 97908.6249995169 (0.3%) dput 97668.9629971841 (0.3%) page_remove_rmap 92813.1569912173 (0.2%) _raw_spin_lock_irq 92344.2499994165 (0.2%) dequeue_entity 89615.0369963748 (0.2%) sched_clock_tick 89098.4689948051 (0.2%) update_min_vruntime 87686.293001336 (0.2%) prepare_to_wait_event 87595.5289991312 (0.2%) set_pte_vaddr 87066.420998778 (0.2%) __wake_up 85818.1300026891 (0.2%) walk_component 85295.2989990991 (0.2%) finish_task_switch 84811.1139978375 (0.2%) cpuacct_charge 84761.027003726 (0.2%) page_add_file_rmap 83768.9749897327 (0.2%) pick_next_entity 83307.4529978009 (0.2%) entry_SYSCALL_64 82818.1540005887 (0.2%) inode_permission 82560.7290010359 (0.2%) kmem_cache_free 82266.4810035541 (0.2%) native_apic_mem_write 80268.7489996338 (0.2%) account_process_tick 79699.9440020286 (0.2%) native_set_fixmap 77178.2519990653 (0.2%) mpihelp_submul_1 76799.3109990237 (0.2%) check_preempt_wakeup 75775.3149971971 (0.2%) retint_user 75658.6009985786 (0.2%) __enqueue_entity 75257.4749966217 (0.2%) sched_clock_local 74627.073001001 (0.2%) handle_mm_fault 74344.1189997355 (0.2%) mpih_sqr_n_basecase 74242.2409967501 (0.2%) set_pte_vaddr_pud 71680.7560005574 (0.2%) link_path_walk 70946.7810055222 (0.2%) __change_page_attr_set_clr 68968.3919992647 (0.2%) iowrite16 68629.3319995683 (0.2%) find_next_zero_bit 68093.1289994775 (0.2%) __free_pages_bootmem 68008.2799999039 (0.2%) probe_kernel_read 67817.4460009944 (0.2%) schedule 67664.0859990818 (0.2%) __alloc_pages_nodemask 67535.849997178 (0.2%) ftrace_modify_code_direct 67373.704996652 (0.2%) mutex_unlock 66822.493999619 (0.2%) copy_user_generic_unrolled 65886.3849978868 (0.2%) text_poke 65293.4880003785 (0.2%) prefetchw 64868.9050002131 (0.2%) rcu_irq_enter 64733.4789974079 (0.2%) strlen 64342.5080017841 (0.2%) kernfs_dop_revalidate 64132.6360015515 (0.2%) __inode_permission 63989.6499996874 (0.2%) kernfs_iop_permission 63648.077004876 (0.2%) rcu_nocb_kthread 63553.7460004445 (0.2%) pcpu_alloc 63374.1510004376 (0.2%) __list_del_entry 62550.6220011206 (0.2%) __add_to_page_cache_locked 61942.2679978972 (0.2%) __acct_update_integrals 60902.0239998158 (0.2%) rcu_irq_exit 59696.2780016055 (0.2%) do_set_pte 59611.520994694 (0.2%) memcpy 59302.9080009956 (0.2%) irq_enter 59102.7570020873 (0.2%) __d_lookup 58982.2149961805 (0.2%) dequeue_task_fair 58575.8359895833 (0.2%) rcu_process_callbacks 58379.6520016473 (0.2%) local_apic_timer_interrupt 58290.110000887 (0.2%) tick_handle_periodic 57626.9209991428 (0.2%)
I’d be willing to bet that many of the places that ‘sort’ is used, it is not really the best tool for the job.
In particular, wake-up events, instead of being kept in a fully sorted list, should probably be kept in some form of min-heap, keyed by wake up time.