I booted Linux 292,612 times

And it only took 21 hours.

Linux 6.4 has a bug where it hangs on boot, but probably only 1 in 1000 boots (and rarer if using Intel hardware for some reason). It’s surprising to me that no one has noticed this, but I certainly did because our nbdkit tests which use libguestfs were randomly hanging, always at the same place early in booting the libguestfs qemu appliance:

[    0.070120] Freeing SMP alternatives memory: 48K

So to bisect this I had to run guestfish in a loop until it either hangs or doesn’t. How many times? I chose 10,000 boots as a good threshold. To make this easier I wrote a test harness which uses up to 8 threads and parses the output to detect the hang.

After a painful bisection between v6.0 and v6.4-rc6 which took many days I found the culprit, a regression in the printk time feature: https://lkml.org/lkml/2023/6/13/733

To prove it I booted Linux 292,612 times before the faulty commit (successfully), and then after (failed after under 1,000 boots).

19 Comments

Filed under Uncategorized

19 responses to “I booted Linux 292,612 times

  1. problemchild68

    Impressed with the tenacity of the search. I’d ASS-u-ME that at 1 in 1000 failure they attributed to H/W failure or glitching. Some more detail on your method of homing down the code segment would be useful …Cheers

    • rich

      I guess people would think that yes.

      Finding the commit was actually simple (albeit taking days). I just use git bisect with the test linked above. The problem was the amount of time it took to run 10,000 boot iterations to prove that the kernel was good (vs bad if it hung).

      For unclear reasons the bisect only got me down to a merge commit, I then had to manually test each commit within that which took about another day.

    • Ehud Gavron

      The highest level of compliment I can give:
      I WOULD HIRE YOU.

      And then of course if you don’t provide the “why” people don’t get it.

      You worked on an issue until it was resolved.
      Your documentation of process, bug, resolution, was great.
      You upstreamed the fixes.
      Others made it a done deal.

      Seriously, great job, and yeah, worthy hire.

      E

  2. Jennifer Thompson

    This is yet more evidence, not that we really need it at this point, that it’s time to rewrite the rest of the Linux kernel in Rust.

    It can’t happen in an instant, of course, but if the kernel devs set a goal of reducing C’s usage by, say, 0.5% with each release, we’d soon see real progress being made toward a safer and more reliable Linux kernel.

    C has served us well, but it’s time to move beyond it. The future is Rust, and only Rust.

    • I thought by now people had learned better than to write this sort of thing with their name on it.

      Yes, C is the pits, but Linux will not be rewritten. Bugs would take longer to fix, and features adapting to new hardware would be delayed, by exactly to the degree that attention is diverted to rewriting. Rewriting just 0.5% would take many times the amount of work that is currently devoted to kernel improvements, even from one major (second-digit) release to the next.

      Compiling as C++ and then incrementally hardening, e.g. with better pointers, would yield fantastically larger benefits throughout the kernel, very quickly and with very little effort. (This was done in Gcc and
      Gdb, with such a result, years ago.) But that won’t happen either.

    • Tim

      Gee, if I had a dollar for every time I heard someone say “it’s time to rewrite the kernel in ${my_favorite_language}” after they saw a kernel bug — !

      Amusingly, the proposed language (I guess this year Rust is in fashion) never seems to offer any specific feature for the bug in question. But for some unstated reason, we always need to use that specific language.

      But sure, let’s rewrite tens of millions of lines of code to solve a rare, minor regression that’s already been fixed. (Not all at once: a mere few hundred thousand lines at a time.) There are certainly no possible downsides to that.

    • Alaafia

      There is a stronger argument for writing such low-level gizzards in Ada. They arguments against Ada are subjective and based on negative sentiments. Ada can be formally checked, and treats many data structures as first-class types. Also Ada has robust support in Open-Source and commercial tools.

    • Ehud Gavron

      Imagine if anything you said made sense.

      Rust will not be savior of the linux kernel or anything else. Please keep your religion to yourself.

      Good on those who actually contribute, not Captain Hindsights like you who only criticize.

      IF ONLY.

      Yeah. If only.

  3. Software Test Engineer walks into a bar, orders a beer, orders 292,612 beers! This is outstanding detective work.

  4. Richard Purdie

    We (Yocto Project) have noticed hangs in the same place when booting images under qemu for automated tests in 6.1 kernels from 6.1.28 onwards. Our theory is that it is related to https://lkml.org/lkml/2023/6/13/1460 i.e. “tick/common: Align tick period with the HZ tick”. As you say, deciding if a kernel is “good” is hard though!

  5. valdikss

    Is this comment section pre-moderated? Posted a long comment here hours ago, did not appear yet.

    • rich

      I think it’s just slow. I moderated it now.

      • valdikss

        It did not appear as a message until I removed http in the link. I tried to post it at least 5 times, and only after removing http is was listed as awaiting moderation queue. Interesting, as there are links with http in other comments.

  6. valdikss

    I have an old VIA-based 32-bit x86 machine, and it hangs in different times, but I managed to create a reproducer which hangs the system not so long after the boot. About 1 in 20 boots are unsuccessful.

    I noticed that verbose booting reduces the chance of hanging compared to quiet boot, but does not eliminate it completely.

    The similar issue was present even on Dell servers based on more recent x86_64 VIA CPUs, here’s an attempt to bisect the issue: bugs.debian.org/cgi-bin/bugreport.cgi?bug=507845#84
    The CPU seem to enter endless loop, as the machine becomes quite hot as if it’s running full-speed.

    All these years I believed this is a hardware implementation issue, related either to context switch or to SSE/SSE2 blocks, as running pentium-mmx-compiled OS seem to work fine, given that no other x86 system hangs the way VIA does.

    However after your post I’m not so sure: the issue you’ve mentioned is related with time and printk, I also associate my problem with how chatty the kernel log is (at least partially), and the person in Debian bug tracker above also bisected the code related to printf, although in libc. It could be another software bug in the kernel. If that’s the case, it is present since at least 2.6 times.

    I would appreciate any suggestions to try, any workarounds to apply, any advice on debugging. If you have spare time and interest, I can setup the dedicated machine over SSH for you.

  7. Oscar HenrĆ­quez

    Cheers to you dude, thanks for making linux more reliable!

  8. robi

    Let everyone use any language they want to, then cross compile to any other language for debugging, then finally to what the kernel devs expect, see (C)?

    Now automate this for big benefits.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.