Tag Archives: kernel

Follow up to “I booted Linux 292,612 times”

Well that blew up. It was supposed to be just a silly off-the-cuff comment about how some bugs are very tedious to bisect.

To answer a few questions people had, here’s what actually happened. As they say, don’t believe everything you read in the press.

A few weeks ago I noticed that some nbdkit tests which work by booting a Linux appliance under qemu were randomly hanging. I ignored it to start off with, but it got annoying so I decided to try to track down what was going on. Initially we thought it might be a qemu bug so I started by filing a bug there and writing my thoughts as I went to investigate. After swapping qemu, Linux guest and Linux host versions around it became clear that the problem was probably in the Linux guest kernel (although I didn’t rule out an issue with KVM emulation which might have implicated either qemu or the host kvm.ko module).

Initially I just had a hang, and because getting to that hang involved booting Linux hundreds or thousands of times it wasn’t feasible to attach gdb at the start to trace through the hang. Instead I had to connect gdb after observing the hang. It turns out that when the Linux guest was “hanging” it really was just missing a timer event so the kernel was still running albeit making no progress. But the upshot is that the stack trace you see is not of the hang itself, but of an idle, slightly confused kernel. gdb was out of the picture.

But since guest kernel 6.0 seemed to work and 6.4rc seemed to hang, I had a path to bisecting the bug.

Well, a very slow path. You see there are 52,363 commits between those two kernels, which means at least 15 or 16 bisect steps. Each step was going to involve booting the kernel at least thousands times to prove it was working (if it hung before then I’d observe that).

I made the mistake here of not first working on a good test, instead just running “while guestfish … ; echo -n . ; done” and watching until I’d seen a page of dots to judge the kernel “good”. Yeah, that didn’t work. It turns out the hang was made more likely by slightly loading the test machine (or running the tests in parallel which is the same thing). As a result my first bisection that took several days got the wrong commit.

Back to the drawing board. This time I wrote a proper test. It booted the kernel 10,000 times using 8 threads, and checked the qemu output to see if the boot had hung, stop the test and print a diagnostic, or print “test ok” if it got through all iterations. This time my bisection was better but that still took a couple of days.

At that point I thought I had the right commit, but Paolo Bonzini suggested to me that I boot the kernel in parallel, in a loop, for 24 hours at the point immediately before the commit, to try to show that there was no latent issue in the kernel before. (As it turns out while this is a good idea, this analysis is subtly flawed as we’ll see).

So I did just that. After 21 hours I got bored (plus this is using a lot of electricity and generating huge amounts of heat, and we’re in the middle of a heatwave here in the UK). I killed the test after 292,612 successful boots.

I had a commit that looked suspicious, but what to do now? I posted my findings on LKML.

We still didn’t fully understand how to trigger the hang, except it was annoying and rare, seemed to happen with different frequencies on AMD and Intel, could be reproduced by several independent people, but crucially kernel developer Peter Zijlstra could not reproduce it.

[For the record, the bug is a load and hardware-speed dependent race condition. It will particularly affect qemu virtual machines, but at least in theory it could happen on baremetal. It’s not AMD or Intel specific, that’s just a timing issue.]

By this point several other people had observed the hang including CoreOS developers and Rishabh Bhatnagar at Amazon.

A commenter on Hacker News pointed out that simply inserting a sleep into the problematic code path caused the same hang (and I verified that). So the commit I had bisected to was the wrong one again – it exposed a latent bug simply because it ran the same code as a sleep. It was introducing the sleep which exposed the bug, not the commit I’d spent a week bisecting. And the 262K boots didn’t in fact prove there was no latent bug. You live and learn …

Eventually the Amazon thread led to Thomas Gleixner suggesting a fix.

I tested the fix and … it worked!

Unfortunately the patch that introduced the bug has already gone into several stable trees meaning that many more people will likely be hitting the problem in future, but thanks to a heroic effort of many people (and not me, really) the bug has been fixed now.

Leave a comment

Filed under Uncategorized

nbdkit for loopback pt 3: loopback mounting VMware disks

nbdkit is a pluggable NBD server and it comes with a very wide range of plugins (of course you can also write your own). One of them is the VMware VDDK plugin, an interface between nbdkit and the very proprietary VMware VDDK library. The library allows you to read local VMware disks or access remote VMware servers. In this example I’m going to use it to loopback mount a VMDK file:

$ export LD_LIBRARY_PATH=~/tmp/vddk/vmware-vix-disklib-distrib/lib64
$ nbdkit -fv vddk \
    libdir=~/tmp/vddk/vmware-vix-disklib-distrib \
    file=/var/tmp/fedora.17.x86-64.20120529.vmdk

When loopback-mounting you must use a 512 byte sector size (see the mailing list for discussion):

# nbd-client -b 512 localhost /dev/nbd0
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 10240MB
Connected /dev/nbd0

Standard health warning: Loopback mounting any unknown disk is dangerous! You should use libguestfs instead as it protects you from harmful disks and also doesn’t require root.

It turns out this VMDK file contains a partitioned disk with one partition:

# fdisk -l /dev/nbd0
Disk /dev/nbd0: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 1024 bytes / 1024 bytes
Disklabel type: dos
Disk identifier: 0x000127ae

Device      Boot Start      End  Sectors Size Id Type
/dev/nbd0p1       2048 20971519 20969472  10G 83 Linux

and it can be mounted directly and it’s fully writable:

# mount /dev/nbd0p1 /mnt
# ls -l /mnt
total 76
lrwxrwxrwx.  1 root root     7 May 29  2012 bin -> usr/bin
dr-xr-xr-x.  4 root root  4096 May 29  2012 boot
drwxr-xr-x.  2 root root  4096 Feb  3  2012 dev
drwxr-xr-x. 59 root root  4096 May 29  2012 etc
drwxr-xr-x.  2 root root  4096 Feb  3  2012 home
lrwxrwxrwx.  1 root root     7 May 29  2012 lib -> usr/lib
lrwxrwxrwx.  1 root root     9 May 29  2012 lib64 -> usr/lib64
[etc]
# touch /mnt/hello

Leave a comment

Filed under Uncategorized

Tracing QEMU guest execution part 4

The final two questions that I posed last time were to do with constructing a timeline of what this guest is spending time on.

We can easily see system calls in the trace log, and we can also see when a kernel function is entered the first time (indicating that a new bit of the kernel is now running), and I wrote a Perl script to analyze that. That gave me a 115K line log file from which I did the rest of the analysis by hand to generate a timeline.

I would reproduce it here, but the results aren’t very enlightening. In particular I doubt it’s more interesting that what you can get by reading the kernel printk’s from a boot log.

What is my conclusion after using these techniques to analyze a booting guest? Here I go:

  • It’s clunky and undocumented. Hopefully this series should help a little.
  • It would be much more powerful with stack traces. It should be possible to get them from QEMU, at least in theory, but it’s a lot of work to do so.
  • It would be much more powerful if we could analyze into kernel modules and userspace.
  • More tooling around this might make it more bearable.

4 Comments

Filed under Uncategorized

Tracing QEMU guest execution part 3

In the previous post I posed three questions about my detailed function-level trace of the kernel booting under QEMU. The first one was Which kernel functions consume the most time?

We don’t have stack traces, although that would be a good direction for future work. So if a function “A” calls another function “B” like this:

A
  --> calls B
  <-- B returns
A

then we’re going to assign just the time in the two parts of “A” to “A”. In other words, “A” doesn’t inherit the time taken running “B”. (Except if “B” is inlined, in which case the trace cannot distinguish which code is in “A” and which is in “B”).

The other limitation is lack of insight into what kernel modules are doing. This is very difficult to solve: compiling kernel modules into the kernel proper would change what I’m trying to measure considerably.

Given those limitations, I wrote a Perl script to do all that from my previous trace. I’m not going to bother reproducing the script here because it’s quite simple, but also very specific to the trace format I’m using. If you got this far, you’ll have no trouble writing your own analysis tools.

The results are at the end of the post. For each function that was called, I had a look into the kernel code to find out what it seems to be doing, and those notes are below.

  • sha256_transform is the SHA256 function. While I’m not sure what it is being used for (some kind of module signature check seems to be the most likely explanation), the more interesting thing is that we’re not using any specialized version of the function (eg. with AVX suppport). That’s easily explained: we’re using TCG, not KVM, so no host processor features are available. However on a real boot we would be using AVX, so the function should take a lot less time, and I think we can discount it.
  • native_safe_halt is the function which halts the processor when it is idle (eg. waiting for an interrupt). Is it worrying that we spend 1/50th of the time not working? Would it help to have more virtual CPUs or is there an inherent lack of parallelism in the boot process?
  • It’s interesting that we spend such a large amount of time in the sort function. It’s used all over the place, eg. for sorting the BIOS E820 map, sorting memory mappings, checking boot parameters, sorting lists of wake-up events …
  • task_tick_fair is part of the Completely Fair Scheduler.
  • If there’s a boot option to disable ftrace, I cannot find it.

My main conclusion is there is no “smoking gun” here. Everything generally points to things that a kernel (or at least, the core, non-module parts of a kernel) should be doing. So this analysis has not been very helpful.

Continue reading

1 Comment

Filed under Uncategorized

Tracing QEMU guest execution part 2

After an overnight 12+ hour run of my Perl scripts I now have a 52 million line file that consists of timestamps, kernel symbols, and other untranslated linear addresses. The only possible way to analyze this will be with yet more scripts, but already there are lots of interesting things.

Here is the kernel code entering userspace:

11409434.8589973 prepare_exit_to_usermode in section .text
11409435.5189973 retint_user + 8 in section .text
11409436.4899973 7fb95ab3cd20
11409447.6899973 7fb95ab4c930
11409469.2169973 7fb95ab577f0

Userspace symbols cannot be decoded because we don’t know which process is being run. More importantly, code in kernel modules cannot be decoded, so we only see core kernel functions.

Handling a timer interrupt:

18000723.5261105 apic_timer_interrupt in section .text
18000725.2241105 smp_apic_timer_interrupt in section .text
18000726.7681105 native_apic_mem_write in section .text
18000729.2691105 smp_apic_timer_interrupt + 46 in section .text
18000729.8361105 irq_enter in section .text
18000730.3941105 rcu_irq_enter in section .text
18000731.1401105 rcu_irq_enter + 92 in section .text
18000731.7111105 irq_enter + 14 in section .text
18000732.3171105 smp_apic_timer_interrupt + 51 in section .text
18000740.9941105 exit_idle in section .text
18000741.5481105 smp_apic_timer_interrupt + 56 in section .text
18000742.0881105 local_apic_timer_interrupt in section .text
18000744.0341105 tick_handle_periodic in section .text
18000744.6341105 _raw_spin_lock in section .text
18000745.9291105 tick_periodic + 67 in section .text
18000747.3941105 do_timer in section .text

Userspace loading a kernel module:

7806760.57896065 40087d
7806765.09696065 4442b0
7806765.65396065 entry_SYSCALL_64 in section .text
7806766.14496065 entry_SYSCALL_64 + 32 in section .text
7806766.46296065 entry_SYSCALL_64 + 36 in section .text
7806788.75296065 sys_init_module in section .text
7806796.76296065 sys_init_module + 62 in section .text
7806797.28296065 sys_init_module + 62 in section .text
7806800.64896065 sys_init_module + 65 in section .text
7806801.94496065 capable in section .text
7806802.91196065 security_capable in section .text
7806804.30796065 cap_capable in section .text
7806804.87796065 security_capable + 72 in section .text
7806805.43596065 ns_capable + 41 in section .text
7806805.92096065 capable + 23 in section .text
7806810.46796065 sys_init_module + 75 in section .text
7806815.59796065 sys_init_module + 86 in section .text
7806821.10196065 sys_init_module + 96 in section .text
7806827.28496065 sys_init_module + 109 in section .text
7806831.23396065 sys_init_module + 129 in section .text
7806839.75396065 security_kernel_module_from_file in section .text
[etc]

What am I interested in knowing? My overall goal is to find areas in the kernel and userspace that we can optimize to make boot faster. Specifically it seems interesting at first to look at two questions:

  1. Which kernel functions consume the most time? This is “profiling 101”. It may indicate kernel functions which are particularly slow (eg. they contain sleeps).
  2. The time elapsed before each kernel function is called the first time. Because studying the whole trace is impossible (by hand), finding the first time each kernel function is called may give me an insight into major timeline events — eg. the first time one of the ext4 mount functions is called, we know that userspace has got to the point where it mounts the root filesystem.
  3. When does userspace make system calls? This is related to the previous point. It’s also the easiest type of analysis to do because system call entry is easy to detect from the trace.

More to follow …

2 Comments

Filed under Uncategorized

BLKDISCARD, BLKZEROOUT, BLKDISCARDZEROES, BLKSECDISCARD

Recent Linux has four ioctls related to discarding blocks on block devices: BLKDISCARD, BLKZEROOUT,
BLKDISCARDZEROES, BLKSECDISCARD
. As far as I’m aware these are not documented anywhere, but this posting describes what they do and how to use them. For a good all round introduction to thin provisioning, see Paolo Bonzini’s talk from DevConf (video here).

BLKDISCARD

This is the simplest ioctl. Given a range described as offset and length (both expressed in bytes), this code:

uint64_t range[2] = { offset, length };
ioctl (fd, BLKDISCARD, range);

will tell the underlying block device (fd) that it may discard the blocks which are contained in the given byte range.

The kernel code wants you to pass a range which is aligned to 512 bytes, and there may be further restrictions on the range you can pass which you can find out about by reading /sys/block/disk/queue/discard_alignment, /sys/block/disk/queue/discard_granularity, and /sys/block/disk/queue/discard_max_bytes.

If discard_max_bytes == 0 then discard isn’t supported at all on this device.

Discard is voluntary. The device might ignore it silently. Also what you read back from the discarded blocks might not be zeroes — you might read back stale data or random data (but see below).

BLKZEROOUT

BLKZEROOUT is a bit like BLKDISCARD but it writes zeroes. The code is similar:

uint64_t range[2] = { offset, length };
ioctl (fd, BLKZEROOUT, range);

Again note that offset and length are in bytes, but the kernel wants you to pass a 512-byte aligned range.

As far as I can tell from the implementation, the kernel implements this call itself. There is no help needed from devices, nor any device-specific optimization available.

BLKDISCARDZEROES

I mentioned above that discarded blocks might read back as stale data. However some devices guarantee that discarded blocks read back as zeroes (which means, I assume, that BLKZEROOUT would not be needed on such block devices).

You can find out if the device you are currently using has this guarantee, either by reading the sysfs file /sys/block/disk/queue/discard_zeroes_data, or by using this code:

unsigned int arg;
discard_zeroes =
    ioctl (fd, BLKDISCARDZEROES, &arg) == 0 && arg;

BLKSECDISCARD

Finally secure discard tells the device that you want to do a secure erase operation on the blocks. Again, pass a byte range (which has the same alignment requirements as BLKDISCARD):

uint64_t range[2] = { offset, length };
ioctl (fd, BLKSECDISCARD, range);

The ioctl will return an error (EOPNOTSUPP) for devices which cannot do secure erase.

8 Comments

Filed under Uncategorized

Half-baked idea: Kill SIGOK

For more half-baked ideas, see the ideas tag

I’m sure you’ve sat there waiting for some long compile to finish. It’s just started the tests, but you don’t want it to do the tests, just finish compiling dammit!

My half-baked idea is this: sending a special signal to a process (SIGOK) should kill the process, but the process should return 0 (ie. non-error exit).

3 Comments

Filed under Uncategorized

Compiling ODROID 3.12.3 kernel, attempt 3

Previously and previously. Will it be third time lucky?

I’m going to document the steps from the start. Note I’m building this on the hardware, not cross-compiling, since the ODROID-XU (Exynos 5410) with an external hard drive is fast (for ARM). Also note you’ll need to make your own serial port or buy one.

Clone this git repository which is based on upstream 3.12.3. Build it:

make odroidxu_defconfig
LOADADDR=40008000 make uImage dtbs
make modules

Install the kernel, device tree, and modules:

sudo cp arch/arm/boot/uImage /boot/uboot/uImage-3.12.3
sudo cp arch/arm/boot/dts/exynos5410-odroidxu.dtb /boot/uboot/
sudo make modules_install

Build an initramfs:

# First print the kernel version: make kernelrelease
sudo mkinitrd ./initrd-3.12.3 3.12.3-78393-g3886709
mkimage -A arm -O linux -T ramdisk -C none -a 0 -e 0 -n uInitrd -d initrd-3.12.3 uInitrd-3.12.3
sudo cp uInitrd-3.12.3 /boot/uboot/

Reboot, and be ready to interrupt u-boot by hitting a key on the serial console at the right moment.

sync
sudo reboot

At the u-boot prompt type:

set bootargs console=ttySAC2,115200n8 loglevel=9 earlyprintk root=UUID=ad455a75-609c-4b3c-956b-bcef3ebc4603 ro
fatload mmc 0 43000000 uimage-3.12.3
fatload mmc 0 46000000 uinitrd-3.12.3
fatload mmc 0 49000000 exynos5410-odroidxu.dtb
bootm 43000000 46000000 49000000

Unfortunately (1):

[    0.390000] CPU: All CPU(s) started in SVC mode.

and unfortunately (2):

[    5.780000] systemd[1]: Running in initial RAM disk.

Welcome to Fedora 20 (Heisenbug) dracut-034-19.git20131021.fc20 (Initramfs)!

[    5.810000] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
[    5.810000] systemd[1]: Caught , dumped core as pid 1220.
[    5.820000] systemd[1]: Freezing execution.
[    5.890000] usb 1-2: new high-speed USB device number 2 using s5p-ehci
[    6.040000] usb 1-2: New USB device found, idVendor=0424, idProduct=9730
[    6.040000] usb 1-2: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    6.050000] usb 1-2: no of_node; not parsing pinctrl DT
[    6.180000] usb 1-3: new high-speed USB device number 3 using s5p-ehci
[    6.420000] usb 1-3: New USB device found, idVendor=0424, idProduct=3503
[    6.420000] usb 1-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    6.430000] usb 1-3: no of_node; not parsing pinctrl DT
[    6.430000] hub 1-3:1.0: no of_node; not parsing pinctrl DT
[    6.440000] hub 1-3:1.0: USB hub found
[    6.440000] hub 1-3:1.0: 3 ports detected
[    6.730000] usb 1-3.2: new high-speed USB device number 4 using s5p-ehci
[    6.860000] usb 1-3.2: New USB device found, idVendor=04e8, idProduct=61b6
[    6.860000] usb 1-3.2: New USB device strings: Mfr=1, Product=11, SerialNumber=3
[    6.870000] usb 1-3.2: Product: Samsung M3 Portable
[    6.870000] usb 1-3.2: Manufacturer: JMicron
[    6.880000] usb 1-3.2: SerialNumber: 00000000011E4C02
[    6.880000] usb 1-3.2: no of_node; not parsing pinctrl DT

8 Comments

Filed under Uncategorized

Compiling ODROID-XU 3.11-rc7 kernel, attempt 2

Previously …

Now that I have made my own serial port for the ODROID-XU, creating something omitted from this development board, I’m ready to try to compile my own kernel again. At least I should be able to debug any error messages this time.

I did everything in the previous posting, but this time built a uImage properly:

LOADADDR=40008000 make uImage dtbs
sudo cp arch/arm/boot/uImage /boot/uboot/uImage-3.11-rc7
sudo cp uInitrd /boot/uboot/uInitrd-3.11-rc7
sudo cp arch/arm/boot/dts/exynos5410-odroidxu.dtb /boot/uboot/

We end up with (where the bold files are the ones I have built):

$ ll /boot/uboot/
total 33047
-rwxr-xr-x 1 root root    5510 Sep 27 08:14 boot.ini
-rwxr-xr-x 1 root root   28603 Dec  7 16:34 exynos5410-odroidxu.dtb
-rwxr-xr-x 1 root root  293168 Sep 27 08:14 u-boot.bin
-rwxr-xr-x 1 root root 2113488 Dec  7 16:33 uImage-3.11-rc7
-rwxr-xr-x 1 root root 6518598 Sep 27 08:14 uInitrd
-rwxr-xr-x 1 root root 8104359 Dec  7 16:36 uInitrd-3.11-rc7
-rwxr-xr-x 1 root root 6518598 Nov  8 17:36 uInitrd-official3.4.5
-rwxr-xr-x 1 root root 5127832 Sep 27 08:13 zImage
-rwxr-xr-x 1 root root 5127832 Nov  8 17:36 zImage-official3.4.5

I boot this by interrupting u-boot and typing the load commands by hand. (Note that /boot/uboot is a FAT partition).

set bootargs console=ttySAC2,115200n8 loglevel=9 earlyprintk root=UUID=ad455a75-609c-4b3c-956b-bcef3ebc4603 ro
fatload mmc 0 43000000 uimage-3.11-rc7
fatload mmc 0 46000000 uinitrd-3.11-rc7
fatload mmc 0 49000000 exynos5410-odroidxu.dtb
bootm 43000000 46000000 49000000

This boots the kernel, but fails trying to find the root filesystem:

[    9.680000] mmc1: error -110 whilst initialising SD card
[    9.700000] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 300000Hz, actual 299401HZ div = 167)
[   11.730000] mmc1: error -110 whilst initialising SD card
[   11.750000] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 200000Hz, actual 200000HZ div = 250)
[   13.790000] mmc1: error -110 whilst initialising SD card
[   13.810000] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 196079Hz, actual 196078HZ div = 255)
[   15.850000] mmc1: error -110 whilst initialising SD card
[   17.680000] systemd[1]: No control group support available, not creating root group.
[   17.710000] systemd[1]: Expecting device dev-disk-by\x2duuid-ad455a75\x2d609c\x2d4b3c\x2d956b\x2dbcef3ebc4603.device...
[   17.720000] systemd[1]: Starting -.slice.
         Expecting device dev-disk-by\x2du[   17.730000] systemd[1]: Caught , dumped core as pid 1144.
[   17.730000] systemd[1]: Freezing execution.
uid-ad455a75\x2d609...c4603.device...

Update #2: See the comments section.

Update: It’s because the 3.11 kernel doesn’t support the MMC card. I’ll try later using an external USB drive.

4 Comments

Filed under Uncategorized

Half-baked ideas: Kernel modules-in-a-file

For more half-baked ideas, see the ideas tag.

The Linux kernel:

/boot/vmlinuz-3.11.9-200.fc19.x86_64

5 MB of highly optimized code that runs the world’s computers. You can copy it around and boot it on different hardware.

No wait, that’s all wrong. I’m forgetting about the modules. There are another 2832 files in /lib/modules/3.11.9-200.fc19.x86_64 and without them this kernel will hardly function.

It’s nice to be able to copy the kernel around — to compile it on fast x86-64 and run it on slow ARM, or to run the host kernel in a VM, but all those module files make everyone’s lives a lot more complex.

So my half-baked idea is this:

Combine all the modules into a single file, let’s call it /boot/modules-3.11.9-200.fc19.x86_64. This file might start off as an ext4 filesystem which can be loopback-mounted on /lib/modules. Eventually I’d hope that the module tools could load modules from the file directly.

Now you can happily cross-compile and copy just two kernel files from x86 to ARM or into your VMs, and that’s a lot less hassle.

10 Comments

Filed under Uncategorized