Tag Archives: kernel

nbdkit for loopback pt 3: loopback mounting VMware disks

nbdkit is a pluggable NBD server and it comes with a very wide range of plugins (of course you can also write your own). One of them is the VMware VDDK plugin, an interface between nbdkit and the very proprietary VMware VDDK library. The library allows you to read local VMware disks or access remote VMware servers. In this example I’m going to use it to loopback mount a VMDK file:

$ export LD_LIBRARY_PATH=~/tmp/vddk/vmware-vix-disklib-distrib/lib64
$ nbdkit -fv vddk \
    libdir=~/tmp/vddk/vmware-vix-disklib-distrib \

When loopback-mounting you must use a 512 byte sector size (see the mailing list for discussion):

# nbd-client -b 512 localhost /dev/nbd0
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 10240MB
Connected /dev/nbd0

Standard health warning: Loopback mounting any unknown disk is dangerous! You should use libguestfs instead as it protects you from harmful disks and also doesn’t require root.

It turns out this VMDK file contains a partitioned disk with one partition:

# fdisk -l /dev/nbd0
Disk /dev/nbd0: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 1024 bytes / 1024 bytes
Disklabel type: dos
Disk identifier: 0x000127ae

Device      Boot Start      End  Sectors Size Id Type
/dev/nbd0p1       2048 20971519 20969472  10G 83 Linux

and it can be mounted directly and it’s fully writable:

# mount /dev/nbd0p1 /mnt
# ls -l /mnt
total 76
lrwxrwxrwx.  1 root root     7 May 29  2012 bin -> usr/bin
dr-xr-xr-x.  4 root root  4096 May 29  2012 boot
drwxr-xr-x.  2 root root  4096 Feb  3  2012 dev
drwxr-xr-x. 59 root root  4096 May 29  2012 etc
drwxr-xr-x.  2 root root  4096 Feb  3  2012 home
lrwxrwxrwx.  1 root root     7 May 29  2012 lib -> usr/lib
lrwxrwxrwx.  1 root root     9 May 29  2012 lib64 -> usr/lib64
# touch /mnt/hello

Leave a comment

Filed under Uncategorized

Tracing QEMU guest execution part 4

The final two questions that I posed last time were to do with constructing a timeline of what this guest is spending time on.

We can easily see system calls in the trace log, and we can also see when a kernel function is entered the first time (indicating that a new bit of the kernel is now running), and I wrote a Perl script to analyze that. That gave me a 115K line log file from which I did the rest of the analysis by hand to generate a timeline.

I would reproduce it here, but the results aren’t very enlightening. In particular I doubt it’s more interesting that what you can get by reading the kernel printk’s from a boot log.

What is my conclusion after using these techniques to analyze a booting guest? Here I go:

  • It’s clunky and undocumented. Hopefully this series should help a little.
  • It would be much more powerful with stack traces. It should be possible to get them from QEMU, at least in theory, but it’s a lot of work to do so.
  • It would be much more powerful if we could analyze into kernel modules and userspace.
  • More tooling around this might make it more bearable.


Filed under Uncategorized

Tracing QEMU guest execution part 3

In the previous post I posed three questions about my detailed function-level trace of the kernel booting under QEMU. The first one was Which kernel functions consume the most time?

We don’t have stack traces, although that would be a good direction for future work. So if a function “A” calls another function “B” like this:

  --> calls B
  <-- B returns

then we’re going to assign just the time in the two parts of “A” to “A”. In other words, “A” doesn’t inherit the time taken running “B”. (Except if “B” is inlined, in which case the trace cannot distinguish which code is in “A” and which is in “B”).

The other limitation is lack of insight into what kernel modules are doing. This is very difficult to solve: compiling kernel modules into the kernel proper would change what I’m trying to measure considerably.

Given those limitations, I wrote a Perl script to do all that from my previous trace. I’m not going to bother reproducing the script here because it’s quite simple, but also very specific to the trace format I’m using. If you got this far, you’ll have no trouble writing your own analysis tools.

The results are at the end of the post. For each function that was called, I had a look into the kernel code to find out what it seems to be doing, and those notes are below.

  • sha256_transform is the SHA256 function. While I’m not sure what it is being used for (some kind of module signature check seems to be the most likely explanation), the more interesting thing is that we’re not using any specialized version of the function (eg. with AVX suppport). That’s easily explained: we’re using TCG, not KVM, so no host processor features are available. However on a real boot we would be using AVX, so the function should take a lot less time, and I think we can discount it.
  • native_safe_halt is the function which halts the processor when it is idle (eg. waiting for an interrupt). Is it worrying that we spend 1/50th of the time not working? Would it help to have more virtual CPUs or is there an inherent lack of parallelism in the boot process?
  • It’s interesting that we spend such a large amount of time in the sort function. It’s used all over the place, eg. for sorting the BIOS E820 map, sorting memory mappings, checking boot parameters, sorting lists of wake-up events …
  • task_tick_fair is part of the Completely Fair Scheduler.
  • If there’s a boot option to disable ftrace, I cannot find it.

My main conclusion is there is no “smoking gun” here. Everything generally points to things that a kernel (or at least, the core, non-module parts of a kernel) should be doing. So this analysis has not been very helpful.

Continue reading

1 Comment

Filed under Uncategorized

Tracing QEMU guest execution part 2

After an overnight 12+ hour run of my Perl scripts I now have a 52 million line file that consists of timestamps, kernel symbols, and other untranslated linear addresses. The only possible way to analyze this will be with yet more scripts, but already there are lots of interesting things.

Here is the kernel code entering userspace:

11409434.8589973 prepare_exit_to_usermode in section .text
11409435.5189973 retint_user + 8 in section .text
11409436.4899973 7fb95ab3cd20
11409447.6899973 7fb95ab4c930
11409469.2169973 7fb95ab577f0

Userspace symbols cannot be decoded because we don’t know which process is being run. More importantly, code in kernel modules cannot be decoded, so we only see core kernel functions.

Handling a timer interrupt:

18000723.5261105 apic_timer_interrupt in section .text
18000725.2241105 smp_apic_timer_interrupt in section .text
18000726.7681105 native_apic_mem_write in section .text
18000729.2691105 smp_apic_timer_interrupt + 46 in section .text
18000729.8361105 irq_enter in section .text
18000730.3941105 rcu_irq_enter in section .text
18000731.1401105 rcu_irq_enter + 92 in section .text
18000731.7111105 irq_enter + 14 in section .text
18000732.3171105 smp_apic_timer_interrupt + 51 in section .text
18000740.9941105 exit_idle in section .text
18000741.5481105 smp_apic_timer_interrupt + 56 in section .text
18000742.0881105 local_apic_timer_interrupt in section .text
18000744.0341105 tick_handle_periodic in section .text
18000744.6341105 _raw_spin_lock in section .text
18000745.9291105 tick_periodic + 67 in section .text
18000747.3941105 do_timer in section .text

Userspace loading a kernel module:

7806760.57896065 40087d
7806765.09696065 4442b0
7806765.65396065 entry_SYSCALL_64 in section .text
7806766.14496065 entry_SYSCALL_64 + 32 in section .text
7806766.46296065 entry_SYSCALL_64 + 36 in section .text
7806788.75296065 sys_init_module in section .text
7806796.76296065 sys_init_module + 62 in section .text
7806797.28296065 sys_init_module + 62 in section .text
7806800.64896065 sys_init_module + 65 in section .text
7806801.94496065 capable in section .text
7806802.91196065 security_capable in section .text
7806804.30796065 cap_capable in section .text
7806804.87796065 security_capable + 72 in section .text
7806805.43596065 ns_capable + 41 in section .text
7806805.92096065 capable + 23 in section .text
7806810.46796065 sys_init_module + 75 in section .text
7806815.59796065 sys_init_module + 86 in section .text
7806821.10196065 sys_init_module + 96 in section .text
7806827.28496065 sys_init_module + 109 in section .text
7806831.23396065 sys_init_module + 129 in section .text
7806839.75396065 security_kernel_module_from_file in section .text

What am I interested in knowing? My overall goal is to find areas in the kernel and userspace that we can optimize to make boot faster. Specifically it seems interesting at first to look at two questions:

  1. Which kernel functions consume the most time? This is “profiling 101”. It may indicate kernel functions which are particularly slow (eg. they contain sleeps).
  2. The time elapsed before each kernel function is called the first time. Because studying the whole trace is impossible (by hand), finding the first time each kernel function is called may give me an insight into major timeline events — eg. the first time one of the ext4 mount functions is called, we know that userspace has got to the point where it mounts the root filesystem.
  3. When does userspace make system calls? This is related to the previous point. It’s also the easiest type of analysis to do because system call entry is easy to detect from the trace.

More to follow …


Filed under Uncategorized


Recent Linux has four ioctls related to discarding blocks on block devices: BLKDISCARD, BLKZEROOUT,
. As far as I’m aware these are not documented anywhere, but this posting describes what they do and how to use them. For a good all round introduction to thin provisioning, see Paolo Bonzini’s talk from DevConf (video here).


This is the simplest ioctl. Given a range described as offset and length (both expressed in bytes), this code:

uint64_t range[2] = { offset, length };
ioctl (fd, BLKDISCARD, range);

will tell the underlying block device (fd) that it may discard the blocks which are contained in the given byte range.

The kernel code wants you to pass a range which is aligned to 512 bytes, and there may be further restrictions on the range you can pass which you can find out about by reading /sys/block/disk/queue/discard_alignment, /sys/block/disk/queue/discard_granularity, and /sys/block/disk/queue/discard_max_bytes.

If discard_max_bytes == 0 then discard isn’t supported at all on this device.

Discard is voluntary. The device might ignore it silently. Also what you read back from the discarded blocks might not be zeroes — you might read back stale data or random data (but see below).


BLKZEROOUT is a bit like BLKDISCARD but it writes zeroes. The code is similar:

uint64_t range[2] = { offset, length };
ioctl (fd, BLKZEROOUT, range);

Again note that offset and length are in bytes, but the kernel wants you to pass a 512-byte aligned range.

As far as I can tell from the implementation, the kernel implements this call itself. There is no help needed from devices, nor any device-specific optimization available.


I mentioned above that discarded blocks might read back as stale data. However some devices guarantee that discarded blocks read back as zeroes (which means, I assume, that BLKZEROOUT would not be needed on such block devices).

You can find out if the device you are currently using has this guarantee, either by reading the sysfs file /sys/block/disk/queue/discard_zeroes_data, or by using this code:

unsigned int arg;
discard_zeroes =
    ioctl (fd, BLKDISCARDZEROES, &arg) == 0 && arg;


Finally secure discard tells the device that you want to do a secure erase operation on the blocks. Again, pass a byte range (which has the same alignment requirements as BLKDISCARD):

uint64_t range[2] = { offset, length };
ioctl (fd, BLKSECDISCARD, range);

The ioctl will return an error (EOPNOTSUPP) for devices which cannot do secure erase.


Filed under Uncategorized

Half-baked idea: Kill SIGOK

For more half-baked ideas, see the ideas tag

I’m sure you’ve sat there waiting for some long compile to finish. It’s just started the tests, but you don’t want it to do the tests, just finish compiling dammit!

My half-baked idea is this: sending a special signal to a process (SIGOK) should kill the process, but the process should return 0 (ie. non-error exit).


Filed under Uncategorized

Compiling ODROID 3.12.3 kernel, attempt 3

Previously and previously. Will it be third time lucky?

I’m going to document the steps from the start. Note I’m building this on the hardware, not cross-compiling, since the ODROID-XU (Exynos 5410) with an external hard drive is fast (for ARM). Also note you’ll need to make your own serial port or buy one.

Clone this git repository which is based on upstream 3.12.3. Build it:

make odroidxu_defconfig
LOADADDR=40008000 make uImage dtbs
make modules

Install the kernel, device tree, and modules:

sudo cp arch/arm/boot/uImage /boot/uboot/uImage-3.12.3
sudo cp arch/arm/boot/dts/exynos5410-odroidxu.dtb /boot/uboot/
sudo make modules_install

Build an initramfs:

# First print the kernel version: make kernelrelease
sudo mkinitrd ./initrd-3.12.3 3.12.3-78393-g3886709
mkimage -A arm -O linux -T ramdisk -C none -a 0 -e 0 -n uInitrd -d initrd-3.12.3 uInitrd-3.12.3
sudo cp uInitrd-3.12.3 /boot/uboot/

Reboot, and be ready to interrupt u-boot by hitting a key on the serial console at the right moment.

sudo reboot

At the u-boot prompt type:

set bootargs console=ttySAC2,115200n8 loglevel=9 earlyprintk root=UUID=ad455a75-609c-4b3c-956b-bcef3ebc4603 ro
fatload mmc 0 43000000 uimage-3.12.3
fatload mmc 0 46000000 uinitrd-3.12.3
fatload mmc 0 49000000 exynos5410-odroidxu.dtb
bootm 43000000 46000000 49000000

Unfortunately (1):

[    0.390000] CPU: All CPU(s) started in SVC mode.

and unfortunately (2):

[    5.780000] systemd[1]: Running in initial RAM disk.

Welcome to Fedora 20 (Heisenbug) dracut-034-19.git20131021.fc20 (Initramfs)!

[    5.810000] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
[    5.810000] systemd[1]: Caught , dumped core as pid 1220.
[    5.820000] systemd[1]: Freezing execution.
[    5.890000] usb 1-2: new high-speed USB device number 2 using s5p-ehci
[    6.040000] usb 1-2: New USB device found, idVendor=0424, idProduct=9730
[    6.040000] usb 1-2: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    6.050000] usb 1-2: no of_node; not parsing pinctrl DT
[    6.180000] usb 1-3: new high-speed USB device number 3 using s5p-ehci
[    6.420000] usb 1-3: New USB device found, idVendor=0424, idProduct=3503
[    6.420000] usb 1-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    6.430000] usb 1-3: no of_node; not parsing pinctrl DT
[    6.430000] hub 1-3:1.0: no of_node; not parsing pinctrl DT
[    6.440000] hub 1-3:1.0: USB hub found
[    6.440000] hub 1-3:1.0: 3 ports detected
[    6.730000] usb 1-3.2: new high-speed USB device number 4 using s5p-ehci
[    6.860000] usb 1-3.2: New USB device found, idVendor=04e8, idProduct=61b6
[    6.860000] usb 1-3.2: New USB device strings: Mfr=1, Product=11, SerialNumber=3
[    6.870000] usb 1-3.2: Product: Samsung M3 Portable
[    6.870000] usb 1-3.2: Manufacturer: JMicron
[    6.880000] usb 1-3.2: SerialNumber: 00000000011E4C02
[    6.880000] usb 1-3.2: no of_node; not parsing pinctrl DT


Filed under Uncategorized

Compiling ODROID-XU 3.11-rc7 kernel, attempt 2

Previously …

Now that I have made my own serial port for the ODROID-XU, creating something omitted from this development board, I’m ready to try to compile my own kernel again. At least I should be able to debug any error messages this time.

I did everything in the previous posting, but this time built a uImage properly:

LOADADDR=40008000 make uImage dtbs
sudo cp arch/arm/boot/uImage /boot/uboot/uImage-3.11-rc7
sudo cp uInitrd /boot/uboot/uInitrd-3.11-rc7
sudo cp arch/arm/boot/dts/exynos5410-odroidxu.dtb /boot/uboot/

We end up with (where the bold files are the ones I have built):

$ ll /boot/uboot/
total 33047
-rwxr-xr-x 1 root root    5510 Sep 27 08:14 boot.ini
-rwxr-xr-x 1 root root   28603 Dec  7 16:34 exynos5410-odroidxu.dtb
-rwxr-xr-x 1 root root  293168 Sep 27 08:14 u-boot.bin
-rwxr-xr-x 1 root root 2113488 Dec  7 16:33 uImage-3.11-rc7
-rwxr-xr-x 1 root root 6518598 Sep 27 08:14 uInitrd
-rwxr-xr-x 1 root root 8104359 Dec  7 16:36 uInitrd-3.11-rc7
-rwxr-xr-x 1 root root 6518598 Nov  8 17:36 uInitrd-official3.4.5
-rwxr-xr-x 1 root root 5127832 Sep 27 08:13 zImage
-rwxr-xr-x 1 root root 5127832 Nov  8 17:36 zImage-official3.4.5

I boot this by interrupting u-boot and typing the load commands by hand. (Note that /boot/uboot is a FAT partition).

set bootargs console=ttySAC2,115200n8 loglevel=9 earlyprintk root=UUID=ad455a75-609c-4b3c-956b-bcef3ebc4603 ro
fatload mmc 0 43000000 uimage-3.11-rc7
fatload mmc 0 46000000 uinitrd-3.11-rc7
fatload mmc 0 49000000 exynos5410-odroidxu.dtb
bootm 43000000 46000000 49000000

This boots the kernel, but fails trying to find the root filesystem:

[    9.680000] mmc1: error -110 whilst initialising SD card
[    9.700000] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 300000Hz, actual 299401HZ div = 167)
[   11.730000] mmc1: error -110 whilst initialising SD card
[   11.750000] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 200000Hz, actual 200000HZ div = 250)
[   13.790000] mmc1: error -110 whilst initialising SD card
[   13.810000] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 196079Hz, actual 196078HZ div = 255)
[   15.850000] mmc1: error -110 whilst initialising SD card
[   17.680000] systemd[1]: No control group support available, not creating root group.
[   17.710000] systemd[1]: Expecting device dev-disk-by\x2duuid-ad455a75\x2d609c\x2d4b3c\x2d956b\x2dbcef3ebc4603.device...
[   17.720000] systemd[1]: Starting -.slice.
         Expecting device dev-disk-by\x2du[   17.730000] systemd[1]: Caught , dumped core as pid 1144.
[   17.730000] systemd[1]: Freezing execution.

Update #2: See the comments section.

Update: It’s because the 3.11 kernel doesn’t support the MMC card. I’ll try later using an external USB drive.


Filed under Uncategorized

Half-baked ideas: Kernel modules-in-a-file

For more half-baked ideas, see the ideas tag.

The Linux kernel:


5 MB of highly optimized code that runs the world’s computers. You can copy it around and boot it on different hardware.

No wait, that’s all wrong. I’m forgetting about the modules. There are another 2832 files in /lib/modules/3.11.9-200.fc19.x86_64 and without them this kernel will hardly function.

It’s nice to be able to copy the kernel around — to compile it on fast x86-64 and run it on slow ARM, or to run the host kernel in a VM, but all those module files make everyone’s lives a lot more complex.

So my half-baked idea is this:

Combine all the modules into a single file, let’s call it /boot/modules-3.11.9-200.fc19.x86_64. This file might start off as an ext4 filesystem which can be loopback-mounted on /lib/modules. Eventually I’d hope that the module tools could load modules from the file directly.

Now you can happily cross-compile and copy just two kernel files from x86 to ARM or into your VMs, and that’s a lot less hassle.


Filed under Uncategorized

How are Linux drives named beyond drive 26 (/dev/sdz, ..)?

[Edit: Thanks to adrianmonk for correcting my math]

It’s surprisingly hard to find a definitive answer to the question of what happens with Linux block device names when you get past drive 26 (ie. counting from one, the first disk is /dev/sda and the 26th disk is /dev/sdz, what comes next?) I need to find out because libguestfs is currently limited to 25 disks, and this really needs to be fixed.

Anyhow, looking at the code we can see that it depends on which driver is in use.

For virtio-blk (/dev/vd*) the answer is:

Drive # — Name
1 vda
26 vdz
27 vdaa
28 vdab
52 vdaz
53 vdba
54 vdbb
702 vdzz
703 vdaaa
704 vdaab
18278 vdzzz

Beyond 18278 drives the virtio-blk code would fail, but that’s not currently an issue.

For SATA and SCSI drives under a modern Linux kernel, the same as above applies except that the code to derive names works properly beyond sdzzz up to (in theory) sd followed by 29 z‘s! [Edit: or maybe not?]

As you can see virtio and SCSI/SATA don’t use common code to name disks. In fact there are also many other block devices in the kernel, all using their own naming scheme. Most of these use numbers instead of letters: eg: /dev/loop0, /dev/ram0, /dev/mmcblk0 and so on.

If disks are partitioned, then the partitions are named by adding the partition number on the end (counting from 1). But if the drive name already ends with a number then a letter p is added between the drive name and the partition number, thus: /dev/mmcblk0p1.


Filed under Uncategorized