Tag Archives: ext2

nbdkit linuxdisk plugin

I’m writing a new nbdkit plugin called linuxdisk. nbdkit is our flexible, plugin-based NBD server, and this new plugin lets you create a complete Linux-compatible virtual disk from a host directory on the fly.

One of the many uses for this is booting minimal VMs very quickly. Here’s an example you can set up in a few seconds. It boots to an interactive busybox shell:

$ mkdir /tmp/root /tmp/root/sbin /tmp/root/bin /tmp/root/dev
$ sudo mknod /tmp/root/dev/console c 5 1
$ cp /sbin/busybox /tmp/root/sbin/
$ ln /tmp/root/sbin/busybox /tmp/root/bin/sh
$ ln /tmp/root/sbin/busybox /tmp/root/bin/ls
$ ln /tmp/root/sbin/busybox /tmp/root/sbin/init
$ nbdkit -U - linuxdisk /tmp/root \
    --run 'qemu-kvm -display none -kernel /boot/vmlinuz-4.20.8-200.fc29.x86_64 -drive file=nbd:unix:$unixsocket,snapshot=on -append "console=ttyS0 root=/dev/sda1 rw" -serial stdio'

If you need any extra files in the VM just drop them straight into /tmp/root before booting it.

Edit: How the heck does /dev get populated in this VM?



Filed under Uncategorized

Visualizing reads, writes and alignment

I wrote an out of tree patch to qemu that lets you gather read and write traces when a virtual machine accesses its virtual hard disk.

Using the patched qemu, a qemu wrapper, and some simple visualization tools we can look at how Linux executes individual filesystem operations.

Firstly we use the guestfish prepared disk feature to partition a disk and create an ext2 filesystem on the disk. The command under test is:

$ guestfish -N fs:ext2:10M

and this is what the disk access looks like (click to see the full size image):

The whole box represents the entire disk (10 MB), and each cell represents one sector (512 bytes in this case).

The large number of unaligned writes there points in fact to a mistake in the guestfs part-disk operation which is creating an unaligned whole disk partition. By adding a (non-upstream) patch to make this create an aligned partition, the results look a little better:

The second test is to take this disk image and simply mount it up. The command under test is:

$ guestfish -a test1.img -m /dev/sda1

and the access pattern looks like this:

As expected this is a mostly-read operation, but the act of mounting performs some writes to the ext2 superblock (shown in red).

Finally let’s see what it looks like to create a file on this empty filesystem:

$ guestfish -a test1.img -m /dev/sda1 \
  write /hello "hello, world."

That diagram includes the mount operation, so you have to mentally subtract that to see just the various file and metadata writes.

An obvious area of improvement here is to have libguestfs signal down to qemu to start and stop the trace, so that we can trace single operations (like just creating the file, not mounting the filesystem).

Another area of investigation is to add an LVM layer, and to experiment with the filesystem blocksize and other tunables.

It would also be good to start identifying various filesystem metadata areas by name, such as the inode table, block free bitmap and superblock.


Filed under Uncategorized

Creating ext2 filesystems from scratch

We’ve had to change the way libguestfs boots its appliance, so now the appliance will be an ext2 filesystem from an ordinary drive, instead of an initrd.

The problem is that we need to create the ext2 filesystem on the fly, and we can’t use libguestfs to do this (which would be way easier than the method I’m about to describe).

However e2fsprogs comes with a low-level library for manipulating ext2 filesystem images, and it is just about possible to use this to make an ext2 filesystem and put files and directories on to it.

The documentation on libext2fs is very light on detail, but I have written some example code:


Compile and run this with:

$ gcc -Wall test.c -o test -lext2fs -lcom_err
$ ./test
$ guestfish -a /tmp/test.img -m /dev/sda
><fs> ll /
drwxr-xr-x  4  500  500   1024 Aug 19 12:08 .
dr-xr-xr-x 20 root root      0 Aug 19 12:09 ..
-rw-r--r--  1 root root 100000 Aug 19 12:08 hello
drwx------  2 root root  12288 Aug 19 12:08 lost+found
drwxr-xr-x  2 root root   1024 Aug 19 12:08 mydir
><fs> ll /mydir/
total 3
drwxr-xr-x 2 root root 1024 Aug 19 12:08 .
drwxr-xr-x 4  500  500 1024 Aug 19 12:08 ..
-rw-r--r-- 1 root root   99 Aug 19 12:08 file_in_mydir
><fs> cat /mydir/file_in_mydir

It’s also advisable to read about how ext2 works internally.

Leave a comment

Filed under Uncategorized

Is ext2/3/4 faster? On LVM?

This question arose at work — is LVM a performance penalty compared to using straight partitions? To save you the trouble, the answer is “not really”. There is a very small penalty, but as with all benchmarks it does depend on what the benchmark measures versus what your real workload does. In any case, here is a small guestfish script you can use to compare the performance of various filesystems with or without LVM, with various operations. Whether you trust the results is up to you, but I would advise caution.

#!/bin/bash -


for fs in ext2 ext3 ext4; do
    for lvm in off on; do
        rm -f $tmpfile
        if [ $lvm = "on" ]; then
            guestfish <<EOF
              sparse $tmpfile 1G
              part-disk /dev/sda efi
              pvcreate /dev/sda1
              vgcreate VG /dev/sda1
              lvcreate LV VG 800
              mkfs $fs /dev/VG/LV
        else # no LVM
            guestfish <<EOF
              sparse $tmpfile 1G
              part-disk /dev/sda efi
              mkfs $fs /dev/sda1
        echo "fs=$fs lvm=$lvm"
        guestfish -a $tmpfile -m $dev <<EOF
          time fallocate /file1 200000000
          time cp /file1 /file2
fs=ext2 lvm=off
elapsed time: 2.74 seconds
elapsed time: 4.52 seconds
fs=ext2 lvm=on
elapsed time: 2.60 seconds
elapsed time: 4.24 seconds
fs=ext3 lvm=off
elapsed time: 2.62 seconds
elapsed time: 4.31 seconds
fs=ext3 lvm=on
elapsed time: 3.07 seconds
elapsed time: 4.79 seconds

# notice how ext4 is much faster at fallocate, because it
# uses extents

fs=ext4 lvm=off
elapsed time: 0.05 seconds
elapsed time: 3.54 seconds
fs=ext4 lvm=on
elapsed time: 0.05 seconds
elapsed time: 4.16 seconds


Filed under Uncategorized

Terabyte virtual disks

This is fun. I added a new command to guestfish which lets you create sparse disk files. This makes it really easy to test out the limits of partitions and Linux filesystems.

Starting modestly, I tried a 1 terabyte disk:

$ guestfish

Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.

Type: 'help' for help with commands
      'quit' to quit the shell

><fs> sparse /tmp/test.img 1T
><fs> run

The real disk image so far isn’t so big, just 4K according to “du”:

$ ll -h /tmp/test.img 
-rw-rw-r-- 1 rjones rjones 1T 2009-11-04 17:52 /tmp/test.img
$ du -h /tmp/test.img
4.0K	/tmp/test.img

Let’s partition it:

><fs> sfdiskM /dev/vda ,

The partition table only uses 1 sector, so the disk image has increased to just 8K. Let’s make an ext2 filesystem on the first partition:

><fs> mkfs ext2 /dev/vda1

This command takes some time, and the sparse disk file has grown. To 17 GB, so ext2 has an approx 1.7% overhead.

We can mount the filesystem and look at it:

><fs> mount /dev/vda1 /
><fs> df-h 
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda1            1008G   72M  957G   1% /sysroot

Can we try this with larger and larger virtual disks? In theory yes, in practice the 1.7% overhead proves to be a problem. A 10T experiment would require a very real 170GB of local disk space, and where I was hoping to go, 100T and beyond, would be too large for my test machines.

In fact there is another limitation before we reach there. Local sparse files on my host ext4 filesystem are themselves limited to under 16T:

><fs> sparse /tmp/test.img 16T
write: File too large
><fs> sparse /tmp/test.img 15T

Although the appliance does boot with that 15T virtual disk:

><fs> blockdev-getsize64 /dev/vda 


I noticed from Wikipedia that XFS has a maximum file size of 8 exabytes – 1 byte. By creating a temporary XFS filesystem on the host, I was able to create a 256TB virtual disk:

><fs> sparse /mnt/tmp/test/test.img 256T
><fs> run
><fs> blockdev-getsize64 /dev/vda 

Unfortunately at this point things break down. MBR partitions won’t work on such a huge disk, or at least sfdisk can’t partition it correctly.

I’m not sure what my options are at this point, but at least this is an interesting experiment in hitting limitations.


Filed under Uncategorized