Category Archives: Uncategorized

NBD graphical viewer – RAID 5 edition

If you saw my posting from two days ago you’ll know I’m working on visualizing what happens on block devices when you perform various operations. Last time we covered basics like partitioning a disk, creating a filesystem, creating files, and fstrim.

This time I’ve tied together 5 of the nbdcanvas widgets into a bigger Tcl application that can show what’s happening on a RAID 5 disk set. As with the last posting there’s a video followed by commentary on what happens at each step.


  • 00:00: I start guestfish connected to all 5 nbdkit servers. Also of note I’ve added raid456.devices_handle_discard_safely=1 to the appliance kernel command line, which is required for discards to work through MD RAID devices (I didn’t know that before yesterday).
  • 00:02: When the appliance starts up, the black flashes show the kernel probing for possible partitions or filesystems. The disks are blank so nothing is found.
  • 00:16: As in the previous post I’m partitioning the disks using GPT. Each ends up with a partition table at the start and end of the disk (two red blocks of pixels).
  • 00:51: Now I use mdadm --create (via the guestfish md-create command) to make a RAID 5 array across the first 4 disks. The 4th disk is the parity disk — you can see disks 1 through 3 being scanned and the parity information being written to the 4th disk. The 5th disk is a hot spare. Notice how the scanning continues after the mdadm command has returned. In real arrays this can go on for hours or days.
  • 01:11: I create a filesystem. The first action that mkfs performs is discarding previous data (indicated by light purple). Notice that the parity data is also discarded, which surprised me, but does make sense.
  • 01:27: The RAID array is mounted and I unpack a tarball into it.
  • 01:40: I delete the files and fstrim, which discards the underlying blocks again.
  • 01:48: Now I’m going to inject errors at the block layer into the 3rd disk. The Error checkbox in the Tcl widget simply creates a file. We’re using the nbdkit error filter which monitors for the named file and when it is created starts injecting errors into any read or write operation. Almost immediately the RAID array notices the damage and starts rebuilding on to the hot spare. Notice the black flashes where it reads the working disks (including old parity disk) to construct the redundant information on the spare.
  • 01:55: While reconstruction is under way, the RAID array can be used normally.
  • 02:14: Examining /proc/mdstat shows that the third disk has been marked failed.
  • 02:24: Now I’m going to inject errors into the 4th disk as well. This RAID array can survive this, operating in a “degraded state”, but there is no more redundancy.
  • 02:46: Finally we can examine the kernel messages which show that the RAID array is continuing on 3 devices.

In case you want to reproduce the results yourself, the full command to run nbdkit (repeated 5 times) is:

$ rm /tmp/sock1 /tmp/error1
$ ./nbdkit -fv -U /tmp/sock1 \
    --filter=error --filter=log --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log1 \
    rdelay=40ms wdelay=40ms \
    error-rate=100% error-file=/tmp/error1

And the nbdraid viewing program:

$ ./nbdraid.tcl 5 $((64*1024*1024)) /tmp/log%d /tmp/error%d

Leave a comment

Filed under Uncategorized

NBD graphical viewer

Ever wondered what is really happening when you write to a disk? What blocks the filesystem writes to and so on? With our flexible, plug-in based NBD server called nbdkit and a little Tcl/Tk program I wrote you can now visualise this.

As in this video (MP4)


… which shows me opening a blank disk, partitioning it, creating an ext4 filesystem and writing some files.

There’s a lot going on in this video, which I’ll explain below. But first to say that each pixel corresponds to a 4K block on disk — the total disk size is 64M which is 128×128 pixels, and each row is therefore half a megabyte. Red pixels are writes. Black flashing pixels show reads. Light purple is for trim requests, and white pixels are zero requests.

nbdkit was run with the following command line:

$ nbdkit -fv \
    --filter=log \
    --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log \
    rdelay=40ms wdelay=40ms

This means that we’re using the memory plugin to create a throwaway blank disk of 64M. In front of this plugin we place two filters: The delay filter delays all reads and writes by 40ms. This makes it easier to see what’s going on. The second filter is the log filter which records all requests in a log file (/tmp/log).

The log file is what the second command reads asynchronously to generate the graphical image:

$ ./nbdview.tcl /tmp/log $((64*1024*1024))

So to the video (MP4):

  • 00:07: I start guestfish connected to the NBD server. This boots a Linux appliance, and you can see from the flashes of black how the Linux kernel probes the disk every which way to try to detect any kind of partition or filesystem signature. (Recall that I’m intentionally delaying all read requests which is why the appliance boot and probing seems to take so long. In reality these probes happen near instantaneously.) Of course the disk is all zeroes at this point, so nothing is found.
  • 00:23: I partition the disk using GPT. The partitioning is done under the hood by GNU parted and as you can see there is a considerable amount of probing going on by both parted and then the kernel still looking for filesystem signatures. Eventually we end up with two blocks of red (written) data, because GPT creates both a primary and secondary partition table at the beginning and end of the disk.
  • 00:36: I create an ext4 filesystem inside the partition. After even more probing by mkfs the first major operation is to trim/discard all data on the disk (shown by the disk filling up with light purple). Then mkfs writes a large block of data in the middle of the disk which I believe is the journal, followed by four dots which I believe could be backup superblocks.
  • 00:48: Interestingly filesystem creation has not finished. ext4 (as well as other modern filesystems) defer a lot of work to the kernel, and this is obvious when I mount the disk. Notice that a few seconds after the mount (around 00:59) the kernel starts zeroing parts of the disk. I believe this is the inode table and block free bitmap for the first block group. For larger disks this lazy initialization could go on for a long time.
  • 01:05: I unpack a tarball into the filesystem. As expected the operation finishes almost instantaneously, and nothing is actually written to disk. However issuing an explicit sync at 01:11 causes the files and directories to be written, filling first the data blocks and then the inodes and block free bitmap (is there a reason these are written last, or is it just coincidence? Also does the Linux page cache retain the order that the filesystem wrote the blocks?)
  • 01:18: I delete the directory tree I just created. As you’d expect nothing is written to disk, and even after a sync nothing much changes.
  • 01:26: In contrast when I fstrim the filesystem, all the now-deleted data blocks are discarded (light purple). This is the same principle which virt-sparsify --in-place uses to make a disk image sparse.
  • 01:32: Finally after unmounting the filesystem I issue a blkdiscard command which throws the whole thing away. Even after this Linux is probing the partition to see if somehow a filesystem signature could be present.

You can easily reproduce this and similar results yourself using nbdkit and nbdview, and I’ve submitted a talk to FOSDEM about this and much more fun you can have with nbdkit.

1 Comment

Filed under Uncategorized

Split block drivers from qemu with nbdkit

One interesting talk at KVM Forum last week was Stefan Hajnoczi‘s talk about QEMU security (sorry, it’s not online — it should eventually be available alongside all the other talks on this youtube channel).

One thing Stefan mentioned was whether QEMU might be split into multiple processes. This has advantages for security:

  1. Crashing or corrupting a single process doesn’t automatically expose the whole hypervisor.
  2. You can separately label each process using SELinux and independently control those policies, providing finer-grained security.

For block drivers you can do this today, and in fact we do this already when we run qemu from virt-v2v. Consider the case where we are using a remote HTTPS disk image:

$ qemu -drive https://remote/disk.img


The curl driver linked to and running inside QEMU needs to make a remote TCP/IP connection, has to encode and decode TLS, is linked to libcurl and so on, and all those things also apply to the QEMU process. If the curl block driver has problems for any reason, these also affect QEMU. SELinux labels and transitions needed to access the socket are labels and transitions needed by the QEMU process. An exploit in the driver is a QEMU exploit.

With nbdkit we can split this out:

$ nbdkit -U - curl url=https://remote/disk.img \
  --run 'qemu -drive $nbd'


From a security point of view this has immediate advantages: If the curl driver crashes or is exploited, only nbdkit is affected. QEMU only needs access to a private Unix domain socket, and conversely nbdkit doesn’t need access to anything else that QEMU uses. You can add resource limits, separate SELinux policy, seccomp, namespaces and anything else you can think of to nbdkit to contain it tightly.

It’s worth pointing out the obvious disadvantages too: It’s likely that there will be a performance impact — although don’t discount how efficient NBD is and how this architecture also lets you scale more effectively over NUMA nodes. And this puts all our eggs into the qemu NBD client which must be very robust.

I should say also that this is more laborious to set up, and it would only really work if some other component (libvirt ideally) handled the creation of the separate nbdkit process. In the example above I used captive nbdkit, but that only works if you have a single drive, and one of the other mechanisms would be more scalable.

Leave a comment

Filed under Uncategorized

New in nbdkit: Create a virtual floppy disk

nbdkit is our flexible, plug-in based Network Block Device server.

While I was visiting the KVM Forum last week, one of the most respected members of the QEMU development team mentioned to me that he wanted to think about deprecating QEMU’s VVFAT driver. This QEMU driver is a bit of an oddity — it lets you point QEMU to a directory of files, and inside the guest it will see a virtual floppy containing those files:

$ qemu -drive file=fat:/some/directory

That’s not the odd thing. The odd thing is that it also lets you make the drive writable, and the VVFAT driver then turns those writes back into modifications to the host filesystem (remember that these are writes happening to raw FAT32 data structures, the driver has to infer from just seeing the writes what is happening at the filesystem level). Which is both amazing and crazy (and also buggy).

Anyway I have implemented the read-only part of this in nbdkit. I didn’t implement the write stuff because that’s very ambitious, although if you were going to implement that, doing it in nbdkit would be better than qemu since the only thing that can crash is nbdkit, not the whole hypervisor.

Usage is very simple:

$ nbdkit floppy /some/directory

This gives you an NBD source which you can connect straight to a qemu virtual machine:

$ qemu -drive nbd:localhost:10809

or examine with guestfish:

$ guestfish --ro --format=raw -a nbd://localhost -m /dev/sda1
Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
      ‘man’ to read the manual
      ‘quit’ to quit the shell

> ll /
total 2420
drwxr-xr-x 14 root root  16384 Jan  1  1970 .
drwxr-xr-x 19 root root   4096 Oct 28 10:07 ..
-rwxr-xr-x  1 root root     40 Sep 17 21:23 .dir-locals.el
-rwxr-xr-x  1 root root    879 Oct 27 21:10 .gdb_history
drwxr-xr-x  8 root root  16384 Oct 28 10:05 .git
-rwxr-xr-x  1 root root   1383 Sep 17 21:23 .gitignore
-rwxr-xr-x  1 root root   1453 Sep 17 21:23 LICENSE
-rwxr-xr-x  1 root root  34182 Oct 28 10:04 Makefile
-rwxr-xr-x  1 root root   2568 Oct 27 22:17
-rwxr-xr-x  1 root root  32085 Oct 27 22:18
-rwxr-xr-x  1 root root    620 Sep 17 21:23 OTHER_PLUGINS
-rwxr-xr-x  1 root root   4628 Oct 16 22:36 README
-rwxr-xr-x  1 root root   4007 Sep 17 21:23 TODO
-rwxr-xr-x  1 root root  54733 Oct 27 22:18 aclocal.m4
drwxr-xr-x  2 root root  16384 Oct 27 22:18 autom4te.cache
drwxr-xr-x  2 root root  16384 Oct 28 10:04 bash
drwxr-xr-x  5 root root  16384 Oct 27 18:07 common

Previously … create ISO images on the fly in nbdkit

Leave a comment

Filed under Uncategorized

Does virt-v2v preserve sparseness?

A common question is does virt-v2v preserve sparseness (aka thin provisioning) when you convert the guest from VMware to a KVM-based hypervisor like oVirt/RHV or OpenStack? The very short answer is: no. The medium answer is: The question doesn’t really make sense. For the long answer, read on …

First of all we need to ask what is thin provisioning? For complex and essentially historical reasons when you provision a new guest you have to decide how much maximum disk space you want it to have. Let’s say you choose 10 GB. However because a new guest install might only take, say, 1 GB, you can also decide if you want the whole 10 GB to be preallocated up front or if you want the disk to be thin provisioned or sparse, meaning it’ll only take up 1 GB of host space, but that will increase as the guest creates new files. There are pros and cons to preallocation. Preallocating the full 10 GB obviously takes a lot of extra space (and what if the guest never uses it?), but it can improve guest performance in some circumstances.

That is what happens initially. By the time we come to do a virt-v2v conversion that guest may have been running for years and years. Now what does the disk look like? It doesn’t matter if the disk was initially thin provisioned or fully allocated, what matters is what the guest did during those years.

Did it repeatedly fill up the disk and/or delete those files? – In which case your initially thin provisioned guest could now be fully allocated.

Did it have trimming enabled? Your initially preallocated guest might now have become sparsely allocated.

In any case VMware doesn’t store this initial state, nor does it make it very easy to find out which bits of the disk are actually backed by host storage and which bits are sparse (well, maybe this is possible, but not using the APIs we use when liberating your guests from VMware).

In any case, as I explained in this talk (slides) from a few years ago, virt-v2v tries to avoid copying any unused, zeroed or deleted parts of the disk for efficiency reasons, and so it will always make the disk maximally sparse when copying it (subject to what the target hypervisor does, read on).

When virt-v2v comes to creating the target guest, the default is to create a maximally sparse guest, but there are two ways to change this:

  1. You can specify the -oa preallocated option, where virt-v2v will try to ask the target hypervisor to fully preallocate the target disks of the guest.
  2. For some hypervisors, especially RHV, your choice of backend storage may mean you have no option but to use preallocated disks (unfortunately I cannot give clear advice here, best to ask a RHV expert).

The basic rule is that when converting guests you need to think about whether you want the guest to be sparse or preallocated after conversion, based on your own performance vs storage criteria. Whether it happened to be thin provisioned when you set it up on VMware years earlier isn’t a relevant issue.


Filed under Uncategorized

New in nbdkit: Create an ISO image on the fly

nbdkit is the pluggable Network Block Device server that Eric and I wrote. I have submitted a talk to FOSDEM next February about the many weird and wonderful ways you can use nbdkit as a flexible replacement for loopback mounting.

Anyway, new in nbdkit 1.7.6 you can now create ISO 9660 (CD-ROM) disk images on the fly from a directory:

# nbdkit iso /boot params="-JrT"
# nbd-client -b 512 localhost /dev/nbd0
# file -bsL /dev/nbd0
ISO 9660 CD-ROM filesystem data 'CDROM'
# mount /dev/nbd0 /tmp/mnt
# ls /tmp/mnt
# umount /tmp/mnt
# nbd-client -d /dev/nbd0
# killall nbdkit

That ISO wouldn’t actually be bootable, but you could create one (eg. an El Torito ISO) by adding the appropriate extra parameters.

To head off the first question: If you copy files into the directory while nbdkit is running, do they appear in the ISO? Answer: No! This is largely impossible with the way Linux block devices work.


Filed under Uncategorized

Fedora/RISC-V now mirrored as a Fedora “alternative” architecture These packages now get mirrored further by the Fedora mirror system, eg. to

If you grab the latest nightly Fedora builds you can get the mirrors by editing the /etc/yum.repos.d/*.repo file.

Also we got some additional help so we now have loads more build hosts! These were provided by Facebook with hosting by Oregon State University Open Source Lab (see cfarm), so thanks to them.

Thanks to David Abdurachmanov and Laurent Guerby for doing all the work (I did nothing).


Filed under Uncategorized