Looks great, and nbdkit compiles out of the box.
Here’s a simple block device with virtual size 1M that reads as zeroes:
nbdkit sh - <<'EOF' case "$1" in get_size) echo 1M ;; pread) dd if=/dev/zero count=$3 iflag=count_bytes ;; *) exit 2 ;; esac EOF
I’ve submitted a talk about nbdkit, our flexible pluggable NBD server, to FOSDEM next February. This is going to be about using NBD as a better way to do loop mounts in Linux.
In preparation I gave a very early version of the talk to a small Red Hat audience.
Sorry about the slow start. You may want to skip to 2 mins to get past the intro.
Summary of what’s in the talk:
I’ve submitted a talk about nbdkit, our flexible, pluggable NBD server, to FOSDEM next year about how you can use nbdkit as a replacement for loopback mounts (or “loop mounts” as I was told off for not calling them last week). In preparation for that talk I ran through it in private to a small Red Hat audience on Monday. If I can I will release that video some time, but I may have to edit out Red Hat “super-secret” stuff first (or most likely not because there aren’t any secrets in it, but I’m still waiting for the internal video to be released).
Anyway this attracted a lot of interest and one question that was asked was why the xz plugin which lets you transparently open and uncompress XZ files on the fly was a plugin at all. Surely it would make more sense for it to be a filter? So it could be used not just to uncompress local files, but also xz-compressed cloud images over HTTPS.
The answer is yes it would! So I fixed it. XZ is now a filter (the plugin is left around but we’ll deprecate it eventually).
You can use it on top of the file plugin, curl plugin or other plugins:
$ nbdkit --filter=xz file file.xz $ nbdkit --filter=xz curl https://download.fedoraproject.org/pub/fedora/linux/releases/29/Cloud/x86_64/images/Fedora-Cloud-Base-29-1.2.x86_64.raw.xz
This is fun and you can use this to boot the cloud image entirely remotely:
$ qemu-system-x86_64 -machine accel=kvm:tcg \ -cpu host -m 2048 \ -drive file=nbd:localhost:10809,if=virtio
However it’s incredibly slow. One problem is that the Fedora mirror sites aren’t very happy about you issuing lots of small HTTP Range requests and I observed that they throttle the connection quite aggressively. The second problem is that the xz block size for these cloud images is too large.
The XZ format (or rather, LZMA format) is divided into streams and blocks. We don’t normally use streams, and many XZ files use a single block. But it’s possible to tell the
xz program to use a smaller than default block size, and in that case the output is divided into indexed blocks. Note the block size applies to the uncompressed input, the compressed blocks will have varying sizes, but the index that is created lets us find the block boundaries easily. When a byte is requested we can use a binary search to take us quickly to the compressed block, uncompress it (and cache it), and answer the request. We will only uncompress at most one block instead of the whole file.
For disk images I normally advocate a 16M block size. The current cloud images use (I think) a 192M block size, so both a huge amount of data has to be read over HTTPS to read one uncompressed byte, plus we have to cache very large blocks in RAM.
As an experiment I recompressed the cloud image using
xz --block-size=$((16 * 1024 * 1024)) and hosted it locally, and booting is much quicker (albeit still slow because the cloud image contains
$ nbdkit -U - --filter=xz curl \ http://builder.libguestfs.org/fedora-29.xz \ --run \ 'qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 -drive file=$nbd,if=virtio'
… although you can’t log in because they all have locked root accounts (virt-builder normally customizes them after download).
If you saw my posting from two days ago you’ll know I’m working on visualizing what happens on block devices when you perform various operations. Last time we covered basics like partitioning a disk, creating a filesystem, creating files, and fstrim.
This time I’ve tied together 5 of the nbdcanvas widgets into a bigger Tcl application that can show what’s happening on a RAID 5 disk set. As with the last posting there’s a video followed by commentary on what happens at each step.
raid456.devices_handle_discard_safely=1to the appliance kernel command line, which is required for discards to work through MD RAID devices (I didn’t know that before yesterday).
mdadm --create(via the guestfish
md-createcommand) to make a RAID 5 array across the first 4 disks. The 4th disk is the parity disk — you can see disks 1 through 3 being scanned and the parity information being written to the 4th disk. The 5th disk is a hot spare. Notice how the scanning continues after the
mdadmcommand has returned. In real arrays this can go on for hours or days.
mkfsperforms is discarding previous data (indicated by light purple). Notice that the parity data is also discarded, which surprised me, but does make sense.
/proc/mdstatshows that the third disk has been marked failed.
In case you want to reproduce the results yourself, the full command to run nbdkit (repeated 5 times) is:
$ rm /tmp/sock1 /tmp/error1 $ ./nbdkit -fv -U /tmp/sock1 \ --filter=error --filter=log --filter=delay \ memory size=$((64*1024*1024)) \ logfile=/tmp/log1 \ rdelay=40ms wdelay=40ms \ error-rate=100% error-file=/tmp/error1
And the nbdraid viewing program:
$ ./nbdraid.tcl 5 $((64*1024*1024)) /tmp/log%d /tmp/error%d
Ever wondered what is really happening when you write to a disk? What blocks the filesystem writes to and so on? With our flexible, plug-in based NBD server called nbdkit and a little Tcl/Tk program I wrote you can now visualise this.
… which shows me opening a blank disk, partitioning it, creating an ext4 filesystem and writing some files.
There’s a lot going on in this video, which I’ll explain below. But first to say that each pixel corresponds to a 4K block on disk — the total disk size is 64M which is 128×128 pixels, and each row is therefore half a megabyte. Red pixels are writes. Black flashing pixels show reads. Light purple is for trim requests, and white pixels are zero requests.
nbdkit was run with the following command line:
$ nbdkit -fv \ --filter=log \ --filter=delay \ memory size=$((64*1024*1024)) \ logfile=/tmp/log \ rdelay=40ms wdelay=40ms
This means that we’re using the memory plugin to create a throwaway blank disk of 64M. In front of this plugin we place two filters: The delay filter delays all reads and writes by 40ms. This makes it easier to see what’s going on. The second filter is the log filter which records all requests in a log file (
The log file is what the second command reads asynchronously to generate the graphical image:
$ ./nbdview.tcl /tmp/log $((64*1024*1024))
virt-sparsify --in-placeuses to make a disk image sparse.
One interesting talk at KVM Forum last week was Stefan Hajnoczi‘s talk about QEMU security (sorry, it’s not online — it should eventually be available alongside all the other talks on this youtube channel).
One thing Stefan mentioned was whether QEMU might be split into multiple processes. This has advantages for security:
For block drivers you can do this today, and in fact we do this already when we run qemu from virt-v2v. Consider the case where we are using a remote HTTPS disk image:
$ qemu -drive https://remote/disk.img
The curl driver linked to and running inside QEMU needs to make a remote TCP/IP connection, has to encode and decode TLS, is linked to libcurl and so on, and all those things also apply to the QEMU process. If the curl block driver has problems for any reason, these also affect QEMU. SELinux labels and transitions needed to access the socket are labels and transitions needed by the QEMU process. An exploit in the driver is a QEMU exploit.
With nbdkit we can split this out:
$ nbdkit -U - curl url=https://remote/disk.img \ --run 'qemu -drive $nbd'
From a security point of view this has immediate advantages: If the curl driver crashes or is exploited, only nbdkit is affected. QEMU only needs access to a private Unix domain socket, and conversely nbdkit doesn’t need access to anything else that QEMU uses. You can add resource limits, separate SELinux policy, seccomp, namespaces and anything else you can think of to nbdkit to contain it tightly.
It’s worth pointing out the obvious disadvantages too: It’s likely that there will be a performance impact — although don’t discount how efficient NBD is and how this architecture also lets you scale more effectively over NUMA nodes. And this puts all our eggs into the qemu NBD client which must be very robust.
I should say also that this is more laborious to set up, and it would only really work if some other component (libvirt ideally) handled the creation of the separate nbdkit process. In the example above I used captive nbdkit, but that only works if you have a single drive, and one of the other mechanisms would be more scalable.