Tag Archives: nbdkit
My talk was accepted: https://fosdem.org/2019/schedule/event/nbdkit/
If you’re coming to FOSDEM, please come and say hello. In the meantime if you want to watch a rough early run-through of the talk, see: https://rwmj.wordpress.com/2018/11/26/nbdkit-fosdem-test-presentation/
Here’s a simple block device with virtual size 1M that reads as zeroes:
nbdkit sh - <<'EOF' case "$1" in get_size) echo 1M ;; pread) dd if=/dev/zero count=$3 iflag=count_bytes ;; *) exit 2 ;; esac EOF
I’ve submitted a talk about nbdkit, our flexible pluggable NBD server, to FOSDEM next February. This is going to be about using NBD as a better way to do loop mounts in Linux.
In preparation I gave a very early version of the talk to a small Red Hat audience.
Sorry about the slow start. You may want to skip to 2 mins to get past the intro.
Summary of what’s in the talk:
- Demo of regular, plain loop mounting.
- Demo of loop mounting an XZ-compressed disk image using NBD + nbdkit.
- Slides about how loop device compares to NBD.
- Slides about nbdkit plugins and filters.
- Using VMware VDDK to access a VMDK file.
- Creating a giant disk costing EUR 300 million(!)
- Visualizing a single filesystem.
- Visualizing RAID 5.
- Writing a plugin in shell script (live demo).
I’ve submitted a talk about nbdkit, our flexible, pluggable NBD server, to FOSDEM next year about how you can use nbdkit as a replacement for loopback mounts (or “loop mounts” as I was told off for not calling them last week). In preparation for that talk I ran through it in private to a small Red Hat audience on Monday. If I can I will release that video some time, but I may have to edit out Red Hat “super-secret” stuff first (or most likely not because there aren’t any secrets in it, but I’m still waiting for the internal video to be released).
Anyway this attracted a lot of interest and one question that was asked was why the xz plugin which lets you transparently open and uncompress XZ files on the fly was a plugin at all. Surely it would make more sense for it to be a filter? So it could be used not just to uncompress local files, but also xz-compressed cloud images over HTTPS.
The answer is yes it would! So I fixed it. XZ is now a filter (the plugin is left around but we’ll deprecate it eventually).
You can use it on top of the file plugin, curl plugin or other plugins:
$ nbdkit --filter=xz file file.xz $ nbdkit --filter=xz curl https://download.fedoraproject.org/pub/fedora/linux/releases/29/Cloud/x86_64/images/Fedora-Cloud-Base-29-1.2.x86_64.raw.xz
This is fun and you can use this to boot the cloud image entirely remotely:
$ qemu-system-x86_64 -machine accel=kvm:tcg \ -cpu host -m 2048 \ -drive file=nbd:localhost:10809,if=virtio
However it’s incredibly slow. One problem is that the Fedora mirror sites aren’t very happy about you issuing lots of small HTTP Range requests and I observed that they throttle the connection quite aggressively. The second problem is that the xz block size for these cloud images is too large.
The XZ format (or rather, LZMA format) is divided into streams and blocks. We don’t normally use streams, and many XZ files use a single block. But it’s possible to tell the
xz program to use a smaller than default block size, and in that case the output is divided into indexed blocks. Note the block size applies to the uncompressed input, the compressed blocks will have varying sizes, but the index that is created lets us find the block boundaries easily. When a byte is requested we can use a binary search to take us quickly to the compressed block, uncompress it (and cache it), and answer the request. We will only uncompress at most one block instead of the whole file.
For disk images I normally advocate a 16M block size. The current cloud images use (I think) a 192M block size, so both a huge amount of data has to be read over HTTPS to read one uncompressed byte, plus we have to cache very large blocks in RAM.
As an experiment I recompressed the cloud image using
xz --block-size=$((16 * 1024 * 1024)) and hosted it locally, and booting is much quicker (albeit still slow because the cloud image contains
$ nbdkit -U - --filter=xz curl \ http://builder.libguestfs.org/fedora-29.xz \ --run \ 'qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 -drive file=$nbd,if=virtio'
… although you can’t log in because they all have locked root accounts (virt-builder normally customizes them after download).
If you saw my posting from two days ago you’ll know I’m working on visualizing what happens on block devices when you perform various operations. Last time we covered basics like partitioning a disk, creating a filesystem, creating files, and fstrim.
This time I’ve tied together 5 of the nbdcanvas widgets into a bigger Tcl application that can show what’s happening on a RAID 5 disk set. As with the last posting there’s a video followed by commentary on what happens at each step.
- 00:00: I start guestfish connected to all 5 nbdkit servers. Also of note I’ve added
raid456.devices_handle_discard_safely=1to the appliance kernel command line, which is required for discards to work through MD RAID devices (I didn’t know that before yesterday).
- 00:02: When the appliance starts up, the black flashes show the kernel probing for possible partitions or filesystems. The disks are blank so nothing is found.
- 00:16: As in the previous post I’m partitioning the disks using GPT. Each ends up with a partition table at the start and end of the disk (two red blocks of pixels).
- 00:51: Now I use
mdadm --create(via the guestfish
md-createcommand) to make a RAID 5 array across the first 4 disks. The 4th disk is the parity disk — you can see disks 1 through 3 being scanned and the parity information being written to the 4th disk. The 5th disk is a hot spare. Notice how the scanning continues after the
mdadmcommand has returned. In real arrays this can go on for hours or days.
- 01:11: I create a filesystem. The first action that
mkfsperforms is discarding previous data (indicated by light purple). Notice that the parity data is also discarded, which surprised me, but does make sense.
- 01:27: The RAID array is mounted and I unpack a tarball into it.
- 01:40: I delete the files and fstrim, which discards the underlying blocks again.
- 01:48: Now I’m going to inject errors at the block layer into the 3rd disk. The Error checkbox in the Tcl widget simply creates a file. We’re using the nbdkit error filter which monitors for the named file and when it is created starts injecting errors into any read or write operation. Almost immediately the RAID array notices the damage and starts rebuilding on to the hot spare. Notice the black flashes where it reads the working disks (including old parity disk) to construct the redundant information on the spare.
- 01:55: While reconstruction is under way, the RAID array can be used normally.
- 02:14: Examining
/proc/mdstatshows that the third disk has been marked failed.
- 02:24: Now I’m going to inject errors into the 4th disk as well. This RAID array can survive this, operating in a “degraded state”, but there is no more redundancy.
- 02:46: Finally we can examine the kernel messages which show that the RAID array is continuing on 3 devices.
In case you want to reproduce the results yourself, the full command to run nbdkit (repeated 5 times) is:
$ rm /tmp/sock1 /tmp/error1 $ ./nbdkit -fv -U /tmp/sock1 \ --filter=error --filter=log --filter=delay \ memory size=$((64*1024*1024)) \ logfile=/tmp/log1 \ rdelay=40ms wdelay=40ms \ error-rate=100% error-file=/tmp/error1
And the nbdraid viewing program:
$ ./nbdraid.tcl 5 $((64*1024*1024)) /tmp/log%d /tmp/error%d