Finally getting down to assembling and installing this 7 node cluster.
As well as nbdkit 1.24 being released on Thursday, its sister project libnbd 1.6 was released at the same time. This comes with an enhanced copying tool called nbdcopy designed to replace some uses of qemu-img convert (note: it’s not a general replacement).
nbdcopy lets you copy from and to NBD servers (nbdkit, qemu-nbd, qemu-storage-daemon, nbd-server), local files, local block devices, pipes/sockets, and stdin/stdout. For example to stream the content of an NBD server:
$ nbdcopy nbd://localhost - | hexdump -C
-” character streams to stdout.
nbd://localhost is an NBD URI referring to an NBD server that is already running. What if you don’t have an already running server? nbdcopy lets you run one from the command line (and cleans up after). For example this is one way to convert a qcow2 file to raw:
$ nbdcopy -- [ qemu-nbd -f qcow2 disk.qcow ] disk.raw
[ ... ] section starts qemu-nbd as a captive NBD server, exposing privately an NBD endpoint, and nbdcopy copies this to local file
--” is needed to stop nbdcopy trying to interpret qemu-nbd’s own command line arguments.)
However this post is really about the nbdkit release. How did I test and benchmark nbdcopy? Of course I wrote an nbdkit plugin called nbdkit-sparse-random-plugin. This plugin has two clever features for testing copying tools. Firstly it creates random disks which have the same “shape” as virtual machine disk images (but without the overhead of needing to bother with an actual VM). Secondly it can act as both a source and target for testing copies.
Let’s unpack those two things a bit further.
Virtual machine disk images (especially mostly empty ones) are mostly sparse. Here’s part of the sparse map from a Fedora 32 disk image:
$ virt-builder fedora-32 $ filefrag -e fedora-32.img Filesystem type is: 58465342 File size of fedora-32.img is 6442450944 (1572864 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 2038672.. 2038672: 1: 1: 1.. 15: 2176040.. 2176054: 15: 2038673: 2: 256.. 271: 2188819.. 2188834: 16: 2176295: 3: 512.. 3135: 3650850.. 3653473: 2624: 2189075: 4: 3168.. 4463: 3781763.. 3783058: 1296: 3653506: [...]
The new sparse-random plugin generates a disk image which has a similar shape — islands of random data in a sea of sparseness. The algorithm for doing this is quite neat. Because the plugin doesn’t need to store the data, unlike a real disk image, it can generate huge disk images (eg. a terabyte) while using almost no memory. We use a low-overhead, high-quality random number generator and are smart about seeds so that every run of sparse-random with the same seed produces identical output.
The other part of this plugin is how we can use it to test copying tools like nbdcopy and qemu-img convert. My idea was that the plugin could be used both as the source and the target of the copy:
$ nbdkit -U - sparse-random 1T --run ' nbdcopy "$uri" "$uri" '
Here we create a terabyte-sized sparse-random disk, and get nbdcopy to copy from the plugin to the plugin. On reads sparse-random supplies the sparseness and random data. On writes it checks if what is being written matches the content of the plugin, throwing
-EIO errors if not. Assuming the copying tool is correctly handling errors, we can both validate the copying tool and benchmark it. And it works with qemu-img convert too:
$ nbdkit -U - sparse-random 1T --run ' qemu-img convert "$uri" "$uri" '
And now we can see which one is faster.
Try it, you may be surprised.
nbdkit 1.24 was released on Thursday. It’s our flexible, fast network block device with loads of features. nbdkit-data-plugin, a plugin that lets you create test patterns from the command line gained some interesting new functionality:
$ nbdkit data ' ( 0x55 0xAA )*2048 '
This command worked before as a way to create a repeating test pattern in a disk image. A new feature is you can write a shell script snippet to generate the pattern instead:
$ nbdkit data ' <( while :; do printf "%04x" $((i++)); done ) [:2048] '
This command will create a pattern of characters “0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 3 …” (truncated to 2048 bytes). We could turn that into a block device and display the contents:
# nbd-client localhost /dev/nbd0 # blockdev --getsize64 /dev/nbd0 2048 # dd if=/dev/nbd0 | hexdump -C | head 4+0 records in 4+0 records out 2048 bytes (2.0 kB, 2.0 KiB) copied, 0.000167082 s, 12.3 MB/s 00000000 30 30 30 30 30 30 30 31 30 30 30 32 30 30 30 33 |0000000100020003| 00000010 30 30 30 34 30 30 30 35 30 30 30 36 30 30 30 37 |0004000500060007| 00000020 30 30 30 38 30 30 30 39 30 30 30 61 30 30 30 62 |00080009000a000b| 00000030 30 30 30 63 30 30 30 64 30 30 30 65 30 30 30 66 |000c000d000e000f| 00000040 30 30 31 30 30 30 31 31 30 30 31 32 30 30 31 33 |0010001100120013| 00000050 30 30 31 34 30 30 31 35 30 30 31 36 30 30 31 37 |0014001500160017| 00000060 30 30 31 38 30 30 31 39 30 30 31 61 30 30 31 62 |00180019001a001b| 00000070 30 30 31 63 30 30 31 64 30 30 31 65 30 30 31 66 |001c001d001e001f| 00000080 30 30 32 30 30 30 32 31 30 30 32 32 30 30 32 33 |0020002100220023| 00000090 30 30 32 34 30 30 32 35 30 30 32 36 30 30 32 37 |0024002500260027| # nbd-client -d /dev/nbd0 # killall nbdkit
The data plugin also lets you read from files which is useful for making disks with random initial data. For example here’s how to create a disk with 16 identical sectors of random data (notice how /dev/random is read in, truncated to 512 bytes, and then 16 copies are made):
$ nbdkit data ' </dev/urandom[:512]*16 '
The plugin can also create sparse disks. You can do this just by moving the current offset using “@”:
$ nbdkit data ' @32768 1 ' --run 'nbdinfo --map "$uri"'
0 32768 3 hole,zero
32768 1 0 allocated
We use this plugin quite extensively when testing libnbd.
(This is in answer to an IRC question, but the answer is a bit longer than I can cover in IRC)
Can you read and write at the block level in a .vmdk file? I think the questioner was asking about writing a backup/restore type tool. Using only free software, qemu can do reads. You can attach qemu-nbd to a vmdk file and that will expose the logical blocks as NBD, and you can then read at the block level using libnbd:
#!/usr/bin/python3 import nbd h = nbd.NBD() h.connect_systemd_socket_activation( ["qemu-nbd", "-t", "/var/tmp/disk.vmdk"]) print("size = %d" % h.get_size()) buf = h.pread(512, 0)
$ ./qemu-test.py size = 1073741824
The example is in Python, but libnbd would let you do this from C or other languages just as easily.
While this works fine for reading, I wouldn’t necessarily be sure that writing is safe. The vmdk format is complex, baroque and only lightly documented, and the only implementation I’d trust is the one from VMware.
#!/usr/bin/python3 import nbd h = nbd.NBD() h.connect_command( ["nbdkit", "-s", "--exit-with-parent", "vddk", "libdir=/var/tmp/vmware-vix-disklib-distrib", "file=/var/tmp/disk.vmdk"]) print("size = %d" % h.get_size()) buf = h.pread(512, 0) h.pwrite(buf, 512)
I quite like how we’re using small tools and assembling them together into a pipeline in just a few lines of code:
┌─────────┬────────┐ ┌─────────┬────────┐ │ your │ libnbd │ NBD │ nbdkit │ VDDK │ │ program │ ●──────────────➤ │ │ └─────────┴────────┘ └─────────┴────────┘ disk.vmdk
One advantage of this approach is that it exposes the extents in the disk which you can iterate over using libnbd APIs. For a backup tool this would let you save the disk efficiently, or do change-block tracking.
#!/usr/sbin/nbdkit python import nbdkit import boto3 from contextlib import closing API_VERSION = 2 def thread_model(): return nbdkit.THREAD_MODEL_PARALLEL def config(key, value): global access_key, secret_key, endpoint_url, bucket_name, key_name if key == "access-key" or key == "access_key": access_key = value elif key == "secret-key" or key == "secret_key": secret_key = value elif key == "endpoint-url" or key == "endpoint_url": endpoint_url = value elif key == "bucket": bucket_name = value elif key == "key": key_name = value else: raise Exception("unknown parameter %s" % key) def open(readonly): global access_key, secret_key, endpoint_url s3 = boto3.client("s3", aws_access_key_id = access_key, aws_secret_access_key = secret_key, endpoint_url = endpoint_url) if s3 is None: raise Exception("could not connect to S3") return s3 def get_size(s3): global bucket_name, key_name resp = s3.get_object(Bucket = bucket_name, Key = key_name) size = resp['ResponseMetadata']['HTTPHeaders']['content-length'] return int(size) def pread(s3, buf, offset, flags): global bucket_name, key_name size = len(buf) rnge = 'bytes=%d-%d' % (offset, offset+size-1) resp = s3.get_object(Bucket = bucket_name, Key = key_name, Range = rnge) body = resp['Body'] with closing(body): buf[:] = body.read(size)
This lets you loop mount a single object (file):
$ ./nbdkit-S3-plugin -f -v -U /tmp/sock \ access_key="XYZ" secret_key="XYZ" \ bucket="my_files" key="fedora-28.iso"
$ sudo nbd-client -b 2048 -unix /tmp/sock /dev/nbd0 Negotiation: ..size = 583MB $ ls /dev/nbd0 nbd0 nbd0p1 nbd0p2 $ sudo mount -o ro /dev/nbd0p1 /tmp/mnt $ ls -l /tmp/mnt total 11 dr-xr-xr-x. 3 root root 2048 Apr 25 2018 EFI -rw-r--r--. 1 root root 2532 Apr 23 2018 Fedora-Legal-README.txt dr-xr-xr-x. 3 root root 2048 Apr 25 2018 images drwxrwxr-x. 2 root root 2048 Apr 25 2018 isolinux -rw-r--r--. 1 root root 1063 Apr 21 2018 LICENSE -r--r--r--. 1 root root 454 Apr 25 2018 TRANS.TBL
I should note this is a bit different from s3fs which is a FUSE driver that mounts all the files in a bucket.
In the last post I showed how you can combine nbdfuse with nbdkit’s RAM disk to mount a RAM disk as a local file. In a talk I gave at FOSDEM last year I described creating these absurdly large RAM-backed filesystems and you can do the same thing now to create ridiculously big “files”. Here’s a 7 exabyte file:
$ touch /var/tmp/disk.img $ nbdfuse /var/tmp/disk.img --command nbdkit -s memory 7E & $ ll /var/tmp/disk.img -rw-rw-rw-. 1 rjones rjones 8070450532247928832 Nov 4 13:37 /var/tmp/disk.img $ ls -lh /var/tmp/disk.img -rw-rw-rw-. 1 rjones rjones 7.0E Nov 4 13:37 /var/tmp/disk.img
What can you actually do with this file, and more importantly does anything break? As in the talk, creating a Btrfs filesystem boringly just works. mkfs.ext4 spins using 100% of CPU. I let it go for 15 minutes but it seemed no closer to either succeeding or crashing. Emacs said:
File disk.img is large (7 EiB), really open? (y)es or (n)o or (l)iterally
and I was too chicken to find out what it would do if I really opened it.
I do wonder if there’s a DoS attack here if I leave this seemingly massive regular file lying around in a public directory.
Our tool nbdfuse lets you mount an NBD block device as a file, using Linux FUSE. For example you could create a directory with a single file in it (called
nbd) which contains the contents of the NBD export:
$ mkdir /var/tmp/test $ nbdfuse /var/tmp/test --command nbdkit -s memory 1G & $ ls -l /var/tmp/test/ total 0 -rw-rw-rw-. 1 rjones rjones 1073741824 Nov 4 13:25 nbd $ fusermount -u /var/tmp/test
This is cool, but wouldn’t it be nice to get rid of the directory and create the file anywhere? Recently Max Reitz found out you can mount a FUSE filesystem over a regular file.
It works! (After a few adjustments to the nbdfuse code)
$ touch /var/tmp/disk.img $ nbdfuse /var/tmp/disk.img --command nbdkit -s memory 1G & $ ls -l /var/tmp/disk.img -rw-rw-rw-. 1 rjones rjones 1073741824 Nov 4 13:29 /var/tmp/disk.img $ fusermount -u /var/tmp/disk.img
Frama-C is a giant modular system for writing formal proofs of C code. For months I’ve been on-and-off trying to see if we could use it to do useful proofs for any parts of the projects we write, like qemu, libvirt, libguestfs, nbdkit etc. I got side-tracked at first with this frama-c tutorial which is fine, but I got stuck trying to make the GUI work.
Yesterday I discovered this set of 3 short command-line based tutorials: https://maniagnosis.crsr.net/2017/06/AFL-brute-force-search.html https://maniagnosis.crsr.net/2017/06/AFL-bug-in-quicksearch.html https://maniagnosis.crsr.net/2017/07/AFL-correctness-of-quicksearch.html
I thought I’d start by trying to apply this to a small section of qemu code, the fairly self-contained range functions.
The first problem is how to invoke frama-c:
frama-c -wp -wp-rte -wp-print util/range.c -cpp-extra-args=" -I include -I build -I /usr/include -DQEMU_WARN_UNUSED_RESULT= "
You have to give all the include directories and define out some qemu-isms.
The first time you run it, this won’t work for “reasons”. You have to initialize the why3 verifier using:
why3 config --full-config
Really frama-c should just do this for you, or at least tell you what you need to do in the obscure error message it prints.
This still won’t work because
util/range.c includes glib headers which use GCC attributes and builtins and frama-c simply cannot parse any of that. So I ended up hacking on the source to replace the headers with standard C headers and remove the one glib-based function in the file.
At this point it does compile and the frama-C WP plugin runs. Of course without having added any annotations it simply produces a long list of problems. Also it takes a fair bit of time to run, which is interesting. I wonder if it will get faster with annotations?
That’s as far as I’ve got for the moment. I’ll come back later and try to add annotations.
I got Fedora 32 installed on an RPi 4 8GB, booting off USB, with UEFI and ACPI. I followed Robert Grimm’s instructions here, and had an additional set of complications summarised here. There’s not much to say except that it was fiendishly complicated. But it works beautifully now, and is reasonably quick too especially when you consider how little it cost.
So let’s talk about costs (all include tax and delivery):
|Raspberry Pi 4 8GB||£77.33|
|SanDisk 500GB SSD x 2||£149.98|
|small SD card needed for booting||£free|
Only one of the SSDs is actually used, but if you follow Robert’s instructions you will need two. I didn’t have any external USB SSDs that were both USB 3 and not spinning hard disks, so I had to buy these, but I’ll be able to reuse one in a future project. The SD card is required to work around a bug in the UEFI firmware, but I happened to have one lying around.