Tag Archives: virtualization

libnbd – A new NBD client library

NBD is a high performance protocol for exporting disks between processes and machines. We use it as a kind of “universal connector” for connecting hypervisors with data sources, and previously myself and Eric Blake wrote a general purpose NBD server called nbdkit. (If you’re interested in the topic of nbdkit as a universal connector, watch my FOSDEM talk.)

Up til now our NBD client has been qemu or one of the qemu tools like qemu-img. That was fine if you wanted to expose a disk source as a running virtual machine (ie. running it with qemu), or if you wanted to perform one of the limited copying operations that qemu-img convert can do, but there were many cases where it would have been nice to have a general client library.

For example I started to add NBD support to Jen Axboe’s FIO. Lacking a client library I synthesized NBD request packets as C structs and sent them on the wire using low level socket commands. The performance was, to put it bluntly, crap.

Although NBD is a very simple protocol and you can write it by hand, it would be nicer to have a library wrap the low-level stuff, and that’s why we have written libnbd (downloads).

Getting reasonable performance from NBD requires a few tricks:

  • You must issue as many commands as possible “in flight” (the server will reply to them out of order, but requests and replies are tied together by a unique ID).
  • You may need to open multiple connections to the server, but doing that requires attention to the special MULTI_CONN flag which the server will use to indicate that this is safe.
  • Most crucially you must disable Nagle’s algorithm.

This isn’t an exhaustive list. In fact while writing libnbd over about 3 weeks we improved performance by a factor of over 15 times, just by paying attention to system calls, maximizing parallelism and minimizing latency. One advantage of libnbd is that it encodes all this knowledge in an easy to use library so NBD clients won’t have to reinvent it in future.

The library has a simple high-level synchronous API which works how you would expect (but doesn’t get the best performance). A typical program might look like:

struct nbd_handle *nbd;
int64_t exportsize;
char buf[512];

nbd = nbd_create ();
if (!nbd) goto error;
if (nbd_connect_tcp (nbd, "localhost", "nbd") == -1)
  goto error;
exportsize = nbd_get_size (nbd);
if (nbd_pread (nbd, buf, sizeof buf, 0, 0) == -1) {
 error:
  fprintf (stderr, "%s\n", nbd_get_error ());
}

To get the best performance you have to use the more low-level asynchronous API which allows you to queue up commands and bring your own main loop.

There are also bindings in OCaml and Python (and Rust, soon). There’s also a nice little shell written in Python so you can access NBD servers interactively:

$ nbdsh
nbd> h.connect_command (["nbdkit", "-s", "memory", "1M"])
nbd> print ("%r" % h.get_size ())
1048576
nbd> h.pwrite (b"12345", 0)
nbd> h.pread (5, 0)
b'12345'

libnbd and the shell, nbdsh, are available now in Fedora 29 and above.

Advertisements

Leave a comment

Filed under Uncategorized

virt-install + nbdkit live install

This seems to be completely undocumented which is why I’m writing this … It is possible to boot a Linux guest (Fedora in this case) from a live CD on a website without downloading it. I’m using our favourite flexible NBD server, nbdkit and virt-install.

First of all we’ll run nbdkit and attach it to the Fedora 29 live workstation ISO. To make this work more efficiently I’m going to place a couple of filters on top — one is the readahead (prefetch) filter recently added to nbdkit 1.12, and the other is the cache filter. In combination these filters should reduce the load on the website and improve local performance.

$ rm /tmp/socket
$ nbdkit -f -U /tmp/socket --filter=readahead --filter=cache \
    curl https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso

I actually replaced that URL with a UK-based mirror to make the process a little faster.

Now comes the undocumented virt-install command:

$ virt-install --name test --ram 2048 \
    --disk /var/tmp/disk.img,size=10 
    --disk device=cdrom,source_protocol=nbd,source_host_transport=unix,source_host_socket=/tmp/socket \
    --os-variant fedora29

After a bit of grinding that should boot into Fedora 29, and you never (not explicitly at least) had to download the ISO.

Screenshot_2019-04-13_10-30-00

To be fair qemu does also have a curl driver which virt-install could use, but nbdkit is better with the filters and plugins system giving you ultimate flexibility — check out my video about it.

1 Comment

Filed under Uncategorized

nbdkit 1.12

The new stable release of nbdkit, our flexible Network Block Device server, is out. You can read the announcement and release notes here.

The big new features are SSH support, the linuxdisk plugin, writing plugins in Rust, and extents. Extents allows NBD clients to work out which parts of a disk are sparse or zeroes and skip reading them. It was hellishly difficult to write because of the number of obscure corner cases.

Also in this release, are a couple of interesting filters. The rate filter lets you add a bandwidth limit to connections. We will use this in virt-v2v to allow v2v instances to be rate limited (even dynamically). The readahead filter makes sequential copying and scanning of plugins more efficient by prefetching data ahead of time. It is self-configuring and in most cases simply adding the filter into your filter stack is sufficient to get a nice performance boost, assuming your client’s access patterns are mostly sequential.

1 Comment

Filed under Uncategorized

Tip: Edit grub kernel command line in RHEL 7 or CentOS 7

Easy with virt-customize. In this example I’m adding the nosmt option to the command line:

$ virt-customize -a rhel7.img \
    --edit '/etc/default/grub:
      s/^GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="nosmt /' \
    --run-command 'grub2-mkconfig -o /boot/grub2/grub.cfg'

Leave a comment

Filed under Uncategorized

nbdkit / FOSDEM test presentation about better loop mounts for Linux

I’ve submitted a talk about nbdkit, our flexible pluggable NBD server, to FOSDEM next February. This is going to be about using NBD as a better way to do loop mounts in Linux.

In preparation I gave a very early version of the talk to a small Red Hat audience.

Video link: http://oirase.annexia.org/rwmj.wp.com/rjones-nbdkit-tech-talk-2018-11-19.mp4

Sorry about the slow start. You may want to skip to 2 mins to get past the intro.

Summary of what’s in the talk:

  1. Demo of regular, plain loop mounting.
  2. Demo of loop mounting an XZ-compressed disk image using NBD + nbdkit.
  3. Slides about how loop device compares to NBD.
  4. Slides about nbdkit plugins and filters.
  5. Using VMware VDDK to access a VMDK file.
  6. Creating a giant disk costing EUR 300 million(!)
  7. Visualizing a single filesystem.
  8. Visualizing RAID 5.
  9. Writing a plugin in shell script (live demo).
  10. Summary.

Screenshot_2018-11-26_17-18-16

2 Comments

Filed under Uncategorized

NBD graphical viewer – RAID 5 edition

If you saw my posting from two days ago you’ll know I’m working on visualizing what happens on block devices when you perform various operations. Last time we covered basics like partitioning a disk, creating a filesystem, creating files, and fstrim.

This time I’ve tied together 5 of the nbdcanvas widgets into a bigger Tcl application that can show what’s happening on a RAID 5 disk set. As with the last posting there’s a video followed by commentary on what happens at each step.

raid0066

  • 00:00: I start guestfish connected to all 5 nbdkit servers. Also of note I’ve added raid456.devices_handle_discard_safely=1 to the appliance kernel command line, which is required for discards to work through MD RAID devices (I didn’t know that before yesterday).
  • 00:02: When the appliance starts up, the black flashes show the kernel probing for possible partitions or filesystems. The disks are blank so nothing is found.
  • 00:16: As in the previous post I’m partitioning the disks using GPT. Each ends up with a partition table at the start and end of the disk (two red blocks of pixels).
  • 00:51: Now I use mdadm --create (via the guestfish md-create command) to make a RAID 5 array across the first 4 disks. The 4th disk is the parity disk — you can see disks 1 through 3 being scanned and the parity information being written to the 4th disk. The 5th disk is a hot spare. Notice how the scanning continues after the mdadm command has returned. In real arrays this can go on for hours or days.
  • 01:11: I create a filesystem. The first action that mkfs performs is discarding previous data (indicated by light purple). Notice that the parity data is also discarded, which surprised me, but does make sense.
  • 01:27: The RAID array is mounted and I unpack a tarball into it.
  • 01:40: I delete the files and fstrim, which discards the underlying blocks again.
  • 01:48: Now I’m going to inject errors at the block layer into the 3rd disk. The Error checkbox in the Tcl widget simply creates a file. We’re using the nbdkit error filter which monitors for the named file and when it is created starts injecting errors into any read or write operation. Almost immediately the RAID array notices the damage and starts rebuilding on to the hot spare. Notice the black flashes where it reads the working disks (including old parity disk) to construct the redundant information on the spare.
  • 01:55: While reconstruction is under way, the RAID array can be used normally.
  • 02:14: Examining /proc/mdstat shows that the third disk has been marked failed.
  • 02:24: Now I’m going to inject errors into the 4th disk as well. This RAID array can survive this, operating in a “degraded state”, but there is no more redundancy.
  • 02:46: Finally we can examine the kernel messages which show that the RAID array is continuing on 3 devices.

In case you want to reproduce the results yourself, the full command to run nbdkit (repeated 5 times) is:

$ rm /tmp/sock1 /tmp/error1
$ ./nbdkit -fv -U /tmp/sock1 \
    --filter=error --filter=log --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log1 \
    rdelay=40ms wdelay=40ms \
    error-rate=100% error-file=/tmp/error1

And the nbdraid viewing program:

$ ./nbdraid.tcl 5 $((64*1024*1024)) /tmp/log%d /tmp/error%d

Leave a comment

Filed under Uncategorized

NBD graphical viewer

Ever wondered what is really happening when you write to a disk? What blocks the filesystem writes to and so on? With our flexible, plug-in based NBD server called nbdkit and a little Tcl/Tk program I wrote you can now visualise this.

As in this video (MP4)

nbdview1234

… which shows me opening a blank disk, partitioning it, creating an ext4 filesystem and writing some files.

There’s a lot going on in this video, which I’ll explain below. But first to say that each pixel corresponds to a 4K block on disk — the total disk size is 64M which is 128×128 pixels, and each row is therefore half a megabyte. Red pixels are writes. Black flashing pixels show reads. Light purple is for trim requests, and white pixels are zero requests.

nbdkit was run with the following command line:

$ nbdkit -fv \
    --filter=log \
    --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log \
    rdelay=40ms wdelay=40ms

This means that we’re using the memory plugin to create a throwaway blank disk of 64M. In front of this plugin we place two filters: The delay filter delays all reads and writes by 40ms. This makes it easier to see what’s going on. The second filter is the log filter which records all requests in a log file (/tmp/log).

The log file is what the second command reads asynchronously to generate the graphical image:

$ ./nbdview.tcl /tmp/log $((64*1024*1024))

So to the video (MP4):

  • 00:07: I start guestfish connected to the NBD server. This boots a Linux appliance, and you can see from the flashes of black how the Linux kernel probes the disk every which way to try to detect any kind of partition or filesystem signature. (Recall that I’m intentionally delaying all read requests which is why the appliance boot and probing seems to take so long. In reality these probes happen near instantaneously.) Of course the disk is all zeroes at this point, so nothing is found.
  • 00:23: I partition the disk using GPT. The partitioning is done under the hood by GNU parted and as you can see there is a considerable amount of probing going on by both parted and then the kernel still looking for filesystem signatures. Eventually we end up with two blocks of red (written) data, because GPT creates both a primary and secondary partition table at the beginning and end of the disk.
  • 00:36: I create an ext4 filesystem inside the partition. After even more probing by mkfs the first major operation is to trim/discard all data on the disk (shown by the disk filling up with light purple). Then mkfs writes a large block of data in the middle of the disk which I believe is the journal, followed by four dots which I believe could be backup superblocks.
  • 00:48: Interestingly filesystem creation has not finished. ext4 (as well as other modern filesystems) defer a lot of work to the kernel, and this is obvious when I mount the disk. Notice that a few seconds after the mount (around 00:59) the kernel starts zeroing parts of the disk. I believe this is the inode table and block free bitmap for the first block group. For larger disks this lazy initialization could go on for a long time.
  • 01:05: I unpack a tarball into the filesystem. As expected the operation finishes almost instantaneously, and nothing is actually written to disk. However issuing an explicit sync at 01:11 causes the files and directories to be written, filling first the data blocks and then the inodes and block free bitmap (is there a reason these are written last, or is it just coincidence? Also does the Linux page cache retain the order that the filesystem wrote the blocks?)
  • 01:18: I delete the directory tree I just created. As you’d expect nothing is written to disk, and even after a sync nothing much changes.
  • 01:26: In contrast when I fstrim the filesystem, all the now-deleted data blocks are discarded (light purple). This is the same principle which virt-sparsify --in-place uses to make a disk image sparse.
  • 01:32: Finally after unmounting the filesystem I issue a blkdiscard command which throws the whole thing away. Even after this Linux is probing the partition to see if somehow a filesystem signature could be present.

You can easily reproduce this and similar results yourself using nbdkit and nbdview, and I’ve submitted a talk to FOSDEM about this and much more fun you can have with nbdkit.

2 Comments

Filed under Uncategorized