NBD-backed qemu guest RAM

This seems too crazy to work, but it does:

$ nbdkit memory 1G
$ nbdfuse mem nbd://localhost &
[1] 1053075
$ ll mem
-rw-rw-rw-. 1 rjones rjones 1073741824 May 17 18:31 mem

Now boot qemu with that memory as the backing RAM:

$ qemu-system-x86_64 -m 1024 \
-object memory-backend-file,id=pc.ram,size=1024M,mem-path=/var/tmp/mem,share=on \
-machine memory-backend=pc.ram \
-drive file=fedora-36.img,if=virtio,format=raw

It works! You can even dump the RAM over a second NBD connection and grep for strings which appear on the screen (or passwords etc):

$ nbdcopy nbd://localhost - | strings | grep 'There was 1 failed'
There was 1 failed login attempt since the last successful login.

Of course this isn’t very useful on its own, it’s just an awkward way to use a sparse RAM disk as guest RAM, but nbdkit has plenty of other plugins that might be useful here. How about remote RAM? You’ll need a very fast network.

Leave a comment

Filed under Uncategorized

nbdkit’s evil filter

If you want to simulate how your filesystem behaves with a bad drive underneath you have a few options like the kernel dm-flakey device, writing a bash nbdkit plugin, kernel fault injection or a few others. We didn’t have that facility in nbdkit however so last week I started the “evil filter”.

The evil filter can add data corruption to an existing nbdkit plugin. Types of corruption include “cosmic rays” (ie. random bit flips), but more realisticly it can simulate stuck bits. Stuck bits are the only failure mode I can remember seeing in real disks and RAM.

One challenge with writing a filter like this is to make the stuck bits persistent across accesses, without requiring us to maintain a large bitmap in the filter keeping track of their location. The solution is fairly elegant: split the underlying disk into blocks. When we read from a block, reconstruct the stuck bits within that block from a fixed seed (calculated from a global PRNG seed + the block’s offset), and iterate across the block incrementing by random intervals. The intervals are derived from the block’s seed so they are the same each time they are calculated. We size the blocks so that each one will have about 100 corrupted bits so this reconstruction doesn’t take very long. Nothing is stored except one global PRNG seed.

The filter isn’t upstream yet but hopefully it can be another way to test filesystems and distributed storage in future.

2 Comments

Filed under Uncategorized

Frame pointers vs DWARF – my verdict

A couple of weeks ago I wrote a blog posting here about Fedora having frame pointers (LWN backgrounder, HN thread). I made some mistakes in that blog posting and retracted it, but I wasn’t wrong about the conclusions, just wrong about how I reached them. Frame pointers are much better than DWARF. DWARF unwinding might have some theoretical advantages but it’s worse in every practical respect.

In particular:

  1. Frame pointers give you much faster profiling with much less overhead. This practically means you can do continuous performance collection and analysis which would be impossible with DWARF.
  2. DWARF unwinding has foot-guns which make it easy to screw up and collect insufficient data for analysis. You cannot know in advance how much data to collect. The defaults are much too small, and even increasing the collection size to unreasonably large sizes isn’t enough.
  3. The overhead of collecting DWARF callgraph data adversely affects what you’re trying to analyze.
  4. Frame pointers have some corner cases which they don’t handle well (certain leaf and most inlined functions aren’t collected), but these don’t matter a great deal in reality.
  5. DWARF unwinding can show inlined functions as if they are separate stack frames. (Opinions differ as to whether or not this is an advantage.)

Below I’ll try to demonstrate some of the issues, but first a little bit of background is necessary about how all this works.

When you run perf record -a on a workload, the kernel fires a timer interrupt on every CPU 100s or 1000s of times a second. Each interrupt must collect a stack trace for that CPU at that moment which is then sent up to the userspace perf process that writes it to a perf.data file in the current directory. Obviously collecting this stack trace and writing it to the file must be done as quickly as possible with the least overhead.

Also the stack trace may start inside the kernel and go all the way out to userspace (unless the CPU was running userspace code at the moment it was interrupted in which case it just collects userspace). That involves unwinding the two different stacks.

For the kernel stack, the kernel has its own unwinding information called ORC. For the userspace stack you choose (with the perf --call-graph option) whether to use frame pointers or DWARF. For frame pointers the kernel is able to immediately walk up the userspace stack all the way to the top (assuming everything was compiled with frame pointers, but that is now true for Fedora 38). For DWARF however the format is complicated and the kernel cannot unwind it immediately. Instead the kernel just collects the user stack. But collecting the whole stack would consume far too much storage, so by default it only collects the first 8K. Many userspace stacks will be larger than this, in which case the data collection will simply be incomplete – it will never be possible to recover the full stack trace. You can adjust the size of stack collected, but that massively bloats the perf.data file as we’ll see below.

To demonstrate what I mean, I collected a set of traces using fio and nbdkit on Fedora 38, using both frame pointers and DWARF. The command is:

sudo perf record -a -g [--call-graph=...] -- nbdkit -U - null 1G --run 'export uri; fio nbd.fio'

with the nbd.fio file from fio’s examples.

I used no --call-graph option for collecting frame pointers (as it is the default), and --call-graph=dwarf,{4096,8192,16384,32768} to collect the DWARF examples with 4 different stack sizes.

I converted the resulting data into flame graphs using Brendan Gregg’s tools.

Everything was run on my idle 12 core / 24 thread AMD development machine.

TypeSize of perf.dataLost chunksFlame graph
Frame pointers934 MB0Link
DWARF (4K)10,104 MB425Link
DWARF (8K)18,733 MB1,643Link
DWARF (16K)35,149 MB5,333Link
DWARF (32K)57,590 MB545,024Link

The first most obvious thing is that even with the smallest stack data collection, DWARF’s perf.data is over 10 times larger, and it balloons even larger once you start to collect more reasonable stack sizes. For a single minute of data collection, collecting 10s of gigabytes of data is not very practical even on high end machines, and continuous performance analysis would be impossible at these data rates.

Related to this, the overhead of perf increases. It is ~ 0.1% for frame pointers. For DWARF the overhead goes: 0.8% (4K), 1.5% (8K), 2.8% (16K), 2.7% (32K). But this disguises the true overhead because it doesn’t count the cost of writing to disk. Unfortunately on this machine I have full disk encryption enabled (which does add a lot to the overhead of writing nearly 60 GB of perf data), but you can see the overhead of all that encryption separate from perf in the flame graph. The total overhead of perf + writing + encryption is about 20%.

This may also be the reason for seeing so many “lost chunks” even on this very fast machine. All of the DWARF tests even at the smallest size printed:

Check IO/CPU overload!

But is the DWARF data accurate? Clearly not. This is to be expected, collecting a partial user stack is not going to be enough to reconstruct a stack trace, but remember that even with 4K of stack, the perf.data is already > 10 times larger than for frame pointers. Zooming in to the nbdkit process only and comparing the flamegraphs shows significant amounts of incomplete stack traces, even when collecting 32K of stack.

On the left, nbdkit with frame pointers (correct). On the right, nbdkit with DWARF and 32K collection size. Notice on the right the large number of unattached frames. nbdkit main() does not directly call Unix domain socket send and receive functions!

If 8K (the default) is insufficient, and even 32K is not enough, how large do we need to make the DWARF stack collection? I couldn’t find out because I don’t have enough space for the expected 120 GB perf.data file at the next size up.

Let’s have a look at one thing which DWARF can do — show inlined and leaf functions. The stack trace for these is more accurate as you can see below. (To reproduce, zoom in on the nbd_poll function). On the left, frame pointers. On the right DWARF with 32K stacks, showing the extra enter_* frames which are inlined.

My final summary here is that for most purposes you would be better off using frame pointers, and it’s a good thing that Fedora 38 now compiles everything with frame pointers. It should result in easier performance analysis, and even makes continuous performance analysis more plausible.

2 Comments

Filed under Uncategorized

SmithForth

I’m just going to leave a link to it

Also watch David Smith’s youtube vid:

More Jonesforth links

1 Comment

Filed under Uncategorized

Frame pointers – an important update

A few days ago I posted about Fedora 38’s frame pointer change. I stated there that perf using only DWARF information produced inaccurate stack traces. But this isn’t true at all. After working on trying to reproduce those results (with the binutils developers who offered to help) I realised that I made a mistake on the DWARF (Fedora 37, pre-frame-pointer change) results, using:

perf record -a -g -- <cmd>

instead of:

perf record -a -g --call-graph=dwarf -- <cmd>

The difference is the first command still tries to use frame pointers, which entirely explains the inaccurate stack traces. With the correct parameters, perf record does do the right thing, collecting accurate stack traces even on Fedora 37, and the flame graphs generated look fine.

That caused a lot of noise, and it was wrong.

There are some issues around DWARF vs frame pointers, but they’re not about the accuracy of the stack traces, but about the size of the collected data and the speed of processing it. Using DWARF CFI means that much more data has to be collected (typical perf.data files grow by more than x10 — on my laptop perf.data grows to 8 GB for a single trace).

1 Comment

Filed under Uncategorized

nbdkit + libblkio

Our plugin-based Network Block Device server, nbdkit, now has support for libblkio.

libblkio is a library written by Stefan Hajnoczi, Alberto Faria, Stefano Garzarella and others for accessing some somewhat unusual disk protocols including vhost-user, NVMe, vDPA, VFIO and io_uring which I’ll talk about below. It’s important to know that these are not disk formats (like raw or qcow2), but accelerated protocols for talking to virtual or real hardware.

The library is written in Rust (but offers a C API) and I believe it’s intended to replace various bottom-end parts of the qemu block layer at some point in the future.

The library uses a set of property strings to describe how to connect to a device. The nbdkit plugin maps those almost exactly into command line parameters, so you can usually follow the libblkio docs and translate that into an nbdkit command line, eg:

$ nbdkit blkio io_uring path=fedora.img

This sets the libblkio driver to “io_uring” and the path to the path of a local file. This libblkio driver uses Linux’s relatively new io_uring facility to access a local file or block device, the simplest way to use libblkio.

The other most frequently used protocol or libblkio driver is vhost-user. This is a protocol that allows a server to share a disk image to client(s) on the same machine. It uses a Unix domain socket for communication, but unlike Network Block Device (NBD) it’s not possible to use this over the network. For greater performance vhost uses shared memory between the client and server for data transfer.

qemu-storage-daemon is the most common server:

$ qemu-storage-daemon \

--blockdev driver=file,node-name=file,filename=fedora.qcow2 \

--blockdev driver=qcow2,node-name=qcow2,file=file \

--export type=vhost-user-blk,id=export,addr.type=unix,addr.path=sock,node-name=qcow2

To connect from nbdkit, just use the socket:

$ nbdkit blkio virtio-blk-vhost-user path=sock

You might wonder why we want to add libblkio support to nbdkit (apart from it being fun). There’s a practical reason which is this brings along all of the scripting support we’ve created around NBD to these somewhat obscure (albeit quite widely used) protocols. I don’t think it was possible before to use Python to script against, eg., vhost-user, but now it is:

$ nbdsh -u nbd://localhost -c 'print("%r" % h.pread(512,0))'

Leave a comment

Filed under Uncategorized

Creating a modifiable gzipped disk image

Dusty Mabe set me a challenge yesterday. He wants to create several compressed disk images that have slightly different content, but are otherwise mostly the same. The disk images are large and compressing them takes a long time (30 minutes each, apparently), so ideally what we’d want to do is compress the disk image just the once and then do the updates on the gzipped image.

Modifying a file which has already been compressed is not usually possible.

However if we make some relatively uncontroversial assumptions and accept a few limitations then we can create a compressed disk image which is modifiable in this way, certainly for gzip and xz (I need to investigate zstd).

Firstly let’s assume we’re using some kind of block-based compression with fixed size blocks. This is true for gzip and xz already. Secondly let’s assume that we want to only modify a single, small partition of the image (Dusty only wants to modify the /boot partition). Lastly we’ll assume that the partition boundaries are aligned to the compression blocks. Since partitions can be placed under control of whoever creates the disk image this last one is pretty easy to arrange.

The trick is to use uncompressed blocks for the part of the input covering the partition you want to modify, and compress the rest of the file normally. Both gzip and xz have an uncompressed block type. (In fact, just about any reasonable compression algorithm works like this – if the input data doesn’t become smaller after trying to compress it, the algorithm will emit the data uncompressed since that takes less space.)

Normal tools won’t let you create files like this, but I wrote one for gzip here, and I’m confident that one could be written for xz (exercise for readers! … or me if we decide to productise this).

I created a normal Fedora 36 image using virt-builder and used guestfish to find the partition boundaries:

$ guestfish --ro -a /var/tmp/fedora-36.img -i part-list /dev/sda
[0] = {
  part_num: 1
  part_start: 1048576
  part_end: 2097151
  part_size: 1048576
}
[1] = {
  part_num: 2
  part_start: 2097152
  part_end: 1075838975
  part_size: 1073741824
}
[2] = {
  part_num: 3
  part_start: 1075838976
  part_end: 6441402367
  part_size: 5365563392
}

We will compress partition #3 while leaving partitions #1 and #2 uncompressed so they can be modified in place later. The tool is a bit crude, but:

$ ./partial-deflate /var/tmp/fedora-36.img /var/tmp/fedora-36.img.gz 0 1075838976 6441402368 

The output is a regular gzip file (albeit rather large because the first 1GB is uncompressed – if I was doing this for real I’d make that boot partition smaller):

$ ll /var/tmp/fedora-36.img*
-rw-r--r--. 1 rjones rjones 6442450944 Nov 30 17:03 /var/tmp/fedora-36.img
-rw-r--r--. 1 rjones rjones 1698849910 Dec  1 16:23 /var/tmp/fedora-36.img.gz
$ zcat /var/tmp/fedora-36.img.gz | md5sum 
57cbd5ebcbe59613378c8aee7ad9e40d  -
$ md5sum /var/tmp/fedora-36.img
57cbd5ebcbe59613378c8aee7ad9e40d  /var/tmp/fedora-36.img

Then I went ahead and modified some known content inside the gzip file (but not compressed). I used a hex editor for this, but you could play around with guestfish + nbdkit-offset-filter for something more supportable:

And the result is a gzipped file with the modifications:

$ zcat /var/tmp/fedora-36.img.gz > /var/tmp/fedora-36.img-modified
gzip: /var/tmp/fedora-36.img.gz: invalid compressed data--crc error
$ guestfish --ro -a /var/tmp/fedora-36.img-modified -i cat /boot/mydata.txt 
helloworld------

…and a CRC error. That’s to be expected, as I haven’t yet worked out how to update CRC-32 after making changes, but it should be easy to solve (with brute force if necessary).

1 Comment

Filed under Uncategorized

An NBD block device written using Linux ublk (user block device)

Commits [1] and [2] and more here.

ublk is a Linux-only io_uring-based user block device. It lets you write block devices in userspace. nbdublk is an NBD client written using ublk.

# modprobe ublk_drv
# nbdublk /dev/ublkb0 nbd://remote
# ublk list

# blockdev --getsize64 /dev/ublkb0
# mke2fs /dev/ublkb0
# (etc)

# ublk del -n 0

Leave a comment

Filed under Uncategorized

nbdkit for macOS

nbdkit, our high performance, portable Network Block Device server has now been ported to macOS. It’s a command line tool and macOS is sufficiently FreeBSD-like that the port wasn’t very hard. It’s relatively full featured, including a large portion of the plugins and filters, a brand new exit-with-parent implementation, and almost all tests passing.

However one larger problem remains (for performance) which is the lack of atomic CLOEXEC when opening pipes or sockets. Linux has pipe2 and accept4. I wasn’t able to find any good equivalent on macOS, and hence most of the time we are limited to serializing some requests that could otherwise run in parallel.

nbdkit already supported Linux, FreeBSD, OpenBSD, Haiku and Windows!

Leave a comment

Filed under Uncategorized

SSH from RHEL 9 to RHEL 5 or RHEL 6

RHEL 9 no longer lets you ssh to RHEL ≤ 6 hosts out of the box. You can weaken security of the whole system but there’s no easy way to set security policy per remote host. Here’s how to set up ssh so it works for a RHEL 5 or RHEL 6 host:

First edit your .ssh/config file, adding an entry for the host:

Host rhel5or6-host
KexAlgorithms +diffie-hellman-group14-sha1
MACs +hmac-sha1
HostKeyAlgorithms +ssh-rsa
PubkeyAcceptedKeyTypes +ssh-rsa
PubkeyAcceptedAlgorithms +ssh-rsa

(The lines except the first “Host” line should be indented. WordPress screws up the formatting …)

That’s not enough on its own, because RHEL 9 also maims the openssl library by disabling SHA1 support by default. To fix that, create /var/tmp/openssl.cnf with:

.include /etc/ssl/openssl.cnf
[openssl_init]
alg_section = evp_properties
[evp_properties]
rh-allow-sha1-signatures = yes

Now you can ssh to RHEL 5 or RHEL 6 hosts like this:

OPENSSL_CONF=/var/tmp/openssl.cnf ssh rhel5or6-host

Thanks Laszlo Ersek for working out most of this. Related bugs:

2064740 – RFE: Make it easier to configure LEGACY policy per service or per host

2062360 – RFE: Virt-v2v should replace hairy “enable LEGACY crypto” advice which a more targeted mechanism

Leave a comment

Filed under Uncategorized