An NBD block device written using Linux ublk (user block device)

Commits [1] and [2] and more here.

ublk is a Linux-only io_uring-based user block device. It lets you write block devices in userspace. nbdublk is an NBD client written using ublk.

# modprobe ublk_drv
# nbdublk /dev/ublkb0 nbd://remote
# ublk list

# blockdev --getsize64 /dev/ublkb0
# mke2fs /dev/ublkb0
# (etc)

# ublk del -n 0

Leave a comment

Filed under Uncategorized

nbdkit for macOS

nbdkit, our high performance, portable Network Block Device server has now been ported to macOS. It’s a command line tool and macOS is sufficiently FreeBSD-like that the port wasn’t very hard. It’s relatively full featured, including a large portion of the plugins and filters, a brand new exit-with-parent implementation, and almost all tests passing.

However one larger problem remains (for performance) which is the lack of atomic CLOEXEC when opening pipes or sockets. Linux has pipe2 and accept4. I wasn’t able to find any good equivalent on macOS, and hence most of the time we are limited to serializing some requests that could otherwise run in parallel.

nbdkit already supported Linux, FreeBSD, OpenBSD, Haiku and Windows!

Leave a comment

Filed under Uncategorized

SSH from RHEL 9 to RHEL 5 or RHEL 6

RHEL 9 no longer lets you ssh to RHEL ≤ 6 hosts out of the box. You can weaken security of the whole system but there’s no easy way to set security policy per remote host. Here’s how to set up ssh so it works for a RHEL 5 or RHEL 6 host:

First edit your .ssh/config file, adding an entry for the host:

Host rhel5or6-host
KexAlgorithms +diffie-hellman-group14-sha1
MACs +hmac-sha1
HostKeyAlgorithms +ssh-rsa
PubkeyAcceptedKeyTypes +ssh-rsa
PubkeyAcceptedAlgorithms +ssh-rsa

(The lines except the first “Host” line should be indented. WordPress screws up the formatting …)

That’s not enough on its own, because RHEL 9 also maims the openssl library by disabling SHA1 support by default. To fix that, create /var/tmp/openssl.cnf with:

.include /etc/ssl/openssl.cnf
[openssl_init]
alg_section = evp_properties
[evp_properties]
rh-allow-sha1-signatures = yes

Now you can ssh to RHEL 5 or RHEL 6 hosts like this:

OPENSSL_CONF=/var/tmp/openssl.cnf ssh rhel5or6-host

Thanks Laszlo Ersek for working out most of this. Related bugs:

2064740 – RFE: Make it easier to configure LEGACY policy per service or per host

2062360 – RFE: Virt-v2v should replace hairy “enable LEGACY crypto” advice which a more targeted mechanism

Leave a comment

Filed under Uncategorized

Composable tools for disk images

Over the past 3 or 4 years, my colleagues and I at Red Hat have been making a set of composable command line tools for handling virtual machine disk images. These let you copy, create, manipulate, display and modify disk images using simple tools that can be connected together in pipelines, while at the same time working very efficiently. It’s all based around the very efficient Network Block Device (NBD) protocol and NBD URI specification.

A basic and very old tool is qemu-img:

$ qemu-img create -f qcow2 disk.qcow2 1G

which creates an empty disk image in qcow2 format. Suppose you want to write into this image? We can compose a few programs:

$ touch disk.raw
$ nbdfuse disk.raw [ qemu-nbd -f qcow2 disk.qcow2 ] &

This serves the qcow2 file up over NBD (qemu-nbd) and then exposes that as a local file using FUSE (nbdfuse). Of interest here, nbdfuse runs and manages qemu-nbd as a subprocess, cleaning it up when the FUSE file is unmounted. We can partition the file using regular tools:

$ gdisk disk.raw
Command (? for help): n
Partition number (1-128, default 1): 
First sector (34-2097118, default = 2048) or {+-}size{KMGTP}: 
Last sector (2048-2097118, default = 2097118) or {+-}size{KMGTP}: 
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 
Changed type of partition to 'Linux filesystem'
Command (? for help): p
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2097118   1023.0 MiB  8300  Linux filesystem
Command (? for help): w

Let’s fill that partition with some files using guestfish and unmount it:

$ guestfish -a disk.raw run : \
  mkfs ext2 /dev/sda1 : mount /dev/sda1 / : \
  copy-in ~/libnbd /
$ fusermount3 -u disk.raw
[1]+  Done    nbdfuse disk.raw [ qemu-nbd -f qcow2 disk.qcow2 ]

Now the original qcow2 file is no longer empty but populated with a partition, a filesystem and some files. We can see the space used by examining it with virt-df:

$ virt-df -a disk.qcow2 -h
Filesystem                Size   Used  Available  Use%
disk.qcow2:/dev/sda1     1006M    52M       903M    6%

Now let’s see the first sector. You can’t just “cat” a qcow2 file because it’s a complicated format understood only by qemu. I can assemble qemu-nbd, nbdcopy and hexdump into a pipeline, where qemu-nbd converts the qcow2 format to raw blocks, and nbdcopy copies those out to a pipe:

$ nbdcopy -- [ qemu-nbd -r -f qcow2 disk.qcow2 ] - | \
  hexdump -C -n 512
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  02 00 ee 8a 08 82 01 00  00 00 ff ff 1f 00 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200

How about instead of a local file, we start with a disk image hosted on a web server, and compressed? We can do that too. Let’s start by querying the size by composing nbdkit’s curl plugin, xz filter and nbdinfo. nbdkit’s --run option composes nbdkit with an external program, connecting them together over an NBD URI ($uri).

$ web=http://mirror.bytemark.co.uk/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20220127.n.0.x86_64.raw.xz
$ nbdkit curl --filter=xz $web --run 'nbdinfo $uri'
protocol: newstyle-fixed without TLS
export="":
	export-size: 5368709120 (5G)
	content: DOS/MBR boot sector, extended partition table (last)
	uri: nbd://localhost:10809/
...

Notice it prints the uncompressed (raw) size. Fedora already provides a qcow2 equivalent, but we can also make our own by composing nbdkit, curl, xz, nbdcopy and qemu-nbd:

$ qemu-img create -f qcow2 cloud.qcow2 5368709120 -o preallocation=metadata
$ nbdkit curl --filter=xz $web \
    --run 'nbdcopy -p -- $uri [ qemu-nbd -f qcow2 cloud.qcow2 ]'

Why would you do that instead of downloading and uncompressing? In this case it wouldn’t matter much, but in the general case the disk image might be enormous (terabytes) and you don’t have enough local disk space to do it. Assembling tools into pipelines means you don’t need to keep an intermediate local copy at any point.

We can find out what we’ve got in our new image using various tools:

$ qemu-img info cloud.qcow2 
image: cloud.qcow2
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 951 MiB
$ virt-df -a cloud.qcow2  -h
Filesystem              Size       Used  Available  Use%
cloud.qcow2:/dev/sda2   458M        50M       379M   12%
cloud.qcow2:/dev/sda3   100M       9.8M        90M   10%
cloud.qcow2:/dev/sda5   4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/root
                        4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/home
                        4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/root/var/lib/portables
                        4.4G       311M       3.6G    7%
$ virt-cat -a cloud.qcow2 /etc/redhat-release
Fedora release 36 (Rawhide)

If we wanted to play with the guest in a sandbox, we could stand up an in-memory NBD server populated with the cloud image and connect it to qemu using standard NBD URIs:

$ nbdkit memory 10G
$ qemu-img convert cloud.qcow2 nbd://localhost 
$ virt-customize --format=raw -a nbd://localhost \
    --root-password password:123456 
$ qemu-system-x86_64 -machine accel=kvm \
    -cpu host -m 2048 -serial stdio \
    -drive file=nbd://localhost,if=virtio 
...
fedora login: root
Password: 123456

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0     11:0    1 1024M  0 rom  
zram0  251:0    0  1.9G  0 disk [SWAP]
vda    252:0    0   10G  0 disk 
├─vda1 252:1    0    1M  0 part 
├─vda2 252:2    0  500M  0 part /boot
├─vda3 252:3    0  100M  0 part /boot/efi
├─vda4 252:4    0    4M  0 part 
└─vda5 252:5    0  4.4G  0 part /home
                                /

We can even find out what changed between the in-memory copy and the pristine qcow2 version (quite a lot as it happens):

$ virt-diff --format=raw -a nbd://localhost --format=qcow2 -A cloud.qcow2 
- d 0755       2518 /etc
+ d 0755       2502 /etc
# changed: st_size
- - 0644        208 /etc/.updated
- d 0750        108 /etc/audit
+ d 0750         86 /etc/audit
# changed: st_size
- - 0640         84 /etc/audit/audit.rules
- d 0755         36 /etc/issue.d
+ d 0755          0 /etc/issue.d
# changed: st_size
... for several pages ...

In conclusion, we’ve got a couple of ways to serve disk content over NBD, a set of composable tools for copying, creating, displaying and modifying disk content either from local files or over NBD, and a way to pipe disk data between processes and systems.

We use this in virt-v2v which can suck VMs out of VMware to KVM systems, efficiently, in parallel, and without using local disk space for even the largest guest.

Leave a comment

Filed under Uncategorized

nbdkit now supports LUKS encryption

nbdkit, our permissively licensed plugin-based Network Block Device server can now transparently decode encrypted disks, for both reading and writing:

qemu-img create -f luks --object secret,data=SECRET,id=sec0 -o key-secret=sec0 encrypted-disk.img 1G

nbdkit file encrypted-disk.img --filter=luks passphrase=+/tmp/secret

We use LUKSv1 as the encryption format. That’s an older version [more on that in a moment] of the format used for Full Disk Encryption on Linux. It’s much preferable to use LUKS rather than using qemu’s built-in qcow2 encryption, and our implementation is compatible with qemu’s.

You can place the filter on top of other nbdkit plugins, like Curl:

nbdkit curl https://example.com/encrypted-disk.img --filter=luks passphrase=+/tmp/secret

The threat model here is that you can store the encrypted data on a remote server, and the admin of the server cannot decrypt the disk (assuming you don’t give them the passphrase).

If you try this filter (or qemu’s device) with a modern Linux LUKS disk you’ll find that it doesn’t work. This is because modern Linux uses LUKSv2, although they are able to create, read and write LUKSv1 if you use set them up that way in advance. Unfortunately LUKSv2 is significantly more complicated than LUKSv1. It requires parsing JSON data(!) stored in the header, and supports a wider range of password derivation functions, typically the very slow and memory-intensive argon2. LUKSv1 by contrast only requires support for PBKDF2 and is generally far more straightforward to implement.

The new filter will be available in nbdkit 1.32, or you can grab the development version now.

2 Comments

Filed under Uncategorized

Installing Fedora 34 on my Turing Pi 7 node cluster

I now have Fedora 34 running on all 7 nodes of my Turing Pi 1 cluster. Tedious to install, so these are just my notes.

Compile https://github.com/raspberrypi/usbboot

Insert the compute module to program, set the jumper to flash mode, connect the cable, power it on and run ./rpiboot

/dev/sdX will appear once the CM is in the right mode. Following the instructions here:

arm-image-installer --image=$HOME/Fedora-Server-34-1.2.aarch64.raw.xz --target=rpi3 --media=/dev/sdb --resizefs --relabel --addkey $HOME/.ssh/id_rsa.pub

This takes quite a long time to run. You also need to kill initial-setup:

virt-customize -a /dev/sdb --link /dev/null:/etc/systemd/system/initial-setup.service --selinux-relabel

If it’s successful, reboot the Turing Pi with the jumper in the boot setting and check the first CM comes up (I used an HDMI monitor)

Repeat with the other 6 boards(!)

Edit: I couldn’t work out how to get stable MAC addresses. To get a stable MAC address:

nmcli c modify "Wired connection 1" 802-3-ethernet.cloned-mac-address aa:bb:cc:dd:ee:ff

Leave a comment

Filed under Uncategorized

Interview for Red Hat Blog

I was interviewed for Red Hat’s Blog: https://www.redhat.com/en/blog/looking-back-30-years-linux-history-red-hats-richard-jones?channel=blog/channel/red-hat-news

1 Comment

Filed under Uncategorized

HiFive Unmatched

5 Comments

Filed under Uncategorized

BeagleV

It’s got a heatsink!

2 Comments

Filed under Uncategorized

Turing Pi 1

Finally getting down to assembling and installing this 7 node cluster.

Leave a comment

Filed under Uncategorized