Tag Archives: libguestfs

Composable tools for disk images

Over the past 3 or 4 years, my colleagues and I at Red Hat have been making a set of composable command line tools for handling virtual machine disk images. These let you copy, create, manipulate, display and modify disk images using simple tools that can be connected together in pipelines, while at the same time working very efficiently. It’s all based around the very efficient Network Block Device (NBD) protocol and NBD URI specification.

A basic and very old tool is qemu-img:

$ qemu-img create -f qcow2 disk.qcow2 1G

which creates an empty disk image in qcow2 format. Suppose you want to write into this image? We can compose a few programs:

$ touch disk.raw
$ nbdfuse disk.raw [ qemu-nbd -f qcow2 disk.qcow2 ] &

This serves the qcow2 file up over NBD (qemu-nbd) and then exposes that as a local file using FUSE (nbdfuse). Of interest here, nbdfuse runs and manages qemu-nbd as a subprocess, cleaning it up when the FUSE file is unmounted. We can partition the file using regular tools:

$ gdisk disk.raw
Command (? for help): n
Partition number (1-128, default 1): 
First sector (34-2097118, default = 2048) or {+-}size{KMGTP}: 
Last sector (2048-2097118, default = 2097118) or {+-}size{KMGTP}: 
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 
Changed type of partition to 'Linux filesystem'
Command (? for help): p
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2097118   1023.0 MiB  8300  Linux filesystem
Command (? for help): w

Let’s fill that partition with some files using guestfish and unmount it:

$ guestfish -a disk.raw run : \
  mkfs ext2 /dev/sda1 : mount /dev/sda1 / : \
  copy-in ~/libnbd /
$ fusermount3 -u disk.raw
[1]+  Done    nbdfuse disk.raw [ qemu-nbd -f qcow2 disk.qcow2 ]

Now the original qcow2 file is no longer empty but populated with a partition, a filesystem and some files. We can see the space used by examining it with virt-df:

$ virt-df -a disk.qcow2 -h
Filesystem                Size   Used  Available  Use%
disk.qcow2:/dev/sda1     1006M    52M       903M    6%

Now let’s see the first sector. You can’t just “cat” a qcow2 file because it’s a complicated format understood only by qemu. I can assemble qemu-nbd, nbdcopy and hexdump into a pipeline, where qemu-nbd converts the qcow2 format to raw blocks, and nbdcopy copies those out to a pipe:

$ nbdcopy -- [ qemu-nbd -r -f qcow2 disk.qcow2 ] - | \
  hexdump -C -n 512
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  02 00 ee 8a 08 82 01 00  00 00 ff ff 1f 00 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200

How about instead of a local file, we start with a disk image hosted on a web server, and compressed? We can do that too. Let’s start by querying the size by composing nbdkit’s curl plugin, xz filter and nbdinfo. nbdkit’s --run option composes nbdkit with an external program, connecting them together over an NBD URI ($uri).

$ web=http://mirror.bytemark.co.uk/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20220127.n.0.x86_64.raw.xz
$ nbdkit curl --filter=xz $web --run 'nbdinfo $uri'
protocol: newstyle-fixed without TLS
export="":
	export-size: 5368709120 (5G)
	content: DOS/MBR boot sector, extended partition table (last)
	uri: nbd://localhost:10809/
...

Notice it prints the uncompressed (raw) size. Fedora already provides a qcow2 equivalent, but we can also make our own by composing nbdkit, curl, xz, nbdcopy and qemu-nbd:

$ qemu-img create -f qcow2 cloud.qcow2 5368709120 -o preallocation=metadata
$ nbdkit curl --filter=xz $web \
    --run 'nbdcopy -p -- $uri [ qemu-nbd -f qcow2 cloud.qcow2 ]'

Why would you do that instead of downloading and uncompressing? In this case it wouldn’t matter much, but in the general case the disk image might be enormous (terabytes) and you don’t have enough local disk space to do it. Assembling tools into pipelines means you don’t need to keep an intermediate local copy at any point.

We can find out what we’ve got in our new image using various tools:

$ qemu-img info cloud.qcow2 
image: cloud.qcow2
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 951 MiB
$ virt-df -a cloud.qcow2  -h
Filesystem              Size       Used  Available  Use%
cloud.qcow2:/dev/sda2   458M        50M       379M   12%
cloud.qcow2:/dev/sda3   100M       9.8M        90M   10%
cloud.qcow2:/dev/sda5   4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/root
                        4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/home
                        4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/root/var/lib/portables
                        4.4G       311M       3.6G    7%
$ virt-cat -a cloud.qcow2 /etc/redhat-release
Fedora release 36 (Rawhide)

If we wanted to play with the guest in a sandbox, we could stand up an in-memory NBD server populated with the cloud image and connect it to qemu using standard NBD URIs:

$ nbdkit memory 10G
$ qemu-img convert cloud.qcow2 nbd://localhost 
$ virt-customize --format=raw -a nbd://localhost \
    --root-password password:123456 
$ qemu-system-x86_64 -machine accel=kvm \
    -cpu host -m 2048 -serial stdio \
    -drive file=nbd://localhost,if=virtio 
...
fedora login: root
Password: 123456

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0     11:0    1 1024M  0 rom  
zram0  251:0    0  1.9G  0 disk [SWAP]
vda    252:0    0   10G  0 disk 
├─vda1 252:1    0    1M  0 part 
├─vda2 252:2    0  500M  0 part /boot
├─vda3 252:3    0  100M  0 part /boot/efi
├─vda4 252:4    0    4M  0 part 
└─vda5 252:5    0  4.4G  0 part /home
                                /

We can even find out what changed between the in-memory copy and the pristine qcow2 version (quite a lot as it happens):

$ virt-diff --format=raw -a nbd://localhost --format=qcow2 -A cloud.qcow2 
- d 0755       2518 /etc
+ d 0755       2502 /etc
# changed: st_size
- - 0644        208 /etc/.updated
- d 0750        108 /etc/audit
+ d 0750         86 /etc/audit
# changed: st_size
- - 0640         84 /etc/audit/audit.rules
- d 0755         36 /etc/issue.d
+ d 0755          0 /etc/issue.d
# changed: st_size
... for several pages ...

In conclusion, we’ve got a couple of ways to serve disk content over NBD, a set of composable tools for copying, creating, displaying and modifying disk content either from local files or over NBD, and a way to pipe disk data between processes and systems.

We use this in virt-v2v which can suck VMs out of VMware to KVM systems, efficiently, in parallel, and without using local disk space for even the largest guest.

Leave a comment

Filed under Uncategorized

nbdkit 1.12

The new stable release of nbdkit, our flexible Network Block Device server, is out. You can read the announcement and release notes here.

The big new features are SSH support, the linuxdisk plugin, writing plugins in Rust, and extents. Extents allows NBD clients to work out which parts of a disk are sparse or zeroes and skip reading them. It was hellishly difficult to write because of the number of obscure corner cases.

Also in this release, are a couple of interesting filters. The rate filter lets you add a bandwidth limit to connections. We will use this in virt-v2v to allow v2v instances to be rate limited (even dynamically). The readahead filter makes sequential copying and scanning of plugins more efficient by prefetching data ahead of time. It is self-configuring and in most cases simply adding the filter into your filter stack is sufficient to get a nice performance boost, assuming your client’s access patterns are mostly sequential.

1 Comment

Filed under Uncategorized

NBD graphical viewer – RAID 5 edition

If you saw my posting from two days ago you’ll know I’m working on visualizing what happens on block devices when you perform various operations. Last time we covered basics like partitioning a disk, creating a filesystem, creating files, and fstrim.

This time I’ve tied together 5 of the nbdcanvas widgets into a bigger Tcl application that can show what’s happening on a RAID 5 disk set. As with the last posting there’s a video followed by commentary on what happens at each step.

raid0066

  • 00:00: I start guestfish connected to all 5 nbdkit servers. Also of note I’ve added raid456.devices_handle_discard_safely=1 to the appliance kernel command line, which is required for discards to work through MD RAID devices (I didn’t know that before yesterday).
  • 00:02: When the appliance starts up, the black flashes show the kernel probing for possible partitions or filesystems. The disks are blank so nothing is found.
  • 00:16: As in the previous post I’m partitioning the disks using GPT. Each ends up with a partition table at the start and end of the disk (two red blocks of pixels).
  • 00:51: Now I use mdadm --create (via the guestfish md-create command) to make a RAID 5 array across the first 4 disks. The 4th disk is the parity disk — you can see disks 1 through 3 being scanned and the parity information being written to the 4th disk. The 5th disk is a hot spare. Notice how the scanning continues after the mdadm command has returned. In real arrays this can go on for hours or days.
  • 01:11: I create a filesystem. The first action that mkfs performs is discarding previous data (indicated by light purple). Notice that the parity data is also discarded, which surprised me, but does make sense.
  • 01:27: The RAID array is mounted and I unpack a tarball into it.
  • 01:40: I delete the files and fstrim, which discards the underlying blocks again.
  • 01:48: Now I’m going to inject errors at the block layer into the 3rd disk. The Error checkbox in the Tcl widget simply creates a file. We’re using the nbdkit error filter which monitors for the named file and when it is created starts injecting errors into any read or write operation. Almost immediately the RAID array notices the damage and starts rebuilding on to the hot spare. Notice the black flashes where it reads the working disks (including old parity disk) to construct the redundant information on the spare.
  • 01:55: While reconstruction is under way, the RAID array can be used normally.
  • 02:14: Examining /proc/mdstat shows that the third disk has been marked failed.
  • 02:24: Now I’m going to inject errors into the 4th disk as well. This RAID array can survive this, operating in a “degraded state”, but there is no more redundancy.
  • 02:46: Finally we can examine the kernel messages which show that the RAID array is continuing on 3 devices.

In case you want to reproduce the results yourself, the full command to run nbdkit (repeated 5 times) is:

$ rm /tmp/sock1 /tmp/error1
$ ./nbdkit -fv -U /tmp/sock1 \
    --filter=error --filter=log --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log1 \
    rdelay=40ms wdelay=40ms \
    error-rate=100% error-file=/tmp/error1

And the nbdraid viewing program:

$ ./nbdraid.tcl 5 $((64*1024*1024)) /tmp/log%d /tmp/error%d

Leave a comment

Filed under Uncategorized

Partitioning a 7 exabyte disk

In the latest nbdkit (and at the time of writing you will need nbdkit from git) you can type this magical incantation:

nbdkit data data="
       @0x1c0 2 0 0xee 0xfe 0xff 0xff 0x01 0  0 0 0xff 0xff 0xff 0xff
       @0x1fe 0x55 0xaa
       @0x200 0x45 0x46 0x49 0x20 0x50 0x41 0x52 0x54
                     0 0 1 0 0x5c 0 0 0
              0x9b 0xe5 0x6a 0xc5 0 0 0 0  1 0 0 0 0 0 0 0
              0xff 0xff 0xff 0xff 0xff 0xff 0x37 0  0x22 0 0 0 0 0 0 0
              0xde 0xff 0xff 0xff 0xff 0xff 0x37 0
                     0x72 0xb6 0x9e 0x0c 0x6b 0x76 0xb0 0x4f
              0xb3 0x94 0xb2 0xf1 0x61 0xec 0xdd 0x3c  2 0 0 0 0 0 0 0
              0x80 0 0 0 0x80 0 0 0  0x79 0x8a 0xd0 0x7e 0 0 0 0
       @0x400 0xaf 0x3d 0xc6 0x0f 0x83 0x84 0x72 0x47
                     0x8e 0x79 0x3d 0x69 0xd8 0x47 0x7d 0xe4
              0xd5 0x19 0x46 0x95 0xe3 0x82 0xa8 0x4c
                     0x95 0x82 0x7a 0xbe 0x1c 0xfc 0x62 0x90
              0x80 0 0 0 0 0 0 0  0x80 0xff 0xff 0xff 0xff 0xff 0x37 0
              0 0 0 0 0 0 0 0  0x70 0 0x31 0 0 0 0 0
       @0x6fffffffffffbe00
              0xaf 0x3d 0xc6 0x0f 0x83 0x84 0x72 0x47
                     0x8e 0x79 0x3d 0x69 0xd8 0x47 0x7d 0xe4
              0xd5 0x19 0x46 0x95 0xe3 0x82 0xa8 0x4c
                     0x95 0x82 0x7a 0xbe 0x1c 0xfc 0x62 0x90
              0x80 0 0 0 0 0 0 0  0x80 0xff 0xff 0xff 0xff 0xff 0x37 0
              0 0 0 0 0 0 0 0  0x70 0 0x31 0 0 0 0 0
       @0x6ffffffffffffe00
              0x45 0x46 0x49 0x20 0x50 0x41 0x52 0x54
                     0 0 1 0 0x5c 0 0 0
              0x6c 0x76 0xa1 0xa0 0 0 0 0
                     0xff 0xff 0xff 0xff 0xff 0xff 0x37 0
              1 0 0 0 0 0 0 0  0x22 0 0 0 0 0 0 0
              0xde 0xff 0xff 0xff 0xff 0xff 0x37 0
                     0x72 0xb6 0x9e 0x0c 0x6b 0x76 0xb0 0x4f
              0xb3 0x94 0xb2 0xf1 0x61 0xec 0xdd 0x3c
                     0xdf 0xff 0xff 0xff 0xff 0xff 0x37 0
              0x80 0 0 0 0x80 0 0 0  0x79 0x8a 0xd0 0x7e 0 0 0 0
" size=7E

When nbdkit starts up you can connect to it in a few ways. If you have a qemu virtual machine running an installed operating system, attach a second NBD drive. On the command line that would look like this:

$ qemu-system-x86_64 ... -file drive=nbd:localhost:10809,if=virtio

Or you could use guestfish:

$ guestfish --format=raw -a nbd://localhost
><fs> run

What this creates is a 7 exabyte disk with a single, empty GPT partition.

7 exabytes is a lot. It’s 8,070,450,532,247,928,832 bytes, or about 7 billion gigabytes. In fact even with ever increasing storage capacities in hard disk drives it’ll be a very long time before we get exabyte drives.

Peculiar things happen when you try to use this disk in Linux. For sure the kernel has no problem finding the partition, creating a /dev/sda1 device, and returning the right size. Ext4 has a maximum filesystem size of merely 1 exabyte so it won’t even try to make a filesystem, and on my laptop trying to write an XFS filesystem on the partition just caused qemu to grind away at 200% CPU making no apparent progress even after many minutes.

Why not throw your own favourite disk analysis tools at this image and see what they make of it.

Finally how did I create the magic command line above?

I used the nbdkit memory plugin to make an empty 7 EB disk. Note this requires a recent version of the plugin which was rewritten with support for sparse arrays.

$ nbdkit memory size=7E

Then I could connect to it with guestfish to create the GPT partition:

$ guestfish --format=raw -a nbd://localhost
><fs> run
><fs> part-disk /dev/sda gpt

GPT uses a partition table at the beginning and end of the disk. So – still in guestfish – I could sample what the partitioning tool had written to both ends of the disk:

><fs> pread-device /dev/sda 1M 0 | cat > start
><fs> pread-device /dev/sda 1M 8070450532246880256 | cat > end

I then used hexdump + manual inspection of the hexdump output to write the long data string:

$ hexdump -C start
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  02 00 ee fe ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
...

translates to …

@0x1c0 2 0 0xee 0xfe 0xff 0xff 0x01 0  0 0 0xff 0xff 0xff 0xff
@0x1fe 0x55 0xaa

1 Comment

Filed under Uncategorized

Dockerfile for running libguestfs, virt-tools and virt-v2v

FROM fedora
RUN dnf install -y libguestfs libguestfs-tools-c virt-v2v \
                   libvirt-daemon libvirt-daemon-config-network

# https://bugzilla.redhat.com/show_bug.cgi?id=1045069
RUN useradd -ms /bin/bash v2v
USER v2v
WORKDIR /home/v2v

# This is required for virt-v2v because neither systemd nor
# root libvirtd runs, and therefore there is no virbr0, and
# therefore virt-v2v cannot set up the network through libvirt.
ENV LIBGUESTFS_BACKEND direct

2 Comments

Filed under Uncategorized

libguestfs for RHEL 7.5 preview

As usual I’ve placed the proposed RHEL 7.5 libguestfs packages in a public repository so you can try them out.

Thanks to Pino Toscano for doing the packaging work.

Leave a comment

Filed under Uncategorized

Great new changes coming to nbdkit

Eric Blake has been doing some great stuff for nbdkit, the flexible plugin-based NBD server.

  • Full parallel request handling.
    You’ve always been able to tell nbdkit that your plugin can handle multiple requests in parallel from a single client, but until now that didn’t actually do anything (only parallel requests from multiple clients worked).
  • An NBD forwarding plugin, so if you have another NBD server which doesn’t support a feature like encryption or new-style protocol, then you can front that server with nbdkit which does.

As well as that he’s fixed lots of small bugs with NBD compliance so hopefully we’re now much closer to the protocol spec (we always check that we interoperate with qemu’s nbd client, but it’s nice to know that we’re also complying with the spec). He also fixed a potential DoS where nbdkit would try to handle very large writes which would delay a thread in the server indefinitely.


Also this week, I wrote an nbdkit plugin for handling the weird Xen XVA file format. The whole thread is worth reading because 3 people came up with 3 unique solutions to this problem.

1 Comment

Filed under Uncategorized

Fedora 27 virt-builder images

Fedora 27 has just been released, and I’ve just uploaded virt-builder images so you can try it right away:

$ virt-builder -l | grep fedora-27
fedora-27                aarch64    Fedora® 27 Server (aarch64)
fedora-27                armv7l     Fedora® 27 Server (armv7l)
fedora-27                i686       Fedora® 27 Server (i686)
fedora-27                ppc64      Fedora® 27 Server (ppc64)
fedora-27                ppc64le    Fedora® 27 Server (ppc64le)
fedora-27                x86_64     Fedora® 27 Server
$ virt-builder fedora-27 \
      --root-password password:123456 \
      --install emacs \
      --selinux-relabel \
      --size 30G
$ qemu-system-x86_64 \
      -machine accel=kvm:tcg \
      -cpu host -m 2048 \
      -drive file=fedora-27.img,format=raw,if=virtio &

Leave a comment

Filed under Uncategorized

Tip: Changing the qemu product name in libguestfs

20:30 < koike> Hi. Is it possible to configure the dmi codes for libguestfs? I mean, I am running cloud-init inside a libguestfs session (through python-guestfs) in GCE, the problem is that cloud-init reads /sys/class/dmi/id/product_name to determine if the machine is a GCE machine, but the value it read is Standard PC (i440FX + PIIX, 1996) instead of the expected Google Compute Engine so cloud-init fails.

The answer is yes, using the guestfs_config API that lets you set arbitrary qemu parameters:

g.config('-smbios',
         'type=1,product=Google Compute Engine')

Leave a comment

Filed under Uncategorized

Fedora 26 is out, virt-builder images available

Fedora 26 is released today. virt-builder images are already available for almost all architectures:

$ virt-builder -l | grep fedora-26
fedora-26                aarch64    Fedora® 26 Server (aarch64)
fedora-26                armv7l     Fedora® 26 Server (armv7l)
fedora-26                i686       Fedora® 26 Server (i686)
fedora-26                ppc64      Fedora® 26 Server (ppc64)
fedora-26                ppc64le    Fedora® 26 Server (ppc64le)
fedora-26                x86_64     Fedora® 26 Server

For example:

$ virt-builder fedora-26
$ qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=fedora-26.img,format=raw,if=virtio

Why not s390x? That’s because qemu doesn’t yet emulate enough of the s390x instruction set / architecture so that we can run Fedora under TCG emulation.

Leave a comment

Filed under Uncategorized