Tag Archives: disk images

Composable tools for disk images

Over the past 3 or 4 years, my colleagues and I at Red Hat have been making a set of composable command line tools for handling virtual machine disk images. These let you copy, create, manipulate, display and modify disk images using simple tools that can be connected together in pipelines, while at the same time working very efficiently. It’s all based around the very efficient Network Block Device (NBD) protocol and NBD URI specification.

A basic and very old tool is qemu-img:

$ qemu-img create -f qcow2 disk.qcow2 1G

which creates an empty disk image in qcow2 format. Suppose you want to write into this image? We can compose a few programs:

$ touch disk.raw
$ nbdfuse disk.raw [ qemu-nbd -f qcow2 disk.qcow2 ] &

This serves the qcow2 file up over NBD (qemu-nbd) and then exposes that as a local file using FUSE (nbdfuse). Of interest here, nbdfuse runs and manages qemu-nbd as a subprocess, cleaning it up when the FUSE file is unmounted. We can partition the file using regular tools:

$ gdisk disk.raw
Command (? for help): n
Partition number (1-128, default 1): 
First sector (34-2097118, default = 2048) or {+-}size{KMGTP}: 
Last sector (2048-2097118, default = 2097118) or {+-}size{KMGTP}: 
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 
Changed type of partition to 'Linux filesystem'
Command (? for help): p
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2097118   1023.0 MiB  8300  Linux filesystem
Command (? for help): w

Let’s fill that partition with some files using guestfish and unmount it:

$ guestfish -a disk.raw run : \
  mkfs ext2 /dev/sda1 : mount /dev/sda1 / : \
  copy-in ~/libnbd /
$ fusermount3 -u disk.raw
[1]+  Done    nbdfuse disk.raw [ qemu-nbd -f qcow2 disk.qcow2 ]

Now the original qcow2 file is no longer empty but populated with a partition, a filesystem and some files. We can see the space used by examining it with virt-df:

$ virt-df -a disk.qcow2 -h
Filesystem                Size   Used  Available  Use%
disk.qcow2:/dev/sda1     1006M    52M       903M    6%

Now let’s see the first sector. You can’t just “cat” a qcow2 file because it’s a complicated format understood only by qemu. I can assemble qemu-nbd, nbdcopy and hexdump into a pipeline, where qemu-nbd converts the qcow2 format to raw blocks, and nbdcopy copies those out to a pipe:

$ nbdcopy -- [ qemu-nbd -r -f qcow2 disk.qcow2 ] - | \
  hexdump -C -n 512
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  02 00 ee 8a 08 82 01 00  00 00 ff ff 1f 00 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200

How about instead of a local file, we start with a disk image hosted on a web server, and compressed? We can do that too. Let’s start by querying the size by composing nbdkit’s curl plugin, xz filter and nbdinfo. nbdkit’s --run option composes nbdkit with an external program, connecting them together over an NBD URI ($uri).

$ web=http://mirror.bytemark.co.uk/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20220127.n.0.x86_64.raw.xz
$ nbdkit curl --filter=xz $web --run 'nbdinfo $uri'
protocol: newstyle-fixed without TLS
export="":
	export-size: 5368709120 (5G)
	content: DOS/MBR boot sector, extended partition table (last)
	uri: nbd://localhost:10809/
...

Notice it prints the uncompressed (raw) size. Fedora already provides a qcow2 equivalent, but we can also make our own by composing nbdkit, curl, xz, nbdcopy and qemu-nbd:

$ qemu-img create -f qcow2 cloud.qcow2 5368709120 -o preallocation=metadata
$ nbdkit curl --filter=xz $web \
    --run 'nbdcopy -p -- $uri [ qemu-nbd -f qcow2 cloud.qcow2 ]'

Why would you do that instead of downloading and uncompressing? In this case it wouldn’t matter much, but in the general case the disk image might be enormous (terabytes) and you don’t have enough local disk space to do it. Assembling tools into pipelines means you don’t need to keep an intermediate local copy at any point.

We can find out what we’ve got in our new image using various tools:

$ qemu-img info cloud.qcow2 
image: cloud.qcow2
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 951 MiB
$ virt-df -a cloud.qcow2  -h
Filesystem              Size       Used  Available  Use%
cloud.qcow2:/dev/sda2   458M        50M       379M   12%
cloud.qcow2:/dev/sda3   100M       9.8M        90M   10%
cloud.qcow2:/dev/sda5   4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/root
                        4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/home
                        4.4G       311M       3.6G    7%
cloud.qcow2:btrfsvol:/dev/sda5/root/var/lib/portables
                        4.4G       311M       3.6G    7%
$ virt-cat -a cloud.qcow2 /etc/redhat-release
Fedora release 36 (Rawhide)

If we wanted to play with the guest in a sandbox, we could stand up an in-memory NBD server populated with the cloud image and connect it to qemu using standard NBD URIs:

$ nbdkit memory 10G
$ qemu-img convert cloud.qcow2 nbd://localhost 
$ virt-customize --format=raw -a nbd://localhost \
    --root-password password:123456 
$ qemu-system-x86_64 -machine accel=kvm \
    -cpu host -m 2048 -serial stdio \
    -drive file=nbd://localhost,if=virtio 
...
fedora login: root
Password: 123456

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0     11:0    1 1024M  0 rom  
zram0  251:0    0  1.9G  0 disk [SWAP]
vda    252:0    0   10G  0 disk 
├─vda1 252:1    0    1M  0 part 
├─vda2 252:2    0  500M  0 part /boot
├─vda3 252:3    0  100M  0 part /boot/efi
├─vda4 252:4    0    4M  0 part 
└─vda5 252:5    0  4.4G  0 part /home
                                /

We can even find out what changed between the in-memory copy and the pristine qcow2 version (quite a lot as it happens):

$ virt-diff --format=raw -a nbd://localhost --format=qcow2 -A cloud.qcow2 
- d 0755       2518 /etc
+ d 0755       2502 /etc
# changed: st_size
- - 0644        208 /etc/.updated
- d 0750        108 /etc/audit
+ d 0750         86 /etc/audit
# changed: st_size
- - 0640         84 /etc/audit/audit.rules
- d 0755         36 /etc/issue.d
+ d 0755          0 /etc/issue.d
# changed: st_size
... for several pages ...

In conclusion, we’ve got a couple of ways to serve disk content over NBD, a set of composable tools for copying, creating, displaying and modifying disk content either from local files or over NBD, and a way to pipe disk data between processes and systems.

We use this in virt-v2v which can suck VMs out of VMware to KVM systems, efficiently, in parallel, and without using local disk space for even the largest guest.

Leave a comment

Filed under Uncategorized

nbdkit 1.24, new data plugin features

nbdkit 1.24 was released on Thursday. It’s our flexible, fast network block device with loads of features. nbdkit-data-plugin, a plugin that lets you create test patterns from the command line gained some interesting new functionality:

$ nbdkit data ' ( 0x55 0xAA )*2048 '

This command worked before as a way to create a repeating test pattern in a disk image. A new feature is you can write a shell script snippet to generate the pattern instead:

$ nbdkit data ' <( while :; do printf "%04x" $((i++)); done ) [:2048] '

This command will create a pattern of characters “0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 3 …” (truncated to 2048 bytes). We could turn that into a block device and display the contents:

# nbd-client localhost /dev/nbd0
# blockdev --getsize64 /dev/nbd0
2048
# dd if=/dev/nbd0 | hexdump -C | head
4+0 records in
4+0 records out
2048 bytes (2.0 kB, 2.0 KiB) copied, 0.000167082 s, 12.3 MB/s
00000000  30 30 30 30 30 30 30 31  30 30 30 32 30 30 30 33  |0000000100020003|
00000010  30 30 30 34 30 30 30 35  30 30 30 36 30 30 30 37  |0004000500060007|
00000020  30 30 30 38 30 30 30 39  30 30 30 61 30 30 30 62  |00080009000a000b|
00000030  30 30 30 63 30 30 30 64  30 30 30 65 30 30 30 66  |000c000d000e000f|
00000040  30 30 31 30 30 30 31 31  30 30 31 32 30 30 31 33  |0010001100120013|
00000050  30 30 31 34 30 30 31 35  30 30 31 36 30 30 31 37  |0014001500160017|
00000060  30 30 31 38 30 30 31 39  30 30 31 61 30 30 31 62  |00180019001a001b|
00000070  30 30 31 63 30 30 31 64  30 30 31 65 30 30 31 66  |001c001d001e001f|
00000080  30 30 32 30 30 30 32 31  30 30 32 32 30 30 32 33  |0020002100220023|
00000090  30 30 32 34 30 30 32 35  30 30 32 36 30 30 32 37  |0024002500260027|
# nbd-client -d /dev/nbd0
# killall nbdkit

The data plugin also lets you read from files which is useful for making disks with random initial data. For example here’s how to create a disk with 16 identical sectors of random data (notice how /dev/random is read in, truncated to 512 bytes, and then 16 copies are made):

$ nbdkit data ' </dev/urandom[:512]*16 '

The plugin can also create sparse disks. You can do this just by moving the current offset using “@”:

$ nbdkit data ' @32768 1 ' --run 'nbdinfo --map "$uri"'
     0       32768    3  hole,zero
 32768           1    0  allocated

We use this plugin quite extensively when testing libnbd.

1 Comment

Filed under Uncategorized

nbdkit tar filter

nbdkit is our high performance liberally licensed Network Block Device server, and OVA files are a common pseudo-standard for exporting virtual machines including their disk images.

A .ova file is really an uncompressed tar file:

$ tar tf rhel.ova
rhel.ovf
rhel-disk1.vmdk
rhel.mf

Since tar files usually store their content unmangled, this opens an interesting possibility for reading (or even writing) the embedded disk image without needing to unpack the tar. You just have to work out the offset of the disk image within the tar file. virt-v2v has used this trick to save a copy when importing OVAs for years.

nbdkit has also included a tar plugin which can access a file inside a local tar file, but the problem is what if the tar file doesn’t happen to be a local file? (eg. It’s on a webserver). Or what if it’s compressed?

To fix this I’ve turned the plugin into a filter. Using nbdkit-tar-filter you can unpack even non-local compressed tar files:

$ nbdkit curl http://example.com/qcow2.tar.xz \
         --filter=tar --filter=xz tar-entry=disk.qcow2

(To understand how filters are stacked, see my FOSDEM talk from last year). Because in this example the disk inside the tarball is a qcow2 file, it appears as qcow2 on the wire, so:

$ guestfish --ro --format=qcow2 -a nbd://localhost

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
      ‘man’ to read the manual
      ‘quit’ to quit the shell

><fs> run
><fs> list-filesystems 
/dev/sda1: ext2
><fs> mount /dev/sda1 /
><fs> ll /
total 19
drwxr-xr-x   3 root root  1024 Jul  6 20:03 .
drwxr-xr-x  19 root root  4096 Jul  9 11:01 ..
-rw-rw-r--.  1 1000 1000    11 Jul  6 20:03 hello.txt
drwx------   2 root root 12288 Jul  6 20:03 lost+found

Leave a comment

Filed under Uncategorized

Compressed RAM disks

There was a big discussion last week about whether zram swap should be the default in a future version of Fedora.

This lead me to think about the RAM disk implementation in nbdkit. In nbdkit up to 1.20 it supports giant virtual disks up to 8 exabytes using a sparse array implemented with a 2-level page table. However it’s still a RAM disk and so you can’t actually store more real data in these disks than you have available RAM (plus swap).

But what if we compressed the data? There are some fine, very fast compression libraries around nowadays — I’m using Facebook’s Zstandard — so the overhead of compression can be quite small, and this lets you make limited RAM go further.

So I implemented allocators for nbdkit ≥ 1.22, including:

$ nbdkit memory 1T allocator=zstd

Compression ratios can be really good. I tested this by creating a RAM disk and filling it with a filesystem containing text and source files, and was getting 10:1 compression. (Note that filesystems start with very regular, easily compressible metadata, so you’d expect this ratio to quickly drop if you filled the filesystem up with a lot of files).

The compression overhead is small, although the current nbdkit-memory-plugin isn’t very smart about locking so it has rather poor performance under multi-threaded loads anyway. (A fun little project to fix that for someone who loves pthread and C.)

I also implemented allocator=malloc which is a non-sparse direct-mapped RAM disk. This is simpler and a bit faster, but has rather obvious limitations compared to using the sparse allocator.

2 Comments

Filed under Uncategorized

libnbd + FUSE = nbdfuse

I’ve talked before about libnbd, our NBD client library. New in libnbd 1.2 is a tool called nbdfuse which lets you turn NBD servers into virtual files.

A couple of weeks ago I mentioned you can use libnbd as a C library to edit qcow2 files. Now you can turn qcow2 files into virtual raw files:

$ mkdir dir
$ nbdfuse dir/file.raw \
      --socket-activation qemu-nbd -f qcow2 file.qcow2
$ ls -l dir/
total 0
-rw-rw-rw-. 1 nbd nbd 1073741824 Jan  1 10:10 file.raw

Reads and writes to file.raw are backed by the original qcow2 file which is updated in real time.

Another fun thing to do is to use nbdkit, xz filter and curl to turn xz-compressed remote disk images into uncompressed local files:

$ mkdir dir
$ nbdfuse dir/disk.img \
      --command nbdkit -s curl --filter=xz \
                       http://builder.libguestfs.org/fedora-30.xz
$ ls -l dir/
total 0
-rw-rw-rw-. 1 nbd nbd 6442450944 Jan  1 10:10 disk.img
$ file dir/disk.img
dir/disk.img: DOS/MBR boot sector
$ qemu-system-x86_64 -m 4G \
      -drive file=dir/disk.img,format=raw,if=virtio,snapshot=on

1 Comment

Filed under Uncategorized

nbdkit for loopback pt 7: a slow disk

What happens to filesystems and programs when the disk is slow? You can test this using nbdkit and the delay filter. This command creates a 4G virtual disk in memory and injects a 2 second delay into every read operation:

$ nbdkit --filter=delay memory size=4G rdelay=2

You can loopback mount this as a device on the host:

# modprobe nbd
# nbd-client -b 512 localhost /dev/nbd0
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 4096MB
Connected /dev/nbd0

Partitioning and formatting is really slow!

# sgdisk -n 1 /dev/nbd0
Creating new GPT entries in memory.
... sits here for about 10 seconds ...
The operation has completed successfully.
# mkfs.ext4 /dev/nbd0p1
mke2fs 1.44.3 (10-July-2018)
waiting ...

Actually I killed it and decided to restart the test with a smaller delay. Since the memory plugin was rewritten to use a sparse array, we’re serializing all requests as an easy way to lock the sparse array data structure. This doesn’t matter normally because requests to the memory plugin are extremely fast, but once you inject delays this means that every request into nbdkit is serialized. Thus for example two reads issued in parallel at the same time by the kernel are delayed by 2+2 = 4 seconds instead of 2 seconds in total.

However shutting down the NBD connection reveals likely kernel bugs in the NBD driver:

[74176.112087] block nbd0: NBD_DISCONNECT
[74176.112148] block nbd0: Disconnected due to user request.
[74176.112151] block nbd0: shutting down sockets
[74176.112183] print_req_error: I/O error, dev nbd0, sector 6144
[74176.112252] print_req_error: I/O error, dev nbd0, sector 6144
[74176.112257] Buffer I/O error on dev nbd0p1, logical block 4096, async page read
[74176.112260] Buffer I/O error on dev nbd0p1, logical block 4097, async page read
[74176.112263] Buffer I/O error on dev nbd0p1, logical block 4098, async page read
[74176.112265] Buffer I/O error on dev nbd0p1, logical block 4099, async page read
[74176.112267] Buffer I/O error on dev nbd0p1, logical block 4100, async page read
[74176.112269] Buffer I/O error on dev nbd0p1, logical block 4101, async page read
[74176.112271] Buffer I/O error on dev nbd0p1, logical block 4102, async page read
[74176.112274] Buffer I/O error on dev nbd0p1, logical block 4103, async page read

Note nbdkit did not return any I/O errors, but the connection was closed with in-flight delayed requests. Well at least our testing is finding bugs!

I tried again with a 500ms delay and using the file plugin which is fully parallel:

$ rm -f temp
$ truncate -s 4G temp
$ nbdkit --filter=delay file file=temp rdelay=500ms

I was able to partition and create a filesystem more easily on this because of the shorter delay and the fact that parallel kernel requests are delayed “in parallel” [same steps as above], and then mount it on a temp directory:

# mount /dev/nbd0p1 /tmp/mnt

The effect is rather strange, like using an NFS mount from a remote server. Initial file reads are slow, and then they are fast (as they are cached in memory). If you drop Linux caches:

# echo 3 > /proc/sys/vm/drop_caches

then everything becomes slow again.

Confident that parallel requests were being delayed in parallel, I also increased the delay back up to 2 seconds (still using the file plugin). This is like swimming in treacle or what I imagine it would be like to mount an NFS filesystem from the other side of the world over a 56K modem.

I wasn’t able to find any further bugs, but this should be useful for someone who wants to test this kind of thing.

1 Comment

Filed under Uncategorized

Mapping files to disk

Wouldn’t it be cool if you could watch a virtual machine booting, and at the same time see what files it is accessing on disk:

reading /dev/sda1 master boot record
reading /dev/sda1 /grub2/i386-pc/boot.img
reading /dev/sda1 /grub2/i386-pc/ext2.mod
reading /dev/sda1 /vmlinuz
...

You can already observe what disk blocks it is accessing pretty easily. There are several methods, but a quick one would be to use nbdkit’s file plugin with the -f -v flags (foreground and verbose). The problem is how to map disk blocks to the files and other interesting objects that exist in the disk image.

How do you map between files and disk blocks? For simple filesystems like ext4 you can use the FIBMAP ioctl, and perhaps adjust the answer by adding the offset of the start of the partition. However as you get further into the boot process you’ll probably encounter complexities like LVM. There may not even be a 1-1 mapping since RAID means that multiple blocks can store a single file block, and tail packing and deduplication mean that a block can belong to multiple files. And of course there are things other than plain files: directories, swap partitions, master boot records, and boot loaders, that live in and between filesystems.

To solve this I have written a tool called virt-bmap. It takes a disk image and outputs a block map. To do this it uses libguestfs (patched) to control an nbdkit instance, reading each file and recording what blocks in the disk image are accessed. (It sounds complicated, but virt-bmap wraps it up in a simple command line tool.) The beauty of this is that the kernel takes care of the mapping for us, and it works no matter how many layers of filesystem/LVM/RAID are between the file and the underlying device. This doesn’t quite solve the “RAID problem” since the RAID layers in Linux are free to only read a single copy of the file, but is generally accurate for everything else.

$ virt-bmap fedora-20.img
virt-bmap: examining /dev/sda1 ...
virt-bmap: examining /dev/sda2 ...
virt-bmap: examining /dev/sda3 ...
virt-bmap: examining filesystem on /dev/sda1 (ext4) ...
virt-bmap: examining filesystem on /dev/sda3 (ext4) ...
virt-bmap: writing /home/rjones/d/virt-bmap/bmap
virt-bmap: successfully examined 3 partitions, 0 logical volumes,
           2 filesystems, 3346 directories, 20585 files
virt-bmap: output written to /home/rjones/d/virt-bmap/bmap

The output bmap file is a straightforward map from disk byte offset to file / files / object occupying that space:

1 541000 541400 d /dev/sda1 /
1 541400 544400 d /dev/sda1 /lost+found
1 941000 941400 f /dev/sda1 /.vmlinuz-3.11.10-301.fc20.x86_64.hmac
1 941400 961800 f /dev/sda1 /config-3.11.10-301.fc20.x86_64
1 961800 995400 f /dev/sda1 /initrd-plymouth.img
1 b00400 ef1c00 f /dev/sda1 /grub2/themes/system/background.png
1 f00400 12f1c00 f /dev/sda1 /grub2/themes/system/fireworks.png
1 1300400 1590400 f /dev/sda1 /System.map-3.11.10-301.fc20.x86_64

[The 1 that appears in the first column means “first disk”. Unfortunately virt-bmap can only map single disk virtual machines at present.]

The second part of this, which I’m still writing, will be another nbdkit plugin which takes these maps and produces a nice log of accesses as the machine boots.

3 Comments

Filed under Uncategorized

New tool: virt-builder

New in libguestfs 1.24 will be a simple tool called virt-builder. This builds virtual machines of various free operating systems quickly and securely:

$ virt-builder fedora-19 --size 20G --install nmap
[     0.0] Downloading: http://libguestfs.org/download/builder/fedora-19.xz
[     2.0] Uncompressing: http://libguestfs.org/download/builder/fedora-19.xz
[    25.0] Running virt-resize to expand the disk to 20.0G
[    74.0] Opening the new disk
[    78.0] Random root password: RCuMKJ4NPak0ptJQ [did you mean to use --root-password?]
[    78.0] Installing packages: nmap
[    93.0] Finishing off

Some notable features:

  • Fast: As you can see above, once it has downloaded and cached the template first time, it can churn out new guests in around 90 seconds.
  • Install packages.
  • Set the hostname.
  • Generate a random seed for the guest.
  • Upload files.
  • Set passwords, create user accounts.
  • Run custom scripts.
  • Install firstboot scripts.
  • Fetch packages from private repos and ISOs.
  • Secure: Everything is assembled in a container (using SELinux if available).
  • Guest templates are PGP-signed.
  • No root or privileged access needed at all (no setuid, no sudo).
  • Fully scriptable.
  • Can be used in locked-down no-network scenarios.
  • Can use UML as a backend (good for use in a cloud).

13 Comments

Filed under Uncategorized

Tip: Pack files into a new disk image

Note: This requires libguestfs ≥ 1.4

If files/ is a directory containing some files, you can create a new disk image populated with those files using the command below (virt-make-fs can also be used for this):

$ guestfish -N fs:ext2:400M -m /dev/sda1 tar-in <(tar -C files/ -cf - .) /

Explanation:

  1. guestfish -N fs:ext2:400M creates a prepared disk image which is 400MB in size with a single partition formatted as ext2. The new disk image is called test1.img.
  2. -m /dev/sda1 mounts the new filesystem when guestfish starts up.
  3. tar-in [...] / uploads and unpacks the tar file into the root directory of the mounted disk image.
  4. <(tar -C files/ -cf - .) is a bash process substitution which runs the tar command and supplies the output (ie. the tar file) as the input on the command line.

This is a quick way to create “appliances” using febootstrap and libguestfs, although you should note that I don’t think these appliances would really work, I just use them for testing our virtualization management tools, like the ability to detect and manage disk images:

$ febootstrap -i fedora-release -i bash -i setup -i coreutils fedora-13 f13
$ echo '/dev/sda1 / ext2 defaults 1 1' > fstab
$ febootstrap-install f13 fstab /etc 0644 root.root
$ febootstrap-minimize f13
$ guestfish -N fs:ext2:40M -m /dev/sda1 tar-in <(tar -C f13 -cf - .) /
$ guestfish --ro -i -a test1.img

Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.

Type: 'help' for a list of commands
      'man' to read the manual
      'quit' to quit the shell

Operating system: Fedora release 13 (Goddard)
/dev/vda1 mounted on /
><fs> cat /etc/fstab
/dev/sda1 / ext2 defaults 1 1

6 Comments

Filed under Uncategorized