Tag Archives: guestfish

Creating a modifiable gzipped disk image

Dusty Mabe set me a challenge yesterday. He wants to create several compressed disk images that have slightly different content, but are otherwise mostly the same. The disk images are large and compressing them takes a long time (30 minutes each, apparently), so ideally what we’d want to do is compress the disk image just the once and then do the updates on the gzipped image.

Modifying a file which has already been compressed is not usually possible.

However if we make some relatively uncontroversial assumptions and accept a few limitations then we can create a compressed disk image which is modifiable in this way, certainly for gzip and xz (I need to investigate zstd).

Firstly let’s assume we’re using some kind of block-based compression with fixed size blocks. This is true for gzip and xz already. Secondly let’s assume that we want to only modify a single, small partition of the image (Dusty only wants to modify the /boot partition). Lastly we’ll assume that the partition boundaries are aligned to the compression blocks. Since partitions can be placed under control of whoever creates the disk image this last one is pretty easy to arrange.

The trick is to use uncompressed blocks for the part of the input covering the partition you want to modify, and compress the rest of the file normally. Both gzip and xz have an uncompressed block type. (In fact, just about any reasonable compression algorithm works like this – if the input data doesn’t become smaller after trying to compress it, the algorithm will emit the data uncompressed since that takes less space.)

Normal tools won’t let you create files like this, but I wrote one for gzip here, and I’m confident that one could be written for xz (exercise for readers! … or me if we decide to productise this).

I created a normal Fedora 36 image using virt-builder and used guestfish to find the partition boundaries:

$ guestfish --ro -a /var/tmp/fedora-36.img -i part-list /dev/sda
[0] = {
  part_num: 1
  part_start: 1048576
  part_end: 2097151
  part_size: 1048576
}
[1] = {
  part_num: 2
  part_start: 2097152
  part_end: 1075838975
  part_size: 1073741824
}
[2] = {
  part_num: 3
  part_start: 1075838976
  part_end: 6441402367
  part_size: 5365563392
}

We will compress partition #3 while leaving partitions #1 and #2 uncompressed so they can be modified in place later. The tool is a bit crude, but:

$ ./partial-deflate /var/tmp/fedora-36.img /var/tmp/fedora-36.img.gz 0 1075838976 6441402368 

The output is a regular gzip file (albeit rather large because the first 1GB is uncompressed – if I was doing this for real I’d make that boot partition smaller):

$ ll /var/tmp/fedora-36.img*
-rw-r--r--. 1 rjones rjones 6442450944 Nov 30 17:03 /var/tmp/fedora-36.img
-rw-r--r--. 1 rjones rjones 1698849910 Dec  1 16:23 /var/tmp/fedora-36.img.gz
$ zcat /var/tmp/fedora-36.img.gz | md5sum 
57cbd5ebcbe59613378c8aee7ad9e40d  -
$ md5sum /var/tmp/fedora-36.img
57cbd5ebcbe59613378c8aee7ad9e40d  /var/tmp/fedora-36.img

Then I went ahead and modified some known content inside the gzip file (but not compressed). I used a hex editor for this, but you could play around with guestfish + nbdkit-offset-filter for something more supportable:

And the result is a gzipped file with the modifications:

$ zcat /var/tmp/fedora-36.img.gz > /var/tmp/fedora-36.img-modified
gzip: /var/tmp/fedora-36.img.gz: invalid compressed data--crc error
$ guestfish --ro -a /var/tmp/fedora-36.img-modified -i cat /boot/mydata.txt 
helloworld------

…and a CRC error. That’s to be expected, as I haven’t yet worked out how to update CRC-32 after making changes, but it should be easy to solve (with brute force if necessary).

1 Comment

Filed under Uncategorized

NBD graphical viewer – RAID 5 edition

If you saw my posting from two days ago you’ll know I’m working on visualizing what happens on block devices when you perform various operations. Last time we covered basics like partitioning a disk, creating a filesystem, creating files, and fstrim.

This time I’ve tied together 5 of the nbdcanvas widgets into a bigger Tcl application that can show what’s happening on a RAID 5 disk set. As with the last posting there’s a video followed by commentary on what happens at each step.

raid0066

  • 00:00: I start guestfish connected to all 5 nbdkit servers. Also of note I’ve added raid456.devices_handle_discard_safely=1 to the appliance kernel command line, which is required for discards to work through MD RAID devices (I didn’t know that before yesterday).
  • 00:02: When the appliance starts up, the black flashes show the kernel probing for possible partitions or filesystems. The disks are blank so nothing is found.
  • 00:16: As in the previous post I’m partitioning the disks using GPT. Each ends up with a partition table at the start and end of the disk (two red blocks of pixels).
  • 00:51: Now I use mdadm --create (via the guestfish md-create command) to make a RAID 5 array across the first 4 disks. The 4th disk is the parity disk — you can see disks 1 through 3 being scanned and the parity information being written to the 4th disk. The 5th disk is a hot spare. Notice how the scanning continues after the mdadm command has returned. In real arrays this can go on for hours or days.
  • 01:11: I create a filesystem. The first action that mkfs performs is discarding previous data (indicated by light purple). Notice that the parity data is also discarded, which surprised me, but does make sense.
  • 01:27: The RAID array is mounted and I unpack a tarball into it.
  • 01:40: I delete the files and fstrim, which discards the underlying blocks again.
  • 01:48: Now I’m going to inject errors at the block layer into the 3rd disk. The Error checkbox in the Tcl widget simply creates a file. We’re using the nbdkit error filter which monitors for the named file and when it is created starts injecting errors into any read or write operation. Almost immediately the RAID array notices the damage and starts rebuilding on to the hot spare. Notice the black flashes where it reads the working disks (including old parity disk) to construct the redundant information on the spare.
  • 01:55: While reconstruction is under way, the RAID array can be used normally.
  • 02:14: Examining /proc/mdstat shows that the third disk has been marked failed.
  • 02:24: Now I’m going to inject errors into the 4th disk as well. This RAID array can survive this, operating in a “degraded state”, but there is no more redundancy.
  • 02:46: Finally we can examine the kernel messages which show that the RAID array is continuing on 3 devices.

In case you want to reproduce the results yourself, the full command to run nbdkit (repeated 5 times) is:

$ rm /tmp/sock1 /tmp/error1
$ ./nbdkit -fv -U /tmp/sock1 \
    --filter=error --filter=log --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log1 \
    rdelay=40ms wdelay=40ms \
    error-rate=100% error-file=/tmp/error1

And the nbdraid viewing program:

$ ./nbdraid.tcl 5 $((64*1024*1024)) /tmp/log%d /tmp/error%d

Leave a comment

Filed under Uncategorized

NBD graphical viewer

Ever wondered what is really happening when you write to a disk? What blocks the filesystem writes to and so on? With our flexible, plug-in based NBD server called nbdkit and a little Tcl/Tk program I wrote you can now visualise this.

As in this video (MP4)

nbdview1234

… which shows me opening a blank disk, partitioning it, creating an ext4 filesystem and writing some files.

There’s a lot going on in this video, which I’ll explain below. But first to say that each pixel corresponds to a 4K block on disk — the total disk size is 64M which is 128×128 pixels, and each row is therefore half a megabyte. Red pixels are writes. Black flashing pixels show reads. Light purple is for trim requests, and white pixels are zero requests.

nbdkit was run with the following command line:

$ nbdkit -fv \
    --filter=log \
    --filter=delay \
    memory size=$((64*1024*1024)) \
    logfile=/tmp/log \
    rdelay=40ms wdelay=40ms

This means that we’re using the memory plugin to create a throwaway blank disk of 64M. In front of this plugin we place two filters: The delay filter delays all reads and writes by 40ms. This makes it easier to see what’s going on. The second filter is the log filter which records all requests in a log file (/tmp/log).

The log file is what the second command reads asynchronously to generate the graphical image:

$ ./nbdview.tcl /tmp/log $((64*1024*1024))

So to the video (MP4):

  • 00:07: I start guestfish connected to the NBD server. This boots a Linux appliance, and you can see from the flashes of black how the Linux kernel probes the disk every which way to try to detect any kind of partition or filesystem signature. (Recall that I’m intentionally delaying all read requests which is why the appliance boot and probing seems to take so long. In reality these probes happen near instantaneously.) Of course the disk is all zeroes at this point, so nothing is found.
  • 00:23: I partition the disk using GPT. The partitioning is done under the hood by GNU parted and as you can see there is a considerable amount of probing going on by both parted and then the kernel still looking for filesystem signatures. Eventually we end up with two blocks of red (written) data, because GPT creates both a primary and secondary partition table at the beginning and end of the disk.
  • 00:36: I create an ext4 filesystem inside the partition. After even more probing by mkfs the first major operation is to trim/discard all data on the disk (shown by the disk filling up with light purple). Then mkfs writes a large block of data in the middle of the disk which I believe is the journal, followed by four dots which I believe could be backup superblocks.
  • 00:48: Interestingly filesystem creation has not finished. ext4 (as well as other modern filesystems) defer a lot of work to the kernel, and this is obvious when I mount the disk. Notice that a few seconds after the mount (around 00:59) the kernel starts zeroing parts of the disk. I believe this is the inode table and block free bitmap for the first block group. For larger disks this lazy initialization could go on for a long time.
  • 01:05: I unpack a tarball into the filesystem. As expected the operation finishes almost instantaneously, and nothing is actually written to disk. However issuing an explicit sync at 01:11 causes the files and directories to be written, filling first the data blocks and then the inodes and block free bitmap (is there a reason these are written last, or is it just coincidence? Also does the Linux page cache retain the order that the filesystem wrote the blocks?)
  • 01:18: I delete the directory tree I just created. As you’d expect nothing is written to disk, and even after a sync nothing much changes.
  • 01:26: In contrast when I fstrim the filesystem, all the now-deleted data blocks are discarded (light purple). This is the same principle which virt-sparsify --in-place uses to make a disk image sparse.
  • 01:32: Finally after unmounting the filesystem I issue a blkdiscard command which throws the whole thing away. Even after this Linux is probing the partition to see if somehow a filesystem signature could be present.

You can easily reproduce this and similar results yourself using nbdkit and nbdview, and I’ve submitted a talk to FOSDEM about this and much more fun you can have with nbdkit.

2 Comments

Filed under Uncategorized

New in nbdkit: Create a virtual floppy disk

nbdkit is our flexible, plug-in based Network Block Device server.

While I was visiting the KVM Forum last week, one of the most respected members of the QEMU development team mentioned to me that he wanted to think about deprecating QEMU’s VVFAT driver. This QEMU driver is a bit of an oddity — it lets you point QEMU to a directory of files, and inside the guest it will see a virtual floppy containing those files:

$ qemu -drive file=fat:/some/directory

That’s not the odd thing. The odd thing is that it also lets you make the drive writable, and the VVFAT driver then turns those writes back into modifications to the host filesystem (remember that these are writes happening to raw FAT32 data structures, the driver has to infer from just seeing the writes what is happening at the filesystem level). Which is both amazing and crazy (and also buggy).

Anyway I have implemented the read-only part of this in nbdkit. I didn’t implement the write stuff because that’s very ambitious, although if you were going to implement that, doing it in nbdkit would be better than qemu since the only thing that can crash is nbdkit, not the whole hypervisor.

Usage is very simple:

$ nbdkit floppy /some/directory

This gives you an NBD source which you can connect straight to a qemu virtual machine:

$ qemu -drive nbd:localhost:10809

or examine with guestfish:

$ guestfish --ro --format=raw -a nbd://localhost -m /dev/sda1
Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
      ‘man’ to read the manual
      ‘quit’ to quit the shell

> ll /
total 2420
drwxr-xr-x 14 root root  16384 Jan  1  1970 .
drwxr-xr-x 19 root root   4096 Oct 28 10:07 ..
-rwxr-xr-x  1 root root     40 Sep 17 21:23 .dir-locals.el
-rwxr-xr-x  1 root root    879 Oct 27 21:10 .gdb_history
drwxr-xr-x  8 root root  16384 Oct 28 10:05 .git
-rwxr-xr-x  1 root root   1383 Sep 17 21:23 .gitignore
-rwxr-xr-x  1 root root   1453 Sep 17 21:23 LICENSE
-rwxr-xr-x  1 root root  34182 Oct 28 10:04 Makefile
-rwxr-xr-x  1 root root   2568 Oct 27 22:17 Makefile.am
-rwxr-xr-x  1 root root  32085 Oct 27 22:18 Makefile.in
-rwxr-xr-x  1 root root    620 Sep 17 21:23 OTHER_PLUGINS
-rwxr-xr-x  1 root root   4628 Oct 16 22:36 README
-rwxr-xr-x  1 root root   4007 Sep 17 21:23 TODO
-rwxr-xr-x  1 root root  54733 Oct 27 22:18 aclocal.m4
drwxr-xr-x  2 root root  16384 Oct 27 22:18 autom4te.cache
drwxr-xr-x  2 root root  16384 Oct 28 10:04 bash
drwxr-xr-x  5 root root  16384 Oct 27 18:07 common
[etc]

Previously … create ISO images on the fly in nbdkit

Leave a comment

Filed under Uncategorized

Partitioning a 7 exabyte disk

In the latest nbdkit (and at the time of writing you will need nbdkit from git) you can type this magical incantation:

nbdkit data data="
       @0x1c0 2 0 0xee 0xfe 0xff 0xff 0x01 0  0 0 0xff 0xff 0xff 0xff
       @0x1fe 0x55 0xaa
       @0x200 0x45 0x46 0x49 0x20 0x50 0x41 0x52 0x54
                     0 0 1 0 0x5c 0 0 0
              0x9b 0xe5 0x6a 0xc5 0 0 0 0  1 0 0 0 0 0 0 0
              0xff 0xff 0xff 0xff 0xff 0xff 0x37 0  0x22 0 0 0 0 0 0 0
              0xde 0xff 0xff 0xff 0xff 0xff 0x37 0
                     0x72 0xb6 0x9e 0x0c 0x6b 0x76 0xb0 0x4f
              0xb3 0x94 0xb2 0xf1 0x61 0xec 0xdd 0x3c  2 0 0 0 0 0 0 0
              0x80 0 0 0 0x80 0 0 0  0x79 0x8a 0xd0 0x7e 0 0 0 0
       @0x400 0xaf 0x3d 0xc6 0x0f 0x83 0x84 0x72 0x47
                     0x8e 0x79 0x3d 0x69 0xd8 0x47 0x7d 0xe4
              0xd5 0x19 0x46 0x95 0xe3 0x82 0xa8 0x4c
                     0x95 0x82 0x7a 0xbe 0x1c 0xfc 0x62 0x90
              0x80 0 0 0 0 0 0 0  0x80 0xff 0xff 0xff 0xff 0xff 0x37 0
              0 0 0 0 0 0 0 0  0x70 0 0x31 0 0 0 0 0
       @0x6fffffffffffbe00
              0xaf 0x3d 0xc6 0x0f 0x83 0x84 0x72 0x47
                     0x8e 0x79 0x3d 0x69 0xd8 0x47 0x7d 0xe4
              0xd5 0x19 0x46 0x95 0xe3 0x82 0xa8 0x4c
                     0x95 0x82 0x7a 0xbe 0x1c 0xfc 0x62 0x90
              0x80 0 0 0 0 0 0 0  0x80 0xff 0xff 0xff 0xff 0xff 0x37 0
              0 0 0 0 0 0 0 0  0x70 0 0x31 0 0 0 0 0
       @0x6ffffffffffffe00
              0x45 0x46 0x49 0x20 0x50 0x41 0x52 0x54
                     0 0 1 0 0x5c 0 0 0
              0x6c 0x76 0xa1 0xa0 0 0 0 0
                     0xff 0xff 0xff 0xff 0xff 0xff 0x37 0
              1 0 0 0 0 0 0 0  0x22 0 0 0 0 0 0 0
              0xde 0xff 0xff 0xff 0xff 0xff 0x37 0
                     0x72 0xb6 0x9e 0x0c 0x6b 0x76 0xb0 0x4f
              0xb3 0x94 0xb2 0xf1 0x61 0xec 0xdd 0x3c
                     0xdf 0xff 0xff 0xff 0xff 0xff 0x37 0
              0x80 0 0 0 0x80 0 0 0  0x79 0x8a 0xd0 0x7e 0 0 0 0
" size=7E

When nbdkit starts up you can connect to it in a few ways. If you have a qemu virtual machine running an installed operating system, attach a second NBD drive. On the command line that would look like this:

$ qemu-system-x86_64 ... -file drive=nbd:localhost:10809,if=virtio

Or you could use guestfish:

$ guestfish --format=raw -a nbd://localhost
><fs> run

What this creates is a 7 exabyte disk with a single, empty GPT partition.

7 exabytes is a lot. It’s 8,070,450,532,247,928,832 bytes, or about 7 billion gigabytes. In fact even with ever increasing storage capacities in hard disk drives it’ll be a very long time before we get exabyte drives.

Peculiar things happen when you try to use this disk in Linux. For sure the kernel has no problem finding the partition, creating a /dev/sda1 device, and returning the right size. Ext4 has a maximum filesystem size of merely 1 exabyte so it won’t even try to make a filesystem, and on my laptop trying to write an XFS filesystem on the partition just caused qemu to grind away at 200% CPU making no apparent progress even after many minutes.

Why not throw your own favourite disk analysis tools at this image and see what they make of it.

Finally how did I create the magic command line above?

I used the nbdkit memory plugin to make an empty 7 EB disk. Note this requires a recent version of the plugin which was rewritten with support for sparse arrays.

$ nbdkit memory size=7E

Then I could connect to it with guestfish to create the GPT partition:

$ guestfish --format=raw -a nbd://localhost
><fs> run
><fs> part-disk /dev/sda gpt

GPT uses a partition table at the beginning and end of the disk. So – still in guestfish – I could sample what the partitioning tool had written to both ends of the disk:

><fs> pread-device /dev/sda 1M 0 | cat > start
><fs> pread-device /dev/sda 1M 8070450532246880256 | cat > end

I then used hexdump + manual inspection of the hexdump output to write the long data string:

$ hexdump -C start
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001c0  02 00 ee fe ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
...

translates to …

@0x1c0 2 0 0xee 0xfe 0xff 0xff 0x01 0  0 0 0xff 0xff 0xff 0xff
@0x1fe 0x55 0xaa

1 Comment

Filed under Uncategorized

Tip: Read guest disks from VMware vCenter using libguestfs

virt-v2v can import guests directly from vCenter. It uses all sorts of tricks to make this fast and efficient, but the basic technique uses plain https range requests.

Making it all work was not so easy and involved a lot of experimentation and bug fixing, and I don’t think it has been documented up to now. So this post describes how we do it. As usual the code is the ultimate repository of our knowledge so you may want to consult that after reading this introduction.

Note this is read-only access. Write access is possible, but you’ll have to use ssh instead.

VMware ESXi hypervisor has a web server but doesn’t support range requests, so although you can download an entire disk image in one go from the ESXi hypervisor, to random-access the image using libguestfs you will need VMware vCenter. You should check that virsh dumpxml works against your vCenter instance by following these instructions. If that doesn’t work, it’s unlikely the rest of the instructions will work.

You will need to know:

  1. The hostname or IP address of your vCenter server,
  2. the username and password for vCenter,
  3. the name of your datacenter (probably Datacenter),
  4. the name of the datastore containing your guest (could be datastore1),
  5. .. and of course the name of your guest.

Tricky step 1 is to construct the vCenter https URL of your guest.

This looks like:

https://root:password@vcenter/folder/guest/guest-flat.vmdk?dcPath=Datacenter&dsName=datastore1

where:

root:password
username and password
vcenter
vCenter hostname or IP address
guest
guest name (repeated twice)
Datacenter
datacenter name
datastore1
datastore

Once you’ve got a URL that looks right, try to fetch the headers using curl. This step is important! not just because it checks the URL is good, but because it allows us to get a cookie which is required else vCenter will break under the load when we start to access it for real.

$ curl --insecure -I https://....
HTTP/1.1 200 OK
Date: Wed, 5 Nov 2014 19:38:32 GMT
Set-Cookie: vmware_soap_session="52a3a513-7fba-ef0e-5b36-c18d88d71b14"; Path=/; HttpOnly; Secure; 
Accept-Ranges: bytes
Connection: Keep-Alive
Content-Type: application/octet-stream
Content-Length: 8589934592

The cookie is the vmware_soap_session=... part including the quotes.

Now let’s make a qcow2 overlay which encodes our https URL and the cookie as the backing file. This requires a reasonably recent qemu, probably 2.1 or above.

$ qemu-img create -f qcow2 /tmp/overlay.qcow2 \
    -b 'json: { "file.driver":"https",
                "file.url":"https://..",
                "file.cookie":"vmware_soap_session=\"...\"",
                "file.sslverify":"off",
                "file.timeout":1000 }'

You don’t need to include the password in the URL here, since the cookie acts as your authentication. You might also want to play with the "file.readahead" parameter. We found it makes a big difference to throughput.

Now you can open the overlay file in guestfish as usual:

$ export LIBGUESTFS_BACKEND=direct
$ guestfish
><fs> add /tmp/overlay.qcow2 copyonread:true
><fs> run
><fs> list-filesystems
/dev/sda1: ext4
><fs> mount /dev/sda1 /

and so on.

4 Comments

Filed under Uncategorized

Making a bootable CD-ROM/ISO from virt-builder

virt-builder can throw out new virtual machines with existing operating systems in a few seconds, and you can also write these directly to a USB key or hard disk:

# virt-builder fedora-20 -o /dev/sdX

What you’ve not been able to do is create a bootable CD-ROM or ISO image.

For that I was using the awful livecd-creator program. This needs root and is incredibly fragile. You can have a kickstart that works one day, but not the next, and requires massive hacks to get working … which is the exact reason why I set off to find out how to make virt-builder create ISOs.

Read-only

The background as to why this is difficult: CDs are not writable.

You can take all the files from a Fedora guest built by virt-builder and turn them into an ISO, and put ISOLINUX on it but such a guest would not be able to boot, or at least, it would fail the first time it tried to write to the disk. One day overlayfs (which just went upstream a few days ago) will solve this, but until that is widely available in upstream kernels, we’re going to need something that creates a writable overlay at boot time.

Boot Time

I have chosen dracut (another tool I have a love/hate, mainly hate, relationship with), which has a useful module called dmsquash-live. This implements the boot side of making a live CD writable, for Fedora and RHEL. It’s what livecd-creator uses.

dmsquash-live demands a very particular ISO layout, but it wasn’t hard to reverse engineer it by reading the code carefully and a lot of trial and error.

It requires that we have a filesystem containing a squashfs in a particular location on the CD:

/LiveOS/squashfs.img

That squashfs has to contain inside it a disk image with this precise name:

/LiveOS/rootfs.img

and the disk image is the root filesystem.

The Script

The script below creates all of this, and effectively replaces livecd-creator with something manageable that doesn’t require root, and is only 100 lines of shell (take that OO/Python!)

Update: Kashyap notes that the script will fail if you’re using tmp-on-tmpfs, so you might need to disable that or modify the script to use /var/tmp instead.

Once you’ve run the script you can try booting the image using:

$ qemu-kvm -m 2048 -cdrom boot.iso -boot d

The Future

One improvement to this script would be to remove the dependency on dmsquash-live. We don’t need the baroque complexity of this script, and could write a custom dracut module (perhaps even, a tiny self-contained initramfs) which would do what we need. It could even use overlayfs to simplify things greatly.

#!/bin/bash -

set -e

# Make bootable ISO from virt-builder
# image.
#
# This requires the Fedora
# squashfs/rootfs machinery.  See:
# /lib/dracut/modules.d/90dmsquash-live/dmsquash-live-root.sh

cd /tmp

# Build the regular disk image, but also
# build a special initramfs which has
# the dmsquash-live & pollcdrom modules
# enabled.  We also need to kill SELinux
# relabelling, and hence SELinux.
cat > postinstall <<'EOF'
#!/bin/bash -
version=` rpm -q kernel | sort -rV | head -1 | sed 's/kernel-//' `
echo installed kernel version: $version
dracut --no-hostonly --add "dmsquash-live pollcdrom" /boot/initrd0 $version
EOF

virt-builder fedora-20 \
    --install kernel \
    --root-password password:123456 \
    --edit '/etc/selinux/config:
        s/SELINUX=enforcing/SELINUX=disabled/' \
    --delete /.autorelabel \
    --run postinstall

# Extract the root filesystem (as an ext3/4 disk image).
guestfish --progress-bars --ro -a fedora-20.img -i \
    download /dev/sda3 rootfs.img

# Update /etc/fstab in the rootfs (but NOT in the original guest)
# so it works for the CD
virt-customize -a rootfs.img \
  --write '/etc/fstab:/dev/root / ext4 defaults 1 1'

# Turn the rootfs.img into a squashfs
# which must contain the layout
# /LiveOS/rootfs.img
rm -rf CDroot
rm -f squashfs.img
mkdir -p CDroot/LiveOS
mv rootfs.img CDroot/LiveOS
mksquashfs CDroot squashfs.img

# Create the CD layout.
rm -rf CDroot
mkdir -p CDroot/LiveOS

cp squashfs.img CDroot/LiveOS/

mkdir CDroot/isolinux

# Get the kernel (only) from the disk
# image.
pushd CDroot/isolinux
virt-builder --get-kernel ../../fedora-20.img
mv vmlinuz* vmlinuz0
rm init*
popd

# Get the special initrd that we built
# above.
guestfish --ro -a fedora-20.img -i \
    download /boot/initrd0 CDroot/isolinux/initrd0

# ISOLINUX configuration.
cat > CDroot/isolinux/isolinux.cfg <<EOF
prompt 1
default 1
label 1
    kernel vmlinuz0
    append initrd=initrd0 rd.live.image root=CDLABEL=boot rootfstype=auto rd.live.debug console=tty0 rd_NO_PLYMOUTH
EOF

# Rest of ISOLINUX installation.
cp /usr/share/syslinux/isolinux.bin CDroot/isolinux/
cp /usr/share/syslinux/ldlinux.c32 CDroot/isolinux/
cp /usr/share/syslinux/libcom32.c32 CDroot/isolinux/
cp /usr/share/syslinux/libutil.c32 CDroot/isolinux/
cp /usr/share/syslinux/vesamenu.c32 CDroot/isolinux/

# Create the ISO.
rm -f boot.iso
mkisofs -o boot.iso \
    -J -r \
    -V boot \
   -b isolinux/isolinux.bin -c isolinux/boot.cat \
   -no-emul-boot -boot-load-size 4 -boot-info-table \
   CDroot

4 Comments

Filed under Uncategorized

nbdkit now supports cURL — HTTP, FTP, and SSH connections

nbdkit is a liberally licensed NBD (Network Block Device) server designed to let you connect all sorts of crazy disk images sources (like Amazon, Glance, VMware VDDK) to the universal network protocol for sharing disk images: NBD.

New in nbdkit 1.1.8: cURL support. This lets you turn any HTTP, FTP, TFTP or SSH server that hosts a disk image into an NBD server.

For example:

$ nbdkit -r curl url=http://onuma/scratch/boot.iso

and then you can read the disk image using guestfish, qemu or any other nbd client:

$ guestfish --ro -a nbd://localhost -i

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell

/dev/sda mounted on /

><fs> _

If you are using a normal SSH server like OpenSSH which supports the SSH File Transfer Protocol (aka SFTP), then you can use SFTP to access images:

$ nbdkit -r curl url=sftp://rjones@localhost/~/fedora-20.img

I’m hoping to enable write support in a future version.

It doesn’t work at the moment because I haven’t worked out how to switch between read (GET) and write (POST) requests in a single cURL handle. Perhaps I need to use two handles? The documentation is confusing.

5 Comments

Filed under Uncategorized

New in libguestfs: virt-log

In libguestfs ≥ 1.27.17, there’s a new tool called virt-log for displaying the log files from a disk image or virtual machine:

$ virt-log -a disk.img | less

Previously you could write:

$ virt-cat -a disk.img /var/log/messages

That worked for some Linux guests, but several things happened:

Virt-log is designed to do the right thing automatically (although at the moment Windows support is not finished). In particular it will automatically decode and display the systemd journal, and it knows the different locations that some Linux distros store their plain text log files.

4 Comments

Filed under Uncategorized

Caseless virtualization cluster, part 4

AMD supports nested virtualization a bit more reliably than Intel, which was one of the reasons to go for AMD processors in my virtualization cluster. (The other reason is they are much cheaper)

But how well does it perform? Not too badly as it happens.

I tested this by creating a Fedora 20 guest (the L1 guest). I could create a nested (L2) guest inside that, but a simpler way is to use guestfish to carry out some baseline performance measurements. Since libguestfs is creating a short-lived KVM appliance, it benefits from hardware virt acceleration when available. And since libguestfs ≥ 1.26, there is a new option that lets you force software emulation so you can easily test the effect with & without hardware acceleration.

L1 performance

Let’s start on the host (L0), measuring L1 performance. Note that you have to run the commands shown at least twice, both because supermin will build and cache the appliance first time and because it’s a fairer test of hardware acceleration if everything is cached in memory.

This AMD hardware turns out to be pretty good:

$ time guestfish -a /dev/null run
real	0m2.585s

(2.6 seconds is the time taken to launch a virtual machine, all its userspace and a daemon, then shut it down. I’m using libvirt to manage the appliance).

Forcing software emulation (disabling hardware acceleration):

$ time LIBGUESTFS_BACKEND_SETTINGS=force_tcg guestfish -a /dev/null run
real	0m9.995s

L2 performance

Inside the L1 Fedora guest, we run the same tests. Note this is testing L2 performance (the libguestfs appliance running on top of an L1 guest), ie. nested virt:

$ time guestfish -a /dev/null run
real	0m5.750s

Forcing software emulation:

$ time LIBGUESTFS_BACKEND_SETTINGS=force_tcg guestfish -a /dev/null run
real	0m9.949s

Conclusions

These are just some simple tests. I’ll be doing something more comprehensive later. However:

  1. First level hardware virtualization performance on these AMD chips is excellent.
  2. Nested virt is about 40% of non-nested speed.
  3. TCG performance is slower as expected, but shows that hardware virt is being used and is beneficial even in the nested case.

Other data

The host has 8 cores and 16 GB of RAM. /proc/cpuinfo for one of the host cores is:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-8320 Eight-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1
bogomips	: 7031.39
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

The L1 guest has 1 vCPU and 4 GB of RAM. /proc/cpuinfo in the guest:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD Opteron 63xx class CPU
stepping	: 0
microcode	: 0x1000065
cpu MHz		: 3515.548
cache size	: 512 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb lm rep_good nopl extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm svm abm sse4a misalignsse 3dnowprefetch xop fma4 tbm arat
bogomips	: 7031.09
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

Update

As part of the discussion in the comments about whether this has 4 or 8 physical cores, here is the lstopo output:

lstopo

9 Comments

Filed under Uncategorized