Tag Archives: filesystems

Ridiculously big “files”

In the last post I showed how you can combine nbdfuse with nbdkit’s RAM disk to mount a RAM disk as a local file. In a talk I gave at FOSDEM last year I described creating these absurdly large RAM-backed filesystems and you can do the same thing now to create ridiculously big “files”. Here’s a 7 exabyte file:

$ touch /var/tmp/disk.img
$ nbdfuse /var/tmp/disk.img --command nbdkit -s memory 7E &
$ ll /var/tmp/disk.img 
 -rw-rw-rw-. 1 rjones rjones 8070450532247928832 Nov  4 13:37 /var/tmp/disk.img
$ ls -lh /var/tmp/disk.img 
 -rw-rw-rw-. 1 rjones rjones 7.0E Nov  4 13:37 /var/tmp/disk.img

What can you actually do with this file, and more importantly does anything break? As in the talk, creating a Btrfs filesystem boringly just works. mkfs.ext4 spins using 100% of CPU. I let it go for 15 minutes but it seemed no closer to either succeeding or crashing. Update: As Ted pointed out in the comments, it’s likely I didn’t mean mkfs.ext4 here, which gives an appropriate error, but mkfs.xfs which consumes more and more space, appearing to spin.

Emacs said:

File disk.img is large (7 EiB), really open? (y)es or (n)o or (l)iterally

and I was too chicken to find out what it would do if I really opened it.

I do wonder if there’s a DoS attack here if I leave this seemingly massive regular file lying around in a public directory.

3 Comments

Filed under Uncategorized

nbdkit for loopback pt 6: giant file-backed disks for testing

In part 1 and part 5 of this series I created some giant disks with a virtual size of 263-1 bytes (8 exabytes). However these were stored in memory using nbdkit-memory-plugin so you could never allocate more space in these disks than available RAM plus swap.

This is a problem when testing some filesystems because the filesystem overhead (the space used to store superblocks, inode tables, block free maps and so on) can be 1% or more.

The solution to this is to back the virtual disks using a sparse file instead. XFS lets you create sparse files up to 263-1 bytes and you can serve them using nbdkit-file-plugin instead:

$ rm -f temp
$ truncate -s $(( 2**63 - 1 )) temp
$ stat -c %s temp
9223372036854775807
$ nbdkit file file=temp

nbdkit-file-plugin recently got a lot of updates to ensure it always maintains sparseness where possible and supports efficient zeroing, so make sure you’re using at least nbdkit ≥ 1.6.

Now you can serve this in the ordinary way and you should be able to allocate as much space as is available on the host filesystem:

# nbd-client -b 512 localhost /dev/nbd0
Negotiation: ..size = 8796093022207MB
Connected /dev/nbd0
# blockdev --getsize64 /dev/nbd0
9223372036854774784
# sgdisk -n 1 /dev/nbd0
# gdisk -l /dev/nbd0
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048  18014398509481948   8.0 EiB     8300

This command will still probably fail unless you have a lot of patience and a huge amount of space on your host:

# mkfs.xfs -K /dev/nbd0p1

Leave a comment

Filed under Uncategorized

nbdkit for loopback pt 5: 8 exabyte btrfs filesystem

Thanks Chris Murphy for noting that btrfs can create and mount 8 EB (approx 263 byte) filesystems effortlessly:

$ nbdkit -fv memory size=$(( 2**63-1 ))
# modprobe nbd
# nbd-client -b 512 localhost /dev/nbd0
# blockdev --getss /dev/nbd0
512
# gdisk /dev/nbd0
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048  18014398509481948   8.0 EiB     8300  Linux filesystem
# mkfs.btrfs -K /dev/nbd0p1
btrfs-progs v4.16
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want to force metadata duplication.
Label:              (null)
UUID:               770e5592-9055-4551-8416-b6376802a2ad
Node size:          16384
Sector size:        4096
Filesystem size:    8.00EiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB
SSD detected:       yes
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1     8.00EiB  /dev/nbd0p1

# mount /dev/nbd0p1 /tmp/mnt
# df -h /tmp/mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/nbd0p1     8.0E   17M  8.0E   1% /tmp/mnt

I created a few files in there and it all seemed to work although I didn’t do any extensive testing. Good job btrfs!

3 Comments

Filed under Uncategorized

nbdkit for loopback pt 2: injecting errors

nbdkit is a pluggable NBD server with a filter system that you can layer over plugins to transform block devices. One of the filters is the error filter which lets you inject errors. We can use this to find out how well filesystems cope with errors and recovering from errors.

$ rm -f /tmp/inject
$ nbdkit -fv --filter=error memory size=$(( 2**32 )) \
    error-rate=100% error-file=/tmp/inject
# nbd-client localhost /dev/nbd0

We can create a filesystem normally:

# sgdisk -n 1 /dev/nbd0
# gdisk -l /dev/nbd0
Number  Start (sector)    End (sector)  Size       Code  Name
   1            1024         4194286   4.0 GiB     8300  
# mkfs.ext4 /dev/nbd0p1
# mount /dev/nbd0p1 /mnt

It’s very interesting watching the verbose output of nbdkit -fv because you can see the lazy metadata creation which the Linux ext4 kernel driver carries out in the background after you mount the filesystem the first time.

So far we have not injected any errors. To do that we create the error-file (/tmp/inject) which the error filter will notice and respond by injecting EIO errors until we remove the file:

# touch /tmp/inject
# ls /mnt
ls: reading directory '/mnt': Input/output error
# rm /tmp/inject
# ls /mnt
lost+found

Ext4 recovered once we stopped injecting errors, but …

# touch /mnt/hello
touch: cannot touch '/mnt/hello': Read-only file system

… it responded to the error by remounting the filesystem read-only. Interestingly I was not able to simply remount the filesystem read-write. Ext4 forced me to unmount the filesystem and run e2fsck before I could mount it again.

e2fsck also said:

e2fsck: Unknown code ____ 251 while recovering journal of /dev/nbd0p1

which I guess is a bug (already found upstream).

Leave a comment

Filed under Uncategorized

Mapping files to disk, part 2

Part 1

Now I’ve written the second tool of virt-bmap which lets you boot a guest and observe what files it is reading from disk. (NB if you want to try this out you will need a patched libguestfs)

The second tool is an nbdkit plugin, so to use the tool you just do:

$ nbdkit -r bmaplogger file=/tmp/win7.img bmap=/tmp/win7.bmap \
  --run ' qemu-kvm -cpu host -m 2048 -hda $nbd '

and watch the output as the guest boots. Note that the bmap file must have been prepared previously by the virt-bmap tool (see part 1).

The results are interesting. Here is Windows 7 booting (edited down for brevity):

read v /dev/sda
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/tr-TR/bootmgr.exe.mui
read f /dev/sda1 /Boot/zh-HK/bootmgr.exe.mui
read f /dev/sda1 /Boot/zh-TW/bootmgr.exe.mui
read f /dev/sda1 /bootmgr
read v /dev/sda
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/Fonts/kor_boot.ttf
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/de-DE/bootmgr.exe.mui
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read f /dev/sda1 /bootmgr
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read v /dev/sda
read p /dev/sda2
read d /dev/sda2 /
read f /dev/sda2 /Windows/System32/Msdtc/MSDTC.LOG
read d /dev/sda2 /
read f /dev/sda2 /ProgramData/Microsoft/Search/Data/Applications/Windows/MSSres00001.jrs
read d /dev/sda2 /
read d /dev/sda2 /Users
read p /dev/sda2
read d /dev/sda2 /Windows/assembly/NativeImages_v2.0.50727_64
read d /dev/sda2 /Windows
read p /dev/sda2
read d /dev/sda2 /Windows/servicing
read d /dev/sda2 /Windows
read f /dev/sda2 /Windows/System32/config/SAM.LOG1
read p /dev/sda2
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read d /dev/sda2 /Windows/System32/en-US/Licenses/_Default
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read d /dev/sda2 /Windows/System32
read d /dev/sda2 /Windows/System32/Tasks/Microsoft/Windows
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read f /dev/sda2 /Windows/System32/CIRCoInst.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/clb.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/cmmon32.exe
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/cryptnet.dll
read d /dev/sda2 /Windows/System32
[...]
read f /dev/sda2 /Windows/System32/iscsilog.dll
read f /dev/sda2 /Windows/System32/ksetup.exe
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/ksproxy.ax
read f /dev/sda2 /Windows/System32/NcdProp.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/nci.dll
read f /dev/sda2 /Windows/System32/profsvc.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/propsys.dll
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read f /dev/sda2 /Windows/System32/winload.exe
[...]

Here is a Windows server that had McAfee (a “virus scanner”) installed:

read v /dev/sda
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /bootmgr
read v /dev/sda
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log0.txt
read v /dev/sda
read p /dev/sda1
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/nl-NL/bootmgr.exe.mui
read f /dev/sda1 /Boot/pl-PL/bootmgr.exe.mui
read f /dev/sda1 /Boot/ru-RU/bootmgr.exe.mui
read f /dev/sda1 /Boot/zh-TW/bootmgr.exe.mui
read f /dev/sda1 /bootmgr
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/Fonts/kor_boot.ttf
read f /dev/sda1 /BOOTSECT.BAK
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /BOOTSECT.BAK
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read f /dev/sda1 /Boot/BCD
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log4.txt
read f /dev/sda1 /Boot/BCD
read p /dev/sda2
read f /dev/sda2 /Program Files (x86)/Common Files/microsoft shared/DAO/dao360.dll
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda2 /Program Files (x86)/Common Files/System/msadc/adcjavas.inc
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/auditmanager.log
read f /dev/sda2 /Program Files (x86)/Common Files/microsoft shared/DAO/dao360.dll
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log7.txt
read f /dev/sda2 /Program Files (x86)/MSBuild/Microsoft/Windows Workflow Foundation/v3.0/Workflow.Targets
read f /dev/sda2 /Windows/ServerEnterprise.xml
read f /dev/sda2 /Windows/inf/setupapi.dev.log
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log7.txt
read f /dev/sda2 /Program Files (x86)/Internet Explorer/en-US/jsprofilerui.dll.mui
read f /dev/sda2 /Users/tempadmin/AppData/Local/Microsoft/Internet Explorer/Recovery/High/Last Active/{7101D2F0-982F-11E0-A584-005056A7000F}.dat
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/Plugins/AuEngineUpdater.dll
read f /dev/sda2 /Windows/System32/clusapi.dll
read f /dev/sda2 /Windows/System32/cmcfg32.dll
read f /dev/sda2 /Windows/winsxs/Backup/amd64_microsoft-windows-com-base_31bf3856ad364e35_6.1.7600.16385_none_69e3281e403684ea_comcat.dll_8571d1d1
read f /dev/sda2 /Windows/System32/comdlg32.dll
read f /dev/sda2 /Windows/SysWOW64/comexp.msc
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/Schema/linux-definitions-schema.xsd
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/SysWOW64/C_10003.NLS
read f /dev/sda2 /Windows/SysWOW64/C_10004.NLS
read f /dev/sda2 /Windows/SysWOW64/C_20005.NLS
read f /dev/sda2 /Windows/SysWOW64/C_21025.NLS
read f /dev/sda2 /Windows/CMAgent/Installer/Providers/ExecutionEngine/providers.catalog
read f /dev/sda2 /Windows/SysWOW64/dfsrHealthReport.xsl
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/SysWOW64/C_10003.NLS
read f /dev/sda2 /Windows/SysWOW64/C_10004.NLS
read f /dev/sda2 /Windows/SysWOW64/C_20005.NLS
read f /dev/sda2 /Windows/SysWOW64/C_21025.NLS
read f /dev/sda2 /Windows/CMAgent/Installer/Providers/ExecutionEngine/providers.catalog
read f /dev/sda2 /Windows/SysWOW64/dfsrHealthReport.xsl
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/System32/hhctrl.ocx
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log2.txt
read f /dev/sda2 /Windows/System32/KBDA1.DLL
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/System32/Kswdmcap.ax
read f /dev/sda2 /Windows/SysWOW64/NOISE.CHS
read f /dev/sda2 /Windows/System32/NlsData0003.dll
read f /dev/sda2 /Windows/SysWOW64/RacRules.xml
read f /dev/sda2 /Windows/System32/ROUTE.EXE
read f /dev/sda2 /Windows/SysWOW64/en-US/tapimgmt.msc
read f /dev/sda2 /Windows/SysWOW64/en-US/tpm.msc
read f /dev/sda2 /Windows/System32/TpmInit.exe
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/oval.db
read f /dev/sda2 /Windows/Microsoft.NET/Framework64/v4.0.30319/ngen.log
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/Audit.db
read f /dev/sda2 /Windows/System32/winload.exe

I wouldn’t take any of these traces very literally right now. Our method of mapping files to disk blocks is a bit shaky, especially for ntfs-3g. However I did check the major points of the McAfee trace against the raw log and block map and it seems plausible.

Leave a comment

Filed under Uncategorized

Mapping files to disk

Wouldn’t it be cool if you could watch a virtual machine booting, and at the same time see what files it is accessing on disk:

reading /dev/sda1 master boot record
reading /dev/sda1 /grub2/i386-pc/boot.img
reading /dev/sda1 /grub2/i386-pc/ext2.mod
reading /dev/sda1 /vmlinuz
...

You can already observe what disk blocks it is accessing pretty easily. There are several methods, but a quick one would be to use nbdkit’s file plugin with the -f -v flags (foreground and verbose). The problem is how to map disk blocks to the files and other interesting objects that exist in the disk image.

How do you map between files and disk blocks? For simple filesystems like ext4 you can use the FIBMAP ioctl, and perhaps adjust the answer by adding the offset of the start of the partition. However as you get further into the boot process you’ll probably encounter complexities like LVM. There may not even be a 1-1 mapping since RAID means that multiple blocks can store a single file block, and tail packing and deduplication mean that a block can belong to multiple files. And of course there are things other than plain files: directories, swap partitions, master boot records, and boot loaders, that live in and between filesystems.

To solve this I have written a tool called virt-bmap. It takes a disk image and outputs a block map. To do this it uses libguestfs (patched) to control an nbdkit instance, reading each file and recording what blocks in the disk image are accessed. (It sounds complicated, but virt-bmap wraps it up in a simple command line tool.) The beauty of this is that the kernel takes care of the mapping for us, and it works no matter how many layers of filesystem/LVM/RAID are between the file and the underlying device. This doesn’t quite solve the “RAID problem” since the RAID layers in Linux are free to only read a single copy of the file, but is generally accurate for everything else.

$ virt-bmap fedora-20.img
virt-bmap: examining /dev/sda1 ...
virt-bmap: examining /dev/sda2 ...
virt-bmap: examining /dev/sda3 ...
virt-bmap: examining filesystem on /dev/sda1 (ext4) ...
virt-bmap: examining filesystem on /dev/sda3 (ext4) ...
virt-bmap: writing /home/rjones/d/virt-bmap/bmap
virt-bmap: successfully examined 3 partitions, 0 logical volumes,
           2 filesystems, 3346 directories, 20585 files
virt-bmap: output written to /home/rjones/d/virt-bmap/bmap

The output bmap file is a straightforward map from disk byte offset to file / files / object occupying that space:

1 541000 541400 d /dev/sda1 /
1 541400 544400 d /dev/sda1 /lost+found
1 941000 941400 f /dev/sda1 /.vmlinuz-3.11.10-301.fc20.x86_64.hmac
1 941400 961800 f /dev/sda1 /config-3.11.10-301.fc20.x86_64
1 961800 995400 f /dev/sda1 /initrd-plymouth.img
1 b00400 ef1c00 f /dev/sda1 /grub2/themes/system/background.png
1 f00400 12f1c00 f /dev/sda1 /grub2/themes/system/fireworks.png
1 1300400 1590400 f /dev/sda1 /System.map-3.11.10-301.fc20.x86_64

[The 1 that appears in the first column means “first disk”. Unfortunately virt-bmap can only map single disk virtual machines at present.]

The second part of this, which I’m still writing, will be another nbdkit plugin which takes these maps and produces a nice log of accesses as the machine boots.

3 Comments

Filed under Uncategorized

Paper classifying bugs in Linux filesystems

This is an excellent paper classifying bugs in Linux filesystems. The results seem to be generally applicable to bugs in open source kernel code.

Leave a comment

Filed under Uncategorized

Tip: copy out / dump filesystems from a guest

The scenario is that you’ve installed a guest in some host logical volume (or partition or iSCSI or whatever). Now you want to dump out the raw filesystem data for any filesystem from that guest.

It’s easy with guestfish

$ sudo guestfish --ro -a /dev/vg/f14 \
    run : download /dev/sda1 /tmp/diskimage

Explanation:

  1. I’m using “sudo” because guestfish needs root in order to read the guest’s disk (/dev/vg/f14). I could run the same command as non-root if I was accessing a guest disk that didn’t need root permissions, eg. from a file.
  2. Use the --ro option because I’m just reading out the contents.
  3. The run command launches the libguestfs back end. This is followed by the download command.
  4. /dev/sda1 is a filesystem inside the guest disk image, in this case, the /boot filesystem of a Fedora 14 guest. Use virt-filesystems or the guestfish list-filesystems command to list out filesystems in a random disk image.
  5. /tmp/diskimage is the target where I want to download the filesystem to

Because /tmp/diskimage is a filesystem, I can also open it with guestfish …

$ guestfish --ro -a /tmp/diskimage -m /dev/sda
><fs> ll /
total 56831
dr-xr-xr-x.  5 root root     1024 Oct 15 11:19 .
drwxr-xr-x  23  500  500     4096 Jan  4 19:49 ..
-rw-r--r--.  1 root root  2022930 May  6  2010 System.map-2.6.33.3-85.fc13.x86_64
-rw-r--r--.  1 root root  2228083 Sep 15 03:02 System.map-2.6.35.4-28.fc14.x86_64
-rw-r--r--.  1 root root  2154391 Oct 13 22:28 System.map-2.6.35.6-43.fc14.x86_64
[...]

In this case because I used the -m option to mount up the whole disk image, I don’t need the run command. (See this explanation for exactly why).

Leave a comment

Filed under Uncategorized

Tip: Making a disk image sparse

Update: libguestfs ≥ 1.14 includes a new tool called virt-sparsify which can make guests sparse (thin-provisioned).

A sparse file is one where file blocks that would contain all zeroes are omitted from the file (and don’t take up any space in the filesystem). A sparse virtual disk image is the same sort of thing: blocks that the guest hasn’t written to yet are not stored by the host, and read as all zeroes. Sparse disk images can be implemented using sparse files on the host, or you can use a format like qcow2 which inherently supports sparse files.

The problem with sparse files is that they gradually grow. When a guest writes a block it is allocated, and potentially this is never freed, even if the guest deletes the file or writes all zeroes to the block. [Eventually this problem will be solved by implementing the TRIM command which lets the host know that the guest no longer requires a block, but we’re not quite there yet.]

This is of course a problem if you fill up the guest disk and then delete the files. The host file does not regain its sparseness.

How do you therefore sparsify a disk image?

There is a technique that you can use, which is simple to understand and implement, but it does require taking the guest offline.

First, fill the empty space in the guest with zeroes. A simple way to do this for a Linux guest is to run this command (run it within each guest filesystem):

dd if=/dev/zero of=zerofile bs=1M
# note that the 'dd' command fills up all free space and eventually fails
sync
rm zerofile

Now shut down the guest.

Copy the guest disk image using either qemu-img convert or cp --sparse=always. “cp” is the fastest but only works to sparsify a raw-format disk image:

cp --sparse=always guest-disk.img guest-disk-copy.img

A little-known feature of the qemu-img convert subcommand is that it automatically sparsifies any disk sector which contains all zeroes, and of course it can convert the format at the same time:

qemu-img convert -f raw -O qcow2 guest-disk.img guest-disk-copy.qcow2

Now the copy in both cases is sparsified, and hopefully a lot smaller than before.

Addendum: Instead of running “dd” by hand inside each guest, you can use the following libguestfs script to achieve the same (but note the guest must be shut down otherwise you will get disk corruption):

#!/usr/bin/perl -w
# ./phil-space.pl (disk.img|GuestName)
# Requires libguestfs >= 1.5.

use strict;
use Sys::Virt;
use Sys::Guestfs;
use Sys::Guestfs::Lib qw(open_guest);

die "$0: recent version of libguestfs >= 1.5 is required\n"
    unless defined (Sys::Guestfs->can ("list_filesystems"));

die "$0 (disk.img|GuestName)\n" unless @ARGV >= 1;

my $g = open_guest (\@ARGV, rw => 1);
$g->launch ();

my %filesystems = $g->list_filesystems ();

foreach (keys %filesystems) {
    eval {
        $g->mount_options ("", $_, "/");

        print "filling empty space in $_ with zeroes ...\n";

        my $filename = "/deleteme.tmp";
        eval { $g->dd ("/dev/zero", $filename) };
        $g->sync (); # make sure the last part of the file is written
        $g->rm ($filename);
    };
    $g->umount_all ();
}

$g->sync ()

2 Comments

Filed under Uncategorized

Freezing filesystems

Ric Wheeler and Christoph Hellwig were quick to point out I was wrong about something: Linux now has a standard API for freezing or “quiescing” filesystems.

Quiescing a filesystem lets you take a consistent snapshot or backup at the block device level. If your server uses SAN storage, then probably your SAN lets you take snapshots of the SCSI LUNs at any time. But if you try doing this while the server is under load you’ll (at best) get a “crash consistent” snapshot, where the journal has to be replayed when the copy of the filesystem is mounted, and at worst you’ll get data corruption, particularly with ext3 defaults.

Quiescing tells the filesystem to make things consistent at the disk / block device level. The journal won’t need to be replayed, and the superblock is marked as if you’d unmounted the device. A snapshot taken at this stage will be consistent, at least at the filesystem level (applications don’t know what is happening, so you could still see things like half-written transactions in databases).

The downside to quiescing a filesystem is that it generally causes writes to be blocked, eventually bringing the whole system to a grinding halt. SAN snapshots can be done very quickly though, so the time between a “freeze” and “thaw” operation is usually brief.

Very recent versions of util-linux-ng have an fsfreeze command that lets you freeze or thaw filesystems at the command line. Use with care!

Freezing filesystems also has an application for virtual machines. Our new guest agent will support freezing filesystems so that you can coordinate a consistent backup or snapshot from outside the guest.

If you have Rawhide and the most recent virt-rescue you can play with freezing filesystems without breaking anything:

$ rm -f test.img
$ truncate -s 1G test.img
$ virt-rescue test.img
><rescue> mkfs.ext4 /dev/vda
><rescue> mount /dev/vda /sysroot

From another window you can see that the image is not consistent. If you were to snapshot the image now the filesystem would at least require journal recovery when mounted:

$ file test.img
test.img: Linux rev 1.0 ext4 filesystem data (needs journal recovery) (extents) (large files) (huge files)

But by issuing fsfreeze in the guest we can make it consistent:

><rescue> fsfreeze -f /sysroot
$ file test.img
test.img: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge files)

.. allowing us to take a snapshot or copy of the block device (test.img) in a consistent state.

Leave a comment

Filed under Uncategorized