Tag Archives: filesystems

Mapping files to disk, part 2

Part 1

Now I’ve written the second tool of virt-bmap which lets you boot a guest and observe what files it is reading from disk. (NB if you want to try this out you will need a patched libguestfs)

The second tool is an nbdkit plugin, so to use the tool you just do:

$ nbdkit -r bmaplogger file=/tmp/win7.img bmap=/tmp/win7.bmap \
  --run ' qemu-kvm -cpu host -m 2048 -hda $nbd '

and watch the output as the guest boots. Note that the bmap file must have been prepared previously by the virt-bmap tool (see part 1).

The results are interesting. Here is Windows 7 booting (edited down for brevity):

read v /dev/sda
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/tr-TR/bootmgr.exe.mui
read f /dev/sda1 /Boot/zh-HK/bootmgr.exe.mui
read f /dev/sda1 /Boot/zh-TW/bootmgr.exe.mui
read f /dev/sda1 /bootmgr
read v /dev/sda
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/Fonts/kor_boot.ttf
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/de-DE/bootmgr.exe.mui
read p /dev/sda1
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda1 /Boot/da-DK/bootmgr.exe.mui
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read f /dev/sda1 /bootmgr
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read v /dev/sda
read p /dev/sda2
read d /dev/sda2 /
read f /dev/sda2 /Windows/System32/Msdtc/MSDTC.LOG
read d /dev/sda2 /
read f /dev/sda2 /ProgramData/Microsoft/Search/Data/Applications/Windows/MSSres00001.jrs
read d /dev/sda2 /
read d /dev/sda2 /Users
read p /dev/sda2
read d /dev/sda2 /Windows/assembly/NativeImages_v2.0.50727_64
read d /dev/sda2 /Windows
read p /dev/sda2
read d /dev/sda2 /Windows/servicing
read d /dev/sda2 /Windows
read f /dev/sda2 /Windows/System32/config/SAM.LOG1
read p /dev/sda2
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read d /dev/sda2 /Windows/System32/en-US/Licenses/_Default
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read d /dev/sda2 /Windows/System32
read d /dev/sda2 /Windows/System32/Tasks/Microsoft/Windows
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read f /dev/sda2 /Windows/System32/CIRCoInst.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/clb.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/cmmon32.exe
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/cryptnet.dll
read d /dev/sda2 /Windows/System32
[...]
read f /dev/sda2 /Windows/System32/iscsilog.dll
read f /dev/sda2 /Windows/System32/ksetup.exe
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/ksproxy.ax
read f /dev/sda2 /Windows/System32/NcdProp.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/nci.dll
read f /dev/sda2 /Windows/System32/profsvc.dll
read d /dev/sda2 /Windows/System32
read f /dev/sda2 /Windows/System32/propsys.dll
read d /dev/sda2 /Windows/System32
read p /dev/sda2
read f /dev/sda2 /Windows/System32/winload.exe
[...]

Here is a Windows server that had McAfee (a “virus scanner”) installed:

read v /dev/sda
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /bootmgr
read v /dev/sda
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log0.txt
read v /dev/sda
read p /dev/sda1
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/nl-NL/bootmgr.exe.mui
read f /dev/sda1 /Boot/pl-PL/bootmgr.exe.mui
read f /dev/sda1 /Boot/ru-RU/bootmgr.exe.mui
read f /dev/sda1 /Boot/zh-TW/bootmgr.exe.mui
read f /dev/sda1 /bootmgr
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/Fonts/kor_boot.ttf
read f /dev/sda1 /BOOTSECT.BAK
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /BOOTSECT.BAK
read f /dev/sda1 /Boot/BCD
read f /dev/sda1 /Boot/BOOTSTAT.DAT
read f /dev/sda1 /Boot/BCD
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log4.txt
read f /dev/sda1 /Boot/BCD
read p /dev/sda2
read f /dev/sda2 /Program Files (x86)/Common Files/microsoft shared/DAO/dao360.dll
read f /dev/sda1 /Boot/cs-CZ/bootmgr.exe.mui
read f /dev/sda2 /Program Files (x86)/Common Files/System/msadc/adcjavas.inc
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/auditmanager.log
read f /dev/sda2 /Program Files (x86)/Common Files/microsoft shared/DAO/dao360.dll
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log7.txt
read f /dev/sda2 /Program Files (x86)/MSBuild/Microsoft/Windows Workflow Foundation/v3.0/Workflow.Targets
read f /dev/sda2 /Windows/ServerEnterprise.xml
read f /dev/sda2 /Windows/inf/setupapi.dev.log
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log7.txt
read f /dev/sda2 /Program Files (x86)/Internet Explorer/en-US/jsprofilerui.dll.mui
read f /dev/sda2 /Users/tempadmin/AppData/Local/Microsoft/Internet Explorer/Recovery/High/Last Active/{7101D2F0-982F-11E0-A584-005056A7000F}.dat
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/Plugins/AuEngineUpdater.dll
read f /dev/sda2 /Windows/System32/clusapi.dll
read f /dev/sda2 /Windows/System32/cmcfg32.dll
read f /dev/sda2 /Windows/winsxs/Backup/amd64_microsoft-windows-com-base_31bf3856ad364e35_6.1.7600.16385_none_69e3281e403684ea_comcat.dll_8571d1d1
read f /dev/sda2 /Windows/System32/comdlg32.dll
read f /dev/sda2 /Windows/SysWOW64/comexp.msc
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/Schema/linux-definitions-schema.xsd
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/SysWOW64/C_10003.NLS
read f /dev/sda2 /Windows/SysWOW64/C_10004.NLS
read f /dev/sda2 /Windows/SysWOW64/C_20005.NLS
read f /dev/sda2 /Windows/SysWOW64/C_21025.NLS
read f /dev/sda2 /Windows/CMAgent/Installer/Providers/ExecutionEngine/providers.catalog
read f /dev/sda2 /Windows/SysWOW64/dfsrHealthReport.xsl
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/SysWOW64/C_10003.NLS
read f /dev/sda2 /Windows/SysWOW64/C_10004.NLS
read f /dev/sda2 /Windows/SysWOW64/C_20005.NLS
read f /dev/sda2 /Windows/SysWOW64/C_21025.NLS
read f /dev/sda2 /Windows/CMAgent/Installer/Providers/ExecutionEngine/providers.catalog
read f /dev/sda2 /Windows/SysWOW64/dfsrHealthReport.xsl
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/System32/hhctrl.ocx
read f /dev/sda2 /Program Files (x86)/McAfee/Real Time/log2.txt
read f /dev/sda2 /Windows/System32/KBDA1.DLL
read f /dev/sda2 /ProgramData/McAfee/Common Framework/Mesh/SvcMgr_WPLCLDWA170.log
read f /dev/sda2 /Windows/System32/Kswdmcap.ax
read f /dev/sda2 /Windows/SysWOW64/NOISE.CHS
read f /dev/sda2 /Windows/System32/NlsData0003.dll
read f /dev/sda2 /Windows/SysWOW64/RacRules.xml
read f /dev/sda2 /Windows/System32/ROUTE.EXE
read f /dev/sda2 /Windows/SysWOW64/en-US/tapimgmt.msc
read f /dev/sda2 /Windows/SysWOW64/en-US/tpm.msc
read f /dev/sda2 /Windows/System32/TpmInit.exe
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/oval.db
read f /dev/sda2 /Windows/Microsoft.NET/Framework64/v4.0.30319/ngen.log
read f /dev/sda2 /Program Files (x86)/McAfee/Policy Auditor Agent/Audit.db
read f /dev/sda2 /Windows/System32/winload.exe

I wouldn’t take any of these traces very literally right now. Our method of mapping files to disk blocks is a bit shaky, especially for ntfs-3g. However I did check the major points of the McAfee trace against the raw log and block map and it seems plausible.

Advertisements

Leave a comment

Filed under Uncategorized

Mapping files to disk

Wouldn’t it be cool if you could watch a virtual machine booting, and at the same time see what files it is accessing on disk:

reading /dev/sda1 master boot record
reading /dev/sda1 /grub2/i386-pc/boot.img
reading /dev/sda1 /grub2/i386-pc/ext2.mod
reading /dev/sda1 /vmlinuz
...

You can already observe what disk blocks it is accessing pretty easily. There are several methods, but a quick one would be to use nbdkit’s file plugin with the -f -v flags (foreground and verbose). The problem is how to map disk blocks to the files and other interesting objects that exist in the disk image.

How do you map between files and disk blocks? For simple filesystems like ext4 you can use the FIBMAP ioctl, and perhaps adjust the answer by adding the offset of the start of the partition. However as you get further into the boot process you’ll probably encounter complexities like LVM. There may not even be a 1-1 mapping since RAID means that multiple blocks can store a single file block, and tail packing and deduplication mean that a block can belong to multiple files. And of course there are things other than plain files: directories, swap partitions, master boot records, and boot loaders, that live in and between filesystems.

To solve this I have written a tool called virt-bmap. It takes a disk image and outputs a block map. To do this it uses libguestfs (patched) to control an nbdkit instance, reading each file and recording what blocks in the disk image are accessed. (It sounds complicated, but virt-bmap wraps it up in a simple command line tool.) The beauty of this is that the kernel takes care of the mapping for us, and it works no matter how many layers of filesystem/LVM/RAID are between the file and the underlying device. This doesn’t quite solve the “RAID problem” since the RAID layers in Linux are free to only read a single copy of the file, but is generally accurate for everything else.

$ virt-bmap fedora-20.img
virt-bmap: examining /dev/sda1 ...
virt-bmap: examining /dev/sda2 ...
virt-bmap: examining /dev/sda3 ...
virt-bmap: examining filesystem on /dev/sda1 (ext4) ...
virt-bmap: examining filesystem on /dev/sda3 (ext4) ...
virt-bmap: writing /home/rjones/d/virt-bmap/bmap
virt-bmap: successfully examined 3 partitions, 0 logical volumes,
           2 filesystems, 3346 directories, 20585 files
virt-bmap: output written to /home/rjones/d/virt-bmap/bmap

The output bmap file is a straightforward map from disk byte offset to file / files / object occupying that space:

1 541000 541400 d /dev/sda1 /
1 541400 544400 d /dev/sda1 /lost+found
1 941000 941400 f /dev/sda1 /.vmlinuz-3.11.10-301.fc20.x86_64.hmac
1 941400 961800 f /dev/sda1 /config-3.11.10-301.fc20.x86_64
1 961800 995400 f /dev/sda1 /initrd-plymouth.img
1 b00400 ef1c00 f /dev/sda1 /grub2/themes/system/background.png
1 f00400 12f1c00 f /dev/sda1 /grub2/themes/system/fireworks.png
1 1300400 1590400 f /dev/sda1 /System.map-3.11.10-301.fc20.x86_64

[The 1 that appears in the first column means “first disk”. Unfortunately virt-bmap can only map single disk virtual machines at present.]

The second part of this, which I’m still writing, will be another nbdkit plugin which takes these maps and produces a nice log of accesses as the machine boots.

3 Comments

Filed under Uncategorized

Paper classifying bugs in Linux filesystems

This is an excellent paper classifying bugs in Linux filesystems. The results seem to be generally applicable to bugs in open source kernel code.

Leave a comment

Filed under Uncategorized

Tip: copy out / dump filesystems from a guest

The scenario is that you’ve installed a guest in some host logical volume (or partition or iSCSI or whatever). Now you want to dump out the raw filesystem data for any filesystem from that guest.

It’s easy with guestfish

$ sudo guestfish --ro -a /dev/vg/f14 \
    run : download /dev/sda1 /tmp/diskimage

Explanation:

  1. I’m using “sudo” because guestfish needs root in order to read the guest’s disk (/dev/vg/f14). I could run the same command as non-root if I was accessing a guest disk that didn’t need root permissions, eg. from a file.
  2. Use the --ro option because I’m just reading out the contents.
  3. The run command launches the libguestfs back end. This is followed by the download command.
  4. /dev/sda1 is a filesystem inside the guest disk image, in this case, the /boot filesystem of a Fedora 14 guest. Use virt-filesystems or the guestfish list-filesystems command to list out filesystems in a random disk image.
  5. /tmp/diskimage is the target where I want to download the filesystem to

Because /tmp/diskimage is a filesystem, I can also open it with guestfish …

$ guestfish --ro -a /tmp/diskimage -m /dev/sda
><fs> ll /
total 56831
dr-xr-xr-x.  5 root root     1024 Oct 15 11:19 .
drwxr-xr-x  23  500  500     4096 Jan  4 19:49 ..
-rw-r--r--.  1 root root  2022930 May  6  2010 System.map-2.6.33.3-85.fc13.x86_64
-rw-r--r--.  1 root root  2228083 Sep 15 03:02 System.map-2.6.35.4-28.fc14.x86_64
-rw-r--r--.  1 root root  2154391 Oct 13 22:28 System.map-2.6.35.6-43.fc14.x86_64
[...]

In this case because I used the -m option to mount up the whole disk image, I don’t need the run command. (See this explanation for exactly why).

Leave a comment

Filed under Uncategorized

Tip: Making a disk image sparse

Update: libguestfs ≥ 1.14 includes a new tool called virt-sparsify which can make guests sparse (thin-provisioned).

A sparse file is one where file blocks that would contain all zeroes are omitted from the file (and don’t take up any space in the filesystem). A sparse virtual disk image is the same sort of thing: blocks that the guest hasn’t written to yet are not stored by the host, and read as all zeroes. Sparse disk images can be implemented using sparse files on the host, or you can use a format like qcow2 which inherently supports sparse files.

The problem with sparse files is that they gradually grow. When a guest writes a block it is allocated, and potentially this is never freed, even if the guest deletes the file or writes all zeroes to the block. [Eventually this problem will be solved by implementing the TRIM command which lets the host know that the guest no longer requires a block, but we’re not quite there yet.]

This is of course a problem if you fill up the guest disk and then delete the files. The host file does not regain its sparseness.

How do you therefore sparsify a disk image?

There is a technique that you can use, which is simple to understand and implement, but it does require taking the guest offline.

First, fill the empty space in the guest with zeroes. A simple way to do this for a Linux guest is to run this command (run it within each guest filesystem):

dd if=/dev/zero of=zerofile bs=1M
# note that the 'dd' command fills up all free space and eventually fails
sync
rm zerofile

Now shut down the guest.

Copy the guest disk image using either qemu-img convert or cp --sparse=always. “cp” is the fastest but only works to sparsify a raw-format disk image:

cp --sparse=always guest-disk.img guest-disk-copy.img

A little-known feature of the qemu-img convert subcommand is that it automatically sparsifies any disk sector which contains all zeroes, and of course it can convert the format at the same time:

qemu-img convert -f raw -O qcow2 guest-disk.img guest-disk-copy.qcow2

Now the copy in both cases is sparsified, and hopefully a lot smaller than before.

Addendum: Instead of running “dd” by hand inside each guest, you can use the following libguestfs script to achieve the same (but note the guest must be shut down otherwise you will get disk corruption):

#!/usr/bin/perl -w
# ./phil-space.pl (disk.img|GuestName)
# Requires libguestfs >= 1.5.

use strict;
use Sys::Virt;
use Sys::Guestfs;
use Sys::Guestfs::Lib qw(open_guest);

die "$0: recent version of libguestfs >= 1.5 is required\n"
    unless defined (Sys::Guestfs->can ("list_filesystems"));

die "$0 (disk.img|GuestName)\n" unless @ARGV >= 1;

my $g = open_guest (\@ARGV, rw => 1);
$g->launch ();

my %filesystems = $g->list_filesystems ();

foreach (keys %filesystems) {
    eval {
        $g->mount_options ("", $_, "/");

        print "filling empty space in $_ with zeroes ...\n";

        my $filename = "/deleteme.tmp";
        eval { $g->dd ("/dev/zero", $filename) };
        $g->sync (); # make sure the last part of the file is written
        $g->rm ($filename);
    };
    $g->umount_all ();
}

$g->sync ()

2 Comments

Filed under Uncategorized

Freezing filesystems

Ric Wheeler and Christoph Hellwig were quick to point out I was wrong about something: Linux now has a standard API for freezing or “quiescing” filesystems.

Quiescing a filesystem lets you take a consistent snapshot or backup at the block device level. If your server uses SAN storage, then probably your SAN lets you take snapshots of the SCSI LUNs at any time. But if you try doing this while the server is under load you’ll (at best) get a “crash consistent” snapshot, where the journal has to be replayed when the copy of the filesystem is mounted, and at worst you’ll get data corruption, particularly with ext3 defaults.

Quiescing tells the filesystem to make things consistent at the disk / block device level. The journal won’t need to be replayed, and the superblock is marked as if you’d unmounted the device. A snapshot taken at this stage will be consistent, at least at the filesystem level (applications don’t know what is happening, so you could still see things like half-written transactions in databases).

The downside to quiescing a filesystem is that it generally causes writes to be blocked, eventually bringing the whole system to a grinding halt. SAN snapshots can be done very quickly though, so the time between a “freeze” and “thaw” operation is usually brief.

Very recent versions of util-linux-ng have an fsfreeze command that lets you freeze or thaw filesystems at the command line. Use with care!

Freezing filesystems also has an application for virtual machines. Our new guest agent will support freezing filesystems so that you can coordinate a consistent backup or snapshot from outside the guest.

If you have Rawhide and the most recent virt-rescue you can play with freezing filesystems without breaking anything:

$ rm -f test.img
$ truncate -s 1G test.img
$ virt-rescue test.img
><rescue> mkfs.ext4 /dev/vda
><rescue> mount /dev/vda /sysroot

From another window you can see that the image is not consistent. If you were to snapshot the image now the filesystem would at least require journal recovery when mounted:

$ file test.img
test.img: Linux rev 1.0 ext4 filesystem data (needs journal recovery) (extents) (large files) (huge files)

But by issuing fsfreeze in the guest we can make it consistent:

><rescue> fsfreeze -f /sysroot
$ file test.img
test.img: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge files)

.. allowing us to take a snapshot or copy of the block device (test.img) in a consistent state.

Leave a comment

Filed under Uncategorized

Tip: use prepared images in guestfish

In guestfish ≥ 1.3.6 you can save yourself a few keystrokes when testing by using a preformatted disk image on the command line:

$ guestfish -N fs:ext4

starts guestfish with an ext4 formatted filesystem in a 100 MB partitioned block device. Specify a different size for the block device by doing:

$ guestfish -N fs:ext4:1G

To start with just an empty block device, you can now do:

$ guestfish -N disk

There are various other prepared disk images available for you to use. For more information, see the git commit here.

3 Comments

Filed under Uncategorized