Tag Archives: filesystems

Paper classifying bugs in Linux filesystems

This is an excellent paper classifying bugs in Linux filesystems. The results seem to be generally applicable to bugs in open source kernel code.

Leave a Comment

Filed under Uncategorized

Tip: copy out / dump filesystems from a guest

The scenario is that you’ve installed a guest in some host logical volume (or partition or iSCSI or whatever). Now you want to dump out the raw filesystem data for any filesystem from that guest.

It’s easy with guestfish

$ sudo guestfish --ro -a /dev/vg/f14 \
    run : download /dev/sda1 /tmp/diskimage

Explanation:

  1. I’m using “sudo” because guestfish needs root in order to read the guest’s disk (/dev/vg/f14). I could run the same command as non-root if I was accessing a guest disk that didn’t need root permissions, eg. from a file.
  2. Use the --ro option because I’m just reading out the contents.
  3. The run command launches the libguestfs back end. This is followed by the download command.
  4. /dev/sda1 is a filesystem inside the guest disk image, in this case, the /boot filesystem of a Fedora 14 guest. Use virt-filesystems or the guestfish list-filesystems command to list out filesystems in a random disk image.
  5. /tmp/diskimage is the target where I want to download the filesystem to

Because /tmp/diskimage is a filesystem, I can also open it with guestfish …

$ guestfish --ro -a /tmp/diskimage -m /dev/sda
><fs> ll /
total 56831
dr-xr-xr-x.  5 root root     1024 Oct 15 11:19 .
drwxr-xr-x  23  500  500     4096 Jan  4 19:49 ..
-rw-r--r--.  1 root root  2022930 May  6  2010 System.map-2.6.33.3-85.fc13.x86_64
-rw-r--r--.  1 root root  2228083 Sep 15 03:02 System.map-2.6.35.4-28.fc14.x86_64
-rw-r--r--.  1 root root  2154391 Oct 13 22:28 System.map-2.6.35.6-43.fc14.x86_64
[...]

In this case because I used the -m option to mount up the whole disk image, I don’t need the run command. (See this explanation for exactly why).

Leave a Comment

Filed under Uncategorized

Tip: Making a disk image sparse

Update: libguestfs ≥ 1.14 includes a new tool called virt-sparsify which can make guests sparse (thin-provisioned).

A sparse file is one where file blocks that would contain all zeroes are omitted from the file (and don’t take up any space in the filesystem). A sparse virtual disk image is the same sort of thing: blocks that the guest hasn’t written to yet are not stored by the host, and read as all zeroes. Sparse disk images can be implemented using sparse files on the host, or you can use a format like qcow2 which inherently supports sparse files.

The problem with sparse files is that they gradually grow. When a guest writes a block it is allocated, and potentially this is never freed, even if the guest deletes the file or writes all zeroes to the block. [Eventually this problem will be solved by implementing the TRIM command which lets the host know that the guest no longer requires a block, but we're not quite there yet.]

This is of course a problem if you fill up the guest disk and then delete the files. The host file does not regain its sparseness.

How do you therefore sparsify a disk image?

There is a technique that you can use, which is simple to understand and implement, but it does require taking the guest offline.

First, fill the empty space in the guest with zeroes. A simple way to do this for a Linux guest is to run this command (run it within each guest filesystem):

dd if=/dev/zero of=zerofile bs=1M
# note that the 'dd' command fills up all free space and eventually fails
sync
rm zerofile

Now shut down the guest.

Copy the guest disk image using either qemu-img convert or cp --sparse=always. “cp” is the fastest but only works to sparsify a raw-format disk image:

cp --sparse=always guest-disk.img guest-disk-copy.img

A little-known feature of the qemu-img convert subcommand is that it automatically sparsifies any disk sector which contains all zeroes, and of course it can convert the format at the same time:

qemu-img convert -f raw -O qcow2 guest-disk.img guest-disk-copy.qcow2

Now the copy in both cases is sparsified, and hopefully a lot smaller than before.

Addendum: Instead of running “dd” by hand inside each guest, you can use the following libguestfs script to achieve the same (but note the guest must be shut down otherwise you will get disk corruption):

#!/usr/bin/perl -w
# ./phil-space.pl (disk.img|GuestName)
# Requires libguestfs >= 1.5.

use strict;
use Sys::Virt;
use Sys::Guestfs;
use Sys::Guestfs::Lib qw(open_guest);

die "$0: recent version of libguestfs >= 1.5 is required\n"
    unless defined (Sys::Guestfs->can ("list_filesystems"));

die "$0 (disk.img|GuestName)\n" unless @ARGV >= 1;

my $g = open_guest (\@ARGV, rw => 1);
$g->launch ();

my %filesystems = $g->list_filesystems ();

foreach (keys %filesystems) {
    eval {
        $g->mount_options ("", $_, "/");

        print "filling empty space in $_ with zeroes ...\n";

        my $filename = "/deleteme.tmp";
        eval { $g->dd ("/dev/zero", $filename) };
        $g->sync (); # make sure the last part of the file is written
        $g->rm ($filename);
    };
    $g->umount_all ();
}

$g->sync ()

Leave a Comment

Filed under Uncategorized

Freezing filesystems

Ric Wheeler and Christoph Hellwig were quick to point out I was wrong about something: Linux now has a standard API for freezing or “quiescing” filesystems.

Quiescing a filesystem lets you take a consistent snapshot or backup at the block device level. If your server uses SAN storage, then probably your SAN lets you take snapshots of the SCSI LUNs at any time. But if you try doing this while the server is under load you’ll (at best) get a “crash consistent” snapshot, where the journal has to be replayed when the copy of the filesystem is mounted, and at worst you’ll get data corruption, particularly with ext3 defaults.

Quiescing tells the filesystem to make things consistent at the disk / block device level. The journal won’t need to be replayed, and the superblock is marked as if you’d unmounted the device. A snapshot taken at this stage will be consistent, at least at the filesystem level (applications don’t know what is happening, so you could still see things like half-written transactions in databases).

The downside to quiescing a filesystem is that it generally causes writes to be blocked, eventually bringing the whole system to a grinding halt. SAN snapshots can be done very quickly though, so the time between a “freeze” and “thaw” operation is usually brief.

Very recent versions of util-linux-ng have an fsfreeze command that lets you freeze or thaw filesystems at the command line. Use with care!

Freezing filesystems also has an application for virtual machines. Our new guest agent will support freezing filesystems so that you can coordinate a consistent backup or snapshot from outside the guest.

If you have Rawhide and the most recent virt-rescue you can play with freezing filesystems without breaking anything:

$ rm -f test.img
$ truncate -s 1G test.img
$ virt-rescue test.img
><rescue> mkfs.ext4 /dev/vda
><rescue> mount /dev/vda /sysroot

From another window you can see that the image is not consistent. If you were to snapshot the image now the filesystem would at least require journal recovery when mounted:

$ file test.img
test.img: Linux rev 1.0 ext4 filesystem data (needs journal recovery) (extents) (large files) (huge files)

But by issuing fsfreeze in the guest we can make it consistent:

><rescue> fsfreeze -f /sysroot
$ file test.img
test.img: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge files)

.. allowing us to take a snapshot or copy of the block device (test.img) in a consistent state.

Leave a Comment

Filed under Uncategorized

Tip: use prepared images in guestfish

In guestfish ≥ 1.3.6 you can save yourself a few keystrokes when testing by using a preformatted disk image on the command line:

$ guestfish -N fs:ext4

starts guestfish with an ext4 formatted filesystem in a 100 MB partitioned block device. Specify a different size for the block device by doing:

$ guestfish -N fs:ext4:1G

To start with just an empty block device, you can now do:

$ guestfish -N disk

There are various other prepared disk images available for you to use. For more information, see the git commit here.

3 Comments

Filed under Uncategorized

New libguestfs tool: virt-make-fs

We posted a new tool today which you can use to make filesystems containing some prepopulated data.

Previously you’ve been able to create ISO and other read-only filesystems using programs like mkisofs and mksquashfs.

This program lets you do the same thing, but for any filesystem, for example an ext3 filesystem or NTFS.

Unlike ISO filesystems, ordinary filesystems don’t “just fit” around the files inside them. And so the main difficulty has been to estimate the amount of space you have to allocate in order to fit your files, with the least wasted space. This program automates this with a simple estimator function which should work for most cases and we hope to improve over time.

Usage is simple (virt-make-fs man page). Start with either a tarball or a directory full of files to add, and just do:

virt-make-fs [--type=fstype] /inputdir output.img

4 Comments

Filed under Uncategorized

Tip: extract a filesystem from a disk image

You’ve got a partitioned disk image, how do you pull out of that just the filesystem(s)? It’s easy with libguestfs tools:

$ virt-list-filesystems -al disk.img
/dev/sda1 ext4
/dev/vg_f12x32/lv_root ext4
/dev/vg_f12x32/lv_swap swap
$ virt-cat disk.img /dev/sda1 > boot.fs
$ file boot.fs
boot.fs: Linux rev 1.0 ext4 filesystem data (extents) (huge files)
$ virt-cat disk.img /dev/vg_f12x32/lv_root > root.fs

You can also use guestfish to examine the filesystem image:

$ guestfish -a boot.fs -m /dev/sda

Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.

Type: 'help' for help with commands
      'quit' to quit the shell

><fs> ll /
total 15941
dr-xr-xr-x.  5 root root     1024 Mar  8 19:37 .
dr-xr-xr-x  19 root root        0 Mar  8 13:40 ..
-rw-r--r--.  1 root root  1486532 Nov  7 21:38 System.map-2.6.31.5-127.fc12.i686.PAE
-rw-r--r--.  1 root root   103788 Nov  7 21:38 config-2.6.31.5-127.fc12.i686.PAE
drwxr-xr-x.  3 root root     1024 Mar  8 19:12 efi
drwxr-xr-x.  2 root root     1024 Mar  8 19:49 grub
-rw-r--r--.  1 root root 11253019 Mar  8 19:39 initramfs-2.6.31.5-127.fc12.i686.PAE.img
drwx------.  2 root root    12288 Mar  8 18:45 lost+found
-rwxr-xr-x.  1 root root  3454368 Nov  7 21:38 vmlinuz-2.6.31.5-127.fc12.i686.PAE

><fs> cat /grub/grub.conf 
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/mapper/vg_f12x32-lv_root
#          initrd /initrd-[generic-]version.img
#boot=/dev/sda
default=0
timeout=0
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Fedora (2.6.31.5-127.fc12.i686.PAE)
	root (hd0,0)
	kernel /vmlinuz-2.6.31.5-127.fc12.i686.PAE ro root=/dev/mapper/vg_f12x32-lv_root  LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=uk rhgb quiet
	initrd /initramfs-2.6.31.5-127.fc12.i686.PAE.img

><fs>

1 Comment

Filed under Uncategorized

New tool: virt-list-filesystems

$ virt-list-filesystems Debian5x64.img
/dev/debian5x64/home
/dev/debian5x64/root
/dev/debian5x64/tmp
/dev/debian5x64/usr
/dev/debian5x64/var
/dev/sda1

You can also augment this tool with the -a and -l options. The -a option tells it to list swap partitions too. The -l option tells it to show the filesystem type on each partition that was found:

$ virt-list-filesystems -a -l Fedora12.img
/dev/sda1 ext4
/dev/vg_f12x64/lv_root ext4
/dev/vg_f12x64/lv_swap swap

While this is a fairly simple tool, the use case comes from a user who asked me how I knew what filesystems could be mounted using the guestmount command. The answer is that you don’t know, unless you know something about the guest, or you interactively examine the guest using guestfish, or just use this new tool.

Leave a Comment

Filed under Uncategorized

Terabyte virtual disks

This is fun. I added a new command to guestfish which lets you create sparse disk files. This makes it really easy to test out the limits of partitions and Linux filesystems.

Starting modestly, I tried a 1 terabyte disk:

$ guestfish

Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.

Type: 'help' for help with commands
      'quit' to quit the shell

><fs> sparse /tmp/test.img 1T
><fs> run

The real disk image so far isn’t so big, just 4K according to “du”:

$ ll -h /tmp/test.img 
-rw-rw-r-- 1 rjones rjones 1T 2009-11-04 17:52 /tmp/test.img
$ du -h /tmp/test.img
4.0K	/tmp/test.img

Let’s partition it:

><fs> sfdiskM /dev/vda ,

The partition table only uses 1 sector, so the disk image has increased to just 8K. Let’s make an ext2 filesystem on the first partition:

><fs> mkfs ext2 /dev/vda1

This command takes some time, and the sparse disk file has grown. To 17 GB, so ext2 has an approx 1.7% overhead.

We can mount the filesystem and look at it:

><fs> mount /dev/vda1 /
><fs> df-h 
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda1            1008G   72M  957G   1% /sysroot

Can we try this with larger and larger virtual disks? In theory yes, in practice the 1.7% overhead proves to be a problem. A 10T experiment would require a very real 170GB of local disk space, and where I was hoping to go, 100T and beyond, would be too large for my test machines.

In fact there is another limitation before we reach there. Local sparse files on my host ext4 filesystem are themselves limited to under 16T:

><fs> sparse /tmp/test.img 16T
write: File too large
><fs> sparse /tmp/test.img 15T

Although the appliance does boot with that 15T virtual disk:

><fs> blockdev-getsize64 /dev/vda 
16492674416640

Update

I noticed from Wikipedia that XFS has a maximum file size of 8 exabytes – 1 byte. By creating a temporary XFS filesystem on the host, I was able to create a 256TB virtual disk:

><fs> sparse /mnt/tmp/test/test.img 256T
><fs> run
><fs> blockdev-getsize64 /dev/vda 
281474976710656

Unfortunately at this point things break down. MBR partitions won’t work on such a huge disk, or at least sfdisk can’t partition it correctly.

I’m not sure what my options are at this point, but at least this is an interesting experiment in hitting limitations.

4 Comments

Filed under Uncategorized