This is an excellent paper classifying bugs in Linux filesystems. The results seem to be generally applicable to bugs in open source kernel code.
Tag Archives: filesystems
Tip: copy out / dump filesystems from a guest
The scenario is that you’ve installed a guest in some host logical volume (or partition or iSCSI or whatever). Now you want to dump out the raw filesystem data for any filesystem from that guest.
It’s easy with guestfish …
$ sudo guestfish --ro -a /dev/vg/f14 \
run : download /dev/sda1 /tmp/diskimage
Explanation:
- I’m using “sudo” because guestfish needs root in order to read the guest’s disk (
/dev/vg/f14). I could run the same command as non-root if I was accessing a guest disk that didn’t need root permissions, eg. from a file. - Use the
--rooption because I’m just reading out the contents. - The run command launches the libguestfs back end. This is followed by the download command.
-
/dev/sda1is a filesystem inside the guest disk image, in this case, the/bootfilesystem of a Fedora 14 guest. Use virt-filesystems or the guestfish list-filesystems command to list out filesystems in a random disk image. -
/tmp/diskimageis the target where I want to download the filesystem to
Because /tmp/diskimage is a filesystem, I can also open it with guestfish …
$ guestfish --ro -a /tmp/diskimage -m /dev/sda ><fs> ll / total 56831 dr-xr-xr-x. 5 root root 1024 Oct 15 11:19 . drwxr-xr-x 23 500 500 4096 Jan 4 19:49 .. -rw-r--r--. 1 root root 2022930 May 6 2010 System.map-2.6.33.3-85.fc13.x86_64 -rw-r--r--. 1 root root 2228083 Sep 15 03:02 System.map-2.6.35.4-28.fc14.x86_64 -rw-r--r--. 1 root root 2154391 Oct 13 22:28 System.map-2.6.35.6-43.fc14.x86_64 [...]
In this case because I used the -m option to mount up the whole disk image, I don’t need the run command. (See this explanation for exactly why).
Filed under Uncategorized
Tip: Making a disk image sparse
Update: libguestfs ≥ 1.14 includes a new tool called virt-sparsify which can make guests sparse (thin-provisioned).
A sparse file is one where file blocks that would contain all zeroes are omitted from the file (and don’t take up any space in the filesystem). A sparse virtual disk image is the same sort of thing: blocks that the guest hasn’t written to yet are not stored by the host, and read as all zeroes. Sparse disk images can be implemented using sparse files on the host, or you can use a format like qcow2 which inherently supports sparse files.
The problem with sparse files is that they gradually grow. When a guest writes a block it is allocated, and potentially this is never freed, even if the guest deletes the file or writes all zeroes to the block. [Eventually this problem will be solved by implementing the TRIM command which lets the host know that the guest no longer requires a block, but we're not quite there yet.]
This is of course a problem if you fill up the guest disk and then delete the files. The host file does not regain its sparseness.
How do you therefore sparsify a disk image?
There is a technique that you can use, which is simple to understand and implement, but it does require taking the guest offline.
First, fill the empty space in the guest with zeroes. A simple way to do this for a Linux guest is to run this command (run it within each guest filesystem):
dd if=/dev/zero of=zerofile bs=1M # note that the 'dd' command fills up all free space and eventually fails sync rm zerofile
Now shut down the guest.
Copy the guest disk image using either qemu-img convert or cp --sparse=always. “cp” is the fastest but only works to sparsify a raw-format disk image:
cp --sparse=always guest-disk.img guest-disk-copy.img
A little-known feature of the qemu-img convert subcommand is that it automatically sparsifies any disk sector which contains all zeroes, and of course it can convert the format at the same time:
qemu-img convert -f raw -O qcow2 guest-disk.img guest-disk-copy.qcow2
Now the copy in both cases is sparsified, and hopefully a lot smaller than before.
Addendum: Instead of running “dd” by hand inside each guest, you can use the following libguestfs script to achieve the same (but note the guest must be shut down otherwise you will get disk corruption):
#!/usr/bin/perl -w
# ./phil-space.pl (disk.img|GuestName)
# Requires libguestfs >= 1.5.
use strict;
use Sys::Virt;
use Sys::Guestfs;
use Sys::Guestfs::Lib qw(open_guest);
die "$0: recent version of libguestfs >= 1.5 is required\n"
unless defined (Sys::Guestfs->can ("list_filesystems"));
die "$0 (disk.img|GuestName)\n" unless @ARGV >= 1;
my $g = open_guest (\@ARGV, rw => 1);
$g->launch ();
my %filesystems = $g->list_filesystems ();
foreach (keys %filesystems) {
eval {
$g->mount_options ("", $_, "/");
print "filling empty space in $_ with zeroes ...\n";
my $filename = "/deleteme.tmp";
eval { $g->dd ("/dev/zero", $filename) };
$g->sync (); # make sure the last part of the file is written
$g->rm ($filename);
};
$g->umount_all ();
}
$g->sync ()
Filed under Uncategorized
Freezing filesystems
Ric Wheeler and Christoph Hellwig were quick to point out I was wrong about something: Linux now has a standard API for freezing or “quiescing” filesystems.
Quiescing a filesystem lets you take a consistent snapshot or backup at the block device level. If your server uses SAN storage, then probably your SAN lets you take snapshots of the SCSI LUNs at any time. But if you try doing this while the server is under load you’ll (at best) get a “crash consistent” snapshot, where the journal has to be replayed when the copy of the filesystem is mounted, and at worst you’ll get data corruption, particularly with ext3 defaults.
Quiescing tells the filesystem to make things consistent at the disk / block device level. The journal won’t need to be replayed, and the superblock is marked as if you’d unmounted the device. A snapshot taken at this stage will be consistent, at least at the filesystem level (applications don’t know what is happening, so you could still see things like half-written transactions in databases).
The downside to quiescing a filesystem is that it generally causes writes to be blocked, eventually bringing the whole system to a grinding halt. SAN snapshots can be done very quickly though, so the time between a “freeze” and “thaw” operation is usually brief.
Very recent versions of util-linux-ng have an fsfreeze command that lets you freeze or thaw filesystems at the command line. Use with care!
Freezing filesystems also has an application for virtual machines. Our new guest agent will support freezing filesystems so that you can coordinate a consistent backup or snapshot from outside the guest.
If you have Rawhide and the most recent virt-rescue you can play with freezing filesystems without breaking anything:
$ rm -f test.img $ truncate -s 1G test.img $ virt-rescue test.img ><rescue> mkfs.ext4 /dev/vda ><rescue> mount /dev/vda /sysroot
From another window you can see that the image is not consistent. If you were to snapshot the image now the filesystem would at least require journal recovery when mounted:
$ file test.img test.img: Linux rev 1.0 ext4 filesystem data (needs journal recovery) (extents) (large files) (huge files)
But by issuing fsfreeze in the guest we can make it consistent:
><rescue> fsfreeze -f /sysroot
$ file test.img test.img: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge files)
.. allowing us to take a snapshot or copy of the block device (test.img) in a consistent state.
Filed under Uncategorized
Tip: use prepared images in guestfish
In guestfish ≥ 1.3.6 you can save yourself a few keystrokes when testing by using a preformatted disk image on the command line:
$ guestfish -N fs:ext4
starts guestfish with an ext4 formatted filesystem in a 100 MB partitioned block device. Specify a different size for the block device by doing:
$ guestfish -N fs:ext4:1G
To start with just an empty block device, you can now do:
$ guestfish -N disk
There are various other prepared disk images available for you to use. For more information, see the git commit here.
Filed under Uncategorized
New libguestfs tool: virt-make-fs
We posted a new tool today which you can use to make filesystems containing some prepopulated data.
Previously you’ve been able to create ISO and other read-only filesystems using programs like mkisofs and mksquashfs.
This program lets you do the same thing, but for any filesystem, for example an ext3 filesystem or NTFS.
Unlike ISO filesystems, ordinary filesystems don’t “just fit” around the files inside them. And so the main difficulty has been to estimate the amount of space you have to allocate in order to fit your files, with the least wasted space. This program automates this with a simple estimator function which should work for most cases and we hope to improve over time.
Usage is simple (virt-make-fs man page). Start with either a tarball or a directory full of files to add, and just do:
virt-make-fs [--type=fstype] /inputdir output.img
Filed under Uncategorized
Tip: extract a filesystem from a disk image
You’ve got a partitioned disk image, how do you pull out of that just the filesystem(s)? It’s easy with libguestfs tools:
$ virt-list-filesystems -al disk.img /dev/sda1 ext4 /dev/vg_f12x32/lv_root ext4 /dev/vg_f12x32/lv_swap swap $ virt-cat disk.img /dev/sda1 > boot.fs $ file boot.fs boot.fs: Linux rev 1.0 ext4 filesystem data (extents) (huge files) $ virt-cat disk.img /dev/vg_f12x32/lv_root > root.fs
You can also use guestfish to examine the filesystem image:
$ guestfish -a boot.fs -m /dev/sda
Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.
Type: 'help' for help with commands
'quit' to quit the shell
><fs> ll /
total 15941
dr-xr-xr-x. 5 root root 1024 Mar 8 19:37 .
dr-xr-xr-x 19 root root 0 Mar 8 13:40 ..
-rw-r--r--. 1 root root 1486532 Nov 7 21:38 System.map-2.6.31.5-127.fc12.i686.PAE
-rw-r--r--. 1 root root 103788 Nov 7 21:38 config-2.6.31.5-127.fc12.i686.PAE
drwxr-xr-x. 3 root root 1024 Mar 8 19:12 efi
drwxr-xr-x. 2 root root 1024 Mar 8 19:49 grub
-rw-r--r--. 1 root root 11253019 Mar 8 19:39 initramfs-2.6.31.5-127.fc12.i686.PAE.img
drwx------. 2 root root 12288 Mar 8 18:45 lost+found
-rwxr-xr-x. 1 root root 3454368 Nov 7 21:38 vmlinuz-2.6.31.5-127.fc12.i686.PAE
><fs> cat /grub/grub.conf
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE: You have a /boot partition. This means that
# all kernel and initrd paths are relative to /boot/, eg.
# root (hd0,0)
# kernel /vmlinuz-version ro root=/dev/mapper/vg_f12x32-lv_root
# initrd /initrd-[generic-]version.img
#boot=/dev/sda
default=0
timeout=0
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Fedora (2.6.31.5-127.fc12.i686.PAE)
root (hd0,0)
kernel /vmlinuz-2.6.31.5-127.fc12.i686.PAE ro root=/dev/mapper/vg_f12x32-lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=uk rhgb quiet
initrd /initramfs-2.6.31.5-127.fc12.i686.PAE.img
><fs>
Filed under Uncategorized
New tool: virt-list-filesystems
$ virt-list-filesystems Debian5x64.img /dev/debian5x64/home /dev/debian5x64/root /dev/debian5x64/tmp /dev/debian5x64/usr /dev/debian5x64/var /dev/sda1
You can also augment this tool with the -a and -l options. The -a option tells it to list swap partitions too. The -l option tells it to show the filesystem type on each partition that was found:
$ virt-list-filesystems -a -l Fedora12.img /dev/sda1 ext4 /dev/vg_f12x64/lv_root ext4 /dev/vg_f12x64/lv_swap swap
While this is a fairly simple tool, the use case comes from a user who asked me how I knew what filesystems could be mounted using the guestmount command. The answer is that you don’t know, unless you know something about the guest, or you interactively examine the guest using guestfish, or just use this new tool.
Filed under Uncategorized
Terabyte virtual disks
This is fun. I added a new command to guestfish which lets you create sparse disk files. This makes it really easy to test out the limits of partitions and Linux filesystems.
Starting modestly, I tried a 1 terabyte disk:
$ guestfish
Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.
Type: 'help' for help with commands
'quit' to quit the shell
><fs> sparse /tmp/test.img 1T
><fs> run
The real disk image so far isn’t so big, just 4K according to “du”:
$ ll -h /tmp/test.img -rw-rw-r-- 1 rjones rjones 1T 2009-11-04 17:52 /tmp/test.img $ du -h /tmp/test.img 4.0K /tmp/test.img
Let’s partition it:
><fs> sfdiskM /dev/vda ,
The partition table only uses 1 sector, so the disk image has increased to just 8K. Let’s make an ext2 filesystem on the first partition:
><fs> mkfs ext2 /dev/vda1
This command takes some time, and the sparse disk file has grown. To 17 GB, so ext2 has an approx 1.7% overhead.
We can mount the filesystem and look at it:
><fs> mount /dev/vda1 / ><fs> df-h Filesystem Size Used Avail Use% Mounted on /dev/vda1 1008G 72M 957G 1% /sysroot
Can we try this with larger and larger virtual disks? In theory yes, in practice the 1.7% overhead proves to be a problem. A 10T experiment would require a very real 170GB of local disk space, and where I was hoping to go, 100T and beyond, would be too large for my test machines.
In fact there is another limitation before we reach there. Local sparse files on my host ext4 filesystem are themselves limited to under 16T:
><fs> sparse /tmp/test.img 16T write: File too large ><fs> sparse /tmp/test.img 15T
Although the appliance does boot with that 15T virtual disk:
><fs> blockdev-getsize64 /dev/vda 16492674416640
Update
I noticed from Wikipedia that XFS has a maximum file size of 8 exabytes – 1 byte. By creating a temporary XFS filesystem on the host, I was able to create a 256TB virtual disk:
><fs> sparse /mnt/tmp/test/test.img 256T ><fs> run ><fs> blockdev-getsize64 /dev/vda 281474976710656
Unfortunately at this point things break down. MBR partitions won’t work on such a huge disk, or at least sfdisk can’t partition it correctly.
I’m not sure what my options are at this point, but at least this is an interesting experiment in hitting limitations.
Filed under Uncategorized
