Tag Archives: ext4

Ridiculously big “files”

In the last post I showed how you can combine nbdfuse with nbdkit’s RAM disk to mount a RAM disk as a local file. In a talk I gave at FOSDEM last year I described creating these absurdly large RAM-backed filesystems and you can do the same thing now to create ridiculously big “files”. Here’s a 7 exabyte file:

$ touch /var/tmp/disk.img
$ nbdfuse /var/tmp/disk.img --command nbdkit -s memory 7E &
$ ll /var/tmp/disk.img 
 -rw-rw-rw-. 1 rjones rjones 8070450532247928832 Nov  4 13:37 /var/tmp/disk.img
$ ls -lh /var/tmp/disk.img 
 -rw-rw-rw-. 1 rjones rjones 7.0E Nov  4 13:37 /var/tmp/disk.img

What can you actually do with this file, and more importantly does anything break? As in the talk, creating a Btrfs filesystem boringly just works. mkfs.ext4 spins using 100% of CPU. I let it go for 15 minutes but it seemed no closer to either succeeding or crashing. Emacs said:

File disk.img is large (7 EiB), really open? (y)es or (n)o or (l)iterally

and I was too chicken to find out what it would do if I really opened it.

I do wonder if there’s a DoS attack here if I leave this seemingly massive regular file lying around in a public directory.


Filed under Uncategorized

nbdkit for loopback pt 2: injecting errors

nbdkit is a pluggable NBD server with a filter system that you can layer over plugins to transform block devices. One of the filters is the error filter which lets you inject errors. We can use this to find out how well filesystems cope with errors and recovering from errors.

$ rm -f /tmp/inject
$ nbdkit -fv --filter=error memory size=$(( 2**32 )) \
    error-rate=100% error-file=/tmp/inject
# nbd-client localhost /dev/nbd0

We can create a filesystem normally:

# sgdisk -n 1 /dev/nbd0
# gdisk -l /dev/nbd0
Number  Start (sector)    End (sector)  Size       Code  Name
   1            1024         4194286   4.0 GiB     8300  
# mkfs.ext4 /dev/nbd0p1
# mount /dev/nbd0p1 /mnt

It’s very interesting watching the verbose output of nbdkit -fv because you can see the lazy metadata creation which the Linux ext4 kernel driver carries out in the background after you mount the filesystem the first time.

So far we have not injected any errors. To do that we create the error-file (/tmp/inject) which the error filter will notice and respond by injecting EIO errors until we remove the file:

# touch /tmp/inject
# ls /mnt
ls: reading directory '/mnt': Input/output error
# rm /tmp/inject
# ls /mnt

Ext4 recovered once we stopped injecting errors, but …

# touch /mnt/hello
touch: cannot touch '/mnt/hello': Read-only file system

… it responded to the error by remounting the filesystem read-only. Interestingly I was not able to simply remount the filesystem read-write. Ext4 forced me to unmount the filesystem and run e2fsck before I could mount it again.

e2fsck also said:

e2fsck: Unknown code ____ 251 while recovering journal of /dev/nbd0p1

which I guess is a bug (already found upstream).

Leave a comment

Filed under Uncategorized

Notes on producing a minimal, compressed filesystem

You can save the minimal, compressed filesystem from a guest in a way that lets you reconstruct that filesystem (or guest) later very quickly. I’m writing an experimental tool called virt-builder which will make this automatic, but here’s how to make the minimal, compressed filesystem manually.

I’m starting here with a freshly installed Fedora 13 i686 guest. Firstly we see what filesystems it contains:

$ virt-list-filesystems -al /dev/vg_pin/TmpF13
/dev/sda1 ext4
/dev/vg_virtbuilder/lv_root ext4
/dev/vg_virtbuilder/lv_swap swap

We have a boot partition (/dev/sda1), a root filesystem (/dev/vg_virtbuilder/lv_root), and a swap partition. We can just ignore swap since it doesn’t contain any information and we can reconstruct it at will.

There is also some extra information “hidden” in the disk and not in a partition, namely the boot sector, partition table, and (maybe) GRUB stages. For this guest just the boot sector is interesting, other guests may have a boot loader located between the boot sector and the first partition. I’m going to ignore these for now, although virt-builder will need to restore this in order to make a bootable guest.

Let’s grab the boot and root partitions using guestfish:

$ guestfish --ro -i -a /dev/vg_pin/TmpF13 \
    download /dev/sda1 fedora-13-i686-boot.img
$ guestfish --progress --ro -i -a /dev/vg_pin/TmpF13 \
    download /dev/vg_virtbuilder/lv_root fedora-13-i686-root.img
 100% ⟦▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉⟧ 00:00

The boot partition is small enough that I’m just going to leave it as it is for now. But I need to shrink the root filesystem. The reason is that I might want to restore this to a different-sized, possibly smaller guest than the one I just created. There is only about 2GB of data in the root filesystem, but as an ext4 filesystem it is taking up about 6GB of space. The resize2fs command has a handy -M option which makes this very simple:

$ e2fsck -f fedora-13-i686-root.img
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Fedora-13-i686-L: 91151/368640 files (0.1% non-contiguous), 531701/1458176 blocks
$ resize2fs -M fedora-13-i686-root.img
resize2fs 1.41.12 (17-May-2010)
Resizing the filesystem on fedora-13-i686-root.img to 520344 (4k) blocks.
The filesystem on fedora-13-i686-root.img is now 520344 blocks long.

Note that new size: 520344 x 4k blocks = 2081376k. We can therefore truncate the file down to the new size without any loss of information:

$ truncate -s 2081376k fedora-13-i686-root.img
$ ls -lh
total 2.5G
-rw-rw-r--. 1 rjones rjones 500M Oct 30 10:21 fedora-13-i686-boot.img
-rw-rw-r--. 1 rjones rjones 2.0G Oct 30 10:30 fedora-13-i686-root.img

I haven’t finished yet. I now need to compress both filesystems so I have something small(ish) and portable. I performed some tests, and xz was the clear winner in terms of final compressed size, although it does take a very long time to run with the -9 option.

$ xz -9 *.img
$ ls -lh
total 652M
-rw-rw-r--. 1 rjones rjones 250M Oct 30 10:21 fedora-13-i686-boot.img.xz
-rw-rw-r--. 1 rjones rjones 403M Oct 30 10:30 fedora-13-i686-root.img.xz

From the two files above, plus the boot sector stuff that I glossed over, it is possible to reconstruct a Fedora 13 VM of any dimensions quickly (in about 1 minute 15 seconds on my machine). Note the comments are not part of the command, and the guest is fully bootable:

guestfish -x -- \
# create a disk image of any size
  alloc test.img 10G : \
  run : \
# create the partitions
  part-init /dev/sda mbr : \
  part-add /dev/sda primary 2048 1026047 : \
  part-add /dev/sda primary 1026048 -64 : \
# upload the boot loader
  upload fedora-13-i686-grub.img /dev/sda : \
# upload the boot partition
  upload <( xzcat fedora-13-i686-boot.img.xz ) /dev/sda1 : \
# create the logical volumes for root and swap
  pvcreate /dev/sda2 : \
  vgcreate vg_virtbuilder /dev/sda2 : \
  lvcreate lv_swap vg_virtbuilder 512 : \
  lvcreate lv_root vg_virtbuilder 256 : \
  lvresize-free /dev/vg_virtbuilder/lv_root 100 : \
# make fresh swap
  mkswap /dev/vg_virtbuilder/lv_swap : \
# upload and resize the root filesystem
  upload <( xzcat fedora-13-i686-root.img.xz ) /dev/vg_virtbuilder/lv_root : \
  resize2fs /dev/vg_virtbuilder/lv_root


Filed under Uncategorized

Is ext2/3/4 faster? On LVM?

This question arose at work — is LVM a performance penalty compared to using straight partitions? To save you the trouble, the answer is “not really”. There is a very small penalty, but as with all benchmarks it does depend on what the benchmark measures versus what your real workload does. In any case, here is a small guestfish script you can use to compare the performance of various filesystems with or without LVM, with various operations. Whether you trust the results is up to you, but I would advise caution.

#!/bin/bash -


for fs in ext2 ext3 ext4; do
    for lvm in off on; do
        rm -f $tmpfile
        if [ $lvm = "on" ]; then
            guestfish <<EOF
              sparse $tmpfile 1G
              part-disk /dev/sda efi
              pvcreate /dev/sda1
              vgcreate VG /dev/sda1
              lvcreate LV VG 800
              mkfs $fs /dev/VG/LV
        else # no LVM
            guestfish <<EOF
              sparse $tmpfile 1G
              part-disk /dev/sda efi
              mkfs $fs /dev/sda1
        echo "fs=$fs lvm=$lvm"
        guestfish -a $tmpfile -m $dev <<EOF
          time fallocate /file1 200000000
          time cp /file1 /file2
fs=ext2 lvm=off
elapsed time: 2.74 seconds
elapsed time: 4.52 seconds
fs=ext2 lvm=on
elapsed time: 2.60 seconds
elapsed time: 4.24 seconds
fs=ext3 lvm=off
elapsed time: 2.62 seconds
elapsed time: 4.31 seconds
fs=ext3 lvm=on
elapsed time: 3.07 seconds
elapsed time: 4.79 seconds

# notice how ext4 is much faster at fallocate, because it
# uses extents

fs=ext4 lvm=off
elapsed time: 0.05 seconds
elapsed time: 3.54 seconds
fs=ext4 lvm=on
elapsed time: 0.05 seconds
elapsed time: 4.16 seconds


Filed under Uncategorized

mkfs compared on different filesystems

How long does it take to mkfs a 10GB disk with all the different filesystems out there?

See my test results here using the new guestfish sparse / filesystem support. btrfs is “best” and ext3 comes off “worst”.

As a test this is interesting, but it’s not that relevant for most users — they will be most interested in how well the filesystem performs for their workload, which is not affected by mkfs time and hard to measure in general benchmarks anyway.


In response to Stephen’s comment, I retested this using a memory-backed block device so there is no question about whether the host backing store affects the test:

$ for fs in ext2 ext3 ext4 xfs jfs reiserfs nilfs2 ntfs msdos btrfs hfs hfsplus gfs gfs2
    do guestfish sparse /dev/shm/test.img 10G : run : echo $fs : sfdiskM /dev/sda , : \
        time mkfs $fs /dev/sda1
elapsed time: 1.45 seconds
elapsed time: 2.71 seconds
elapsed time: 2.58 seconds
elapsed time: 0.13 seconds
elapsed time: 0.27 seconds
elapsed time: 0.33 seconds
elapsed time: 0.08 seconds
elapsed time: 2.07 seconds
elapsed time: 0.14 seconds
elapsed time: 0.07 seconds
elapsed time: 0.17 seconds
elapsed time: 0.17 seconds
elapsed time: 0.84 seconds
elapsed time: 2.76 seconds


Filed under Uncategorized

Terabyte virtual disks

This is fun. I added a new command to guestfish which lets you create sparse disk files. This makes it really easy to test out the limits of partitions and Linux filesystems.

Starting modestly, I tried a 1 terabyte disk:

$ guestfish

Welcome to guestfish, the libguestfs filesystem interactive shell for
editing virtual machine filesystems.

Type: 'help' for help with commands
      'quit' to quit the shell

><fs> sparse /tmp/test.img 1T
><fs> run

The real disk image so far isn’t so big, just 4K according to “du”:

$ ll -h /tmp/test.img 
-rw-rw-r-- 1 rjones rjones 1T 2009-11-04 17:52 /tmp/test.img
$ du -h /tmp/test.img
4.0K	/tmp/test.img

Let’s partition it:

><fs> sfdiskM /dev/vda ,

The partition table only uses 1 sector, so the disk image has increased to just 8K. Let’s make an ext2 filesystem on the first partition:

><fs> mkfs ext2 /dev/vda1

This command takes some time, and the sparse disk file has grown. To 17 GB, so ext2 has an approx 1.7% overhead.

We can mount the filesystem and look at it:

><fs> mount /dev/vda1 /
><fs> df-h 
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda1            1008G   72M  957G   1% /sysroot

Can we try this with larger and larger virtual disks? In theory yes, in practice the 1.7% overhead proves to be a problem. A 10T experiment would require a very real 170GB of local disk space, and where I was hoping to go, 100T and beyond, would be too large for my test machines.

In fact there is another limitation before we reach there. Local sparse files on my host ext4 filesystem are themselves limited to under 16T:

><fs> sparse /tmp/test.img 16T
write: File too large
><fs> sparse /tmp/test.img 15T

Although the appliance does boot with that 15T virtual disk:

><fs> blockdev-getsize64 /dev/vda 


I noticed from Wikipedia that XFS has a maximum file size of 8 exabytes – 1 byte. By creating a temporary XFS filesystem on the host, I was able to create a 256TB virtual disk:

><fs> sparse /mnt/tmp/test/test.img 256T
><fs> run
><fs> blockdev-getsize64 /dev/vda 

Unfortunately at this point things break down. MBR partitions won’t work on such a huge disk, or at least sfdisk can’t partition it correctly.

I’m not sure what my options are at this point, but at least this is an interesting experiment in hitting limitations.


Filed under Uncategorized

Half-baked ideas: Cluster ext4

Half-bakery is a great website for amusing ideas.

I’m going to put my half-baked ideas under a special Ideas tag on this site.

My first idea is a “cluster” version of ext4 for the special case where there is one writer and multiple readers.

Virt newbies commonly think this should “just work” – create a ext4 filesystem in the host and export it read-only to all the guests. What could possibly go wrong?

In fact this doesn’t work, and guests will see corrupt data depending on how often the filesystem gets updated. The reason is that the guest kernel caches old parts of the filesystem and/or can be reading metadata which the host is simultaneously updating.

Another problem is that a guest process could open a file which the host would delete (and reuse the blocks). Really the host should be aware of what files and directories that guests have open and keep those around.

So it doesn’t work, but can it be made to work with some small changes to ext4 in the kernel?

You obviously need a communications path from the guests back to the host. Guests could use this to “reserve” or mark their interest in files, which the host would treat as if a local process had opened. (In fact if the hypervisor is qemu, it could open(2) these files on the host side).

This is a really useful feature for virtualization. Uses include: exporting any read-only data to guests (documents, data). Updateable shared /usr for guests. Mirrors of yum/apt repositories.

Related: virtio filesystem (what happened to that plan9 fs for KVM?), paravirt filesystems.


If qemu is going to open files on the host side, why not go the whole way and implement a paravirtualized filesystem? It wouldn’t need to be limited to just ext4 on the host side.

But how would we present it on the guest side? Presenting a read-only ext2 filesystem on the guest side is tempting, but not feasible. The problem again is what to do when files disappear on the host side — there is no way to communicate this to the guest except to give fake EIO errors which is hardly ideal. In any case qemu can already export directories as “virtual FAT filesystems”. I don’t know anyone who has a good word to say about this (mis-)feature.

So it looks like however it is done, there is a requirement for the guest to communicate its intentions to the host, even though the guest still would not be able to write.


Filed under Uncategorized