xz plugin for nbdkit

I’ve now written an xz plugin for nbdkit (previous discussion on this blog).

This is useful if you’re building up a library of xz-compressed disk images using virt-sparsify and xz, and you want to access them without having to uncompress them.

I certainly learned a lot about the xz file format and liblzma this weekend …

The xz file format consists of multiple streams (but usually one). Each stream contains zero or more blocks of compressed data, followed by an “index”. Like zip, everything in an xz file happens from the end, so the block index is at the end of the stream (this allows xz files to be streamed when writing without needing any reverse seeks).

Crucially the index contains the offset of each block both in the actual xz file and in the uncompressed data, so once you’ve read the index from a file you can find the position of any uncompressed byte and seek to the beginning of that block and read the data. Random access!

Preparing xz files correctly is important in order to be able to get good random access performance with low memory overhead:

$ xz --list /tmp/winxp.img.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1     384  2,120.1 MiB  6,144.0 MiB  0.345  CRC64   /tmp/winxp.img.xz

A file with lots of small blocks like the above (16 MB block size) is relatively easy to seek inside. At most 16 MB of data has to be uncompressed to reach any byte.

Perhaps ironically, if your machine has lots of free memory then xz appears to choose a large block size, resulting in some one-block files. Here’s the same file when I originally compressed it for my guest library:

$ xz --list guest-library/winxp.img.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1  2,100.0 MiB  6,144.0 MiB  0.342  CRC64   guest-library/winxp.img.xz

So unfortunately you may need to recompress some of your xz files using the new xz --block-size option:

$ xz --best --block-size=$((16*1024*1024)) winxp.img

Here’s how you use the new nbdkit xz plugin:

$ nbdkit plugins/nbdkit-xz-plugin.so file=winxp.img.xz
$ guestfish --ro -a nbd://localhost -i

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell

Operating system: Microsoft Windows XP
/dev/sda1 mounted on /

><fs> ll /
total 1573209
drwxrwxrwx  1 root root       4096 Apr 16  2012 .
drwxr-xr-x 23 1000 1000       4096 Jun 24 13:57 ..
-rwxrwxrwx  1 root root          0 Oct 11  2011 AUTOEXEC.BAT
-rwxrwxrwx  1 root root          0 Oct 11  2011 CONFIG.SYS
drwxrwxrwx  1 root root       4096 Oct 11  2011 Documents and Settings
-rwxrwxrwx  1 root root          0 Oct 11  2011 IO.SYS
-rwxrwxrwx  1 root root          0 Oct 11  2011 MSDOS.SYS


Filed under Uncategorized

5 responses to “xz plugin for nbdkit

  1. Dave Vasilevsky

    Very cool stuff. I’ve written some related projects:

    * pixz, my parallel xz compressor, produces blocked xz-compatible archives by default. It’s a great way to create .xz archives suitable for random access. https://github.com/vasi/pixz

    * lzopfs is a FUSE filesystem that provides random access to several compressed file formats: xz, bzip2, gzip, and lzo. (It also supports compatible formats like pixz, pigz, etc.) Most formats require a preliminary scan to index the compressed file, but the index is small enough to keep around so you never have to scan again. https://github.com/vasi/lzopfs

    You may want to consider including indexing in nbdkit, so that gzip and bzip2 archives are usable.

  2. Gary Scarborough

    I am wondering why there seems to be a push to use xz with kvm instead of gzip. In my experiences, xz is very slow for such a small difference in compression. gzip works well and creates a file that won’t lose its sparseness. qemu-img convert -c also does a very good job of compacting images if you aren’t worried about moving the file about a lot. Is there something I am missing?

    • rich

      Probably just smaller is better. I think what would really help is to have standardized containers separate from standardized codecs (similar to the situation with video codecs, but hopefully with better implementations). That way we could all get the benefits of stuff like transparent uncompress plugins for qemu/nbd/etc without needing to sweat over the exact codec we are using.

  3. Pingback: virt-resize from an NBD source | Richard WM Jones

  4. none

    xz is great for files with lots of long range redundancy. An example is Wikipedia data dumps: every revision of every wiki page appears in the dump, so if the article about “cheese” is 1MB long and has been edited 10000 times since Wikipedia started, that means there’s 10GB of cheese article versions in the dump, each only slightly different from the previous one. That is WAY larger than the window size of bzip2 or gzip, but xz eats it right up. So the dump is around 10TB (terabytes) uncompressed, several hundred GB as .bz2, but only around 70GB as .xz’s. xz decompression is also much faster than bz2 decompression, though xz is slower on the compression side.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s