Half-baked ideas: Cluster ext4

Half-bakery is a great website for amusing ideas.

I’m going to put my half-baked ideas under a special Ideas tag on this site.

My first idea is a “cluster” version of ext4 for the special case where there is one writer and multiple readers.

Virt newbies commonly think this should “just work” – create a ext4 filesystem in the host and export it read-only to all the guests. What could possibly go wrong?

In fact this doesn’t work, and guests will see corrupt data depending on how often the filesystem gets updated. The reason is that the guest kernel caches old parts of the filesystem and/or can be reading metadata which the host is simultaneously updating.

Another problem is that a guest process could open a file which the host would delete (and reuse the blocks). Really the host should be aware of what files and directories that guests have open and keep those around.

So it doesn’t work, but can it be made to work with some small changes to ext4 in the kernel?

You obviously need a communications path from the guests back to the host. Guests could use this to “reserve” or mark their interest in files, which the host would treat as if a local process had opened. (In fact if the hypervisor is qemu, it could open(2) these files on the host side).

This is a really useful feature for virtualization. Uses include: exporting any read-only data to guests (documents, data). Updateable shared /usr for guests. Mirrors of yum/apt repositories.

Related: virtio filesystem (what happened to that plan9 fs for KVM?), paravirt filesystems.


If qemu is going to open files on the host side, why not go the whole way and implement a paravirtualized filesystem? It wouldn’t need to be limited to just ext4 on the host side.

But how would we present it on the guest side? Presenting a read-only ext2 filesystem on the guest side is tempting, but not feasible. The problem again is what to do when files disappear on the host side — there is no way to communicate this to the guest except to give fake EIO errors which is hardly ideal. In any case qemu can already export directories as “virtual FAT filesystems”. I don’t know anyone who has a good word to say about this (mis-)feature.

So it looks like however it is done, there is a requirement for the guest to communicate its intentions to the host, even though the guest still would not be able to write.


Filed under Uncategorized

9 responses to “Half-baked ideas: Cluster ext4

  1. “The problem again is what to do when files disappear on the host side — there is no way to communicate this to the guest except to give fake EIO errors which is hardly ideal.”

    As you know, a file is only truly deleted when nothing references it anymore (as open()/close() in userland, some refcount in VFS). So you just need to make sure to keep the files open on the host for as long as they’re referenced in the guest.

    That means that virtio_fs needs an API/protocol that allows the guest to basically open() files in the host. The API could be similar to POSIX, but perhaps something closer to the VFS is better. See page 18 of http://www.slideshare.net/ericvh/9p-on-kvm

    I think you want something that mmap()s the host’s files into the emulated physical memory space in the guest on demand. The guest would then use a special virtiofs which doesn’t use the guest kernel’s buffer cache, instead sharing the host’s buffer cache with all other guests.

    • rich

      Thanks for this comment — those slides on 9p-on-kvm were the ones I was looking for earlier.

      The more I think about this, the more I see it needs a PV filesystem of some sort. Exporting extX to a guest sounds simple, but in practice is nothing like that.

  2. Jitesh

    (Warning: newbie question)
    nfs has already solved a similar problem, right? So, how would a read-only nfs mount compare to a PV-ext4 fs in the guest?

    Are there any particular reasons why nfs is not desirable in this case?

    • rich

      There are big advantages over NFS:

      You can share memory between the host system and the VM(s).
      A block device which uses shared memory would be zero-copy.
      NFS performance over virtual networks is very poor — try it some time.

  3. I wonder if you could leverage btrfs r/o snapshotting. If it can handle absurdly frequent snapshot creation and destruction efficiently, then you’d ‘just’ need to plumb the VM to recognize each subsequent snapshot as the same filesystem. And handle all the shared memory (1.8yr project in those scare quotes).

    • rich

      Yes, I guess the problem is similar to what I outlined, which is you can’t garbage collect the older versions if you don’t know which versions your “clients” (ie. guests) might be reading.

  4. I just noticed this:

    It appears stalled but might have some of the design done.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.