Cladograms of virtual machines

Cladograms are diagrams showing the relationships between species and their ancestors. For virtual machines which get cloned and copied, you can also construct simple cladograms, and then use these to construct a tree of backing files to minimize the space used.

The Java program attached was written as a proof-of-concept for an oVirt feature which would automatically infer cladograms (and hence efficient backing files) from a set of virtual machines.

I ran it on a test set consisting of:

  1. An original Windows XP disk image.
  2. The same WinXP disk image as 1, copied and booted once.
  3. The same WinXP disk image as 1, copied, booted and Firefox installed.
  4. A 32 bit Fedora 19 disk image.
  5. A 64 bit Fedora 19 disk image.
  6. An ArchLinux disk image.

The program inferred the following cladogram:

--------------------------- winxp-original
    |    |         |    |
    |    |         |    `---- winxp-copy-1
    |    |         |
    |    |         `----- winxp-copy-2
    |    |
    |    |  ,------ f19rawhidex32
    |     --|
    |       `------ f19rawhidex64
    `-------------- archlinux

The “distance” measured between the Fedora and ArchLinux guests was so large that those are essentially random. But it did very well with the Windows XP guests which were truly derivatives. In fact you could easily modify the program to ignore guests which are unrelated (eg. have a distance measure over, say, 90%).

From the diagram you should be able to see how to build an efficient tree of backing files to minimize disk space requirements.

/* Compare multiple disk images for similarity.
 * By Richard W.M. Jones <>
 * See:
 * Usage:
 *   java Similarity disk.img disk.img [disk.img ...]
 * You need to supply two disk images at a minimum, and usually it's
 * best to supply a lot more.  Any format can be used.
 * Notes on the technique used:
 * (1) We use libguestfs to open each disk image.  This allows us to
 * get at the raw data, in case the disk image is stored in some
 * format like qcow2 or vmdk.  Also you could extend this program so
 * it could understand encrypted disks.
 * (2) For each disk, we split it into 64K blocks and hash each block
 * using SHA-256.  The reason for choosing 64K blocks is that it's the
 * normal cluster size for qcow2, and the block size used by qemu-img
 * etc.  The reason for doing the hashing is so that we can compare
 * the disk images for similarity by holding the complete set of
 * hashes in memory.  The hashing reduces each disk by a factor of
 * 2048, so for example a 10 GB disk image is reduced to a more
 * manageable 5 MB.
 * NB: For speed the hashes are saved in a cache file called
 * 'similarity-cache' in the local directory.  You can just delete
 * this file when done.
 * (3) We then compare each disk image, block by block, and record the
 * difference between each pair of images.
 * Note that we DON'T do advanced Cluster Analysis on the disk images.
 * There's not any point since the rebasing operation used by qemu-img
 * can only handle simple differences at the block level; it cannot,
 * for example, move blocks around or do fuzzy matches.
 * (4) We then output a tree (technically a 'Cladogram') showing a
 * hierarchy of guests, using a simple hierarchical clustering
 * algorithm, where we group the two closest guests, then that group
 * with the next closest guest, and so forth.

import java.util.*;

public class Similarity
    static String cache_filename = "similarity-cache";

    static int blocksize = 65536;
    static String hash_algorithm = "SHA-256";
    static int hash_bytes_per_block = 256/8;

    static String[] filenames;

    public static Map<String, ByteArray> read_cache_file ()
        throws Exception
        FileInputStream file = null;

        try {
            file = new FileInputStream (cache_filename);
        catch (Exception e) { }
        if (file == null)
            return new HashMap<String, ByteArray> ();

        ObjectInputStream ois = new ObjectInputStream (file);
        return (HashMap<String, ByteArray>) ois.readObject ();

    public static void write_cache_file (Map<String, ByteArray> cache)
        throws Exception
        FileOutputStream file = new FileOutputStream (cache_filename);
        ObjectOutputStream oos = new ObjectOutputStream (file);
        oos.writeObject (cache);
        oos.close ();

    /* Read the disk image and return all the block hashes
     * concatenated together.
    public static byte[] read_disk_image (String filename)
        throws Exception
        GuestFS g = new GuestFS ();
        g.add_drive_ro (filename);
        g.launch ();

        String dev = "/dev/sda";
        long size = g.blockdev_getsize64 (dev);
        ByteArrayOutputStream os = new ByteArrayOutputStream ();
        long percent, last_percent = -1;

        for (long offset = 0; offset < size-blocksize+1; offset += blocksize) {
            percent = 100 * offset / size;
            if (percent != last_percent) {
                last_percent = percent;
                System.out.print (percent + "%\r");
                System.out.flush ();

            String block = g.pread_device (dev, blocksize, offset);
            MessageDigest md = MessageDigest.getInstance (hash_algorithm);
            os.write (md.digest (block.getBytes ("UTF-8")));

        g.close ();

        return os.toByteArray ();

    // Return number of blocks which are different.
    public static long calculate_distance (byte[] bi, byte[] bj)
        long d = 0;
        ByteArrayInputStream bis = new ByteArrayInputStream (bi);
        ByteArrayInputStream bjs = new ByteArrayInputStream (bj);

        while (bis.available () > 0 || bjs.available () > 0) {
            byte[] bis_block = new byte[hash_bytes_per_block];
            if (bis.available () >= hash_bytes_per_block)
       (bis_block, 0, hash_bytes_per_block);
            byte[] bjs_block = new byte[hash_bytes_per_block];
            if (bjs.available () >= hash_bytes_per_block)
       (bjs_block, 0, hash_bytes_per_block);

            if (! Arrays.equals (bis_block, bjs_block))

        return d;

    /* The matrix of distance between disk image i and disk image j. */
    static long[][] matrix;

    public static long min_distance_between_sets (Set<Integer> set1,
                                                  Set<Integer> set2)
        long min_distance = -1;

        Iterator<Integer> i = set1.iterator ();
        while (i.hasNext ()) {
            Integer ii = ();
            Iterator<Integer> j = set2.iterator ();
            while (j.hasNext ()) {
                Integer ji = ();
                long distance = matrix[ii][ji];
                if (min_distance == -1 || distance < min_distance)
                    min_distance = distance;

        return min_distance;

    // Return true if any member of 's' is in 'set'.
    public static boolean is_member_of_set (Set s, Set set)
        Iterator i = s.iterator ();
        while (i.hasNext ()) {
            Object member = ();
            if (set.contains (member))
                return true;
        return false;

    public static Set[] make_cladogram (int level, Set[] sets)
        // Find the minimum distance between any sets.
        long min_distance = min_distance_between_sets (sets[0], sets[1]);

        for (int i = 0; i < sets.length; ++i) {
            for (int j = i+1; j < sets.length; ++j) {
                long distance = min_distance_between_sets (sets[i], sets[j]);
                if (distance < min_distance)
                    min_distance = distance;

        //System.out.println ("min_distance = " + min_distance);

        // For each set that differs from another by exactly the
        // minimum distance, group them together into a single set.
        Set grouped = new HashSet<Integer> ();

        for (int i = 0; i < sets.length; ++i) {
            for (int j = i+1; j < sets.length; ++j) {
                long distance = min_distance_between_sets (sets[i], sets[j]);
                if (distance == min_distance) {
                    grouped.addAll (sets[i]);
                    grouped.addAll (sets[j]);

        List<Set> ret = new ArrayList<Set> ();
        ret.add (grouped);

        // Add the other sets which are not included in 'grouped'.
        for (int i = 0; i < sets.length; ++i) {
            if (! is_member_of_set (sets[i], grouped))
                ret.add (sets[i]);

        System.out.println ("cladogram level " + level + ":");
        Iterator<Set> i = ret.iterator ();
        while (i.hasNext ()) {
            Set<Integer> s = ();
            System.out.print ("[ ");
            Iterator<Integer> j = s.iterator ();
            while (j.hasNext ()) {
                int ii =;
                String filename = filenames[ii];
                System.out.print (filename + " ");
            System.out.print ("] ");
        System.out.print ("\n");

        return ret.toArray (new HashSet[ret.size()]);

    public static void main (String[] argv)
        throws Exception
        filenames = argv;
        int n = filenames.length;

        if (n < 2)
            throw new Error ("usage: Similarity disk.img disk.img [disk.img ...]");

        // Read the cache file.
        Map<String, ByteArray> cache = read_cache_file ();

        // Read in the disk images and hash them.
        for (int i = 0; i < n; ++i) {
            if (cache.containsKey (filenames[i])) {
                System.out.println ("disk image " + filenames[i] +
                                    " is already in the cache");
            } else {
                System.out.println ("reading disk image: " + filenames[i]);
                byte[] bytes = read_disk_image (filenames[i]);
                ByteArray ba = new ByteArray (bytes);
                cache.put (filenames[i], ba);

            byte[] b = cache.get(filenames[i]).getArray();
            System.out.println (filenames[i] +
                                ": bytes.length = " + b.length +
                                ", blocks = " + b.length / hash_bytes_per_block);

        // Write the cache file.
        write_cache_file (cache);

        // Work out the similarity for each pair of guests.
        // Conveniently we can store this in a matrix where each (i,j)
        // element is the "distance" from filenames[i] to
        // filenames[j].  Distance means the number of blocks which
        // are different.
        matrix = new long[n][n];
        for (int i = 0; i < n; ++i) {
            matrix[i][i] = 0;
            for (int j = 0; j < i; ++j) {
                byte[] bi = cache.get(filenames[i]).getArray();
                byte[] bj = cache.get(filenames[j]).getArray();
                long d = calculate_distance (bi, bj);
                System.out.println ("distance from " +
                                    filenames[i] + " to " +
                                    filenames[j] + " = " + d);
                matrix[i][j] = d;
                matrix[j][i] = d;

        // Construct the cladogram.  The bottom level of this is an
        // array of Sets, with each Set containing just a single
        // guest.  Then we group the Sets together into larger and
        // larger Sets until we have one Set covering everything.
        Set h[] = new Set[n];
        for (int i = 0; i < n; ++i) {
            h[i] = new HashSet<Integer> ();
            h[i].add (new Integer (i));

        int level = 0;
        while (h.length > 1) {
            h = make_cladogram (level, h);

        System.out.println ("finished!");

   Derby - Class org.apache.derby.iapi.util.ByteArray

   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to you under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   See the License for the specific language governing permissions and
   limitations under the License.



  ByteArray wraps java byte arrays (byte[]) to allow
  byte arrays to be used as keys in hashtables.

  This is required because the equals function on
  byte[] directly uses reference equality.
  This class also allows the trio of array, offset and length
  to be carried around as a single object.
public final class ByteArray implements Serializable
    static final long serialVersionUID = 7541513629082952232L;

  private byte[] array;
  private int    offset;
  private int    length;

    Create an instance of this class that wraps ths given array.
    This class does not make a copy of the array, it just saves
    the reference.
  public ByteArray(byte[] array, int offset, int length) {
    this.array = array;
    this.offset = offset;
    this.length = length;

  public ByteArray(byte[] array) {
    this(array, 0, array.length);

  public ByteArray()

  public void setBytes(byte[] array)
    this.array = array;
    offset = 0;
    length = array.length;

  public void setBytes(byte[] array, int length)
    this.array = array;
    this.offset = 0;
    this.length = length;

  public void setBytes(byte[] array, int offset, int length)
    this.array = array;
    this.offset = offset;
    this.length = length;

    Value equality for byte arrays.
  public boolean equals(Object other) {
    if (other instanceof ByteArray) {
      ByteArray ob = (ByteArray) other;
      return ByteArray.equals(array, offset, length, ob.array, ob.offset, ob.length);
    return false;

  public int hashCode() {

    byte[] larray = array;

    int hash = length;
    for (int i = 0; i < length; i++) {
      hash += larray[i + offset];
    return hash;

  public final byte[] getArray() {
    return array;
  public final int getOffset() {
    return offset;

  public final int getLength() {
    return length;
  public final void setLength(int newLength) {
    length = newLength;
   * Read this object from a stream of stored objects.
   * @param in read this.
   * @exception IOException         thrown on error
  public void readExternal( ObjectInput in ) throws IOException
    int len = length = in.readInt();
    offset = 0; 
    array = new byte[len];

    in.readFully(array, 0, len);

   * Write the byte array out w/o compression
   * @param out write bytes here.
   * @exception IOException   thrown on error
  public void writeExternal(ObjectOutput out) throws IOException
    out.write(array, offset, length);

    Compare two byte arrays using value equality.
    Two byte arrays are equal if their length is
    identical and their contents are identical.
  private static boolean equals(byte[] a, int aOffset, int aLength, byte[] b, int bOffset, int bLength) {

    if (aLength != bLength)
      return false;

    for (int i = 0; i < aLength; i++) {
      if (a[i + aOffset] != b[i + bOffset])
        return false;
    return true;


Filed under Uncategorized

10 responses to “Cladograms of virtual machines

  1. C. M.

    Neat project! One question: you mention that you chose a 64K block size “because it’s the normal cluster size for qcow2 and the block size used by qemu-img;” however, typical filesystem block sizes are usually smaller — 4k or 8k. Ignoring QEMU’s native sizes, why the choice of 64k?

    It seems like there’s a trade-off here — storage-overhead per disk image versus accuracy of difference. So if you want to come up with a very accurate cladogram of a bunch of very closely-related VMs (and snapshots), maybe a smaller block size (4k, 8k) would produce more accurate results. But it would incur a higher storage overhead penalty.

    Likely, 64k is good enough for distinguishing between very different images (Windows vs Fedora vs Arch) and finding similarities in broadly similar images (Fedora 32-bit vs 64-bit base installs). But I think a smaller block size would be better once you start getting into less trivial cladograms…

    • rich

      You are right there is a trade-off between the cluster size vs the amount of memory required to do the analysis. Note that choosing anything smaller than whatever qemu-img rebase uses when it’s rebasing images would be a waste of time.

      The other thing is to change the distance measure: with libguestfs we can easily look inside the filesystem, so we could construct a metric based on (eg) recursive list of filenames.

      • C. M.

        Re: “Note that choosing anything smaller than whatever qemu-img rebase uses when it’s rebasing images would be a waste of time.” >> As someone unfamiliar with qemi-img rebase, why?

        “The other thing is to change the distance measure” >> Right, we’re going into another layer here ;-). You’re correct, there’s only so much information block-based diff will tell us.

      • rich

        Because qemu-img won’t look at anything smaller than that blocksize, and qemu-img is the tool we would normally use to “diff” two images.

  2. Yaniv

    what about something like:
    If (FS1 == NTFS && FS2 == LVM)
    similarity = 0%

    meaning, first check the file systems. If they are very different, than there’s no point in doing any similarity checks.

    • rich

      Absolutely you could do that. However I doubt it’s going to be necessary, because the “distance” measure I’m using is so huge for the unrelated guests that it would be easier simply to cut off at distance > 99% (or whatever).

      When I’m back at my dev computer I’ll post the actual numbers.

  3. rich

    This is the full output of the program, except that I have added some blank lines to make the output (a little) easier to understand.

    $ java -classpath .:/usr/share/java/libguestfs-1.20.1.jar \
     Similarity \
     /dev/fedora/f19rawhidex32 \
     /dev/fedora/f19rawhidex64 \
     /dev/fedora/archlinux20121201x64 \
     /tmp/winxp.img \
     /tmp/winxp-copy.img \
    disk image /dev/fedora/f19rawhidex32 is already in the cache
    /dev/fedora/f19rawhidex32: bytes.length = 10485760, blocks = 327680
    disk image /dev/fedora/f19rawhidex64 is already in the cache
    /dev/fedora/f19rawhidex64: bytes.length = 10485760, blocks = 327680
    disk image /dev/fedora/archlinux20121201x64 is already in the cache
    /dev/fedora/archlinux20121201x64: bytes.length = 10485760, blocks = 327680
    disk image /tmp/winxp.img is already in the cache
    /tmp/winxp.img: bytes.length = 3145728, blocks = 98304
    disk image /tmp/winxp-copy.img is already in the cache
    /tmp/winxp-copy.img: bytes.length = 3145728, blocks = 98304
    reading disk image: /tmp/winxp-copy2.img
    /tmp/winxp-copy2.img: bytes.length = 3145728, blocks = 98304
    distance from /dev/fedora/f19rawhidex64 to /dev/fedora/f19rawhidex32 = 316812
    distance from /dev/fedora/archlinux20121201x64 to /dev/fedora/f19rawhidex32 = 323022
    distance from /dev/fedora/archlinux20121201x64 to /dev/fedora/f19rawhidex64 = 322026
    distance from /tmp/winxp.img to /dev/fedora/f19rawhidex32 = 321543
    distance from /tmp/winxp.img to /dev/fedora/f19rawhidex64 = 322498
    distance from /tmp/winxp.img to /dev/fedora/archlinux20121201x64 = 325575
    distance from /tmp/winxp-copy.img to /dev/fedora/f19rawhidex32 = 321542
    distance from /tmp/winxp-copy.img to /dev/fedora/f19rawhidex64 = 322497
    distance from /tmp/winxp-copy.img to /dev/fedora/archlinux20121201x64 = 325576
    distance from /tmp/winxp-copy.img to /tmp/winxp.img = 312
    distance from /tmp/winxp-copy2.img to /dev/fedora/f19rawhidex32 = 321701
    distance from /tmp/winxp-copy2.img to /dev/fedora/f19rawhidex64 = 322659
    distance from /tmp/winxp-copy2.img to /dev/fedora/archlinux20121201x64 = 325723
    distance from /tmp/winxp-copy2.img to /tmp/winxp.img = 3607
    distance from /tmp/winxp-copy2.img to /tmp/winxp-copy.img = 3593
    cladogram level 0:
    [ /tmp/winxp.img /tmp/winxp-copy.img ] [ /dev/fedora/f19rawhidex32 ] [ /dev/fedora/f19rawhidex64 ] [ /dev/fedora/archlinux20121201x64 ] [ /tmp/winxp-copy2.img ] 
    cladogram level 1:
    [ /tmp/winxp.img /tmp/winxp-copy.img /tmp/winxp-copy2.img ] [ /dev/fedora/f19rawhidex32 ] [ /dev/fedora/f19rawhidex64 ] [ /dev/fedora/archlinux20121201x64 ] 
    cladogram level 2:
    [ /dev/fedora/f19rawhidex32 /dev/fedora/f19rawhidex64 ] [ /tmp/winxp.img /tmp/winxp-copy.img /tmp/winxp-copy2.img ] [ /dev/fedora/archlinux20121201x64 ] 
    cladogram level 3:
    [ /dev/fedora/f19rawhidex32 /dev/fedora/f19rawhidex64 /tmp/winxp.img /tmp/winxp-copy.img /tmp/winxp-copy2.img ] [ /dev/fedora/archlinux20121201x64 ] 
    cladogram level 4:
    [ /dev/fedora/f19rawhidex32 /dev/fedora/f19rawhidex64 /dev/fedora/archlinux20121201x64 /tmp/winxp.img /tmp/winxp-copy.img /tmp/winxp-copy2.img ] 
  4. C. M.


    “Because qemu-img won’t look at anything smaller than that blocksize, and qemu-img is the tool we would normally use to “diff” two images.”

    Ah, I think I misunderstood. The end goal isn’t the cladogram — that’s just a step along the journey. The goal is to, using the information from the cladogram, use qemu-img rebase to essentially de-dupe disk images that were forked at some point in the past, at the block level (and qemu-img’s native block size is 64kB). Is that correct?

    If de-dupe of shared blocks is the goal, it seems like CoW at fork time might incur less initial overhead than a de-dupe pass, but you might also end up with block fragmentation issues… which is overhead.

  5. Pingback: New tool: virt-similarity | Richard WM Jones

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.