Tag Archives: python

Loop mount an S3 or Ceph object

This is a fun, small nbdkit Python plugin using the Boto3 AWS SDK:

#!/usr/sbin/nbdkit python

import nbdkit
import boto3
from contextlib import closing


def thread_model():
    return nbdkit.THREAD_MODEL_PARALLEL

def config(key, value):
    global access_key, secret_key, endpoint_url, bucket_name, key_name

    if key == "access-key" or key == "access_key":
        access_key = value
    elif key == "secret-key" or key == "secret_key":
        secret_key = value
    elif key == "endpoint-url" or key == "endpoint_url":
        endpoint_url = value
    elif key == "bucket":
        bucket_name = value
    elif key == "key":
        key_name = value
        raise Exception("unknown parameter %s" % key)

def open(readonly):
    global access_key, secret_key, endpoint_url

    s3 = boto3.client("s3",
                      aws_access_key_id = access_key,
                      aws_secret_access_key = secret_key,
                      endpoint_url = endpoint_url)
    if s3 is None:
        raise Exception("could not connect to S3")
    return s3

def get_size(s3):
    global bucket_name, key_name

    resp = s3.get_object(Bucket = bucket_name, Key = key_name)
    size = resp['ResponseMetadata']['HTTPHeaders']['content-length']
    return int(size)

def pread(s3, buf, offset, flags):
    global bucket_name, key_name

    size = len(buf)
    rnge = 'bytes=%d-%d' % (offset, offset+size-1)
    resp = s3.get_object(Bucket = bucket_name, Key = key_name, Range = rnge)
    body = resp['Body']
    with closing(body):
        buf[:] = body.read(size)

This lets you loop mount a single object (file):

$ ./nbdkit-S3-plugin -f -v -U /tmp/sock \
  access_key="XYZ" secret_key="XYZ" \
  bucket="my_files" key="fedora-28.iso"
$ sudo nbd-client -b 2048 -unix /tmp/sock /dev/nbd0
Negotiation: ..size = 583MB
$ ls /dev/nbd0
 nbd0    nbd0p1  nbd0p2  
$ sudo mount -o ro /dev/nbd0p1 /tmp/mnt
$ ls -l /tmp/mnt
 total 11
 dr-xr-xr-x. 3 root root 2048 Apr 25  2018 EFI
 -rw-r--r--. 1 root root 2532 Apr 23  2018 Fedora-Legal-README.txt
 dr-xr-xr-x. 3 root root 2048 Apr 25  2018 images
 drwxrwxr-x. 2 root root 2048 Apr 25  2018 isolinux
 -rw-r--r--. 1 root root 1063 Apr 21  2018 LICENSE
 -r--r--r--. 1 root root  454 Apr 25  2018 TRANS.TBL

I should note this is a bit different from s3fs which is a FUSE driver that mounts all the files in a bucket.

Leave a comment

Filed under Uncategorized

Tip: Changing the qemu product name in libguestfs

20:30 < koike> Hi. Is it possible to configure the dmi codes for libguestfs? I mean, I am running cloud-init inside a libguestfs session (through python-guestfs) in GCE, the problem is that cloud-init reads /sys/class/dmi/id/product_name to determine if the machine is a GCE machine, but the value it read is Standard PC (i440FX + PIIX, 1996) instead of the expected Google Compute Engine so cloud-init fails.

The answer is yes, using the guestfs_config API that lets you set arbitrary qemu parameters:

         'type=1,product=Google Compute Engine')

Leave a comment

Filed under Uncategorized

Downloading all the 78rpm rips at the Internet Archive

I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.

[Edit: If you want a light introduction to this, I recommend this double CD]

I wanted to download it … all!

But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …

Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood, select the identifier field (only), set the format to CSV. Set the limit to 30000 (there are about 25000+ records), and download the huge CSV:

$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv

A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:

78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b → https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b

What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.


import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

with open('search.csv', 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        if row[0] == "identifier":
        url = "https://archive.org/download/%s/" % row[0]
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")})
        # Only want the first link in the page.
        link = links[0]
        link = link.get('href', None)
        link = urlparse.urljoin(url, link)
        print link

When you run this it converts each identifier into a download URL:

Edit: Amusingly WordPress turns the next pre section with MP3 URLs into music players. I recommend listening to them!

$ ./download.py | head -10

And after that you can download as many 78s as you can handle 🙂 by doing:

$ ./download.py > downloads
$ wget -nc -i downloads


I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.

Update #2

Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.


Filed under Uncategorized

New in nbdkit: Run nbdkit as a captive process

New in nbdkit ≥ 1.1.6, you can run nbdkit as a “captive process” under external programs like qemu or guestfish. This means that nbdkit runs for as long as qemu/guestfish is running, and when they exit it cleans up and exits too.

Here is a rather involved way to boot a Fedora 20 guest:

$ virt-builder fedora-20
$ nbdkit file file=fedora-20.img \
    --run 'qemu-kvm -m 1024 -drive file=$nbd,if=virtio'

The --run parameter is what tells nbdkit to run as a captive under qemu-kvm. $nbd on the qemu command line is substituted automatically with the right nbd: URL for the port or socket that nbdkit listens on. As soon as qemu-kvm exits, nbdkit is killed and cleaned up.

Here is another example using guestfish:

$ nbdkit file file=fedora-20.img \
    --run 'guestfish --format=raw -a $nbd -i'

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell

Operating system: Fedora release 20 (Heisenbug)
/dev/sda3 mounted on /
/dev/sda1 mounted on /boot


The main use for this is not to run the nbdkit file plugin like this, but in conjunction with perl and python plugins, to let people easily open and edit OpenStack Glance/Cinder and other unconventional disk images.


Filed under Uncategorized

New in nbdkit: Write plugins in Python

nbdkit is a permissively licensed Network Block Device server that lets you export “unusual” disk sources to qemu and libguestfs.

New in nbdkit 1.1.5, you can write plugins using Python. Here is an example.

1 Comment

Filed under Uncategorized

Get kernel and initramfs from a disk image

This script, a response to the insecure and over-complex disk-image-get-kernel in OpenStack, shows how to use libguestfs to safely and easily get the kernel and initramfs from a disk image so you can boot it using an external kernel.

# Get latest kernel & initramfs safely from a disk image.

# Note this will overwrite /tmp/kernel & /tmp/initramfs which you
# wouldn't want to do in production.

import sys
import guestfs

assert (len (sys.argv) == 2)
disk = sys.argv[1]

g = guestfs.GuestFS (python_return_dict=True)

# To enable tracing, uncomment the next line.
#g.trace (1)

# Attach the disk image read-only to libguestfs.
g.add_drive_opts (disk, readonly=1)

# Run the libguestfs back-end.
g.launch ()

# Ask libguestfs to inspect for operating systems.
roots = g.inspect_os ()
if len (roots) == 0:
    raise (Error ("no operating systems found"))
if len (roots) > 1:
    raise (Error ("dual/multi-boot images are not supported"))

root = roots[0]

# Mount up the disks, like guestfish -i.
# Sort keys by length, shortest first, so that we end up
# mounting the filesystems in the correct order.
mps = g.inspect_get_mountpoints (root)
def compare (a, b): return len(a) - len(b)
for device in sorted (mps.keys(), compare):
        g.mount_ro (mps[device], device)
    except RuntimeError as msg:
        print "%s (ignored)" % msg

# For debugging:
print "/boot directory of this guest:"
print (g.ll ("/boot"))

# Get all kernels & initramfses.
kernels = g.glob_expand ("/boot/vmlinuz-*")
initramfses = g.glob_expand ("/boot/initramfs-*")

# Old RHEL:
if len (initramfses) == 0:
    initramfses = g.glob_expand ("/boot/initrd-*")

# Debian/Ubuntu:
if len (initramfses) == 0:
    initramfses = g.glob_expand ("/boot/initrd.img-*")

if len (kernels) == 0:
    raise (Error ("no kernel found in this disk image"))
if len (initramfses) == 0:
    raise (Error ("no initramfs found in this disk image"))

# Sort by version so we get the latest.
from distutils.version import LooseVersion
kernels.sort (key=LooseVersion)
initramfses.sort (key=LooseVersion)

# Download the latest.
print ("downloading %s -> /tmp/kernel" % kernels[len (kernels)-1])
g.download (kernels[len (kernels)-1], "/tmp/kernel")
print ("downloading %s -> /tmp/initramfs" % initramfses[len (initramfses)-1])
g.download (initramfses[len (initramfses)-1], "/tmp/initramfs")

# Shutdown.
g.shutdown ()
g.close ()

Leave a comment

Filed under Uncategorized

Using libguestfs remotely with Python and rpyc

libguestfs has high quality Python bindings. Using rpyc you can make a remote libguestfs server with almost no effort at all.

Firstly start an rpyc server:

$ /usr/lib/python2.7/site-packages/rpyc/servers/classic_server.py
[SLAVE      INFO       13:21:17 tid=140019939981120] server started on
[SLAVE      INFO       13:21:17 tid=140019784894208] started background auto-register thread (interval = 60)
[REGCLNT    INFO       13:21:17] registering on
[REGCLNT    WARNING    13:21:19] no registry acknowledged

Now, possibly from the same machine or some other machine, you can connect to this server and use Python objects remotely as if they were local:

$ python
Python 2.7.3 (default, Aug  9 2012, 17:23:57) 
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import rpyc
>>> c = rpyc.classic.connect('localhost')

You can now create a libguestfs handle, following the example here.

>>> g = c.modules.guestfs.GuestFS()
>>> g.version()
{'release': 36L, 'major': 1L, 'minor': 21L, 'extra': 'fedora=20,release=1.fc20,libvirt'}
>>> g.add_drive('/dev/fedora/f18x64', readonly=True)
>>> g.launch()
>>> roots = g.inspect_os()
>>> g.inspect_get_product_name(roots[0])
'Fedora release 18 (Spherical Cow)'
>>> g.inspect_get_mountpoints(roots[0])
[('/', '/dev/mapper/fedora-root'), ('/boot', '/dev/sda1')]

As you can see, the g object is transparently remoted without you needing to do anything.

Leave a comment

Filed under Uncategorized

Which foreign function interface is the best?

I’ve written libguestfs language bindings for Perl, Python, Ruby, Java, OCaml, PHP, Haskell, Erlang and C#. But which of these is the best? Which is the easiest? What makes this hard? Grubbing around in the internals of a language reveals mistakes made by the language designers, but what are the worst mistakes?

Note: There is source that goes with this. Download libguestfs-1.13.13.tar.gz and look in the respective directories.

The best

It’s going to be a controversial choice, but in my opinion: C#. You just add some simple annotations to your functions and structs, and you can call into shared libraries (or “DllImport”s as Microsoft insisted on calling them) directly. It’s just about as easy as directly calling C and that is no simple achievement considering how the underlying runtime of C# is very different from C.

Example: a C struct:

[StructLayout (LayoutKind.Sequential)]
public class _int_bool {
  int i;
  int b;

The worst

There are two languages in the doghouse: Haskell and PHP. PHP first because their method of binding is just very broken. For example, 64 bit types aren’t possible on a 32 bit platform. It requires a very complex autoconf setup. And the quality of their implementation is very poor verging on broken — it makes me wonder if the rest of PHP can be this bad.

Haskell: even though I’m an experienced functional programmer and have done a fair bit of Haskell programming in the past, the FFI is deeply strange and very poorly documented. I simply could not work out how to return anything other than integers from my functions. You end up with bindings that look like this:

write_file h path content size = do
  r <- withCString path $ \path -> withCString content $ \content -> withForeignPtr h (\p -> c_write_file p path content (fromIntegral size))
  if (r == -1)
    then do
      err <- last_error h
      fail err
    else return ()

The middle tier

There’s not a lot to choose between OCaml, Ruby, Java and Erlang. For all of them: you write bindings in C, there’s good documentation, it’s a bit tedious but basically mechanical, and in 3 out of 4 you’re dealing with a reasonable garbage collector so you have to be aware of GC issues.

Erlang is slightly peculiar because the method I chose (out of many possible) is to write an external process that talks to the Erlang over stdin/stdout. But I can’t fault their documentation, and the rest of it is sensible.

Example: Here is a function binding in OCaml, but with mechanical changes this could be Ruby, Java or Erlang too:

CAMLprim value
ocaml_guestfs_add_drive_ro (value gv, value filenamev)
  CAMLparam2 (gv, filenamev);
  CAMLlocal1 (rv);

  guestfs_h *g = Guestfs_val (gv);
  if (g == NULL)
    ocaml_guestfs_raise_closed ("add_drive_ro");

  char *filename = guestfs_safe_strdup (g, String_val (filenamev));
  int r;

  caml_enter_blocking_section ();
  r = guestfs_add_drive_ro (g, filename);
  caml_leave_blocking_section ();
  free (filename);
  if (r == -1)
    ocaml_guestfs_raise_error (g, "add_drive_ro");

  rv = Val_unit;
  CAMLreturn (rv);

The ugly

Perl: Get reading. You’d better start with perlxs because Perl uses its own language — C with bizarre macros on top so your code looks like this:

SV *
is_config (g)
      guestfs_h *g;
      int r;
      r = guestfs_is_config (g);
      if (r == -1)
        croak ("%s", guestfs_last_error (g));
      RETVAL = newSViv (r);

After that, get familiar with perlguts. Perl has only 3 structures and you’ll be using them a lot. There are some brilliant things about Perl which shouldn’t be overlooked, including POD which libguestfs uses to make effortless manual pages.

Python: Best described as half arsed. Rather like the language itself.

Python, Ruby, Erlang: If your language depends on “int”, “long”, “long long” without defining what those mean, and differing based on your C compiler and platform, then you’ve made a big mistake that will unfortunately dog you throughout the runtime, FFIs and the language itself. It’s better either to define them precisely (like Java) or to just use int32 and int64 (like OCaml).

And finally, reference counting (Perl, Python). It’s tremendously easy to make mistakes that are fiendishly difficult to track down. It’s a poor way to do GC and it indicates to me that the language designer didn’t know any better.


Filed under Uncategorized

What I learned about AMQP

I’m playing with AMQP at the moment. I thought I’d start off with RabbitMQ first.

The good:

  • It works.
  • OCaml and Python programs can talk to each other.
  • It works across remote hosts. You need to open port 5672/tcp on the firewall.

The bad:

  • RabbitMQ and Apache Qpid use different versions of AMQP and are not interoperable! Good summary of the mess here. This might be resolved when everyone gets around to supporting AMQP 1-0, but even though that standard has been published, no one is expecting interop to happen for at least a year.
  • You can’t cluster different versions of the RabbitMQ broker together.
  • Even if all your hosts are at the same RabbitMQ version, you have to open more firewall ports and make changes to the start-up scripts. (Dynamic ports? Really? Did we learn nothing from NFS?)
  • Long, obscure Erlang error messages which don’t point to the problem. eg. You’ll get a good 25 lines of error message if another process is already bound to a port.
  • Possibly just a Fedora packaging problem: I managed to get my host into some state where it’s impossible to stop the RabbitMQ server except by kill -9, and after that I can’t start or stop it.


Filed under Uncategorized

Tip: Using libguestfs from Perl

I translated the standard libguestfs examples (already available in C/C++, OCaml, Python, Ruby) into Perl.

If you want to call libguestfs from Perl, you have to use Sys::Guestfs.

All 300+ libguestfs API calls are available to all language bindings equally because we generate the bindings.

Leave a comment

Filed under Uncategorized