Tag Archives: ideas

Split block drivers from qemu with nbdkit

One interesting talk at KVM Forum last week was Stefan Hajnoczi‘s talk about QEMU security (sorry, it’s not online — it should eventually be available alongside all the other talks on this youtube channel).

One thing Stefan mentioned was whether QEMU might be split into multiple processes. This has advantages for security:

  1. Crashing or corrupting a single process doesn’t automatically expose the whole hypervisor.
  2. You can separately label each process using SELinux and independently control those policies, providing finer-grained security.

For block drivers you can do this today, and in fact we do this already when we run qemu from virt-v2v. Consider the case where we are using a remote HTTPS disk image:

$ qemu -drive https://remote/disk.img

drawing.svg

The curl driver linked to and running inside QEMU needs to make a remote TCP/IP connection, has to encode and decode TLS, is linked to libcurl and so on, and all those things also apply to the QEMU process. If the curl block driver has problems for any reason, these also affect QEMU. SELinux labels and transitions needed to access the socket are labels and transitions needed by the QEMU process. An exploit in the driver is a QEMU exploit.

With nbdkit we can split this out:

$ nbdkit -U - curl url=https://remote/disk.img \
  --run 'qemu -drive $nbd'

drawing2.svg

From a security point of view this has immediate advantages: If the curl driver crashes or is exploited, only nbdkit is affected. QEMU only needs access to a private Unix domain socket, and conversely nbdkit doesn’t need access to anything else that QEMU uses. You can add resource limits, separate SELinux policy, seccomp, namespaces and anything else you can think of to nbdkit to contain it tightly.

It’s worth pointing out the obvious disadvantages too: It’s likely that there will be a performance impact — although don’t discount how efficient NBD is and how this architecture also lets you scale more effectively over NUMA nodes. And this puts all our eggs into the qemu NBD client which must be very robust.

I should say also that this is more laborious to set up, and it would only really work if some other component (libvirt ideally) handled the creation of the separate nbdkit process. In the example above I used captive nbdkit, but that only works if you have a single drive, and one of the other mechanisms would be more scalable.

Leave a comment

Filed under Uncategorized

Half-baked ideas: Server remote management with an RPi Zero

For more half-baked ideas see the ideas tag

I guess like most people who work a lot with computers, I have a large number of computers in a “server room” (my loft). I’m too cheap to buy actual servers though, so most of these computers lack any sort of remote management / IPMI / BMC etc. I also just bought 6 Intel NUCs and these are also ideal as development servers, but unless you buy very specific (and unobtainium) versions they don’t come with remote management either.

Could we design a small USB device which would turn an ordinary PC into a remotely managed server?

What would we want it to do?

  1. Reboot remotely (like hitting the power switch, but remote).
  2. Unattended operating system install and reinstall.
  3. Modify BIOS settings.

At first the answer is obviously no since you cannot capture the display output from this device, but let’s ignore the display for a moment.

My initial idea was that you could simply have a USB flash drive which would have some kind of minimal remote operating environment. You would have to reconfigure the server BIOS so it always boots off the USB drive, and the USB drive then either chain boots to the operating system or can interrupt the boot process to download a new image. This lets you reinstall an OS, but not much else.

However USB can do a whole lot more, including acting as a keyboard, and it’s possible for a single USB port to provide multiple functions like drive + keyboard. Could we design a USB peripheral which would inject keys (like Ctrl+Alt+Del to soft reboot)? It could also blindly control the BIOS by injecting keys.

Designing a USB peripheral is a bunch of work, but the hardware to do it already exists. The Raspberry Pi Zero can act as a USB device (known as a Linux USB gadget).

So the idea is you’d connect the RPi Zero to the server using a USB OTG cable. It would appear as a keyboard and mass storage device (but unlike above you would not change the server to boot off USB, it boots normally). The RPi Zero could be remotely accessed over wifi and using that you could inject keypresses (like Ctrl+Alt+Del), and present a boot disk to the server, and using blind keypresses negotiate the BIOS menus to select an alternate boot disk when you want to reinstall. You can also act as a virtual serial console gadget allowing simpler interactive use for some operating systems.

One problem here is the RPi is powered from the server, so when it reboots it’s possible it will shutdown (but perhaps not if plugged into the yellow USB ports since I guess those should have constant power).

Another problem is you can’t see the display, so you’re driving the BIOS and other menus blind. But it’ll probably work in most automated cases. USB has a video class but I’m not clear if it would help here (it also wouldn’t be any good unless the server BIOS could use it).

A third problem is you cannot hard reboot the server in case it hangs.

8 Comments

Filed under Uncategorized

Half-baked ideas: C strings with implicit length field

For more half-baked ideas, see the ideas tag.

If you prefer just to see the code, then it’s here.

Chris Siebenmann wrote a couple of interesting articles about C’s null terminated strings and how they pre-date C.

Chris notes an alternative is a length + string representation, as used in Pascal. Although there are libraries for this in C, there are several drawbacks and approximately no one uses them.

However it’s possible to have the best of both worlds: Strings using an implicit length field that takes up no extra storage. These strings are backwards compatible with ordinary C strings — you can literally pass them to legacy functions or cast them to char * — yet the equivalent of a strlen operation is O(1).

There are two ideas here: Firstly, when you use the C malloc function, malloc stashes some extra metadata about your allocation, and with most malloc implementations there is a function to obtain the size of the allocation from a pointer. In glibc, the function is called malloc_usable_size. Note that because of alignment concerns, the amount allocated is usually larger than the amount you originally requested.

The second idea comes from OCaml. OCaml stores strings in a clever internal representation which is both backwards compatible with C (a fancy way to say they are null terminated), and it allows you to get the real length of the string even though OCaml — like C — allocates more than requested for alignment reasons.

So here’s how we do it: When allocating an “implicit length string” (ilenstr) we store extra data in the final byte of the “full” malloced space, in the byte marked B in the diagram below:

+-------------------------+----+------------+----+
| the string              | \0 |   ....     | B  |
+-------------------------+----+------------+----+
<----- malloc we requested ---->
<----------- malloc actually allocated ---------->

If malloc allocated exactly the same amount of space as is used by our string + terminating null, then B is simply the terminating \0:

+-------------------------+----+
| the string              | \0 |
+-------------------------+----+

If malloc allocated 1 spare byte, we store B = 1:

+-------------------------+----+----+
| the string              | \0 | 1  |
+-------------------------+----+----+

If malloc allocated 4 spare bytes, we store B = 4:

+-------------------------+----+----+----+----+----+
| the string              | \0 |   ....       | 4  |
+-------------------------+----+----+----+----+----+

Getting the true length of the string is simply a matter of asking malloc for the allocated length (ie. calling malloc_usable_size), finding the last byte (B) and subtracting it. So we can get the true string length in an O(1) operation (usually, although this may depend on your malloc implementation).

ilenstr strings can contain \0 characters within the string.

ilenstr strings are also backwards compatible, in that we can pass one to any “legacy” C function, and assuming the string itself doesn’t contain any \0 inside it, everything just works.

Alright. This is terrible. DO NOT USE IT IN PRODUCTION CODE! It breaks all kinds of standards, is unportable etc. There are security issues with allowing \0-containing strings to be passed to legacy functions. Still, it’s a nice idea. With proper cooperation from libc, standards authorities and so on, it could be made to work.

Here is my git repo:

http://git.annexia.org/?p=ilenstr.git;a=summary

7 Comments

Filed under Uncategorized

Half-baked ideas: qemu -M container

For more half-baked ideas, see the ideas tag.

Containers offer a way to do limited virtualization with fewer resources. But a lot of people have belatedly realized that containers aren’t secure, and so there’s a trend for putting containers into real virtual machines.

Unfortunately qemu is not very well suited to just running a single instance of the Linux kernel, as we in the libguestfs community have long known. There are at least a couple of problems:

  1. You have to allocate a fixed amount of RAM to the VM. This is basically a guess. Do you guess too large and have memory wasted in guest kernel structures, or do you guess too small and have the VM fail at random?
  2. There’s a large amount of overhead — firmware, legacy device emulation and other nonsense — which is essentially irrelevant to the special case of running a Linux appliance in a VM.

Here’s the half-baked idea: Let’s make a qemu “container mode/machine” which is better for this one task.

Unlike other proposals in this area, I’m not suggesting that we throw away or rewrite qemu. That’s stupid, as qemu gives us lots of useful abilities.

Instead the right way to do this is to implement a special virtio-ram device where the guest kernel can start off with a very tiny amount of RAM and request more memory on demand. And an empty machine type which is just for running appliances (qemu on ARM already has this: mach-virt).

Libguestfs people and container people, all happy. What’s not to like?

1 Comment

Filed under Uncategorized

Half-baked ideas: Demand-revealing referenda applied to Fedora features

For more half-baked ideas, see the ideas tag

Recently Lennart proposed dropping — and then unilaterally dropped a few days later — support for TCP wrappers in systemd. I haven’t used TCP wrappers for a long time, but there are some who do, and for those people dropping features like this provokes strong feelings. How should we conduct a survey or vote to decide what features to add or drop in software projects?

One way would be for all users [however that is defined] to have a vote. The problem with that is that a feature which few people use, but which really matters for those that use it would probably get dropped by a simple majority vote.

A better idea would be to use an economic system called a revealed preference. The idea is by asking people to risk their own money on the outcome of a vote, you hope to get a truer picture of their feelings. This technique also excludes moaners with lots of time on their hands to argue on mailing lists.

Chris Dillow (who incidentally writes a really great blog) has a worked example of a demand-revealing referendum which you should go and read.

Let’s try this with the systemd / TCP wrappers example. I’m going to have six voters. Four are mostly apathetic about the feature. But two of them use it, and one of those is going to have to change his whole infrastructure around if TCP wrappers goes away.

But first I have to assign a cost to this feature1. Unlike Chris’s Trident example, removing TCP wrappers from systemd is cheap. But it’s not completely free, assuming that Lennart is going to have to write some code, communicate the change, update documentation and so on. I’ll say it costs £12, which is £2/voter.

Let’s see how our six people might vote:

        Cost  Benefit
Alice   £ 2   £    5
Bob     £ 2   £   10
Charlie £ 2   £    2
Diane   £ 2   £    2
Eleanor £ 2   £  -50
Fred    £ 2   £-1000
TOTAL   £12   £-1031

Alice and Bob perceive some small benefit to the change because they think it’ll make systemd cleaner. Fred is the one who is going to have to make significant changes to his company network, and he’s not happy. Charlie and Diane are completely neutral.

The net benefits are calculated by subtracting the benefit from the cost:

        Cost  Benefit  Net benefit
                     (Benefit - Cost)
Alice   £ 2   £    5   £    3
Bob     £ 2   £   10   £    8
Charlie £ 2   £    2   £    0
Diane   £ 2   £    2   £    0
Eleanor £ 2   £  -50   £  -52
Fred    £ 2   £-1000   £-1002
TOTAL   £12   £-1031   £-1043

One thing you should notice from the “TOTAL” row is that there is no (expressed) net benefit to the change. Fred’s large negative vote has soured the whole thing. It sounds unfair that Fred is able to block this, but read on …

All we’ve done so far is asked people to guess numbers. To make it a revealed preference, we have to get people to pay real money. In this case, we’re going to ask some people to pay what is called a Clarke tax.

The tax is paid only by those who “win” (or get their way). Eleanor and Fred in this example get their way and we keep TCP wrappers in systemd. They pay the social cost of their winning that is incurred by the rest of the voters. To calculate the tax you have to remove Eleanor and Fred from the table to find the net benefit without them:

        Net benefit
Alice   £    3
Bob     £    8
Charlie £    0
Diane   £    0
TAX     £   11

Eleanor and Fred have to pay £11 in tax. (I’m unclear if this is split equally or pro-rata according to their vote). They pay this real money to Alice and Bob. Even after paying the tax, Eleanor and Fred and still better off (according to their claim). Alice and Bob have been compensated for their lost benefit.

The bids in the auction are sealed — ie. people shouldn’t be able to collude. Let’s imagine however that Alice estimated Fred’s £1000 cost and tried to neutralize it by claiming a £2000 benefit to the change. Alice would win (total net benefit becomes positive £952), but she and the other winners would have to pay a tax of £1046. This is a costly victory, but the money goes some way to compensating Fred for the changes he has to make to his company network.

1One issue with this is the estimate of the cost of the feature. I’m sure systemd developers will claim that although dropping TCP wrappers costs a bit of money in the short term, it pays dividends in the long term because of reduced code maintenance and bug reports. In other words that the cost is negative. You have to be able to provide credible costs for this to work, but I think you can express that by having the feature developers join in the voting process, in other words, revealing their true preferences as well.

Further reading

Leave a comment

Filed under Uncategorized

Half-baked idea: Kill SIGOK

For more half-baked ideas, see the ideas tag

I’m sure you’ve sat there waiting for some long compile to finish. It’s just started the tests, but you don’t want it to do the tests, just finish compiling dammit!

My half-baked idea is this: sending a special signal to a process (SIGOK) should kill the process, but the process should return 0 (ie. non-error exit).

3 Comments

Filed under Uncategorized

Half-baked ideas: Kernel modules-in-a-file

For more half-baked ideas, see the ideas tag.

The Linux kernel:

/boot/vmlinuz-3.11.9-200.fc19.x86_64

5 MB of highly optimized code that runs the world’s computers. You can copy it around and boot it on different hardware.

No wait, that’s all wrong. I’m forgetting about the modules. There are another 2832 files in /lib/modules/3.11.9-200.fc19.x86_64 and without them this kernel will hardly function.

It’s nice to be able to copy the kernel around — to compile it on fast x86-64 and run it on slow ARM, or to run the host kernel in a VM, but all those module files make everyone’s lives a lot more complex.

So my half-baked idea is this:

Combine all the modules into a single file, let’s call it /boot/modules-3.11.9-200.fc19.x86_64. This file might start off as an ext4 filesystem which can be loopback-mounted on /lib/modules. Eventually I’d hope that the module tools could load modules from the file directly.

Now you can happily cross-compile and copy just two kernel files from x86 to ARM or into your VMs, and that’s a lot less hassle.

10 Comments

Filed under Uncategorized

Half-baked ideas: Log level viewer

For more half-baked ideas, see the ideas tag

Back when I used to work on this sort of thing we used to do a lot of testing involving loops. For 1000 passes, for 1000 voltage settings, for 1000 pulse widths, do the test. That sort of thing. Because logging all 1000x1000x1000 tests would produce far too much output, I wrote a little hierarchical logging library that saved the logs at each level:

for (pass = 0; pass < 1000; pass++) {
  log (1, "pass %d", pass);
  for (voltage = 0; voltage < 1000; voltage++) {
    log (2, "voltage %d", voltage);
    for (pulse = 0; pulse < 1000; pulse++) {
      log (3, "pulse %d", pulse);
      log (3, "performing test");
      log (3, "setting pulse");
      if (set_pulse () == -1)
         error ("setting pulse failed");
      /* etc */
    }
  }
}

The log function didn’t immediately print the message. Instead it saved it in a buffer. Only when a test failed would the log buffer be printed so you’d see something like:

pass 997
voltage 123
pulse 0
performing test
setting pulse
error: setting pulse failed

Well I don’t recall now the details, but it strikes me there are many situations where logging is hierarchical, and I’m not aware of any logging libraries that express this easily.

One example would be make. You have top level rules like all and check. At a high level, a user might only be interested in whether make all check is currently running the all rule or the check rule. At the next level down, there is each directory recursed into. Below that, individual Makefile rules. Below that, individual commands.

How nice would it be to have a log viewing program that could display this? On screen you’d see several subwindows, each individually scrolling:

▃▃ top level ▃▃▃▃▃▃▃▃▃▃▃▃▃▃
all: OK
check: running ...

▃▃ level 1 ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
make all in src: OK
make all in tools: OK
make check in src: running ...

▃▃ level 2 ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
  CC       config.o
  CC       domain.o
  CC       options.o
  TEST     API: running ...

That seems to me to be a much clearer way of presenting things than the usual make “splurge”.

2 Comments

Filed under Uncategorized

Half-baked ideas: wikipedia for facts

Want more half-baked ideas? See my ideas tag

Would you like to find out about Boston USA? There’s Wikipedia for that. How about travelling there? Wikivoyage Boston.

How about the population of Boston in the years 1625-2013? Or the wages of bartenders in that fine city over the years? Or the peak summer temperature each year? Not so good.

My half-baked idea is a “wikipedia for curated facts”. These can be derived from many sources, but are presented in a uniform way (by place, time, variable, etc), with references to back them up.

This would be a great way to inject factual content into the air-headed anecdote-based nonsense that passes for opinion on the internet.

13 Comments

Filed under Uncategorized

Half-baked idea: Content-addressable web proxy

For more half-baked ideas, see my ideas tag

There are several situations where you want to fetch some content and don’t particularly care which precise source it comes from:

  1. Downloading packages from Linux distro mirrors.
  2. Downloading git commits.
  3. Grabbing a bittorrent data block.

My proposal (which surely has been done??) is that clients can supply the hash of the file they want when connecting to a proxy; something like:

GET http://example.com/foo HTTP/1.1
Content-Hash: sha256 b32683017c9530[etc]

The proxy is entitled to return any object in its cache that has the desired hash. If it doesn’t have any such object then it’ll fetch it from the URI in the usual way. We’ll have to make some assumptions that only cryptographically strong hashes are allowed, both to prevent the client getting wrong data and to stop clients fishing for unauthorized files from the cache.

In the distro mirroring case, the metadata would contain the hashes of the packages (which it probably already does). The client would supply these to the proxy. The proxy would be able to satisfy the request no matter what mirror was selected — you wouldn’t get the situation where the proxy is downloading several copies of the same data from different mirrors.

In the git case, git commits are the hashes. This would finally let us have an intelligent git mirror, something I’ve been wanting for a while given that I’m on slow DSL and downloading gnulib multiple times per day is no fun for anyone.

8 Comments

Filed under Uncategorized