There are various open source ISAs and chip designs. I’ve previously run OpenRISC 1200 on an FPGA. Another effort is the RISC-V (“RISC Five”) project, which is developing an open, patent-free 64 bit ISA. It has a sister project lowRISC which aims to produce a synthesizable RISC-V FPGA design “in 6 months”, and tape out by the end of this year (I’m a little skeptical of the timeframes).
RISC-V has added support to a fork of qemu:
$ git remote add riscv https://github.com/riscv/riscv-qemu
$ git fetch riscv
$ git checkout -b riscv-master --track riscv/master
$ ./configure --target-list="riscv-softmmu"
$ ./riscv-softmmu/qemu-system-riscv -cpu \?
$ ./riscv-softmmu/qemu-system-riscv -machine \?
Supported machines are:
board RISCV Board (default)
none empty machine
To save yourself a world of pain, download a RISC-V Linux kernel binary and root image from here.
$ file ~/vmlinux
/home/rjones/vmlinux: ELF 64-bit LSB executable, UCB RISC-V, version 1 (SYSV), statically linked, BuildID[sha1]=d0a6d680362018e0f3b9208a7ea7f79b2b403f7c, not stripped
Then you can boot the image in the usual way:
$ ./riscv-softmmu/qemu-system-riscv \
-display none \
-kernel ~/vmlinux \
-hda ~/root.bin \
The root filesystem is very sparse:
# uname -a
Linux ucbvax 3.14.15-g4073e84-dirty #4 Sun Jan 11 07:17:06 PST 2015 riscv GNU/Linux
# ls /bin
ash chgrp dd ln mv rmdir touch
base64 chmod df ls nice sleep true
busybox chown echo mkdir printenv stat uname
cat cp false mknod pwd stty usleep
catv date fsync mount rm sync
# ls /sbin
# ls /usr/bin
[ dirname groups mkfifo sha1sum tac uniq
[[ dos2unix head nohup sha256sum tail unix2dos
basename du hostid od sha3sum tee uudecode
cal env id printf sha512sum test uuencode
cksum expand install readlink sort tr wc
comm expr logname realpath split tty whoami
cut fold md5sum seq sum unexpand yes
Obligatory comic strip
See end of post for an important update
UEFI firmware has a concept of persistent variables. They are used to control the boot order amongst other things. They are stored in non-volatile RAM on the system board, or for virtual machines in a host file.
When a UEFI machine is running you can edit these variables using various tools, such as Peter Jones’s efivar library, or the efibootmgr program.
These programs don’t actually edit the varstore directly. They access the kernel
/sys/firmware/efi interface, but even the kernel doesn’t edit the varstore. It just redirects to the UEFI runtime “Variable Services”, so what is really running is UEFI code (possibly proprietary, but more usually from the open source TianoCore project).
So how can you edit varstores offline? The NVRAM file format is peculiar to say the least, and the only real specification is the code that writes it from Tianocore. So somehow you must reuse that code. To make it more complicated, the varstore NVRAM format is tied to the specific firmware that uses it, so varstores used on aarch64 aren’t compatible with those on x86-64, nor are SecureBoot varstores compatible with normal ones.
virt-efivars is an attempt to do that. It’s rather “meta”. You write a small editor program (an example is included), and virt-efivars compiles it into a tiny appliance. You then boot the appliance using qemu + UEFI firmware + varstore combination, the editor program runs and edits the varstore, using the UEFI code.
It works .. at least on aarch64 which is the only convenient machine I have that has virtualized UEFI.
Git repo: http://git.annexia.org/?p=virt-efivars.git;a=summary
After studying this problem some more, Laszlo Ersek came up with a different and better plan:
- Boot qemu with only the OVMF code & varstore attached. No OS or appliance.
- This should drop you into a UEFI shell which is accessible over qemu’s serial port.
- Send appropriate
setvar commands to update the variables. Using
expect this should be automatable.
For more half-baked ideas, see the ideas tag.
Containers offer a way to do limited virtualization with fewer resources. But a lot of people have belatedly realized that containers aren’t secure, and so there’s a trend for putting containers into real virtual machines.
Unfortunately qemu is not very well suited to just running a single instance of the Linux kernel, as we in the libguestfs community have long known. There are at least a couple of problems:
- You have to allocate a fixed amount of RAM to the VM. This is basically a guess. Do you guess too large and have memory wasted in guest kernel structures, or do you guess too small and have the VM fail at random?
- There’s a large amount of overhead — firmware, legacy device emulation and other nonsense — which is essentially irrelevant to the special case of running a Linux appliance in a VM.
Here’s the half-baked idea: Let’s make a qemu “container mode/machine” which is better for this one task.
Unlike other proposals in this area, I’m not suggesting that we throw away or rewrite qemu. That’s stupid, as qemu gives us lots of useful abilities.
Instead the right way to do this is to implement a special
virtio-ram device where the guest kernel can start off with a very tiny amount of RAM and request more memory on demand. And an empty machine type which is just for running appliances (qemu on ARM already has this: mach-virt).
Libguestfs people and container people, all happy. What’s not to like?
qemu-img convert input output
does not work if the output is a pipe.
It’d sure be nice if it did though! For one thing, we could use this in virt-v2v to stream images into OpenStack Glance (instead of having to spool them into a temporary file).
I mentioned this to Paolo Bonzini yesterday and he suggested a simple workaround. Just replace the output with:
qemu-img convert -n input nbd:...
and write an NBD server that turns the sequence of writes from qemu-img into a stream that gets written to a pipe. Assuming the output is raw, then
qemu-img convert will write, starting at disk offset 0, linearly through to the end of the disk image.
How to write such an NBD server easily? nbdkit is a project I started to make it easy to write NBD servers.
So I wrote a streaming plugin which does exactly that, in 243 lines of code.
Using a feature called captive nbdkit, you can rewrite the above command as:
nbdkit -U - streaming pipe=/tmp/output --run '
qemu-img convert -n input -O raw $nbd
(This command will “hang” when you run it — you have to attach some process to read from the pipe, eg:
md5sum < /tmp/output)
The streaming plugin will a lot more generally useful if it supported a sliding window, allowing limited reverse seeking and reading. So there’s a nice little project for a motivated person. See here
If you ever used the old version of virt-v2v, our software that converts guests to run on KVM, then you probably found it slow, but worse still it was slow and could fail at the end of the conversion (after possibly an hour or more). No one liked that, least of all the developers and support people who had to help people use it.
A V2V conversion is intrinsically going to take a long time, because it always involves copying huge disk images around. These can be gigabytes or even terabytes in size.
My main aim with the rewrite was to do all the work up front (and if the conversion is going to fail, then fail early), and leave the huge copy to the last step. The second aim was to work much harder to minimize the amount of data that we need to copy, so the copy is quicker. I achieved both of these aims using a lot of new technology that we developed for qemu in RHEL 7.
Virt-v2v works (now) by putting an overlay on top of the source disk. This overlay protects the source disk from being modified. All the writes done to the source disk during conversion (eg. modifying config files and adding device drivers) are saved into the overlay. Then we qemu-img convert the overlay to the final target. Although this sounds simple and possibly obvious, none of this could have been done when we wrote old virt-v2v. It is possible now because:
- qcow2 overlays can now have virtual backing files that come from HTTPS or SSH sources. This allows us to place the overlay on top of (eg) a VMware vCenter Server source without having to copy the whole disk from the source first.
- qcow2 overlays can perform copy-on-read. This means you only need to read each block of data from the source once, and then it is cached in the overlay, making things much faster.
- qemu now has excellent discard and trim support. To minimize the amount of data that we copy, we first fstrim the filesystems. This causes the overlay to remember which bits of the filesystem are used and only copy those bits.
- I added support for fstrim to ntfs-3g so this works for Windows guests too.
- libguestfs has support for remote storage, cachemode, discard, copy-on-read and more, meaning we can use all these features in virt-v2v.
- We use OCaml — not C, and not type-unsafe languages — to ensure that the compiler is helping us to find bugs in the code that we write, and also to ensure that we end up with an optimized, standalone binary that requires no runtime support/interpreters and can be shipped everywhere.
If qemu crashes or fails when run under libguestfs, it can be a bit hard to debug things. However a small qemu wrapper and gdbserver can help.
Create a file called
qemu-wrapper chmod +x and containing:
if ! echo "$@" | grep -sqE -- '-help|-version|-device \?' ; then
exec $gdbserver /usr/bin/qemu-system-x86_64 "$@"
Set your environment variables so libguestfs will use the qemu wrapper instead of running qemu directly:
$ export LIBGUESTFS_BACKEND=direct
$ export LIBGUESTFS_HV=/path/to/qemu-wrapper
Now we run guestfish or another virt tool as normal:
$ guestfish -a /dev/null -v -x run
When qemu starts up, gdbserver will run and halt the process, printing:
Listening on port 1234
At this point you can connect gdb:
(gdb) file /usr/bin/qemu-system-x86_64
(gdb) target remote tcp::1234
set breakpoints etc here
This is mostly adapted from this long thread on the VMware community site.
I got VMware ESXi 5.5.0 running on upstream KVM today.
First I had to disable the “VMware backdoor”. When VMware runs, it detects that qemu underneath is emulating this port and tries to use it to query the machine (instead of using CPUID and so on). Unfortunately qemu’s emulation of the VMware backdoor is very half-assed. There’s no way to disable it except to patch qemu:
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index eaf3e61..ca1c422 100644
@@ -204,7 +204,7 @@ static void pc_init1(QEMUMachineInitArgs *args,
pc_vga_init(isa_bus, pci_enabled ? pci_bus : NULL);
/* init basic PC hardware */
- pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, xen_enabled(),
+ pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, 1,
It would be nice if this was configurable in qemu. This is now being fixed upstream.
Secondly I had to turn off MSR emulation. This is, unfortunately, a machine-wide setting:
# echo 1 > /sys/module/kvm/parameters/ignore_msrs
# cat /sys/module/kvm/parameters/ignore_msrs
Thirdly I had to give the ESXi virtual machine an IDE disk and an
network card. Note also that ESXi requires ≥ 2 vCPUs and at least 2 GB of RAM.