Tag Archives: qemu

virt-builder Debian 9 image available

Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:

$ virt-builder -l | grep debian
debian-6                 x86_64     Debian 6 (Squeeze)
debian-7                 sparc64    Debian 7 (Wheezy) (sparc64)
debian-7                 x86_64     Debian 7 (Wheezy)
debian-8                 x86_64     Debian 8 (Jessie)
debian-9                 x86_64     Debian 9 (stretch)

$ virt-builder debian-9 \
    --root-password password:123456
[   0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz
[   1.2] Planning how to build this image
[   1.2] Uncompressing
[   5.5] Opening the new disk
[  15.4] Setting a random seed
virt-builder: warning: random seed could not be set for this type of guest
[  15.4] Setting passwords
[  16.7] Finishing off
                   Output file: debian-9.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 3.9G
                    Free space: 3.1G (78%)

$ qemu-system-x86_64 \
    -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=debian-9.img,format=raw,if=virtio \
    -serial stdio


Filed under Uncategorized

Supernested on the QEMU Advent Calendar


I wrote supernested a few years ago to see if I could break nested KVM. It works by repeatedly nesting KVM guests until either something breaks or the whole thing grinds to a halt. Even on my very fastest machine I can only get to an L4 guest (L0 = host, L1 = normal guest).

Kashyap and Thomas Huth resurrected the QEMU Advent Calendar this year, and today (day 13) supernested is featured.

Please note that supernested should only be run on idle machines which aren’t doing anything else, and it can crash the machine.

1 Comment

Filed under Uncategorized

Fedora 25 is out, virt-builder images available

$ virt-builder -l | grep fedora-25
fedora-25                x86_64     Fedora® 25 Server
fedora-25                i686       Fedora® 25 Server (i686)
fedora-25                aarch64    Fedora® 25 Server (aarch64)
fedora-25                armv7l     Fedora® 25 Server (armv7l)
fedora-25                ppc64      Fedora® 25 Server (ppc64)
fedora-25                ppc64le    Fedora® 25 Server (ppc64le)
$ virt-builder fedora-25
$ qemu-system-x86_64 -machine accel=kvm:tcg \
      -cpu host -m 2048 \
      -drive file=fedora-25.img,format=raw,if=virtio

Or to try out Fedora on a different architecture:

$ virt-builder fedora-25 --arch ppc64le -o fedora-25-ppc64le.img
$ qemu-system-ppc64 -cpu POWER8 -m 2048 \
      -drive file=fedora-25-ppc64le.img,format=raw,if=virtio


Filed under Uncategorized

Now building Fedora/RISC-V “stage4” disk images

I’m happy to announce that Fedora/RISC-V, the project to bootstrap Fedora on the new RISC-V architecture, has reached a key milestone: We are now releasing clean “stage4” disk images which are built entirely from RPMs (ie. every file except two[1] are managed by RPM).

You can get the latest image from http://oirase.annexia.org/riscv/

To use it, you must enable my RISC-V tools copr:

# dnf copr enable rjones/riscv
# dnf install riscv-qemu riscv-pk

and you can then boot the stage4 directly using this qemu command[2]:

$ qemu-system-riscv -m 4G -kernel /usr/bin/bbl \
    -append vmlinux \
    -drive file=stage4-disk.img,format=raw -nographic

This is an early release and there are a few problems. The main one is that we lack a util-linux package, and thus there is no mount command so the disk image stays read-only after boot. You’ll see lots of errors like this at boot:

/init: line 16: mount: command not found
/init: line 19: mount: command not found
/init: line 20: mount: command not found

I hope to get that fixed soon.

There’s also no actual rpm command in the stage4, again because of a required dependency, and again that’s something that will be fixed real soon.

Many thanks go to David Abdurachmanov and Stefan O’Rear for their huge efforts building packages.


[1] Because there is no systemd package yet, currently two extra files are added to the disk image which are not under the control of RPM: /init and /usr/bin/poweroff

[2] For real hardware, read this page


Filed under Uncategorized

Tip: Poor man’s qemu breakpoint

I’ve written before about how you can use qemu + gdb to debug a guest. Today I was wondering how I was going to debug a problem in a BIOS option ROM, when Stefan Hajnoczi mentioned this tip: Insert

1: jmp 1b

into the code as a “poor man’s breakpoint”. In case you don’t know what that assembly code does, it causes a jump back (b) to the previous 1 label. In other words, an infinite loop.

After inserting that into the option ROM, recompiling and rebooting the virtual machine, it hangs in the boot, and hitting ^C in gdb gets me straight to the place where I inserted the loop.

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000fff0 in ?? ()
(gdb) set architecture i8086
The target architecture is assumed to be i8086
(gdb) cont
Program received signal SIGINT, Interrupt.
0x00000045 in ?? ()
(gdb) info registers
eax            0xc100	49408
ecx            0x0	0
edx            0x0	0
ebx            0x0	0
esp            0x6f30	0x6f30
ebp            0x6f30	0x6f30
esi            0x0	0
edi            0x0	0
eip            0x45	0x45
eflags         0x2	[ ]
cs             0xc100	49408
ss             0x0	0
ds             0xc100	49408
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) disassemble 0xc1000,0xc1050
Dump of assembler code from 0xc1000 to 0xc1050:
   0x000c103c:	mov    %cs,%ax
   0x000c103e:	mov    %ax,%ds
   0x000c1040:	mov    %esp,%ebp
   0x000c1043:	cli    
   0x000c1044:	cld    
   0x000c1045:	jmp    0xc1045
   0x000c1047:	jmp    0xc162c
   0x000c104a:	sub    $0x4,%esp
   0x000c104e:	mov    0xc(%esp),%eax
End of assembler dump.

Look, my infinite loop!

I can then jump over the loop and keep single stepping*:

(gdb) set $eip=0x47
(gdb) si
0x0000062c in ?? ()
(gdb) si
0x0000062e in ?? ()
(gdb) si
0x00000632 in ?? ()

I did wonder if I could take Stefan’s idea further and insert an actual breakpoint (int $3) into the code, but that didn’t work for me.

Note to set breakpoints, the regular gdb break command doesn’t work. You have to use hardware-assisted breakpoints instead:

(gdb) hbreak *0xc164a
Hardware assisted breakpoint 1 at 0xc164a
(gdb) cont

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000064a in ?? ()

* If you find that single stepping doesn’t work, make sure you are using qemu in TCG mode (-M accel=tcg), as KVM code apparently cannot be single-stepped.


Filed under Uncategorized

Tracing QEMU guest execution part 4

The final two questions that I posed last time were to do with constructing a timeline of what this guest is spending time on.

We can easily see system calls in the trace log, and we can also see when a kernel function is entered the first time (indicating that a new bit of the kernel is now running), and I wrote a Perl script to analyze that. That gave me a 115K line log file from which I did the rest of the analysis by hand to generate a timeline.

I would reproduce it here, but the results aren’t very enlightening. In particular I doubt it’s more interesting that what you can get by reading the kernel printk’s from a boot log.

What is my conclusion after using these techniques to analyze a booting guest? Here I go:

  • It’s clunky and undocumented. Hopefully this series should help a little.
  • It would be much more powerful with stack traces. It should be possible to get them from QEMU, at least in theory, but it’s a lot of work to do so.
  • It would be much more powerful if we could analyze into kernel modules and userspace.
  • More tooling around this might make it more bearable.


Filed under Uncategorized

Tracing QEMU guest execution part 3

In the previous post I posed three questions about my detailed function-level trace of the kernel booting under QEMU. The first one was Which kernel functions consume the most time?

We don’t have stack traces, although that would be a good direction for future work. So if a function “A” calls another function “B” like this:

  --> calls B
  <-- B returns

then we’re going to assign just the time in the two parts of “A” to “A”. In other words, “A” doesn’t inherit the time taken running “B”. (Except if “B” is inlined, in which case the trace cannot distinguish which code is in “A” and which is in “B”).

The other limitation is lack of insight into what kernel modules are doing. This is very difficult to solve: compiling kernel modules into the kernel proper would change what I’m trying to measure considerably.

Given those limitations, I wrote a Perl script to do all that from my previous trace. I’m not going to bother reproducing the script here because it’s quite simple, but also very specific to the trace format I’m using. If you got this far, you’ll have no trouble writing your own analysis tools.

The results are at the end of the post. For each function that was called, I had a look into the kernel code to find out what it seems to be doing, and those notes are below.

  • sha256_transform is the SHA256 function. While I’m not sure what it is being used for (some kind of module signature check seems to be the most likely explanation), the more interesting thing is that we’re not using any specialized version of the function (eg. with AVX suppport). That’s easily explained: we’re using TCG, not KVM, so no host processor features are available. However on a real boot we would be using AVX, so the function should take a lot less time, and I think we can discount it.
  • native_safe_halt is the function which halts the processor when it is idle (eg. waiting for an interrupt). Is it worrying that we spend 1/50th of the time not working? Would it help to have more virtual CPUs or is there an inherent lack of parallelism in the boot process?
  • It’s interesting that we spend such a large amount of time in the sort function. It’s used all over the place, eg. for sorting the BIOS E820 map, sorting memory mappings, checking boot parameters, sorting lists of wake-up events …
  • task_tick_fair is part of the Completely Fair Scheduler.
  • If there’s a boot option to disable ftrace, I cannot find it.

My main conclusion is there is no “smoking gun” here. Everything generally points to things that a kernel (or at least, the core, non-module parts of a kernel) should be doing. So this analysis has not been very helpful.

Continue reading

1 Comment

Filed under Uncategorized