How many disks can you add to a (virtual) Linux machine?

><rescue> ls -l /dev/sd[tab]
Display all 4001 possibilities? (y or n)

Just how many virtual hard drives is it practical to add to a Linux VM using qemu/KVM? I tried to find out. I started by modifying virt-rescue to raise the limit on the number of scratch disks that can be added¹: virt-rescue --scratch=4000

I hit some interesting limits in our toolchain along the way.

256

256 is the maximum number of virtio-scsi disks in unpatched virt-rescue / libguestfs. A single virtio-scsi controller supports 256 targets, with up to 16384 SCSI logical units (LUNs) per target. We were assigning one disk per target, and giving them all unit number 0, so of course we couldn’t add more than 256 drives, but virtio-scsi supports very many more. In theory each virtio-scsi controller could support 256 x 16,384 = 4,194,304 drives. You can even add more than one controller to a guest.

About 490-500

At around 490-500 disks, any monitoring tools which are using libvirt to collect disk statistics from your VMs will crash (https://bugzilla.redhat.com/show_bug.cgi?id=1440683).

About 1000

qemu uses one file descriptor per disk (maybe two per disk if you are using ioeventfd). qemu quickly hits the default open file limit of 1024 (ulimit -n). You can raise this to something much larger by creating this file:

$ cat /etc/security/limits.d/99-local.conf
# So we can run qemu with many disks.
rjones - nofile 65536

It’s called /etc/security for a reason, so you should be careful adjusting settings here except on test machines.

About 4000

The Linux guest kernel uses quite a lot of memory simply enumerating each SCSI drive. My default guest had 512 MB of RAM (no swap), and ran out of memory and panicked when I tried to add 4000 disks. The solution was to increase guest RAM to 8 GB for the remainder of the test.

Booting with 4000 disks took 10 minutes² and free shows about a gigabyte of memory disappears:

><rescue> free -m
              total        used        free      shared  buff/cache   available
Mem:           7964         104        6945          15         914        7038
Swap:             0           0           0

What was also surprising is that increasing the number of virtual CPUs from 1 to 16 made no difference to the boot time (in fact it was a bit slower). So even though SCSI LUN probing is not deterministic, it appears that it is not running in parallel either.

About 8000

If you’re using libvirt to manage the guest, it will fail at around 8000 disks because the XML document describing the guest is too large to transfer over libvirt’s internal client to daemon connection (https://bugzilla.redhat.com/show_bug.cgi?id=1443066). For the remainder of the test I instructed virt-rescue to run qemu directly.

My guest with 8000 disks took 77 minutes to boot. About 1.9 GB of RAM was missing, and my ballpark estimate is that each extra drive takes about 200KB of kernel memory.

Between 10,000 and 11,000

We pass the list of drives to qemu on the command line, with each disk taking perhaps 180 bytes to express. Somewhere between 10,000 and 11,000 disks, this long command line fails with:

qemu-system-x86_64: Argument list too long

To be continued …

So that’s the end of my testing, for now. I managed to create a guest with 10,000 drives, but I was hoping to explore what happens when you add more than 18278 drives since some parts of the kernel or userspace stack may not be quite ready for that.

Notes

¹That command will not work with the virt-rescue program found in most Linux distros. I have had to patch it extensively and those patches aren’t yet upstream.

²Note that the uptime command within the guest is not an accurate way to measure the boot time when dealing with large numbers of disks, because it doesn’t include the time taken by the BIOS which has to scan the disks too. To measure boot times, use the wallclock time from launching qemu.

Thanks: Paolo Bonzini

Edit: 2015 KVM Forum talk about KVM’s limits.

7 Comments

Filed under Uncategorized

New in virt-v2v: Import from .vmx files

Virt-v2v converts guests from VMware to KVM, installing any necessary virtio drivers to make the guest work optimally on KVM.

A new feature in virt-v2v 1.37.10 is the ability to import VMware guests directly from disk storage. You do this by pointing virt-v2v directly to the guest’s .vmx metadata file (guest disks are found from references within the VMX file):

$ virt-v2v -i vmx /folder/Fedora_20/Fedora_20.vmx -o local -os /var/tmp
[   0.0] Opening the source -i vmx /folder/Fedora_20/Fedora_20.vmx
[   0.0] Creating an overlay to protect the source from being modified
[   0.1] Initializing the target -o local -os /var/tmp
[   0.1] Opening the overlay
[   6.5] Inspecting the overlay
[  14.0] Checking for sufficient free disk space in the guest
[  14.0] Estimating space required on target for each disk
         ...
[  70.8] Creating output metadata
[  70.8] Finishing off

The problem was how to deal with the case where VMware is storing guests on a NAS (Network Attached Storage). Previously we had to go through VMware to read the guests, either over https or using the proprietary ovftool, but both methods are really slow. If we can directly mount the NAS on the conversion server and read the storage, VMware is no longer involved and things go (much) faster. The result is we’re liberating people from proprietary software much more efficiently.

You’re supposed to use VMware’s proprietary libraries to read and write the VMX file format, which is of course out of the question for virt-v2v, so this was an interesting voyage into VMware’s undocumented and unspecified VMX file format. Superficially it seems like a simple list of key/value pairs in a text file:

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "10"
nvram = "Fedora_20.nvram"
pciBridge0.present = "TRUE"
svga.present = "TRUE"
...
scsi0.virtualDev = "pvscsi"
scsi0.present = "TRUE"
sata0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "Fedora_20.vmdk"
...
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "0"
sched.mem.minSize = "0"
sched.mem.shares = "normal"

But there are some things to catch you out:

  • All values are quoted, even booleans, integers and lists. This means you cannot distinguish in the parser between strings, booleans and other types. The code which interprets the value must know the type and do the conversion and deal with failures.
  • It’s case insensitive, except when it’s case sensitive. In the example above, all of the keys, and the values "TRUE", "normal" and "scsi-hardDisk" are case insensitive. But the filenames ("Fedora_20.vmdk") are case sensitive. The virt-v2v parser tries to be as case insensitive as possible.
  • Keys are arranged into a tree. Adding key.present = "FALSE" causes VMware to ignore all keys at that level and below in the tree.
  • Quoting of values is plain strange. I’ve never seen the | (pipe) character used as a quoting symbol before.

Luckily libvirt has a large selection of VMX files from the wild we could use to test against.

Leave a comment

Filed under Uncategorized

Pine64 + USB drive

It looks like a crazy ball of string and rubber bands now. I added an external SSD in an enclosure powered by the compatible JMS578 chipset. But the board itself cannot supply enough power through USB to external drivers, so there’s also a powered USB hub (thus the whole thing needs two power supplies).

It works is the best I can say about it at this point.

Important edit: I discovered that the powered USB hub is not necessary (presumably because this is an SSD, not a spinning disk). That eliminates the power supply problem.

2 Comments

March 25, 2017 · 5:52 pm

Fedora on the Pine64

Well getting Fedora running on the Pine64 has been an adventure. Fedora itself doesn’t work out of the box, but that’s to be expected because we’re waiting for some things to go upstream. But thanks to the tireless efforts of the Linux SunXi project I was able to boot the board with a (mostly) open source firmware, self-compiled near-upstream kernel, and a Fedora filesystem.

rjones@pine:~$ uname -a
Linux pine 4.9.0-00036-ge6af24d #14 SMP PREEMPT Sat Mar 18 13:56:36 GMT 2017 aarch64 aarch64 aarch64 GNU/Linux
rjones@pine:~$ cat /etc/fedora-release
Fedora release 25 (Twenty Five)

Below I will describe how to do this, but note that by the time Fedora 26 comes out you should not need to do any of this stuff.


Cross-compile your own kernel as described here. As well as the standard defconfig you will also need to enable CONFIG_XFS_FS=y.

Run make dtbs to create arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dtb which you will need later.

Read about the AllWinner A64/Pine64 boot process. It’s not necessary to replicate those steps exactly, but it helps to explain why we’re doing the next steps.

Grab one of the firmware images from here (it doesn’t matter which) and write it to your micro SD card. But note this firmware and dtb is out of date, and so you must then get the latest firmware from here and overwrite it:

# dd if=pine64_firmware-20170314.img of=/dev/mmcblk0 bs=8k seek=1

The firmware image above will create a single 100 MB FAT partition. Add further partitions to the partition table on the micro SD card so it looks approximately like this. The root filesystem must be on partition 5 (the first logical partition).

Device         Boot    Start      End  Sectors  Size Id Type
/dev/mmcblk0p1          2048   204799   202752   99M  6 FAT16
/dev/mmcblk0p2        204800 31116287 30911488 14.8G  5 Extended
/dev/mmcblk0p5        206848 21178367 20971520   10G 83 Linux
/dev/mmcblk0p6      21180416 25374719  4194304    2G 82 Linux swap / Solaris

Make swap on /dev/mmcblk0p6.

From your kernel build, copy arch/arm64/boot/Image and arch/arm64/boot/dts/allwinner/sun50i-a64-pine64-plus.dtb to the first (FAT) partition. (This will overwrite the existing out of date dtb file.)

Extract the filesystem from a virt-builder Fedora 25 aarch64 image:

$ virt-builder --arch aarch64 fedora-25
$ virt-filesystems -a fedora-25.img --all --long -h
$ guestfish --ro -a fedora-25.img run : download /dev/sda4 aarch64-root.fs

This is an XFS filesystem image, which is why you have to enable the XFS driver in the custom kernel above.

Now write this to the fifth (first logical) partition:

# dd if=aarch64-root.fs of=/dev/mmcblk0p5 bs=16M
# xfs_growfs /dev/mmcblk0p5

You will now need to mount up the root filesystem and make a few changes. At the very least:

  1. Edit /etc/fstab to reflect reality.
  2. Disable the root password in /etc/passwd.

With any luck booting the micro SD card in the Pine64 should now work.

6 Comments

Filed under Uncategorized

Pine64 — extra things

e64As with other low end ARM hardware the $50 I paid for the Pine64 isn’t enough for a fully working system. You will also need a serial port adapter, I recommend the CP2102 of which you’ll find millions on Amazon for under £10. Also, a micro SD card. And a USB to micro USB cable to power the board.

The total cost of this shouldn’t be more than another $40, taking the total cost of the hardware to about $90.

4 Comments

Filed under Uncategorized

Pine64 — delivered

A few weeks ago you will remember that I ordered a Pine64 aarch64 developer kit with the wifi daughter-card, in order to test how well it works with upstream Fedora. It arrived today. The ordering process was very efficient with Pine64 keeping me up to date at all steps along the way, and there were no customs delays or charges.

As I’m rather busy in the next few days, I may not have time to look at it right away.

4 Comments

March 14, 2017 · 6:31 pm

Tip: Run virt-inspector on a compressed disk (with nbdkit)

virt-inspector is a very convenient tool to examine a disk image and find out if it contains an operating system, what applications are installed and so on.

If you have an xz-compressed disk image, you can run virt-inspector on it without uncompressing it, using the magic of captive nbdkit. Here’s how:

nbdkit xz file=win7.img.xz \
    -U - \
    --run 'virt-inspector --format=raw -a nbd://?socket=$unixsocket'

What’s happening here is we run nbdkit with the xz plugin, and tell it to serve NBD over a randomly named Unix domain socket (-U -).

We then run virt-inspector as a sub-process. This is called “captive nbdkit”. (Nbdkit is “captive” here, because it will exit as soon as virt-inspector exits, so there’s no need to clean anything up.)

The $unixsocket variable expands to the name of the randomly generated Unix domain socket, forming a libguestfs NBD URL which allows virt-inspector to examine the raw uncompressed data exported by nbdkit.

The nbdkit xz plugin only uncompresses those blocks of the data which are actually accessed, so this is quite efficient.

2 Comments

Filed under Uncategorized