In my last post I tried to see what happens when you add thousands of virtio-scsi disks to a Linux virtual machine. Above 10,000 disks the qemu command line grew too long for the host to handle. Several people pointed out that I could use the qemu
-readconfig parameter to read the disks from a file. So I modified libguestfs to allow that. What will be the next limit?
Linux uses a strange scheme for naming disks which I’ve covered before on this blog. In brief, disks are named
/dev/sdzz, and after 18,278 drives we reach
/dev/sdzzz. What’s special about
zzz? Nothing really, but historically Linux device drivers would fail after this, although that is not a problem for modern Linux.
In any case I created a Linux guest with 20,000 drives with no problem except for the enormous boot time: It was over 12 hours at which point I killed it. Most of the time was being spent in:
- 72.62% 71.30% qemu-system-x86 qemu-system-x86_64 [.] drive_get
- 72.62% drive_get
- 1.26% __irqentry_text_start
- 1.23% smp_apic_timer_interrupt
- 1.00% local_apic_timer_interrupt
- 1.00% hrtimer_interrupt
- 0.82% __hrtimer_run_queues
Drives are stored inside qemu on a linked list, and the
drive_get function iterates over this linked list, so of course everything is extremely slow when this list grows long.
QEMU bug filed: https://bugs.launchpad.net/qemu/+bug/1686980
Edit: Dan Berrange posted a hack which gets me past this problem, so now I can add 20,000 disks.
The guest boots fine, albeit taking about 30 minutes (and udev hasn’t completed device node creation in that time, it’s still going on in the background).
><rescue> ls -l /dev/sd[Tab]
Display all 20001 possibilities? (y or n)
/dev/sdacog on / type ext2 (rw,noatime,block_validity,barrier,user_xattr,acl)
As you can see the modern Linux kernel and userspace handles “four letter” drive names like a champ.
I managed to create a guest with 30,000 drives. I had to give the guest 50 GB (yes, not a mistake) of RAM to get this far. With less RAM, disk probing fails with:
scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
I’d seen SCSI probing run out of memory before, and I made a back-of-the-envelope calculation that each disk consumed 200 KB of RAM. However that cannot be correct — there must be a non-linear relationship between number of disks and RAM used by the kernel.
Because my development machine simply doesn’t have enough RAM to go further, I wasn’t able to add more than 30,000 drives, so that’s where we have to end this little experiment, at least for the time being.
><rescue> ls -l /dev/sd???? | tail
brw------- 1 root root 66, 30064 Apr 28 19:35 /dev/sdarin
brw------- 1 root root 66, 30080 Apr 28 19:35 /dev/sdario
brw------- 1 root root 66, 30096 Apr 28 19:35 /dev/sdarip
brw------- 1 root root 66, 30112 Apr 28 19:35 /dev/sdariq
brw------- 1 root root 66, 30128 Apr 28 19:35 /dev/sdarir
brw------- 1 root root 66, 30144 Apr 28 19:35 /dev/sdaris
brw------- 1 root root 66, 30160 Apr 28 19:35 /dev/sdarit
brw------- 1 root root 66, 30176 Apr 28 19:24 /dev/sdariu
brw------- 1 root root 66, 30192 Apr 28 19:22 /dev/sdariv
brw------- 1 root root 67, 29952 Apr 28 19:35 /dev/sdariw