Really nice doing
make -j46 kernel builds on Qualcomm’s insanely fast ARM-based Amberwing server.
Really nice doing
make -j46 kernel builds on Qualcomm’s insanely fast ARM-based Amberwing server.
In my last post I tried to see what happens when you add thousands of virtio-scsi disks to a Linux virtual machine. Above 10,000 disks the qemu command line grew too long for the host to handle. Several people pointed out that I could use the qemu
-readconfig parameter to read the disks from a file. So I modified libguestfs to allow that. What will be the next limit?
Linux uses a strange scheme for naming disks which I’ve covered before on this blog. In brief, disks are named
/dev/sdzz, and after 18,278 drives we reach
/dev/sdzzz. What’s special about
zzz? Nothing really, but historically Linux device drivers would fail after this, although that is not a problem for modern Linux.
In any case I created a Linux guest with 20,000 drives with no problem except for the enormous boot time: It was over 12 hours at which point I killed it. Most of the time was being spent in:
- 72.62% 71.30% qemu-system-x86 qemu-system-x86_64 [.] drive_get - 72.62% drive_get - 1.26% __irqentry_text_start - 1.23% smp_apic_timer_interrupt - 1.00% local_apic_timer_interrupt - 1.00% hrtimer_interrupt - 0.82% __hrtimer_run_queues 0.53% tick_sched_timer
Drives are stored inside qemu on a linked list, and the
drive_get function iterates over this linked list, so of course everything is extremely slow when this list grows long.
QEMU bug filed: https://bugs.launchpad.net/qemu/+bug/1686980
Edit: Dan Berrange posted a hack which gets me past this problem, so now I can add 20,000 disks.
The guest boots fine, albeit taking about 30 minutes (and udev hasn’t completed device node creation in that time, it’s still going on in the background).
><rescue> ls -l /dev/sd[Tab] Display all 20001 possibilities? (y or n) ><rescue> mount /dev/sdacog on / type ext2 (rw,noatime,block_validity,barrier,user_xattr,acl)
As you can see the modern Linux kernel and userspace handles “four letter” drive names like a champ.
I managed to create a guest with 30,000 drives. I had to give the guest 50 GB (yes, not a mistake) of RAM to get this far. With less RAM, disk probing fails with:
scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
I’d seen SCSI probing run out of memory before, and I made a back-of-the-envelope calculation that each disk consumed 200 KB of RAM. However that cannot be correct — there must be a non-linear relationship between number of disks and RAM used by the kernel.
Because my development machine simply doesn’t have enough RAM to go further, I wasn’t able to add more than 30,000 drives, so that’s where we have to end this little experiment, at least for the time being.
><rescue> ls -l /dev/sd???? | tail brw------- 1 root root 66, 30064 Apr 28 19:35 /dev/sdarin brw------- 1 root root 66, 30080 Apr 28 19:35 /dev/sdario brw------- 1 root root 66, 30096 Apr 28 19:35 /dev/sdarip brw------- 1 root root 66, 30112 Apr 28 19:35 /dev/sdariq brw------- 1 root root 66, 30128 Apr 28 19:35 /dev/sdarir brw------- 1 root root 66, 30144 Apr 28 19:35 /dev/sdaris brw------- 1 root root 66, 30160 Apr 28 19:35 /dev/sdarit brw------- 1 root root 66, 30176 Apr 28 19:24 /dev/sdariu brw------- 1 root root 66, 30192 Apr 28 19:22 /dev/sdariv brw------- 1 root root 67, 29952 Apr 28 19:35 /dev/sdariw
><rescue> ls -l /dev/sd[tab] Display all 4001 possibilities? (y or n)
Just how many virtual hard drives is it practical to add to a Linux VM using qemu/KVM? I tried to find out. I started by modifying virt-rescue to raise the limit on the number of scratch disks that can be added¹:
I hit some interesting limits in our toolchain along the way.
256 is the maximum number of virtio-scsi disks in unpatched virt-rescue / libguestfs. A single virtio-scsi controller supports 256 targets, with up to 16384 SCSI logical units (LUNs) per target. We were assigning one disk per target, and giving them all unit number 0, so of course we couldn’t add more than 256 drives, but virtio-scsi supports very many more. In theory each virtio-scsi controller could support 256 x 16,384 = 4,194,304 drives. You can even add more than one controller to a guest.
At around 490-500 disks, any monitoring tools which are using libvirt to collect disk statistics from your VMs will crash (https://bugzilla.redhat.com/show_bug.cgi?id=1440683).
qemu uses one file descriptor per disk (maybe two per disk if you are using ioeventfd). qemu quickly hits the default open file limit of 1024 (
ulimit -n). You can raise this to something much larger by creating this file:
$ cat /etc/security/limits.d/99-local.conf # So we can run qemu with many disks. rjones - nofile 65536
/etc/security for a reason, so you should be careful adjusting settings here except on test machines.
The Linux guest kernel uses quite a lot of memory simply enumerating each SCSI drive. My default guest had 512 MB of RAM (no swap), and ran out of memory and panicked when I tried to add 4000 disks. The solution was to increase guest RAM to 8 GB for the remainder of the test.
Booting with 4000 disks took 10 minutes² and
free shows about a gigabyte of memory disappears:
><rescue> free -m total used free shared buff/cache available Mem: 7964 104 6945 15 914 7038 Swap: 0 0 0
What was also surprising is that increasing the number of virtual CPUs from 1 to 16 made no difference to the boot time (in fact it was a bit slower). So even though SCSI LUN probing is not deterministic, it appears that it is not running in parallel either.
If you’re using libvirt to manage the guest, it will fail at around 8000 disks because the XML document describing the guest is too large to transfer over libvirt’s internal client to daemon connection (https://bugzilla.redhat.com/show_bug.cgi?id=1443066). For the remainder of the test I instructed virt-rescue to run qemu directly.
My guest with 8000 disks took 77 minutes to boot. About 1.9 GB of RAM was missing, and my ballpark estimate is that each extra drive takes about 200KB of kernel memory.
Between 10,000 and 11,000
We pass the list of drives to qemu on the command line, with each disk taking perhaps 180 bytes to express. Somewhere between 10,000 and 11,000 disks, this long command line fails with:
qemu-system-x86_64: Argument list too long
To be continued …
So that’s the end of my testing, for now. I managed to create a guest with 10,000 drives, but I was hoping to explore what happens when you add more than 18278 drives since some parts of the kernel or userspace stack may not be quite ready for that.
¹That command will not work with the virt-rescue program found in most Linux distros. I have had to patch it extensively and those patches aren’t yet upstream.
²Note that the
uptime command within the guest is not an accurate way to measure the boot time when dealing with large numbers of disks, because it doesn’t include the time taken by the BIOS which has to scan the disks too. To measure boot times, use the wallclock time from launching qemu.
Thanks: Paolo Bonzini
The final two questions that I posed last time were to do with constructing a timeline of what this guest is spending time on.
We can easily see system calls in the trace log, and we can also see when a kernel function is entered the first time (indicating that a new bit of the kernel is now running), and I wrote a Perl script to analyze that. That gave me a 115K line log file from which I did the rest of the analysis by hand to generate a timeline.
I would reproduce it here, but the results aren’t very enlightening. In particular I doubt it’s more interesting that what you can get by reading the kernel printk’s from a boot log.
What is my conclusion after using these techniques to analyze a booting guest? Here I go:
In the previous post I posed three questions about my detailed function-level trace of the kernel booting under QEMU. The first one was Which kernel functions consume the most time?
We don’t have stack traces, although that would be a good direction for future work. So if a function “A” calls another function “B” like this:
A --> calls B <-- B returns A
then we’re going to assign just the time in the two parts of “A” to “A”. In other words, “A” doesn’t inherit the time taken running “B”. (Except if “B” is inlined, in which case the trace cannot distinguish which code is in “A” and which is in “B”).
The other limitation is lack of insight into what kernel modules are doing. This is very difficult to solve: compiling kernel modules into the kernel proper would change what I’m trying to measure considerably.
Given those limitations, I wrote a Perl script to do all that from my previous trace. I’m not going to bother reproducing the script here because it’s quite simple, but also very specific to the trace format I’m using. If you got this far, you’ll have no trouble writing your own analysis tools.
The results are at the end of the post. For each function that was called, I had a look into the kernel code to find out what it seems to be doing, and those notes are below.
My main conclusion is there is no “smoking gun” here. Everything generally points to things that a kernel (or at least, the core, non-module parts of a kernel) should be doing. So this analysis has not been very helpful.
After an overnight 12+ hour run of my Perl scripts I now have a 52 million line file that consists of timestamps, kernel symbols, and other untranslated linear addresses. The only possible way to analyze this will be with yet more scripts, but already there are lots of interesting things.
Here is the kernel code entering userspace:
11409434.8589973 prepare_exit_to_usermode in section .text 11409435.5189973 retint_user + 8 in section .text 11409436.4899973 7fb95ab3cd20 11409447.6899973 7fb95ab4c930 11409469.2169973 7fb95ab577f0
Userspace symbols cannot be decoded because we don’t know which process is being run. More importantly, code in kernel modules cannot be decoded, so we only see core kernel functions.
Handling a timer interrupt:
18000723.5261105 apic_timer_interrupt in section .text 18000725.2241105 smp_apic_timer_interrupt in section .text 18000726.7681105 native_apic_mem_write in section .text 18000729.2691105 smp_apic_timer_interrupt + 46 in section .text 18000729.8361105 irq_enter in section .text 18000730.3941105 rcu_irq_enter in section .text 18000731.1401105 rcu_irq_enter + 92 in section .text 18000731.7111105 irq_enter + 14 in section .text 18000732.3171105 smp_apic_timer_interrupt + 51 in section .text 18000740.9941105 exit_idle in section .text 18000741.5481105 smp_apic_timer_interrupt + 56 in section .text 18000742.0881105 local_apic_timer_interrupt in section .text 18000744.0341105 tick_handle_periodic in section .text 18000744.6341105 _raw_spin_lock in section .text 18000745.9291105 tick_periodic + 67 in section .text 18000747.3941105 do_timer in section .text
Userspace loading a kernel module:
7806760.57896065 40087d 7806765.09696065 4442b0 7806765.65396065 entry_SYSCALL_64 in section .text 7806766.14496065 entry_SYSCALL_64 + 32 in section .text 7806766.46296065 entry_SYSCALL_64 + 36 in section .text 7806788.75296065 sys_init_module in section .text 7806796.76296065 sys_init_module + 62 in section .text 7806797.28296065 sys_init_module + 62 in section .text 7806800.64896065 sys_init_module + 65 in section .text 7806801.94496065 capable in section .text 7806802.91196065 security_capable in section .text 7806804.30796065 cap_capable in section .text 7806804.87796065 security_capable + 72 in section .text 7806805.43596065 ns_capable + 41 in section .text 7806805.92096065 capable + 23 in section .text 7806810.46796065 sys_init_module + 75 in section .text 7806815.59796065 sys_init_module + 86 in section .text 7806821.10196065 sys_init_module + 96 in section .text 7806827.28496065 sys_init_module + 109 in section .text 7806831.23396065 sys_init_module + 129 in section .text 7806839.75396065 security_kernel_module_from_file in section .text [etc]
What am I interested in knowing? My overall goal is to find areas in the kernel and userspace that we can optimize to make boot faster. Specifically it seems interesting at first to look at two questions:
More to follow …