Tag Archives: hard drives

How many disks can you add to a (virtual) Linux machine? (contd)

In my last post I tried to see what happens when you add thousands of virtio-scsi disks to a Linux virtual machine. Above 10,000 disks the qemu command line grew too long for the host to handle. Several people pointed out that I could use the qemu -readconfig parameter to read the disks from a file. So I modified libguestfs to allow that. What will be the next limit?

18,278

Linux uses a strange scheme for naming disks which I’ve covered before on this blog. In brief, disks are named /dev/sda through /dev/sdz, then /dev/sdaa through /dev/sdzz, and after 18,278 drives we reach /dev/sdzzz. What’s special about zzz? Nothing really, but historically Linux device drivers would fail after this, although that is not a problem for modern Linux.

20,000

In any case I created a Linux guest with 20,000 drives with no problem except for the enormous boot time: It was over 12 hours at which point I killed it. Most of the time was being spent in:

-   72.62%    71.30%  qemu-system-x86  qemu-system-x86_64  [.] drive_get
   - 72.62% drive_get
      - 1.26% __irqentry_text_start
         - 1.23% smp_apic_timer_interrupt
            - 1.00% local_apic_timer_interrupt
               - 1.00% hrtimer_interrupt
                  - 0.82% __hrtimer_run_queues
                       0.53% tick_sched_timer

Drives are stored inside qemu on a linked list, and the drive_get function iterates over this linked list, so of course everything is extremely slow when this list grows long.

QEMU bug filed: https://bugs.launchpad.net/qemu/+bug/1686980

Edit: Dan Berrange posted a hack which gets me past this problem, so now I can add 20,000 disks.

The guest boots fine, albeit taking about 30 minutes (and udev hasn’t completed device node creation in that time, it’s still going on in the background).

><rescue> ls -l /dev/sd[Tab]
Display all 20001 possibilities? (y or n)
><rescue> mount
/dev/sdacog on / type ext2 (rw,noatime,block_validity,barrier,user_xattr,acl)

As you can see the modern Linux kernel and userspace handles “four letter” drive names like a champ.

Over 30,000

I managed to create a guest with 30,000 drives. I had to give the guest 50 GB (yes, not a mistake) of RAM to get this far. With less RAM, disk probing fails with:

scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured

I’d seen SCSI probing run out of memory before, and I made a back-of-the-envelope calculation that each disk consumed 200 KB of RAM. However that cannot be correct — there must be a non-linear relationship between number of disks and RAM used by the kernel.

Because my development machine simply doesn’t have enough RAM to go further, I wasn’t able to add more than 30,000 drives, so that’s where we have to end this little experiment, at least for the time being.

><rescue> ls -l /dev/sd???? | tail
brw------- 1 root root  66, 30064 Apr 28 19:35 /dev/sdarin
brw------- 1 root root  66, 30080 Apr 28 19:35 /dev/sdario
brw------- 1 root root  66, 30096 Apr 28 19:35 /dev/sdarip
brw------- 1 root root  66, 30112 Apr 28 19:35 /dev/sdariq
brw------- 1 root root  66, 30128 Apr 28 19:35 /dev/sdarir
brw------- 1 root root  66, 30144 Apr 28 19:35 /dev/sdaris
brw------- 1 root root  66, 30160 Apr 28 19:35 /dev/sdarit
brw------- 1 root root  66, 30176 Apr 28 19:24 /dev/sdariu
brw------- 1 root root  66, 30192 Apr 28 19:22 /dev/sdariv
brw------- 1 root root  67, 29952 Apr 28 19:35 /dev/sdariw

3 Comments

Filed under Uncategorized

How many disks can you add to a (virtual) Linux machine?

><rescue> ls -l /dev/sd[tab]
Display all 4001 possibilities? (y or n)

Just how many virtual hard drives is it practical to add to a Linux VM using qemu/KVM? I tried to find out. I started by modifying virt-rescue to raise the limit on the number of scratch disks that can be added¹: virt-rescue --scratch=4000

I hit some interesting limits in our toolchain along the way.

256

256 is the maximum number of virtio-scsi disks in unpatched virt-rescue / libguestfs. A single virtio-scsi controller supports 256 targets, with up to 16384 SCSI logical units (LUNs) per target. We were assigning one disk per target, and giving them all unit number 0, so of course we couldn’t add more than 256 drives, but virtio-scsi supports very many more. In theory each virtio-scsi controller could support 256 x 16,384 = 4,194,304 drives. You can even add more than one controller to a guest.

About 490-500

At around 490-500 disks, any monitoring tools which are using libvirt to collect disk statistics from your VMs will crash (https://bugzilla.redhat.com/show_bug.cgi?id=1440683).

About 1000

qemu uses one file descriptor per disk (maybe two per disk if you are using ioeventfd). qemu quickly hits the default open file limit of 1024 (ulimit -n). You can raise this to something much larger by creating this file:

$ cat /etc/security/limits.d/99-local.conf
# So we can run qemu with many disks.
rjones - nofile 65536

It’s called /etc/security for a reason, so you should be careful adjusting settings here except on test machines.

About 4000

The Linux guest kernel uses quite a lot of memory simply enumerating each SCSI drive. My default guest had 512 MB of RAM (no swap), and ran out of memory and panicked when I tried to add 4000 disks. The solution was to increase guest RAM to 8 GB for the remainder of the test.

Booting with 4000 disks took 10 minutes² and free shows about a gigabyte of memory disappears:

><rescue> free -m
              total        used        free      shared  buff/cache   available
Mem:           7964         104        6945          15         914        7038
Swap:             0           0           0

What was also surprising is that increasing the number of virtual CPUs from 1 to 16 made no difference to the boot time (in fact it was a bit slower). So even though SCSI LUN probing is not deterministic, it appears that it is not running in parallel either.

About 8000

If you’re using libvirt to manage the guest, it will fail at around 8000 disks because the XML document describing the guest is too large to transfer over libvirt’s internal client to daemon connection (https://bugzilla.redhat.com/show_bug.cgi?id=1443066). For the remainder of the test I instructed virt-rescue to run qemu directly.

My guest with 8000 disks took 77 minutes to boot. About 1.9 GB of RAM was missing, and my ballpark estimate is that each extra drive takes about 200KB of kernel memory.

Between 10,000 and 11,000

We pass the list of drives to qemu on the command line, with each disk taking perhaps 180 bytes to express. Somewhere between 10,000 and 11,000 disks, this long command line fails with:

qemu-system-x86_64: Argument list too long

To be continued …

So that’s the end of my testing, for now. I managed to create a guest with 10,000 drives, but I was hoping to explore what happens when you add more than 18278 drives since some parts of the kernel or userspace stack may not be quite ready for that.

Continue to part 2 …

Notes

¹That command will not work with the virt-rescue program found in most Linux distros. I have had to patch it extensively and those patches aren’t yet upstream.

²Note that the uptime command within the guest is not an accurate way to measure the boot time when dealing with large numbers of disks, because it doesn’t include the time taken by the BIOS which has to scan the disks too. To measure boot times, use the wallclock time from launching qemu.

Thanks: Paolo Bonzini

Edit: 2015 KVM Forum talk about KVM’s limits.

9 Comments

Filed under Uncategorized

There I fixed it

20130218_112227

It works too …

sd 4:0:0:0: [sdc] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 4:0:0:0: [sdc] 4096-byte physical blocks
sd 4:0:0:0: [sdc] Write Protect is off
sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 5:0:0:0: [sdd] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 5:0:0:0: [sdd] 4096-byte physical blocks
sd 5:0:0:0: [sdd] Write Protect is off
sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA

2 Comments

Filed under Uncategorized

PC hours powered on

I just bought a second hand laptop for a friend. It’s a Lenovo Thinkpad X60S from around 2006/2007. I have one nearly the same and they were built from very solid stuff.

Poking around in the hard drive SMART information I found an interesting thing:

sudo smartctl -a /dev/sda

The Power_On_Hours for the laptop was 4847 (hours). Assuming the laptop is 6 years old and there are about 8700 hours in a year, the laptop has only been switched on for about 10% of its life.

(Or possibly it had a replacement hard drive, but that seems unlikely given the spec of the hard drive matches what I’d expect for the machine).

I ran the same test on my current laptop: Power_On_Hours = 16361 (hours) and the laptop is around 2 years old, and stays on just about 100% of the time, so that number makes sense.

6 Comments

Filed under Uncategorized

When will disk sizes go beyond 64 bits?

264 bytes = 16,777,216 terabytes

It’s actually not that much. It’s just 8½ million 2 TB disks.

Purchased at retail from Amazon.com @ $80 each, that’s $671 million, easily within the reach of governments and large corporations (a non-retail bulk purchase would be cheaper, even if you account for failed drives, adding some redundancy, and the disk manufacturers’ power-of-10 tax).

Do we need to think about 128 bit byte addresses for disks yet?

5 Comments

Filed under Uncategorized

How are Linux drives named beyond drive 26 (/dev/sdz, ..)?

[Edit: Thanks to adrianmonk for correcting my math]

It’s surprisingly hard to find a definitive answer to the question of what happens with Linux block device names when you get past drive 26 (ie. counting from one, the first disk is /dev/sda and the 26th disk is /dev/sdz, what comes next?) I need to find out because libguestfs is currently limited to 25 disks, and this really needs to be fixed.

Anyhow, looking at the code we can see that it depends on which driver is in use.

For virtio-blk (/dev/vd*) the answer is:

Drive # — Name
1 vda
26 vdz
27 vdaa
28 vdab
52 vdaz
53 vdba
54 vdbb
702 vdzz
703 vdaaa
704 vdaab
18278 vdzzz

Beyond 18278 drives the virtio-blk code would fail, but that’s not currently an issue.

For SATA and SCSI drives under a modern Linux kernel, the same as above applies except that the code to derive names works properly beyond sdzzz up to (in theory) sd followed by 29 z‘s! [Edit: or maybe not?]

As you can see virtio and SCSI/SATA don’t use common code to name disks. In fact there are also many other block devices in the kernel, all using their own naming scheme. Most of these use numbers instead of letters: eg: /dev/loop0, /dev/ram0, /dev/mmcblk0 and so on.

If disks are partitioned, then the partitions are named by adding the partition number on the end (counting from 1). But if the drive name already ends with a number then a letter p is added between the drive name and the partition number, thus: /dev/mmcblk0p1.

7 Comments

Filed under Uncategorized