Tag Archives: kernel

How are Linux drives named beyond drive 26 (/dev/sdz, ..)?

[Edit: Thanks to adrianmonk for correcting my math]

It’s surprisingly hard to find a definitive answer to the question of what happens with Linux block device names when you get past drive 26 (ie. counting from one, the first disk is /dev/sda and the 26th disk is /dev/sdz, what comes next?) I need to find out because libguestfs is currently limited to 25 disks, and this really needs to be fixed.

Anyhow, looking at the code we can see that it depends on which driver is in use.

For virtio-blk (/dev/vd*) the answer is:

Drive # — Name
1 vda
26 vdz
27 vdaa
28 vdab
52 vdaz
53 vdba
54 vdbb
702 vdzz
703 vdaaa
704 vdaab
18278 vdzzz

Beyond 18278 drives the virtio-blk code would fail, but that’s not currently an issue.

For SATA and SCSI drives under a modern Linux kernel, the same as above applies except that the code to derive names works properly beyond sdzzz up to (in theory) sd followed by 29 z‘s! [Edit: or maybe not?]

As you can see virtio and SCSI/SATA don’t use common code to name disks. In fact there are also many other block devices in the kernel, all using their own naming scheme. Most of these use numbers instead of letters: eg: /dev/loop0, /dev/ram0, /dev/mmcblk0 and so on.

If disks are partitioned, then the partitions are named by adding the partition number on the end (counting from 1). But if the drive name already ends with a number then a letter p is added between the drive name and the partition number, thus: /dev/mmcblk0p1.

4 Comments

Filed under Uncategorized

How LVM does snapshots

LVM2 snapshots are fully read/write. You can write to either the snapshot or the original volume and the write won’t be seen by the other. In the snapshot volume what is stored is an “exception list”, basically a big hash table(?) recording which blocks are different from the original volume.

Now I was curious how this works, because obviously writes to the original volume must also cause an exception to be added to the snapshot volume. How does the snapshot module see these writes? Does it hook into the original device in some way? It turns out, no.

It’s all handled at higher levels (by the lvcreate command in fact) which inserts a device mapper layer above the original device. This special layer (called a “snapshot-origin”) grabs the write request and passes it to the snapshot code where it causes an exception to be added to the snapshot (or snapshots because several snapshots might have been created against a single origin). [Refer to origin_map and __origin_write in dm-snap.c].

You can see the extra layer added by lvcreate -s by examining the device mapper tables directly. For example, before creating a snapshot:

# dmsetup table | grep F13
vg_pin-F13x64: 0 20971520 linear 253:0 257687936

and after creating a snapshot of that LV (notice the “real” LV has been renamed):

# lvcreate -s -n F13x64snap -L 1G /dev/vg_pin/F13x64
# dmsetup table | grep F13
vg_pin-F13x64snap: 0 20971520 snapshot 253:38 253:39 P 8
vg_pin-F13x64snap-cow: 0 2097152 linear 253:0 605815168
vg_pin-F13x64: 0 20971520 snapshot-origin 253:38
vg_pin-F13x64-real: 0 20971520 linear 253:0 257687936

Resolving the device references into a simpler diagram:

  F13x64 (253:4)             F13x64snap (253:37)
          |                        |         |
          |        +---------------+         |
          v        v                         v
  F13x64-real (253:38)    F13x64snap-cow (253:39)

F13x64snap-cow at the bottom right is the actual storage used for the snapshot exception list. It is just a plain linear mapping of some blocks from the underlying block device. F13x64snap is the virtual device. When read, the read consults the exception list in the snapshot cow, and if not there, consults the real device. Writes to F13x64snap go to the exception list. Finally, writes to the virtual origin device F13x64 go to the snapshot cow (or snapshot cows plural). There is no explicit connection here — in fact it goes via a hash table stored in kernel memory.

3 Comments

Filed under Uncategorized

libtool/kernel explosion

The following pair of commands will hang your Fedora 13/14/15 system solid. Be warned.

$ truncate -s 16G bigfile
$ libtool --mode=execute /bin/ls bigfile 

Bug: https://bugzilla.redhat.com/show_bug.cgi?id=636045

Leave a Comment

Filed under Uncategorized

Linux core_pattern fail

  1. core_pattern obeys the current chroot, which effectively makes it useless as a way to collect cores centrally if your system has any chrooted processes.
  2. core_pattern explicitly prevents you from dumping to non-regular files. In a virtual environment it’d be very useful to coredump to a block device.

3 Comments

Filed under Uncategorized

Tip: mock-build Rawhide packages on RHEL 5

The Fedora build system Koji runs on RHEL 5 Xen and builds everything on top of that using mock. This can lead to some rather difficult to debug problems where your package builds and tests OK for you on your local Rawhide machine, but fails in Koji. The reason it can fail in Koji is because it is running on the RHEL 5 Linux kernel (2.6.18). Your program, or any program you depend on during the build, might make assumptions about system calls that work for a Rawhide kernel, but fail for a RHEL 5 kernel.

Reproducing these bugs is difficult. Hopefully this posting should be a good start.

Koji is doing roughly the equivalent of this command (on a RHEL 5 host):

mock -r fedora-rawhide-x86_64 --rebuild your.src.rpm

That command doesn’t work straightaway. There are some things you have to install and upgrade first before that works:

  1. Install RHEL 5 (or use CentOS or another no-cost alternative).
  2. Install EPEL.
  3. Install or update yum, python-hashlib, mock.
  4. Use /usr/sbin/vigr to add yourself to the “mock” group.
  5. The version of RPM from RHEL 5 is too old to understand the new xz-based compression format used by Rawhide RPMs. You have to build the Fedora 12 RPM (NB: Fedora 13 RPM definitely doesn’t work because it requires Python 2.6). The Fedora 12 specfile is a starting point, but it won’t work directly. There are some small changes you have to make, and a single patch to the source code, but hopefully those will be obvious. Update: Here for a short time is a scratch build of the Fedora 12 RPM made to work on RHEL 5.4. Once you’ve built the new rpm RPM (!), install it.

At this point you can use the mock command above to test-build SRPMs using the unusual RHEL 5 kernel / Rawhide userspace combination.

5 Comments

Filed under Uncategorized

libguestfs launch times

Indulge me while a make a “note to self” about efforts to reduce the time taken by guestfs_launch which boots up the libguestfs appliance.

Time (s) Operation
2s Create supermin appliance: This has crept up over time from originally taking about 1/5th of a second to around 2s. Needs attention. Fixed see this note about cpio blocksize and update below.
2-3s qemu startup: The time is mainly spent reading in the large -kernel and particularly -initrd files specified on the command line. The released qemu code is quite rubbish, but luckily kraxel beat me to fixing the problem with this patch.
3s BIOS waits for keypress: As discussed yesterday, I’ve posted a patch. In the meantime the qemu devs have abandoned the old bochs BIOS for SeaBIOS, which I haven’t tested yet,
3s Kernel boot time: Not very many easy wins here, since the kernel is already pretty efficient. We could try to remove some busy waits and sleeps — for example the kernel waits ¼s for the serial ports, which we don’t use.
1-2s Userspace boot time: This is mainly time spent on udev and partition detection. I have never really understood why the kernel needs to pause so long on partition detection. Not much easy meat here, but improving the speed of udev in itself would be worthwhile.

I spotted a nice tool for turning strace output into timelines.

Update — bash globbing

Having solved the cpio problem, the largest bottleneck in creating the supermin appliance becomes globbing.

It’s probably not a well-known fact, but if you do:

ls *.c *.h

then bash reads the directory twice. It treats the two globs on the command line as completely separate entities. This is not so bad for a few globs, but when we make the supermin appliance we need to do over 120 globs, which takes bash about 0.8s to complete, contributing about 10% to the overall launch time. Bash is literally reading the same directory over and over again, 50 or more times.

At the moment it’s not obvious to me how to solve this.

Leave a Comment

Filed under Uncategorized

Half-baked ideas: inject syscalls into virtual machines

For more half-baked ideas, see my “ideas” tag.

After we wrote virt-df and later libguestfs, what customers were asking me about was to be able to read out of /proc and /sys in a running virtual machine.

Of course that’s not possible with libguestfs. libguestfs reads the filesystem. /proc is a synthetic “filesystem” that only exists in the living C structs in the Linux kernel. What’s worse, those structs change with every release and every vendor specific patch. Following C structs is not easy, although we did it (with the help of a giant database) for virt-mem.

The prize of being able to “read” /proc is great — reading out statistics, process tables, network configuration, and much more information besides.

To do this tractably, what we need is to be able to inject syscalls into the virtual machine. If we could inject the following sequence of syscalls, we’d be able to read /proc in a completely portable manner without needing to chase kernel structs:

open ("/proc", O_RDONLY);
getdents (fd, ...);
read (fd, ...);

Here is my half-baked idea for how to do it.

  1. Wait for a userspace program to be running. Then pause the VM.
  2. Fork qemu, so we have a complete copy of the VM, its state, memory and so on. The original (parent) process can now be resumed, and hence the VM resumes. The rest of this discussion concerns only the child process.
  3. Disconnect qemu from any outside influence. This means disconnecting any block devices, network devices, and perhaps other devices. This ensures our private copy of the VM can’t accidentally overwrite any state from the real, running VM.
  4. At this point we have a “captured” userspace process in the VM. It doesn’t particularly matter which process we happened to capture. We now set up the stack frame and registers for the system call we want to execute. Any previous contents of the memory and registers can be discarded.
  5. Set the emulation running. (The captured userspace process now runs and performs the syscall).
  6. Trap back into qemu when the syscall exits.
  7. Capture the return value from the syscall, which might be a status code, error or read buffer. In any case, we’ve successfully injected a syscall into the VM and this has allowed us to read something out of /proc.
  8. Discard the qemu child process.

We make the modest assumption that the syscall we chose will run without scheduling. Even if it does schedule, the fact that we have disconnected qemu from any block devices (writes effectively go to /dev/null) should mean at least it won’t damage anything.

Notice that we’re using the public syscall interface to the Linux kernel, not depending on the details of changing internal structures.

As ideas go this seems tractable, although the implementation is both technically difficult and probably hard to get upstream. We need a way to trap-and-pause when a VM switches to userspace. We need to be able to fork the VM and do all sorts of modifications on our copy. Then we would need some nice wrappers around this so the user just has to type “virt-ifconfig myvm” (note: previously we implemented virt-ifconfig as part of the virt-mem project by chasing kernel structs).

6 Comments

Filed under Uncategorized

How does mount load the right kernel module?

On any recent Linux distro, you can mount any filesystem type directly. For example:

# dd if=/dev/zero of=/tmp/test.img bs=4k count=4096
# mkfs.xfs /tmp/test.img
# mount -v -o loop /tmp/test.img /mnt/tmp

The mount command works even though I didn’t have the xfs.ko kernel module loaded, and I didn’t tell mount that it’s xfs.

How does it do that? I asked around several people at work and no one could give me the correct answer. So in this article I’ll describe exactly how it works.

First of all, I’ll mention two wrong answers to this: (a) the kernel doesn’t have a magic “mount any filesystem” syscall, and (b) it’s nothing to do with either /proc/filesystems or /etc/filesystems.

For (a), the mount(2) syscall clearly takes a filesystem type (string). As for (b), /proc/filesystems only lists filesystems which are known to the kernel already, ie. ones for which we’ve already loaded the right module. Since I didn’t have the xfs module loaded, xfs wasn’t listed in /proc/filesystems before I ran the mount command.

This should be enough of a clue that there must be some utility in userspace which knows how to probe the type by just looking at the header of any arbitrary filesystem. This utility is blkid, which used to be part of e2fsprogs but has now been combined with util-linux-ng.

blkid can probe a filesystem that it has not seen before and tell what type it is:

# blkid /tmp/test.img 
/tmp/test.img: UUID="c80ebc11-3b26-4b93-acbb-f52bdfaa9ac5" TYPE="xfs" 

Looking at the source for blkid confirms there is a directory full of probe tools for every conceivable filesystem.

The mount utility calls out to blkid — actually to the libblkid library, not to the command line tool, but it comes to the same thing.

So /bin/mount knows what it’s mounting, and requests the “xfs” filesystem type when it issues the system call into the kernel.

That still leaves the question of how the xfs module gets loaded. The answer is that the mount syscall eventually calls the kernel function __request_module. This strange function actually calls out to the userspace /sbin/modprobe binary, causing the module to get loaded. Meanwhile the mount syscall itself is paused. And yes, it even deals with the recursive situation where modprobe might need to mount filesystems or load other modules in order to succeed.

So there you have it, mounting a filesystem can magically load the right kernel module for that filesystem. All done using some userspace probing and some kernel trickery.

6 Comments

Filed under Uncategorized

Size of RPM dependencies

Continuing the theme of minimal Fedora installs, I had a go at visualizing RPM dependencies and the size of those dependencies.

This problem is harder than I thought. The pretty filelight diagrams in my last post are possible because filesystems are simple trees (if you discount hard links).

However package dependencies are directed graphs, often containing loops and diamonds. If pam pulls in cracklib-dicts, that’s not good because cracklib-dicts is big. But if it was pulled in by another package already, then pam got it “for free” so do we need to worry? You can massively change your views on whether a particular package is excessively large just by changing the way you divide up these shared dependencies between packages.

What I’m working on is a more interactive tool that will let you explore these possibilities — so you will be able to “remove” a dependency and see how that changes the situation. Also you should be able to explore different ways of dividing up shared deps.

That part is not written yet, but I do have a visualisation of the dependencies of some packages already. Please keep in mind that the width of each bar is the incremental cost of the dependency (in terms of all extra data that it pulls in) and it does not mean that a particular package is bloated or excessive.

rpmdepsize-openssh

rpmdepsize-coreutils1

rpmdepsize-kernel

rpmdepsize-gnome-desktop

The rpmdepsize program is an ad-hoc mixture of python and OCaml. Caveat emptor.

2 Comments

Filed under Uncategorized