Without the GUI is a bit faster: https://bellard.org/jslinux/vm.html?cpu=riscv64&url=https://bellard.org/jslinux/fedora29-riscv-2.cfg&mem=256
Tag Archives: linux
What happens to filesystems and programs when the disk is slow? You can test this using nbdkit and the delay filter. This command creates a 4G virtual disk in memory and injects a 2 second delay into every read operation:
$ nbdkit --filter=delay memory size=4G rdelay=2
You can loopback mount this as a device on the host:
# modprobe nbd # nbd-client -b 512 localhost /dev/nbd0 Warning: the oldstyle protocol is no longer supported. This method now uses the newstyle protocol with a default export Negotiation: ..size = 4096MB Connected /dev/nbd0
Partitioning and formatting is really slow!
# sgdisk -n 1 /dev/nbd0 Creating new GPT entries in memory. ... sits here for about 10 seconds ... The operation has completed successfully. # mkfs.ext4 /dev/nbd0p1 mke2fs 1.44.3 (10-July-2018) waiting ...
Actually I killed it and decided to restart the test with a smaller delay. Since the memory plugin was rewritten to use a sparse array, we’re serializing all requests as an easy way to lock the sparse array data structure. This doesn’t matter normally because requests to the memory plugin are extremely fast, but once you inject delays this means that every request into nbdkit is serialized. Thus for example two reads issued in parallel at the same time by the kernel are delayed by 2+2 = 4 seconds instead of 2 seconds in total.
However shutting down the NBD connection reveals likely kernel bugs in the NBD driver:
[74176.112087] block nbd0: NBD_DISCONNECT [74176.112148] block nbd0: Disconnected due to user request. [74176.112151] block nbd0: shutting down sockets [74176.112183] print_req_error: I/O error, dev nbd0, sector 6144 [74176.112252] print_req_error: I/O error, dev nbd0, sector 6144 [74176.112257] Buffer I/O error on dev nbd0p1, logical block 4096, async page read [74176.112260] Buffer I/O error on dev nbd0p1, logical block 4097, async page read [74176.112263] Buffer I/O error on dev nbd0p1, logical block 4098, async page read [74176.112265] Buffer I/O error on dev nbd0p1, logical block 4099, async page read [74176.112267] Buffer I/O error on dev nbd0p1, logical block 4100, async page read [74176.112269] Buffer I/O error on dev nbd0p1, logical block 4101, async page read [74176.112271] Buffer I/O error on dev nbd0p1, logical block 4102, async page read [74176.112274] Buffer I/O error on dev nbd0p1, logical block 4103, async page read
Note nbdkit did not return any I/O errors, but the connection was closed with in-flight delayed requests. Well at least our testing is finding bugs!
I tried again with a 500ms delay and using the file plugin which is fully parallel:
$ rm -f temp $ truncate -s 4G temp $ nbdkit --filter=delay file file=temp rdelay=500ms
I was able to partition and create a filesystem more easily on this because of the shorter delay and the fact that parallel kernel requests are delayed “in parallel” [same steps as above], and then mount it on a temp directory:
# mount /dev/nbd0p1 /tmp/mnt
The effect is rather strange, like using an NFS mount from a remote server. Initial file reads are slow, and then they are fast (as they are cached in memory). If you drop Linux caches:
# echo 3 > /proc/sys/vm/drop_caches
then everything becomes slow again.
Confident that parallel requests were being delayed in parallel, I also increased the delay back up to 2 seconds (still using the file plugin). This is like swimming in treacle or what I imagine it would be like to mount an NFS filesystem from the other side of the world over a 56K modem.
I wasn’t able to find any further bugs, but this should be useful for someone who wants to test this kind of thing.
In part 1 and part 5 of this series I created some giant disks with a virtual size of 263-1 bytes (8 exabytes). However these were stored in memory using nbdkit-memory-plugin so you could never allocate more space in these disks than available RAM plus swap.
This is a problem when testing some filesystems because the filesystem overhead (the space used to store superblocks, inode tables, block free maps and so on) can be 1% or more.
The solution to this is to back the virtual disks using a sparse file instead. XFS lets you create sparse files up to 263-1 bytes and you can serve them using nbdkit-file-plugin instead:
$ rm -f temp $ truncate -s $(( 2**63 - 1 )) temp $ stat -c %s temp 9223372036854775807 $ nbdkit file file=temp
nbdkit-file-plugin recently got a lot of updates to ensure it always maintains sparseness where possible and supports efficient zeroing, so make sure you’re using at least nbdkit ≥ 1.6.
Now you can serve this in the ordinary way and you should be able to allocate as much space as is available on the host filesystem:
# nbd-client -b 512 localhost /dev/nbd0 Negotiation: ..size = 8796093022207MB Connected /dev/nbd0 # blockdev --getsize64 /dev/nbd0 9223372036854774784 # sgdisk -n 1 /dev/nbd0 # gdisk -l /dev/nbd0 Number Start (sector) End (sector) Size Code Name 1 2048 18014398509481948 8.0 EiB 8300
This command will still probably fail unless you have a lot of patience and a huge amount of space on your host:
# mkfs.xfs -K /dev/nbd0p1
Thanks Chris Murphy for noting that btrfs can create and mount 8 EB (approx 263 byte) filesystems effortlessly:
$ nbdkit -fv memory size=$(( 2**63-1 ))
# modprobe nbd # nbd-client -b 512 localhost /dev/nbd0 # blockdev --getss /dev/nbd0 512 # gdisk /dev/nbd0 Number Start (sector) End (sector) Size Code Name 1 2048 18014398509481948 8.0 EiB 8300 Linux filesystem # mkfs.btrfs -K /dev/nbd0p1 btrfs-progs v4.16 See http://btrfs.wiki.kernel.org for more information. Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication. Label: (null) UUID: 770e5592-9055-4551-8416-b6376802a2ad Node size: 16384 Sector size: 4096 Filesystem size: 8.00EiB Block group profiles: Data: single 8.00MiB Metadata: single 8.00MiB System: single 4.00MiB SSD detected: yes Incompat features: extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 8.00EiB /dev/nbd0p1 # mount /dev/nbd0p1 /tmp/mnt # df -h /tmp/mnt Filesystem Size Used Avail Use% Mounted on /dev/nbd0p1 8.0E 17M 8.0E 1% /tmp/mnt
I created a few files in there and it all seemed to work although I didn’t do any extensive testing. Good job btrfs!
nbdkit is a pluggable NBD server with a filter system that you can layer over plugins to transform block devices. One of the filters is the error filter which lets you inject errors. We can use this to find out how well filesystems cope with errors and recovering from errors.
$ rm -f /tmp/inject $ nbdkit -fv --filter=error memory size=$(( 2**32 )) \ error-rate=100% error-file=/tmp/inject
# nbd-client localhost /dev/nbd0
We can create a filesystem normally:
# sgdisk -n 1 /dev/nbd0 # gdisk -l /dev/nbd0 Number Start (sector) End (sector) Size Code Name 1 1024 4194286 4.0 GiB 8300 # mkfs.ext4 /dev/nbd0p1 # mount /dev/nbd0p1 /mnt
It’s very interesting watching the verbose output of
nbdkit -fv because you can see the lazy metadata creation which the Linux ext4 kernel driver carries out in the background after you mount the filesystem the first time.
So far we have not injected any errors. To do that we create the
/tmp/inject) which the error filter will notice and respond by injecting EIO errors until we remove the file:
# touch /tmp/inject # ls /mnt ls: reading directory '/mnt': Input/output error # rm /tmp/inject # ls /mnt lost+found
Ext4 recovered once we stopped injecting errors, but …
# touch /mnt/hello touch: cannot touch '/mnt/hello': Read-only file system
… it responded to the error by remounting the filesystem read-only. Interestingly I was not able to simply remount the filesystem read-write. Ext4 forced me to unmount the filesystem and run
e2fsck before I could mount it again.
e2fsck also said:
e2fsck: Unknown code ____ 251 while recovering journal of /dev/nbd0p1
which I guess is a bug (already found upstream).
$ nbdkit -f -v memory size=$(( 2**63-1 ))
On the same machine:
# modprobe nbd # nbd-client localhost /dev/nbd0 Warning: the oldstyle protocol is no longer supported. This method now uses the newstyle protocol with a default export Negotiation: ..size = 8796093022207MB Connected /dev/nbd0 # sgdisk -n 1 /dev/nbd0 Creating new GPT entries in memory. The operation has completed successfully. # gdisk -l /dev/nbd0 Number Start (sector) End (sector) Size Code Name 1 1024 9007199254740973 8.0 EiB 8300
You can then try fun things like creating massive XFS filesystems.
A few caveats:
- You must be using kernel ≥ 4.18. Earlier kernels had a bunch of bugs in the Linux NBD driver.
- You must apply this fix for this bug in the nbd-client program. I have updated the Fedora packages so make sure you get the latest update from Fedora updates-testing.
- You will need nbdkit ≥ 1.6 which has support for sparse memory-backed huge disks.
I think it’d be interesting to integrate this into filesystem test suites. Unfortunately use of the Linux NBD kernel driver needs root 😦
There are bootable (but very minimal) disk images built cleanly from RPMs: https://fedorapeople.org/groups/risc-v/disk-images/
More soon …