Red Hat Research article on RISC-V extensions

Finally been published …

https://research.redhat.com/blog/article/risc-v-extensions-whats-available-and-how-to-find-it/

Leave a comment

Filed under Uncategorized

nbdkit binaries for Windows

Much requested, I’m now building nbdkit binaries for Windows. You can get them from the Fedora Koji build system by following this link. Choose the latest build by me (not one of the automatic builds), then under the noarch heading look for a package called mingw64-nbdkit-version. Download this and use your favourite tool that can unpack RPM files.

Some notes: This contains a 64 bit Windows binary of nbdkit and a selection of plugins and filters. There is a mingw32-nbdkit package too if you really want a 32 bit binary but I wouldn’t recommend it. For more information about running nbdkit on Windows, see the instructions here. Source is available for the binaries from either Koji or the upstream git repository. The binary is cross-compiled and not tested, so if it is broken please let us know on the mailing list.

Leave a comment

Filed under Uncategorized

Heads up! Lichee Pi 4A vs VisionFive 2 vs HiFive Unmatched vs Raspberry Pi 4B

I have a lot of RISC-V and Arm hardware. How do my latest 3 RISC-V purchases stand up against each other and the stalwart Raspberry Pi 4B? Let’s find out!

The similarities between these boards are striking. All have 4 cores and all except the HiFive board have 8GB of RAM (HiFive Unmatched has 16GB). All have some kind of flash-based storage: The Raspberry Pi and Sipeed Lichee are using external SanDisk SSDs connected by USB 3. The HiFive Unmatched and VisionFive 2 have NVMe drives (I hope all SBCs provide an NVMe slot going forward).

Since I mainly use these for compiling Fedora packages, I tested compiling qemu using identical configurations. I built it a few times to warm up and then timed the last build, on otherwise unloaded machines. Here are the results:

Release dateCost (see note)qemu build (secs)
HiFive Unmatched (RISC-V)2020£1000+3642
Vision Five 2 (RISC-V)2022/3£150582
Sipeed Lichee Pi 4A (RISC-V)2023£2001376*
Raspberry Pi 4B (Arm)2019£2381154

Note that in the cost column I have included tax, delivery, and all extras that I had to purchase (such as disks) to bring the device up to the tested configuration. This is why the prices are much higher than the sticker price you will see online. Also the Raspberry Pi price is what I paid back in the halcyon days of 2020 before Raspberry Pi shortages.

* The speed test for the Sipeed Lichee was done using the Fedora distribution. There seems to be something very wrong with the measured speed of this board, and given the TH1520 chip we think this board ought to be able to do much better. However restoring the original Debian distro to it will require a load more work, because the boot path for this board is insane.

If you would like to try to reproduce these numbers, first download this config file (benchconfig.sh). Then check out qemu sources @ commit 885fc169f09f591 (don’t forget the submodules). Then do:

mkdir build
cd build
../benchconfig.sh
make clean
time make -j4

This should compile about 2576 targets (the number can vary depending on the precise stuff you have installed, it’s hard to make qemu configurations completely identical, but it shouldn’t be much larger or smaller than this).

12 Comments

Filed under Uncategorized

LicheePi 4A cpuinfo etc

# cat /proc/cpuinfo
processor	: 0
hart		: 0
isa		: rv64imafdcvsu
mmu		: sv39
cpu-freq	: 1.848Ghz
cpu-icache	: 64KB
cpu-dcache	: 64KB
cpu-l2cache	: 1MB
cpu-tlb		: 1024 4-ways
cpu-cacheline	: 64Bytes
cpu-vector	: 0.7.1

processor	: 1
hart		: 1
isa		: rv64imafdcvsu
mmu		: sv39
cpu-freq	: 1.848Ghz
cpu-icache	: 64KB
cpu-dcache	: 64KB
cpu-l2cache	: 1MB
cpu-tlb		: 1024 4-ways
cpu-cacheline	: 64Bytes
cpu-vector	: 0.7.1

processor	: 2
hart		: 2
isa		: rv64imafdcvsu
mmu		: sv39
cpu-freq	: 1.848Ghz
cpu-icache	: 64KB
cpu-dcache	: 64KB
cpu-l2cache	: 1MB
cpu-tlb		: 1024 4-ways
cpu-cacheline	: 64Bytes
cpu-vector	: 0.7.1

processor	: 3
hart		: 3
isa		: rv64imafdcvsu
mmu		: sv39
cpu-freq	: 1.848Ghz
cpu-icache	: 64KB
cpu-dcache	: 64KB
cpu-l2cache	: 1MB
cpu-tlb		: 1024 4-ways
cpu-cacheline	: 64Bytes
cpu-vector	: 0.7.1

# free -m
               total        used        free      shared  buff/cache   available
Mem:            7803         432        6816          11         645        7371
Swap:              0           0           0

# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
mmcblk0      179:0    0  7.3G  0 disk 
|-mmcblk0p1  179:1    0    2M  0 part 
|-mmcblk0p2  179:2    0  500M  0 part /boot
`-mmcblk0p3  179:3    0  6.8G  0 part /
mmcblk0boot0 179:8    0    4M  1 disk 
mmcblk0boot1 179:16   0    4M  1 disk 

Leave a comment

Filed under Uncategorized

Sipeed Lichee Pi 4A

At some point I will do a head to head comparison of HiFive Unmatched, Vision Five 2, Lichee Pi 4A, and Raspberry Pi 4B. I believe this little Lichee board below might win!

8 Comments

Filed under Uncategorized

Follow up to “I booted Linux 292,612 times”

Well that blew up. It was supposed to be just a silly off-the-cuff comment about how some bugs are very tedious to bisect.

To answer a few questions people had, here’s what actually happened. As they say, don’t believe everything you read in the press.

A few weeks ago I noticed that some nbdkit tests which work by booting a Linux appliance under qemu were randomly hanging. I ignored it to start off with, but it got annoying so I decided to try to track down what was going on. Initially we thought it might be a qemu bug so I started by filing a bug there and writing my thoughts as I went to investigate. After swapping qemu, Linux guest and Linux host versions around it became clear that the problem was probably in the Linux guest kernel (although I didn’t rule out an issue with KVM emulation which might have implicated either qemu or the host kvm.ko module).

Initially I just had a hang, and because getting to that hang involved booting Linux hundreds or thousands of times it wasn’t feasible to attach gdb at the start to trace through the hang. Instead I had to connect gdb after observing the hang. It turns out that when the Linux guest was “hanging” it really was just missing a timer event so the kernel was still running albeit making no progress. But the upshot is that the stack trace you see is not of the hang itself, but of an idle, slightly confused kernel. gdb was out of the picture.

But since guest kernel 6.0 seemed to work and 6.4rc seemed to hang, I had a path to bisecting the bug.

Well, a very slow path. You see there are 52,363 commits between those two kernels, which means at least 15 or 16 bisect steps. Each step was going to involve booting the kernel at least thousands times to prove it was working (if it hung before then I’d observe that).

I made the mistake here of not first working on a good test, instead just running “while guestfish … ; echo -n . ; done” and watching until I’d seen a page of dots to judge the kernel “good”. Yeah, that didn’t work. It turns out the hang was made more likely by slightly loading the test machine (or running the tests in parallel which is the same thing). As a result my first bisection that took several days got the wrong commit.

Back to the drawing board. This time I wrote a proper test. It booted the kernel 10,000 times using 8 threads, and checked the qemu output to see if the boot had hung, stop the test and print a diagnostic, or print “test ok” if it got through all iterations. This time my bisection was better but that still took a couple of days.

At that point I thought I had the right commit, but Paolo Bonzini suggested to me that I boot the kernel in parallel, in a loop, for 24 hours at the point immediately before the commit, to try to show that there was no latent issue in the kernel before. (As it turns out while this is a good idea, this analysis is subtly flawed as we’ll see).

So I did just that. After 21 hours I got bored (plus this is using a lot of electricity and generating huge amounts of heat, and we’re in the middle of a heatwave here in the UK). I killed the test after 292,612 successful boots.

I had a commit that looked suspicious, but what to do now? I posted my findings on LKML.

We still didn’t fully understand how to trigger the hang, except it was annoying and rare, seemed to happen with different frequencies on AMD and Intel, could be reproduced by several independent people, but crucially kernel developer Peter Zijlstra could not reproduce it.

[For the record, the bug is a load and hardware-speed dependent race condition. It will particularly affect qemu virtual machines, but at least in theory it could happen on baremetal. It’s not AMD or Intel specific, that’s just a timing issue.]

By this point several other people had observed the hang including CoreOS developers and Rishabh Bhatnagar at Amazon.

A commenter on Hacker News pointed out that simply inserting a sleep into the problematic code path caused the same hang (and I verified that). So the commit I had bisected to was the wrong one again – it exposed a latent bug simply because it ran the same code as a sleep. It was introducing the sleep which exposed the bug, not the commit I’d spent a week bisecting. And the 262K boots didn’t in fact prove there was no latent bug. You live and learn …

Eventually the Amazon thread led to Thomas Gleixner suggesting a fix.

I tested the fix and … it worked!

Unfortunately the patch that introduced the bug has already gone into several stable trees meaning that many more people will likely be hitting the problem in future, but thanks to a heroic effort of many people (and not me, really) the bug has been fixed now.

Leave a comment

Filed under Uncategorized

I booted Linux 292,612 times

And it only took 21 hours.

Linux 6.4 has a bug where it hangs on boot, but probably only 1 in 1000 boots (and rarer if using Intel hardware for some reason). It’s surprising to me that no one has noticed this, but I certainly did because our nbdkit tests which use libguestfs were randomly hanging, always at the same place early in booting the libguestfs qemu appliance:

[    0.070120] Freeing SMP alternatives memory: 48K

So to bisect this I had to run guestfish in a loop until it either hangs or doesn’t. How many times? I chose 10,000 boots as a good threshold. To make this easier I wrote a test harness which uses up to 8 threads and parses the output to detect the hang.

After a painful bisection between v6.0 and v6.4-rc6 which took many days I found the culprit, a regression in the printk time feature: https://lkml.org/lkml/2023/6/13/733

To prove it I booted Linux 292,612 times before the faulty commit (successfully), and then after (failed after under 1,000 boots).

19 Comments

Filed under Uncategorized

NBD-backed qemu guest RAM

This seems too crazy to work, but it does:

$ nbdkit memory 1G
$ nbdfuse mem nbd://localhost &
[1] 1053075
$ ll mem
-rw-rw-rw-. 1 rjones rjones 1073741824 May 17 18:31 mem

Now boot qemu with that memory as the backing RAM:

$ qemu-system-x86_64 -m 1024 \
-object memory-backend-file,id=pc.ram,size=1024M,mem-path=/var/tmp/mem,share=on \
-machine memory-backend=pc.ram \
-drive file=fedora-36.img,if=virtio,format=raw

It works! You can even dump the RAM over a second NBD connection and grep for strings which appear on the screen (or passwords etc):

$ nbdcopy nbd://localhost - | strings | grep 'There was 1 failed'
There was 1 failed login attempt since the last successful login.

Of course this isn’t very useful on its own, it’s just an awkward way to use a sparse RAM disk as guest RAM, but nbdkit has plenty of other plugins that might be useful here. How about remote RAM? You’ll need a very fast network.

Leave a comment

Filed under Uncategorized

nbdkit’s evil filter

If you want to simulate how your filesystem behaves with a bad drive underneath you have a few options like the kernel dm-flakey device, writing a bash nbdkit plugin, kernel fault injection or a few others. We didn’t have that facility in nbdkit however so last week I started the “evil filter”.

The evil filter can add data corruption to an existing nbdkit plugin. Types of corruption include “cosmic rays” (ie. random bit flips), but more realisticly it can simulate stuck bits. Stuck bits are the only failure mode I can remember seeing in real disks and RAM.

One challenge with writing a filter like this is to make the stuck bits persistent across accesses, without requiring us to maintain a large bitmap in the filter keeping track of their location. The solution is fairly elegant: split the underlying disk into blocks. When we read from a block, reconstruct the stuck bits within that block from a fixed seed (calculated from a global PRNG seed + the block’s offset), and iterate across the block incrementing by random intervals. The intervals are derived from the block’s seed so they are the same each time they are calculated. We size the blocks so that each one will have about 100 corrupted bits so this reconstruction doesn’t take very long. Nothing is stored except one global PRNG seed.

The filter isn’t upstream yet but hopefully it can be another way to test filesystems and distributed storage in future.

2 Comments

Filed under Uncategorized

Frame pointers vs DWARF – my verdict

A couple of weeks ago I wrote a blog posting here about Fedora having frame pointers (LWN backgrounder, HN thread). I made some mistakes in that blog posting and retracted it, but I wasn’t wrong about the conclusions, just wrong about how I reached them. Frame pointers are much better than DWARF. DWARF unwinding might have some theoretical advantages but it’s worse in every practical respect.

In particular:

  1. Frame pointers give you much faster profiling with much less overhead. This practically means you can do continuous performance collection and analysis which would be impossible with DWARF.
  2. DWARF unwinding has foot-guns which make it easy to screw up and collect insufficient data for analysis. You cannot know in advance how much data to collect. The defaults are much too small, and even increasing the collection size to unreasonably large sizes isn’t enough.
  3. The overhead of collecting DWARF callgraph data adversely affects what you’re trying to analyze.
  4. Frame pointers have some corner cases which they don’t handle well (certain leaf and most inlined functions aren’t collected), but these don’t matter a great deal in reality.
  5. DWARF unwinding can show inlined functions as if they are separate stack frames. (Opinions differ as to whether or not this is an advantage.)

Below I’ll try to demonstrate some of the issues, but first a little bit of background is necessary about how all this works.

When you run perf record -a on a workload, the kernel fires a timer interrupt on every CPU 100s or 1000s of times a second. Each interrupt must collect a stack trace for that CPU at that moment which is then sent up to the userspace perf process that writes it to a perf.data file in the current directory. Obviously collecting this stack trace and writing it to the file must be done as quickly as possible with the least overhead.

Also the stack trace may start inside the kernel and go all the way out to userspace (unless the CPU was running userspace code at the moment it was interrupted in which case it just collects userspace). That involves unwinding the two different stacks.

For the kernel stack, the kernel has its own unwinding information called ORC. For the userspace stack you choose (with the perf --call-graph option) whether to use frame pointers or DWARF. For frame pointers the kernel is able to immediately walk up the userspace stack all the way to the top (assuming everything was compiled with frame pointers, but that is now true for Fedora 38). For DWARF however the format is complicated and the kernel cannot unwind it immediately. Instead the kernel just collects the user stack. But collecting the whole stack would consume far too much storage, so by default it only collects the first 8K. Many userspace stacks will be larger than this, in which case the data collection will simply be incomplete – it will never be possible to recover the full stack trace. You can adjust the size of stack collected, but that massively bloats the perf.data file as we’ll see below.

To demonstrate what I mean, I collected a set of traces using fio and nbdkit on Fedora 38, using both frame pointers and DWARF. The command is:

sudo perf record -a -g [--call-graph=...] -- nbdkit -U - null 1G --run 'export uri; fio nbd.fio'

with the nbd.fio file from fio’s examples.

I used no --call-graph option for collecting frame pointers (as it is the default), and --call-graph=dwarf,{4096,8192,16384,32768} to collect the DWARF examples with 4 different stack sizes.

I converted the resulting data into flame graphs using Brendan Gregg’s tools.

Everything was run on my idle 12 core / 24 thread AMD development machine.

TypeSize of perf.dataLost chunksFlame graph
Frame pointers934 MB0Link
DWARF (4K)10,104 MB425Link
DWARF (8K)18,733 MB1,643Link
DWARF (16K)35,149 MB5,333Link
DWARF (32K)57,590 MB545,024Link

The first most obvious thing is that even with the smallest stack data collection, DWARF’s perf.data is over 10 times larger, and it balloons even larger once you start to collect more reasonable stack sizes. For a single minute of data collection, collecting 10s of gigabytes of data is not very practical even on high end machines, and continuous performance analysis would be impossible at these data rates.

Related to this, the overhead of perf increases. It is ~ 0.1% for frame pointers. For DWARF the overhead goes: 0.8% (4K), 1.5% (8K), 2.8% (16K), 2.7% (32K). But this disguises the true overhead because it doesn’t count the cost of writing to disk. Unfortunately on this machine I have full disk encryption enabled (which does add a lot to the overhead of writing nearly 60 GB of perf data), but you can see the overhead of all that encryption separate from perf in the flame graph. The total overhead of perf + writing + encryption is about 20%.

This may also be the reason for seeing so many “lost chunks” even on this very fast machine. All of the DWARF tests even at the smallest size printed:

Check IO/CPU overload!

But is the DWARF data accurate? Clearly not. This is to be expected, collecting a partial user stack is not going to be enough to reconstruct a stack trace, but remember that even with 4K of stack, the perf.data is already > 10 times larger than for frame pointers. Zooming in to the nbdkit process only and comparing the flamegraphs shows significant amounts of incomplete stack traces, even when collecting 32K of stack.

On the left, nbdkit with frame pointers (correct). On the right, nbdkit with DWARF and 32K collection size. Notice on the right the large number of unattached frames. nbdkit main() does not directly call Unix domain socket send and receive functions!

If 8K (the default) is insufficient, and even 32K is not enough, how large do we need to make the DWARF stack collection? I couldn’t find out because I don’t have enough space for the expected 120 GB perf.data file at the next size up.

Let’s have a look at one thing which DWARF can do — show inlined and leaf functions. The stack trace for these is more accurate as you can see below. (To reproduce, zoom in on the nbd_poll function). On the left, frame pointers. On the right DWARF with 32K stacks, showing the extra enter_* frames which are inlined.

My final summary here is that for most purposes you would be better off using frame pointers, and it’s a good thing that Fedora 38 now compiles everything with frame pointers. It should result in easier performance analysis, and even makes continuous performance analysis more plausible.

2 Comments

Filed under Uncategorized