Tag Archives: fedora

Frame pointers vs DWARF – my verdict

A couple of weeks ago I wrote a blog posting here about Fedora having frame pointers (LWN backgrounder, HN thread). I made some mistakes in that blog posting and retracted it, but I wasn’t wrong about the conclusions, just wrong about how I reached them. Frame pointers are much better than DWARF. DWARF unwinding might have some theoretical advantages but it’s worse in every practical respect.

In particular:

  1. Frame pointers give you much faster profiling with much less overhead. This practically means you can do continuous performance collection and analysis which would be impossible with DWARF.
  2. DWARF unwinding has foot-guns which make it easy to screw up and collect insufficient data for analysis. You cannot know in advance how much data to collect. The defaults are much too small, and even increasing the collection size to unreasonably large sizes isn’t enough.
  3. The overhead of collecting DWARF callgraph data adversely affects what you’re trying to analyze.
  4. Frame pointers have some corner cases which they don’t handle well (certain leaf and most inlined functions aren’t collected), but these don’t matter a great deal in reality.
  5. DWARF unwinding can show inlined functions as if they are separate stack frames. (Opinions differ as to whether or not this is an advantage.)

Below I’ll try to demonstrate some of the issues, but first a little bit of background is necessary about how all this works.

When you run perf record -a on a workload, the kernel fires a timer interrupt on every CPU 100s or 1000s of times a second. Each interrupt must collect a stack trace for that CPU at that moment which is then sent up to the userspace perf process that writes it to a perf.data file in the current directory. Obviously collecting this stack trace and writing it to the file must be done as quickly as possible with the least overhead.

Also the stack trace may start inside the kernel and go all the way out to userspace (unless the CPU was running userspace code at the moment it was interrupted in which case it just collects userspace). That involves unwinding the two different stacks.

For the kernel stack, the kernel has its own unwinding information called ORC. For the userspace stack you choose (with the perf --call-graph option) whether to use frame pointers or DWARF. For frame pointers the kernel is able to immediately walk up the userspace stack all the way to the top (assuming everything was compiled with frame pointers, but that is now true for Fedora 38). For DWARF however the format is complicated and the kernel cannot unwind it immediately. Instead the kernel just collects the user stack. But collecting the whole stack would consume far too much storage, so by default it only collects the first 8K. Many userspace stacks will be larger than this, in which case the data collection will simply be incomplete – it will never be possible to recover the full stack trace. You can adjust the size of stack collected, but that massively bloats the perf.data file as we’ll see below.

To demonstrate what I mean, I collected a set of traces using fio and nbdkit on Fedora 38, using both frame pointers and DWARF. The command is:

sudo perf record -a -g [--call-graph=...] -- nbdkit -U - null 1G --run 'export uri; fio nbd.fio'

with the nbd.fio file from fio’s examples.

I used no --call-graph option for collecting frame pointers (as it is the default), and --call-graph=dwarf,{4096,8192,16384,32768} to collect the DWARF examples with 4 different stack sizes.

I converted the resulting data into flame graphs using Brendan Gregg’s tools.

Everything was run on my idle 12 core / 24 thread AMD development machine.

TypeSize of perf.dataLost chunksFlame graph
Frame pointers934 MB0Link
DWARF (4K)10,104 MB425Link
DWARF (8K)18,733 MB1,643Link
DWARF (16K)35,149 MB5,333Link
DWARF (32K)57,590 MB545,024Link

The first most obvious thing is that even with the smallest stack data collection, DWARF’s perf.data is over 10 times larger, and it balloons even larger once you start to collect more reasonable stack sizes. For a single minute of data collection, collecting 10s of gigabytes of data is not very practical even on high end machines, and continuous performance analysis would be impossible at these data rates.

Related to this, the overhead of perf increases. It is ~ 0.1% for frame pointers. For DWARF the overhead goes: 0.8% (4K), 1.5% (8K), 2.8% (16K), 2.7% (32K). But this disguises the true overhead because it doesn’t count the cost of writing to disk. Unfortunately on this machine I have full disk encryption enabled (which does add a lot to the overhead of writing nearly 60 GB of perf data), but you can see the overhead of all that encryption separate from perf in the flame graph. The total overhead of perf + writing + encryption is about 20%.

This may also be the reason for seeing so many “lost chunks” even on this very fast machine. All of the DWARF tests even at the smallest size printed:

Check IO/CPU overload!

But is the DWARF data accurate? Clearly not. This is to be expected, collecting a partial user stack is not going to be enough to reconstruct a stack trace, but remember that even with 4K of stack, the perf.data is already > 10 times larger than for frame pointers. Zooming in to the nbdkit process only and comparing the flamegraphs shows significant amounts of incomplete stack traces, even when collecting 32K of stack.

On the left, nbdkit with frame pointers (correct). On the right, nbdkit with DWARF and 32K collection size. Notice on the right the large number of unattached frames. nbdkit main() does not directly call Unix domain socket send and receive functions!

If 8K (the default) is insufficient, and even 32K is not enough, how large do we need to make the DWARF stack collection? I couldn’t find out because I don’t have enough space for the expected 120 GB perf.data file at the next size up.

Let’s have a look at one thing which DWARF can do — show inlined and leaf functions. The stack trace for these is more accurate as you can see below. (To reproduce, zoom in on the nbd_poll function). On the left, frame pointers. On the right DWARF with 32K stacks, showing the extra enter_* frames which are inlined.

My final summary here is that for most purposes you would be better off using frame pointers, and it’s a good thing that Fedora 38 now compiles everything with frame pointers. It should result in easier performance analysis, and even makes continuous performance analysis more plausible.

1 Comment

Filed under Uncategorized

Raspberry Pi 4 running Fedora 32

I got Fedora 32 installed on an RPi 4 8GB, booting off USB, with UEFI and ACPI. I followed Robert Grimm’s instructions here, and had an additional set of complications summarised here. There’s not much to say except that it was fiendishly complicated. But it works beautifully now, and is reasonably quick too especially when you consider how little it cost.

So let’s talk about costs (all include tax and delivery):

Raspberry Pi 4 8GB£77.33
SanDisk 500GB SSD x 2£149.98
small SD card needed for booting£free

Only one of the SSDs is actually used, but if you follow Robert’s instructions you will need two. I didn’t have any external USB SSDs that were both USB 3 and not spinning hard disks, so I had to buy these, but I’ll be able to reuse one in a future project. The SD card is required to work around a bug in the UEFI firmware, but I happened to have one lying around.


Filed under Uncategorized

OCaml RISC-V port is now upstream!


We’ve been using this patch in Fedora since Nov 2016.

Leave a comment

Filed under Uncategorized

Fedora/RISC-V now mirrored as a Fedora “alternative” architecture

https://dl.fedoraproject.org/pub/alt/risc-v/repo/fedora/29/latest/. These packages now get mirrored further by the Fedora mirror system, eg. to https://mirror.math.princeton.edu/pub/alt/risc-v/repo/fedora/29/latest/

If you grab the latest nightly Fedora builds you can get the mirrors by editing the /etc/yum.repos.d/*.repo file.

Also we got some additional help so we now have loads more build hosts! These were provided by Facebook with hosting by Oregon State University Open Source Lab (see cfarm), so thanks to them.

Thanks to David Abdurachmanov and Laurent Guerby for doing all the work (I did nothing).


Filed under Uncategorized

Run Fedora RISC-V with X11 GUI in your browser


Without the GUI is a bit faster: https://bellard.org/jslinux/vm.html?cpu=riscv64&url=https://bellard.org/jslinux/fedora29-riscv-2.cfg&mem=256

Leave a comment

Filed under Uncategorized

Fedora/RISC-V nightly builds

Thanks to David Abdurachmanov for doing the hard work of making Fedora/RISC-V nightly Fedora 29 builds available. To learn how you can boot and play with these in qemu on x86, see this page.

1 Comment

Filed under Uncategorized

My talk from the RISC-V workshop in Barcelona

Leave a comment

Filed under Uncategorized

“RISCY BUSINESS” runs Fedora in a chroot on HiFive Unleashed

Note you can now run Fedora directly, see the instructions here:




Filed under Uncategorized

RISC-V 8th Workshop Agenda

The RISC-V 8th Workshop is happening in Barcelona next month and the agenda and speakers have been announced:


David Abdurachmanov and myself are giving a short talk about Fedora on RISC-V at 4pm on Tuesday 8th May.

Leave a comment

Filed under Uncategorized

HiFive Unleashed cpuinfo and dmesg

hart	: 1
isa	: rv64imafdc
mmu	: sv39
uarch	: sifive,rocket0

hart	: 2
isa	: rv64imafdc
mmu	: sv39
uarch	: sifive,rocket0

hart	: 3
isa	: rv64imafdc
mmu	: sv39
uarch	: sifive,rocket0

hart	: 4
isa	: rv64imafdc
mmu	: sv39
uarch	: sifive,rocket0

Kernel boot messages after the fold.
Continue reading


Filed under Uncategorized