Extending BPF to allow memory access into BPF map entries

#include <uapi/linux/ptrace.h>

#define MAX_STACKS 16

struct val_t {
    unsigned long refs;
    unsigned get_iter;
    unsigned put_iter;
    int get_stackids[MAX_STACKS];
    int put_stackids[MAX_STACKS];

BPF_HASH(leakinfo, unsigned long, struct val_t);
BPF_STACK_TRACE(stack_traces, 1024);

static int
__trace_eb_get(struct pt_regs *ctx, unsigned long addr)
    struct val_t *val;
    int old[MAX_STACKS];
    unsigned index = 0;

    val = leakinfo.lookup(&addr);
    if (!val) {
        return 0;


    if (val->get_iter >= MAX_STACKS)
        val->get_iter = index;
        index = val->get_iter++;
    val->get_stackids[index] =
        stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
    return 0;

This is part of the BPF script I wanted to use to find an extent_buffer leak in btrfs. I wrote this script a week ago. Since then I’ve been hacking on BPF to make this script actually run.

BPF is special, it has to verify the program you are telling it to run is going to end for sure, and that it isn’t going to do anything stupid that will crash the box. All of this is contained within kernel/bpf/verifier.c. With BPF you aren’t allowed to use loops, because it can’t verify that you will ever break out of that loop. The other thing you are not allowed to do, as I discovered when I tried to run my script, is access arbitrary memory. Something like


is completely fine, we know where that offset is in struct val_t, and we can verify that the access to ->refs falls within the known size of struct val_t. The BPF ASM code for this access is very simple

7: (85) call 1
8: (bf) r7 = r0
9: (15) if r7 == 0x0 goto pc+23
 R0=map_value_or_null(ks=8,vs=144) R6=ctx R7=map_value(ks=8,vs=144) R10=fp
10: (79) r1 = *(u64 *)(r7 +0)
11: (07) r1 += 1
12: (7b) *(u64 *)(r7 +0) = r1

This access is allowed, because we know that r7 hasn’t been messed with, and we know that +0 (the offset of refs) falls within the object size, which is 144. But say down here where we do

val->get_stackids[index] =
    stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);

Well this is a little trickier, lets look at the BPF assembly for this part

29: (85) call 27
30: (67) r8 <<= 2
31: (0f) r7 += r8
32: (63) *(u32 *)(r7 +16) = r0

And this is where the problem lies. We are taking r8 and adding it to r7. Suddenly the verifier can’t know for sure what the value of r7 is, so any access into r7 will be rejected as unsafe. This is completely reasonable, we all know what kind of havoc out of bounds array accesses can wreak, but it is still a feature that would be extremely useful.

The preliminary fix you can find here:

[PATCH] Bpf: allow access into map value arrays

Now this patch will require some changes (I always forget to use div64_u64), but the overall approach is sound. The solution comes down to figuring out the range of values a particular register could have. After all we do the correct thing in our code

if (val->get_iter >= MAX_STACKS)
    val->get_iter = index;
    index = val->get_iter++;

There is no way for index to go past our array. Here is the full BPF ASM that got generated for this code (with my patch applied)

16: (61) r2 = *(u32 *)(r0 +8)
17: (b7) r1 = 0
18: (b7) r4 = 15
19: (b7) r3 = 0
20: (2d) if r2 > r4 goto pc+2
 R0=map_value(ks=8,vs=144),min_value=0,max_value=0 R1=imm0,min_value=0,max_value=0 R2=inv,max_value=15 R3=imm0,min_value=0,max_value=0 R4=imm15,min_value=15,max_value=15 R6=inv R7=inv R10=fp
21: (bf) r3 = r2
22: (07) r3 += 1
23: (2d) if r2 > r4 goto pc+1
 R0=map_value(ks=8,vs=144),min_value=0,max_value=0 R1=imm0,min_value=0,max_value=0 R2=inv,max_value=15 R3=inv,max_value=16 R4=imm15,min_value=15,max_value=15 R6=inv R7=inv R10=fp
24: (bf) r1 = r2
25: (63) *(u32 *)(r0 +8) = r3
26: (67) r1 <<= 2
27: (0f) r0 += r1
28: (63) *(u32 *)(r0 +16) = r6
 R0=map_value_adj(ks=8,vs=144),min_value=0,max_value=60 R1=inv,max_value=60 R2=inv,max_value=15 R3=inv,max_value=16 R4=imm15,min_value=15,max_value=15 R6=inv R7=inv R10=fp
29: (b7) r0 = 0

The way the verifier works is first it goes through every instruction, and then creates a new state for every JMP (indicated by the if statements here) and verifies those states individually. The above is from the first pass through, which means that every if statement evaluates to false. At instruction 28 is where we do our val->get_stackids[index] bit, and as you can see it has a min_value=0 and a max_value=60, so we verify that our read into that array would be safe and we allow the program to load.

Looking at instruction 27 we see that r1 is the register we use to get the offset into the array. At line 17 r1 is initialized to 0, so we have a min_value = max_value = 0. But this is the all false case, which means at line 24 r1 is assigned to r2, so we need to look at how r2 is treated. r2 is loaded from memory, so off the bat we know nothing about this register or its values. We get to line 20 and see the comparison if r2 > r4 goto pc+2 and since this is false we know that r2 <= r4. Since r4 simply holds a constant value, and we know that constant value is 16, then we know the maximum value of r2 is 15. This follows all the way down and once we get to the memory access we know everything is good to go.


This is all fine and good, but what do most people (including myself originally) do? Why they do this

int index = 0;
if (val->get_iter >= MAX_STACKS)
    val->get_iter = index;
    index = val->get_iter++;

“So?” you ask? Well now index is signed. No longer can we assume the floor is 0, now we have to assume the floor is some negative number. The code as it stands is now unverifiable, as we have no idea what the floor would be. BPF with my current patch would kick this code out, and the developer would be very confused. This is unfortunate because I can see no way around this. Now what would work would be something like the following

int index = 0;
index = val->get_iter % MAX_STACKS;

From here we can know that the value will be between 0 and MAX_STACKS-1, and everything is ok. But not being able to do it the other way may be confusing to developers. We are still discussing this at the moment, but for right now it seems like the best way to do array accesses in BPF with my patch will to be use unsigned variables always.

Removing the btrfs btree_inode

Wait what?

Btrfs is unique in that we have unbounded dirty metadata.  Xfs and Ext3/4 are bounded by their journal size.  This means they can only ever have as much dirty metadata in cache as they have room in the journal to write it out.  This sounds like a drawback but it’s actually a pretty great advantage, it’s a huge pain trying to make sure you balance how often you write out dirty metadata vs keep it in cache.

Think of normal writeback.  You have some process downloading a huge file, so dirtying lots of memory.  We have balance_dirty_pages() which makes sure this process never exceeds the global dirty limits.  We have this in place because you can’t just evict dirty memory, it sits there until it has been written to disk, so it’s a serious system liability.  Too much dirty memory, normal processes that just want to malloc() crawl to a halt while we go write back that dirty memory so we can satisfy memory allocations.

So traditionally Btrfs has had a dummy inode for every mounted file system that we allocate our pages from.  This allows us to limited by the ye-olde system dirty limits and takes the hard work of making sure we don’t OOM the box out of our hands.  But this thing has got to go.

Sub-pagesize blocksizes

We want to support sub-pagesize blocksizes.  Why do you ask?  Well because different systems have different pagesizes.  Most of the computers we use have 4k pagesize, but that’s not universal.  PPC has 64k pagesizes.  If we want to take a file system that was formatted on a 4k pagesize machine and migrate it to a machine with 64k pagesizes then we wouldn’t be able to do that.  Now of course this doesn’t happen often which is why we’ve made it this far without that support, but it would still be nice to have.

The patches for this work have been in circulation for a few years now, and every time I look at them I’m not happy with the hoops we have to jump through for our metadata.  Our metadata pages have a lot of management built around them, the use of pagecache is actually kind of a pain because we have to tie the page eviction stuff into our extent_buffer framework and that can get tricky.  All we really want are pages and a way to keep the dirty limits enforced.

So what do we need?

Not much actually.  The extent_buffer abstraction we use everywhere actually means we don’t really use the btree_inode for much.  So we just need a way to tell the system how much dirty metadata we have, and give the system a way to tell us to write stuff out.

Enter these initial patches



Writeback is controlled on a per bdi (backing device info) basis, which in the case of btrfs is fs wide.  We also have global page counters to enforce global limits.  All we need to do is provide an API for the file system to notify the system of it’s dirty metadata pages.

Then we provide a mechanism for writeback to call into the file system to tell us it’s time to write stuff out, and give us a counter of how much stuff to write out.  Btrfs can do all of this with relative ease.

Next steps

Next I plan to actually implement the btrfs side of this.  There is a lot of weird prep work that has to happen in order to actually remove the btree_inode, and unfortunately it’s almost impossible to break a lot of this up.  There may be one or two prep patches that change certain btrfs specific API’s to allow this to happen, then one giant patch where I remove all the old stuff and add the new stuff.  I wish I could have one patch to remove and one to add, but if people bisect’ed across those patches they’d end up with things just not working at all.  Once this work is in place this will allow us to implement sub-pagesize blocksizes a lot simpler for our metadata.