Removing the btrfs btree_inode

Wait what?

Btrfs is unique in that we have unbounded dirty metadata.  Xfs and Ext3/4 are bounded by their journal size.  This means they can only ever have as much dirty metadata in cache as they have room in the journal to write it out.  This sounds like a drawback but it’s actually a pretty great advantage, it’s a huge pain trying to make sure you balance how often you write out dirty metadata vs keep it in cache.

Think of normal writeback.  You have some process downloading a huge file, so dirtying lots of memory.  We have balance_dirty_pages() which makes sure this process never exceeds the global dirty limits.  We have this in place because you can’t just evict dirty memory, it sits there until it has been written to disk, so it’s a serious system liability.  Too much dirty memory, normal processes that just want to malloc() crawl to a halt while we go write back that dirty memory so we can satisfy memory allocations.

So traditionally Btrfs has had a dummy inode for every mounted file system that we allocate our pages from.  This allows us to limited by the ye-olde system dirty limits and takes the hard work of making sure we don’t OOM the box out of our hands.  But this thing has got to go.

Sub-pagesize blocksizes

We want to support sub-pagesize blocksizes.  Why do you ask?  Well because different systems have different pagesizes.  Most of the computers we use have 4k pagesize, but that’s not universal.  PPC has 64k pagesizes.  If we want to take a file system that was formatted on a 4k pagesize machine and migrate it to a machine with 64k pagesizes then we wouldn’t be able to do that.  Now of course this doesn’t happen often which is why we’ve made it this far without that support, but it would still be nice to have.

The patches for this work have been in circulation for a few years now, and every time I look at them I’m not happy with the hoops we have to jump through for our metadata.  Our metadata pages have a lot of management built around them, the use of pagecache is actually kind of a pain because we have to tie the page eviction stuff into our extent_buffer framework and that can get tricky.  All we really want are pages and a way to keep the dirty limits enforced.

So what do we need?

Not much actually.  The extent_buffer abstraction we use everywhere actually means we don’t really use the btree_inode for much.  So we just need a way to tell the system how much dirty metadata we have, and give the system a way to tell us to write stuff out.

Enter these initial patches

https://patchwork.kernel.org/patch/9272081/

https://patchwork.kernel.org/patch/9272085/

Writeback is controlled on a per bdi (backing device info) basis, which in the case of btrfs is fs wide.  We also have global page counters to enforce global limits.  All we need to do is provide an API for the file system to notify the system of it’s dirty metadata pages.

Then we provide a mechanism for writeback to call into the file system to tell us it’s time to write stuff out, and give us a counter of how much stuff to write out.  Btrfs can do all of this with relative ease.

Next steps

Next I plan to actually implement the btrfs side of this.  There is a lot of weird prep work that has to happen in order to actually remove the btree_inode, and unfortunately it’s almost impossible to break a lot of this up.  There may be one or two prep patches that change certain btrfs specific API’s to allow this to happen, then one giant patch where I remove all the old stuff and add the new stuff.  I wish I could have one patch to remove and one to add, but if people bisect’ed across those patches they’d end up with things just not working at all.  Once this work is in place this will allow us to implement sub-pagesize blocksizes a lot simpler for our metadata.