How to set up a BcacheFS Filesystem, and what is it?

You might’ve heard of BcacheFS before, but you might’ve not been entirely sure what it is, and why it is suddenly showing up everywhere. Or, you might already know all about it, but aren’t sure how to exactly set it up. If any of those apply to you, then you’ve come to the right place!

What the hecc is BcacheFS?

The basics

BcacheFS is a B-tree filesystem, somewhat like BTRFS or ZFS, but with a bunch of extra features not seen in these filesystems, such as storage-tiering.

Let’s first start with the parts where they are similar:

B-tree filesystems

If you’re not aware what makes BTRFS or ZFS unique, it’s their B-tree filesystem layout. A B-tree Filesystem is a type of file system that uses a B-tree data structure to organize and manage data on storage devices like hard drives or solid-state drives (SSDs). The B-tree, short for “balanced tree,” is a hierarchical data structure that allows for efficient insertion, deletion, and retrieval of data. In the context of file systems, B-trees are used to index and manage the allocation of data blocks and metadata. B-tree filesystems are known for their scalability and ability to handle large amounts of data efficiently. They are designed to provide features like snapshotting, data integrity checks, and support for advanced storage management operations.

Copy-On-Write (COW) filesystems

A Copy-On-Write (COW) filesystem is a type of filesystem that implements a data storage strategy where data is not overwritten directly when it is modified but is instead copied to a new location. This strategy ensures data integrity and allows for efficient snapshots and versioning.

Here’s how it works and its connection to B-tree filesystems in these cases:

Initial Data: In a COW filesystem, when you create a file or modify an existing one, the original data is not immediately overwritten.
Copy Instead of Overwrite: Instead of overwriting the original data in place, the COW filesystem creates a new copy of the modified data in a different location on the storage device.
Metadata Updates: The filesystem updates metadata structures (like B-trees in the case of B-tree filesystems) to point to the new location of the modified data.
Atomic Operations: This process is designed to be atomic, meaning that it happens all at once. If something goes wrong during the write operation, the original data remains intact, ensuring data consistency.
Snapshots and Versioning: Because the old data is preserved, COW filesystems make it easy to create snapshots or versions of the filesystem at different points in time. These snapshots are essentially references to the filesystem state at specific moments, achieved by maintaining pointers to the data as it existed when each snapshot was created.

The connection to B-tree filesystems lies in how they manage the metadata and pointers to data blocks. B-trees are well-suited for COW filesystems because they provide an efficient way to manage and update these pointers. When a change is made to a file or directory in a COW filesystem, the B-tree is updated to reflect the new location of the modified data, ensuring that the filesystem remains consistent and that snapshots can be efficiently created.

But what makes BcacheFS different?

I’m glad you asked! One of the primary things that comes to mind is Storage Tiering, but it also implements native encryption, and (tiered) compression, and can work with various-sized various-performing devices, being much more flexible than RAID arrays or even BTRFS/ZFS.

Storage Tiering

BcacheFS is a filesystem that incorporates storage tiering as one of its features. Storage tiering is a technique used to manage data across different storage devices based on their performance characteristics and usage patterns. Here’s an explanation of BcacheFS’s storage tiering mechanism based on the information you provided:

Caching and Extents: BcacheFS uses a concept called “extents” to represent data. An extent can have multiple copies stored on different storage devices.
Cached Copies: Some copies of an extent can be marked as “cached.” This means that these copies are stored on a faster and potentially more expensive storage device, often referred to as a cache device. Cached data is typically used to accelerate read operations since accessing data from a faster storage tier is quicker.
Bucket Management: BcacheFS organizes these extents into “buckets.” Buckets containing only cached data are managed in a way that allows them to be discarded as needed in Least Recently Used (LRU) order. This means that when space is needed on the cache device, the least recently used cached data will be removed to make room for new data.
Background Target Option: BcacheFS provides options to move data between storage devices in the background. When you use the “background target” option, the original copy of the data remains in place but is marked as cached. This effectively allows BcacheFS to keep a cached copy of the data on a faster device while also having a copy on a slower device.
Promote Target Option: In contrast, when you use the “promote target” option, the original copy remains unchanged, and a new copy is created on the target device, which is marked as cached. This is useful for maintaining multiple copies of the data across different storage tiers.
Writeback and Writearound Caching: BcacheFS allows you to configure different caching strategies:
- Writeback Caching: You can set the “foreground target” and “promote target” to the cache device, and the “background target” to the backing (slower) device. This configuration accelerates write operations by writing data to the cache device first, which is faster, and later moving it to the backing device in the background.
- Writearound Caching: Alternatively, you can set the “foreground target” to the backing device and the “promote target” to the cache device. This configuration prioritizes reading data directly from the backing device while using the cache device for caching frequently accessed data.

BcacheFS’s storage tiering mechanism allows you to manage data efficiently across different storage devices by caching copies of data on faster devices and providing options for moving data between storage tiers based on your desired caching strategy, whether it’s for read acceleration or optimizing write operations. This flexibility is amazing to have when you need more performance and efficiency tuning parameters. ZFS has some caching methods, like having a separate Intent Log, and a L2 Adaptive Replacement Cache, but BcacheFS’s method allows you to make more tiers, and also assign e.g compression to the background tiers, giving you both the advantage of more efficient storage, and more performant caching.

Other differences with BTRFS/ZFS

Besides storage tiering, native encryption, and tiered compression, BcacheFS offers several unique features and advantages that make it stand out from other filesystems like BTRFS and ZFS:

Multidevice Support: BcacheFS is designed to work seamlessly with various-sized and various-performing storage devices. This flexibility allows you to create storage setups that are more tailored to your specific needs, making it a versatile choice for managing data across a range of hardware configurations.
Writeback and Writearound Caching: As mentioned earlier, BcacheFS provides the ability to configure different caching strategies for read and write operations, enabling you to optimize performance based on your workload requirements.
Tiered Compression: In addition to storage tiering, BcacheFS supports tiered compression. This means you can choose to compress data at different levels based on your storage tiers, optimizing both space efficiency and performance.
Flexible Scaling: BcacheFS is designed to be scalable, allowing you to add or remove devices from your storage configuration as needed, making it suitable for both small and large-scale storage solutions.
Advanced Allocator: The allocator in BcacheFS is designed to be efficient and optimized for modern hardware, ensuring that data is allocated and managed effectively on your storage devices.

Erasure Coding

Like BTRFS/ZFS and RAID5/6, BcacheFS supports Erasure Coding, however it implements it a little bit differently than the aforementioned ones, avoiding the ‘write hole’ entirely. It currently has a slight performance penalty due to the current lack of allocator tweaking to make bucket reuse possible for these scenarios, but seems to be functional. Here’s the manual’s take on it:

2.2.2 Erasure coding
bcachefs also supports Reed-Solomon erasure coding - the same algorithm used
by most RAID5/6 implementations) When enabled with the ec option, the
desired redundancy is taken from the data replicas option - erasure coding of
metadata is not supported.

Erasure coding works significantly differently from both conventional RAID
implementations and other filesystems with similar features. In conventional
RAID, the ”write hole” is a significant problem - doing a small write within a
stripe requires the P and Q (recovery) blocks to be updated as well, and since
those writes cannot be done atomically there is a window where the P and Q
blocks are inconsistent - meaning that if the system crashes and recovers with
a drive missing, reconstruct reads for unrelated data within that stripe will be
corrupted.

ZFS avoids this by fragmenting individual writes so that every write be-
comes a new stripe - this works, but the fragmentation has a negative effect on
performance: metadata becomes bigger, and both read and write requests are
excessively fragmented. Btrfs’s erasure coding implementation is more conven-
tional, and still subject to the write hole problem.

bcachefs’s erasure coding takes advantage of our copy on write nature -
since updating stripes in place is a problem, we simply don’t do that. And since
excessively small stripes is a problem for fragmentation, we don’t erasure code
individual extents, we erasure code entire buckets - taking advantage of bucket
based allocation and copying garbage collection.

When erasure coding is enabled, writes are initially replicated, but one of
the replicas is allocated from a bucket that is queued up to be part of a new
stripe. When we finish filling up the new stripe, we write out the P and Q
buckets and then drop the extra replicas for all the data within that stripe - the
effect is similar to full data journalling, and it means that after erasure coding
is done the layout of our data on disk is ideal.

Since disks have write caches that are only flushed when we issue a cache
flush command - which we only do on journal commit - if we can tweak the
allocator so that the buckets used for the extra replicas are reused (and then
overwritten again) immediately, this full data journalling should have negligible
overhead - this optimization is not implemented yet, however.

Things in common with BTRFS/ZFS

Of course it also brings a lot of the good stuff that the aforementioned filesystems also have, such as:

Block Deduplication: BcacheFS includes support for block-level deduplication, which means it can identify and eliminate duplicate data blocks on your storage devices. This can save storage space and reduce redundancy in your filesystem.
Native Encryption: BcacheFS offers native encryption support, ensuring that your data is secure, whether it’s at rest or in transit.
Snapshot Management: BcacheFS provides robust snapshot functionality, allowing you to create point-in-time copies of your filesystem.
Data Integrity: BcacheFS places a strong emphasis on data integrity. It uses checksums to detect and correct data corruption, helping to ensure the reliability of your stored data.

Okay so how do I use it?

Of course that’s a lot of nice marketing for the thing, but if you’re like me you’d want to immediately dive into how to use it and set it up at home. I’ve done exactly that with my secondary server that was previously running a ZFS array; it’s now a fresh Fedora Rawhide install (I’ll explain why after this) that runs BcacheFS, which seems to be handling the workload flawlessly thusfar.

How do I install BcacheFS?

First things first, installing it! At the time of writing, it’s a little complicated to get it, because you need the right kernel to be able to run it, or you need to compile your own, or you need to use the userspace FUSE based tools (probably slow, haven’t tested myself).

The good news is that from kernel 6.7 it will be in the mainline kernel, so if you’re in the future and already have Linux kernel 6.7 installed, you should be able to just install bcachefs-tools (NOT bcache-tools!), and get it to run.

Personally I found that the easiest and most stable way to get the latest 6.7 kernel is by installing Fedora Rawhide, which is the streaming version of Fedora, that automatically grabs the bleeding edge versions of software from everywhere, including the 6.7 kernel.

Note that bcache and bcachefs are two entirely separate products that should not be confused, you may see them both showing up in packages sometimes; make sure you select bcachefs and not bcache options.

How to format your disks with bcacheFS

Straight from the manual:

To format a new bcachefs filesystem use the subcommand bcachefs format,
or mkfs.bcachefs. All persistent filesystem-wide options can be specified at
format time. For an example of a multi device filesystem with compression,
encryption, replication and writeback caching:
bcachefs format --compression=lz4 \
--encrypted \
--replicas=2 \
--label=ssd.ssd1 /dev/sda \
--label=ssd.ssd2 /dev/sdb \
--label=hdd.hdd1 /dev/sdc \
--label=hdd.hdd2 /dev/sdd \
--label=hdd.hdd3 /dev/sde \
--label=hdd.hdd4 /dev/sdf \
--foreground_target=ssd \
--promote_target=ssd \
--background_target=hdd

The above will give you a filesystem with encryption, 2 replicas of every file, and a SSD and HDD storage tier, the SSD one being used for caching purposes.

If you want something more specific, say for example you only want your background tier to be compressed, you can use one of the other options it gives you: --background_compression zstd will make sure the background tier is compressed with zstd compression.

Full list of options:

This is the full list of options available, pulled straight from the manual. substitute spaces in the options with _ to be able to use them on the command line (e.g background compression becomes background_compression)

block size (format): Filesystem block size (default 4k)
btree node size (format): Btree node size, default 256k
errors (format,mount,runtime): Action to take on filesystem error
metadata replicas (format,mount,runtime): Number of replicas for metadata (journal and btree)
data replicas (format,mount,runtime,inode): Number of replicas for user data
replicas (format): Alias for both metadata replicas and data replicas
metadata checksum (format,mount,runtime): Checksum type for metadata writes
data checksum (format,mount,runtime,inode): Checksum type for data writes
compression (format,mount,runtime,inode): Compression type
background compression (format,mount,runtime,inode): Background compression type
str hash (format,mount,runtime,inode): Hash function for string hash tables (directories and xattrs)
metadata target (format,mount,runtime,inode): Preferred target for metadata writes
foreground target (format,mount,runtime,inode): Preferred target for foreground writes
background target (format,mount,runtime,inode): Target for data to be moved to in the background
promote target (format,mount,runtime,inode): Target for data to be copied to on read
erasure code (format,mount,runtime,inode): Enable erasure coding
inodes 32bit (format,mount,runtime): Restrict new inode numbers to 32 bits
shard inode numbers (format,mount,runtime): Use CPU id for high bits of new inode numbers.
wide macs (format,mount,runtime): Store full 128 bit cryptographic MACs (default 80)
inline data (format,mount,runtime): Enable inline data extents (default on)
journal flush delay (format,mount,runtime): Delay in milliseconds before automatic journal commit (default 1000)
journal flush disabled (format,mount,runtime): Disables journal flush on sync/fsync. journal flush delay remains in effect, thus with the default setting not more than 1 second of work will be lost.
journal reclaim delay (format,mount,runtime): Delay in milliseconds before automatic journal reclaim
acl (format,mount): Enable POSIX ACLs
usrquota (format,mount): Enable user quotas
grpquota (format,mount): Enable group quotas
prjquota (format,mount): Enable project quotas
degraded (mount): Allow mounting with data degraded
very degraded (mount): Allow mounting with data missing
verbose (mount): Extra debugging info during mount/recovery
fsck (mount): Run fsck during mount
fix errors (mount): Fix errors without asking during fsck
ratelimit errors (mount): Ratelimit error messages during fsck
read only (mount): Mount in read only mode
nochanges (mount): Issue no writes, even for journal replay
norecovery (mount): Don’t replay the journal (not recommended)
noexcl (mount): Don’t open devices in exclusive mode
version upgrade (mount): Upgrade on disk format to latest version
discard (device): Enable discard/TRIM support

A script that’ll help you set up a multi-device filesystem

see https://git.dragonhive.net/DragonHive/WyvernWorks/small-management-tools/-/blob/main/filesystems/bcachefs-multidevice-builder.py?ref_type=heads. If you run this script on the server, it’ll automatically find all unmounted disks and SSDs and help you assign them to foreground/background layers or exclude them.