[Paper of the Day] Bitcask - A Log Structured Hash Table For Fast Key/Value Data
The origin of Bitcask is tied to the history of the Riak distributed database. In a Riak key/value cluster, each node uses pluggable local storage; nearly anything k/v-shaped can be used as the per-host storage engine. This pluggability allowed progress on Riak to be parallelized such that storage engines could be improved and tested without impact on the rest of the codebase.
- Riak has a pluggable storage engine so authors are looking for a storage engine that satisfies some of their goals
- low latency for read and write operations
- high throughput with a random stream of items
- ability to handle datasets that are larger than RAM
- ease of backup and restore
- crash friendly (fast recovery plus not losing data)
- relatively simple and understandable code structure and data format.
- predictable behavior under heavy load or large volume of data.
- a license that allowed for easy default use in Riak.
Original purpose of the Bitcask was to build it for Riak but the outcome was a generic key-value store that can be used by other applications as well.
- Bitcask is just a directory as far as the operating system concerned.
- Only a single process has write access to that directory (this process is sort of like a “database server”)
- There are bunch of data files inside that directory and only one of them is an active data file while others are immutable data files.
- Once an active data file reaches a certain size threshold or server exits for some reason it becomes an immutable data file.
- Immutable data files are never opened for writing again.
- The active file is written by appending, meaning any write sequential write operation will not require disk seek.
The storage line consists of a cyclic redundancy check (CRC) where it detects accidental changes to the data. Each line also stores the key size and value size so that the reader thread knows how much data it needs to read for the key/value pair.
So data file is nothing but a bunch of linear sequence of entries
- Bitcask uses an in-memory hash table to index keys to respective files and their offsets.
- Every key that ever exists will be inside that hash table so main memory will be the limiting factor here.
- Writes are already fast but indexing is a definite need for the read operations since we don’t want to deal with excessive disk seeks every time we want to look up a random value. (SSD won’t save us here either)
- This hash table called keydir in Bitcask world.
Reading and Writing
- when value asked for the respective key it will first consult the keydir (in-memory hash table)
- keydir will show the file that has the given key (or not if the key does not exist)
- The reader thread will jump directly to the place inside the data file with a single disk seek (since we know the offset already)
- Opening and reading files won’t be a problem as most of the files will be in the file system read-ahead cache already so in practice this operation is quick.
- given key and a corresponding value writer thread will append the key value to the active data file
- it also updates the keydir (hash table) with the key and the offset of the key in the active data file.
Issues with Design
Any crash of the system will take down keydir with itself. This means in order for the system to be back up and running again it needs to build keydir from scratch. Depending on the data you are dealing with this might be a deal-breaker. Bitcask leaves hint files for every data file within the Bitcask directory. Hint files do not contain full-fledged data but just enough metadata to build the keydir again in case of a system crash. (hint file contains position and size of the value but not values itself). This speeds up the crash recovery.
Considering we effectively never get rid of stale values for a particular key this quickly becomes an issue. For this issue, Bitcask has a background process called compaction/merging. This process goes through inactive data files and first tries to get rid of any stale data for a particular key and if it sees a need it merges multiple inactive data files to a more compact one. This process also updates keydir as it requires propagating new file ids for some keys.
Goals & Outcome
- Low latency per item read and write: Bitcask is fast. Typical median latency (p50) is sub-millisecond.
- High throughput, especially when writing an incoming stream of random items: In early tests on a laptop with a slow disk, they have seen a throughput of 5000-6000 writes per second.
- Ability to handle datasets larger than the RAM of the machine w/o degradation: The test includes x10 amount of RAM data and showed no sign of changed behavior at that point.
- Crash friendliness (fast recovery and not losing data): There is no distinction in terms of a data file and commit log (they are the same) no need to pay for the penalty for building index. Hint files also makes this process faster.
- Relatively simple, understandable code structure and data format: this is a pretty clear win.
- Predictable behavior under heavy load or large volume: The expectation here Bitcask already shows promising results with double-digit gigabyte volumes. The only thing that might be a problem is that keydir size should not exceed the main memory. If your keyset is bigger than a RAM that might be a problem but tests show that even with millions of keys RAM usage is under a gigabyte.
- they rely on the file system read-ahead cache for the read performance. They haven’t added another internal cache to speed things up as they were not sure how big of a performance gain it will be.
- paper does not include locking technique for keydir (hash table)
- they do not perform any compression of data as they believe cost/benefit is very application dependent.
- Research alternative key/value storage engines eg. Berkley Db, Tokyo Cabinet, Innostore
- Appending to the end of the file does not require any disk seek.
- Check file system read-ahead cache.
Link to paper: https://riak.com/assets/bitcask-intro.pdf