A common requirement is sustained throughput under a workload that consists of random inserts, where either the key range is chosen so that inserts are very unlikely to conflict (e.g., 128-bit hashes), or where inserts are expected to overwrite existing values.
With traditional btree variants, inserts are very fast while the data set remains in cache, but once the tree overflows the cache, performance drops significantly. There are two factors involved:
Log-Structured Merge Trees are described in this paper by Patrick O'Neil, Edward Cheng, Dieter Gawlick and Elizabeth O'Neil: http://www.cs.umb.edu/~poneil/lsmtree.pdf
A logical tree is split into several physical pieces so that the most-recently-updated portion of data is in a tree that fits entirely in memory. The size of the in-memory chunk can be configured with the "lsm_chunk_size"
configuration key to WT_SESSION::create.
Once the in-memory tree reaches a threshold size, a new in-memory tree is created and the old tree synced to disk. Once written to disk, trees are read-only, though they are merged in the background with other on-disk trees to reduce the cost of reads.
With this structure, "blind writes" can be performed entirely on the in-memory tree. Deletes are implemented by inserting a special "tombstone" record into the in-memory tree.
An LSM tree can be created as follows, in much the same way as a WiredTiger btree file:
Once created, the LSM tree can be accessed using the same cursor interface as other data sources in WiredTiger:
Unlike ordinary file cursors, LSM cursors default to overwrite
mode, where:
This behavior can be disabled by passing "overwrite=false"
to WT_SESSION::open_cursor, but the result will be a search through the levels of the LSM tree before every modification.
A background thread is opened for each active LSM tree. This thread is responsible for both writing old chunks to stable storage, and for merging multiple chunks together so that reads can be satisfied from a small number of files. There is currently no way to configure merges: they are performed automatically by the background thread.
WiredTiger creates a Bloom filter when merging. This is an additional file that contains a configurable number of bits per key (default 8). Keys are hashed a configurable number of times (default 4), and the corresponding bits set. The Bloom filter is used to avoid reading from a chunk if the key cannot be present.
With the defaults, they Bloom filter only requires one byte per key, so it usually fits in cache. The Bloom parameters can be configured with "lsm_bloom_bit_count"
and "lsm_bloom_hash_count"
configuration keys to WT_SESSION::create.
Reads from an LSM cursor may need to position a cursor in each active chunk. The number of chunks depends on the chunk size, and how many chunks have been merged. There must be at least as many hazard references available as there are chunks in the tree for each cursor that is open on the LSM tree. The number of hazard references is configured with the "hazard_max"
configuration key to wiredtiger_open.
It is not yet possible to create tables or indices using LSM trees for storage. This will be addressed in a future release of WiredTiger.
Schema support will be provided for LSM as with an extension to the WT_SESSION::create method:
The default type for all schema objects will continue to be btree.
Internally, WiredTiger's LSM trees use an empty value to represent a record that has been removed (also known as a "tombstone"). For this reason, applications cannot store records in LSM trees with empty values.
There are currently some significant limitations in transactional access to data stored in LSM trees:
We intend to address these limitations in future releases.