Background

A common requirement is sustained throughput under a workload that consists of random inserts, where either the key range is chosen so that inserts are very unlikely to conflict (e.g., 128-bit hashes), or where inserts are expected to overwrite existing values.

With traditional btree variants, inserts are very fast while the data set remains in cache, but once the tree overflows the cache, performance drops significantly. There are two factors involved:

once the data fills the cache, new inserts have some probability of going to a page that is not in cache, requiring a read; and
the cache is full of dirty pages, so pages have to be written to free space in the cache before the read can be satisfied.

Description of LSM trees

Log-Structured Merge Trees are described in this paper by Patrick O'Neil, Edward Cheng, Dieter Gawlick and Elizabeth O'Neil: http://www.cs.umb.edu/~poneil/lsmtree.pdf

A logical tree is split into several physical pieces so that the most-recently-updated portion of data is in a tree that fits entirely in memory. The size of the in-memory chunk can be configured with the "lsm_chunk_size" configuration key to WT_SESSION::create.

Once the in-memory tree reaches a threshold size, a new in-memory tree is created and the old tree synced to disk. Once written to disk, trees are read-only, though they are merged in the background with other on-disk trees to reduce the cost of reads.

With this structure, "blind writes" can be performed entirely on the in-memory tree. Deletes are implemented by inserting a special "tombstone" record into the in-memory tree.

Interface to LSM trees

An LSM tree can be created as follows, in much the same way as a WiredTiger btree file:

session->create(session, "lsm:bucket", "key_format=S,value_format=S");

Once created, the LSM tree can be accessed using the same cursor interface as other data sources in WiredTiger:

WT_CURSOR *c;
session->open_cursor(session, "lsm:bucket", NULL, NULL, &c);
for(;;) {
    c->set_key("key");
    c->set_value("data");
    c->insert();
}

Unlike ordinary file cursors, LSM cursors default to overwrite mode, where:

WT_CURSOR::insert will update existing values without checking;
WT_CURSOR::update will insert values regardless of whether they exist; and
WT_CURSOR::remove will succeed regardless of whether the specified record exists.

This behavior can be disabled by passing "overwrite=false" to WT_SESSION::open_cursor, but the result will be a search through the levels of the LSM tree before every modification.

Merging

A background thread is opened for each active LSM tree. This thread is responsible for both writing old chunks to stable storage, and for merging multiple chunks together so that reads can be satisfied from a small number of files. There is currently no way to configure merges: they are performed automatically by the background thread.

Bloom filters

WiredTiger creates a Bloom filter when merging. This is an additional file that contains a configurable number of bits per key (default 8). Keys are hashed a configurable number of times (default 4), and the corresponding bits set. The Bloom filter is used to avoid reading from a chunk if the key cannot be present.

With the defaults, they Bloom filter only requires one byte per key, so it usually fits in cache. The Bloom parameters can be configured with "lsm_bloom_bit_count" and "lsm_bloom_hash_count" configuration keys to WT_SESSION::create.

Caveats

Hazard configuration

Reads from an LSM cursor may need to position a cursor in each active chunk. The number of chunks depends on the chunk size, and how many chunks have been merged. There must be at least as many hazard references available as there are chunks in the tree for each cursor that is open on the LSM tree. The number of hazard references is configured with the "hazard_max" configuration key to wiredtiger_open.

Creating tables using LSM trees

It is not yet possible to create tables or indices using LSM trees for storage. This will be addressed in a future release of WiredTiger.

Schema support will be provided for LSM as with an extension to the WT_SESSION::create method:

session->create(session, "table:T", "type=lsm");

The default type for all schema objects will continue to be btree.

Empty values

Internally, WiredTiger's LSM trees use an empty value to represent a record that has been removed (also known as a "tombstone"). For this reason, applications cannot store records in LSM trees with empty values.

Transactional access

There are currently some significant limitations in transactional access to data stored in LSM trees:

if an update transaction runs for long enough that the chunk it started writing to has been replaced, its data may not be included in the checkpoint when that chunk is swapped to "on disk" mode, and thus may not become visible to readers of the tree immediately when the transaction commits;

update transactions running at snapshot isolation may be permitted to commit if they conflict with an update in a concurrent transaction and the current chunk has been replaced;

read-only transactions running at snapshot isolation may read newer changes after a chunk is written to stable storage;

named checkpoints are not fully supported on LSM trees: recent updates to the tree may not appear in the checkpoint;

We intend to address these limitations in future releases.