Version 11.0.1
Data StructuresSource Location

Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.

The WiredTiger cache is memory used to hold copies of recently accessed or modified data. WiredTiger reads Btree pages into the cache on demand. When the cache runs low on space, Eviction removes unneeded pages. Updates modify data in the cache and are flushed to storage asynchronously, either by Checkpoint or Eviction.

The page layout in the WiredTiger cache is optimized for fast, concurrent access by multiple application threads. In contrast, WiredTiger organizes pages in storage to minimize storage space. As a result, WiredTiger has to convert between the in-memory and on-storage representations of a page whenever it reads or writes the page.

Basic operation

Cached Btree pages point to each other, mirroring the structure of the on-disk Btree. When WiredTiger opens a file, it loads the root page of the Btree into memory along with the first level of internal pages. To lookup an entry in a Btree, WiredTiger starts from the root page and searches the Btree until finds the entry. If WiredTiger encounters a page that is not in memory, it loads that page from storage and continues the search.

To load a page into the cache, WiredTiger passes the page's address cookie to the Block Manager and gets back a buffer containing the corresponding block from the underlying file. If necessary, WiredTiger decrypts and decompresses the block. Then it allocates indexing structures to facilitate quick binary search of the keys in the page. The first time WiredTiger needs to modify or insert an entry on a page, it allocates additional structures to track these changes.

WiredTiger tracks the total amount of data in the cache. It also tracks the space used by clean, (unmodified) pages and by dirty (modified) pages. When the cache becomes too full or contains too much dirty data, WiredTiger invokes Eviction to remove data from the cache. To remove a clean page from the cache, WiredTiger simply frees the page's memory. To remove a dirty page, WiredTiger must first reconcile the page (converting it from in-memory format to on-disk format) and then write it to storage.

Cache structure

Internally, WiredTiger's cache state is represented by the WT_CACHE structure, which contains counters and parameter settings for tracking cache usage and controlling eviction policy. The WT_CACHE also includes state WiredTiger uses to track the progress of eviction. There is a single WT_CACHE for each connection, accessed via the WT_CONNECTION_IMPL structure.

Each page in the cache is accessed via a WT_REF structure. When WiredTiger opens a Btree, it places a WT_REF for the cached root page in the corresponding WT_BTREE structure. A WT_REF can represent either a page in the cache or one that has not been loaded yet. The page itself is represented by a WT_PAGE structure. This includes a pointer to a buffer that contains the on-disk page image (decrypted and uncompressed). It also holds the supplemental structures that WiredTiger uses to access and update the page while it is cached.

When WiredTiger loads a page into the cache, it allocates an internal table with one entry for each entry on the page. The type and content of these entries depends on the page type. An internal Btree page will have an array of WT_REF structures. A row-store leaf page will have an array of WT_ROW structures representing the KV pairs stored on the page. A variable-length column-store leaf page will have an array of WT_COL structures along with a parallel array of WT_COL_RLE structures indicating run lengths for items that are repeated more then once on the page. Both of these leaf page formats support binary search to quickly find an entry. In a fixed-length column-store leaf page, values will be packed into a simple byte array, allowing WiredTiger to access entries using bit operations based on the value length; the time window information is indexed in a separate array.

The first time an entry on a leaf page is inserted or modified, WiredTiger adds a WT_PAGE_MODIFY structure to the corresponding WT_PAGE in the cache. For a row-store leaf page the WT_PAGE_MODIFY tracks changes using an array of WT_UPDATE pointers with one element for each KV pair on the leaf page. When WiredTiger updates an entry, it inserts a WT_UPDATE in this array. If there are multiple updates to the same item, WiredTiger chains them together in a linked list. When a record is deleted, WiredTiger adds an update with a special tombstone value. WiredTiger stores newly inserted elements in a similar array of skip lists represented by WT_INSERT structures. There is a separate skiplist for the gap between each pair of keys on the page, as well as skiplists for the gaps between the beginning and end of the page and the first and last keys, respectively.

For a column-store leaf page the WT_PAGE_MODIFY structure tracks changes using a pair of skip lists, one for appended items and one for updated items.

Almost all operations on these data structures are lock-free, allowing a high level of concurrency in the cache.

Cache size and content

The amount of memory used by the WiredTiger cache is controlled by the cache_size configuration parameter, which defaults to 100 MB. (Note that MongoDB sets the cache size, by default, to be half the size of RAM.) WiredTiger does not explicitly manage this memory, relying instead on the C memory allocator to acquire and free memory as needed. Since the cache is allocated from the heap, evicting data from the cache simply returns the memory to the allocator; it does not reduce the application's memory footprint.

The WiredTiger cache is only used for Btree data, including associated in-memory structures such as indexes, insert lists, and update chains. Other WiredTiger data structures, such as dhandles, cursors, and sessions, are not considered part of the cache and do not count against the cache size. Similarly, memory used to read in and write out the on-disk representations of Btree pages is not cached; it is only allocated temporarily during the I/O operation and while the data is converted to or from the on-disk format.

Shared caches

WiredTiger supports sharing a single cache among multiple databases within a process. Normally if a process opens connections to multiple different databases, each connection would use a separate fixed-size cache. With a shared cache, WiredTiger dynamically partitions a fixed amount of cache space between participating connections.

When shared caching is enabled, WiredTiger creates a cache pool server thread to manage the shared cache. It also allocates a global WT_CACHE_POOL structure, which stores settings and statistics for the shared cache. These settings include a minimum and maximum cache size for connections participating in the shared cache.

The cache pool server thread wakes up periodically and adjusts the sizes of the individual per-connection caches. Adjustments are based on a pressure metric for each cache computed using a weighted average of the amount of data read into the cache (i.e., cache misses) and how often applications threads have evicted data from the cache or waited while performing eviction. If a cache has higher pressure than average and is not yet at the maximum size, WiredTiger grows that cache. Conversely, if a cache has low pressure, WiredTiger shrinks it, subject to the minimum cache size. To change the size of a cache, the cache pool server simply changes the cache size parameters in the corresponding WT_CACHE structure. WiredTiger's eviction code will adjust the amount of data in the cache accordingly.