WiredTiger's cache

Cache size

The cache size for the database is configurable by setting the cache_size configuration string when calling the wiredtiger_open function.

The effectiveness of the cache can be measured by reviewing the page eviction statistics for the database.

An example of setting a cache size to 500MB:

        if ((ret = wiredtiger_open(home, NULL,
            "create,cache_size=500M", &conn)) != 0)
                fprintf(stderr, "Error connecting to %s: %s\n",
                    home, wiredtiger_strerror(ret));

Read-only objects

Cursors opened on checkpoints (either named, or using the special "last checkpoint" name "WiredTigerCheckpoint") are read-only objects. Unless memory mapping is configured off (using the "mmap" configuration string to wiredtiger_open), read-only objects are mapped into process memory instead of being read through the WiredTiger cache. Using read-only objects where possible minimizes the amount of buffer cache memory required by WiredTiger applications and the work required for buffer cache management, as well as reducing the number of memory copies from the operating system buffer cache into application memory.

To open a named checkpoint, use the configuration string "checkpoint" to the WT_SESSION::open_cursor method:

ret = session->open_cursor(session,

"table:mytable", NULL, "checkpoint=midnight", &cursor);

To open the last checkpoint taken in the object, use the configuration string "checkpoint" with the name "WiredTigerCheckpoint" to the WT_SESSION::open_cursor method:

ret = session->open_cursor(session,

"table:mytable", NULL, "checkpoint=WiredTigerCheckpoint", &cursor);

Bulk load

When loading a large amount of data into a new object, using a cursor with the bulk configuration string enabled and loading the data in sorted order will be much faster than doing out-of-order inserts. See Bulk-load for more information.

Cache resident objects

Cache resident objects (objects never considered for the purposes of cache eviction), can be configured with the WT_SESSION::create "cache_resident" configuration string.

Configuring a cache resident object has two effects: first, once the object's pages have been instantiated in memory, no further I/O cost is ever paid for object access, minimizing potential latency. Second, in-memory objects can be accessed faster than objects tracked for potential eviction, and applications able to guarantee sufficient memory that an object need never be evicted can significantly increase their performance.

An example of configuring a cache-resident object:

ret = session->create(session,

"table:mytable", "key_format=r,value_format=S,cache_resident=true");

Memory allocator

The performance of heavily-threaded WiredTiger applications can be dominated by memory allocation because the WiredTiger engine has to free and re-allocate memory as part of many queries. Replacing the system's malloc implementation with one that has better threaded performance (for example, Google's tcmalloc, or jemalloc), can dramatically improve throughput.

Cursor persistence

Opening a new cursor is a relatively expensive operation in WiredTiger (especially in table objects and Log-Structured Merge Trees (LSM) trees, where a logical cursor may require multiple, underlying object cursors), and caching cursors can improve performance. On the other hand, cursors hold positions in objects, and therefore long-lived cursor positions can decrease performance. The best combination is to cache cursors, but use the WT_CURSOR::reset method to discard the cursor's position in the object when the position is no longer needed.

Additionally, cursors are automatically reset whenever a transaction boundary is crossed; when a transaction is started with the WT_SESSION::begin_transaction or ended with either WT_SESSION::commit_transaction or WT_SESSION::rollback_transaction, all open cursors are automatically reset, there is no need to call the WT_CURSOR::reset method explicitly, and the cursor can be immediately reused.

Page and overflow sizes

There are four page and item size configuration values: internal_page_max, internal_item_max, leaf_page_max and leaf_item_max. All four should be specified to the WT_SESSION::create method, that is, they are configurable on a per-file basis.

The internal_page_max and leaf_page_max configuration values specify the maximum size for Btree internal and leaf pages. That is, when an internal or leaf page reaches the specified size, it splits into two pages. Generally, internal pages should be sized to fit into the system's L1 or L2 caches in order to minimize cache misses when searching the tree, while leaf pages should be sized to maximize I/O performance (if reading from disk is necessary, it is usually desirable to read a large amount of data, assuming some locality of reference in the application's access pattern).

The internal_item_max and leaf_item_max configuration values specify the maximum size at which an object will be stored on-page. Larger items will be stored separately in the file from the page where the item logically appears. Referencing overflow items is more expensive than referencing on-page items, requiring additional I/O if the object is not already cached. For this reason, it is important to avoid creating large numbers of overflow items that are repeatedly referenced, and the maximum item size should probably be increased if many overflow items are being created. Because pages must be large enough to store any item that is not an overflow item, increasing the size of the overflow items may also require increasing the page sizes.

With respect to compression, page and item sizes do not necessarily reflect the actual size of the page or item on disk, if block compression has been configured. Block compression in WiredTiger happens within the disk I/O subsystem, and so a page might split even if subsequent compression would result in a resulting page size that would be small enough to leave as a single page. In other words, page and overflow sizes are based on in-memory sizes, not disk sizes.

There are two other, related configuration values, also settable by the WT_SESSION::create method. They are allocation_size, and split_pct.

The allocation_size configuration value is the underlying unit of allocation for the file. As the unit of file allocation, it has two effects: first, it limits the ultimate size of the file, and second, it determines how much space is wasted when storing overflow items.

By limiting the size of the file, the allocation size limits the amount of data that can be stored in a file. For example, if the allocation size is set to the minimum possible (512B), the maximum file size is 2TB, that is, attempts to allocate new file blocks will fail when the file reaches 2TB in size. If the allocation size is set to the maximum possible (512MB), the maximum file size is 2EB.

The unit of allocation also determines how much space is wasted when storing overflow items. For example, if the allocation size were set to the minimum value of 512B, an overflow item of 1100 bytes would require 3 allocation sized file units, or 1536 bytes, wasting almost 500 bytes. For this reason, as the allocation size increases, page sizes and overflow item sizes will likely increase as well, to ensure that significant space isn't wasted by overflow items.

The last configuration value is split_pct, which configures the size of a split page. When a page grows sufficiently large that it must be split, the newly split page's size is split_pct percent of the maximum page size. This value should be selected to avoid creating a large number of tiny pages or repeatedly splitting whenever new entries are inserted. For example, if the maximum page size is 1MB, a split_pct value of 10% would potentially result in creating a large number of 100KB pages, which may not be optimal for future I/O. Or, if the maximum page size is 1MB, a split_pct value of 90% would potentially result in repeatedly splitting pages as the split pages grow to 1MB over and over. The default value for split_pct is 75%, intended to keep large pages relatively large, while still giving split pages room to grow.

An example of configuring page sizes:

        ret = session->create(session, "file:example",
            "key_format=u,"
            "internal_page_max=32KB,internal_item_max=1KB,"
            "leaf_page_max=1MB,leaf_item_max=32KB");

File growth

It's faster on some filesystems to grow a file in chunks rather than to extend it a block at a time as new blocks are written. By configuring the wiredtiger_open functions file_extend value, applications can grow files ahead of the blocks being written.

ret = wiredtiger_open(

home, NULL, "create,file_extend=(data=16MB)", &conn);

System buffer cache

Direct I/O

WiredTiger optionally supports direct I/O. Configuring direct I/O may be useful for applications wanting to:

minimize the operating system cache effects of I/O to and from WiredTiger's buffer cache,
avoid double-buffering of blocks in WiredTiger's cache and the operating system buffer cache, and
avoid stalling underlying solid-state drives by writing a large number of dirty blocks.

Direct I/O is configured using the "direct_io" configuration string to the wiredtiger_open function. An example of configuring direct I/O for WiredTiger's data files:

ret = wiredtiger_open(home, NULL, "create,direct_io=[data]", &conn);

Direct I/O implies a writing thread waits for the write to complete (which is a slower operation than writing into the system buffer cache), and configuring direct I/O is likely to decrease overall application performance.

Many Linux systems do not support mixing O_DIRECT and memory mapping or normal I/O to the same file, and attempting to do so can result in data loss or corruption. For this reason:

WiredTiger silently ignores the setting of the mmap configuration to the wiredtiger_open function in those cases, and will never memory map a file which is configured for direct I/O.

If O_DIRECT is configured for data files on Linux systems, any system utilities used to copy database files for the purposes of hot backup should also specify O_DIRECT when configuring their file access. A standard Linux system utility that supports O_DIRECT is the dd utility, when using the iflag=direct command-line option.

Additionally, many Linux systems require specific alignment for buffers used for I/O when direct I/O is configured, and using the wrong alignment can cause data loss or corruption. When direct I/O is configured on Linux systems, WiredTiger aligns I/O buffers to 4KB; if different alignment is required by your system, the buffer_alignment configuration to the wiredtiger_open call should be configured to the correct value.

Finally, if direct I/O is configured on any system, WiredTiger will silently change the file unit allocation size and the maximum leaf and internal page sizes to be at least as large as the buffer_alignment value as well as a multiple of that value.

Direct I/O is based on the non-standard O_DIRECT flag to the POSIX 1003.1 open system call and may not be available on all platforms.

os_cache_dirty_max

As well as direct I/O, WiredTiger supports two additional configuration options related to the system buffer cache:

The first is os_cache_dirty_max, the maximum dirty bytes an object is allowed to have in the system buffer cache. Once this many bytes from an object are written into the system buffer cache, WiredTiger will attempt to schedule writes for all of the dirty blocks the object has in the system buffer cache. This configuration option allows applications to flush dirty blocks from the object, avoiding stalling any underlying drives when the object is subsequently flushed to disk as part of a durability operation.

An example of configuring os_cache_dirty_max:

ret = session->create(

session, "table:mytable", "os_cache_dirty_max=500MB");

The os_cache_dirty_max configuration may not be used in combination with direct I/O.

The os_cache_dirty_max configuration is based on the non-standard Linux sync_file_range system call and may not available on all platforms.

os_cache_max

The second configuration option related to the system buffer cache is os_cache_max, the maximum bytes an object is allowed to have in the system buffer cache. Once this many bytes from an object are either read into or written from the system buffer cache, WiredTiger will attempt to evict all of the object's blocks from the buffer cache. This configuration option allows applications to evict blocks from the system buffer cache to limit double-buffering and system buffer cache overhead.

An example of configuring os_cache_max:

ret = session->create(session, "table:mytable", "os_cache_max=1GB");

The os_cache_max configuration may not be used in combination with direct I/O.

The os_cache_max configuration is based on the POSIX 1003.1 standard posix_fadvise system call and may not available on all platforms.

Configuring direct I/O, os_cache_dirty_max or os_cache_max all have the side effect of turning off memory-mapping of objects in WiredTiger.

Checksums

WiredTiger checksums file reads and writes, by default. In read-only applications, or when file compression provides any necessary checksum functionality, or when using backing storage systems where blocks require no validation, performance can be increased by turning off checksum support when calling the WT_SESSION::create method.

Checksums can be configured to be "off" or "on", as well as "uncompressed". The "uncompressed" configuration checksums blocks not otherwise protected by compression. For example, in a system where the compression algorithms provide sufficient protection against corruption, and when writing a block which is too small to be usefully compressed, setting the checksum configuration value to "uncompressed" causes WiredTiger to checksum the blocks which are not compressed:

ret = session->create(session, "table:mytable",

"key_format=S,value_format=S,checksum=uncompressed");

Compression

WiredTiger configures key prefix compression for row-store objects by default. Additional forms of compression for both row- and column-store objects, including dictionary and block compression, and Huffman encoding, are optional. Compression minimizes in-memory and on-disk resource requirements and decreases the amount of I/O, at some CPU cost when rows are read and written.

Configuring compression on or off may change application throughput. For example, in applications using solid-state drives (where I/O is less expensive), turning off compression may increase application performance by reducing CPU costs; in applications where I/O costs are more expensive, turning on compression may increase application performance by reducing the overall number of I/O operations.

For example, turning off row-store key prefix compression:

ret = session->create(session, "table:mytable",

"key_format=S,value_format=S,prefix_compression=false");

For example, turning on row-store or column-store dictionary compression:

ret = session->create(session, "table:mytable",

"key_format=S,value_format=S,dictionary=1000");

See File formats and compression for more information.

Performance monitoring with statistics

WiredTiger optionally maintains a variety of statistics, when the statistics configuration string is specified to wiredtiger_open; see Statistics for general information about statistics, and Statistics Data for information about accessing the statistics.

Note that maintaining run-time statistics involves updating shared-memory data structures and may decrease application performance.

The statistics gathered by WiredTiger can be combined to derive information about the system's behavior. For example, a cursor can be opened on the statistics for a table:

        if ((ret = session->open_cursor(session,
            "statistics:table:access", NULL, NULL, &cursor)) != 0)
                return (ret);

Then this code calculates the "fragmentation" of a table, defined here as the percentage of the table that is not part of the current checkpoint:

        uint64_t ckpt_size, file_size;
        ret = get_stat(cursor, WT_STAT_DSRC_BLOCK_CHECKPOINT_SIZE, &ckpt_size);
        ret = get_stat(cursor, WT_STAT_DSRC_BLOCK_SIZE, &file_size);
        printf("File is %d%% fragmented\n",
            (int)(100 * (file_size - ckpt_size) / file_size));

The following example calculates the "write amplification", defined here as the ratio of bytes written to the filesystem versus the total bytes inserted, updated and removed by the application.

        uint64_t app_insert, app_remove, app_update, fs_writes;
        ret = get_stat(cursor, WT_STAT_DSRC_CURSOR_INSERT_BYTES, &app_insert);
        ret = get_stat(cursor, WT_STAT_DSRC_CURSOR_REMOVE_BYTES, &app_remove);
        ret = get_stat(cursor, WT_STAT_DSRC_CURSOR_UPDATE_BYTES, &app_update);
        ret = get_stat(cursor, WT_STAT_DSRC_CACHE_BYTES_WRITE, &fs_writes);
        if (app_insert + app_remove + app_update != 0)
                printf("Write amplification is %.2lf\n",
                    (double)fs_writes / (app_insert + app_remove + app_update));

Both examples use this helper function to retrieve statistics values from a cursor:

int
get_stat(WT_CURSOR *cursor, int stat_field, uint64_t *valuep)
{
        const char *desc, *pvalue;
        int ret;
        cursor->set_key(cursor, stat_field);
        if ((ret = cursor->search(cursor)) != 0)
                return (ret);
        return (cursor->get_value(cursor, &desc, &pvalue, valuep));
}