This document aims to explain the role played by different page sizes in WiredTiger. It also details motivation behind an application wanting to modify these page sizes from their default values and the procedure to do so. Applications commonly configure page sizes based on their workload's typical key and value size. Once a page size has been chosen, appropriate defaults for the other configuration values are derived by WiredTiger from the page sizes, and relatively few applications will need to modify the other page and key/value size configuration options. WiredTiger also offers several compression options that have an impact on the size of the data both in-memory and on-disk. Hence while selecting page sizes, an application must also look at its desired compression needs. Since the data and workload for a table differs from one table to another in the database, an application can choose to set page sizes and compression options on a per-table basis.
Data life cycle
Before detailing each page size, here is a review of how data gets stored inside WiredTiger:
- WiredTiger uses the physical disks to store data durably, creating on-disk files for the tables in the database directory. It also caches the portion of the table being currently accessed by the application for reading or writing in main memory.
- WiredTiger maintains a table's data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
- The format of the in-memory pages is not the same as the format of the on-disk pages. Therefore, the in-memory pages regularly go through a process called reconciliation to create data structures appropriate for storage on the disk. These data structures are referred to as on-disk pages. An application can set a maximum size separately for the internal and leaf on-disk pages otherwise WiredTiger uses a default value. If reconciliation of an in-memory page is leading to an on-disk page size greater than this maximum, WiredTiger creates multiple smaller on-disk pages.
- A component of WiredTiger called the Block Manager divides the on-disk pages into smaller chunks called blocks, which then get written to the disk. The size of these blocks is defined by a parameter called allocation_size, which is the underlying unit of allocation for the file the data gets stored in. An application might choose to have data compressed before it gets stored to disk by enabling block compression.
- A database's tables are usually much larger than the main memory available. Not all of the data can be kept in memory at any given time. A process called eviction takes care of making space for new data by freeing the memory of data infrequently accessed. An eviction server regularly finds in-memory pages that have not been accessed in a while (following an LRU algorithm). Several background eviction threads continuously process these pages, reconcile them to disk and remove them from the main memory.
- When an application does an insert or an update of a key/value pair, the associated key is used to refer to an in-memory page. In the case of this page not being in memory, appropriate on-disk page(s) are read and an in-memory page constructed (the opposite of reconciliation). A data structure is maintained on every in-memory page to store any insertions or modifications to the data done on that page. As more and more data gets written to this page, the page's memory footprint keeps growing.
- An application can choose to set the maximum size a page is allowed to grow in-memory. A default size is set by WiredTiger if the application doesn't specify one. To keep page management efficient, as a page grows larger in-memory and approaches this maximum size, if possible, it is split into smaller in-memory pages.
- When doing an insert or an update, if a page grows larger than the maximum, the application thread is used to forcefully evict this page. This is done to split the growing page into smaller in-memory pages and reconcile them into on-disk pages. Once written to the disk they are removed from the main memory, making space for more data to be written. When an application gets involved in forced eviction, it might take longer than usual to do these inserts and updates. It is not always possible to (force) evict a page from memory and this page can temporarily grow larger in size than the configured maximum. This page then remains marked to be evicted and reattempts are made as the application puts more data in it.
Configurable page structures in WiredTiger
There are three page sizes that the user can configure:
- The maximum page size of any type of in-memory page in the WiredTiger cache, memory_page_max.
- The maximum size of the on-disk page for an internal page, internal_page_max.
- The maximum size of the on-disk leaf page, leaf_page_max.
There are additional configuration settings that tune more esoteric and specialized data. Those are included for completeness but are rarely changed.
memory_page_max
The maximum size a table's page is allowed to grow to in memory before being reconciled to disk.
- An integer, with acceptable values between 512B and 10TB
- Default size: 5 MB
- Additionally constrained by the condition: leaf_page_max <= memory_page_max <= cache_size/10
- Motivation to tune the value:
memory_page_max is significant for applications wanting to tune for consistency in write intensive workloads.
- This is the parameter to start with for tuning and trying different values to find the correct balance between overall throughput and individual operation latency for each table.
- Splitting a growing in-memory page into smaller pages and reconciliation both require exclusive access to the page which makes an application's write operations wait. Having a large memory_page_max means that the pages will need to be split and reconciled less often. But when that happens, the duration that an exclusive access to the page is required is longer, increasing the latency of an application's insert or update operations. Conversely, having a smaller memory_page_max reduces the time taken for splitting and reconciling the pages, but causes it to happen more frequently, forcing more frequent but shorter exclusive accesses to the pages.
- Applications should choose the memory_page_max value considering the trade-off between frequency of exclusive access to the pages (for reconciliation or splitting pages into smaller pages) versus the duration that the exclusive access is required.
- Configuration:
Specified as memory_page_max configuration option to WT_SESSION::create(). An example of such a configuration string is as follows:
"key_format=S,value_format=S,memory_page_max=10MB"
internal_page_max
The maximum page size for the reconciled on-disk internal pages of the B-Tree, in bytes. When an internal page grows past this size, it splits into multiple pages.
- An integer, with acceptable values between 512B and 512MB
- Default size: 4 KB (*appropriate for applications with relatively small keys)
- Additionally constrained by the condition: the size must be a multiple of the allocation size
- Motivation to tune the value:
internal_page_max is significant for applications wanting to avoid excessive L2 cache misses while searching the tree.
- Recall that only keys are stored on internal pages, so the type and size of the key values for a table help drive the setting for this parameter.
- Should be sized to fit into on-chip caches.
- Applications doing full-table scans with out-of-memory workloads might increase internal_page_max to transfer more data per I/O.
- Influences the shape of the B-Tree, i.e. depth and the number of children each page in B-Tree has. To iterate to the desired key/value pair in the B-Tree, WiredTiger has to binary search the key-range in a page to determine the child page to proceed to and continue down the depth until it reaches the correct leaf page. Having an unusually deep B-Tree, or having too many children per page can negatively impact time taken to iterate the B-Tree, slowing down the application. The number of children per page and, hence, the tree depth depends upon the number of keys that can be stored in an internal page, which is internal_page_max divided by key size. Applications should choose an appropriate internal_page_max size that avoids the B-Tree from getting too deep.
- Configuration:
Specified as internal_page_max configuration option to WT_SESSION::create(). An example of such a configuration string is as follows:
"key_format=S,value_format=S,internal_page_max=16KB,leaf_page_max=1MB"
leaf_page_max
The maximum page size for the reconciled on-disk leaf pages of the B-Tree, in bytes. When a leaf page grows past this size, it splits into multiple pages.
- An integer, with acceptable values between 512B and 512MB
- Default size: 32 KB (*appropriate for applications with relatively small keys and values)
- Additionally constrained by the condition: must be a multiple of the allocation size
- Motivation to tune the value:
leaf_page_max is significant for applications wanting to maximize sequential data transfer from a storage device.
- Should be sized to maximize I/O performance (when reading from disk, it is usually desirable to read a large amount of data, assuming some locality of reference in the application's access pattern).
- Applications doing full-table scans through out-of-cache workloads might increase leaf_page_max to transfer more data per I/O.
- Applications focused on read/write amplification might decrease the page size to better match the underlying storage block size.
- Configuration:
Specified as leaf_page_max configuration option to WT_SESSION::create(). An example of such a configuration string is as follows:
"key_format=S,value_format=S,internal_page_max=16KB,leaf_page_max=1MB"
The following configuration items following are rarely used. They are described for completeness:
allocation_size
This is the underlying unit of allocation for the file. As the unit of file allocation, it sets the minimum page size and how much space is wasted when storing small amounts of data and overflow items.
- an integer between 512B and 128 MB
- must a power-of-two
- default : 4 KB
- Motivation to tune the value:
Most applications should not need to tune the allocation size.
- To be compatible with virtual memory page sizes and direct I/O requirements on the platform (4KB for most common server platforms)
- Smaller values decrease the file space required by overflow items.
- For example, if the allocation size is set to 4KB, an overflow item of 18,000 bytes requires 5 allocation units and wastes about 2KB of space. If the allocation size is 16KB, the same overflow item would waste more than 10KB.
- Configuration:
Specified as allocation_size configuration option to WT_SESSION::create(). An example of such a configuration string is as follows:
"key_format=S,value_format=S,allocation_size=4KB"
internal/leaf key/value max
- Overflow items
Overflow items are keys and values too large to easily store on a page. Overflow items are stored separately in the file from the page where the item logically appears, and so reading or writing an overflow item is more expensive than an on-page item, normally requiring additional I/O. Additionally, overflow values are not cached in memory. This means overflow items won't affect the caching behavior of the application. It also means that each time an overflow value is read, it is re-read from disk.
- internal_key_max
The largest key stored in an internal page, in bytes. If set, keys larger than the specified size are stored as overflow items.
- The default and the maximum allowed value are both one-tenth the size of a newly split internal page.
- leaf_key_max
The largest key stored in a leaf page, in bytes. If set, keys larger than the specified size are stored as overflow items.
- The default value is one-tenth the size of a newly split leaf page.
- leaf_value_max
The largest value stored in a leaf page, in bytes. If set, values larger than the specified size are stored as overflow items
- The default is one-half the size of a newly split leaf page.
- If the size is larger than the maximum leaf page size, the page size is temporarily ignored when large values are written.
- Motivation to tune the values:
Most applications should not need to tune the maximum key and value sizes. Applications requiring a small page size, but also having latency concerns such that the additional work to retrieve an overflow item may find modifying these values useful.
Since overflow items are separately stored in the on-disk file, aren't cached and require additional I/O to access (read or write), applications should avoid creating overflow items.
- Since page sizes also determine the default size of overflow items, i.e., keys and values too large to easily store on a page, they can be configured to avoid performance penalties working with overflow items:
- Applications with large keys and values, and concerned with latency, might increase the page size to avoid creating overflow items, in order to avoid the additional cost of retrieving them.
- Applications with large keys and values, doing random searches, might decrease the page size to avoid wasting cache space on overflow items that aren't likely to be needed.
- Applications with large keys and values, doing table scans, might increase the page size to avoid creating overflow items, as the overflow items must be read into memory in all cases, anyway.
- internal_key_max, leaf_key_max and leaf_value_max configuration values allow applications to change the size at which a key or value will be treated as an overflow item.
- Most applications should not need to tune the maximum key and value sizes.
- The value of internal_key_max is relative to the maximum internal page size. Because the number of keys on an internal page determines the depth of the tree, the internal_key_max value can only be adjusted within a certain range, and the configured value will be automatically adjusted by WiredTiger, if necessary, to ensure a reasonable number of keys fit on an internal page.
- The values of leaf_key_max and leaf_value_max are not relative to the maximum leaf page size. If either is larger than the maximum page size, the page size will be ignored when the larger keys and values are being written, and a larger page will be created as necessary.
- Configuration:
Specified as internal_key_max, leaf_key_max and leaf_value_max configuration options to WT_SESSION::create(). An example of configuration string for a large leaf overflow value:
"key_format=S,value_format=S,leaf_page_max=16KB,leaf_value_max=256KB"
split_pct (split percentage)
The size (specified as percentage of internal/leaf page_max) at which the reconciled page must be split into multiple smaller pages before being sent for compression and then be written to the disk. If the reconciled page can fit into a single on-disk page without the page growing beyond it's set max size, split_pct is ignored and the page isn't split.
- an integer between 25 and 100
- default : 75
- Motivation to tune the value:
Most applications should not need to tune the split percentage size.
- This value should be selected to avoid creating a large number of tiny pages or repeatedly splitting whenever new entries are inserted.
For example, if the maximum page size is 1MB, a split_pct value of 10% would potentially result in creating a large number of 100KB pages, which may not be optimal for future I/O. Or, if the maximum page size is 1MB, a split_pct value of 90% would potentially result in repeatedly splitting pages as the split pages grow to 1MB over and over. The default value for split_pct is 75%, intended to keep large pages relatively large, while still giving split pages room to grow.
- Configuration:
Specified as split_pct configuration option to WT_SESSION::create(). An example of such a configuration string is as follows:
"key_format=S,value_format=S,split_pct=60"
Compression considerations
WiredTiger compresses data at several stages to preserve memory and disk space. Applications can configure these different compression algorithms to tailor their requirements between memory, disk and CPU consumption. Compression algorithms other than block compression work by modifying how the keys and values are represented, and hence reduce data size in-memory and on-disk. Block compression on the other hand compress the data in its binary representation while saving it on the disk.
Configuring compression may change application throughput. For example, in applications using solid-state drives (where I/O is less expensive), turning off compression may increase application performance by reducing CPU costs; in applications where I/O costs are more expensive, turning on compression may increase application performance by reducing the overall number of I/O operations.
WiredTiger uses some internal algorithms to compress the amount of data stored that are not configurable, but always on. For example, run-length reduces the size requirement by storing sequential, duplicate values in the store only a single time (with an associated count).
Different compression options available with WiredTiger:
- Key-prefix
- Reduces the size requirement by storing any identical key prefix only once per page. The cost is additional CPU and memory when operating on the in-memory tree. Specifically, reverse sequential cursor movement (but not forward) through a prefix-compressed page or the random lookup of a key/value pair will allocate sufficient memory to hold some number of uncompressed keys. So, for example, if key prefix compression only saves a small number of bytes per key, the additional memory cost of instantiating the uncompressed key may mean prefix compression is not worthwhile. Further, in cases where the on-disk cost is the primary concern, block compression may mean prefix compression is less useful.
- Configuration:
Specified as prefix_compression configuration option to WT_SESSION::create(). Applications may limit the use of prefix compression by configuring the minimum number of bytes that must be gained before prefix compression is used with prefix_compression_min configuration option. An example of such a configuration string is as follows:
"key_format=S,value_format=S,prefix_compression=true,prefix_compression_min=7"
- Dictionary
- Reduces the size requirement by storing any identical value only once per page.
- Configuration:
Specified as dictionary configuration configuration option to WT_SESSION::create(), which specifies the maximum number of unique values remembered in the B-Tree row-store leaf page value dictionary. An example of such a configuration string is as follows:
"key_format=S,value_format=S,dictionary=1000"
- Huffman
- Reduces the size requirement by compressing individual key/value items, and can be separately configured either or both keys and values. The additional CPU cost of Huffman encoding can be high, and should be considered. (See Huffman Encoding for details.)
- Configuration:
Specified as huffman_key and/or huffman_value configuration option to WT_SESSION::create(). These options can take values of "english" (to use a built-in English language frequency table), "utf8<file>" or "utf16<file>" (to use a custom utf8 or utf16 symbol frequency table file). An example of such a configuration string is as follows:
"key_format=S,value_format=S,huffman_key=english,huffman_value=english"
- Block Compression
- Reduces the size requirement of on-disk objects by compressing blocks of the backing object's file. The additional CPU cost of block compression can be high, and should be considered. When block compression has been configured, configured page sizes will not match the actual size of the page on disk.
- WiredTiger provides two methods of compressing your data when using block compression: the raw and noraw methods. These methods change how WiredTiger works to fit data into the blocks that are stored on disk. Applications needing to write specific sized blocks may want to consider implementing a WT_COMPRESSOR::compress_raw function.
- Noraw compression:
A fixed amount of data is given to the compression system, then turned into a compressed block of data. The amount of data chosen to compress is the data needed to fill the uncompressed block. Thus when compressed, the block will be smaller than the normal data size and the sizes written to disk will often vary depending on how compressible the data being stored is. Algorithms using noraw compression include zlib-noraw, lz4-noraw and snappy. Noraw compression is better suited for workloads with random access patterns because each block will tend to be smaller and require less work to read and decompress.
- Raw compression:
WiredTiger's raw compression takes advantage of compressors that provide a streaming compression API. Using the streaming API WiredTiger will try to fit as much data as possible into one block. This means that blocks created with raw compression should be of similar size. Using a streaming compression method should also make for less overhead in compression, as the setup and initial work for compressing is done fewer times compared to the amount of data stored. Algorithms using raw compression include zlib, lz4. Compared to noraw, raw compression provides more compression while using more CPU. Raw compression may provide a performance advantage in workloads where data is accessed sequentially. That is because more data is generally packed into each block on disk.
- Configuration:
Specified as the block_compressor configuration option to WT_SESSION::create(). If WiredTiger has builtin support for "lz4", "snappy", "zlib" or "zstd" compression, these names are available as the value to the option. An example of such a configuration string is as follows:
"key_format=S,value_format=S,block_compressor=snappy"
See Compressors for further information on how to configure and enable different compression options.
Table summarizing compression in WiredTiger
Compression Type | Supported by row-store | Supported by variable col-store | Supported by fixed col-store | Default config | Reduces in-mem size | Reduces on-disk size | CPU and Memory cost |
Key-prefix | yes | no | no | disabled | yes | yes | minor |
Dictionary | yes | yes | no | disabled | yes | yes | minor |
Huffman | yes | yes | no | disabled | yes | yes | can be high |
Block | yes | yes | yes | disabled | no | yes | can be high |