This document aims to explain the role played by different page sizes in WiredTiger. It also details motivation behind an application wanting to modify these page sizes from their default values and the procedure to do so. Applications commonly configure page sizes based on their workload's typical key and value size. Once a page size has been chosen, appropriate defaults for the other configuration values are derived by WiredTiger from the page sizes, and relatively few applications will need to modify the other page and key/value size configuration options. WiredTiger also offers several compression options that have an impact on the size of the data both in-memory and on-disk. Hence while selecting page sizes, an application must also look at its desired compression needs. Since the data and workload for a table differs from one table to another in the database, an application can choose to set page sizes and compression options on a per-table basis.
Data life cycle
Before detailing each page size, here is a review of how data gets stored inside WiredTiger:
- WiredTiger uses the physical disks to store data durably, creating on-disk files for the tables in the database directory. It also caches the portion of the table being currently accessed by the application for reading or writing in main memory.
- WiredTiger maintains a table's data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
- The format of the in-memory pages is not the same as the format of the on-disk pages. Therefore, the in-memory pages regularly go through a process called reconciliation to create data structures appropriate for storage on the disk. These data structures are referred to as on-disk pages. An application can set a maximum size separately for the internal and leaf on-disk pages otherwise WiredTiger uses a default value. If reconciliation of an in-memory page is leading to an on-disk page size greater than this maximum, WiredTiger creates multiple smaller on-disk pages.
- A component of WiredTiger called the Block Manager divides the on-disk pages into smaller chunks called blocks, which then get written to the disk. The size of these blocks is defined by a parameter called allocation_size, which is the underlying unit of allocation for the file the data gets stored in. An application might choose to have data compressed before it gets stored to disk by enabling block compression.
- A database's tables are usually much larger than the main memory available. Not all of the data can be kept in memory at any given time. A process called eviction takes care of making space for new data by freeing the memory of data infrequently accessed. An eviction server regularly finds in-memory pages that have not been accessed in a while (following an LRU algorithm). Several background eviction threads continuously process these pages, reconcile them to disk and remove them from the main memory.
- When an application does an insert or an update of a key/value pair, the associated key is used to refer to an in-memory page. In the case of this page not being in memory, appropriate on-disk page(s) are read and an in-memory page constructed (the opposite of reconciliation). A data structure is maintained on every in-memory page to store any insertions or modifications to the data done on that page. As more and more data gets written to this page, the page's memory footprint keeps growing.
- An application can choose to set the maximum size a page is allowed to grow in-memory. A default size is set by WiredTiger if the application doesn't specify one. To keep page management efficient, as a page grows larger in-memory and approaches this maximum size, if possible, it is split into smaller in-memory pages.
- When doing an insert or an update, if a page grows larger than the maximum, the application thread is used to forcefully evict this page. This is done to split the growing page into smaller in-memory pages and reconcile them into on-disk pages. Once written to the disk they are removed from the main memory, making space for more data to be written. When an application gets involved in forced eviction, it might take longer than usual to do these inserts and updates. It is not always possible to (force) evict a page from memory and this page can temporarily grow larger in size than the configured maximum. This page then remains marked to be evicted and reattempts are made as the application puts more data in it.
Configurable page structures in WiredTiger
There are three page sizes that the user can configure:
- The maximum page size of any type of in-memory page in the WiredTiger cache, memory_page_max.
- The maximum size of the on-disk page for an internal page, internal_page_max.
- The maximum size of the on-disk leaf page, leaf_page_max.
There are additional configuration settings that tune more esoteric and specialized data. Those are included for completeness but are rarely changed.
memory_page_max
The maximum size a table's page is allowed to grow to in memory before being reconciled to disk.
- An integer, with acceptable values between 512B and 10TB
- Default size: 5 MB
- Additionally constrained by the condition: leaf_page_max <= memory_page_max <= cache_size/10
- Motivation to tune the value:
memory_page_max is significant for applications wanting to tune for consistency in write intensive workloads.
- This is the parameter to start with for tuning and trying different values to find the correct balance between overall throughput and individual operation latency for each table.
- Splitting a growing in-memory page into smaller pages and reconciliation both require exclusive access to the page which makes an application's write operations wait. Having a large memory_page_max means that the pages will need to be split and reconciled less often. But when that happens, the duration that an exclusive access to the page is required is longer, increasing the latency of an application's insert or update operations. Conversely, having a smaller memory_page_max reduces the time taken for splitting and reconciling the pages, but causes it to happen more frequently, forcing more frequent but shorter exclusive accesses to the pages.
- Applications should choose the memory_page_max value considering the trade-off between frequency of exclusive access to the pages (for reconciliation or splitting pages into smaller pages) versus the duration that the exclusive access is required.
- Configuration:
Specified as memory_page_max configuration option to WT_SESSION::create. An example of such a configuration string is as follows:
"key_format=S,value_format=S,memory_page_max=10MB"
internal_page_max
The maximum page size for the reconciled on-disk internal pages of the B-Tree, in bytes. When an internal page grows past this size, it splits into multiple pages.
- An integer, with acceptable values between 512B and 512MB
- Default size: 4 KB (*appropriate for applications with relatively small keys)
- Additionally constrained by the condition: the size must be a multiple of the allocation size
- Motivation to tune the value:
internal_page_max is significant for applications wanting to avoid excessive L2 cache misses while searching the tree.
- Recall that only keys are stored on internal pages, so the type and size of the key values for a table help drive the setting for this parameter.
- Should be sized to fit into on-chip caches.
- Applications doing full-table scans with out-of-memory workloads might increase internal_page_max to transfer more data per I/O.
- Influences the shape of the B-Tree, i.e. depth and the number of children each page in B-Tree has. To iterate to the desired key/value pair in the B-Tree, WiredTiger has to binary search the key-range in a page to determine the child page to proceed to and continue down the depth until it reaches the correct leaf page. Having an unusually deep B-Tree, or having too many children per page can negatively impact time taken to iterate the B-Tree, slowing down the application. The number of children per page and, hence, the tree depth depends upon the number of keys that can be stored in an internal page, which is internal_page_max divided by key size. Applications should choose an appropriate internal_page_max size that avoids the B-Tree from getting too deep.
- Configuration:
Specified as internal_page_max configuration option to WT_SESSION::create. An example of such a configuration string is as follows:
"key_format=S,value_format=S,internal_page_max=16KB,leaf_page_max=1MB"
leaf_page_max
The maximum page size for the reconciled on-disk leaf pages of the B-Tree, in bytes. When a leaf page grows past this size, it splits into multiple pages.
- An integer, with acceptable values between 512B and 512MB (fixed-length column store pages are limited to 128KB; the configured size does not include the additional size of timestamp information when timestamps are in use)
- Default size: 32 KB (*appropriate for applications with relatively small keys and values)
- Additionally constrained by the condition: must be a multiple of the allocation size
- Motivation to tune the value:
leaf_page_max is significant for applications wanting to maximize sequential data transfer from a storage device.
- Should be sized to maximize I/O performance (when reading from disk, it is usually desirable to read a large amount of data, assuming some locality of reference in the application's access pattern).
- Applications doing full-table scans through out-of-cache workloads might increase leaf_page_max to transfer more data per I/O.
- Applications focused on read/write amplification might decrease the page size to better match the underlying storage block size.
- In fixed-length column stores, a relatively large number of items fit on a single page; this can occasionally lead to contention issues that can be mitigated by making the pages smaller.
- For fixed-length column stores, the much-larger size of in-memory pending changes relative to on-disk values can lead to adverse eviction behavior if a large number of uncommitted changes are present on the same page. Applications that expect to update large numbers of adjacent values at once, especially in long-running transactions, may want to decrease
leaf_page_max
or, alternatively, increase memory_page_max
.
- Configuration:
Specified as leaf_page_max configuration option to WT_SESSION::create. An example of such a configuration string is as follows:
"key_format=S,value_format=S,internal_page_max=16KB,leaf_page_max=1MB"
The following configuration items following are rarely used. They are described for completeness:
allocation_size
This is the underlying unit of allocation for the file. As the unit of file allocation, it sets the minimum page size and how much space is wasted when storing small amounts of data and overflow items.
- an integer between 512B and 128 MB
- must a power-of-two
- default : 4 KB
- Motivation to tune the value:
Most applications should not need to tune the allocation size.
- To be compatible with virtual memory page sizes and direct I/O requirements on the platform (4KB for most common server platforms)
- Smaller values decrease the file space required by overflow items.
- For example, if the allocation size is set to 4KB, an overflow item of 18,000 bytes requires 5 allocation units and wastes about 2KB of space. If the allocation size is 16KB, the same overflow item would waste more than 10KB.
- Configuration:
Specified as allocation_size configuration option to WT_SESSION::create. An example of such a configuration string is as follows:
"key_format=S,value_format=S,allocation_size=4KB"
internal/leaf key/value max
- Overflow items
Overflow items are keys and values too large to easily store on a leaf page. Overflow items are stored separately in the file from the page where the item logically appears, and so reading an overflow item is more expensive than an on-page item, normally requiring additional I/O. Additionally, overflow values are not cached in memory. This means overflow items won't affect the caching behavior of the application. It also means that each time an overflow value is read, it is re-read from disk.
- leaf_key_max
The largest key stored in a leaf page, in bytes. If set, keys larger than the specified size are stored as overflow items.
- The default value is one-tenth the size of a newly split leaf page.
- leaf_value_max
The largest value stored in a leaf page, in bytes. If set, values larger than the specified size are stored as overflow items
- The default is one-half the size of a newly split leaf page.
- If the size is larger than the maximum leaf page size, the page size is temporarily ignored when large values are written.
- Motivation to tune the values:
Most applications should not need to tune the maximum key and value sizes. Applications requiring a small page size, but also having latency concerns such that the additional work to retrieve an overflow item may find modifying these values useful.
Since overflow items are separately stored in the on-disk file, aren't cached and require additional I/O to access (read or write), applications should avoid creating overflow items.
- Since page sizes also determine the default size of overflow items, i.e., keys and values too large to easily store on a page, they can be configured to avoid performance penalties working with overflow items:
- Applications with large keys and values, and concerned with latency, might increase the page size to avoid creating overflow items, in order to avoid the additional cost of retrieving them.
- Applications with large keys and values, doing random searches, might decrease the page size to avoid wasting cache space on overflow items that aren't likely to be needed.
- Applications with large keys and values, doing table scans, might increase the page size to avoid creating overflow items, as the overflow items must be read into memory in all cases, anyway.
- leaf_key_max and leaf_value_max configuration values allow applications to change the size at which a key or value will be treated as an overflow item.
- Most applications should not need to tune the maximum key and value sizes.
- The values of leaf_key_max and leaf_value_max are not relative to the maximum leaf page size. If either is larger than the maximum page size, the page size will be ignored when the larger keys and values are being written, and a larger page will be created as necessary.
- Configuration:
Specified as leaf_key_max and leaf_value_max configuration options to WT_SESSION::create. An example of configuration string for a large leaf overflow value:
"key_format=S,value_format=S,leaf_page_max=16KB,leaf_value_max=256KB"
split_pct (split percentage)
The size (specified as percentage of internal/leaf page_max) at which the reconciled page must be split into multiple smaller pages before being sent for compression and then be written to the disk. If the reconciled page can fit into a single on-disk page without the page growing beyond it's set max size, split_pct is ignored and the page isn't split.
- an integer between 50 and 100
- default : 90
- Motivation to tune the value:
Most applications should not need to tune the split percentage size.
- This value should be selected based on the expected workload. Workloads that perform many random updates benefit from values around 66%, leaving room on the newly split pages for future updates. Append-only workloads benefit from a value of 100%, as they will not add updates to the newly split pages. The default value for split_pct is 90%, intended as a balance between these two workload types.
- Configuration:
Specified as split_pct configuration option to WT_SESSION::create. An example of such a configuration string is as follows:
"key_format=S,value_format=S,split_pct=80"
Compression considerations
WiredTiger compresses data at several stages to preserve memory and disk space. Applications can configure these different compression algorithms to tailor their requirements between memory, disk and CPU consumption. Compression algorithms other than block compression work by modifying how the keys and values are represented, and will reduce data size both in-memory and on-disk. Block compression compresses the data in its binary representation while saving it on the disk, and so only reduces the data size on-disk.
Configuring compression may change application throughput. For example, in applications using solid-state drives (where I/O is less expensive), turning off compression may increase application performance by reducing CPU costs; in applications where I/O costs are more expensive, turning on compression may increase application performance by reducing the overall number of I/O operations.
WiredTiger uses some internal algorithms to compress the amount of data stored that are not configurable, but always on. For example, run-length encoding reduces in-memory and on-disk size requirements by storing sequential, duplicate values in a column-store object only a single time.
Different compression options available with WiredTiger:
- Key-prefix
Reduces the size requirement by storing any identical key prefix only once per page. The cost is additional CPU and memory when operating on the in-memory tree. Specifically, reverse sequential cursor movement (but not forward) through a prefix-compressed page or the random lookup of a key/value pair will allocate sufficient memory to hold some number of uncompressed keys. So, for example, if key prefix compression only saves a small number of bytes per key, the additional memory cost of instantiating the uncompressed key may mean prefix compression is not worthwhile. Further, in cases where the on-disk cost is the primary concern, block compression may mean prefix compression is less useful.
Key-prefix configuration:
Specified using the
prefix_compression configuration option to WT_SESSION::create. Applications may limit the use of prefix compression by configuring the minimum number of bytes that must be gained before prefix compression is used with the prefix_compression_min
configuration option. An example of such a configuration string is as follows:
"key_format=S,value_format=S,prefix_compression=true,prefix_compression_min=7"
- Dictionary
Reduces the size requirement by storing any identical value only once per page.
Dictionary configuration:
Specified using the dictionary
configuration option to WT_SESSION::create, which specifies the maximum number of unique values remembered in the B-Tree row-store leaf page value dictionary. An example of such a configuration string is as follows:
"key_format=S,value_format=S,dictionary=1000"
- Huffman
Reduces the size requirement by compressing individual value items, and can be configured for values. The additional CPU cost of Huffman encoding can be high, and should be considered. (See Huffman Encoding for details.)
Huffman configuration:
Specified using the huffman_value
configuration options to WT_SESSION::create. These options can take values of "english" (to use a built-in English language frequency table), "utf8<file>" or "utf16<file>" (to use a custom UTF-8 or UTF-16 symbol frequency table file). An example of such a configuration string is as follows:
"key_format=S,value_format=S,huffman_value=english"
- Block Compression
Reduces the size requirement of on-disk objects by compressing blocks of the backing object's file. The additional CPU cost of block compression can be high, and should be considered. When block compression has been configured, configured page sizes will not match the actual size of the page on disk.
When block compression is configured, chunks of data are given to the compression system, then returned as a compressed block of data. The amount of data chosen to compress is based on previous compression results. When compressed, the size written to disk will vary depending on the compressibility of the stored data.
Block compression configuration:
Specified using the block_compressor
configuration option to WT_SESSION::create. If WiredTiger was built with support for "lz4", "snappy", "zlib" or "zstd" compression, these names are available as the value to the configuration option. An example of such a configuration string is as follows:
"key_format=S,value_format=S,block_compressor=snappy"
See Compressors for further information on how to configure and enable different compression options.
Table summarizing compression in WiredTiger
Compression Type | Supported by row-store | Supported by variable col-store | Supported by fixed col-store | Default config | Reduces in-mem size | Reduces on-disk size | CPU and Memory cost |
Key-prefix | yes | no | no | disabled | yes | yes | minor |
Dictionary | yes | yes | no | disabled | yes | yes | minor |
Huffman | yes | yes | no | disabled | yes | yes | can be high |
Block | yes | yes | yes | disabled | no | yes | can be high |