There are seven page and key/value size configuration strings:
allocation_size
),internal_page_max
and leaf_page_max
),internal_key_max
, leaf_key_max
and leaf_value_max
), and thesplit_pct
).All seven are specified to the WT_SESSION::create method, in other words, they are configurable on a per-file basis.
Applications commonly configure page sizes, based on their workload's typical key and value size. Once the correct page size has been chosen, appropriate defaults for the other configuration values are derived from the page sizes, and relatively few applications will need to modify the other page and key/value size configuration options.
An example of configuring page and key/value sizes:
The internal_page_max
and leaf_page_max
configuration values specify a maximum size for Btree internal and leaf pages. That is, when an internal or leaf page grows past that size, it splits into multiple pages. Generally, internal pages should be sized to fit into on-chip caches in order to minimize cache misses when searching the tree, while leaf pages should be sized to maximize I/O performance (if reading from disk is necessary, it is usually desirable to read a large amount of data, assuming some locality of reference in the application's access pattern).
The default page size configurations (2KB for internal_page_max
, 32KB for leaf_page_max
), are appropriate for applications with relatively small keys and values.
When block compression has been configured, configured page sizes will not match the actual size of the page on disk. Block compression in WiredTiger happens within the I/O subsystem, and so a page might split even if subsequent compression would result in a resulting page size small enough to leave as a single page. In other words, page sizes are based on in-memory sizes, not on-disk sizes. Applications needing to write specific sized blocks may want to consider implementing a WT_COMPRESSOR::compress_raw function.
The page sizes also determine the default size of overflow items, that is, keys and values too large to easily store on a page. Overflow items are stored separately in the file from the page where the item logically appears, and so reading or writing an overflow item is more expensive than an on-page item, normally requiring additional I/O. Additionally, overflow values are not cached in memory. This means overflow items won't affect the caching behavior of the application, but it also means that each time an overflow value is read, it is re-read from disk.
For both of these reasons, applications should avoid creating large numbers of commonly referenced overflow items. This is especially important for keys, as keys on internal pages are referenced during random searches, not just during data retrieval. Generally, applications should make every attempt to avoid creating overflow keys.
The internal_key_max
, leaf_key_max
and leaf_value_max
configuration values allow applications to change the size at which a key or value will be treated as an overflow item.
The value of internal_key_max
is relative to the maximum internal page size. Because the number of keys on an internal page determines the depth of the tree, the internal_key_max
value can only be adjusted within a certain range, and the configured value will be automatically adjusted by WiredTiger, if necessary to ensure a reasonable number of keys fit on an internal page.
The values of leaf_key_max
and leaf_value_max
are not relative to the maximum leaf page size. If either is larger than the maximum page size, the page size will be ignored when the larger keys and values are being written, and a larger page will be created as necessary.
Most applications should not need to tune the maximum key and value sizes. Applications requiring a small page size, but also having latency concerns such that the additional work to retrieve an overflow item is an issue, may find them useful.
An example of configuring a large leaf overflow value:
The split_pct
configuration string configures the size of a split page. When a page grows sufficiently large that it must be written as multiple disk blocks, the newly written block size is split_pct
percent of the maximum page size. This value should be selected to avoid creating a large number of tiny pages or repeatedly splitting whenever new entries are inserted. For example, if the maximum page size is 1MB, a split_pct
value of 10% would potentially result in creating a large number of 100KB pages, which may not be optimal for future I/O. Or, if the maximum page size is 1MB, a split_pct
value of 90% would potentially result in repeatedly splitting pages as the split pages grow to 1MB over and over. The default value for split_pct
is 75%, intended to keep large pages relatively large, while still giving split pages room to grow.
Most applications should not need to tune the split percentage size.
The allocation_size
configuration value is the underlying unit of allocation for the file. As the unit of file allocation, it sets the minimum page size and how much space is wasted when storing small amounts of data and overflow items. For example, if the allocation size is set to 4KB, an overflow item of 18,000 bytes requires 5 allocation units and wastes about 2KB of space. If the allocation size is 16KB, the same overflow item would waste more than 10KB.
The default allocation size is 4KB, chosen for compatibility with virtual memory page sizes and direct I/O requirements on common server platforms.
Most applications should not need to tune the allocation size; it is primarily intended for applications coping with the specific requirements some file systems make to support features like direct I/O.