Version 2.4.1
Page and overflow item sizes

There are four page and item size configuration values: internal_page_max, internal_item_max, leaf_page_max and leaf_item_max. All four should be specified to the WT_SESSION::create method, that is, they are configurable on a per-file basis.

The internal_page_max and leaf_page_max configuration values specify the maximum size for Btree internal and leaf pages. That is, when an internal or leaf page reaches the specified size, it splits into two pages. Generally, internal pages should be sized to fit into the system's L1 or L2 caches in order to minimize cache misses when searching the tree, while leaf pages should be sized to maximize I/O performance (if reading from disk is necessary, it is usually desirable to read a large amount of data, assuming some locality of reference in the application's access pattern).

The internal_item_max and leaf_item_max configuration values specify the maximum size at which an object will be stored on-page. Larger items will be stored separately in the file from the page where the item logically appears. Referencing overflow items is more expensive than referencing on-page items, requiring additional I/O if the object is not already cached. For this reason, it is important to avoid creating large numbers of overflow items that are repeatedly referenced, and the maximum item size should probably be increased if many overflow items are being created. Because pages must be large enough to store any item that is not an overflow item, increasing the size of the overflow items may also require increasing the page sizes.

With respect to compression, page and item sizes do not necessarily reflect the actual size of the page or item on disk, if block compression has been configured. Block compression in WiredTiger happens within the disk I/O subsystem, and so a page might split even if subsequent compression would result in a resulting page size that would be small enough to leave as a single page. In other words, page and overflow sizes are based on in-memory sizes, not disk sizes.

There are two other, related configuration values, also settable by the WT_SESSION::create method. They are allocation_size, and split_pct.

The allocation_size configuration value is the underlying unit of allocation for the file. As the unit of file allocation, it has two effects: first, it limits the ultimate size of the file, and second, it determines how much space is wasted when storing overflow items.

By limiting the size of the file, the allocation size limits the amount of data that can be stored in a file. For example, if the allocation size is set to the minimum possible (512B), the maximum file size is 2TB, that is, attempts to allocate new file blocks will fail when the file reaches 2TB in size. If the allocation size is set to the maximum possible (512MB), the maximum file size is 2EB.

The unit of allocation also determines how much space is wasted when storing overflow items. For example, if the allocation size were set to the minimum value of 512B, an overflow item of 1100 bytes would require 3 allocation sized file units, or 1536 bytes, wasting almost 500 bytes. For this reason, as the allocation size increases, page sizes and overflow item sizes will likely increase as well, to ensure that significant space isn't wasted by overflow items.

The last configuration value is split_pct, which configures the size of a split page. When a page grows sufficiently large that it must be written as multiple disk blocks, the newly written block size is split_pct percent of the maximum page size. This value should be selected to avoid creating a large number of tiny pages or repeatedly splitting whenever new entries are inserted. For example, if the maximum page size is 1MB, a split_pct value of 10% would potentially result in creating a large number of 100KB pages, which may not be optimal for future I/O. Or, if the maximum page size is 1MB, a split_pct value of 90% would potentially result in repeatedly splitting pages as the split pages grow to 1MB over and over. The default value for split_pct is 75%, intended to keep large pages relatively large, while still giving split pages room to grow.

An example of configuring page sizes:

ret = session->create(session, "file:example",
"key_format=u,"
"internal_page_max=32KB,internal_item_max=1KB,"
"leaf_page_max=1MB,leaf_item_max=32KB");