Version 12.0.0
Page Deltas
Data StructuresSource Location
WTI_DELTA_LEAF_MERGE_STATE
WT_PAGE_BLOCK_META
WT_PAGE_DELTA_CONFIG
src/include/btmem.h
src/include/btree.h
src/include/connection.h
src/btree/bt_page.c
src/btree/bt_read.c
src/reconcile/rec_row.c
src/reconcile/rec_write.c

Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.

WiredTiger's page delta feature reduces write amplification and I/O by writing only the changed entries for a page (a delta) rather than always rewriting the full page image during reconciliation. This feature is used exclusively with disaggregated (layered) row-store btrees.

Goals and high-level model

Instead of always writing a full page image on reconciliation, WiredTiger can write:

  • A base page image (full page), and
  • A bounded chain of deltas on top of that base.

On disk the block for a page therefore contains either:

  • Only a full page image, or
  • A full page image plus one or more delta images.

A delta chain is the sequence of deltas that must be applied to the most recent full page image (the "base image") to reconstruct the latest page state. Its maximum length is bounded by the page_delta.max_consecutive_delta configuration option.

Page deltas only apply to row-store pages (WT_PAGE_ROW_LEAF and WT_PAGE_ROW_INT). Column-store pages do not participate in this framework. Deltas are also never written for non-disaggregated tables the btree must have the WT_BTREE_DISAGGREGATED flag set (set automatically for layered btrees).

Configuration

Page delta behavior is controlled via the page_delta sub-config at wiredtiger_open or reconfigure time:

  • page_delta.leaf_page_delta (bool): enable leaf page deltas.
  • page_delta.internal_page_delta (bool): enable internal page deltas.
  • page_delta.delta_pct (int): maximum allowed size of a delta as a percentage of the full page size. Candidates that exceed this threshold fall back to a full page write.
  • page_delta.max_consecutive_delta (int): maximum number of deltas in a chain for a single page. When this limit would be exceeded, reconciliation writes a full page image and resets the chain.

These settings are stored at the connection level in WT_PAGE_DELTA_CONFIG (see src/include/connection.h). The flags WT_LEAF_PAGE_DELTA and WT_INTERNAL_PAGE_DELTA are derived from them at startup or reconfig time.

Whether a given btree and page type actually use deltas is controlled by the macros WT_DELTA_LEAF_ENABLED and WT_DELTA_INT_ENABLED, both of which additionally require the btree to have the WT_BTREE_DISAGGREGATED flag set.

Generating deltas (reconciliation)

Deltas are generated during reconciliation (__wt_reconcile). The two page types follow different paths:

  • For a leaf page: after the full disk image is built, the delta build function is called if the page has not undergone an in-memory split and the reconciliation produced a single block result.
  • For an internal page: the delta decision and entry packing happen concurrently with building the full page image. Modified children are packed into the delta buffer as the child-walk proceeds. Once the full page image is complete, the write generation field is updated onto the already-populated delta buffer.

After a candidate delta is built, the write path validates whether the delta can be written or whether a full page image is required, and takes the decision accordingly:

  1. If block_meta->delta_count has reached max_consecutive_delta, the delta is rejected and a full page image is written instead, resetting the chain.
  2. If the built delta's size exceeds delta_pct percent of the full page image, it is also rejected.
  3. Several structural reasons can also prevent a delta from being used: the reconcile produced multiple blocks, the page has no stable LSN yet, the result is not a single page, the leaf delta is empty, or the delta build returned no entries.

When a delta is accepted, it is written by __rec_write_delta and block_meta->delta_count is incremented.

Leaf page deltas

__rec_build_delta_leaf iterates over the saved-updates list (supd) for the page. For each key whose selected on-page value has changed since the last reconciliation (__rec_selected_key_changed), it packs a key/value entry into the delta buffer via __wti_rec_pack_delta_row_leaf. Only keys that actually changed are included, making the delta compact for workloads with sparse updates. Each packed entry carries full MVCC time-window metadata so reconstruction can respect visibility rules identically to a full page image.

Internal page deltas

Internal page deltas are built during the row-store internal reconcile pass in src/reconcile/rec_row.c. Each child-address change (insert, update, or delete of a sub-tree pointer) is packed into r->delta.

Key files

File Role
src/reconcile/rec_write.c Delta creation decision and writing (__rec_build_delta, __rec_split_write, __rec_write_delta)
src/reconcile/rec_row.c Delta packing per row for internal pages (__rec_pack_delta_row_int, __rec_build_delta_int) and leaf pages (__rec_build_delta_leaf entry helpers)
src/reconcile/rec_visibility.c Update selection and time-window handling used by leaf delta packing
src/btree/bt_page.c Merge helpers on read (__wti_page_merge_deltas_with_base_image_leaf, __wti_page_merge_deltas_with_base_image_int)
src/btree/bt_read.c Page fault-in and delta reconstruction (__page_read_build_full_disk_image)
src/include/btmem.h Chain-tracking structures (WT_PAGE_BLOCK_META, WT_PAGE_DISAGG_INFO)
src/include/connection.h Connection-level delta config (WT_PAGE_DELTA_CONFIG)
src/include/btree.h Enable macros (WT_DELTA_LEAF_ENABLED, WT_DELTA_INT_ENABLED)
src/reconcile/reconcile_private.h Reconcile-time macros (WT_BUILD_DELTA_LEAF, WT_BUILD_DELTA_INT, WT_REC_RESULT_SINGLE_PAGE)

On-disk layout and chain tracking

The per-page delta chain state is kept in WT_PAGE_BLOCK_META (see src/include/btmem.h):

  • delta_count: number of deltas written on top of the current base image (a uint8_t, so the theoretical maximum is 255; keep max_consecutive_delta well below this in practice).
  • base_lsn: LSN of the base full-image block in the page log.
  • backlink_lsn: LSN of the previous delta (or base) in the chain.
  • cumulative_size: total byte size of the base image plus all deltas.

WT_PAGE_BLOCK_META lives inside WT_PAGE_DISAGG_INFO, which is attached to the in-memory page via page->disagg_info.

Reconstruction on read

When a page is read from storage and has delta_count > 0, the read path in __page_read_build_full_disk_image (see src/btree/bt_read.c) merges the base image with all delta images into a single complete disk image before building the in-memory page. The rest of the read path then proceeds identically to the non-delta case.

The merge helpers live in src/btree/bt_page.c:

  • __wti_page_merge_deltas_with_base_image_leaf: merges base leaf image and all leaf deltas in key order. For each key the newest delta entry wins; delta deletes remove the key. The result is a valid leaf page disk image reflecting the latest logical state.
  • __wti_page_merge_deltas_with_base_image_int: merges base internal image and all internal deltas. Each child address is either kept from the base or replaced/removed according to the newest delta entry.

Tuning

The primary tuning levers are:

  • page_delta.delta_pct: raise to allow larger deltas, lower to keep them compact.
  • page_delta.max_consecutive_delta: raise to extend chains and maximize write savings, lower to bound read reconstruction cost.

The right balance depends on the read/write ratio and how many keys are updated per page per checkpoint cycle. Write-heavy workloads with sparse updates benefit most. Read-heavy workloads or workloads that touch every key on a page should use short chains (max_consecutive_delta of 48).