Version 12.0.0
All Classes Functions Variables Typedefs Enumerations Enumerator Modules Pages
Checkpoint
Data StructuresSource Location
WT_CONNECTIONsrc/block/block_ckpt.c
src/block/block_ckpt_scan.c
src/checkpoint/checkpoint_ckptlist.c
src/checkpoint/checkpoint_conn.c
src/checkpoint/checkpoint_stats.c
src/checkpoint/checkpoint_txn.c
src/meta/meta_ckpt.c

Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.

Overview

A checkpoint is a known point in time from which WiredTiger can recover in the event of a crash or unexpected shutdown. Both checkpoints and logging provide durability guarantees. During recovery WiredTiger can start up from the last known checkpoint and replay the log, recovering the lost operations. WiredTiger checkpoints are created either via the API WT_SESSION::checkpoint, or internally. See Internal Callers for when checkpoint can be called internally by WiredTiger.

A checkpoint is performed within the context of a snapshot isolation transaction, as such the checkpoint has a consistent view of the database from beginning to end. Typically when running a checkpoint the configuration "use_timestamp=true" is specified. This instructs WiredTiger to set the checkpoint_timestamp to be the current stable_timestamp. As of the latest version of WiredTiger the checkpoint_timestamp timestamp is not used as a read_timestamp for the checkpoint transaction. The checkpoint_timestamp is written out with the metadata information for the checkpoint. On startup WiredTiger will set the stable_timestamp internally to the timestamp contained in the metadata, and rollback updates which are newer to the stable_timestamp see: WT_CONNECTION::rollback_to_stable.

The checkpoint algorithm

The checkpoint algorithm can be considered as a coordinator of multiple aspects of the system. It relies on the btree subsystem for traversing in memory pages, the block manager subsystem for handling on-disk blocks and the metadata subsystem for constructing and interpreting the metadata information.

A checkpoint can be broken up into the following stages:

Locking

The first stage of checkpoint is to gather the necessary checkpoint locks. The checkpoint_lock has the highest precedent ensuring only one checkpoint takes place at time. Other locks such as, dhandle, schema and flush locks are taken throughout the checkpoint process.

Eviction

Before the checkpoint transaction begins (pinning content), the eviction subsystem is leveraged to reduce the dirty content in the cache. Using eviction has the benefit of taking advantage of the multi-threaded functionality of eviction and reduces the work required during the critical section of the checkpoint. The thread performing the checkpoint operation will wait for eviction to reach the target dirty content ratio unless no progress is made.

Prepare

Checkpoint prepare sets up the checkpoint, it begins the checkpoint transaction, updates the global checkpoint state and gathers a list of handles to be checkpointed. A global schema lock wraps checkpoint prepare to avoid any tables being created or dropped during this phase, additionally the global transaction lock is taken during this process as it must modify the global transaction state, and to ensure the stable_timestamp doesn't move ahead of the snapshot taken by the checkpoint transaction. The connection dhandle list is walked one handle at a time, where each handle gathered refers to a specific btree. Additionally clean btrees, i.e. btrees without any modifications, are excluded from the list, with an exception for specific checkpoint configuration scenarios. See Skipping checkpoints for scenarios where btrees are skipped during checkpoint. Btrees included in the checkpoint list then delete earlier checkpoints when possible. Previous checkpoints can only be discarded if the associated blocks are no longer used.

Data files checkpoint

Data files in this instance refer to all the user created files. The main work of checkpoint is done here, the array of btree's collected in the prepare stage are iterated over. For each btree, the tree is walked and all the dirty pages are reconciled. Clean pages are skipped to avoid unnecessary work. Pages made clean ahead of the checkpoint by eviction are still skipped regardless of whether the update written by eviction is visible to the checkpoint transaction. The checkpoint guarantees that a clean version of every page in the tree exists and can be written to disk. The block manager subsystem is then responsible for managing the extent lists modified by the checkpoint (see block-checkpoint) and deleting old checkpoints.

History store checkpoint

The history store is checkpointed after the data files intentionally as during the reconciliation of the data files additional writes may be created in the history store and its important to include them in the checkpoint. The history store will follow the same checkpoint process as data files.

Flushing the files to disk

All the btrees checkpointed and the history are flushed to disk at this stage, WiredTiger will wait until that process has completed to continue with the checkpoint.

Metadata checkpoint

A new entry into the metadata file is created for every data file checkpointed, including the history store. As such the metadata file is the last file to be checkpointed. As WiredTiger maintains two checkpoints, the location of the most recent checkpoint is written to the turtle file.

Skipping checkpoints

It is possible that a checkpoint will be skipped. A checkpoint will be skipped when:

  • No modifications to the database have been made since the last checkpoint.
  • The last checkpoint timestamp is equal to the current stable timestamp.
  • There is no available space at the end of the file. This logic can be overridden by forcing a checkpoint via configuration.

Checkpoint generations

The checkpoint generation indicates which iteration of checkpoint a file has undergone, at the start of a checkpoint the generation is incremented. Then after processing any btree its checkpoint_gen is set to the latest checkpoint generation. Checkpoint generations impact visibility checks within WiredTiger, essentially if a btree is behind a checkpoint, i.e. its checkpoint generation is less than the current checkpoint generation, then the checkpoint transaction id and checkpoint timestamp are included in certain visibility checks. This prevents eviction from evicting updates from a given btree ahead of the checkpoint.

Garbage collection

While processing a btree, checkpoint can mark pages as obsolete. Any page that has an aggregated stop time pair which is globally visible will no longer be required by any reader and can be marked as deleted. This occurs prior to the page being reconciled, allowing for the page to be removed during the reconciliation. However this does not mean that the deleted page is available for re-use as it may be referenced by older checkpoints, once the older checkpoint is deleted the page is free to be used. If the freed pages exist at the end of the file the file can be truncated. Otherwise compaction will need to be initiated to shrink the file, see: WT_SESSION::compact.

Internal callers

WiredTiger may call checkpoint internally at any time. These can occur during: