Version 11.1.0
Commit-level durability

The next level of WiredTiger transactional application involves adding commit-level durability for data modifications. As described in Checkpoint-level durability, WiredTiger supports checkpoint durability by default. Commit-level durability requires additional configuration.

Enabling commit-level durability

To enable commit-level durability, pass the log=(enabled) configuration string to wiredtiger_open. This causes WiredTiger to write records into the log for each transaction, giving all objects opened in the database commit-level durability. The operational transactional API does not otherwise change.

Warning
By default, log records are written to an in-memory buffer before WT_SESSION::commit_transaction returns, giving the highest performance but not ensuring immediate durability. The database can be configured to flush log records to the operating system buffer cache (ensuring durability over application failure), or to stable storage (ensuring durability over system failure), but that will impact performance. See Flush call configuration for more information.

It is possible to enable commit-level durability for some database objects and not others. To do this, one must pass "log=(enabled)" to wiredtiger_open and then pass "log=(enabled=false)" to WT_SESSION::create for the objects that should continue to use checkpoint durability. (Doing the converse is not supported, that is, enabling logging on some tables while leaving the global switch turned off.)

Warning
Making commits to both checkpoint-durable and commit-durable objects in the same transaction requires caution: if the system crashes before another checkpoint is taken, any such transaction will be torn, only the commit-durable part of it will remain and the rest will be lost. In most cases this is undesirable, however, this combination can be usefully leveraged to create application-level write-ahead logs and is therefore explicitly supported.

While this is rarely useful, object logging can be toggled during application restart (in other words, logging can be enabled for a table previously created or used with logging disabled, and vice-versa). Log records found during recovery are applied to the table, regardless of whether logging is currently configured for the table.

Commit-level durability and logs

Commit-level durability is implemented using a write-ahead log. When logging is enabled for an object, WiredTiger writes a record to the log for each update operating on the object. Transactions group updates, and in keeping with the principle of atomicity each transaction's updates become durable when the transaction is committed.

By default, log records are buffered in memory and not flushed to disk immediately, even when committed; groups of transactions are flushed together. (See Group commit.) It is possible to flush transactions to disk more aggressively if desired. See Flush call configuration.

Recovery

When the transactional log is enabled, calling wiredtiger_open automatically performs a recovery step when opening the database. This rolls the log forward; that is, it reapplies whatever changes from the log are required to bring the database up to date with the most recent transactional state.

This recovery step may require extensions be available when it runs (for example, collators and compression). Therefore, applications using commit-level durability must configure extensions with the extensions keyword to wiredtiger_open consistently whenever re-opening the database.

Checkpoints

When using commit-level durability one should still perform checkpoints of the database. Database checkpoints are necessary for two reasons: First, log files can only be removed after a checkpoint completes, and so the frequency of checkpoints determines the disk space required by log files. Second, checkpoints bound the time required for recovery to complete after application or system failure by limiting the log records that need to be processed.

Checkpoints can be done either explicitly by the application or periodically based on elapsed time or data size with the checkpoint configuration to wiredtiger_open. The period between checkpoint completion and the start of a subsequent checkpoint can be set in seconds via wait, as the number of bytes written to the log since the last checkpoint via log_size, or both. If both are set then the checkpoint occurs as soon as either threshold has occurred and both are reset once the checkpoint is complete. If using log_size to scheduled automatic checkpoints, we recommend the size selected be a multiple of the physical size of the underlying log file to more easily support automatic log file removal.

Backups

Backups are done using backup cursors (see Backups for more information).

With logging enabled, partial backups (backups where not all of the database objects are copied) may result in error messages during recovery, because data files referenced in the logs might not be found. Applications should either copy all objects and log files if commit-level durability of the copied database is required, or alternatively, copy only selected objects when backing up and not copy log files at all, then fall back to checkpoint durability when activating the backup.

Log file archival and removal

WiredTiger log files are named "WiredTigerLog.[number]" where "[number]" is a 10-digit value, for example WiredTigerLog.0000000001. The log file with the largest number in its name is the most recent log file written. The log file size can be set using the log configuration to wiredtiger_open.

By default, WiredTiger automatically removes log files no longer required for recovery. Applications wanting to archive log files instead (for example, to support catastrophic recovery), must disable log file removal using the wiredtiger_open "log=(remove=false)" configuration.

Log files may be removed or archived after a checkpoint has completed, as long as there is no backup in progress. When performing Log-based Incremental backup, WT_SESSION::truncate can be used to remove log files after completing each incremental backup.

Immediately after the checkpoint has completed, only the most recent log file is needed for recovery, and all other log files can be removed or archived. Note that there must always be at least one log file for the database.

Log cursors

Applications can independently read and write WiredTiger log files for their own purposes (for example, inserting debugging records), using the standard WiredTiger cursor interfaces. See Log cursors for more information.

Open log cursors prevent WiredTiger from automatically removing log files. Therefore, we recommend proactively closing log cursors when done with them. Applications manually removing log files should take care that no log cursors are opened in the log when removing files or errors may occur when trying to read a log record in a file that was removed.

Bulk loads

Bulk loads are not commit-level durable, that is, the creation and bulk-load of an object will not appear in the database log files. For this reason, applications doing incremental backups after a full backup should repeat the full backup step after doing a bulk load to make the bulk load appear in the backup. In addition, incremental backups after a bulk load (without an intervening full backup) can cause recovery to report errors because there are log records that apply to data files which do not appear in the backup.

Tuning commit-level durability

Group commit

WiredTiger automatically groups the flush operations for threads that commit concurrently into single calls. This usually means multi-threaded workloads will achieve higher throughput than single-threaded workloads because the operating system can flush data more efficiently to the disk. No application-level configuration is required for this feature.

Flush call configuration

By default, log records are written to an in-memory buffer before WT_SESSION::commit_transaction returns, giving highest performance but not ensuring durability. The durability guarantees can be stricter but this will impact performance.

If "transaction_sync=(enabled=false)" is configured in wiredtiger_open, log records may be buffered in memory, and only flushed to disk by checkpoints, when log files switch or calls to WT_SESSION::commit_transaction with sync=on. (Note that any call to WT_SESSION::commit_transaction with sync=on will flush the log records for all committed transactions, not just the transaction where the configuration is set.) This provides the minimal guarantees, but will be significantly faster than other configurations.

If "transaction_sync=(enabled=true)", "transaction_sync=(method)" further configures the method used to flush log records to disk. By default, the configured value is fsync, which calls the operating system's fsync call (or fdatasync if available) as each commit completes.

If the value is set to dsync, the O_DSYNC or O_SYNC flag to the operating system's open call will be specified when the file is opened. (The durability guarantees of the fsync and dsync configurations are the same, and in our experience the open flags are slower; this configuration is only included for systems where that may not be the case.)

If the value is set to none, the operating system's write call will be called as each commit completes but no explicit disk flush is made. This setting gives durability across application failure, but likely not across system failure (depending on operating system guarantees).

When a log file fills and the system moves to the next log file, the previous log file will always be flushed to disk prior to close. So when running in a durability mode that does not flush to disk, the risk is bounded by the most recent log file change.

Here is the expected performance of durability modes, in order from the fastest to the slowest (and from the fewest durability guarantees to the most durability guarantees).

Durability ModeNotes
log=(enabled=false)checkpoint-level durability
log=(enabled),transaction_sync=(enabled=false)in-memory buffered logging configured; updates durable after checkpoint or after sync is set in WT_SESSION::begin_transaction
log=(enabled),transaction_sync=(enabled=true,method=none)logging configured; updates durable after application failure, but not after system failure
log=(enabled),transaction_sync=(enabled=true,method=fsync)logging configured; updates durable on application or system failure
log=(enabled),transaction_sync=(enabled=true,method=dsync)logging configured; updates durable on application or system failure

The durability setting can also be controlled directly on a per-transaction basis via the WT_SESSION::commit_transaction method. The WT_SESSION::commit_transaction supports several durability modes with the sync configuration that override the connection level settings.

If sync=on is configured then this commit operation will wait for its log records, and all earlier ones, to be durable to the extent specified by the "transaction_sync=(method)" setting before returning.

If sync=off is configured then this commit operation will write its records into the in-memory buffer and return immediately.

The durability of the write-ahead log can be controlled independently as well via the WT_SESSION::log_flush method. The WT_SESSION::log_flush supports several durability modes with the sync configuration that immediately act upon the log.

If sync=on is configured then this flush will force the current log and all earlier records to be durable on disk before returning. This method call overrides the transaction_sync setting and forces the data out via fsync.

If sync=off is configured then this flush operation will force the logging subsystem to write any outstanding in-memory buffers to the operating system before returning, but not explicitly ask the OS to flush them.