Version 11.3.0
Automatic prepare timestamp rounding

Replaying prepared transactions by rounding up the prepare timestamp

Prepared transactions have a configuration keyword for rounding timestamps. Applications can configure roundup_timestamps=(prepare=true) with the WT_SESSION::begin_transaction method.

It is possible for a system crash to cause a prepared transaction to be rolled back. Because the durable timestamp of a transaction is permitted to be later than the prepared transaction's commit timestamp, it is even possible for a system crash to cause a prepared and committed transaction to be rolled back. Part of the purpose of the timestamp interface is to allow such transactions to be replayed at their original timestamps during an application-level recovery phase.

Under ordinary circumstances this is purely an application concern. However, because it is also allowed for the stable timestamp to move forward after a transaction prepares, strict enforcement of the timestamping rules can make replaying prepared transactions at the same time impossible.

The setting roundup_timestamps=(prepared=true) is provided to handle this problem. It disables the normal restriction that the prepare timestamp must be greater than the stable timestamp. In addition, the prepare timestamp is rounded up to the oldest timestamp (not the stable timestamp) if necessary and then the commit timestamp is rounded up to the prepare timestamp. The rounding provides some measure of safety by disallowing operations before oldest.

This setting is dangerous. It is safe to replay a prepared transaction at its original timestamps, regardless of the current stable timestamp, as long as it is done during an application recovery phase after a crash and before any ordinary operations are allowed. Using this setting to prepare and/or commit before the current stable timestamp for any other purpose can lead to data inconsistency. Likewise, replaying anything other than the exact transaction that successfully prepared before the crash can lead to subtle inconsistencies. If in any doubt, it is safer to globally abort the transaction (this requires no further action in WiredTiger) or not allow the stable timestamp to advance past the commit timestamp of a transaction that has been prepared.

Safety rationale and details

When a transaction is prepared and rolled back by a crash, then replayed, this creates a period of execution time where the transaction's updates will not appear. Reads or writes made during this period that intersect with the transaction will not see it and can therefore produce incorrect results.

An application recovery phase is a startup phase in application code that is responsible for returning the application to a running state after a crash. It executes after WiredTiger's own recovery completes and before the application resumes normal operation. (For a distributed application this may have nontrivial aspects.) The important property is that only application-level recovery code executes, and that code is expected to be able to take account of special circumstances related to recovery.

It is safe to replay a prepared transaction during an application recovery phase if nothing makes intersecting reads or writes during the period the prepared transaction is missing and the replay makes the exact same updates as before the crash, so any subsequent intersecting reads or writes will behave the same as if they had been performed before the crash. (If the application recovery code itself makes intersecting reads before replaying a prepared transaction, it is responsible for compensating.)

Because a transaction's durable timestamp is allowed to be later than its commit timestamp, it is possible for a transaction to prepare and commit and still be rolled back by a crash. It is thus possible to perform intersecting reads that succeed (rather than failing with WT_PREPARE_CONFLICT), either before or after the crash, and these would become inconsistent if the replayed transaction is not replayed exactly.

Even for transactions that prepared successfully but did not commit before the crash, it is important to replay exactly the same write set; otherwise reads before and after the crash might produce WT_PREPARE_CONFLICT inconsistently.

It is expected the oldest timestamp will not advance during application recovery. The rounding behavior does not check for this possibility; if for some reason applications wish to advance oldest while replaying transactions during recovery, they must check their commit timestamps explicitly to avoid committing before oldest.