Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.

WiredTiger is a highly multi-threaded application which means that a large number of operations can happen in parallel. In order to be performant WiredTiger utilizes a number of lock-free programming mechanisms. Commonly used patterns are documented here. A pattern in this instance is a combination of one or more lock-free programming techniques.

This architecture guide page aims to aid developers in understanding and using these patterns in the future.

Message passing with acquire and release

When a writer thread needs to share data with one or more reader threads concurrently the message passing pattern can be used. For this pattern to work two variables are required. A "payload" which is the data being shared, and a "signal" variable. The payload could be a number of things, e.g:

A struct with any number of fields
A single variable in memory

For the writer to share this payload with readers it needs to perform two steps in the correct order. Firstly it writes the payload, then writes a signal variable with a release write. The reader threads will read the signal variable to know if the payload is safe to read. The release write must use the WT_RELEASE_WRITE macro described in Release.

The reader threads must read with the opposite order, to do so they read the signal variable using the acquire read macro WT_ACQUIRE_READ. Then they read the payload, the WT_ACQUIRE_READ macro provides the required ordering semantics. The macro is described in Acquire. If the signal variable indicates the payload is safe to read the reader thread can read the payload.

This pattern is used in many places in WiredTiger but they are not listed here as the list will become stale.

Example pseudocode:

void write {
    s->payload = 5;
    WT_RELEASE_WRITE(s->ready, true);
}
 
void read {
    bool ready;
 
    WT_ACQUIRE_READ(ready, s->ready);
    if (ready) {
        var = s->payload;
    }
}

Slotted arrays

Overview

Given an array of size Z, where objects within the array may be unused. The slot counter pattern provides a variable - whose name varies depending on implementation but we will refer to it as slot_count - to readers to determine how many slots in that array need reviewing. That way the reader only has to walk slot_count slots in the array instead of traversing the whole list. This allows for dynamic array growth and provides an optimization as readers would otherwise have to walk the full array.

If the array size is fixed then no ordering is defined for slot_count. If the array is allowed to grow then additional ordering is defined for slot_count. See Growable arrays for further details.

Use cases

This pattern should only be used if performance is critical in the read path, prior to usage of this pattern the following determinations should be made:

Is walking the entire array too expensive?
Is a read-write lock protecting both traversing and writing to the array too expensive?
Does this array need to be growable?

Writing

When an algorithm initializes an object in the array there are two outcomes with respect to slot_count:

If a free slot is available at an index smaller than slot_count the slot will be used and the object in the slot will be initialized.
If there is no free slot available at a position less than slot_count then the slot at index slot_count is used and slot_count is incremented.

Additionally the writers must follow the Message passing with acquire and release pattern when initializing the slot. Firstly the object / slot must be initialized with its payload then a flag or pointer can be set for readers to safely read the object held in the slot itself.

Reading

Reading from this array requires readers to loop through the array, up until slot_count. Readers must read slot_count only once prior to entering the body of the loop and ensure a local copy of the value is used for the remainder of the loop.

When reading a specific slot, readers must follow the Message passing with acquire and release pattern by reading a flag or pointer indicating the slot is initialized prior to reading the contents of the object held in the slot.

Correctness of allowing reading while allowing writes

Additional writes to the array can occur concurrently, and as a copy of slot_count is taken by readers prior to their traversal they may miss objects being initialized concurrently. The reader algorithm must take care to ensure that once it has entered the loop new slots being used in the array will not impact the outcome of the loop as it is not guaranteed to see those.

Growable arrays

A growable array is one where the value of Z is not fixed and the user can increase the array size while the array is in use. Here we will discuss the additional semantics required to support a growable array.

Note: Generations handle the life cycle, creation and freeing of the growable array and that will not be covered here.

Writing

If a new slot is required in the array, but all slots are in use and slot_count is equal to the size Z of the array then the array must grow. The variable slot_count has additional ordering defined when swapping in the array. The array pointer swap must happen before slot_count is incremented. This requires a Release.

Reading

Readers must respect the writer side ordering. The slot_count must be read prior to reading the array pointer. This requires an Acquire. If this ordering is not respected the reader could read an old array pointer but a new slot_count and encounter an array index out of bounds exception.

Further details

The fixed size version of this pattern is used to maintain the Connection's sessions array. The dynamically growing version of this pattern is used to maintain the hazard pointer array for a Session.

Write While stable (WwS)

The Write while Stable (WwS) pattern performs a write to a memory location W, but only considers that write to be successful if the value of a separate memory location S has not changed by the time the write to W has become visible. As a result when the WwS pattern succeeds, the next time the value of S changes all threads can already see the update to W.

In this pattern the write to W always takes place, however the writing thread will not consider the update to be valid until it has verified that S is unchanged. The writing thread is responsible for either retrying or rolling back the write to W on a failure. Threads reading from W must be able to handle or ignore these interim "failed" writes to W, and must perform a read-acquire on S before reading W.

This pattern should only be used in performance critical paths. In all other cases a lock should be used as it provides a much simpler interface for developers when reading and debugging code.

Pseudocode

void example_WwS(int *W, int *S) {
    // The initial read from S must take place before the write to W.
    WT_ACQUIRE_READ(initial_s_value, *S);
    *W = new_value;
 
    // This full barrier ensures the write to W is visible to other threads *before*
    // we re-read the contents of S to verify it hasn't changed.
    FULL_BARRIER();
 
    if(*S != initial_s) {
        // The write was unsuccessful. Rollback or retry.
    }
}

Example WT implementation

Note: The following example assumes familiarity with WiredTiger's Generations.

In the generation code the __wt_session_gen_enter function the current value of the connection->generation (S) and writes it to session->generation (W). Independently a second thread executing __gen_oldest will first read connection->generation and then walk all session->generations, reporting the smallest generation seen.

Without this pattern __wt_session_gen_enter could enter a generation G, the connection->generation could be advanced to G+1, and the __gen_oldest thread would see the update to connection->generation G+1 but not the write to the session->generation G. This would incorrectly report the oldest generation as G+1 while the session->generation G is still active.

On a successful WwS the write to session->generation must be visible before advancing the connection->generation. If the connection->generation has not been advanced the new session->generation is visible and the oldest generation will be equal to it.

On a failed WwS __wt_session_gen_enter must retry and enter the new generation G+1. If __gen_oldest sees the "invalid" session->generation of G it will report an older generation that the actual oldest generation. This is considered acceptable as it temporarily delays freeing a resource which has no impact on correctness.

This is the trimmed down version of the writing thread code in the generations logic:

do {
    session->generations[which] = __wt_gen(session, which);
    WT_FULL_BARRIER();
} while (session->generations[which] != __wt_gen(session, which));

Win the Race

Note: This usage pattern behaves like a lightweight version of a trylock. In non-critical code paths a trylock should be preferred as it also provides acquire/release semantics around the critical section, guaranteeing instructions inside the critical section take place within the lock.

Multiple threads will often compete to enter a critical section or execute functions that should be limited to a single thread. For example in __wt_cache_pool_server all threads will attempt to become the pool manager but only one manager is allowed. Whichever thread comes first will become the manager.

To implement this usage pattern all threads attempt to set some shared flag F via a compare_and_swap. On success the winning thread is allowed to continue into the critical section. When the thread leaves the critical section it unsets the flag for the other threads to claim. Threads that fail to claim the flag F must decide whether to continue, abort or retry.

Example

In __wt_cache_pool_server all threads attempt to set the pool_managed flag with a call to __wt_atomic_cas8(&cp->pool_managed, 0, 1) and only the successful thread is allowed then to set the WT_CACHE_POOL_MANAGER flag and perform the duties of the pool manager.

Example Pseudocode

void win_the_race(int *F) {
    if(__wt_atomic_int(&F, 0, 1)) {
        // perform work that must be performed by a single thread
 
        // unset the flag at the end of the section
          *F = 0;
    }
}