Version 10.0.0
Cursor (Architecture Guide)
Data StructuresSource Location

Cursors are used in WiredTiger to get and modify data. A caller of WiredTiger uses WT_SESSION::open_cursor to create a WT_CURSOR. Methods on the WT_CURSOR can then be used to position, iterate, get, and set data. In the typical case, a cursor will be used to access keys and values in a Btree. However, cursors can also used to access indexed data, WiredTiger statistics, log files, and metadata. Additionally, cursors are used for managing backups.

The various kinds of cursors are created by the WT_SESSION::open_cursor call. Depending on the uri used when opening a cursor, the cursor will be implemented internally as one of the many cursor structures that include WT_CURSOR_BTREE, WT_CURSOR_BACKUP, WT_CURSOR_INDEX, WT_CURSOR_LOG, WT_CURSOR_METADATA, WT_CURSOR_STAT, WT_CURSOR_TABLE. Each of these structures starts with the common WT_CURSOR structure, which contain all of the data and method pointers that make up the public part of the API. Thus, any one of these "extended cursor structs" can be allocated and returned as a WT_CURSOR pointer. Since the method pointers are filled with a specific implementations, for example __curtable_reset, a call to WT_CURSOR::reset will call the specific implementation function, and the first argument can be cast to be a pointer to the cursor struct used by the implementation. Thus, in our C code, we have something similar to a class hierarchy, having an abstract base class (WT_CURSOR) with virtual methods, and a number of implementation classes.

The code for each cursor type's methods are generally organized in a single file. For example, a backup cursor is implemented in src/cursor/cur_backup.c . Similarly, shared or utility cursor methods are defined in src/cursor/cur_std.c .

Several cursor types have the concept of subordinate cursors, or child cursors. For example, a table cursor is composed of subordinate file cursors, each representing a column group. A metadata cursor has a subordinate cursor that is a file cursor on the metadata file.

Every open cursor appears on a list in WT_SESSION->cursors. This list is ordered in a way that cursors are closed when the session is closed. Thus, if a subordinate cursor needs to be closed before its parent, it must be listed before the parent.

Data translation

Cursors that expose Btree data, like file, table and index cursors, return a set of keys and values, translating encoded data to types that match the schema. For example, if the value_format specified for a table is "iSi" (an integer, a null-terminated string, an integer), then data {0, "abc", 5} might be stored in the Btree in a packed format as:

0x80 0x61 0x62 0x63 0x00 0x85

When retrieved, the values are decoded and stored into typed variables whose addresses are passed to WT_CURSOR::get_value.

As an aside, 0x80 represents 0 using a variable length encoding. Using 0x80 as zero allows negative integers to be stored (-1 is 0x7f) in a way that they will sort before zero and positive integers. Small integer values can be stored in a single byte. See comments in src/include/intpack_inline.h for more information.

When creating a cursor via WT_SESSION::open_cursor, the raw flag can be used. This has the effect of disabling the translation provided by the schema, transferring a single block of unencoded data for the key or value.

File cursors

File cursors (also known as Btree cursors) are one of fundamental kind of cursors, allowing direct accesses to WiredTiger Btrees. The implementation structure for a file cursor is WT_CURSOR_BTREE. The file cursor methods are generally small wrappers around calls into the Btree layer, where the WT_CURSOR_BTREE structure is used. The Btree layer handles all aspects of cursor positioning, and transfers of raw key and value data.

Table and index cursors

A table cursor is a higher level concept built on file cursors. A WiredTiger table allows data to be physically split into separate Btrees, this is done via the concept of column groups. Column groups may be defined that contain a set of named columns (columns are synonymous with fields). Each column group's columns are the stored in a single btree and may be looked up by the table's key. See Column groups for API details.

Tables may also have indices. These are implemented as Btrees mapping the index key(s) to the main key(s) of the table. Indices may be added after the data is populated in the main table. This requires the index to be filled at the time it is added. Index cursors are implemented using WT_CURSOR_INDEX. Methods on this cursor that set and get values know to use the value that appears in the index as a key into the main table. See Indices for API details.

Internally, we define the concept of a "simple" table as one that has no named columns. Columns, that is the set of keys and values, may be optionally named when creating a table. With no named columns, column groups are not possible (as they must reference names), and all of a table's keys and values reside in a single file. Also without named columns, there is no possibility that a cursor can use projections. A "simple" table, by its nature, can be implemented by a single Btree file and needs no special translation. Thus, if a cursor is opened on a "simple" table, we can return a file cursor on the single file used to store its data, instead of a table cursor. This optimization means that every cursor method goes directly to the file cursor implementation, saving CPU time throughout the lifetime of the cursor.

Projections and plans

Projections are an indication, when a cursor is opened, that the indicated values, and possibly some keys, should be returned by WT_CURSOR::get_value calls. This is only available for table cursors. If the table was configured with column groups, a projection has a bearing on which column group files must be opened in a cursor. When a subset of values is returned, it's possible that some column groups will not be needed. To implement projections and column groups, cursors use a plan.

A plan is a string that indicates a series of actions that must be taken to retrieve the needed values from the subordinate cursors in the table cursor. Remember that each column group gets its own cursor. When a table is created, a default plan is created that asks to copy all columns from each column group cursor in order. When a projection is used for a cursor, a more complex plan may be created. A plan contains numbers and action letters. The numbers are arguments, the action letters are commands. The actions 'k' and 'v' indicate that the subordinate cursor indicated by the numeric argument will be used to get keys or values that follow. The 'n' action gets one or more next key or value columns from the cursor. The 's' skips one or more columns. Switching to another cursor resets the cursor to point at the first column in the indicated cursor. This allows a way to find keys and values in arbitrary order.

As a contrived example, suppose we have a table with two keys "k1,k2" and four values "x1,x2,x3,x4". The column groups are created as follows:

session->create(session, "colgroup:main:c1", "columns=(x1,x2)");
session->create(session, "colgroup:main:c2", "columns=(x3,x4)");

The column group "main:c1" has keys "k1,k2" and values "x1,x2"; the column group "main:c2" has keys "k1,k2" and values "x3,x4". A table cursor on the main table will have two sub-cursors, one for each column group. Now consider the ill-considered projection that is opened as:

session->open_cursor(session, "table:main(x1,x4,x3,k2,k1,x2,x4)", NULL, NULL, &cursor);

In this case, the plan is this string (with spacing added):

0v n 1v s n 1v n 0k s n 0k n 0v s n 1v s n

To break it down,

  • "0v" means use sub-cursor 0 and prepare to read values.
  • "n" means get the next value, that is "x1".
  • "1v" means use sub-cursor 1 and prepare to read values.
  • "s" means skip a value, that is "x3".
  • "n" means get the next value, that is "x4".
  • ...

Notice that this plan requires many switches of cursors and several "s" (skip) operations. Each skip involves enough decoding of the data item to determine its length so its data can be skipped over.

With the default (complete) projection, getting values is fast. Using the default plan, the needed columns are pulled out of each subordinate cursor one by one, and get copied to the caller's arguments. With projections, the simple algorithm following the plan works well if the columns in the projection are grouped by column group and requested in order. Without that discipline, as in this example, the performance will not be optimal.

The implementation of plan creation and execution resides in the Schema (Architecture Guide).

Dump cursors

Dump cursors are used in two rather different ways. Regular dump cursors retrieve the raw keys and values and translate bytes either as raw characters or as hex values. This flavor of dump cursor is used by the wt dump or wt dump -x utility. JSON dump cursors do more sophisticated translation, returning a string that is a JSON formatted record, with the name for each key and its data and the name for each value and its corresponding data. Data is translated to either integral, floating point, or string depending on the format of the column in the schema. This flavor of dump cursor is used by the wt dump -j command as part of creating a JSON dump of a WiredTiger table or file.

There is a single WT_CURSOR_DUMP struct that is used to implement both flavors. The dump code checks for the WT_CURSTD_DUMP_JSON flag and as needed, calls into functions in src/cursor/cur_json.c . The code in that file also implements several external functions that are used by the wt utility when loading JSON-dumped files. In particular, the __wt_json_token function returns individual JSON tokens from an input string.

The JSON code used by the dump cursor uses some storage that hangs off of cursor->json_private which is typed as WT_CURSOR_JSON. When a JSON flavored cursor is created, the list of key column names and value column names is populated in WT_CURSOR_JSON. These names, obtained from the configuration string that created the table or file, are useful to have in advance, as they are used once per row to help fill out the JSON output. The functions that get rows iterate these names and unpack the corresponding column data, converting them into the appropriate JSON format for the data.

Backup cursors

A backup cursor is used to manage backups. It is implemented using a WT_CURSOR_BACKUP structure. A backup cursor can be configured to do a full backup or an incremental backup. First, we'll look at full backups.

A backup cursor for a full backup returns the set of files that need to be copied to achieve the backup. The backup cursor, when opened, ensures that it is the only backup cursor running in the system and returns an error if not. This is managed using the WT_CONNECTION_IMPL->hot_backup_start variable, which can only be accessed when the connection schema lock is held. A non-zero value means a hot backup is in progress. Closing the backup cursor sets it back to zero.

Having an open backup influences actions elsewhere in WiredTiger, since part of the backup protocol involves the application copying whole data files. Thus, having an open backup may cause the block manager and log file server thread to avoid truncating data and log files. A truncation of a file being copied by the application would be unexpected. Also, open checkpoints are not deleted during the course of a backup.

When the backup cursor is initialized, the complete set of files needed to back up is generated and stored in the cursor. This makes the backup's next function easy as it just returns the next file in the list.

Incremental backups work much the same way, except that the file list is reduced to files that have changed since a previous backup referenced in the configuration when the cursor is opened. The other twist is that for each file returned, the caller does a duplicate operation on the backup cursor, and the duplicate code actually returns a specialized incremental backup cursor. This kind of cursor has its own next method that causes it to return information about individual pieces of this file that need to be copied. The code to implement incremental cursors is in src/cursor/cur_backup_incr.c .

Join cursors

Join cursors implement a join mechanism for WiredTiger. The idea is that joins can be configured by opening a special join cursor on a table, and attaching the table's index cursors to it, to return rows that match a filter, like:

(row.price > 100 and row.price <= 200) and row.in_stock < 10

Building up the conditions essentially creates a tree that is used for evaluation. This is stored in WT_CURSOR_JOIN, and is returned as the cursor object. Entries in the tree representing an index's participation in one clause are stored in WT_CURSOR_JOIN_ENTRY objects. WT_CURSOR_JOIN and WT_CURSOR_JOIN_ENTRY objects can be composed in a hierarchical manner, representing the shape of the tree representing the query. These two structures are somewhat static, being created when the join is created.

The join cursor contains a pointer to a WT_CURSOR_JOIN_ITER, which in some sense encodes the "position" of the cursor. To get the next row that satisfies a join requires that multiple cursors be iterated. Generally, the "left-most" index cursor is iterated first. Using the example above, the price index would be iterated, skipping over any entries that did not satisfy (row.price > 100 and row.price <= 200). Then other cursors are checked to see that any other conditions are satisfied. The join may be configured to use Bloom filters, and when that occurs, WT_CURSOR_JOIN_ENTRIES contain the bloom filter for individual index checks. This allows checks to occur quickly, at the expense of an initialization that occurs when the cursor begins iteration.

During a join cursor iteration, multiple WT_CURSOR_JOIN_ITER objects may be created. There is one WT_CURSOR_JOIN_ITER object corresponding to each level of nesting (queries can have arbitrary nesting of AND/OR). Each WT_CURSOR_JOIN_ITER allows the question to be asked: does the current position of the main table cursor satisfy the join conditions for this part of the join tree?

Part of the reason for this dynamic structure is that disjunctions may require some dynamic action. Consider

(row.price > 100 and row.in_stock < 10) .... or (row.aisle == 12 and row.on_sale == 1)

The last part of the query (being part of an OR clause) will be first executed after a number of items are returned that satisfy the first part of the query. If the join cursor is closed before then, it would be a waste to have opened subordinate cursors on the aisle and on_sale indices, and potentially computed bloom filters, etc.

When positioned on an entry that represents a nested join, a new child WT_CURSOR_JOIN_ITER is created that will be bound to the nested WT_CURSOR_JOIN. That iterator is then used to generate candidate primary keys. When its iteration is completed, that iterator is destroyed and the parent iterator advances to the next entry. Thus, depending on how deeply joins are nested, a similarly deep stack of iterators is created.

Duplicating cursors

Cursors may be duplicated, this occurs by passing a cursor to be duplicated as part of the WT_SESSION::open_cursor call. Cursor duplication does not occur in the cursor type code. Rather, a new cursor of the requested type is created, and the cursor's position is duplicated via a call to __wt_cursor_dup_position. This function gets the key from the original cursor in raw form (not converting it using key_format), sets the key in the new cursor, and does a search to set the position properly.

File cursors, Btrees and data handles

File cursors, Btrees and data handles exist in a WiredTiger system as different ways to reference the data in a Btree. It is useful to understand the differences between these structures and how they are used.

At the bottom is the data handle (also known as dhandle). This is an abstraction of an operating system file handle, with a set of flags and some reference counts. A Btree is a much larger abstraction, with a memory cache of key value pairs along with functions to read and write data as needed to and from the data file. A Btree is paired with a data handle to allow the transfer of data. Both the data handle and the Btree are owned by the connection. That is, they are shared among all sessions.

File cursors, on the other hand, are owned by a session. When a session opens a cursor on a file for the first time in that session, a file cursor is created. This occurs even if the file may be opened by cursors in other sessions already. The session owns the cursor, and the cursor may only be used by that session. Open cursors do increment reference counts in the data handle, so that the data handle "knows" it is being used, so that the file may not be dropped, renamed, verified or salvaged. So when WT_CURSOR::close is called for a file cursor, the cursor's memory may be released (or retained if cached), and reference counts decremented. Other sessions may retain open cursors on that file, they are independent.

Cursor caching

Cursors, upon closing, may be cached in the session. An open of the same URI will return a cached cursor if one is found matching the URI. Cursor caching is currently only done on file cursors. Because of the optimization for simple tables described above, cursors on simple tables are also cached.

To help implement caching, two methods, cache and reopen have been added to the cursor API. These are not public. Their function is to perform cursor-type specific operations to change a cursor from an open state to a cached state (cache) and change from a cached state to the open state (reopen).

When a cursor is opened the first time, it is marked as cacheable or not. Cursors that specify certain options, like bulk loading, random, or readonly, are not cacheable. When a cursor is closed, the cursor is checked if it is cacheable. If so and if cursor caching is enabled in the session, then it will be cached. Cached cursors live in a hash table that is owned by the session. A hash function on the URI is used to determine which hash bucket to use. We compute the hash of the URI once, its value is stored in in the cursor for future use. Thus, caching a cursor (what happens within the type-specific cache function (e.g. __curfile_cache) is relatively quick:

  • Free storage that we don't want held (for example, storage used by the cursor's key and value).
  • Get a weak reference to the data handle (increment dhandle->session_ref).
  • Release the strong reference to the data handle (decrement dhandle->session_inuse).
  • Determine the hash bucket needed (using the hash value in the cursor).
  • Move the cursor from the session's open list to the list in the hash bucket.
  • Increment statistics and decrement the connection's open cursor count.
  • Set the cache cursor flag.
  • Unlock the dhandle.

The change of reference from a strong to a weak is significant. When a dhandle's session_inuse (strong reference) drops to zero, it means that no cursor is open on the dhandle, and the only references are from cached cursors. In this state, the dhandle may be marked dead by the dhandle sweep. When the dhandle is dead, the dhandle's memory will still persist, but each session will eventually notice, during its cursor cache sweep, and "fully close" the cursor, removing it from the cache list, releasing its weak reference before freeing the cursor. Each session holding cached cursors must have some periodic activity that causes it to run its sweep, an occasional call to WT_SESSION::reset will suffice. After a dhandle has been dead for enough time, it is expected that all of its weak references will drop to zero, and the dhandle itself can be freed by the dhandle sweep.

If there is a failure during the cache function, then we would want to fully close the cursor. Rather than having special case code to handle this rare condition, we instead call reopen to temporarily bring the cursor back to an open state, and turn cursor caching off temporarily in the session while we close the cursor, releasing all its resources and references.

During a cursor open, if the cursor configuration options allow caching, we hash the uri, and look at the corresponding hash bucket in the session cursor cache. If we find a matching cursor, we call reopen on the cursor. This is what happens within the call to the type-specific reopen function (e.g. __curfile_reopen):

  • Lock the dhandle.
  • If the dhandle is no longer open, release it and mark the reopen to fail (but continue).
  • Get a "strong" reference to the data handle (increment dhandle->session_inuse).
  • Release the "weak" reference to the data handle (decrement dhandle->session_ref).
  • Increment statistics and the connection's open cursor count.
  • Move the handle from the hash bucket to the session's open list (the hash value was previously saved in the cursor)
  • Clear the cache cursor flag.
  • Update convenient pointers within the cursor to parts of the WT_BTREE that may have changed which the cursor was cached.

If the reopen fails (probably due to the dhandle no longer being open or being marked dead), we have ensured that enough of the cursor is opened so that it can be legally closed. We have studiously avoided having a cursor that is in some state that is half-open and half-closed, as it is hard to know how to dispose of it.

Session cursor sweep

Consider a large system that has many sessions using the same set of tables. When all sessions have closed a particular table, there will be no need to keep the underlying data handle open. In fact, if a WT_SESSION::drop call is called, we want to ensure that the data handle has been closed and the corresponding file is removed. Some systems, like Windows, require that all open file handles be closed before a file can be successfully removed from the file system. So there is motivation to periodically mark data handles that have no active references, and have sessions free cached cursors that have weak references to such marked data handles. The former job occurs in the connection sweep code. The latter job occurs in the session cursor cache sweep.

The session cursor cache sweep currently happens in the WT_CURSOR::close call, and also on calls to WT_SESSION::reset. On one hand, we don't want the overhead of a sweep to occur too frequently. It is quite possible that both close and reset can be called a lot and there may be many of thousands of cached cursors in the session. For that reason, we'd like to do the sweep in small increments, and not on each call. On the other hand, in a larger system, a session may be part of a pool servicing higher level requests. When a session completes its work, it may be left idle by the caller of WiredTiger, and such a session may then be idle for long periods of time. When cursor caching is enabled and sessions are not active, we want even occasional calls to WT_SESSION::reset to have a strong effect. We want occasional sweeps to keep up with freeing up references to data handles, so that otherwise unused data handles may in turn be freed eventually.

Our solution to this is three-fold. First, every time we want to call the sweep, a countdown counter is used, so we only consider a sweep every WT_SESSION_CURSOR_SWEEP_COUNTDOWN times (currently WT_SESSION_CURSOR_SWEEP_COUNTDOWN == 40). Secondly, we won't sweep if it's already been done this second in time. Finally, we sweep by walking a small set of buckets, initially 5 out of typically 512 configured buckets. However, depending on how productive our sweep is, that is, how many references to closed data handles are freed, we may continue our walk. This should usually strike a good balance between not having a lot of overhead for sweeps, and keeping up with the need to free up shared resources.

Debug copy

When cursors are positioned, their data may point to data in the btree or data allocated in the cursor. A caller may use a pointer to that data until the next cursor call. After that, the pointer should not be considered valid. By default, WiredTiger does not enforce this. When opening a cursor, a "debug=copy" configuration flag can be used. This forces any data that is returned by WT_CURSOR::get_key or WT_CURSOR::get_value to be in malloc'd memory, and explicitly freed on the next API call. Systems that are instrumented to track memory references can detect the references to freed memory, thus latent bugs can be detected.

The implementation is straightforward. The key and value in each cursor is represented as a WT_ITEM. A WT_ITEM includes a pointer and size, and it can point to arbitrary memory. However, the WT_ITEM also includes a memory buffer that may or may not be allocated. When the WT_ITEM pointer points to the item's own memory buffer, then it is already in malloc'd memory. When "debug=copy" is configured, it is a simple matter to check if a key and value being returned are already in the item's malloc'd memory. If not, memory is allocated, the copy is made and the item's pointer is updated. On the beginning of the next API call using that cursor, the item's malloc'd memory is overwritten and freed. Thus, in the presence of a memory tracker, uses of "stray" pointers will be detected. Even without a memory tracker, uses of "stray" pointers into the freed storage will likely yield the overwritten bytes, and not the previously seen key or value.