Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.
WiredTiger includes a scheme for discarding whole pages at a time. This is known as fast-truncate or fast-delete (the terms are used interchangeably) and the pages it is done to are called deleted pages.
There are also four other ways deleted pages can appear in a database. The checkpoint cleanup code (found in bt_sync.c
) discards pages it finds to contain only obsolete values. Pages that reconcile completely empty turn into deleted pages. In VLCS, an empty deleted page is inserted when loading an internal page whose start recno is less than the start recno of its first child. Finally, new trees are created with a single empty deleted leaf page. These circumstances are all discussed further below.
The fast-truncate and deleted-page arrangements to some extent violate the system architecture and the system's layering and modularity. Consequently they have tentacles in a number of places; furthermore, a lot of the functioning is implicit or hidden and appears magic to the uninitiated.
Ideally, this page documents all the tentacles (see Pointers to pieces of the implementation) and explains all the implicit magic.
Most of the code related to fast-truncate and deleted pages lives in bt_delete.c
.
One of the states a WT_REF
can be in is WT_REF_DELETED
. This means that (as of some point) all data on the page has been removed. However, it is not necessarily the case that this is true for all current or possible readers. The page_del
field of a WT_REF
, if not NULL
, contains information about the transaction that deleted the page. Its type is WT_PAGE_DELETED
. The page_del
can be thought of as a special kind of update, as they contain roughly the same information as an update and is treated the same way but on a per page basis.
The page_del
field can be NULL
; this means that the prior data on the page is all obsolete and nobody can see it. (Or equivalently, the deletion has become globally visible.) If non-null, the transaction and timestamp information describes the visibility of the deletion. When the deletion is not visible, the prior data on the page is (or may be) and may still need to be read. Reading a deleted page into memory requires deleting every item on it with its own WT_UPDATE
; the process of doing this is called instantiation and is described below under Instantiation.
When a page is deleted by fast-truncate, a WT_PAGE_DELETED
structure is allocated, populated with information about the transaction containing the truncate, and inserted in the page_del
field. When a page is discarded by checkpoint cleanup or other mechanisms that have already determined that all the prior data is obsolete, the page_del
field is left NULL
.
Accessing the deletion information requires locking the WT_REF
. This both prevents the structure from being discarded while under examination and prevents the page from being simultaneously instantiated by another thread. (While the page_del
field can be tested for NULL
atomically, this has limited utility.)
In general the deletion information should only be consulted if the ref's state prior to locking was WT_REF_DELETED
. Apart from instantiated pages, whose state is WT_REF_MEM
(see Instantiated Pages below) the page_del
field will be NULL
.
Locking the ref and visibility checks are expensive, we want to avoid doing either unnecessarily. Consequently many of the places in the system that lock the ref to examine the deletion information will discard it right away if they find the deletion has become globally visible. Therefore it is possible that a truncated page with null page_del
may still have an on-disk image and an address.
The transaction referenced in the deletion information may be committed or uncommitted, but it is never aborted. Upon transaction rollback the page_del
field is cleared immediately.
Note: It is not possible to delete a page twice; if the same region of a tree is truncated twice, the already-deleted pages will be skipped over the second time. See Truncation.
For the visibility of deleted pages, they still "exist" in the sense that the WT_REF
remains, and any searches within the btree that lead to its portion of the namespace will still land on it. In this case we trigger page instantiation. See Instantiation.
An instantiated page is a page in WT_REF_MEM
state that has been produced by the instantiation process (see Instantiation). For most purposes instantiated pages are ordinary in-memory pages.
Instantiated pages always have a modify structure, and differ from ordinary in-memory pages in two ways. First, the field page->modify->instantiated
is set to true
, and the page_del
field is retained from the prior deleted state. Second, if the transaction that deleted the page had not resolved yet when the instantiation happened, the field page->modify->inst_updates
contains an array of the WT_UPDATE
structures used to do the instantiation.
These two conditions are orthogonal and resolve independently; when both are resolved the page is thenceforth a completely normal in-memory page.
Note that instantiated pages are not automatically marked dirty. (See the notes on this point in the Instantiation section.)
Also note that it is not possible to fast-truncate an instantiated page. Only on-disk pages can be fast-truncated. See Truncation.
The instantiated
flag and the page_del
structure are retained for the benefit of parent internal page reconciliation. See Internal page reconciliation below.
When the page is itself reconciled, or if the transaction that deleted it rolls back, the instantiated
flag is cleared and the page_del
structure is discarded.
Note that the page_del
field of an instantiated page should not be used to make operational decisions. Additional updates might have been applied to the page since the instantiation happened; these may contradict or obsolete the deletion information.
Meanwhile, the inst_updates
field is kept independently until the transaction that created it is resolved. It in effect belongs to that transaction and is neither needed nor used by anything else, with one exception: eviction checks whether it is non-NULL before evicting the page. (Pages with an instantiated but uncommitted truncate cannot be evicted.)
Because it is possible for the page to split between instantiation and transaction resolution, finding the updates created during instantiation to resolve them is problematic. The inst_updates
field makes this possible.
Upon resolution (either commit or rollback) the inst_updates
field is discarded.
Instantiation is the process by which an on-disk deleted page (ref in state WT_REF_DELETED
) is converted into an in-memory page (ref in state WT_REF_MEM
) with all its items explicitly deleted with tombstone updates. Semantically, this is an identity transformation.
This occurs under two sets of circumstances: first, if a thread that cannot yet see the deletion tries to read from the deleted page; and second, if a search lands on the deleted page's portion of the namespace. It can happen either before or after the transaction that deleted the page is resolved. (But not at the same time; locking the ref prevents that.)
Note that for searches that are positioning to do updates, the instantiation is unavoidable; however, for searches that are only reading, it would be better to return WT_NOTFOUND
without pointlessly instantiating the page. This is not currently implemented and as of this writing it is not entirely clear how involved or feasible such changes would be. Also note that this only applies to explicit searches and searches that are part of other cursor operations; the cursor next and previous operations do skip over deleted pages.
In any case, we notice the page is deleted when we read it. If the page has no address, or it has an address but the deletion is globally visible, we create a new in-memory page instead of reading the on-disk page. Otherwise, we read the on-disk page and call the instantiation code in bt_delete.c
.
The instantiation code iterates the page and adds a tombstone for every item. (To be precise, it adds a tombstone for every item that isn't already deleted; items that have a stop time do not need to be deleted again.) The row-store version of this iterates the entries on the page and directly adds a tombstone to each update list. The column-store version uses a cursor and calls into col_modify.c
to do it; this is because the data structures are considerably more complicated and updating them directly would require a lot of cut-and-paste code.
The tombstones are tagged with WT_UPDATE_RESTORED_FAST_TRUNCATE
. This is used by __wt_txn_prepare
to avoid trying to coalesce these updates with others to the same key. (That wouldn't work because the instantiation updates don't appear directly in the transaction modify list; they are referenced indirectly through the truncation.)
If the deletion is not resolved (according to the page_del
information in the ref) an array is allocated to hold the updates and this is placed in page->modify->inst_updates
. As noted above this is used to find the updates during transaction resolution.
Finally, page->modify->instantiated
is set to true
.
The instantiated page is not automatically marked dirty. Instantiation is logically an identity transformation; if the page is not otherwise modified, discarding it after instantiation returns it to the WT_REF_DELETED
state, and the deletion information remains in the ref. (If the parent page is then evicted, the deletion information is written into the address cell. If it is discarded because it is also unmodified, the deletion information must have already been written to disk and can be read in again later.)
However, VLCS pages end up marked dirty anyway because the instantiation logic uses col_modify
to post the tombstones and that always marks its page dirty.
Note that in two cases the instantiation code is skipped: verify, and salvage are concerned only with the on-disk state. These cases also skip the optimization that generates blank pages instead of reading pages that contain only obsolete values (those with a globally visible truncation) – verify is, at least in part, concerned with the physical structure of the tree and this substitution can confuse it.
Salvage is included for consistency because the internal page with the deletion information is not necessarily available (or correct) and it is probably better to not try to use it.
Reconciliation of internal pages that have deleted (or instantiated) children requires special handling. It is necessary to check whether the page can be dropped entirely, and if not, to write the deletion information into the child page's address cell. (Like the address of a ref, the deletion information more or less belongs to the parent; it is written to disk as part of the parent page, not the child page.) See On-disk format.
The code that supports this lives in rec_child.c
and is used by both VLCS and row-store internal page reconciliation.
If the child page is deleted (that is, the state is WT_REF_DELETED
) we must lock the ref to examine the deletion information. Then there are several possible cases.
If the deletion is or has become globally visible, we can delete any on-disk block, and drop the child from the on-disk representation of the parent. This is accomplished by sending back WT_CHILD_IGNORE
.
If the deletion is visible to the thread doing the reconciliation but not globally visible, we need to write the deletion information to disk. This is accomplished by sending back WT_CHILD_PROXY
and copying the deletion information for the caller. (The "proxy" refers to a "proxy cell", which is another name for the deleted-address cells used to refer to deleted pages. This terminology is probably outdated and should perhaps be removed sometime.)
If the deletion is not visible to the thread doing the reconciliation, then we need to refer to the original on-disk page without the deletion information. This is accomplished by sending back WT_CHILD_ORIGINAL
. This also requires leaving the parent page dirty. For checkpoints, r->leave_dirty
is set; for eviction, that doesn't work, but there's also not much to be gained by evicting the page under these circumstances so instead we just fail.
For instantiated pages, under normal circumstances the instantiated child will be reconciled before its parent. Eviction skips parents that have in-memory children, and when checkpointing ordinarily all children are written out before their parents. In this case the instantiated flag and deletion information will have been cleared and no special steps are required to reconcile the parent. (It is possible for the truncation to be uncommitted so the update list to be non-null; however, this does not matter when checkpointing either the child or the parent, as the updates created by instantiation will be left in memory in the usual way. Eviction of the child is blocked in these circumstances.)
There is an edge case where special steps are needed when reconciling an internal page. The case happens when the parent page is being checkpointed and a truncated child page has been instantiated after it has been reconciled already. The checkpoint will refer to the original on-disk page before it was instantiated, any updates to the instantiated page will not be visible to the checkpoint. The checkpoint will write out the deletion information along with the address.
Note that in all these cases "visible" and "globally visible" do not include prepared transactions. Truncations that have been prepared but not yet committed cannot be written to disk (the on-disk format doesn't have space to represent the state) so must be treated as invisible during reconciliation. (And it turns out they must always be considered invisible; see Notes on visibility.)
Deleted pages are already on disk and inherently never themselves need to be reconciled. Instantiated pages that have been read into memory, however, do.
Eviction of instantiated pages where the truncation is unresolved is blocked. In principle these pages could successfully go through update-restore eviction, but there are complications involved in doing so (e.g. handling the update list, and if the page were to split the page_del
information would have to be cloned, and that requires locking) and it was judged not worthwhile.
Eviction of instantiated pages where the truncation is committed is permitted, however, and checkpoint of instantiated pages is always allowed. The first reconciliation after instantiation clears the instantiated flag (page->modify->instantiated
) and discards the page_del
structure, as once a reconciliation result exists it can be used to reconcile the parent. (This is true whether or not it includes the tombstones from instantiation, i.e., whether the truncation was committed.)
Note that "unresolved" includes "prepared". While we cannot write out a prepared truncation in the parent page's address cells, in principle after instantiation we could write out the prepared tombstone updates like any other prepared updates, and after doing so nothing special is needed in the parent. This is not currently done, chiefly because safely determining whether the truncation is prepared at the point where eviction needs to check is problematic. (We can't check page_del
because it might have been discarded by then; looking in inst_updates
is at best messy and it isn't entirely clear what locking or synchronization might be needed. Some such check, or an extra flag in the modify structure, is probably possible if it eventually becomes important to allow these evictions.)
Deleted pages are referred to on disk by a special address cell type, WT_CELL_ADDR_DEL
. These contain three additional packed integers between the time aggregate and the cell data: the transaction ID, commit timestamp, and durable timestamp of the transaction that deleted the page. (Only committed truncations are written out. Prepared truncations cannot be represented on disk. Truncations that are globally visible do not result in a cell in the parent page at all.)
These fields are, however, only present if the page header includes the WT_PAGE_FT_UPDATE
flag, whose value is 0x20. Proper support for timestamped fast-truncate only appeared in WT 11.0; earlier versions neither write these fields nor expect to find them. The explicit header flag is required to make compatibility guarantees function as needed. (MongoDB 6.x knows how to read pages with WT_PAGE_FT_UPDATE
set, but does not write them. This is controlled in the WT library by the __wt_process::fast_truncate_2022
flag, whose default setting is controlled by the build config.)
The B-tree cursor truncate code, found in bt_cursor.c
, iterates through the specified truncation range with cursor_next
, individually removing all values it finds. This is the slow-truncate path.
The fast-truncate functionality is implicit. Passing true
for the truncating
argument of __wt_btcur_next
causes the flag WT_READ_TRUNCATE
to be passed to the tree-walk inside the next code. Fast-truncate then happens inside the tree-walk code (in bt_walk.c
) which calls __wt_delete_page
on leaf pages it visits.
This function, in bt_delete.c
, checks the page for eligibility. First, if the page is unmodified and in memory we attempt to evict it. Then we check if the page is on disk, that is, the state is WT_REF_DISK
, and if so lock the ref. (And then check if it is still on disk.)
Then the following categories of pages are ineligible for fast-truncate:
WT_ADDR_LEAF_NO
, rather than WT_ADDR_LEAF
, where overflow items may exist.If the page passes these checks, we mark its parent dirty, initialize the page_del
structure, add a WT_TXN_OP_REF_DELETE
operation to the current transaction (except if we're truncating the history store, which is non-transactional), and set the ref state to WT_REF_DELETED
.
Otherwise we report back to the tree-walk code that we couldn't delete the page and it needs to visit it. (This happens by a return flag and not by returning an error.)
Note that it is not possible to fast-truncate an already deleted page; ordinarily the tree-walk code will have already skipped over it (see Skipping deleted pages) and also, __wt_delete_page
won't accept one. If one truncate reaches a page truncated by another transaction that is not yet visible, such that the skip code doesn't skip it, we need to load the page, instantiate it, and attempt to slow-truncate it; this will discover the transaction conflict.
Also note that the first and last pages of a truncate operation are always slow-truncated regardless of eligibility; this is a result of the initial positioning of the start and end cursors, which requires the pages under them to be present in memory. This point is not particularly important operationally but can create complications for writing tests.
Finally note that currently the initial eviction attempt is done unconditionally even in cases where we could determine beforehand that the page will be ineligible. This causes it to be evicted and read in again, which is suboptimal, and should perhaps be improved at some point.
As mentioned earlier, there are four ways besides fast-truncate that deleted pages can appear.
First, the checkpoint cleanup code in bt_sync.c
discards pages that it finds contain only obsolete values, instead of writing them out. The ref state becomes WT_REF_DELETED
, no page_del
is generated, and when the checkpoint reaches the parent internal page any prior on-disk page image will be dropped and no cell will be produced.
Second, pages that reconcile empty end up deleted. During a checkpoint this happens in the checkpoint cleanup; in eviction, it happens in the WT_PM_REC_EMPTY
case of __evict_page_dirty_update
. The ref state becomes WT_REF_DELETED
and no page_del
is generated. (Also note that WT_REF_MEM
pages with WT_PM_REC_EMPTY
reconciliation results are explicitly skipped during internal page reconciliation.)
In VLCS, empty deleted pages are inserted during certain internal page reads; see VLCS considerations.
Finally, new trees are created with a single empty deleted leaf page, because creating the tree with no leaves at all causes problems.
There are two separate skip functions for skipping deleted pages during tree walks. The basic skip function is __wt_delete_page_skip
in bt_delete.c
; it is always called in tree walks when WT_READ_SEE_DELETED
is not set, which is all tree walks except Rollback to Stable (RTS) and column-store appends. It checks for whether the ref state is WT_REF_DELETED
and the truncation is visible to the caller.
As in other cases (see Notes on visibility) we must treat prepared truncations as not visible; in order to generate prepare conflicts we cannot skip over truncations that are prepared but not yet committed.
The other skip function is __wt_btcur_skip_page
in btree_inline.h
. This is used by cursor next and previous to skip over deleted pages in those traversals specifically, and in addition to checking for WT_REF_DELETED
with visible page_del
, it also inspects the address cell. If the page is in memory and unmodified and the address cell contains page-delete information, it checks for the visibility of that deletion. Failing that, it checks the time aggregate in case all the values on the page are no longer visible, e.g. because the page was slow-truncated and reconciled.
For an unmodified in-memory page the deletion information in the address should be the same as present in the ref's page_del
structure. However, inspecting the unpacked copy instead, which is free since we are unpacking the cell to look at the time aggregate anyway, makes this check independent of the lifecycle of the ref's page_del
. This was more important in previous iterations of the code than it is now; however, in principle it's possible to drop the ref's page_del
immediately upon instantiation in a read-only tree, since it is kept for internal page reconciliation and pages in read-only trees are never reconciled. Checking the unpacked information here avoids interfering with that option.
Note that __wt_btcur_skip_page
is passed to the tree-walk code as a custom skip function, which means that when it's used both it and __wt_delete_page_skip
are checked. This is not optimal and probably ought to be tidied. Furthermore, it is unclear whether these functions really need to be different; that is, it may be that the additional time aggregate check in __wt_btcur_skip_page
should be deployed for all tree walks. (If not, the reason should be discovered and documented.)
In row-store, discarding a chunk of the namespace has no particular effect. Leaf pages support insertion at the beginning and at the end (as well as in the middle), so if the internal page structure directs a search to a particular leaf page there's always a place to put any updates that might be generated.
In VLCS this is different; insertions are supported only at the ends of pages, in the append list. Historically this was furthermore only allowed on the last page of the tree; the namespace begins at 1 and all keys between 1 and the last key in the tree existed on some particular leaf page, possibly in a deleted-value cell.
Extending fast-truncate and deleted support to VLCS requires allowing chunks of the namespace to be discarded. In most cases this is harmless; searches into that portion of the namespace will be directed by the internal page structure to the next page to the left and any updates will appear on its append list.
However, if the leftmost child of an internal page is discarded, a problem arises: searches for that portion of the namespace still go to that internal page, because its start key hasn't changed (and can't change), but now the leftmost child begins at some later key. There's now nowhere for the search to go, and things go downhill from there.
One possible solution to this problem is to use the split code to reinsert an empty page in the leftmost slot on demand. This was rejected as dangerous (violates previous assumptions, extremely difficult to test or verify) so instead steps were put in place to avoid ever discarding the leftmost child.
These steps are:
Note that this extra page never appears on disk.
There is one other consideration, which is that normally VLCS leaf pages never reconcile empty; instead they reconcile with a single deleted-value cell, possibly with a large RLE count. These pages are detected at the end of leaf reconciliation and converted to empty pages.
Fast-truncate and deleted pages in general are not supported in FLCS. Most of the code is in place (since deleted and instantiated page handling mostly occurs in the internal page code and this is shared with VLCS) but there's a showstopper problem: because there are no deleted values (deleted values read back as zero) there are also no gaps in the namespace. In particular, if we truncate a range, discard the pages, and then read through the gap we need to read back the entire truncated namespace as zero one entry at a time. That requires knowing how big the gaps are; and while that information is encoded in the internal page structure, it is not available from the internal page structure. Currently this looks infeasible to support, though it's possible that there's some clever solution nobody's thought of yet.
This is perhaps unfortunate because slow-truncating FLCS pages (which contain large numbers of rows with very small values) is particularly expensive.
For FLCS, fast-truncate is inhibited by checking for it and clearing WT_READ_TRUNCATE
in the page-walk code; similarly, checkpoint cleanup avoids discarding FLCS pages. Because values are never deleted, pages never become empty; no special handling is needed to prevent deleted pages appearing via that mechanism. However, there is still a bit of FLCS-specific code there to avoid examining pages that are not in memory (unlike in row-store and VLCS) because this is not useful.
The empty deleted page attached to a new tree is still created; it will turn into an empty in-memory page on the first search, and in principle can change back to a deleted page if then evicted immediately. However, once an update is posted to it (even if later rolled back) it will not reconcile empty.
As mentioned previously, for deleted page purposes prepared transactions must be treated as not visible. This differs from the treatment elsewhere (for ordinary updates, prepared values are visible but cause WT_PREPARE_CONFLICT
if visited) and the special-case handling is wrapped into a pair of functions __wt_page_del_visible
and __wt_page_del_visible_all
.
These functions take a boolean argument that enables this special-case handling; this is only an optimization, since in one place (parent page eviction checks) uncommitted transactions have already been excluded and the check for a prepared transaction is redundant.
It turns out that (so far at least) all visibility references to truncations require treating prepared truncations as invisible. In the case of page skipping, it is necessary to visit pages with a prepared truncation so as to be able to generate WT_PREPARE_CONFLICT if needed. This is also true when reading pages in: we cannot skip reading a page because its truncation is visible-all unless it is actually committed.
The other visibility checks appear in reconciliation and eviction, and in those cases we need to treat prepared truncations as invisible because we cannot write them to disk.
After recovery, when the write generations are bumped, it is necessary to check and possibly discard the transaction IDs (and sometimes the timestamps) in loaded page_del
structures, so that they contain the values they would if unpacked with the new write generations. Otherwise we might start using transaction IDs from a previous run. This is done by __wt_delete_redo_window_cleanup
in bt_delete.c
.
Internal pages are never fast-truncated. In most cases if a truncate spans all the children of an internal page, at the point when the child refs are discarded a reverse split will be triggered and this will cause the internal page to be discarded as well. However, it is possible for internal pages to become deleted pages if they reconcile empty. At this point the state is set to WT_REF_DELETED
and no page_del
is created, as with other cases of reconciling empty. If this portion of the namespace is subsequently searched, instantiation occurs; however, instantiation will create an empty leaf page. There is a hook in __wt_btree_new_leaf_page
that changes the type from WT_REF_FLAG_INTERNAL
to WT_REF_FLAG_LEAF
at this point. (It might be tidier if this change happened at the time of deletion instead.)
Source file | Type | Symbol | Description |
---|---|---|---|
btmem.h | Enumerator | WT_REF_DELETED | The deleted ref state (with the other ref states). |
btmem.h | Read flag | WT_READ_TRUNCATE | The flag passed to tree-walk that causes eligible pages to be truncated instead of visited. |
btmem.h | Page header flag | WT_PAGE_FT_UPDATE | The on-disk flag that indicates the presence of page delete information in deleted-address cells. |
btmem.h | Update flag | WT_UPDATE_RESTORED_FAST_TRUNCATE | A flag used in prepare handling to identify updates from instantiation. |
btmem.h | Structure member | WT_PAGE_MODIFY::instantiated | The flag marking an in-memory WT_REF as an instantiated page. |
btmem.h | Structure member | WT_PAGE_MODIFY::inst_updates | The updates used to instantiate an unresolved truncation. |
btmem.h | Type | WT_PAGE_DELETED | The structure that holds page deletion information. |
btmem.h | Structure member | WT_REF::page_del | The page deletion information for a |
cell.h | Enumerator | WT_CELL_ADDR_DEL | The deleted-address cell type (along with the other cell types). |
cell.h | Misc | WT_CELL | The allocation of space in WT_CELL for the on-disk deletion information. |
cell.h | Structure member | WT_CELL_UNPACK_ADDR::page_del | The deletion information unpacked from an address cell. |
reconcile.h | Structure member | WT_CHILD_MODIFY_STATE::del | A |
txn.h | Enumerator | WT_TXN_OP_REF_DELETE | The operation type for a fast-truncate transaction operation. |
txn.h | Structure member | WT_TXN_OP::ref | The data for a fast-truncate transaction operation: the |
btree_inline.h | Structure member | WT_ADDR_COPY::page_deleted | Page deletion information unpacked from the on-disk cell by __wt_ref_addr_copy . |
btree_inline.h | Structure member | WT_ADDR_COPY::del_set | True if page deletion information was unpacked. |
btree_inline.h | Hook | in __wt_ref_addr_copy | Return the page delete information unpacked from the address. |
btree_inline.h | Function | __wt_page_del_visible | Function for checking thread visibility of a WT_PAGE_DELETED . |
btree_inline.h | Function | __wt_page_del_visible_all | Function for checking global visibility of a WT_PAGE_DELETED . |
btree_inline.h | Function | __wt_page_del_committed_set | Function for checking whether a WT_PAGE_DELETED is committed. |
btree_inline.h | Function | __wt_btcur_skip_page | The page-skip function used by cursor next and previous. |
reconcile_inline.h | Hook | in __wti_rec_cell_build_addr | A hook to choose |
timestamp_inline.h | Macro | WT_TIME_AGGREGATE_UPDATE_PAGE_DEL | Akin to |
txn_inline.h | Function | __wt_txn_op_delete_apply_prepare_state | The code for updating WT_PAGE_DELETED at prepare time. |
txn_inline.h | Function | __wt_txn_op_commit_apply_timestamps | Part of the code for updating WT_PAGE_DELETED at commit time. (The rest is in __wt_txn_commit itself. Note that the rollback-time update is in bt_delete.c .) |
txn_inline.h | Hook | in __wt_txn_op_set_timestamp | Call __wt_txn_op_delete_apply_prepare_state and __wt_txn_op_commit_apply_timestamps . |
txn_inline.h | Function | __wt_txn_modify_page_delete | This records a fast-truncate in the current transaction. |
cell_inline.h | Function | __cell_page_del_window_cleanup | Akin to __cell_kv_window_cleanup except for WT_PAGE_DELETED . |
cell_inline.h | Function | __cell_redo_page_del_cleanup | Function that redoes the timestamp cleanup for a WT_PAGE_DELETED structure; used after we bump write generations at the end recovery. |
cell_inline.h | Hook | in __wt_cell_unpack_safe | Unpack the page deletion information from deleted-address cells. |
cell_inline.h | Hook | in __wt_cell_pack_addr | Pack the page deletion information into deleted-address cells. |
bt_curnext.c | Hook | in __wt_btcur_next_prefix | Accept a flag argument that sets WT_READ_TRUNCATE . |
bt_curnext.c | Hook | in __wt_btcur_next | Accept a flag argument that sets |
bt_curprev.c | Hook | in __wt_btcur_prev | Accept a flag argument that sets |
bt_cursor.c: | Function | __wt_btcur_range_truncate (and others) | The cursor-level truncate code lives here. |
bt_debug.c | Hook | in __debug_cell_int | Prints the deletion information for WT_CELL_ADDR_DEL cells. |
bt_debug.c | Hook | in __debug_ref | Prints the deletion information for deleted and instantiated pages. |
bt_delete.c | Function | __wt_delete_page | The implementation of fast-truncate itself. |
bt_delete.c | Function | __wt_delete_page_rollback | Code for rolling back a fast-truncate, used by transaction rollback (but not RTS). |
bt_delete.c | Function | __delete_redo_window_cleanup_internal | Page-visitor part of __wt_delete_redo_window_cleanup . |
bt_delete.c | Function | __delete_redo_window_cleanup_skip | Custom page skip function for __wt_delete_redo_window_cleanup . |
bt_delete.c | Function | __wt_delete_redo_window_cleanup | Iterate a tree to do redo time window cleanup on already-loaded WT_PAGE_DELETED structures after recovery. |
bt_delete.c | Function | __wt_delete_page_skip | The page skip function to skip deleted pages that's used in ordinary tree walks. |
bt_delete.c | Function | __tombstone_update_alloc | Allocate a tombstone for instantiation. |
bt_delete.c | Function | __instantiate_tombstone | Allocate and remember a tombstone during instantiation. (Possibly this and __tombstone_update_alloc should be folded together eventually.) |
bt_delete.c | Function | __instantiate_col_var | Instantiate a VLCS page. |
bt_delete.c | Function | __instantiate_row | Instantiate a row-store page. |
bt_delete.c | Function | __wt_delete_page_instantiate | Perform instantiation of tombstones on a deleted page when reading it into memory. |
bt_discard.c | Hook | in __free_page_modify | Discard the truncate-related fields of WT_PAGE_MODIFY . |
bt_discard.c | Hook | in __wt_free_ref | Discard the |
bt_handle.c | Hook | in __wt_btree_new_leaf_page | Change the ref type from |
bt_page.c | Hook | in __wt_page_inmem | When counting how many refs to allocate on a column-store internal page, figure out when we need to allocate an extra one to insert an empty page in the leftmost slot. |
bt_page.c | Hook | in __inmem_col_int | Load deleted-address cells as deleted WT_REF structures. |
bt_page.c | Hook | in __inmem_row_int | Load deleted-address cells as deleted WT_REF structures. |
bt_page.c | Hook | in __page_read | Call the instantiation code when needed. Also, avoid reading deleted pages if we don't need the pre-deletion contents, or if there aren't any at all. Get an empty page instead and mark it instantiated. |
bt_read.c | Hook | in __wt_page_in_func | Return |
bt_split.c | Hook | in __split_parent_discard_ref | Discard the page_del field of the WT_REF . |
bt_split.c | Hook | in __split_parent | For VLCS trees, avoid discarding the leftmost child even if it's deleted. |
bt_vrfy.c | Hook | in __verify_tree | Allow for deleted-address cells when checking the cell type in the address against the page type. |
bt_vrfy.c | Hook | in __verify_tree | Allow namespace gaps in VLCS but not in FLCS. |
bt_vrfy.c | Hook | in __verify_page_content_int | Handle deleted-address cells. |
bt_vrfy_dsk.c | Function | __verify_dsk_addr_page_del | Validate and crosscheck the page deletion information discovered in on-disk address cells. |
bt_walk.c | Hook | in __tree_walk_internal | Disable fast-truncate for FLCS. |
bt_walk.c | Hook | in __tree_walk_internal | Call __wt_delete_page_skip when WT_READ_SEE_DELETED is not set. |
bt_walk.c | Hook | in __tree_walk_internal | Call __wt_delete_page when WT_READ_TRUNCATE is set. |
bt_walk.c | Hook | in __tree_walk_skip_count_callback | Call |
conn_dhandle.c | Hook | in __wt_dhandle_update_write_gens | Call |
evict_page.c | Function | __evict_delete_ref | Delete evicted pages and check for/trigger reverse splits. |
evict_page.c | Hook | in __evict_page_clean_update | Delete pages with __evict_delete_ref that are clean and have no on-disk address. |
evict_page.c | Hook | in __evict_page_dirty_update | Use __evict_delete_ref to delete pages that reconciled empty. |
evict_page.c | Hook | in __evict_page_clean_update | Check for instantiated pages and set the ref state back to WT_REF_DELETED . |
evict_page.c | Hook | in __evict_child_check | Prohibit evicting internal pages with uncommitted truncations. |
rec_child.c | Function | __rec_child_deleted | Handle the processing for deleted and instantiated pages during internal page reconciliation. |
rec_child.c | Hook | in __wt_rec_child_modify | Call |
rec_col.c | Hook | in __wt_rec_col_int | Write out deleted address cells when needed. |
rec_col.c | Hook | in __wt_rec_col_var | Reconcile empty instead if we get one big deleted-value cell. |
rec_row.c | Hook | in __wt_rec_row_int | Write out deleted address cells when needed. |
rec_write.c | Hook | in __rec_split_write_header | Set WT_PAGE_FT_UPDATE on the page header if appropriate. |
rec_write.c | Hook | in __rec_write_wrapup | Clear the instantiated flag and discard the page deletion information for instantiated pages. |
txn.c | Hook | in __wt_txn_commit | Clear inst_updates and set the committed field in WT_PAGE_DELETED . |
txn.c | Hook | in __wt_txn_prepare | Call __wt_txn_op_delete_apply_prepare_state . |
txn.c | Hook | in __wt_txn_rollback | Call |
rts_visibility.c | Hook | in __wt_rollback_page_needs_abort | Check page_del when deciding whether the page contains unstable values that need to be examined. |
rts_btree_walk.c | Hook | in __rollback_to_stable_page_skip | Check |
Note also that txn_log.c
contains the functions __wt_txn_truncate_log
and __wt_txn_truncate_end
for logging truncates, and various hooks for handling truncate log records. (Further hooks exist in log_auto.c
.) However, logging of truncates happens at the cursor level and not the page level. The functions are called from the cursor code. Page-level fast-truncate actions themselves are not logged. Replaying a truncation from the log may (in fact, likely will as more pages will be on-disk and eligible) fast-truncate different or more pages than the original operation. This is correct because there is supposed to be no semantic difference between fast-truncate and slow-truncate.