Data Structures | Source Location | |
---|---|---|
|
|
Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.
The WiredTiger block manager subsystem manages the reading and writing of data from the disk. It is designed to facilitate high performance, economic use of disk space and customizability.
A block is a chunk of data that is stored on the disk and operated on as a single unit. Each WiredTiger data file (any file in the home directory with the .wt suffix) is made up of these blocks. Each block consists of a page header, a block header and contains a single page of the btree from which it was generated. WiredTiger is a no-overwrite storage engine, and when blocks are re-written, they are written to new locations in the file. The size of a block is a multiple of the allocation size which is set during creation of the associated WiredTiger data file see: WT_SESSION::create.
Once a block is written an address cookie is returned. This address cookie is stored as the addr
on the associated page ref. The WT_REF
structure can be found in btmem.h
. The address cookie is opaque to other parts of the system and cannot be interpreted meaningfully.
The address cookie is made up of 4 components:
The block header contains the following fields:
The page header is not described in this document but can be found in btmem.h
.
The block manager decides where in the file a block will be written. It has two forms of writing modes, "first fit" and "best fit". The default behavior is best fit. While operating in best fit mode the block manager will search a skip list of extents sorted by size, returning either an exact match or the next largest. This is done to avoid fragmenting the file when possible. In first fit mode the block manager will place the newly created block in the first available extent. First fit mode is used for all root pages.
Additionally the block manager is a no-overwrite system. As such once a block is written it cannot be modified. This is for crash recovery reasons, because if the system were to crash during an overwrite the block state would be unknown. This doesn't mean that the associated page cannot be modified, once the associated page is modified a subsequent reconciliation will result in a new block being created.
A file is divided up into blocks. The first block in a file is special as it contains metadata about the file and is referred to as the "descriptor block". It contains the WiredTiger major and minor version, a checksum of the block contents as well as a "magic" number to check against.
The descriptor block serves as a safety check to ensure that the file being loaded into the block manager is actually a WiredTiger data file, that it belongs to a compatible version of WiredTiger and that the entire file has not been corrupted. WiredTiger also uses checksums to defend against file corruption which is described in the Checksum section.
Internally, the block manager uses a data structure called an extent list or a WT_EXTLIST
to track file usage. An extent list consists of a series of extents (or WT_EXT
elements). Each extent uses a file offset and size to track a portion of the file.
There are three extent lists that are maintained per checkpoint:
alloc:
The file ranges allocated for a given checkpoint.avail:
The file ranges that are unused and available for allocation.discard:
The file ranges freed in the current checkpoint.The alloc and discard extent lists are maintained as a skiplist sorted by file offset. The avail extent list also maintains an extra skiplist sorted by the extent size to aid with allocating new blocks.
There are a number of configuration options that affect the block manager's behavior. This does not aim to be an exhaustive list, however, these are the configuration options that are more commonly of interest to developers.
All of the configuration options below are passed into the WT_SESSION::create
API at the time of file creation.
A block's size must be a multiple of the underlying file allocation unit. This unit is controlled by the allocation_size parameter.
For example, if we specify an allocation_size of 4KB
, blocks of size 8KB
and 12KB
would be permitted but NOT 10KB
. The allocation_size is set to 4KB
by default which is a good choice unless the OS or storage device has special requirements.
The checksum
"on" configuration can be provided during creation of the file. This configuration instructs the block manager to checksum the full length of the buffer provided to be written into the block. Be default it is enabled. When disabled the block manager still does perform a checksum operation but only the first 64 bytes of the buffer are included.
The checksum is used when reading blocks to validate their contents, it is compared with the checksum extracted from the address cookie and it is compared with a checksum generated from the buffer that was held in the block being read. In both cases the checksum has to match.
There are other options that can be provided for this configuration option, they are not discussed here.
When a new file is created in WiredTiger via WT_SESSION::create, the file is created on disk and the associated allocation_size
is written out to the metadata file. However the block manager itself only exists on the btree structure and is allocated when opening a closed btree.
When an existing btree is opened for the first time, the location of the root block is contained in the metadata file WiredTiger.wt
. The block manager will read the block at the location specified and return the page image as a buffer to the layer above. This will then be instantiated as a page in memory.
From there subsequent page addresses can be read from the root page and the process repeated as required. If a cursor traverses to a page which hasn't been read into memory the same process will take place.
Two cases exist for writing out data using the block manager: checkpoint and eviction. When a page image is written out the block manager the bm->write
API is called. See bt_io.c
for more detail.
For details on checkpoint at the WiredTiger level see: Checkpoint.
At the block manager level, a checkpoint corresponds with a set of blocks that are stored in the file associated with the given URI. Typically each file will contain a minimum of two checkpoints. Upon opening an existing file the most recent checkpoint is read.
During a checkpoint new blocks are only written out for dirty pages. A block can be included in multiple checkpoints. Assuming a page X
is dirty and gets checkpointed in checkpoint A
, it will be created as a new block on disk. Now the same page X
isn't modified and another checkpoint is taken. The page is clean and as such will not require a new block to be written for it. The address of the original block is still valid.
Checkpoints are created in depth first order, leaf blocks are created, then the parent blocks. This is a requirement as the parent blocks contain the addresses of the leaf blocks.
The block manager doesn't guarantee that calling bm->write
will result in the data being flushed to disk. In the checkpoint scenario WiredTiger will also call bm->sync
once all blocks have been written which will call the file system dependent flush function.
Checkpoint deletion and merging
As a checkpoint progresses it takes a snapshot of the three extent lists kept by the block manager, these extent lists are written out to disk as part of the checkpoint in blocks. Between checkpoints these extent lists are being updated via normal operation of WiredTiger.
Suppose we have a checkpoint A
, which has an alloc list which contains 3 blocks I
, J
, K
as such its extent lists are as follows:
Alloc: I
, J
, K
Avail: L
, M
Discard: Empty
A second checkpoint B
completes and has removed a page which corresponds with block J
, it also has allocated an additional block L
.
Checkpoint B's
extent lists are as follows: Alloc: L
Avail: M
Discard: J
Finally we complete a 3rd checkpoint C
which allocates an additional block M
. Upon completion of this checkpoint we are able to remove checkpoint A
, to do that, the block manager will merge checkpoint A's
extent lists into checkpoint B's
.
What's important here is that if a block appears in both the alloc list and the discard list it can be freed which means it goes on the avail list.
Which gives us the following lists for checkpoint C:
Alloc: M
Avail: Empty Discard: Empty
And the following lists for checkpoint B:
Alloc: I
, K
, L
Avail: J
Discard: Empty
These extent lists are written out with the checkpoint C
. Anything on the avail list is considered free space and can be reused as of the completion of checkpoint C
.
We don't want to list each block individually in the extent lists, so instead of listing each block separately in the list, we use extents, which can describe a range in the file, that is, any number of contiguous blocks.
Checkpoint cookies
After a checkpoint is written, the block manager creates a checkpoint cookie to describe it. Similar to an address cookie, a checkpoint cookie is a sequence of bytes that is encoded and decoded by the block manager, and is opaque to other parts of the system. A checkpoint cookie is composed of four address cookies and two additional values:
As described in Address Cookies, an address cookie used with a tiered storage has an additional value (the object_id
). As a result, the checkpoint cookie for a tiered btree will include four additional object_id
values that are not in other checkpoints.
The program in tools/wt_ckpt_decode.py
can be used to unpack a set of bytes representing a checkpoint cookie and show its values.
For more detail on how WiredTiger eviction works see: Eviction.
Eviction also utilizes the block manager. When a page is evicted and contains data that needs to be maintained, logically a block needs to be written. Eviction calls bm->write
however it does not instruct the block manager to sync the data.
As new blocks are written, the block manager will place them where they fit best. Because of this it's common that removal of data will not result in the file shrinking. The file can only be shrunk when there are available blocks at the end of the file.
To manage this, WiredTiger provides a compaction API call WT_SESSION::compact. The block manager operates in first fit mode during compaction to maximize block movement towards the beginning of the file. WiredTiger walks the btree and asks the block manager if relocating that page will reduce the file size. If so, the page is marked dirty, forcing the block to be rewritten. WiredTiger then performs two checkpoints, as at least two checkpoints are required to delete the checkpoint originally containing the block.
For more details about the compaction process, please refer to Compaction.