Version 10.0.2
S3 Storage Source
Data StructuresSource Location
S3_FILE_HANDLE
S3_FILE_SYSTEM
S3_STORAGE
ext/storage_sources/s3_store/

Caution: the Architecture Guide is not updated in lockstep with the code base and is not necessarily correct or complete for any specific release.

S3 Storage Source Implementation

S3 storage source, or the S3 store, is a WiredTiger extension that implements the storage source abstraction for AWS's S3 object-store. The store extends the WiredTiger's capability to store the data files on the S3 cloud. The extension provides a means to construct, configure and control a custom file system linked to an S3 bucket. The extension internally uses the API provided by the AWS C++ SDK to interact with the S3 cloud.

The S3 store implements the following key functionalities defined by the storage source interface:

  • Initialize the store, the AWS SDK and create a logging module for the store and the SDK.
  • Register the S3 store as a storage source available to Wiredtiger.
  • Provide an API to access and terminate the store. A reference count is incremented each time the storage source reference is taken and decremented with the termination. When the reference count goes zero, the storage source is de-registered with WiredTiger and any allocated resources are freed.
  • Another essential functionality the S3 store provides is to create a WT_FILE_SYSTEM implementation for a given bucket. The store requires a bucket and an access token to create the filesystem. The bucket is expected to be a string argument, which is the bucket name and the corresponding region separated by a ';', e.g. "bucket1;ap-southeast-2". AWS requires an access token to authenticate programmatic access to S3. The access token comprises of an access key ID and a secret access key as a set. The S3 store requires the token to be provided as a single string argument of the access key followed by the secret key, separated by a ';'.
  • The S3 store also configures a directory on the file system local to the WiredTiger database as a cache for the object files. The cache is used to improve the read latency for the object files read after an initial download from the S3.

S3 Connection class

The S3Connection class represents an active connection to an S3 bucket. It encapsulates and implements all the operations the S3 filesystem requires. The instantiation of the class object authenticates with the AWS and validates the existence of and access to the configured bucket. The S3Connection exposes an API to list the bucket contents filtered by a directory and a prefix, check for an object's existence in the bucket, put an object to the cloud, and get the object from the cloud. Though not required for the file system's implementation, the class also provides the means to delete the objects to clean up artifacts from the internal unit testing.

S3_FILE_SYSTEM

WiredTiger supports a database over a custom file system by giving a WT_FILE_SYSTEM interface for the end-user to implement. The S3 store itself provides a means to configure as many file systems as needed, each linked to an S3 bucket. These file systems, called the S3_FILE_SYSTEM, are a custom implementation provided by the S3 store. The S3 filesystem contains an instance of the S3Connection class, hence providing an active connection to the S3 bucket embedded in the file system instance.

The S3 file system uses the functionality provided by the S3 connection to simulate a write-once-read-many file system. The filesystem can list the directory contents, check for a file's presence, and access the file by creating a filehandle. The files that are not in the cache are first downloaded from the bucket and put in the cache before a filehandle is made to access them.

The S3 store uses the filesystem's put-object functionality to flush the files into the cloud. The Tier Manager inside WiredTiger directs the flush of the files. The files pushed to the cloud are also copied to the local cache for easy access.

S3_FILE_HANDLE

The S3 store also implements the WT_FILE_HANDLE interface, S3_FILE_HANDLE, to access the files on the S3 file system. When the handle is created, the local cache gets a copy of the S3 object if it doesn't already have it. That copy will be accessed by the handle. Since the local cache is a directory on the WiredTiger database's file system, the file is accessed using a WT_FILE_HANDLE implementation native to WiredTiger. Hence, a copy of native WT_FILE_HANDLE is kept encapsulated in the S3 filehandle. Since the object store is archival, all the write access methods are disabled.

S3 Logging and Statistics

The S3 store provides a logger implementation that redirects the generated logs to the WiredTiger logging streams. AWS SDK also registers the same logger, and the best attempt is made to match the SDK's logging levels to WiredTiger's.

Several statistics are also collected to count various interactions with S3 and the operations on the objects and files.