Local Caching of Storage Containers

Vertica's Execution Engine uses the StorageMerge operator to read data from storage containers in cases where it is important to read container data in its projection-specified sort order. This is particularly useful for operations that must preserve the sort order of the data that is read from multiple storage containers, before merging it into a single storage container. Common operations that enforce sort order include mergeout, and some queries with ORDER BY clauses—for example CREATE TABLE…AS query, where query includes an ORDER BY clause.

The Execution Engine typically allocates multiple threads to the StorageMerge operator. Each thread is assigned a single Scan operator to open and read container contents. If the number of containers to read is greater than the number of available threads, the Execution Engine is likely to assign individual Scan operators to multiple containers. In this case, Scan operators might need to switch among different containers and reopen them multiple times before all required data is fetched and assembled. Doing so is especially problematic when reading storage containers on remote filesystems such as S3. The extra overhead incurred by reopening and reading remote storage containers can significantly impact performance and usage costs.

You can configure your database so the Execution Engine caches on local disk the data of S3 storage containers that require multiple opens. The size of temp space allocated per query to the StorageMerge operator for caching is set by configuration parameter StorageMergeMaxTempCacheMB. By default, this configuration parameter is set to -1 (unlimited). If caching requests exceed temp space limits or available disk space, Vertica caches as much container data as it can, and then reads from S3.

The actual temp space that is allocated is the lesser of two settings:

To turn off caching, set StorageMergeMaxTempCacheMB to 0.