Vertica

Author Archive

Physical Design Automation in the HP Vertica Analytic Database

Automatic physical database design is a challenging task. Different customers have different requirements and expectations, bounded by their resource constraints. To deal with these challenges in HP Vertica, we adopt a customizable approach by allowing users to tailor their designs for specific scenarios and applications. To meet different customer requirements, any physical database design tool should allow its users to trade off query performance and storage footprint for different applications.

In this blog, we present a technical overview of the Database Designer (DBD), a customizable physical design tool that primarily operates under three design policies:

  • Load-optimized –DBD proposes the minimum required set of super projections (containing all columns) that permit fast load and deliver required fault tolerance.
  • Query-optimized –DBD may propose additional (possibly non-super) projections such that all workload queries are fully-optimized
  • Balanced—DBD proposes projections until it reaches the point where additional projections do not bring sufficient benefits in query optimization.

These options allow users to choose to trade off query performance and storage footprint, while considering update costs. These policies indirectly control the number of projections proposed to achieve the desired balance among query performance, storage and load constraints.
In real-world environments, query workloads often evolve over time. A projection that was helpful in the past may not be relevant today and could be wasting space or slowing down loads. This space could instead be reused to create new projections that optimize current workloads. To cater to such workload changes, DBD operates in two different modes:

  • Comprehensive–DBD creates an entirely new physical design that optimizes for the current workload while retaining parts of the existing design that are beneficial and dropping parts that are non-beneficial
  • Incremental– Customers can optionally create additional projections that optimize new queries without disturbing the existing physical design. Customers should use the incremental mode when workloads have not changed significantly. With no input queries, DBD optimizes purely for storage and load purposes.

ram_comprehensiveMode

The key challenges involved in the projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings that optimize query performance while reducing space overhead and allowing faster recovery. The DBD proceeds in two major sequential phases. During the query optimization phase, DBD chooses projection columns, sort orders, and cluster distributions (segmentation) that optimize query performance. DBD enumerates candidate projections after extracting interesting column subsets by analyzing query workload for predicate, join, group-by, order-by and aggregate columns. Run length encoding (RLE) is given special preference for columns appearing early in the sort order, because it is beneficial for both query performance and storage optimization. DBD then invokes the query optimizer for each workload query and presents a choice of the candidate projections. The query optimizer evaluates the query plans for all candidate projections, progressively narrowing the set of candidates until a stopping condition (based on the design policy) is reached. Query and table filters are applied during this process to filter one or more queries that are sufficiently optimized by chosen projections or tables that have reached a target number of projections set by the design policy. DBD’s direct use of the optimizer’s cost and benefit model guarantees that it remains synchronized as the optimizer evolves over time.

ram_inputParameters

During the storage optimization phase, DBD finds the best non-RLE column encoding schemes that achieve the smallest storage footprint for the designed projections via a series of empirical encoding experiments on the sample data. In addition, DBD creates the required number of buddy projections containing the same data but distributed differently across the cluster, enabling the design to be tolerant to node-down scenarios. When a node is down, buddy projections are employed to source the missing data in the down nodes. In HP Vertica, identical buddy projections (with same sort orders and column encodings) enable faster recovery by facilitating direct copy of their physical storage structures and DBD automatically produces such designs.

When DBD is invoked with an input set of workload queries, the queries are parsed and useful query meta-data is extracted (e.g., the predicate, group-by, order-by, aggregate and join query columns). Design proceeds in iterations. In each iteration, one new projection is proposed for each table under design. Once an iteration is done, queries that have been optimized by the newly proposed projections are removed, and the remaining queries serve as input to the next iteration. If a design table has reached its targeted number of projections (decided by the design policy), it is not considered in future iterations to ensure that no more projections are proposed for it. This process is repeated until there are no more design tables or design queries are available to propose projections for.

To form the complete search space for enumerating projections, we identify the following design features in a projection definition:

  • Feature 1: Sort order
  • Feature 2: Segmentation
  • Feature 3: Column encoding schemes
  • Feature 4: Column sets (select columns)

We enumerate choices for features 1 and 2 above, and use the optimizer’s cost and benefit model to compare and evaluate them (during the query optimization phase ). Note that the choices made for features 3 and 4 typically do not affect the query performance significantly. The winners decided by the cost and benefit model are then extended to full projections by filling out the choices for features 3 and 4, which have a large impact on load performance and storage (during the storage optimization phase).
In summary, the HP Vertica Database Designer is a customizable physical database design tool that works with a set of configurable input parameters that allow users to trade off query performance, storage footprint, fault tolerance and recovery time to meet their requirements and optionally override design features.

Optimizing Big Data Storage in HP Vertica

Man looking at server in HP EcoPod

With the explosion of data volumes all enterprises are capturing, new technological solutions, such as HP Vertica, offer a solution to non-expert users who need to analyze and monetize their Big Data. If you are a non-expert user, the Database Designer (DBD) module in HP Vertica can help you choose a physical database design that minimizes storage footprint while optimizing the performance of the input query workload. The DBD can recommend good physical designs as quickly as possible using minimal computing resources.

One key technique by which HP Vertica can improve performance is data-specific encoding and compression. The encoding optimization component automatically chooses the best techniques, saving the cost of very-scarce human experts to manually specify encodings. While very effective, this process can be very resource and time intensive, often requiring all of a cluster’s resources for many hours to produce an optimal design. However, this approach is often not cost effective because the less resource intense the optimization process is, the more it will be used and the more data people can analyze.

The HP Vertica Analytic Database was designed from the ground up to analyze massive amounts of data, using research from MIT, Brown, and Brandeis. At least three current HP Vertica customers have over 1 PB in their production databases. One of the key technologies to handle petabyte-scale data sets is sophisticated encoding and compression algorithms because they minimize storage and I/O bandwidth requirements. HP Vertica automatically chooses an optimal physical database design including encoding and compression based on schema, sample data, and query workload. The storage optimization process is extraordinarily accurate.

With storage optimization, the specific challenges to address are:

  • Identifying appropriate encoding/compression methods for each column type based on cardinality, sortedness, or both.
  • Finding the best encoding/compression method among multiple candidate choices that yields the least storage footprint without compromising retrieval performance.
  • Identifying a representative data sample for storage experiments, which requires:
    • Finding an appropriate sample size
    • Identifying an appropriate sampling technique

Data Encoding: Encoding is the process of converting data into a standard format. In HP Vertica, encoded data can be processed directly, which distinguishes encoding from compression. HP Vertica uses a number of different encoding strategies, depending on column data type, table cardinality, and sort order. The query executor in HP Vertica operates on the encoded data representation whenever possible to avoid the cost of decoding. It also passes encoded values to other operations, saving memory bandwidth. In contrast, row stores and most other column stores typically decode data elements before performing any operation. The following table presents the different encoding types along with their descriptions, emphasis on their ideal column type.

Data Compression: Compression is the process of transforming encoded data into a compact format. Compressed data cannot be directly processed; it must first be decompressed. HP Vertica uses integer packing for unencoded integers and LZO for compressible data. When compression is used extensively, it allows a column store to occupy substantially less storage than a row store. In a column store, every value stored in a column has the same data type which enables more effective encoding and compression, particularly in sorted columns. In a row store, each row contains multiple columns with different data types, resulting in a much less effective use of compression.

Storage Optimization: The best encoding type for each column is the one that achieves the best compression (least storage footprint). The DBD tries all ideal encoding candidates for each column, based on the column type and properties to find this optimal encoding. You choose a data sample, and then the DBD compares the storage footprints of the encoding candidates, skipping inappropriate choices. For example, RLE encoding is not an ideal choice for a column identified as high cardinality through statistics or primary key constraints. Thus, the DBD won’t try the RLE encoding type r such columns, as the following table shows. Also, in most cases, Database Designer won’t try a column with an unambiguous encoding choice unless it affects the storage footprint of other columns.

Encoding types Description Ideal for
AUTO(default) Automatically picks based on properties of the data itself, when insufficient examples are known. For string, Boolean and float types, Lempel-Ziv-Oberhumer based (LZO) compression is used. For integer-based types, the compression scheme is based on the delta between consecutive column values. CPU requirements are relatively small. Sorted, high cardinality columns such as primary keys. Also suitable for general-purpose applications for which no other encoding or compression scheme is applicable.
RLE (Run-length encoding)  Replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. The storage footprint for RLE and AUTO encoding of string types is the same. Sorted, low-cardinality columns. Ideal when the run length is large, such as when low-cardinality columns are sorted.
BLOCK-DICT (Block Dictionary) Within a block, distinct values are stored in a dictionary. A list of references to that dictionary represents the data block. CPU requirements are significantly higher than for default encoding schemes. Unsorted, low-cardinality columns, such as stock prices.
BLOCKDICT-COMP (Compressed Block Dictionary) Similar to BLOCK_DICT except that dictionary indexes are entropy coded. Requires significantly more CPU time to encode and decode. Unsorted, low-cardinality columns with extremely skewed value distribution.
DELTAVAL (Delta Value) Data is recorded as a difference from the smallest value in a data block. CPU requirements are minimal, and the data never expands. Unsorted, high-cardinality integer or integer-based columns.
DELTARANGE-COMP(Compressed Delta Range) Stores each value as a delta from the previous one.  This scheme has a high cost for both compression and decompression. High-cardinality float columns that are sorted or confined to a range.
COMMONDELTA-COMP(Compressed Common Delta) Builds a dictionary of all the deltas in the block and then stores indexes into the delta dictionary using entropy coding. If the delta distribution is excellent, columns can be stored in less than one bit per row. CPU requirements are extensive. Sorted columns with predictable sequences and only occasional sequence breaks.  (For example,   you could use this encoding type for timestamps recorded at periodic intervals or primary keys).
GCD-DELTA (Greatest Common Divisor Delta Value) Data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. CPU requirements for decoding GCDDELTA are minimal and the data never expands, but may take more encoding time than DELTAVAL. Many-valued unsorted integer or integer-based columns, when the values are a multiple of a common factor. (For example, timestamps are stored internally in microseconds, so data this is only precise to the millisecond are all multiples of 1000).

Data Sampling for Storage Optimization: Implementing an appropriate sampling technique is critical for the quality of encoding/compression achieved. We experimented two different methods: (a) random sampling and (b) clustered sampling. Random sampling involves selecting a random set of rows for encoding tests. The main disadvantage is performance, as the random sample is selected by scanning every row and applying a random function.

We also noticed that using such a random sample is usually detrimental for certain encoding types. For example, with RLE, it tends to increase the apparent number of distinct values and propensity toward sequences. This increase negatively affects COMMONDELTA-COMP and a few others. To overcome these disadvantages, we implemented a clustered sampling technique. Clustered sampling selects contiguous rows from equally spaced bands in the database, based on a preselected sample size and band count. This approach improves performance because HP Vertica reads only the required amount of data. It also improves quality of sample appropriate for encoding experiments.

Determining an appropriate Sample Size for Storage Optimization: Selecting an appropriate sample size is critical for both performance and quality of the encoding/compression achieved. To decide on an appropriate sample size, we experimented with six real customer data sets. We tried different sample sizes in million increments starting from 1 million to 10 million samples. We then measured the ccompression ratio achieved, as explained by this equation:

Compression ratio = Storage Footprint (in bytes) with optimal encodings (selected using all available data)
Storage Footprint (in bytes) with approximate encodings (selected using a data sample)

storage_opt

The compression ratio cannot be more than 1.0 and the closer it is to 1.0, the better. We found that one million clustered samples are sufficient to achieve a compression ratio of 99.6%. We also noticed that 1 million samples are enough to find the best encodings in most columns. Other columns, however, might end up having the second or third best encoding.

Summary: The Database Designer module of HP Vertica Analytic Database automatically recommends physical designs that optimize query performance, storage footprint, fault tolerance and recovery to meet different customer scenarios and applications. In this blog, we focused on how big data storage optimization is being done using the Database Designer component of HP Vertica.

Get Started With Vertica Today

Subscribe to Vertica