Vertica

Archive for April, 2014

Optimizing Big Data Storage in HP Vertica

Man looking at server in HP EcoPod

With the explosion of data volumes all enterprises are capturing, new technological solutions, such as HP Vertica, offer a solution to non-expert users who need to analyze and monetize their Big Data. If you are a non-expert user, the Database Designer (DBD) module in HP Vertica can help you choose a physical database design that minimizes storage footprint while optimizing the performance of the input query workload. The DBD can recommend good physical designs as quickly as possible using minimal computing resources.

One key technique by which HP Vertica can improve performance is data-specific encoding and compression. The encoding optimization component automatically chooses the best techniques, saving the cost of very-scarce human experts to manually specify encodings. While very effective, this process can be very resource and time intensive, often requiring all of a cluster’s resources for many hours to produce an optimal design. However, this approach is often not cost effective because the less resource intense the optimization process is, the more it will be used and the more data people can analyze.

The HP Vertica Analytic Database was designed from the ground up to analyze massive amounts of data, using research from MIT, Brown, and Brandeis. At least three current HP Vertica customers have over 1 PB in their production databases. One of the key technologies to handle petabyte-scale data sets is sophisticated encoding and compression algorithms because they minimize storage and I/O bandwidth requirements. HP Vertica automatically chooses an optimal physical database design including encoding and compression based on schema, sample data, and query workload. The storage optimization process is extraordinarily accurate.

With storage optimization, the specific challenges to address are:

  • Identifying appropriate encoding/compression methods for each column type based on cardinality, sortedness, or both.
  • Finding the best encoding/compression method among multiple candidate choices that yields the least storage footprint without compromising retrieval performance.
  • Identifying a representative data sample for storage experiments, which requires:
    • Finding an appropriate sample size
    • Identifying an appropriate sampling technique

Data Encoding: Encoding is the process of converting data into a standard format. In HP Vertica, encoded data can be processed directly, which distinguishes encoding from compression. HP Vertica uses a number of different encoding strategies, depending on column data type, table cardinality, and sort order. The query executor in HP Vertica operates on the encoded data representation whenever possible to avoid the cost of decoding. It also passes encoded values to other operations, saving memory bandwidth. In contrast, row stores and most other column stores typically decode data elements before performing any operation. The following table presents the different encoding types along with their descriptions, emphasis on their ideal column type.

Data Compression: Compression is the process of transforming encoded data into a compact format. Compressed data cannot be directly processed; it must first be decompressed. HP Vertica uses integer packing for unencoded integers and LZO for compressible data. When compression is used extensively, it allows a column store to occupy substantially less storage than a row store. In a column store, every value stored in a column has the same data type which enables more effective encoding and compression, particularly in sorted columns. In a row store, each row contains multiple columns with different data types, resulting in a much less effective use of compression.

Storage Optimization: The best encoding type for each column is the one that achieves the best compression (least storage footprint). The DBD tries all ideal encoding candidates for each column, based on the column type and properties to find this optimal encoding. You choose a data sample, and then the DBD compares the storage footprints of the encoding candidates, skipping inappropriate choices. For example, RLE encoding is not an ideal choice for a column identified as high cardinality through statistics or primary key constraints. Thus, the DBD won’t try the RLE encoding type r such columns, as the following table shows. Also, in most cases, Database Designer won’t try a column with an unambiguous encoding choice unless it affects the storage footprint of other columns.

Encoding types Description Ideal for
AUTO(default) Automatically picks based on properties of the data itself, when insufficient examples are known. For string, Boolean and float types, Lempel-Ziv-Oberhumer based (LZO) compression is used. For integer-based types, the compression scheme is based on the delta between consecutive column values. CPU requirements are relatively small. Sorted, high cardinality columns such as primary keys. Also suitable for general-purpose applications for which no other encoding or compression scheme is applicable.
RLE (Run-length encoding)  Replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. The storage footprint for RLE and AUTO encoding of string types is the same. Sorted, low-cardinality columns. Ideal when the run length is large, such as when low-cardinality columns are sorted.
BLOCK-DICT (Block Dictionary) Within a block, distinct values are stored in a dictionary. A list of references to that dictionary represents the data block. CPU requirements are significantly higher than for default encoding schemes. Unsorted, low-cardinality columns, such as stock prices.
BLOCKDICT-COMP (Compressed Block Dictionary) Similar to BLOCK_DICT except that dictionary indexes are entropy coded. Requires significantly more CPU time to encode and decode. Unsorted, low-cardinality columns with extremely skewed value distribution.
DELTAVAL (Delta Value) Data is recorded as a difference from the smallest value in a data block. CPU requirements are minimal, and the data never expands. Unsorted, high-cardinality integer or integer-based columns.
DELTARANGE-COMP(Compressed Delta Range) Stores each value as a delta from the previous one.  This scheme has a high cost for both compression and decompression. High-cardinality float columns that are sorted or confined to a range.
COMMONDELTA-COMP(Compressed Common Delta) Builds a dictionary of all the deltas in the block and then stores indexes into the delta dictionary using entropy coding. If the delta distribution is excellent, columns can be stored in less than one bit per row. CPU requirements are extensive. Sorted columns with predictable sequences and only occasional sequence breaks.  (For example,   you could use this encoding type for timestamps recorded at periodic intervals or primary keys).
GCD-DELTA (Greatest Common Divisor Delta Value) Data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. CPU requirements for decoding GCDDELTA are minimal and the data never expands, but may take more encoding time than DELTAVAL. Many-valued unsorted integer or integer-based columns, when the values are a multiple of a common factor. (For example, timestamps are stored internally in microseconds, so data this is only precise to the millisecond are all multiples of 1000).

Data Sampling for Storage Optimization: Implementing an appropriate sampling technique is critical for the quality of encoding/compression achieved. We experimented two different methods: (a) random sampling and (b) clustered sampling. Random sampling involves selecting a random set of rows for encoding tests. The main disadvantage is performance, as the random sample is selected by scanning every row and applying a random function.

We also noticed that using such a random sample is usually detrimental for certain encoding types. For example, with RLE, it tends to increase the apparent number of distinct values and propensity toward sequences. This increase negatively affects COMMONDELTA-COMP and a few others. To overcome these disadvantages, we implemented a clustered sampling technique. Clustered sampling selects contiguous rows from equally spaced bands in the database, based on a preselected sample size and band count. This approach improves performance because HP Vertica reads only the required amount of data. It also improves quality of sample appropriate for encoding experiments.

Determining an appropriate Sample Size for Storage Optimization: Selecting an appropriate sample size is critical for both performance and quality of the encoding/compression achieved. To decide on an appropriate sample size, we experimented with six real customer data sets. We tried different sample sizes in million increments starting from 1 million to 10 million samples. We then measured the ccompression ratio achieved, as explained by this equation:

Compression ratio = Storage Footprint (in bytes) with optimal encodings (selected using all available data)
Storage Footprint (in bytes) with approximate encodings (selected using a data sample)

storage_opt

The compression ratio cannot be more than 1.0 and the closer it is to 1.0, the better. We found that one million clustered samples are sufficient to achieve a compression ratio of 99.6%. We also noticed that 1 million samples are enough to find the best encodings in most columns. Other columns, however, might end up having the second or third best encoding.

Summary: The Database Designer module of HP Vertica Analytic Database automatically recommends physical designs that optimize query performance, storage footprint, fault tolerance and recovery to meet different customer scenarios and applications. In this blog, we focused on how big data storage optimization is being done using the Database Designer component of HP Vertica.

A Vertica training update: Installing Vertica and Management Console

IMG_7110

The training team has started the process of updating and revamping out web-based training offerings, available here. Our first released rebuild is the course Installing Vertica and Management Console. Here are some of the updates that we made to the material:

  • The course covers installing both Vertica Analytics and Management Console
  • We’ve streamlines the description of the installation process, so that you don’t need to work through each step
  • We’ve improved the description of how a cluster is defined, and highlighted some of the most common configuration errors
  • We’ve included pointers to relevant sections of the documentation
  • The course is self-running, and takes less than 10 minutes to complete. You can, of course, pause and replay each page of the course.
  • The course includes voiceover descriptions and a Table of Contents for navigation, moving us away from the text-based presentation model

To start the course, click here

We hope you enjoy it, and look forward to your feedback!

Meet the Team: Amy Miller

 

Amys Army

It’s another weekly edition of our Meet the Team feature! Our very own Amy Miller from support, shares with us the story of her career here at Vertica, what she loves about her job and her team, and how she wins national hockey tournaments.

Facebook and Vertica: A Case for MPP Databases

I have just come back from a business trip to China where I visited several large Chinese telecom customers to talk about the recent big Vertica win at Facebook. Two questions these customers had constantly asked me were: What’s the future of MPP databases? Will Hadoop become one database that rules the whole analytic space?

These seemed to be odd questions considering that Facebook, one of the juggernauts in the Open Source community in general and Hadoop world in particular, has recently picked Vertica to be the anchoring database to satisfy its ever-increasing analytical demands and has since put the biggest Vertica cluster (with ~300 nodes and effective data storage of 6+ PB) into production. It tells me that if a Hadoop power-house and the inventor of Hive (the most popular SQL-on-Hadoop database) like Facebook, with its teams of brilliant programmers and bound-less resources, still thinks that it needs a MPP database like Vertica in its “Big Data” technology stack in the foreseeable future, it sends a clear and strong message. Obviously Facebook thinks the answers to both questions are NO, not so fast. In the meantime, Facebook will continue to use Hive/HBase and other Hadoop technologies for the tasks they are good at: ETL, handling unstructured data and conducting complex data-mining types of deep analysis.
 
So why does Facebook think that it needs a MPP database? Facebook has been running an EDW (Oracle Exadata ~50TB) for some time but feels that their existing EDW is running out of steam because it cannot keep up with the rapid data growth especially as mobile platform becomes more and more popular. Facebook would like to take advantage of the established commercial MPP databases for lower cost, robust eco-system, improved data security and better scalability/performance. Their main reasons for going with an MPP database can be summarized as follows:

  • Rapidly expanding analytical needs at Facebook,
  • MapReduce is too slow, plus security concerns
  • In-Memory Database (IMDB) is too expensive and too immature
  • Current SQL-on-Hadoop databases are not good enough and too immature

Facebook has invited four MPP vendors (including Vertica) to participate in two rounds of competitive POCs before declaring Vertica as the ultimate winner on the basis of Vertica’s low TCO, ease of management and superior ad-hoc query performance.

There have recently been many SQL-on-Hadoop offerings in the last couple of years, both open source and proprietary, including but not limited to Hive, Hadapt, Citus, Impala, Stinger and Apache Drill. Though their effort in making Hadoop more SQL friendly is welcome, my general impression is that they are still a long way off in terms of closing the performance gap to the popular MPP databases in the marketplace (e.g. Vertica). Depending on your perspective, you may argue that this gap is not exactly getting narrower at any pace that foretells its closing any time soon.

There is strong reason for me to believe that the SQL-on-Hadoop camp may have over-estimated the effectiveness of bolting/wrapping around open source SQL optimizers (e.g. PostgreSQL) to HDFS and severely underestimated the effort and time it takes to produce an enterprise quality MPP database whose core optimizer/execution engine technology requires years of intensive real world use to mature, and 100s (if not 1000s) of customers to validate and millions of cases to test and train. This is certainly more about practice than theory or concept. Query optimization is fundamentally a software problem and there is a limit to what any “brute force” hardware-based approach can do. To echo and rephrase what the authors of the MapReduce and Parallel Databases: Friends or Foes?” said, smart software (like MPP databases) is still a good idea in the age of Hadoop and “Big Data” and there is plenty of room and opportunity for MPP databases to thrive for a long time to come….

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.

Get Started With Vertica Today

Subscribe to Vertica