Vertica

Author Archive

Understanding the Difference Between Column-Stores and OLAP Data Cubes

Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.)  These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions.  For example, in a sales database, one might wish to find the sales of technology products by month and store–the SQL query to do this would look like:

SELECT month, store, COUNT(*)
FROM sales, products
WHERE productType = ‘technology’
AND products.id = sales.productID
GROUP BY month, store

In this post, we study how column-stores and data cubes would evaluate this query on a sample database:

Column Store Analysis
In column-stores, this query would be answered by scanning the productType column of the products table to find the ids that have type technology.  These ids would then be used to filter the productID column of the sales table to find positions of records with the appropriate product type.  Finally, these positions would be used to select data from themonths and stores columns for input into the GROUP BY operator.  Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology.

Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples.  For example, notice that the sales table is sorted on productID, then month, then storeID.   Here, all of the records for a givenproductID are co-located, so the extraction of matching productIDs can be done very quickly using binary search or a sparse index that gives the first record of each distinctproductID.  Furthermore, the productID column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space.  Run-length encoding will also be effective on the month and storeID columns, since for a group of records representing a specific productID, month is sorted, and for a group of records representing a given (productID,month) pair, storeID is sorted.  For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the productID column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records.  This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.

Data Cube Analysis
Data cube-based solutions (sometimes referred to as MOLAP systems for “multidimensional online analytical processing”), are represented by commercial products such as EssBase.  They  store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data.  A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above.  Here, the cube includes “roll-up” cells that summarize the values of the cells in the same row, column, or “stack” (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like:

 

Here, each cell contains the count of the number of records with a given (productID,month,storeID) value.  For example, there is one record with storeID=1, productID=2, and month=April.  The “sum” fields indicate the values of the COUNT “rolled up” on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in storeID 1, productID 1 was sold twice across all months.  Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology products–this is indicated by their dark shading in the above figure.)  Then, the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October.

The advantages of a data cube should be clear–it contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries.  It can be used to efficiently compute a hierarchy of aggregates–for example, the sum columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store.  Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:

  • Sparsity: Looking at the above cube, most of the cells are empty.  This is not simply an artifact our sample data set being small–the number of cells in a cube is the product of the cardinalities of the dimensions in the cube.  Our 3D cube with 10 stores and 1,000 products would have 120,000 cells, and adding a fourth dimension, such as customerID (with, say, 10,000 values), would cause the number of cells to balloon to 1.2 billion!  Such high dimensionality cubes cannot be stored without compression.  Unfortunately, compression can limit performance somewhat, as direct offsetting is no longer possible. For example, a common technique is to store them as a table with the values and positions of the non-empty cells, resulting in an implementation much like a row-oriented relational database!
  • Inflexible, Limited ad-hoc query support: Data cubes work great when a cube aggregated on the dimensions of interest and using the desired aggregation functions is available.  Consider, however, what happens in the above example if the user wants to compute the average sale price rather than the count of sales, or if the user wants to include aggregates on customerID in addition to the other attributes.  If no cube is available, the user has no choice but to fall back to queries on an underlying relational system.  Furthermore, if the
    user wants to drill down into the underlying data–asking, for example “who was the customer who bought a technology product at store 2 in February?”–the cube cannot be used (one could imagine storing entire tuples, or pointers to tuples, in the cells of a cube, but like sparse representations, this significantly complicates the representation of a cube and can lead to storage space explosions.)  To deal with these limitations, some cube systems support what is called “HOLAP” or “hybrid online analytical processing”, where they will automatically redirect queries that cannot be answered with cubes to a relational system, but such queries run as fast as whatever relational system executes them.
  • Long load times: Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially, every record has to be read from the database.)  Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries.

Summary and Discussion

Data cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed.  They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed.

In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store.

Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system.  That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world.  Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.

Relational databases for storing and querying RDF

The Resource Description Format (RDF) is a way to describe information about relationships between entities and objects. It was originally developed by the W3C as a way to describe information about resources on the Web. It is intended to be the data model used in the Semantic Web, where web pages contain not just text but also structured records describing the data they contain and the relationships in that data.

RDF has seen widespread adoption in recent years. For example, the entire MIT library catalog is available in RDF format. More recently, a number of biology researchers have begun to publish their data in RDF, including the UniProt comprehensive catalog of protein sequence, function, and annotation data.

Understanding RDF

An RDF document consists of a collection of statements of the form subject-property-object. For example, a library database that stores data about authors and books might have statement triples like “User1 has-name ‘Sam Madden’”, “User1 is-an Author”, “User1 wrote Book1″, “Book1 is-a Book”, “Book1 has-title ‘Who ate my cheese?’”, etc.

It should be clear that an RDF document, containing a collection of triples about a group of resources, is a structured database that users may want to browse, search, or query in a number of ways. Building tools that make it possible do this efficiently is one of the goals of our research. In particular, we are interested in the performance of different on-disk storage representations for a collection of triples.

Designing tools to handle RDF efficiently

Our first attempts to do this have focused on leveraging relational database technology. The obvious relational representation of an RDF document is as a table with three columns, which would conventionally be stored as a series of 3-tuples laid out on disk in a row-major format. This representation, however, performs quite poorly for many types of queries. Suppose, for example, we want to find all the authors of the book “Who ate my cheese”.  We will first have to find the triple “bookM has-title ‘Who ate my cheese’”. We will then have to perform a self join with the triples table to find all of the triples of the form “personN wrote bookM’. Finally, for each author, we will have to perform another self join to find triples of the form ‘personN has-name ‘Sam Madden’”.

Hence, we have been looking at alternative representations that eliminate these self joins (we still expose a logical model of a collection of triples that the user queries, but we transform user queries to apply to our modified physical representation.) For example, one possible representation is to store a table where the first column contains the subject, and each additional column corresponds to a particular property. This representation is sometimes called a “property representation”, as shown on the bottom of the figure above.  Though this representation can have many NULL values if there are a variety of subjects with diverse properties defined, it has the advantage that all of the properties of a given object are now stored together.

Our work in this area, “Scalable Semantic Web Data Management Using Vertical Partitioning,” appeared in the VLDB Conference in Vienna in September. It showed that using a column-oriented database, along with this property representation, allows us to overcome the overhead of representing NULLs, while providing two orders of magnitude better performance than the naive triples representation. This is particularly true when processing queries that must access many triples during execution (e.g., computing the number of books grouped by subject area or institution.) Of course, there is a fair amount of subtlety to getting good performance out of such a representation. Have a look at our conference paper for the details!

Caveats for column- and row-store databases

As we’ve discussed elsewhere in this blog, column-stores can perform worse than row-stores for certain classes of queries. In particular, for lookups of a single record (e.g., all of the information about a particular author), a row-oriented database (using a property representation) may outperform a column-oriented system. This is because it only has to seek to one location on disk to read the data from this record, whereas a column store will have to seek to each column to reconstruct the entire record.

There are other situations where neither a row- nor column-oriented property representation is ideal. Imagine, for example, a user browsing an RDF-based Web site containing our library database. During browsing, suppose the user navigates from books or articles, to authors, to related books and articles, and so on. Such browsing queries in a property representation will lead to (slow) self-joins on the property table, just as they did in the triples table. Hence, a more sensible representation for a browsing-oriented database would be to store a given record R near to records the user is likely to navigate to from R. This is the topic of our current research in this area.

* Editors note: While this post will show up in the blog as written by Sam Madden, it has two authors: Samuel Madden (MIT) and Daniel Abadi (Yale)

Database parallelism choices greatly impact scalability

Large databases require the use of parallel computing resources to get good performance. There are several fundamentally different parallel architectures in use today; in this post, Dave DeWitt, Mike Stonebraker, and I review three approaches and reflect on the pros and cons of each. Though these tradeoffs were articulated in the research community twenty years ago, we wanted to revisit these issues to bring readers up to speed before publishing upcoming posts that will discuss recent developments in parallel database design.


Shared-memory systems don’t scale well as the shared bus becomes the bottleneck

In a shared-memory approach, as implemented on many symmetric multi-processor machines, all of the CPUs share a single memory and a single collection of disks. This approach is relatively easy to program. Complex distributed locking and commit protocols are not needed because the lock manager and buffer pool are both stored in the memory system where they can be easily accessed by all the processors.

Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and memory requests have to be transferred over the same bus that all of the processors share. This causes the bandwidth of the bus to rapidly become a bottleneck. In addition, shared-memory multiprocessors require complex, customized hardware to keep their L2 data caches consistent. Hence, it is unusual to see shared-memory machines of larger than 8 or 16 processors unless they are custom-built from non-commodity parts (and if they are custom-built, they are very expensive). As a result, shared-memory systems don’t scale well.


Shared-disk systems don’t scale well either

Shared-disk systems suffer from similar scalability limitations. In a shared-disk architecture, there are a number of independent processor nodes, each with its own memory. These nodes all access a single collection of disks, typically in the form of a storage area network (SAN) system or a network-attached storage (NAS) system. This architecture originated with the Digital Equipment Corporation VAXcluster in the early 1980s, and has been widely used by Sun Microsystems and Hewlett-Packard.

Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the interconnection network that connects each of the CPUs to the shared-disk subsystem can become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the processors, there is no obvious place for the lock table or buffer pool to reside. To set locks, one must either centralize the lock manager on one processor or resort to a complex distributed locking protocol. This protocol must use messages to implement in software the same sort of cache-consistency protocol implemented by shared-memory multiprocessors in hardware. Either of these approaches to locking is likely to become a bottleneck as the system is scaled.

To make shared-disk technology work better, vendors typically implement a “shared-cache” design. Shared cache works much like shared disk, except that, when a node in a parallel cluster needs to access a disk page, it first checks to see if the page is in its local buffer pool (“cache”). If not, it checks to see if the page is in the cache of any other node in the cluster. If neither of those efforts works, it reads the page from disk.

Such a cache appears to work fairly well on OLTP but performs less well for data warehousing workloads. The problem with the shared-cache design is that cache hits are unlikely to happen because warehouse queries are typically answered using sequential scans of the fact table (or via materialized views). Unless the whole fact table fits in the aggregate memory of the cluster, sequential scans do not typically benefit from large amounts of cache. Thus, the entire burden of answering such queries is placed on the disk subsystem. As a result, a shared cache just creates overhead and limits scalability.

In addition, the same scalability problems that exist in the shared memory model also occur in the shared-disk architecture. The bus between the disks and the processors will likely become a bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs increases, can be a problem. To reduce bus contention, customers frequently configure their large clusters with many Fiber channel controllers (disk buses), but this complicates system design because now administrators must partition data across the disks attached to the different controllers.

Shared-nothing scales the best

In a shared-nothing approach, by contrast, each processor has its own set of disks. Data is “horizontally partitioned” across nodes. Each node has a subset of the rows from each table in the database. Each node is then responsible for processing only the rows on its own disks. Such architectures are especially well suited to the star schema queries present in data warehouse workloads, as only a very limited amount of communication bandwidth is required to join one or more (typically small) dimension tables with the (typically much larger) fact table.

In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have nearly as severe bus or resource contention as shared-memory or shared-disk machines, shared nothing can be made to scale to hundreds or even thousands of machines. Because of this, it is generally regarded as the best-scaling architecture.

As a closing point, we note that this shared nothing approach is completely compatible with other advanced database techniques we’ve discussed on this blog, such as compression and vertical partitioning. Systems that combine all of these techniques are likely to offer the best performance and scalability when compared to more traditional architectures.

Follow up on compression post: Columns, indices, and sorting

Earlier this week I wrote about the advantages of compression in column-oriented databases. A reader had questions about an example I used and the issue of sorting and indexes. I thought the commenter’s points and questions were worth exploring in some depth.

Keeping columns in order

The commenter’s key question was really about maintaining the correspondence between columns in a column-oriented database. The issue isn’t really “getting the data back in the original order” as much as it is ensuring that the system has a way of accessing records in other columns that correspond to a particular record in a sorted column. I’ll explain the details of how this works.

The simplest way to do this is just store one copy of the table sorted on one or more attributes. For example, suppose we have the following table sorted on state:

name state salary
joe ca 10k
dave ca 55k
mary ma 50k
nancy ma 65k
mike nv 30k

We can physically represent this as three columns each stored in the order shown in a different file. We can RLE encode the state column to reduce its space, and, if we secondarily sort the table on salaries, can probably also benefit from delta encoding salaries and dictionary encoding names. Any query that includes a predicate over state (e.g., state = ‘ca’) can be evaluated quickly because we can figure out what positions (e.g., 1 and 2) in the three columns need to be looked up to find matching records without decompressing the state column or looking at all of records of the other two columns.

Of course, we need to be able to efficiently perform this look up of a given set of positions in a column (e.g., finding the names and salaries that correspond to positions 1 and 2 in the state column). We can do this much faster than a conventional join would (which is what the commenter’s foreign key suggestion would require) because the name and salary columns are stored in the same order as the state column (so the 1st state corresponds to the 1st entry in the name file, etc.). This allows us to either:

  1. Directly offset to the locations in the columns where a particular position is stored, if the column is uncompressed and fields are fixed length (e.g., each record is the same size). This would be the case for salary in the previous example if it were uncompressed, or, if each record is variable-sized, or the column is compressed (as would be the case if name is stored as a text field, for example) such that we can’t apply direct offsetting.
  2. Use a sparse index to map positions to entries in a column. A sparse index is simply an index with one entry per *page* of the column file where an entry indicates the range of positions on the corresponding page. So if the name column has 10,000 names stored on 100 pages, there would 100 entries in the sparse index. To look up a given position, the system searches the index to quickly find the pages that contain the positions of interest. Because these sparse indices are very compact, they can usually be stored in memory or on a single disk page, and this extra lookup imposes very little overhead (unlike a traditional database index which can be quite large and require a significant number of I/Os to query).

Notice that the positions will be accessed sequentially within each column file, which further improves performance (especially if a range of consecutive positions are accessed).

That said, the commenter was correct in that in this example we will really only get the maximum benefit of RLE on the state column with (if we sort them at all) a potentially decreased benefit for a secondary or tertiary sort on name or salary.

One way to address this is to replicate subsets of columns, sorting each subset in a different order. We call such subsets “projections” (where the columns in a projection are still each stored in separate files). For example, a user might create one projection with the above three columns sorted in state order and another projection with just name and salary sorted in salary order (which will compress well using delta or RLE encoding). That way, if a query wants to find employees in a particular salary range, it can do that quickly using this second projection. Of course, for a query to be able to use a particular projection, that projection must have the columns used in the query (or the system must run an expensive join operation to glue together two projections that are sorted in different orders).

Future post: Projections to maximize query performance

This, of course, presents an interesting physical database design question: What projections should the user (or administrator) create to maximize query performance for a given query workload? This is the idea behind the automatic database design features in C-Store and Vertica, which will be the subject of a future post.

Good things come in small packages: The advantage of compression in column databases

In this post, we’ll discuss how column-oriented databases are able to more effectively exploit compression than a typical row-oriented system. There are two key points:

  • Columns of data from the same attribute compress better than rows of tuples in a table, and
  • A well-architected database engine using appropriate compression techniques can operate directly on the compressed data, without decompressing it.

Achieving a high degree of compression is highly desirable. Depending on the dataset, column databases use up to 90% less storage space than the raw data loaded into the database. This means column databases require less storage hardware (and associated space, cooling & power) and/or allow users to analyze much more data on the hardware they already have.

There are several common approaches to database compression. One involves compressing tables using a block-based dictionary compression algorithm like Lempel-Ziv (the algorithm used by gzip and other tools). A variant of these block-based dictionary schemes are value-based approaches, which replace individual values in records with shorter values from a dictionary – for example, every time the string “John Smith” occurs, it might be replaced with the code “*A”. Other codes would be used for other common strings.

Another type of compression can be applied to sorted sequences of numbers or values that generally differ by just a few bits. For example, if a collection of tuples contain the values  “10001, 10002, 10005, 10008″ in one attribute, those can be compactly encoded as a series of differences, or deltas as “10001, +1, +3, +3″. Though this hasn’t reduced the number of entries in the list, the differences are much smaller than the original number, and so can be stored with fewer bits (in this case, it takes 16 bits to encode 10001, but just 2 to encode +1 and 3 to encode +3.) Text data works here too, if you think of a string as really being a sequence of numeric ASCII character codes.

These methods effectively reduce the space requirements for storing data and work similarly in column- or row-oriented systems. But column-oriented databases can also take advantage of other forms of compression that row-oriented databases cannot easily leverage. We will spend the rest of this post examining alternative compression techniques that are advantageous to column-oriented databases solutions.

The column compression advantage

To understand the additional advantage that column-oriented databases have with respect to compression, let’s examine how a column-oriented database structures data on disk and how compression on that stored data works.

Think of a column-oriented database as storing each column of a table in a separate file. This means that when the database system wants to access data, it only needs to read the files representing those columns that it cares about. This is unlike a row-oriented database that must always scan over all of the columns in the rows it is reading.

When the database wants to reconstruct a tuple in the original table (remember, column-oriented databases typically still provide a relational API!), it needs some way to establish correspondence between different columns. The simplest way to do this is to ensure that the positions in the different files line up — e.g., that the nth entry in column 1 belongs to the same tuple as the nth entry in column 2, and so on. Note, however, that these columns can be stored in any order the system wants, and that certain orders — e.g., those sorted on one or more of the attributes (e.g., customer state, and then customer name) — may allow the system to rapidly find records of interest (e.g., Smith’s in California).

Rather than simply storing each column once in one sort order, column-oriented databases will often store each column multiple times, in different sort orders, much as materialized views are used in traditional database systems. This improves query performance since there is a higher probability of there being a “good” sort order for a particular query. But storing different columns in different sort orders provides the database with another benefit: sorted columns of data compress extremely well.

There are several reasons for this. First, when a sorted column has only a few different distinct values (such as a column of states, or genders, departments, or even product codes) there will be many repeated values that can be compressed away. For example, given a list of numbers like 1,1,1,1,2,2,2,3,3, it’s clearly going to be more compact to encode this as 4 x 1, 3 x 2, 2 x 3 (we’ve used just six numbers, instead of 9.) This type of compression is called run-length encoding. Of course, this technique can be used with textual data just as effectively as with numeric data.

A second reason that columns compress so well is that there is relatively little variability between successive values in columns  — that is, columns are low entropy. This means that schemes like Lempel-Ziv encoding, which are especially good at compressing such low-entropy data, can be substantially more effective in column-based systems. This is because individual columns tend to have fewer differences between values than entire rows and hence compress better.

Improved compression benefits

The benefit of techniques like run length encoding can be tremendous. Imagine a table with 50 million sales records, sorted on state and then last name as described above. To represent the state column using run length encoding, the system needs just 50 states names and 50 run lengths (e.g., AK x 1,000,000 , AL x 1,000,000, etc). Perhaps more surprisingly, there’s also a big win to be had for the last name column. Though it won’t be strictly sorted, it will be sorted within each state (e.g., when the two columns are joined, the will form a sorted list of tuples like (AK, Aaronson), (AK, Adams), … , (AL, Adams), (AL, Baker), (AL, Cook), …. ). Last names within each state can thus be run length or delta encoded quite effectively. For example, in the 50 million record database, if there are 10,000 last names in the database, and customers are evenly spread across 50 states, there will be 50,000,000 / (50×10,000) = 100 repeats of each last name in each state. This means run length encoding will reduce the size of the last name column by a factor of 100.

The  compression techniques described above have another benefit: the database system often doesn’t need to convert the compressed records back into their uncompressed representation before operating on the data (direct operation can be applied to run length, delta encoded, and some kinds of dictionary encoded data.) For example, it’s trivial to determine whether a predicate is satisfied by a run length encoded (run length, value) pair by simply applying the predicate to the value. And when that predicate is applied, it can be applied to the entire run’s worth of tuples. But it is possible to efficiently join or aggregate run length encoded data too, and the performance wins can be dramatic. For example, in his SIGMOD 2006 paper “Integrating Compression and Execution in a Column-Oriented Database System“, my student Daniel Abadi showed that for many queries over the TPC-H data set, performance gains in C-Store (an academic predecessor to the Vertica Database) are an order of magnitude or more relative to an uncompressed database.

Notice that these compression techniques would be much less effective in a row-oriented system because successive tuples in a table have a lot less in common than successive values in a column. Though it’s probably possible to re-arch
itect a row-oriented database to get some of the advantages described above, doing so is tantamount to making the system column-oriented, since it requires the compressed data to be stored separately in a compact representation. So, the bottom line is that redundant columns, multiple sort orders, and column orientation pave the way for dramatic performance gains from compression.

Note: This post had been updated to improve clarity about compression techniques.

Get Started With Vertica Today

Subscribe to Vertica