This blog post was authored by Steve Sarsfield.
For decades, it’s been widely accepted that snowflake and star schemas facilitate getting optimal performance from your data warehouse. You normalize data by identifying the rows of data that you typically ingest, and creating a schema that is optimized for the types of queries you want to run. The schema sets up logical tables that group and sometimes duplicate the data so queries run fast.
Taking this approach can help avoid performance problems in a standard RDBMS, which must scan through many rows of data to find data required for a given query. Breaking up data into separate tables helps address scalability limits of a row-store database. Thus, schemas for normalized data typically comprise multiple large fact tables and many smaller dimension tables. Analytical queries often involve joins between a large fact table and multiple dimension tables.
Schemas are designed both for the data you ingest, and the analysis you run on that data. Changes in either might require you to refactor your schemas. Refactoring is tiresome and can be expensive. Much has been written about this from Ralph Kimball
, Bill Inmon
and many others
. There are many ways to refactor
, all with potential benefit and peril. When you refactor, you run the risk of slowing down other queries that rely on the current schema, or breaking the queries altogether. Thus, analytics teams tend to avoid it, even at the expense of query performance.
In contrast, a column store database such as Vertica has few limitations on the number of columns each table can have, as table width has much less impact on performance. A column store does not need to scan across millions of rows to find the data it wants – it can access it directly as a column. A column store database lets you create and use schemas that are optimized for a particular set of data and a particular type of question. You can also create projections on the fly to use the data for another purpose.
Given this, some Vertica users create wide tables that combine the fact and dimension table columns that their queries require. These tables can dramatically speed up query execution. However, maintaining redundant sets of normalized and denormalized data carries its own administrative costs.
With release 8.1, Vertica introduced flattened
tables, which minimizes these problems. Flattened tables can include columns that get their values by querying other tables. Operations on the source tables and flattened table are decoupled; changes in one are not automatically propagated to the other. This minimizes the overhead that is otherwise typical of denormalized tables.
Flattened tables offer the following benefits:
• Performance: Queries run faster on a flattened table than a standard or optimized join.
• Simplicity: The absence of complex Joins simplifies queries for analysts.
• Ease: Creating flattened tables requires no refactoring.
• Storage and price: Vertica flattened tables require only minimal additional disk space, so they have no significant impact on license costs.
If your schemas show signs of needing to be refactored, check out flattened tables. For more information on flattened tables, check out our online documentation
and other resources