Archive for the ‘sorted data’ Category

Life Beyond Indices: The Query Benefits of Storing Sorted Data

With the Vertica Analytics Platform, there are a number of benefits to storing compressed and sorted data, as well as operating directly on compressed data, that we have discussed in previous posts. In this post, I’m going to discuss how the Vertica Analytics Platform takes advantage of this sorted data to make query execution blindingly fast, which obviates the need for traditional DBMS indexes.

Unlike traditional DBMS solutions, Vertica has no user-defined indexes. Indexes in an analytic database take up DBA time (figuring out which indexes to make), storage capacity, and load time (to keep them up to date). Even if an index consumes only 10% of the size of the original data and takes 10% more time during load, storing even a few indexes on terabytes of data is costly.  As we have mentioned before, a true column store isn’t just a vertically-partitioned row store.

How does Vertica query huge volumes without indexes? It’s easy… the data is sorted by column value, something we can do because we wrote both our storage engine and execution engine from scratch. We don’t store the data by insert order, nor do we limit sorting to within a set of disk blocks. Instead, we have put significant engineering effort into keeping the data totally sorted during its entire lifetime in Vertica. It should be clear how sorted data increases compression ratios (by putting similar values next to each other in the data stream), but it might be less obvious at first how we use sorted data to increase query speed as well.

Let’s start with the simplest and easiest to understand example: the data is sorted the way a query requests it (ORDER BY). Consider a snippet of trading data sorted by stock and price (see Table 1).  If the user’s query requests all the data ordered by the stock and price, they might use something like:

SELECT stock, price FROM ticks ORDER BY stock, price;

Clearly Vertica is off the hook to do any sort at runtime: data is just read off disk (with perhaps some merging) and we are done.

Table 1: Illustration of data sorted on (stock, price). Other columns are omitted for clarity.

A more interesting query might ask for a single stock’s data ordered by price:

SELECT stock, price FROM ticks WHERE stock=’IBM’ ORDER BY price;

Finding rows in storage (disk or memory) that match stock=’IBM’ is quite easy when the data is sorted, simply by applying your favorite search algorithm (no indexes are required!). Furthermore, it isn’t even necessary to sort the stock=’IBM’ rows because the predicate ensures the secondary sort becomes primary within the rows that match as illustrated below:

Table 2: when only rows that match stock=’IBM’ are considered, the results are ordered by price, and thus no additional sorting is required.

Next, let us consider a query that computes the average price for each stock symbol:

SELECT stock, avg(price) FROM ticks GROUP BY stock;

In general, the aggregator operator does not know a priori how many distinct stocks there are nor in what order that they will be encountered. One common approach to computing the aggregation is to keep some sort of lookup table in memory with the partial aggregates for each distinct stock. When a new tuple is read by the aggregator, its corresponding row in the table is found (or a new one is made) and the aggregate is updated as shown below:

Table 3: Illustration of aggregation when data is not sorted on stock. The aggregator has processed the first 4 rows: It has updated HPQ three times with 100, 102 and 103 for an average of 101.66, and it has updated IBM once for an average of 100. Now it encounters ORCL and needs to make a new entry in the table.

This scheme, often denoted as “Group By Hash” because a hash table is used as the lookup data structure, does a good job when there are a small number of groups. However, when there are a large number of groups, it takes significant RAM to store the hash table and provisions need to be made when RAM is exhausted (typically by spilling to disk).

With Vertica, a second type of aggregation algorithm is possible because the data is already sorted, so every distinct stock symbol appears together in the input stream. In this case, the aggregator can easily find the average stock price for each symbol while keeping only one intermediate average at any point in time. Once it sees a new symbol, the same symbol will never be seen again and the current average may be generated. This is illustrated below:

Table 4: Illustration of aggregation when data is sorted on stock. The aggregator has processed the first 7 rows. It has already computed the final averages of stock A and of stock HPQ and has seen the first value of stock IBM resulting in the current average of 100. When the aggregator encounters the next IBM row with price 103 it will update the average to 101.5. When the ORCL row is encountered the output row IBM,101.5 is produced.

This scheme, commonly called “one pass aggregation” has pipelined parallelism (the same concept as instruction pipelining) if the data is already sorted according to stock. This means we can start producing tuples for downstream operators to consume almost immediately. Given that the Vertica execution is multi-threaded, and all modern machines have multiple cores, pipelined parallelism decreases query execution time.

Of course, one pass aggregation is used in other systems (often called SORT GROUP BY), but they require a sort at runtime to sort the data by stock. Forcing a sort before the aggregation costs execution time and it prevents pipelined parallelism because all the tuples must be seen by the sort before any can be sent on. Using an index is also a possibility, but that requires more I/O, both to get the index and then to get the actual values. This is a reasonable approach for systems that aren’t designed for reporting, such as those that are designed for OLTP, but for analytic systems that often handle queries that contain large numbers of groups it is a killer.

I hear you ask what kind of real-world queries aggregate large numbers of groups? There are at least two very common scenarios that our customers encounter: distinct counts and correlated subqueries with aggregation that have been flattened into joins. Our web analytics customers typically have queries that look for distinct visitors given some condition such as:

SELECT count(DISTINCT visitor_id) FROM user_sessions WHERE <filtering predicates>;

The applicability of one-pass aggregation can be seen if we rewrite the query to an equivalent form:

SELECT COUNT(sq.visitor_id) from (select visitor_id FROM user_sessions WHERE <filtering predicates> GROUP BY visitor_id) as sq

And as such is amenable to the same “group by pipeline” optimization of data sorted on visitor_id. As you are probably glazing over at this point, I will postpone further technical discussion of flattened subqueries for a future discussion if there is sufficient interest.

Another area where having pre-sorted data helps is the computation of SQL-99 analytics. We can optimize the PARTITON BY clause in a manner very similar to GROUP BY when the partition keys are sequential in the data stream. We can also optimize the analytic ORDER BY clause similarly to the normal SQL ORDER BY clause.

The final area to consider is Merge-Join. Of course this is not a new idea, but other database systems typically have Sort-Merge-Join, whereby a large join can be performed by pre-sorting the data from both input relations according to the join keys. Since Vertica already has the data sorted, it is often possible to skip the costly sort and begin the join right away.

Since sorting is such a fundamental part of our system, we have built sophisticated infrastructure in the Vertica Optimizer to track the sortedness of various intermediate results. Our infrastructure takes into account that some columns are equivalent after joining, that some columns have had constant predicates, that some expressions (e.g. price * 100) maintain sortedness, and a host of other factors. By keeping careful track, we maximize the opportunities to apply the optimizations shown above, all without any additional storage.

Of course, Vertica is not limited to a single sort order for each table. In fact, if redundant copies of the data need to be stored to survive node failures, the different copies can be stored with different sort orders. Different sort orders furthers the chance that we can apply one of our sort-based optimizations. And lest you think we are simply swapping determining sort order for determining indexes for a new DBA headache, the optimal sort order of the physical storage is typically automatically determined by the Vertica Database Designer!

If anyone wants me to spell out a specific topic in more detail leave a comment below and let me know!


More Time Series Analytics: Event-based Window Functions

An event-based window function assigns input rows to windows based on their non-timestamp column values.  The Vertica event-based window functions, one of Vertica’s many in-database analytics, assign to each input row an integer value representing the window ID, starting from 0. The window ID is incremented when a new window starts.

Two previous posts discussed Gap Filling and Interpolation (GFI) time series analytics in Vertica and some use cases. GFI analytics groups the input rows according to consecutive uniform time intervals referred to as time slices, and then performs gap filling and interpolation on the group of rows associated with each time slice. We say the grouping of the input rows in GFI is time-based.

In some use cases, however, the rows in the input time series data need to be grouped according to certain column values, instead of their timestamps. For example, given an input stream of MSFT stock quotes, the stock analyst may want to place the input quotes into a new group whenever the spread (the difference between the ask price and the bid price) goes above $0.05. If we view each such group as awindow of events, then the window endpoints are defined by the occurrence of certain event types. In the above example, the window border-defining event type is a stock quote whose spread is above $0.05.

The above Financial Services query example can be formulated as follows.

SELECT symbol, ask, bid, timestamp, CONDITIONAL_TRUE_EVENT(ask – bid > 0.05) OVER (PARTITION BY symbol ORDER BY timestamp) spread_window_id

FROM Tickstore;

The Almighty Event-based Window Function CONDITIONAL_TRUE_EVENT

The primary event window function in Vertica 4.0 time series analytics is CONDITIONAL_TRUE_EVENT, which we abbreviate as CTE in subsequent text. CTE takes an input Boolean expression P (P for predicate). P is evaluated once for each input row. When P is evaluated to true, the associated input row is labeled with a new window ID.

Below is a pictorial example illustrating the semantics of CTE. Let the input Tickstore table contain the following rows (the ask column is omitted for simplicity). Ignore the output column window_id for now.

symbol bid timestamp window_id
XYZ 10.0 03:00:00 0
XYZ 11.0 03:00:03 1
XYZ 10.5 03:00:06 1
XYZ 11.0 03:00:09 2



Now let us answer the following query. The values of its output column window_id are shown in the above table.

SELECT symbol, bid, timestamp, CONDITIONAL_TRUE_EVENT(bid > 10.6) OVER(PARTITION BY symbol ORDER BY timestamp) window_id

FROM Tickstore;

The input and output data can be visualized in the following figure. The blue dots represent the input rows. Whenever the bid price goes above the threshold value $10.6, the output window_id is incremented.


Accessing Previous Rows in Event-based Window Functions

In the example of event-based window functions provided above, the Boolean expression P within CTE only accesses values from the current row. However, sometimes the window border-defining event type involves a sequence of recent past rows as well as the current row. For example, we may want to define a new window whenever the average value of bid and ask in the current row is above that in the last row. This Boolean expression can be formulated in Vertica as (bid1 + ask1) / 2 – (LAG(bid1) + LAG(ask1))/2 > 0. More generally, we use the analytic functional syntax LAG(x, n) to retrieve the value of column X in the nth to last input row. The second parameter n is optional, and defaults to 1.

With its ability to access previous rows, we can show that CTE can express any event-based window functions whose input involves the current row and the past n rows for any arbitrary finite number n. A formal proof is omitted here in an attempt to keep the readers on this post.

Another Event-based Window Function CONDITIONAL_CHANGE_EVENT

Having covered the extremely powerful event-based window function CTE, we now introduce a second function CONDITIONAL_CHANGE_EVENT, abbreviate as CCE in this post. CCE takes an input expression E of any data type. E is evaluated once for each input row. When the value of E on the current row is different from the value of E on the previous row, the current row is labeled with a new window ID.

Semantically, CCE is a mere special version of CTE, because CCE(E(current row)) º CTE(E(current row) <> E(previous row)). However, proper use of CCE can result in more compact query formulation, and possibly better run-time performance.

More Use Cases of Event-based Window Functions

Besides Financial Services, event-based window functions have applications in many other industry sectors as well. Let us turn to log analysis, where the log data can be produced by click streams, software programs, online games, etc. say the popular MMORPG game World of Warcraft logs a sequence of action events for each in-game character. Each character can work on one major task at a time (e.g. slay dragons, obtain magical artifacts, etc). For each task being taken on by the character, its log events consist of a START TASK event, followed by a sequence of action events pertinent to accomplishing this task. An example table schema that stores such log data can be (character, event_type, timestamp). Now the log analyst would like to group the log events based on the tasks they are associated with. This can be accomplished by this event-based window function: CONDITIONAL_TRUE_EVENT(eventType = ‘START TASK’) OVER (PARTITION BY character ORDER BY timestamp) task_id.

It turns out that for clickstream analytics, CTE is a powerful tool to implement in-database sessionization capability with unmatched flexibility and performance. This will be the subject of a future post.

Reading between the Lines with Vertica 4.0

In the recent blockbuster flick Avatar, both the hero and heroine possess the skill referred to as “I See you” in Na’vi speak. Or as we Earthlings may say, perhaps to a more accurate degree, that they know how to read between the lines.

In a key scene, Neytiri and Jake speak under the Tree of Voices. Neytiri tells Jake that he is one of the Omaticaya now and it is time for him to choose a companion. As she starts listing the fine candidates in her tribe, Jake suppresses a smile and replies (in Na’vi): “I’ve already chosen. But this woman must also choose me.” At Jake’s response, Neytiri’s face turns into relief and satisfaction: “She already has.”

The skill of interpolation is indispensable to business as well. Say you are a financial analyst looking to finding possible price correlations between the bid prices of Google and Microsoft over the past 3 months. While the stock ticker stream stored in your database contains all the necessary bid events of both stocks, these events do not always occur at regular time intervals, preventing you from comparing apples to apples. Here’s what is stored in your database:

Symbol Bid Timestamp
MSFT 30.83 09:59:59
GOOG 529.10 10:00:00
MSFT 30.87 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

To begin your analysis, you’d like to normalize your data, by extracting the bid prices at regular time intervals, say one bid price per second for each stock. You know the bid price of MSFT is at $30.83 at time 09:59:59, and it subsequently rises to $30.87 at 10:00:02. What should the bid price be between these time points, so at 10am? In financial analysis, it is standard practice to assume that a stock’s bid price remains constant until the next bid event occurs. Therefore, between 09:59:59 (inclusive) and 10:00:02 (exclusive), the bid price of MSFT remains at $30.83. Based on this understanding your normalized output, starting at 10am, will look like (interpolated values are in green):

Symbol Bid Timestamp
MSFT 30.83 10:00:00
GOOG 529.10 10:00:00
MSFT 30.83 10:00:01
GOOG 529.10 10:00:01
MSFT 30.87 10:00:02
GOOG 529.10 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

Voilà! You have successfully synthesized the output events at every second. This is an example of interpolation — with the bid prices of MSFT at 09:59:59 and 10:00:02, you can interpolate the bid prices at any time point in between. Thanks to interpolation, you can now conduct further analysis on your stock data.

Performing Interpolation in Vertica 4.0

You have three options for performing interpolation:

  1. Use a statistical software program pulling data out of your database. While such software often supports sophisticated interpolation schemes, you know this won’t cut it for the hundreds of GBs of bid stream data stored in your database due to the scalability challenges in both CPU and I/O.
  2. Write your own C++/Java/Perl program. Always a viable option, but you know your time could be better spent.
  3. Use Vertica 4.0 to interpolate. You can now easily perform interpolation as well as other powerful time-series analytics within Vertica at high efficiency and low cost.

When designing our new, patent-pending time-series analytics features, we focused on the following design goals:

  • Ease of use by following similar syntax to existing SQL
  • Powerful semantics in formulating time series computation
  • Highly efficient and scalable native execution in Vertica’s MPP column-store query engine (i.e., bring analytics computation closer to the data)

In sum, we baked time series analytics into our SQL syntax and execution engine, for simplicity and speed. In contrast, another popular solution is to use UDFs for analytics. However, the design and implementation of UDFs often lead to less intuitive syntax as well as non-trivial runtime overhead.

In Vertica 4.0, your interpolation solution can be expressed in just three lines of SQL:

<blockquote>SELECT slice_time, symbol, TS_FIRST_VALUE(bid) AS first_bid
FROM Tickstore
TIMESERIES slice_time AS '1 second' OVER
(PARTITION BY symbol ORDER BY timestamp);

Congratulations! Now you know how to read between the lines with Vertica 4.0. Go and make a dent in the stock market.

Peeking under the Covers

Let’s take a closer look at what the above query does, focusing on the input and output bid events on the Google stock.


The two red dots denote the two input GOOG events. The vertical dashed lines delimit the 1-second time intervals, referred to as time slices. The dashed horizontal lines denote the bid values of GOOG at those time points when there are no input events. The blue stars denote the output GOOG bid events, lying in the intersections of the vertical and horizontal lines.

Note that there is a time slice gap between 10:00:01 and 10:00:02 – there are no input events. However, the output is guaranteed to have a GOOG event for that time slice. This behavior is commonly referred to as gap filling.

In future posts, we will talk more about time series analytics in Vertica 4.0. These features are not only applicable to gaining an edge in financial markets; they are equally powerful and effective to help you gain insights into other types of times series data such as web browsing click streams and call detail records.

Column Store vs. Column Store

It has been 5 years since Vertica was founded and it is great to see that Column Stores are becoming prevalent and widely regarded as the preferred architectures for data warehousing and analytics. Mainstream and upstart vendors alike are announcing columnar storage and columnar compression as “features” of their row-oriented DBMS. While this is excellent news for the column store enthusiasts, marketing messages are rife with false information that creates confusion for buyers. Could you be mistaking an imitation diamond for the real thing?

Here’s what you should know about when evaluating or buying a Column store DBMS.

What makes a True Columnar DBMS

A true column store, like Vertica, must have the following 4 features:

Columnar Storage, Compression and Retrieval

Data is stored in columns such that it is possible to retrieve data in a column without fetching other columns. This has the benefits of I/O reduction as well as improved compression. Data is compressed on column-by-column basis, with the compression technique chosen based on properties of the data. Block level columnar compression in row-oriented databases fails to meet this criterion – compression in these systems is typically limited to a single technique and does not eliminate unnecessary columns (and the resulting I/O) on retrieval.

Columnar on Disk, not just In-memory

Some so-called columnar DBMS vendors rely on caching the entire data into memory in columnar format. These systems experience a performance cliff when the data sizes grow beyond what can fit into memory or require a huge hardware footprint. It is no secret that memory continues to be the most expensive component in any system, so this approach is likely to limit your scalability. Check out some recently published 1TB TPCH benchmarks by columnar vendors and notice how much hardware and memory was needed for this tiny amount of data!!

Columnar Optimizer & Execution Engine

To really take advantage of a column store architecture, the query optimizer must be deeply aware of columnar storage and optimization techniques.  Late materialization is just one example of an optimization technique that can significantly speed up joins in a column store. Here, the result of the join can be computed by simply fetching the join key columns off the disk and the remaining columns are only fetched at the very end of query execution.

Going hand in hand with the optimizer, the execution engine of a true columnar database looks radically different from the typical processing model employed in a typical modern row-oriented DBMS. A true columnar engine can do predicates, joins, aggregates, sorts and analytics on compressed data, thereby saving not only on I/O but also CPU cycles. The problem then shifts to optimizing memory bandwidth and techniques like vectorizing or operating on columns are used to allow more efficient use of the L2 cache.

No amount of retrofitting can turn a row-oriented optimizer and engine into column-oriented ones.

For more on this subject, see Dan Abadi’s excellent research on this topic:,,

Optimized Loads and Transactions

While analytic DBMS workloads are heavy on queries v/s transaction throughput, this does not mean they are “read-only”. Many vendors implement columnar storage as a feature assuming “archival” or “read-only” access or reducing compression if updates are supported.  A true columnar RDBMS should provide the ability to do fast loads, and handle SQL deletes and updates to the data without sacrificing query performance or compression benefits.

Lacking any one of the above elements significantly reduces the benefits of column stores. Vertica is the only analytic database in the market today with all of the above features. That being said, a columnar architecture is just one of the many design choices that makes Vertica the DBMS of choice for large-scale real-time analytics – I’ll talk more about these in a future blog post.

And don’t take our word for it.  Try it out for yourself.

Vertica 3.5 FlexStore: The next generation of Column-stores

Last month Vertica delivered release 3.5 of our analytic database software. Over the past few years, Vertica has continued to innovate and mature its core database functionality with major releases every 6 months. I would like to thank all our customers and partners whose feedback has been instrumental in developing our product. The centerpiece of the Vertica 3.5 release is – FlexStore – our next generation columnar database architecture.  With FlexStore, Vertica now has all the benefits of a columnar database with several of the benefits traditionally considered the forte of traditional row-oriented databases.

There are three main ideas introduced by FlexStore:

  1. grouping of multiple (possibly all) columns into a single file
  2. automatically choosing the disk storage format based on data load patterns
  3. ability to differentiate storage media by their performance characteristics and to enable intelligent placement of data based on usage patterns

Let us look at a couple of practical examples where Vertica can derive huge performance gains from patent-pending FlexStore features.

One of the key innovations in Vertica has been the ability to eliminate the need for batch loading data warehouses – most Vertica customers trickle load data throughout the day, while providing real-time data latency to their end-users.  There are no indexes to build or materialized views to refresh.  Data is trickle loaded into the Write-Optimized-Store (WOS) – a fully queryable in-memory store and over time moved over to Read-Optimized-Store (ROS) on disk. The migration and access to data across the WOS and ROS is completely transparent to the user and managed automatically by the Tuple Mover.  Data from both the WOS & ROS are also automatically combined whenever necessary in any queries.

FlexStore improves the efficiency of this trickle load process by enabling Vertica to choose whether the incoming data is to be stored into a row-oriented format in the WOS or in a row or column oriented format in the ROS, depending on the size of the data  being loaded. This determination is similarly made entirely automatically without any user intervention.  The row-oriented format on disk groups several columns into a single file and applies columnar compression within each file. This reduces the number of file accesses needed to access the data during queries.  Over time, data from multiple small loads are combined if necessary and reorganized into a more highly optimized column oriented format.

FlexStore also allows user control over placement of columns on different media.  This intelligent data placement also provides an opportunity to incorporate the use of solid state drives for database storage in a cost effective manner.  Even if your architecture consists of a homogeneous set of disks, it is well known that storing data on inner v/s outer tracks can result in different performance.  By presenting the inner and outer tracks as two different storage locations to Vertica, FlexStore allows intelligence placement of columns so that frequently accessed columns can be placed on the outer tracks and infrequently used columns can be placed on inner tracks.

Grouping of two or more columns into a single file is also available as a user directive when defining projections.  Columns that are frequently accessed together can be grouped to reduce the number of file accesses necessary to retrieve the data.  Columns that are related in a domain specific way, such as bid and ask values in tick data are candidates to for the grouping directive. Grouping of columns also enables use of interesting compression and encoding techniques across multiple columns.

The grouped column feature can be combined with the data placement feature to fine tune performance or storage efficiency. For instance, columns that are infrequently accessed can be combined into a single group and stored on slower disks or on the inner tracks of a disk.  In this manner, Flexstore lays the foundation of Vertica’s Information Lifecycle Management strategy, where data is differentiated based on its value over its the course of its lifecycle.  You can expect more advances from Vertica in this area over the course of the next few releases.

Vertica 3.5 also introduces our next generation query optimizer and execution engine with improved performance and many new optimizations. For example, because Vertica stores data sorted, merge joins are extremely efficient in Vertica and in Vertica 3.5, merge joins can be now be used in many more situations. Deletes and updates, traditionally considered a weakness of column store architectures are not only faster but also, have little impact on query performance. Several of our early-adopter customers reported 30% faster out of the box query performance and over 100x improvement on delete performance after upgrade to Vertica 3.5!

Along with performance and scalability, Vertica continues to invest in enhancing and simplifying the operational capabilities of the database and improving overall user experience. Vertica 3.5 introduces a rich set of monitoring tables fully query-able using SQL.

Watch this space to learn about more upcoming technical innovations from Vertica as we continue to build the fastest and simplest database for analytics!

Get Started With Vertica Today

Subscribe to Vertica