Author Archive

The Right Tool for the Job: Using Hadoop with Vertica for Big Data Analytics

by Mingsheng Hong, Vertica Product Marketing Engineer

I have an entrepreneur friend who used to carry a butter knife around.  He claimed this “almighty” tool was the only one he ever needed!  While the butter knife does serve a wide range of purposes (especially with a stretch of the imagination), in practice it doesn’t always yield optimal results.  For example, as a screwdriver, it may work for common screws, but certainly not a Phillips (unless you push down very hard and hope not to strip the screw).  As a hammer, you may be able to drive finishing nails, but your success and mileage may vary.  As a pry bar, well, I think you get my point!  Clearly one tool isn’t sufficient for all purposes – a good toolbox includes various tools each fulfilling a specific purpose.

When it comes to Big Data Analytics, Hadoop (as a platform) has received an incredible amount of attention.  Some highlights include: scalable architecture based on commodity hardware, flexible programming language support, and strong open source community support committed to its on-going development.  However, Hadoop is not without limitations: due to its batch oriented nature, Hadoop alone cannot be deployed as a real-time analytics solution.  Its highly technical and low-level programming interface makes it extremely flexible and friendly to developers but not optimal for business analysts.  In an enterprise business intelligence environment Hadoops’s limited integration with existing BI tools makes people scratch their head trying to figure out how to fit it into their environment.

As Hadoop has continued to gain traction in the market and (in my opinion) moved beyond the peak of the hype cycle, it is becoming clear that to maximize its effectiveness, one should leverage Hadoop in conjunction with other business intelligence platforms and tools.  Best practices are emerging regarding the choice of such companions, as well as how to leverage each component in a joint deployment.

Among the various BI platforms and tools, Vertica has proved an excellent choice. Many of its customers have successfully leveraged the joint deployment of Hadoop and Vertica to tackle BI challenges in algorithmic trading, web analytics, and countless other industry verticals.

What makes the joint deployment so effective, and what are the common use cases?

First, both platforms have a lot in common:

  • Purpose-built from scratch for Big Data transformation and analytics
  • Leverage MPP architecture to scale out with commodity hardware, capable of managing TBs through PBs of data
  • Native HA support with low administration overhead

In the Big Data space crowded with existing and emerging solutions, the above architectural elements have been accepted as must-haves for any solution to deliver scalability, cost effectiveness and ease of use.  Both platforms have obtained strong market traction in the last few years, with customer success stories from a wide range of industry verticals.

While agreeing on things can be pleasant, it is the following key differences that make Hadoop and Vertica complement each other when addressing Big Data challenges:

Aspect / Feature Hadoop VERTICA
Interface and extensibility Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community. Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc).  This enables both technologists and business analysts to leverage Vertica in their analytic use cases.Vertica’s 5.0 analytics SDK enables users to plug their custom analytic logic into the platform, with in-process and parallel execution.  The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain /
Eco system
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging. Vertica integrates with the BI tools because of its standards compliant interface.  Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management Hadoop replicates data 3 times by default for HA.  It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs. Vertica’s columnar compression often achieves 10:1 in its compression ratio.  A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries.  Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O. The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime.  As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs. The Vertica Database Designer provides automatic performance tuning given an input workload.  Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.


After working with a number of customers involving joint Hadoop and Vertica deployments, we have identified a number of best practices combing the power of both platforms.  As an example, Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.  Sessionizing clickstreams, Monte Carlo analysis or web-scale graph analytics are some such examples.  For those analytic features supported by both platforms, we have observed significant performance advantages in Vertica, due to the key architectural differences between the two platforms as described above.

Finally, by leveraging Vertica’s Hadoop connector, users can easily move data between the two platforms.  Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS.  A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.   As another example, when an analytic job retrieves input data from the Vertica storage, its initial stages of computation, often consisting of filter, join and aggregation, should be conducted in Vertica for optimal performance.  The intermediate result can then be fed into a map-reduce job for further processing, such as building a decision tree or some other machine learning model.

Big Data with Hadoop and Vertica – OSCON ‘11

The recent OSCON ’11 was filled with exciting technology and best practice discussions on Big Data, Java and many other subjects. There I had an opportunity to deliver a talk to the open source community on the subject of this post. In a subsequent talk, my colleagues Steve Watt and Glenn Gebhart presented a compelling demo to illustrate the power of combining Hadoop and Vertica to analyze unstructured and structured data. We were delighted at the feedback that both talks received from the follow-up conversations in person as well as from Twitter. This interview captured the gist of the numerous conversations we had with other attendants of OSCON about Vertica’s real-time analytics capabilities and its underlying technology.

Sessionize with Style: Part 2

Why is the Vertica Approach Unmatched?

In Part 1 of this two-part article, we described Vertica’s approach to sessionization. To understand why this approach is head and shoulders above the state-of-the-art, let us closely examine a few existing approaches.

The Aster Data Approach –From Boston to LA via China

The first approach comes from Aster Data. One query example from this article is quoted below.

SELECT sessionId, userId, timestampValue

FROM Sessionize( ‘timestamp’, 60 ) ON

(SELECT userid, timestampValue FROM webclicks WHERE userid = 50);

There are a few usability hurdles with this design.

  1. The Sessionize function does not take userId as its input. It therefore must be assuming the input table webclicks has a column named userId, to be used in this computation. Hard-coding column names makes the product less usable.
  2. This design of sessionization support unnecessarily deviates from standard SQL in terms of both syntax and semantics, artificially creating a steep learning curve for the end users to climb. For example, the keyword ON has a different meaning from the standard use of ON in SQL ANSI joins. As a proposal for an alternative design, why not retain the subquery as a standard FROM clause subquery, and move the Sessionize function to the SELECT clause? This could preserve the required semantic ordering of evaluating the subquery before the Sessionize function, while minimizing the deviation from Standard SQL.
  3. This example implies that the predicate userid = 50 has to be placed in a subquery in order for that predicate to be pushed before the Sessionize computation. This predicate push down technique, well known to relational databases since the invention of relational algebra in the 1970’s, should have been taken care of in the query optimizer automatically. Forcing the end user to come up with “clever” SQL formulations violates the spirit of declarative programming in SQL, again making the product less usable.

Apparently Aster Data took another stab at the design, where a query example in that new design is quoted below.

Select ts, userid, session

From sessionize (

on clicks partition by userid order by ts

Timecolumn(‘ts’) timeout (60)


While this design represents progress, it still suffers from unnecessary deviation from the Standard SQL (see Comment #2 above). Also, since the proposed Sessionize syntax is not fully interoperable with other existing SQL constructs, it makes the query formulation cumbersome when the user wants to perform sessionization and other SQL computation in the same query. For example, how many subqueries need to be involved, if the user wants to first join the table clicks with a dimension table, perform a group-by, and then perform sessionization? In comparison, the Vertica approach employs not a single subquery to accomplish this task.

Finally, in case you wonder, in Aster Data Sessionize is a MapReduce function written in JavaDoes the sessionization task justify the employment of a big hammer execution mechanism like MapReduce? No! As we mentioned before, the run-time complexity of sessionization is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER, which takes nothings more than a single pass over the sorted data.

Using MapReduce to implement sessionization is a mistake on the architectural level. It does not matter how fast your MapReduce implementation is – if you travel from Boston to LA by flying around the globe, does it matter that you are taking a supersonic aircraft? Besides, while MapReduce is more versatile than SQL when processing computational tasks, it has been shown that when processing SQL-like query tasks, due to its significant overhead a MapReduce engine can be slower than a SQL engine by an order of magnitude or more.

The Teradata Approach – a Horse Wagon Fun Ride from Boston to LA

In this article, Teradata revealed its support for sessionization. We quote its query example below.

with dt (IP, Click_Timestamp, samesession) as

( select IP, Click_Timestamp,

case when (Click_Timestamp – (max(Click_Timestamp) over (partition by IP order by Click_Timestamp rows between 1 preceding and 1 preceding ) ) minute) < (interval ’30’ minute) then 0 else 1 end

from webclicks)

select sum(samesession) over (partition by IP order by Click_Timestamp rows unbounded preceding) as Session_No, IP, Click_Timestamp

from dt;

From this example, we can see that Teradata’s sessionization support is not native. Instead, it is expressed in terms of existing SQL ’99 functions. Such a design bears the following consequences.

  1. The resulting formulation is quite cumbersome and unintuitive – if you understand why this query is actually performing sessionization, we salute to you for being a SQL guru. On the other hand, an average DBA might be at a complete loss at what the query intends to do. Also, the WITH clause is an unnecessary big hammer construct to employ – a subquery could have been used. Then again, no subquery would be nice!
  2. The use of the SQL window aggregate function Max further complicates the formulation unnecessarily. The SQL analytic function LAG should have been used.
  3. Last but certainly not least, the complex query formulation involving two query blocks is likely to lead to suboptimal run-time performance.


Vertica’s sessionization support is unmatched among its peers, and here is why.

  1. Thanks to Vertica’s analytic function CONDITIONAL_TRUE_EVENT (CTE), the query formulation of sessionization is extremely compact – no subquery or unnecessary SQL construct is involved.
  2. Thanks to CTE, Vertica’s sessionization goes beyond its standard semantics, and is thus capable of supporting a wider range of use cases.
  3. The run-time performance of Vertica’s sessionization is close to optimal –the computation is fully pipelined for each user id, and is fully load-balanced across all user ids.

In closing this post, we would like to offer our dear esteemed readers the challenge to come up with a solution that beats Vertica’s implementation of sessionization.

Sessionize with Style – Part 1

The Vertica Approach

Sessionization is a common analytic operation in clickstream analysis. Given an input clickstream table, where each row records a webpage click made by a particular user (or IP address), the sessionization operation identifies user’s web browsing sessions from the recorded clicks, by grouping the clicks from each user based on the time-intervals between the clicks. Conceptually, if two clicks from the same user are made too far apart in time (as defined by a time-out threshold), they will be treated as coming from two browsing sessions.

Here is an example input clickstream table with a simplified schema. Ignore the output column session_id for now.

user_id timestamp URL session_id
U0 15:00:00 0
U0 15:00:25 0
U0 15:00:45 0
U0 15:01:45 0

The standard semantics of sessionization takes a single input parameter: the time-out threshold, which is a constant time interval value. An example time-out threshold value is 30 seconds. Sessionization performs its computation on two columns in the input clickstream table, the user_id and the timestamp of the click. The output session_id column produced by sessionization is shown in the above table.

Vertica’s Sessionization Support

Sessionization in Vertica is built on top of the event-based window function CONDITIONAL_TRUE_EVENT (or CTE in short). Recall the semantics of CTE with input Boolean expression P: CTE(P) is evaluated once per input row, and defines a new window starting at the current row, whenever P is evaluated to true for that row. For example, given a sequence of values <1, 2, 3, 4> for column X, CTE(X > 2) assigns to these rows a sequence of window Ids <0, 0, 1, 2>. Also, recall that the expression P in CTE can access column values in the current row, as well as in previous rows. For example, CTE (X > LAG(X)) defines a new window whenever the value of column X in the current row is greater than X in the last row.

Despite of its powerful semantics, the run-time complexity of CTE is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER – it takes only a single pass over the sorted data, while retaining a minimal amount of state in the computation.

Thanks to CTE, sessionization with its standard semantics can be expressed in Vertica as follows.

SELECT user_id, timestamp, CONDITIONAL_TRUE_EVENT(timestamp – LAG(timestamp) > ’30 seconds’) OVER (PARTITION BY user_id ORDER BY timestamp)

FROM clickstream;

Beyond the Standard Semantics of Sessionization

One limitation of the standard semantics of sessionization is that the time-out threshold is a constant value. However, different users may have different styles and preferences for internet browsing, and therefore the same time-out threshold may not accurately identify sessions for all users.

For example, say user A is a slower web-surfer than an average user, perhaps because A is multi-tasking heavily. Say if an average user does not perform page clicks in a particular web domain D in 30 seconds, it indicates the end of a session. However, for user A, the typical interval between two clicks in same domain is 1 minute, as she is busy tweeting, listening to music, and harvesting in Farmville at the same time. So a better solution is to adaptively determine the session timeout threshold of user A based on her recent browsing behavior (e.g. the average time interval between 2 consecutive clicks in the last 10 clicks which A has performed). This allows the clickstream analyst to customize the timeout threshold for difference users.

For example, to adaptively compute the time-out threshold for a user based on her last 10 clicks with a “fudge factor” of 3 seconds, we can use the following CTE expression: CONDITIONAL_TRUE_EVENT (timestamp – lag(timestamp) <= (LAG(timestamp, 1) – LAG(timestamp,11)) / 10) + ‘3 seconds’. The fudge factor can be a multiplicative factor instead of an additive one. For example, it can be 110% of the average time intervals of the last 10 clicks.

Another sessionization use case involving a more sophisticated time-out threshold is to use different threshold values based on other factors, such as the time of the day, or the nature of the website being browsed. For example, the time-out threshold for Wall Street Journal Online should be higher than xkcd webcomic, as the WSJ articles take longer to read in average than the xkcd comic strips.


The Vertica approach to sessionization enjoys the multiple benefits of ease of use, strong expressive power, as well as highly efficient and scalable execution. Wondering how some alternative approaches will stack up against Vertica’s (hint: they won’t)? That’s what we will answer in a future post.

More Time Series Analytics: Event-based Window Functions

An event-based window function assigns input rows to windows based on their non-timestamp column values.  The Vertica event-based window functions, one of Vertica’s many in-database analytics, assign to each input row an integer value representing the window ID, starting from 0. The window ID is incremented when a new window starts.

Two previous posts discussed Gap Filling and Interpolation (GFI) time series analytics in Vertica and some use cases. GFI analytics groups the input rows according to consecutive uniform time intervals referred to as time slices, and then performs gap filling and interpolation on the group of rows associated with each time slice. We say the grouping of the input rows in GFI is time-based.

In some use cases, however, the rows in the input time series data need to be grouped according to certain column values, instead of their timestamps. For example, given an input stream of MSFT stock quotes, the stock analyst may want to place the input quotes into a new group whenever the spread (the difference between the ask price and the bid price) goes above $0.05. If we view each such group as awindow of events, then the window endpoints are defined by the occurrence of certain event types. In the above example, the window border-defining event type is a stock quote whose spread is above $0.05.

The above Financial Services query example can be formulated as follows.

SELECT symbol, ask, bid, timestamp, CONDITIONAL_TRUE_EVENT(ask – bid > 0.05) OVER (PARTITION BY symbol ORDER BY timestamp) spread_window_id

FROM Tickstore;

The Almighty Event-based Window Function CONDITIONAL_TRUE_EVENT

The primary event window function in Vertica 4.0 time series analytics is CONDITIONAL_TRUE_EVENT, which we abbreviate as CTE in subsequent text. CTE takes an input Boolean expression P (P for predicate). P is evaluated once for each input row. When P is evaluated to true, the associated input row is labeled with a new window ID.

Below is a pictorial example illustrating the semantics of CTE. Let the input Tickstore table contain the following rows (the ask column is omitted for simplicity). Ignore the output column window_id for now.

symbol bid timestamp window_id
XYZ 10.0 03:00:00 0
XYZ 11.0 03:00:03 1
XYZ 10.5 03:00:06 1
XYZ 11.0 03:00:09 2



Now let us answer the following query. The values of its output column window_id are shown in the above table.

SELECT symbol, bid, timestamp, CONDITIONAL_TRUE_EVENT(bid > 10.6) OVER(PARTITION BY symbol ORDER BY timestamp) window_id

FROM Tickstore;

The input and output data can be visualized in the following figure. The blue dots represent the input rows. Whenever the bid price goes above the threshold value $10.6, the output window_id is incremented.


Accessing Previous Rows in Event-based Window Functions

In the example of event-based window functions provided above, the Boolean expression P within CTE only accesses values from the current row. However, sometimes the window border-defining event type involves a sequence of recent past rows as well as the current row. For example, we may want to define a new window whenever the average value of bid and ask in the current row is above that in the last row. This Boolean expression can be formulated in Vertica as (bid1 + ask1) / 2 – (LAG(bid1) + LAG(ask1))/2 > 0. More generally, we use the analytic functional syntax LAG(x, n) to retrieve the value of column X in the nth to last input row. The second parameter n is optional, and defaults to 1.

With its ability to access previous rows, we can show that CTE can express any event-based window functions whose input involves the current row and the past n rows for any arbitrary finite number n. A formal proof is omitted here in an attempt to keep the readers on this post.

Another Event-based Window Function CONDITIONAL_CHANGE_EVENT

Having covered the extremely powerful event-based window function CTE, we now introduce a second function CONDITIONAL_CHANGE_EVENT, abbreviate as CCE in this post. CCE takes an input expression E of any data type. E is evaluated once for each input row. When the value of E on the current row is different from the value of E on the previous row, the current row is labeled with a new window ID.

Semantically, CCE is a mere special version of CTE, because CCE(E(current row)) º CTE(E(current row) <> E(previous row)). However, proper use of CCE can result in more compact query formulation, and possibly better run-time performance.

More Use Cases of Event-based Window Functions

Besides Financial Services, event-based window functions have applications in many other industry sectors as well. Let us turn to log analysis, where the log data can be produced by click streams, software programs, online games, etc. say the popular MMORPG game World of Warcraft logs a sequence of action events for each in-game character. Each character can work on one major task at a time (e.g. slay dragons, obtain magical artifacts, etc). For each task being taken on by the character, its log events consist of a START TASK event, followed by a sequence of action events pertinent to accomplishing this task. An example table schema that stores such log data can be (character, event_type, timestamp). Now the log analyst would like to group the log events based on the tasks they are associated with. This can be accomplished by this event-based window function: CONDITIONAL_TRUE_EVENT(eventType = ‘START TASK’) OVER (PARTITION BY character ORDER BY timestamp) task_id.

It turns out that for clickstream analytics, CTE is a powerful tool to implement in-database sessionization capability with unmatched flexibility and performance. This will be the subject of a future post.

Gap Filling and Interpolation (GFI)

A Swiss-Army Knife for Time Series Analytics

Gap Filling and Interpolation (GFI) is a set of patent-pending time series analytics features in Vertica 4.0.  In a previous post, we introduced GFI and showcased their expressive power and ease of use through a use case in financial services. In this post, through additional use cases, we will show that GFI can enable Vertica users in a wide range of industry sectors to achieve a diverse set of goals.

Use Case #1: Aggregate bandwidth analysis for Video-On-Demand (VOD) applications

In a Telecommunications VOD workload, assume we have the following (simplified) fact table schema for tracking the quality of service (QoS) in serving video content to end users: vod_qos_fact(user_id, video_id, timestamp, bandwidth).

Each time a user requests a video clip, a row is inserted into the fact table, recording the timestamp and the bandwidth consumption of serving the video clip at that time.  In addition, when the bandwidth consumption changes significantly for a video clip being served, a new row is inserted to record the new bandwidth consumption. An example fact table might be:

In this example, the  VOD QoS analyst might like to know the following:  for every minute, what is the aggregate bandwidth consumption across all users? This question can be answered in Vertica 4.0 in two pipelined steps.

First, the GFI time series analytics is used to compute the interpolated bandwidth consumption of each user for every minute. The result is shown below.

Second, the interpolated bandwidth consumption is aggregated for every minute. The resulting SQL formulation of these two steps is the following.

SELECT slice_time, sum(i_bandwidth) agg_bandwidth


SELECT user_id, slice_time, TS_FIRST_VALUE(bandwidth) AS i_bandwidth

FROM vod_qos_fact

TIMESERIES slice_time AS ‘1 minute’ OVER (PARTITION BY user_id ORDER BY timestamp)) tmp

GROUP BY slice_time;

In comparison to other popular solution alternatives such as stored procedure, Map/Reduce script and ODBC/JDBC program, the Vertica approach enjoys the following benefits:

  • Code and architecture simplicity: The computation is expressed in SQL, and is performed within the Vertica database. No additional programming language or execution platform is needed.
  • Efficiency and scalability: The Vertica approach effectively balances the data and computation across all cluster nodes, and minimizes the amount of disk and network I/O needed to accomplish this computation task.

These benefits apply to the following use cases as well.

Use Case #2: Storage compression for inventory management applications

In a retail workload, assume we have the following (simplified) fact table schema for inventory management: inventory_fact(item_id, item_location, timestamp, quantity_in_stock).

For each inventory item in a certain location, every time its stock quantity changes, there is a new row inserted into this fact table with the updated quantity value and the associated timestamp of that inventory change.

For example, I recent went to a nearby Apple Retail Store to purchase an iPad. After that transaction, suppose Apple’s inventory_fact table is populated with a new row, indicating that for this particular Apple store, the quantity of its iPads in stock has decreased by 1.

For a popular product like the iPad, the inventory level in each store may change thousands of times or more each day, creating a large number of rows in the inventory_fact table. However, for the purpose of inventory analysis, the inventory data can often be stored in a much coarser time granularity, such as one row per day for each item and location pair.

The GFI time series analytics in Vertica 4.0 can be used to efficiently aggregate the raw inventory data into a proper time granularity, thus consuming much less storage as well as significantly speeding up inventory analysis. Note that in comparison, traditional SQL analytic functions such as FIRST_VALUE cannot be used to correctly compute the aggregation (the reasoning behind this statement is left as an exercise for the interested reader).

Use Case #3: Power reduction for wireless sensor applications

In a Smartdust project, computer scientists may deploy a large number of sensors in a wild region to measure environmental parameters, such as light and temperature. The sensors transmit their measurement readings via wireless signals to one or more base stations, where installed applications process the measurements.

Since these sensors are battery-driven, efficient power management is crucial to ensure a prolonged operation. The key to efficient power management is to minimize the amount of data transmission from sensors to the base stations.

A base station application may require that measurement updates be available at regular time intervals; for example, once every second. One straightforward implementation is to have each sensor transmit a new measurement for each time interval. However, this transmission policy could lead to a huge battery drain on the sensors, especially considering that the consecutive measurements from the same sensor often carry the same value.

Instead, we could have the sensors submit their readings only when the measurement values change, and then use GFI on the base station application to losslessly recover the required measurement values at regular intervals.


The above three use cases respectively illustrate the use of Vertica’s GFI time series analytics in the context of Telecommunications, Retail, and Science applications. You may be wondering, how to assess whether GFI applies to your particular problem at hand? One way is to look for the following common characteristics of problems to which GFI is applicable:

  1. You have multiple time series event streams (say from multiple users or multiple stock symbols), where the events do not occur at regular time intervals. However, you would like to aggregate or correlate over these event streams (use case #1)
  2. The raw data is stored in a denser time granularity than needed in your analysis computation (use case #2)
  3. The raw data is stored in a sparser time granularity than needed in your analysis computation (use case #3)

With Vertica, the built-in GFI functionality allows you to simplify and accelerate your time-series analytics.

Reading between the Lines with Vertica 4.0

In the recent blockbuster flick Avatar, both the hero and heroine possess the skill referred to as “I See you” in Na’vi speak. Or as we Earthlings may say, perhaps to a more accurate degree, that they know how to read between the lines.

In a key scene, Neytiri and Jake speak under the Tree of Voices. Neytiri tells Jake that he is one of the Omaticaya now and it is time for him to choose a companion. As she starts listing the fine candidates in her tribe, Jake suppresses a smile and replies (in Na’vi): “I’ve already chosen. But this woman must also choose me.” At Jake’s response, Neytiri’s face turns into relief and satisfaction: “She already has.”

The skill of interpolation is indispensable to business as well. Say you are a financial analyst looking to finding possible price correlations between the bid prices of Google and Microsoft over the past 3 months. While the stock ticker stream stored in your database contains all the necessary bid events of both stocks, these events do not always occur at regular time intervals, preventing you from comparing apples to apples. Here’s what is stored in your database:

Symbol Bid Timestamp
MSFT 30.83 09:59:59
GOOG 529.10 10:00:00
MSFT 30.87 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

To begin your analysis, you’d like to normalize your data, by extracting the bid prices at regular time intervals, say one bid price per second for each stock. You know the bid price of MSFT is at $30.83 at time 09:59:59, and it subsequently rises to $30.87 at 10:00:02. What should the bid price be between these time points, so at 10am? In financial analysis, it is standard practice to assume that a stock’s bid price remains constant until the next bid event occurs. Therefore, between 09:59:59 (inclusive) and 10:00:02 (exclusive), the bid price of MSFT remains at $30.83. Based on this understanding your normalized output, starting at 10am, will look like (interpolated values are in green):

Symbol Bid Timestamp
MSFT 30.83 10:00:00
GOOG 529.10 10:00:00
MSFT 30.83 10:00:01
GOOG 529.10 10:00:01
MSFT 30.87 10:00:02
GOOG 529.10 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

Voilà! You have successfully synthesized the output events at every second. This is an example of interpolation — with the bid prices of MSFT at 09:59:59 and 10:00:02, you can interpolate the bid prices at any time point in between. Thanks to interpolation, you can now conduct further analysis on your stock data.

Performing Interpolation in Vertica 4.0

You have three options for performing interpolation:

  1. Use a statistical software program pulling data out of your database. While such software often supports sophisticated interpolation schemes, you know this won’t cut it for the hundreds of GBs of bid stream data stored in your database due to the scalability challenges in both CPU and I/O.
  2. Write your own C++/Java/Perl program. Always a viable option, but you know your time could be better spent.
  3. Use Vertica 4.0 to interpolate. You can now easily perform interpolation as well as other powerful time-series analytics within Vertica at high efficiency and low cost.

When designing our new, patent-pending time-series analytics features, we focused on the following design goals:

  • Ease of use by following similar syntax to existing SQL
  • Powerful semantics in formulating time series computation
  • Highly efficient and scalable native execution in Vertica’s MPP column-store query engine (i.e., bring analytics computation closer to the data)

In sum, we baked time series analytics into our SQL syntax and execution engine, for simplicity and speed. In contrast, another popular solution is to use UDFs for analytics. However, the design and implementation of UDFs often lead to less intuitive syntax as well as non-trivial runtime overhead.

In Vertica 4.0, your interpolation solution can be expressed in just three lines of SQL:

<blockquote>SELECT slice_time, symbol, TS_FIRST_VALUE(bid) AS first_bid
FROM Tickstore
TIMESERIES slice_time AS '1 second' OVER
(PARTITION BY symbol ORDER BY timestamp);

Congratulations! Now you know how to read between the lines with Vertica 4.0. Go and make a dent in the stock market.

Peeking under the Covers

Let’s take a closer look at what the above query does, focusing on the input and output bid events on the Google stock.


The two red dots denote the two input GOOG events. The vertical dashed lines delimit the 1-second time intervals, referred to as time slices. The dashed horizontal lines denote the bid values of GOOG at those time points when there are no input events. The blue stars denote the output GOOG bid events, lying in the intersections of the vertical and horizontal lines.

Note that there is a time slice gap between 10:00:01 and 10:00:02 – there are no input events. However, the output is guaranteed to have a GOOG event for that time slice. This behavior is commonly referred to as gap filling.

In future posts, we will talk more about time series analytics in Vertica 4.0. These features are not only applicable to gaining an edge in financial markets; they are equally powerful and effective to help you gain insights into other types of times series data such as web browsing click streams and call detail records.

Get Started With Vertica Today

Subscribe to Vertica