Reading between the Lines with Vertica 4.0

In the recent blockbuster flick Avatar, both the hero and heroine possess the skill referred to as “I See you” in Na’vi speak. Or as we Earthlings may say, perhaps to a more accurate degree, that they know how to read between the lines.

In a key scene, Neytiri and Jake speak under the Tree of Voices. Neytiri tells Jake that he is one of the Omaticaya now and it is time for him to choose a companion. As she starts listing the fine candidates in her tribe, Jake suppresses a smile and replies (in Na’vi): “I’ve already chosen. But this woman must also choose me.” At Jake’s response, Neytiri’s face turns into relief and satisfaction: “She already has.”

The skill of interpolation is indispensable to business as well. Say you are a financial analyst looking to finding possible price correlations between the bid prices of Google and Microsoft over the past 3 months. While the stock ticker stream stored in your database contains all the necessary bid events of both stocks, these events do not always occur at regular time intervals, preventing you from comparing apples to apples. Here’s what is stored in your database:

Symbol Bid Timestamp
MSFT 30.83 09:59:59
GOOG 529.10 10:00:00
MSFT 30.87 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

To begin your analysis, you’d like to normalize your data, by extracting the bid prices at regular time intervals, say one bid price per second for each stock. You know the bid price of MSFT is at \$30.83 at time 09:59:59, and it subsequently rises to \$30.87 at 10:00:02. What should the bid price be between these time points, so at 10am? In financial analysis, it is standard practice to assume that a stock’s bid price remains constant until the next bid event occurs. Therefore, between 09:59:59 (inclusive) and 10:00:02 (exclusive), the bid price of MSFT remains at \$30.83. Based on this understanding your normalized output, starting at 10am, will look like (interpolated values are in green):

Symbol Bid Timestamp
MSFT 30.83 10:00:00
GOOG 529.10 10:00:00
MSFT 30.83 10:00:01
GOOG 529.10 10:00:01
MSFT 30.87 10:00:02
GOOG 529.10 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

Voilà! You have successfully synthesized the output events at every second. This is an example of interpolation — with the bid prices of MSFT at 09:59:59 and 10:00:02, you can interpolate the bid prices at any time point in between. Thanks to interpolation, you can now conduct further analysis on your stock data.

Performing Interpolation in Vertica 4.0

You have three options for performing interpolation:

1. Use a statistical software program pulling data out of your database. While such software often supports sophisticated interpolation schemes, you know this won’t cut it for the hundreds of GBs of bid stream data stored in your database due to the scalability challenges in both CPU and I/O.
2. Write your own C++/Java/Perl program. Always a viable option, but you know your time could be better spent.
3. Use Vertica 4.0 to interpolate. You can now easily perform interpolation as well as other powerful time-series analytics within Vertica at high efficiency and low cost.

When designing our new, patent-pending time-series analytics features, we focused on the following design goals:

• Ease of use by following similar syntax to existing SQL
• Powerful semantics in formulating time series computation
• Highly efficient and scalable native execution in Vertica’s MPP column-store query engine (i.e., bring analytics computation closer to the data)

In sum, we baked time series analytics into our SQL syntax and execution engine, for simplicity and speed. In contrast, another popular solution is to use UDFs for analytics. However, the design and implementation of UDFs often lead to less intuitive syntax as well as non-trivial runtime overhead.

In Vertica 4.0, your interpolation solution can be expressed in just three lines of SQL:

```<blockquote>SELECT slice_time, symbol, TS_FIRST_VALUE(bid) AS first_bid
FROM Tickstore
TIMESERIES slice_time AS '1 second' OVER```
`(PARTITION BY symbol ORDER BY timestamp);`
`</blockquote>`

Congratulations! Now you know how to read between the lines with Vertica 4.0. Go and make a dent in the stock market.

Peeking under the Covers

Let’s take a closer look at what the above query does, focusing on the input and output bid events on the Google stock.

The two red dots denote the two input GOOG events. The vertical dashed lines delimit the 1-second time intervals, referred to as time slices. The dashed horizontal lines denote the bid values of GOOG at those time points when there are no input events. The blue stars denote the output GOOG bid events, lying in the intersections of the vertical and horizontal lines.

Note that there is a time slice gap between 10:00:01 and 10:00:02 – there are no input events. However, the output is guaranteed to have a GOOG event for that time slice. This behavior is commonly referred to as gap filling.

In future posts, we will talk more about time series analytics in Vertica 4.0. These features are not only applicable to gaining an edge in financial markets; they are equally powerful and effective to help you gain insights into other types of times series data such as web browsing click streams and call detail records.