Vertica

Archive for June, 2010

There Will Come Soft Rains

I am an early riser, especially on the weekends in summer when the sun rises early. This means I get 2-3 hours of time to myself before anyone else

is awake. Much to my wife’s chagrin, I often spend the time catching up on Vertica email and tasks that didn’t quite get done during the week. This is one of the only times where I can write email without getting semi-immediate responses back.

This morning, I did get several emails, though from our automated performance test systems, which reminded of Ray Bradbury’s short story “There Will Come Soft Rains” in which the humans are gone, but their automated systems still remain.

Of course the story’s main points aren’t about software automation early morning emails, but sitting here in the quiet before the day wakes, it seems somehow appropriate.

The automation systems we have at Vertica please me. Of course I am biased — I mostly wrote what we call “ServerFarm” which runs regression tests regularly (every few checkins) as well as allows developers to run tests on demand on a farm of machines. This system came about like most good productivity enhancing systems do: I was annoyed. Specifically I was annoyed of having to run our 1 hour long test suite on my laptop before checking in. I had taken to saving patches for my changes and running them manually at night so that I didn’t tie up my development machine during the day. ServerFarm started out as a way for me to automate that process and has grown from there.

Automated and regular regression testing is, of course, software best practice, but it is cool to see it working well at Vertica. By lowering the bar for using ServerFarm, even the crustiest of our developers now uses the Farm. The one thing that could be better is reporting of the results, especially over time. Thankfully we have a clever intern this summer helping us out with that. I am totally psyched to see what he comes up with.

Reading between the Lines with Vertica 4.0

In the recent blockbuster flick Avatar, both the hero and heroine possess the skill referred to as “I See you” in Na’vi speak. Or as we Earthlings may say, perhaps to a more accurate degree, that they know how to read between the lines.

In a key scene, Neytiri and Jake speak under the Tree of Voices. Neytiri tells Jake that he is one of the Omaticaya now and it is time for him to choose a companion. As she starts listing the fine candidates in her tribe, Jake suppresses a smile and replies (in Na’vi): “I’ve already chosen. But this woman must also choose me.” At Jake’s response, Neytiri’s face turns into relief and satisfaction: “She already has.”

The skill of interpolation is indispensable to business as well. Say you are a financial analyst looking to finding possible price correlations between the bid prices of Google and Microsoft over the past 3 months. While the stock ticker stream stored in your database contains all the necessary bid events of both stocks, these events do not always occur at regular time intervals, preventing you from comparing apples to apples. Here’s what is stored in your database:

Symbol Bid Timestamp
MSFT 30.83 09:59:59
GOOG 529.10 10:00:00
MSFT 30.87 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

To begin your analysis, you’d like to normalize your data, by extracting the bid prices at regular time intervals, say one bid price per second for each stock. You know the bid price of MSFT is at $30.83 at time 09:59:59, and it subsequently rises to $30.87 at 10:00:02. What should the bid price be between these time points, so at 10am? In financial analysis, it is standard practice to assume that a stock’s bid price remains constant until the next bid event occurs. Therefore, between 09:59:59 (inclusive) and 10:00:02 (exclusive), the bid price of MSFT remains at $30.83. Based on this understanding your normalized output, starting at 10am, will look like (interpolated values are in green):

Symbol Bid Timestamp
MSFT 30.83 10:00:00
GOOG 529.10 10:00:00
MSFT 30.83 10:00:01
GOOG 529.10 10:00:01
MSFT 30.87 10:00:02
GOOG 529.10 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

Voilà! You have successfully synthesized the output events at every second. This is an example of interpolation — with the bid prices of MSFT at 09:59:59 and 10:00:02, you can interpolate the bid prices at any time point in between. Thanks to interpolation, you can now conduct further analysis on your stock data.

Performing Interpolation in Vertica 4.0

You have three options for performing interpolation:

  1. Use a statistical software program pulling data out of your database. While such software often supports sophisticated interpolation schemes, you know this won’t cut it for the hundreds of GBs of bid stream data stored in your database due to the scalability challenges in both CPU and I/O.
  2. Write your own C++/Java/Perl program. Always a viable option, but you know your time could be better spent.
  3. Use Vertica 4.0 to interpolate. You can now easily perform interpolation as well as other powerful time-series analytics within Vertica at high efficiency and low cost.

When designing our new, patent-pending time-series analytics features, we focused on the following design goals:

  • Ease of use by following similar syntax to existing SQL
  • Powerful semantics in formulating time series computation
  • Highly efficient and scalable native execution in Vertica’s MPP column-store query engine (i.e., bring analytics computation closer to the data)

In sum, we baked time series analytics into our SQL syntax and execution engine, for simplicity and speed. In contrast, another popular solution is to use UDFs for analytics. However, the design and implementation of UDFs often lead to less intuitive syntax as well as non-trivial runtime overhead.

In Vertica 4.0, your interpolation solution can be expressed in just three lines of SQL:

<blockquote>SELECT slice_time, symbol, TS_FIRST_VALUE(bid) AS first_bid
FROM Tickstore
TIMESERIES slice_time AS '1 second' OVER
(PARTITION BY symbol ORDER BY timestamp);
</blockquote>

Congratulations! Now you know how to read between the lines with Vertica 4.0. Go and make a dent in the stock market.

Peeking under the Covers

Let’s take a closer look at what the above query does, focusing on the input and output bid events on the Google stock.

 

The two red dots denote the two input GOOG events. The vertical dashed lines delimit the 1-second time intervals, referred to as time slices. The dashed horizontal lines denote the bid values of GOOG at those time points when there are no input events. The blue stars denote the output GOOG bid events, lying in the intersections of the vertical and horizontal lines.

Note that there is a time slice gap between 10:00:01 and 10:00:02 – there are no input events. However, the output is guaranteed to have a GOOG event for that time slice. This behavior is commonly referred to as gap filling.

In future posts, we will talk more about time series analytics in Vertica 4.0. These features are not only applicable to gaining an edge in financial markets; they are equally powerful and effective to help you gain insights into other types of times series data such as web browsing click streams and call detail records.

Get Started With Vertica Today

Subscribe to Vertica