Vertica

Archive for the ‘use cases’ Category

The Disruptive Power of Big Data

Aside from the sheer quantity of digital data created every day—about 2.5 exabytes1 —there’s more to Big Data than volume. Big Data offers enterprise leaders the opportunity to dramatically change the way their organizations operate to gain competitive advantage and find new revenue opportunities. But realizing the value Big Data promises requires a new approach. Traditional data warehouses and business intelligence tools weren’t built for the scale of Big Data, and can’t provide insight quickly enough to be useful or even keep up.

But this isn’t just a case of data growth outstripping technology growth. Big Data embodies fundamental differences that necessitate new approaches and new technologies. Big Data takes many forms, three in particular we’ll discuss here:

  • Transactional data
  • Sentiment and perceptual data based on conversations taking place in social media
  • Data from networked sensors—the so-called “Internet of Things”

Transactional Data

As businesses have expanded—and expanded onto the Internet—the volume of business transactions has grown. The Economist reported in 2010 that Wal-Mart processes more than 1 million customer transactions every hour and maintains databases exceeding 2.5 petabytes (million gigabytes)2. Imagine how those numbers have grown since then.

What’s even more critical is that companies can now capture not just sales transactions, but the detailed histories and clickstreams that lead to the sale. From web-based clickstream analysis to call data records, pre- and post-transaction histories are more robust than ever—and our ability to collect, analyze and act on that data must adjust accordingly.

The social media explosion

Today’s online customer has progressed well beyond accessing information. Today’s consumers are not only interacting and collaborating with each other, but they’re talking about and interacting with your brand. Facebook has more than 1 billion active subscribers3, and it’s estimated they share almost 700,000 individual pieces on content every minute. On Twitter, more than a billion tweets go out every two to three days4. (You can watch them mapped geographically in real-time at tweetping.net.)

Product reviews, user communities, forums and blogs allow consumers to generate content that contains critical insight for the business. The proliferation of user-generated content in these social channels has lead to new techniques and tools for “sentiment analysis”—the ability to measure emotion to determine how your company and brand are perceived.

The Internet of Things

The amount of information generated by devices rather than people is also growing explosively.
Mobile devices—and the apps people use on them—regularly broadcast individuals’ location, performance and other factors to the network. Retailers and distributors are using radio frequency identification (RFID), bar and QR codes to track inventory and enhance their supply chain and inventory performance. The healthcare industry seeks to improve care and reduce costs through remote patient monitoring. The automotive industry is embedding sensors in vehicles. And utilities are beginning to rely on smart meters to track usage. McKinsey Global Institute reports that more than 30 million networked sensors are in use in the transportation, automotive, industrial, utilities and retail sectors—and the number is growing by 30 percent every year.5

We recently presented a webinar on the Internet of Things and the Power of Sensor Data, which delves into this exciting area in much more detail.

Disrupting conventional analytics – developing a ‘conversational relationship with data’

Using Big Data to make operations more efficient, improve competitiveness and increase revenue is not about generating traditional statistics or producing standard reports.

Just as important as systems to collect and store data are systems to analyze and extract insight from that data. Without insight, you can’t gain new knowledge into your markets, your products and your operations.

When you have this insight at your disposal, you can act faster and with greater probability of success.

Extracting business value from Big Data requires a new approach. We believe that Big Data analytics is an iterative process. We describe it as developing a conversational relationship with your data. Analytics becomes a continuous improvement loop, which uses the results of analyses to frame better, more meaningful analyses, which, in turn, produce more definitive results. When results are available in minutes, analysts can ask, “What if?”

When properly applied, Big Data analytics enables business leaders to:

  • Understand market reaction and brand perception
  • Identify key buying factors
  • Segment populations to customize actions
  • Enable experimentation
  • Accurately predict outcomes
  • Reinvent and enhance inventory and supply chain systems and processes
  • Disrupt their industries, gain an edge over competitors and enable new business models

Big Data already proved its game-changing power during the 2012 U.S. presidential election. Obama campaign chairman Jim Messina said: “We were going to demand data on everything, we were going to measure everything…We were going to put an analytics team inside of us to study us the entire time to make sure we were being smart about things.”
And, in fact, Big Data analytics helped the Obama campaign ratchet up the three key levers in any election: voter registration, persuasion and turnout. Rolling Stone magazine singled out Messina and the campaign’s CTO, Harper Reed, as two among a handful of unsung heroes in Obama’s victory.

You can hear more about how HP Vertica contributed to the high-tech strategy behind Obama’s reelection in a recent webinar featuring Chris Wegrzyn, director of data architecture for the Democratic National Committee.

The traditional data warehouse won’t get it done

The concept of the data warehouse evolved in the 1980s. Then, data warehouses were simply databases into which data from multiple sources was consolidated for the purpose of query and reporting. But today, these systems fall short when confronted with the volume, velocity and variety of Big Data. Why? They fail to enable the conversational approach to data required by Big Data analytics.

Traditional databases and data warehouses don’t easily scale to the hundreds of terabytes or even petabytes needed for many Big Data applications. Data is often not compressed, so huge amounts of storage and I/O bandwidth are needed to load, store and retrieve data. Data is still stored in tables by row, so access to a single data element through many rows—a common operation in business analytics—requires retrieving practically all of the data in a dataset to extract the specific element(s) needed. That strains I/O bandwidth and extends processing time. We have seen cases where the velocity of incoming data exceeds the capacity of the system to load it into the database, and queries produce answers in hours rather than the seconds or minutes needed for iterative business analytics. As a result, systems cost too much to maintain, and they fail to deliver the insight business leaders seek.

Take sentiment analysis, for example. The goal is to extract meaningful information from unstructured data so results can be stored in databases and analyzed. But the formats of resulting data are less predictable, more varied and subject to change during iterative analytics. This requires frequent changes to relational database structure and to processes that load data into them. For IT, it means the iterative approach to extracting business insight from Big Data requires new approaches, new tools and new skills.

Challenges for business leaders

Big Data is not just a technical challenge. Gaining and applying business insight compels business leaders to adopt new and disruptive ways of thinking and working.
Successful leaders we have known in data-driven organizations become more familiar with the sources of data available to them. Rather than asking IT what information is available in the database, they view information as a key competitive asset and explore how insights might be extracted from it to offer immediate and sustainable competitive advantage.

A solution for Big Data analytics

HP Vertica Analytics Platform is a new kind of database designed from the ground up for business analytics at the scale of Big Data. Compared to traditional databases and data warehouses, it drives down the cost of capturing, storing and analyzing data. And it produces answers 50 to 1,000 times faster to enable the iterative, conversational analytics approach needed.

  • HP Vertica Analytics Platform compresses data to reduce storage costs and speed access by up to 90 percent.
  • It stores data by columns rather than rows and caches data in memory to make analytic queries 50 to 1,000 times faster.
  • It uses massively parallel processing (MPP) to spread huge data volumes over any hardware, including low-cost commodity servers.
  • It uses data replication, failover and recovery to achieve automatic high availability.
  • It includes a pre-packaged, in-database analytics library to handle complex analytics and development framework.
  • It supports the R statistical programming language so analysts can create user-defined analytics inside the database.
  • It dynamically integrates with Hadoop to analyze large sets of structured, semi-structured and unstructured data.

HP Vertica Analytics Platform means better, faster business insight at less cost.


Test drive the HP Vertica Analytics Platform at www.vertica.com/evaluate.


[1] “Big Data: The Management Revolution,” Andrew McAfee and Erik Brynjolfsson, Harvard Business Review, October, 1012.

[2]“Data, data everywhere,” The Economist, Feb 25, 2010.

[3]Facebook key facts.

[4] http://www.mediabistro.com/alltwitter/tweetping_b35247

[5] “Big data: The next frontier for innovation, competition, and productivity,” The McKinsey Global Institute, June 2011.

Sensor Data and the Internet of Things: When Big Data Gets Really Big

I remember back in the 1990s when Sun Microsystems claimed that “Java anywhere” would even make refrigerators intelligent to know when you were out of milk, triggering a series of events that ultimately resulted in a grocery delivery chain bringing your milk to your door step the very next day.

Fast forward to today. There are millions (and soon billions) of devices that are connected to the Internet — cars, medical equipment, buildings, meters, power grids, and, yes, even refrigerators. These connected devices comprise the Internet of Things (also known as Machine to Machine or M2M).

But why is this important to your world of Big Data analytics?

The Internet of Things is generating an unfathomable amount of sensor data  — data that product manufacturers, particularly, need to manage and analyze to build better products, predict failures to reduce costs, and understand customer behavior to differentiate and improve loyalty.

In fact, a recent report by IDC’s The Digital Universe 2020 forecasts that machine-generated data will increase to 42 percent of all data by 2020, up from 11 percent in 2005.

The use cases are proven and here. Some are even mainstream. Think Progressive Insurance’s Snapshot pay-as-you-drive insurance commercials that have taken over our airwaves. Others are around us, and you may not even know it. Over your next work day, think about how many devices are connected and distributing information just waiting for analysis — your car, train, flight, or bus; traffic lights, road side signs, the elevator and escalator, an ATM, your check-out system.

But, more importantly, join us for our upcoming Webcast: Unlocking the Massive Potential of Sensor Data and the Internet of Things on Thursday, February 14th at noon EST (9:00AM PST).

We look forward to continuing the conversation and share these and other emerging use cases, real-world case studies, and a technology perspective to help you prepare for this massive opportunity ushered in by sensor data and the Internet of Things!

Top 3 Discussions from TDWI 2012

At last week’s TDWI Conference in Orlando, FL there was a general buzz about how to derive value from Big Data for competitive advantage. That makes perfect sense, given that the general conference theme was to “Get Smarter with Big Data.”

Naturally, attendees – largely data scientists, business analysts, architects, and developers — were on hand to learn more about which technologies to recommend back at headquarters for their Big Data initiatives. We at HP Vertica were busy addressing a range of questions at our booth. However, the conversations went beyond features, benefits, and how we complement BI, ETL, and technologies like Hadoop. Attendees were also generally interested in learning from their peers and presenters how to:

    1.  Build a Business Case
    2. One attendee from a Fortune 100 company could understand how his company could gain greater analytics performance from a columnar database (such as HP Vertica Analytics Platform). However, he acknowledged that his company needed to step back and learn how the insight from Big Data analytics would address top-line objectives: reduce costs, generate revenue, differentiate, and ultimately improve customer satisfaction.

    3. Define the Meaning and Value of Big Data
    4. My colleague, Chris Selland, participated in a panel (Business Value: The Fourth V of Big Data) that answered a range of questions from the audience. It was clear that attendees still struggled with a clear common definition of Big Data. The three V analogy (Velocity, Volume, and Variety) seemed to provide more clarity. However, the fourth V (Value) underscored how Big Data analytics is supporting overall business benefits. For an example, check out our recently published case study — Cardlytics Serves Up Success with HP Vertica.  

    5. Adopt Common Use Cases
    6. There were some interesting lunch-time discussions centered on how Big Data analytics could address common use cases by industry. In health care, for example, health care providers are using analytics to ultimately provide improved preventive care based on family history, test results, etc. A large logistics company is using analytics for route optimization in reducing fuel costs and overall emissions. Clickstream analytics, fraud detection, inventory management, and a myriad of use cases are emerging as companies gain greater insight from their Big Data, once they have removed the technology barriers (performance, scalability, and overall low TCO).

    How does your organization plan to derive value from Big Data? We’d love to hear your use case.

Heating up Data in Vertica

A couple of weeks ago Vertica saw the interns teaming up and working on their own UDx projects. These projects were then presented under the eyes of judges who scored each group on various criteria. One hot project was our Heat Map UDT. After some additional work, the Heat Map is becoming a promising addition to the Vertica UDx family!  Let’s get this show on the road!

Columns of data in need of some quick and dirty analysis?  Enter Vertica and the Heat Map Transform, a highly extensible and parameterized analysis tool.

Imagine you had a client who came to you for advice on how to improve their popular first person shooter.  After some talking you come to the conclusion that the maps may not be well balanced and players simply die too often.  If only there was some way to keep them alive longer, and still have an exciting and fast-paced level!  A good first step might be to figure out where the dangerous zones are in the maps so that the client may figure out where the maps may need balance changes.  Perhaps by balancing out the map, players will start to utilize the level in its entirety and allow for more tactical, strategic play.  Conveniently, the client has been logging all of the death and killing locations into his Vertica database. Now how to quickly process this information to get some useful visual results?  Easy! Let’s try out the Heat Map UDT from the Vertica Extension Packages GitHub repository.

(more…)

Life Beyond Indices: The Query Benefits of Storing Sorted Data

With the Vertica Analytics Platform, there are a number of benefits to storing compressed and sorted data, as well as operating directly on compressed data, that we have discussed in previous posts. In this post, I’m going to discuss how the Vertica Analytics Platform takes advantage of this sorted data to make query execution blindingly fast, which obviates the need for traditional DBMS indexes.

Unlike traditional DBMS solutions, Vertica has no user-defined indexes. Indexes in an analytic database take up DBA time (figuring out which indexes to make), storage capacity, and load time (to keep them up to date). Even if an index consumes only 10% of the size of the original data and takes 10% more time during load, storing even a few indexes on terabytes of data is costly.  As we have mentioned before, a true column store isn’t just a vertically-partitioned row store.

How does Vertica query huge volumes without indexes? It’s easy… the data is sorted by column value, something we can do because we wrote both our storage engine and execution engine from scratch. We don’t store the data by insert order, nor do we limit sorting to within a set of disk blocks. Instead, we have put significant engineering effort into keeping the data totally sorted during its entire lifetime in Vertica. It should be clear how sorted data increases compression ratios (by putting similar values next to each other in the data stream), but it might be less obvious at first how we use sorted data to increase query speed as well.

Let’s start with the simplest and easiest to understand example: the data is sorted the way a query requests it (ORDER BY). Consider a snippet of trading data sorted by stock and price (see Table 1).  If the user’s query requests all the data ordered by the stock and price, they might use something like:

SELECT stock, price FROM ticks ORDER BY stock, price;

Clearly Vertica is off the hook to do any sort at runtime: data is just read off disk (with perhaps some merging) and we are done.

Table 1: Illustration of data sorted on (stock, price). Other columns are omitted for clarity.

A more interesting query might ask for a single stock’s data ordered by price:

SELECT stock, price FROM ticks WHERE stock=’IBM’ ORDER BY price;

Finding rows in storage (disk or memory) that match stock=’IBM’ is quite easy when the data is sorted, simply by applying your favorite search algorithm (no indexes are required!). Furthermore, it isn’t even necessary to sort the stock=’IBM’ rows because the predicate ensures the secondary sort becomes primary within the rows that match as illustrated below:

Table 2: when only rows that match stock=’IBM’ are considered, the results are ordered by price, and thus no additional sorting is required.

Next, let us consider a query that computes the average price for each stock symbol:

SELECT stock, avg(price) FROM ticks GROUP BY stock;

In general, the aggregator operator does not know a priori how many distinct stocks there are nor in what order that they will be encountered. One common approach to computing the aggregation is to keep some sort of lookup table in memory with the partial aggregates for each distinct stock. When a new tuple is read by the aggregator, its corresponding row in the table is found (or a new one is made) and the aggregate is updated as shown below:

Table 3: Illustration of aggregation when data is not sorted on stock. The aggregator has processed the first 4 rows: It has updated HPQ three times with 100, 102 and 103 for an average of 101.66, and it has updated IBM once for an average of 100. Now it encounters ORCL and needs to make a new entry in the table.

This scheme, often denoted as “Group By Hash” because a hash table is used as the lookup data structure, does a good job when there are a small number of groups. However, when there are a large number of groups, it takes significant RAM to store the hash table and provisions need to be made when RAM is exhausted (typically by spilling to disk).

With Vertica, a second type of aggregation algorithm is possible because the data is already sorted, so every distinct stock symbol appears together in the input stream. In this case, the aggregator can easily find the average stock price for each symbol while keeping only one intermediate average at any point in time. Once it sees a new symbol, the same symbol will never be seen again and the current average may be generated. This is illustrated below:

Table 4: Illustration of aggregation when data is sorted on stock. The aggregator has processed the first 7 rows. It has already computed the final averages of stock A and of stock HPQ and has seen the first value of stock IBM resulting in the current average of 100. When the aggregator encounters the next IBM row with price 103 it will update the average to 101.5. When the ORCL row is encountered the output row IBM,101.5 is produced.

This scheme, commonly called “one pass aggregation“ has pipelined parallelism (the same concept as instruction pipelining) if the data is already sorted according to stock. This means we can start producing tuples for downstream operators to consume almost immediately. Given that the Vertica execution is multi-threaded, and all modern machines have multiple cores, pipelined parallelism decreases query execution time.

Of course, one pass aggregation is used in other systems (often called SORT GROUP BY), but they require a sort at runtime to sort the data by stock. Forcing a sort before the aggregation costs execution time and it prevents pipelined parallelism because all the tuples must be seen by the sort before any can be sent on. Using an index is also a possibility, but that requires more I/O, both to get the index and then to get the actual values. This is a reasonable approach for systems that aren’t designed for reporting, such as those that are designed for OLTP, but for analytic systems that often handle queries that contain large numbers of groups it is a killer.

I hear you ask what kind of real-world queries aggregate large numbers of groups? There are at least two very common scenarios that our customers encounter: distinct counts and correlated subqueries with aggregation that have been flattened into joins. Our web analytics customers typically have queries that look for distinct visitors given some condition such as:

SELECT count(DISTINCT visitor_id) FROM user_sessions WHERE <filtering predicates>;

The applicability of one-pass aggregation can be seen if we rewrite the query to an equivalent form:

SELECT COUNT(sq.visitor_id) from (select visitor_id FROM user_sessions WHERE <filtering predicates> GROUP BY visitor_id) as sq

And as such is amenable to the same “group by pipeline” optimization of data sorted on visitor_id. As you are probably glazing over at this point, I will postpone further technical discussion of flattened subqueries for a future discussion if there is sufficient interest.

Another area where having pre-sorted data helps is the computation of SQL-99 analytics. We can optimize the PARTITON BY clause in a manner very similar to GROUP BY when the partition keys are sequential in the data stream. We can also optimize the analytic ORDER BY clause similarly to the normal SQL ORDER BY clause.

The final area to consider is Merge-Join. Of course this is not a new idea, but other database systems typically have Sort-Merge-Join, whereby a large join can be performed by pre-sorting the data from both input relations according to the join keys. Since Vertica already has the data sorted, it is often possible to skip the costly sort and begin the join right away.

Since sorting is such a fundamental part of our system, we have built sophisticated infrastructure in the Vertica Optimizer to track the sortedness of various intermediate results. Our infrastructure takes into account that some columns are equivalent after joining, that some columns have had constant predicates, that some expressions (e.g. price * 100) maintain sortedness, and a host of other factors. By keeping careful track, we maximize the opportunities to apply the optimizations shown above, all without any additional storage.

Of course, Vertica is not limited to a single sort order for each table. In fact, if redundant copies of the data need to be stored to survive node failures, the different copies can be stored with different sort orders. Different sort orders furthers the chance that we can apply one of our sort-based optimizations. And lest you think we are simply swapping determining sort order for determining indexes for a new DBA headache, the optimal sort order of the physical storage is typically automatically determined by the Vertica Database Designer!

If anyone wants me to spell out a specific topic in more detail leave a comment below and let me know!

Andrew

Tales from a Cocktail Party: How Customers Use Vertica

On Wednesday, I gave one of the lightning talks at the 4th Extremely Large Databases Conference (XLDB) at Stanford University about how real customers use Vertica to store and analyze their ‘extremely large’ databases. The 5 minute format is tough.  I had to pick only three customers   but I hope I was able to get across how cool the things they are doing with Vertica are.

Given the proximity to San Francisco, it occurs to me that Vertica is basically in the same business that Levi Strauss was in during the 19th century gold rush: selling supplies to a rapid and lucrative market where there is lots of money to be made. Our modern day gold rush is between businesses to see who can use the massive amounts of data they collect to make the best X possible, where X is anything from ‘television distribution network’ to ‘massive online game’ and everything in between.

Given that we are in the system software business, we sell a (vital) component of an end product and thus I largely don’t see complete applications. I most typically see the SQL that those applications generate, and on most days I am focused on the nitty-gritty details of making it work as fast as possible.  Sometimes, though, I get to step back and see what is really being done with our products and it’s usually quite cool.

So, in this blog post, I wanted to give you some cool cocktail party tidbits about what our customers do with their data…

One telecom company told me that (unsurprisingly) most of those 500 channels on your cable box never get watched.  So most of the time, the company doesn’t waste their network’s bandwidth sending all of the channels to your house.  Instead, they use switched video and then use Vertica to analyze the patterns of who watches what and when in every local distribution group, so that they can make network planning decisions like “how much switched video capacity do we really need for those 100 subscribers?”

Cocktail item: This telco has an (anonymized) record of what station each television box is tuned to for every minute of every day.

Another of our customers analyzes SMS message content for big mobile service providers. To be honest, I don’t know exactly what they are doing with this information, but you might imagine they have interesting data and interesting conclusions, and they use Vertica to parse this data.

Cocktail item: Question: What is the most common text message? Answer: The single letter ‘k.’

My aunt is a devoted online farmer, but I doubt she realizes how much technical firepower is used to analyze her online buying habits. Zynga, the developer of FarmVille, Mafia Wars and a number of other popular online games, has a massive Vertica cluster that they use to analyze past in-game decisions with the goal of making their games more fun to play.  This means you will spend more time playing the games, and hopefully spending more money with Zynga.

Cocktail Item: Zynga’s revenue is massive, estimated by various sources to be $200M or greater in 2009, and also reportedly on track for a killer 2010.  Who would have thought virtual farming or playing a gangster would be so lucrative?

Vertica’s customers are from a wide range of industries – gaming, telecommunications, financial, healthcare, and more – but they share one thing in common.  They all use Vertica to analyze massive amounts of data in real time, converting it into usable information that helps drive business decisions.  And now, I feel like a cocktail!

Reading between the Lines with Vertica 4.0

In the recent blockbuster flick Avatar, both the hero and heroine possess the skill referred to as “I See you” in Na’vi speak. Or as we Earthlings may say, perhaps to a more accurate degree, that they know how to read between the lines.

In a key scene, Neytiri and Jake speak under the Tree of Voices. Neytiri tells Jake that he is one of the Omaticaya now and it is time for him to choose a companion. As she starts listing the fine candidates in her tribe, Jake suppresses a smile and replies (in Na’vi): “I’ve already chosen. But this woman must also choose me.” At Jake’s response, Neytiri’s face turns into relief and satisfaction: “She already has.”

The skill of interpolation is indispensable to business as well. Say you are a financial analyst looking to finding possible price correlations between the bid prices of Google and Microsoft over the past 3 months. While the stock ticker stream stored in your database contains all the necessary bid events of both stocks, these events do not always occur at regular time intervals, preventing you from comparing apples to apples. Here’s what is stored in your database:

Symbol Bid Timestamp
MSFT 30.83 09:59:59
GOOG 529.10 10:00:00
MSFT 30.87 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

To begin your analysis, you’d like to normalize your data, by extracting the bid prices at regular time intervals, say one bid price per second for each stock. You know the bid price of MSFT is at $30.83 at time 09:59:59, and it subsequently rises to $30.87 at 10:00:02. What should the bid price be between these time points, so at 10am? In financial analysis, it is standard practice to assume that a stock’s bid price remains constant until the next bid event occurs. Therefore, between 09:59:59 (inclusive) and 10:00:02 (exclusive), the bid price of MSFT remains at $30.83. Based on this understanding your normalized output, starting at 10am, will look like (interpolated values are in green):

Symbol Bid Timestamp
MSFT 30.83 10:00:00
GOOG 529.10 10:00:00
MSFT 30.83 10:00:01
GOOG 529.10 10:00:01
MSFT 30.87 10:00:02
GOOG 529.10 10:00:02
MSFT 30.89 10:00:03
GOOG 529.13 10:00:03

Voilà! You have successfully synthesized the output events at every second. This is an example of interpolation — with the bid prices of MSFT at 09:59:59 and 10:00:02, you can interpolate the bid prices at any time point in between. Thanks to interpolation, you can now conduct further analysis on your stock data.

Performing Interpolation in Vertica 4.0

You have three options for performing interpolation:

  1. Use a statistical software program pulling data out of your database. While such software often supports sophisticated interpolation schemes, you know this won’t cut it for the hundreds of GBs of bid stream data stored in your database due to the scalability challenges in both CPU and I/O.
  2. Write your own C++/Java/Perl program. Always a viable option, but you know your time could be better spent.
  3. Use Vertica 4.0 to interpolate. You can now easily perform interpolation as well as other powerful time-series analytics within Vertica at high efficiency and low cost.

When designing our new, patent-pending time-series analytics features, we focused on the following design goals:

  • Ease of use by following similar syntax to existing SQL
  • Powerful semantics in formulating time series computation
  • Highly efficient and scalable native execution in Vertica’s MPP column-store query engine (i.e., bring analytics computation closer to the data)

In sum, we baked time series analytics into our SQL syntax and execution engine, for simplicity and speed. In contrast, another popular solution is to use UDFs for analytics. However, the design and implementation of UDFs often lead to less intuitive syntax as well as non-trivial runtime overhead.

In Vertica 4.0, your interpolation solution can be expressed in just three lines of SQL:

<blockquote>SELECT slice_time, symbol, TS_FIRST_VALUE(bid) AS first_bid
FROM Tickstore
TIMESERIES slice_time AS '1 second' OVER
(PARTITION BY symbol ORDER BY timestamp);
</blockquote>

Congratulations! Now you know how to read between the lines with Vertica 4.0. Go and make a dent in the stock market.

Peeking under the Covers

Let’s take a closer look at what the above query does, focusing on the input and output bid events on the Google stock.

 

The two red dots denote the two input GOOG events. The vertical dashed lines delimit the 1-second time intervals, referred to as time slices. The dashed horizontal lines denote the bid values of GOOG at those time points when there are no input events. The blue stars denote the output GOOG bid events, lying in the intersections of the vertical and horizontal lines.

Note that there is a time slice gap between 10:00:01 and 10:00:02 – there are no input events. However, the output is guaranteed to have a GOOG event for that time slice. This behavior is commonly referred to as gap filling.

In future posts, we will talk more about time series analytics in Vertica 4.0. These features are not only applicable to gaining an edge in financial markets; they are equally powerful and effective to help you gain insights into other types of times series data such as web browsing click streams and call detail records.