Author Archive

No, You Do Not Need One Projection Per Query in Vertica

Projections are the Vertica Analytic Databases’s only data structure. Every row from every table stored in Vertica is stored in a projection. There is no such thing as a query which “does not hit the projections.” If someone says those words, you should immediately suspect their motives and/or competence. We are quite open about projections (see previous posts such as this, this and this) and we think they are part of Vertica’s fundamental technical advantage. For those wishing for a more detailed description of projections, please see our VLDB paper from last year.

The idea that Vertica needs a special projection for every query in order to get good performance is just wrong. This rumor is spread as FUD sales tactic by one of our more unscrupulous competitors who knows it is not true and yet promulgates it anyways. We have typically assumed that people see through this transparent tactic, but after reading the same idea repeated by some otherwise credible articles and posts on the internet recently, I feel the need to set the record straight. The reason our competitor’s FUD inducing claim is semi-believable is because it plays on a classic DBA nightmare of full table scans in Row Store systems.

There is a fundamental technical difference between a native Column Store (e.g. Vertica) and a Row Store for ad hoc analysis when you do not have a specialized data structure for that query. In a Row Store, if you don’t have an appropriate index, the system must fallback to a full table scan to retrieve the data. Scanning an entire table’s worth of data is almost always a disastrous amount of I/O for large tables. However, in a Column Store, even if you don’t have an optimal physical structure for a specific query (for example, the optimal projection in Vertica), you simply end up with column scans for those columns referred to in the query.

Furthermore, due to the fact that we built our column storage and execution engine from the ground up with this kind of scenario in mind, our specialized storage format can often avoid reading all the column data from the disk even when a “full column scan” is needed. Along with the other well discussed benefits such as better compression, the fundamental I/O benefit for ad hoc queries is why a Column Store architecture is so much better suited to many data warehouse scenarios than a Row Store.

When Customers Buy you Beer you are on to Something

A few weeks ago, Shilpa, our VP of engineering was in New York City visiting prospective customers. While there, she also had an informal meetup with new and existing customers. One of our new customers liked Vertica so much that he literally handed Shilpa money to buy the Vertica Engineering team beer.

So, she did what all good managers do – delegate the acquisition to Sumeet. Thanks to his efforts we had a very special addition to one of our recent engineering lunches.

Nick, cheers from the entire engineering team! Thank you for your gift – we will all keep working hard to ensure your experience with Vertica continues to be a pleasure.


Vertica Lunch

If you are intrigued, don’t take my anecdotal customer stories for why Vertica is great – try it yourself with the Vertica Community Edition.

P.S. If you are interested in working somewhere customers like your product so much they send you tasty beverages, we are hiring in all areas. Within engineering specifically we are looking for hackers from the lowest level depths of the database server, up through the client interfaces, the management console and third party integration programs. Consider coming in to talk with us:

When UPDATE is actually INSERT

At the VLDB 2012 conference a few weeks ago, we had a chance to listen to Jiri Schindler giving a tutorial about NoSQL.  His interesting and informative presentation covered the fundamental architecture and I/O usage patterns of RDBMS systems and various NoSQL data management systems, such as HBase, Cassandra, and MongoDB.

During the presentation, Schindler listed basic I/O access patterns for columnar databases using the slide below. It is hard to summarize the operation of the various columnar database systems on a single slide, and Schindler did a great job given the constraints of the presentation. However, while his characterization might hold for other columnar databases, the Vertica Analytic Database  has a different I/O pattern for UPDATEs which we wanted to explain in more detail.

First, Vertica does not require synchronous I/O of a recovery log. Unlike most other RDBMS systems,  Vertica implements durability and fault tolerance via distributed replication.

Second, since Vertica never modifies storage in place, it avoids the other I/O intensive operations referenced in the slide.

When a user issues an UPDATE statement, Vertica performs the equivalent of a delete followed by an insert. The existing row is deleted by inserting a Delete Vector (a small record saying that the row was deleted), and a new copy of the row with the appropriately updated columns is inserted. Both the Delete Vector and the new version of the row are stored in a memory buffer known as the WOS (write optimized store). After sufficient data has accumulated in the WOS from INSERTs, UPDATEs, DELETEs, and COPYs (bulk loads), they are moved in bulk to disk storage known as the ROS.

It is important to note that existing files in the ROS are not modified while data is moved from WOS to the ROS – rather a new set of sorted and encoded column files is created. To avoid a large number of files accumulating over time, the Tuple Mover regularly merges column files together using an algorithm that limits the number of times any tuple is rewritten and also uses large contiguous disk operations, which is quite efficient well on most modern file and disk systems.

This arrangement has several advantages:

  1. From the user’s point of view, the update statement completes quickly and future queries get the expected answer (by filtering out the original values at runtime using the Delete Vectors).
  2. The cost of sorting, encoding, and writing column files to disk is amortized over a large number of rows by utilizing the in memory WOS.
  3. I/O is always in proportion to the number of rows inserted or modified – it is never the case that an update of a small number of rows causes I/O on a significant amount of previously stored data.

More details about how data is stored and Vertica’s overall architecture and design decisions, please consider reading our VLDB 2012 paper.



VLDB 2012 – Istanbul Bound!

I’ll be giving a talk next week about Vertica at VLDB 2012. If you happen to be in Istanbul, please stop by (Nga and I have a T-Shirt for you). Our paper can be found at the VLDB website:

The Vertica Analytic Database: C-Store 7 Years Later

At Vertica/HP, we pride ourselves on cutting edge technology, informed by the latest academic research, applied with cutting edge software craftsmanship. Over the years, we have benefited by closely collaborating with academic researchers, befitting a company founded by Mike Stonebraker.

Vertica Systems was originally founded to commercialize the ideas from the C-Store research project developed at MIT and other top universities and which was originally described in a VLDB 2005 paper. This year I am proud we have come full circle and published a rigorous technical description of the Vertica Analytic Database in VLDB 2012.

We look forward to many more years of technical breakthroughs and cool innovation in analytic database systems. Speaking of which, we are hiring! If you are a superstar (cliché, I know) and are interested in working with us to

  • Design, build and test challenging distributed systems, database internals, and analytics systems software
  • Bring one of the very few new database engines to new customers who desperately need it

Drop us a line at

Vertica Analytics Anywhere

by Andrew Lamb

As Vertica evolved to address the needs of diverse analytics users, a common refrain we heard from our customers is that data modeling and exploration is a key activity for data scientists. This is the phase when data is available but they aren’t quite sure how to harness it yet. Over a series of experiments and iterations, the right data model emerges, and at that point it can be operationalized in Vertica for ongoing interactive use. People often use Hadoop for this phase, which gives them the flexibility to access any data, but it means they must write MR programs for analytics and are unable to leverage the sophisticated analytics available in Vertica. This insight led us to decouple our Analytics Engine from our columnar storage to further extend our patent-pending FlexStore architecture . With Vertica 6, it is now possible to use the full expressive power of Vertica Analytics Engine and its analytics without having to load the data into Vertica!

With External Tables combined with User-Defined Loads in Vertica 6, we not only support conventional external tables backed by files on a database server, but also external tables backed by any user defined data sources. We have already written adapters for HDFS, FTP or HTTP servers, JSON and XML objects, IDOL, and of course, other databases via ODBC. (Stay tuned for future blog posts on each of these!). The ability to analyze arbitrary data sources in this federated fashion enables powerful mash-ups such as, joining structured data in Vertica with semi-structured data (think log files) in HDFS or unstructured data (think audio or images) indexed in IDOL or  master data in other legacy relational databases.  The combined data set can now be analyzed using the native analytics in Vertica such as Timeseries, Event Series Pattern Matching, SQL, as well as a growing body of user defined analytic custom extensions in C++, and now R!

Of course, as you might expect, analytics over external data is significantly slower than data stored in Vertica’s native, highly compressed columnar storage format, but it offers the same flexibility of “late binding” people love about NoSQL interfaces, while continuing to leverage familiar SQL interfaces and BI tools.  And, thanks to Vertica’s fast MPP engine and C++ implementation, significantly faster than using alternatives like Pig or Hive on top of Hadoop. Now, you may choose to leave less valuable information in cheaper and slower storage such as HDFS and never move it into Vertica. And if you change your mind, or when the right data model is discovered, or you just want a go-fast switch, with a slight tweak of syntax, voila! – the same data is loaded into Vertica to automatically get full high availability, high compression, backup, recovery, automatic storage optimization, and other benefits of an enterprise class analytic platform!

The figure illustrates how external tables fit into the overall architecture of Vertica.

To use an external table, you define a table with an external keyword and provide information about the data source. Whenever that external table is read, the database retrieves data from the external source and parses it into the appropriate relational form and the rest of the query plan proceeds as normal.

And of course, we also enable the legacy use-case for external tables, which is simpler and/or quicker ETL/ELT. Rather than loading data into a temporary staging table prior to transformation in the database, the data transformation begins by reading the data directly from the external files it lives in thus avoiding an unnecessary materialization in database storage structures.

We believe that this separation of analytics from storage will let more people use Vertica’s analytics on more data in more interesting ways! And that is after all, what Vertica is all about!

Reports of SQL’s Death Are Greatly Exaggerated

Apache Log Analysis in Vertica

I am a proud new father and of course the first thing I did was to put pictures of my daughter online on our webserver because I wanted to see who had been looking at them. I could have gone the Google Analytics route, but being a geek I wanted to explore the data myself rather than just get a static report.

Before Vertica, for this kind of analysis I probably would have written a perl script (because it has the best regexp support of any language I know), but as soon as I start doing anything more complicated than summarizing, things get ugly quickly. Specifically I wanted to group the web log entries into sessions (“sessionize”) to analyze visits rather than page views. According to the interwebs, it seems Hadoop is often used to do this kind of analysis, but that still requires a program to compute the statistics of interest, though it can distribute that computation across many machines.

Since working at Vertica, I have become convinced that SQL is an excellent language for this kind of analysis — it allows one to easily express declaratively what is painful to express programatically (e.g. COUNT distinct). An often cited problem of SQL based analysis is that you need to get your data into a database first by writing a load script that parses your logs into some structured table format. Far from impossible (and any analysis needs to put the logs into a structured format) but annoying as the parsing code (the script) and structure definition (SQL DDL) aren’t bound together.

Recently, I have been working at Vertica on our extensibility mechanism to extend our database from within. So I selfishly used my desire to analyze my own web logs to justify writing an example of parsing Apache logs inside the database. On a recent cross country flight I whipped up a simple Apache log parser (now included as an example in our SDK in 5.0 – if you try it out let me know what you think!). The hardest part of the parser was dealing with the Apache log format (which for some reason changed sometime since the first batch of logs I have from 2005).

Armed with the log parser, the analysis of who was looking at my daughter’s pictures became pretty easy. Furthermore, because I had access to the raw log data in a database, I ended up finding several other fascinating patterns that I hadn’t specifically set out to find. The more I see, the more convinced I am that Hal Varian has it right and that data analysis will be the sexy job of the next decade.

The analysis steps were simple:

  • • rsync logs from my web server to my laptop
  • • Get logs into Vertica with straightforward SQL:

(filename VARCHAR(500),raw_log varchar(4000))

COPY raw_logs(filename as ‘access_log’ , raw_log) FROM
DELIMITER E’\1’\””; — avoid field parsing on tabs

  • • Install and run the log parser code

– Install the parser code
CREATE LIBRARY ParserLib AS ‘/tmp/';
AS LANGUAGE ‘C++’ NAME ‘ApacheParserFactory’ LIBRARY ParserLib;

– Parse the logs into a new table
CREATE TABLE parsed_logs AS
SELECT filename, ApacheParser(raw_log) OVER (PARTITION BY filename)
FROM raw_logs;

Voila! Now I have a structured table with one row per log entry (i.e. file served by the server) and one column per logical log field. It is now a simple task to collect the clicks into sessions (see Sessionize with Style)

CREATE TABLE parsed_sessions as
..CONDITIONAL_TRUE_EVENT(ts – LAG(ts) > ’30 seconds’)
….OVER (PARTITION BY ip ORDER BY ts) || replace(ip,’.’,”) as session_id
FROM parsed_logs;

Now I am ready to ask questions like:

How many sessions, ips, clicks and total bytes were served for my daughter’s pages?

..count(distinct session_id) as session_count,
..count(distinct ip) as ip_count,
..count(*) as total_click_count,
..sum(response_size)/(1024*1024) as Mbytes
from parsed_sessions
where extract(year from ts) = ‘2011’ and username = ‘changed';

session_count -| ip_count | total_click_count | Mbytes
oooooooooo 313 |ooooo 162 |oooooooooooo 11151 | 7353.86
(1 row)

Who looked at the most pictures?

..max(ts) max_ts,
..count(*) as click_count
from parsed_sessions
where extract(year from ts) = ‘2011’ and username = ‘changed’
group by ip, session_id
order by click_count desc
limit 10;

ooooooo max_ts ooooooo | click_count
2011-04-10 08:11:43-04 | ooooooo 294
2011-04-12 11:53:51-04 | ooooooo 197
2011-04-12 09:22:20-04 | ooooooo 191
2011-04-12 06:17:36-04 | ooooooo 184
2011-04-10 10:46:36-04 | ooooooo 171
2011-04-18 11:18:52-04 | ooooooo 167
2011-04-10 14:47:31-04 | ooooooo 160
2011-04-12 13:35:12-04 | ooooooo 159
2011-04-12 18:15:56-04 | ooooooo 157
2011-04-14 10:04:10-04 | ooooooo 153
(10 rows)

So now I was curious: Who where those top 10 clickers? At this point, querying the raw data (as opposed to an aggregated report) was super helpful.

select distinct cnt_rnk, ps.session_id, ip
from parsed_sessions ps JOIN click_counts cc USING (session_id)
where extract(year from ts) = ‘2011’ and
oooo username = ‘changed’ and
ooooo ps.session_id IN (select session_id from click_counts where cnt_rnk <= 5)
order by cnt_rnk;

Without divulging any actual of the actual results (privacy, you know), it turns out 7 of the top 10 were my wife and I, one was my mother in law, one was a family friend and one is still a mystery which I am looking into.

As I was poking around, I noticed another interesting pattern that I wasn’t specifically looking for: a lot of requests came in from Google searches. So my next logical query was:

What are people searching for on Google and where does it lead them to on my site?

select ts,
from parsed_logs
where referring_url ilike ‘’ and
ooooo extract(year from ts) = ‘2011’
order by ts desc
limit 10;

ooooooooo ts ooooooooo | oooooooooo request_url oooooooooo | referring_url
2011-06-13 21:36:57-04 | /classes/commit/fft-factoring.pdf |
2011-06-13 19:55:26-04 | /classes/commit/fft-factoring.pdf |
2011-06-13 02:58:03-04 | /classes/mechatronics/ion-generator.pdf |
2011-06-10 20:35:54-04 | /classes/commit/fft-factoring.pdf |
2011-06-10 18:29:56-04 | /favicon.ico |
2011-06-01 14:54:13-04 | /classes/mechatronics/ion-generator.pdf |
2011-06-01 09:36:51-04 | /classes/commit/fft-factoring.pdf |
2011-05-31 16:41:56-04 | /classes/commit/fft-factoring.pdf |
2011-05-29 14:29:16-04 | /classes/commit/ |
2011-05-25 03:44:12-04 | /classes/commit/fft-factoring.pdf |
(10 rows)

I could see the query terms peeking out of that mess, but it isn’t easy to analyze because the query string is URI encoded within the referring url. I thought it would be cool to programmatically pick out the query terms, and so I spent some time messing with an unsatisfactory regexp based solution. Then I found out that Hieu, one of our interepid interns this summer, had already made a URI decoder using our SDK and the uriparser library:

AS LANGUAGE ‘C++’ NAME ‘UriExtractorFactory’ LIBRARY ParserLib;

Extract the search terms from the URIs of Google searches

CREATE table search_terms
SELECT request_url, value as search_term
..(SELECT request_url, UriExtractor(referring_url) OVER (PARTITION BY request_url) FROM search_referrals ) AS sq
WHERE = ‘q';

SELECT * FROM search_terms LIMIT 10;

ooooooooooooooo request_url ooooooooooooooo | search_term
/ ooooooooooooooooooooooooooooooooooooooooo | andrew nerdnetworks
/classes/6.033/cyberbeanie.pdf oooooooooooo | 6.033 cyberbeanie
/classes/6.033/cyberbeanie.pdf oooooooooooo | Jerome H. Saltzer and M. Frans Kaashoek. 6.033 class notes
/classes/6.033/cyberbeanie/cyberbeanie.html | 6.033 Bibliography Saltzer Computer systems
/classes/6.033/cyberbeanie.html ooooooooooo | link_send
/classes/6.033/cyberbeanie.html ooooooooooo | “Topics in the Engineering of Computer Systems”
/classes/6.033/cyberbeanie.html ooooooooooo | Jerome H. Saltzer, M. Frans Kaashoek. Topics in the Engineering of Computer Systems. M.I.T. 6.033 class notes
/classes/6.033/spank/spankster.html ooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | “chunk server”
(10 rows)

Note that some of the actual values in the above data have been changed to protect other people’s privacy.

Now I need to go back to my day job making Vertica better, but I truly do hope people are able to take the Apache log parser and quickly and easily find their own interesting insights.

Life Beyond Indices: The Query Benefits of Storing Sorted Data

With the Vertica Analytics Platform, there are a number of benefits to storing compressed and sorted data, as well as operating directly on compressed data, that we have discussed in previous posts. In this post, I’m going to discuss how the Vertica Analytics Platform takes advantage of this sorted data to make query execution blindingly fast, which obviates the need for traditional DBMS indexes.

Unlike traditional DBMS solutions, Vertica has no user-defined indexes. Indexes in an analytic database take up DBA time (figuring out which indexes to make), storage capacity, and load time (to keep them up to date). Even if an index consumes only 10% of the size of the original data and takes 10% more time during load, storing even a few indexes on terabytes of data is costly.  As we have mentioned before, a true column store isn’t just a vertically-partitioned row store.

How does Vertica query huge volumes without indexes? It’s easy… the data is sorted by column value, something we can do because we wrote both our storage engine and execution engine from scratch. We don’t store the data by insert order, nor do we limit sorting to within a set of disk blocks. Instead, we have put significant engineering effort into keeping the data totally sorted during its entire lifetime in Vertica. It should be clear how sorted data increases compression ratios (by putting similar values next to each other in the data stream), but it might be less obvious at first how we use sorted data to increase query speed as well.

Let’s start with the simplest and easiest to understand example: the data is sorted the way a query requests it (ORDER BY). Consider a snippet of trading data sorted by stock and price (see Table 1).  If the user’s query requests all the data ordered by the stock and price, they might use something like:

SELECT stock, price FROM ticks ORDER BY stock, price;

Clearly Vertica is off the hook to do any sort at runtime: data is just read off disk (with perhaps some merging) and we are done.

Table 1: Illustration of data sorted on (stock, price). Other columns are omitted for clarity.

A more interesting query might ask for a single stock’s data ordered by price:

SELECT stock, price FROM ticks WHERE stock=’IBM’ ORDER BY price;

Finding rows in storage (disk or memory) that match stock=’IBM’ is quite easy when the data is sorted, simply by applying your favorite search algorithm (no indexes are required!). Furthermore, it isn’t even necessary to sort the stock=’IBM’ rows because the predicate ensures the secondary sort becomes primary within the rows that match as illustrated below:

Table 2: when only rows that match stock=’IBM’ are considered, the results are ordered by price, and thus no additional sorting is required.

Next, let us consider a query that computes the average price for each stock symbol:

SELECT stock, avg(price) FROM ticks GROUP BY stock;

In general, the aggregator operator does not know a priori how many distinct stocks there are nor in what order that they will be encountered. One common approach to computing the aggregation is to keep some sort of lookup table in memory with the partial aggregates for each distinct stock. When a new tuple is read by the aggregator, its corresponding row in the table is found (or a new one is made) and the aggregate is updated as shown below:

Table 3: Illustration of aggregation when data is not sorted on stock. The aggregator has processed the first 4 rows: It has updated HPQ three times with 100, 102 and 103 for an average of 101.66, and it has updated IBM once for an average of 100. Now it encounters ORCL and needs to make a new entry in the table.

This scheme, often denoted as “Group By Hash” because a hash table is used as the lookup data structure, does a good job when there are a small number of groups. However, when there are a large number of groups, it takes significant RAM to store the hash table and provisions need to be made when RAM is exhausted (typically by spilling to disk).

With Vertica, a second type of aggregation algorithm is possible because the data is already sorted, so every distinct stock symbol appears together in the input stream. In this case, the aggregator can easily find the average stock price for each symbol while keeping only one intermediate average at any point in time. Once it sees a new symbol, the same symbol will never be seen again and the current average may be generated. This is illustrated below:

Table 4: Illustration of aggregation when data is sorted on stock. The aggregator has processed the first 7 rows. It has already computed the final averages of stock A and of stock HPQ and has seen the first value of stock IBM resulting in the current average of 100. When the aggregator encounters the next IBM row with price 103 it will update the average to 101.5. When the ORCL row is encountered the output row IBM,101.5 is produced.

This scheme, commonly called “one pass aggregation” has pipelined parallelism (the same concept as instruction pipelining) if the data is already sorted according to stock. This means we can start producing tuples for downstream operators to consume almost immediately. Given that the Vertica execution is multi-threaded, and all modern machines have multiple cores, pipelined parallelism decreases query execution time.

Of course, one pass aggregation is used in other systems (often called SORT GROUP BY), but they require a sort at runtime to sort the data by stock. Forcing a sort before the aggregation costs execution time and it prevents pipelined parallelism because all the tuples must be seen by the sort before any can be sent on. Using an index is also a possibility, but that requires more I/O, both to get the index and then to get the actual values. This is a reasonable approach for systems that aren’t designed for reporting, such as those that are designed for OLTP, but for analytic systems that often handle queries that contain large numbers of groups it is a killer.

I hear you ask what kind of real-world queries aggregate large numbers of groups? There are at least two very common scenarios that our customers encounter: distinct counts and correlated subqueries with aggregation that have been flattened into joins. Our web analytics customers typically have queries that look for distinct visitors given some condition such as:

SELECT count(DISTINCT visitor_id) FROM user_sessions WHERE <filtering predicates>;

The applicability of one-pass aggregation can be seen if we rewrite the query to an equivalent form:

SELECT COUNT(sq.visitor_id) from (select visitor_id FROM user_sessions WHERE <filtering predicates> GROUP BY visitor_id) as sq

And as such is amenable to the same “group by pipeline” optimization of data sorted on visitor_id. As you are probably glazing over at this point, I will postpone further technical discussion of flattened subqueries for a future discussion if there is sufficient interest.

Another area where having pre-sorted data helps is the computation of SQL-99 analytics. We can optimize the PARTITON BY clause in a manner very similar to GROUP BY when the partition keys are sequential in the data stream. We can also optimize the analytic ORDER BY clause similarly to the normal SQL ORDER BY clause.

The final area to consider is Merge-Join. Of course this is not a new idea, but other database systems typically have Sort-Merge-Join, whereby a large join can be performed by pre-sorting the data from both input relations according to the join keys. Since Vertica already has the data sorted, it is often possible to skip the costly sort and begin the join right away.

Since sorting is such a fundamental part of our system, we have built sophisticated infrastructure in the Vertica Optimizer to track the sortedness of various intermediate results. Our infrastructure takes into account that some columns are equivalent after joining, that some columns have had constant predicates, that some expressions (e.g. price * 100) maintain sortedness, and a host of other factors. By keeping careful track, we maximize the opportunities to apply the optimizations shown above, all without any additional storage.

Of course, Vertica is not limited to a single sort order for each table. In fact, if redundant copies of the data need to be stored to survive node failures, the different copies can be stored with different sort orders. Different sort orders furthers the chance that we can apply one of our sort-based optimizations. And lest you think we are simply swapping determining sort order for determining indexes for a new DBA headache, the optimal sort order of the physical storage is typically automatically determined by the Vertica Database Designer!

If anyone wants me to spell out a specific topic in more detail leave a comment below and let me know!


Get Started With Vertica Today

Subscribe to Vertica