Vertica

Author Archive

Reports of SQL’s Death Are Greatly Exaggerated

Apache Log Analysis in Vertica
.

I am a proud new father and of course the first thing I did was to put pictures of my daughter online on our webserver because I wanted to see who had been looking at them. I could have gone the Google Analytics route, but being a geek I wanted to explore the data myself rather than just get a static report.

Before Vertica, for this kind of analysis I probably would have written a perl script (because it has the best regexp support of any language I know), but as soon as I start doing anything more complicated than summarizing, things get ugly quickly. Specifically I wanted to group the web log entries into sessions (“sessionize”) to analyze visits rather than page views. According to the interwebs, it seems Hadoop is often used to do this kind of analysis, but that still requires a program to compute the statistics of interest, though it can distribute that computation across many machines.

Since working at Vertica, I have become convinced that SQL is an excellent language for this kind of analysis — it allows one to easily express declaratively what is painful to express programatically (e.g. COUNT distinct). An often cited problem of SQL based analysis is that you need to get your data into a database first by writing a load script that parses your logs into some structured table format. Far from impossible (and any analysis needs to put the logs into a structured format) but annoying as the parsing code (the script) and structure definition (SQL DDL) aren’t bound together.

Recently, I have been working at Vertica on our extensibility mechanism to extend our database from within. So I selfishly used my desire to analyze my own web logs to justify writing an example of parsing Apache logs inside the database. On a recent cross country flight I whipped up a simple Apache log parser (now included as an example in our SDK in 5.0 – if you try it out let me know what you think!). The hardest part of the parser was dealing with the Apache log format (which for some reason changed sometime since the first batch of logs I have from 2005).

Armed with the log parser, the analysis of who was looking at my daughter’s pictures became pretty easy. Furthermore, because I had access to the raw log data in a database, I ended up finding several other fascinating patterns that I hadn’t specifically set out to find. The more I see, the more convinced I am that Hal Varian has it right and that data analysis will be the sexy job of the next decade.

The analysis steps were simple:

  • • rsync logs from my web server to my laptop
  • • Get logs into Vertica with straightforward SQL:

CREATE TABLE raw_logs
(filename VARCHAR(500),raw_log varchar(4000))
SEGMENTED BY HASH(filename) ALL NODES;

COPY raw_logs(filename as ‘access_log’ , raw_log) FROM
‘/home/alamb/access_log’
DELIMITER E’\1′\”"; — avoid field parsing on tabs

  • • Install and run the log parser code

– Install the parser code
CREATE LIBRARY ParserLib AS ‘/tmp/ApacheParser2.so’;
CREATE TRANSFORM FUNCTION ApacheParser
AS LANGUAGE ‘C++’ NAME ‘ApacheParserFactory’ LIBRARY ParserLib;

– Parse the logs into a new table
CREATE TABLE parsed_logs AS
SELECT filename, ApacheParser(raw_log) OVER (PARTITION BY filename)
FROM raw_logs;

Voila! Now I have a structured table with one row per log entry (i.e. file served by the server) and one column per logical log field. It is now a simple task to collect the clicks into sessions (see Sessionize with Style)

CREATE TABLE parsed_sessions as
SELECT
..*,
..CONDITIONAL_TRUE_EVENT(ts – LAG(ts) > ’30 seconds’)
….OVER (PARTITION BY ip ORDER BY ts) || replace(ip,’.',”) as session_id
FROM parsed_logs;

Now I am ready to ask questions like:

How many sessions, ips, clicks and total bytes were served for my daughter’s pages?

select
..count(distinct session_id) as session_count,
..count(distinct ip) as ip_count,
..count(*) as total_click_count,
..sum(response_size)/(1024*1024) as Mbytes
from parsed_sessions
where extract(year from ts) = ’2011′ and username = ‘changed’;

session_count -| ip_count | total_click_count | Mbytes
________________________________________________________
oooooooooo 313 |ooooo 162 |oooooooooooo 11151 | 7353.86
(1 row)

Who looked at the most pictures?

select
..max(ts) max_ts,
..count(*) as click_count
from parsed_sessions
where extract(year from ts) = ’2011′ and username = ‘changed’
group by ip, session_id
order by click_count desc
limit 10;

ooooooo max_ts ooooooo | click_count
_______________________|_____________
2011-04-10 08:11:43-04 | ooooooo 294
2011-04-12 11:53:51-04 | ooooooo 197
2011-04-12 09:22:20-04 | ooooooo 191
2011-04-12 06:17:36-04 | ooooooo 184
2011-04-10 10:46:36-04 | ooooooo 171
2011-04-18 11:18:52-04 | ooooooo 167
2011-04-10 14:47:31-04 | ooooooo 160
2011-04-12 13:35:12-04 | ooooooo 159
2011-04-12 18:15:56-04 | ooooooo 157
2011-04-14 10:04:10-04 | ooooooo 153
(10 rows)

So now I was curious: Who where those top 10 clickers? At this point, querying the raw data (as opposed to an aggregated report) was super helpful.

select distinct cnt_rnk, ps.session_id, ip
from parsed_sessions ps JOIN click_counts cc USING (session_id)
where extract(year from ts) = ’2011′ and
oooo username = ‘changed’ and
ooooo ps.session_id IN (select session_id from click_counts where cnt_rnk <= 5)
order by cnt_rnk;

Without divulging any actual of the actual results (privacy, you know), it turns out 7 of the top 10 were my wife and I, one was my mother in law, one was a family friend and one is still a mystery which I am looking into.

As I was poking around, I noticed another interesting pattern that I wasn’t specifically looking for: a lot of requests came in from Google searches. So my next logical query was:

What are people searching for on Google and where does it lead them to on my site?

select ts,
request_url,
referring_url
from parsed_logs
where referring_url ilike ‘%google.com%’ and
ooooo extract(year from ts) = ’2011′
order by ts desc
limit 10;

ooooooooo ts ooooooooo | oooooooooo request_url oooooooooo | referring_url
_______________________|___________________________________|____________________________
2011-06-13 21:36:57-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/search?q=dft+math+using+matrices&hl=en&prmd=ivnsfd&ei=VLn2TdLSLsnr0gHC9MiMCw&start=30&sa=N&biw=1221&bih=812
2011-06-13 19:55:26-04 | /classes/commit/fft-factoring.pdf | http://www.google.com.sa/search?sourceid=navclient&aq=0&oq=fft+matrix+factoriz&ie=UTF-8&rlz=1T4RNSN_enSA402SA402&q=fft+matrix+factorization
2011-06-13 02:58:03-04 | /classes/mechatronics/ion-generator.pdf | http://www.google.com/search?client=ubuntu&channel=fs&q=airflow+detector&ie=utf-8&oe=utf-8
2011-06-10 20:35:54-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=4&ved=0CC8QFjAD&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=fft%20matrix%20decomposition&ei=WrjyTdqLKYP2tgPRw_i7Cw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
2011-06-10 18:29:56-04 | /favicon.ico | http://www.google.com/search?hl=en&safe=off&client=iceweasel-a&rls=org.mozilla%3Aen-US%3Aunofficial&q=DFT+identities&aq=f&aqi=&aql=t&oq=
2011-06-01 14:54:13-04 | /classes/mechatronics/ion-generator.pdf | http://www.google.com/search?q=ion+%22measure+airflow%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
2011-06-01 09:36:51-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=5&ved=0CDYQFjAE&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=dft%20factorization&ei=akDmTcHBCILw0gGgofCeCw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
2011-05-31 16:41:56-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=9&ved=0CFYQFjAI&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=fft%20matrix%20notation&ei=fVLlTZ76Nc-A-waBmKHyBg&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A&sig2=4jpsx3kmRWJilGnX0VOaAg
2011-05-29 14:29:16-04 | /classes/commit/asplos.ps | http://www.google.com/url?sa=t&source=web&cd=4&ved=0CCsQFjAD&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Fasplos.ps&rct=j&q=naive%20partitioning%20in%20stream%20graph&ei=tY7iTfzGOpC5hAfFspnzBw&usg=AFQjCNHNNERPnNXT7aFGvfaJB3bjpK5Oxg&cad=rja
2011-05-25 03:44:12-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=14&ved=0CC4QFjADOAo&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=DERIVATION%20OF%20DFT&ei=L7PcTZ7zL4q_0AH007C_Dw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
(10 rows)

I could see the query terms peeking out of that mess, but it isn’t easy to analyze because the query string is URI encoded within the referring url. I thought it would be cool to programmatically pick out the query terms, and so I spent some time messing with an unsatisfactory regexp based solution. Then I found out that Hieu, one of our interepid interns this summer, had already made a URI decoder using our SDK and the uriparser library:

CREATE TRANSFORM FUNCTION UriExtractor
AS LANGUAGE ‘C++’ NAME ‘UriExtractorFactory’ LIBRARY ParserLib;

Extract the search terms from the URIs of Google searches

CREATE table search_terms
AS
SELECT request_url, value as search_term
FROM
..(SELECT request_url, UriExtractor(referring_url) OVER (PARTITION BY request_url) FROM search_referrals ) AS sq
WHERE sq.name = ‘q’;

SELECT * FROM search_terms LIMIT 10;

ooooooooooooooo request_url ooooooooooooooo | search_term
____________________________________________|_______________________________________
/ ooooooooooooooooooooooooooooooooooooooooo | andrew nerdnetworks
/classes/6.033/cyberbeanie.pdf oooooooooooo | 6.033 cyberbeanie
/classes/6.033/cyberbeanie.pdf oooooooooooo | Jerome H. Saltzer and M. Frans Kaashoek. 6.033 class notes
/classes/6.033/cyberbeanie/cyberbeanie.html | 6.033 Bibliography Saltzer Computer systems
/classes/6.033/cyberbeanie.html ooooooooooo | link_send
/classes/6.033/cyberbeanie.html ooooooooooo | “Topics in the Engineering of Computer Systems”
/classes/6.033/cyberbeanie.html ooooooooooo | Jerome H. Saltzer, M. Frans Kaashoek. Topics in the Engineering of Computer Systems. M.I.T. 6.033 class notes
/classes/6.033/spank/spankster.html ooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | “chunk server”
(10 rows)

Note that some of the actual values in the above data have been changed to protect other people’s privacy.

Now I need to go back to my day job making Vertica better, but I truly do hope people are able to take the Apache log parser and quickly and easily find their own interesting insights.

Life Beyond Indices: The Query Benefits of Storing Sorted Data

With the Vertica Analytics Platform, there are a number of benefits to storing compressed and sorted data, as well as operating directly on compressed data, that we have discussed in previous posts. In this post, I’m going to discuss how the Vertica Analytics Platform takes advantage of this sorted data to make query execution blindingly fast, which obviates the need for traditional DBMS indexes.

Unlike traditional DBMS solutions, Vertica has no user-defined indexes. Indexes in an analytic database take up DBA time (figuring out which indexes to make), storage capacity, and load time (to keep them up to date). Even if an index consumes only 10% of the size of the original data and takes 10% more time during load, storing even a few indexes on terabytes of data is costly.  As we have mentioned before, a true column store isn’t just a vertically-partitioned row store.

How does Vertica query huge volumes without indexes? It’s easy… the data is sorted by column value, something we can do because we wrote both our storage engine and execution engine from scratch. We don’t store the data by insert order, nor do we limit sorting to within a set of disk blocks. Instead, we have put significant engineering effort into keeping the data totally sorted during its entire lifetime in Vertica. It should be clear how sorted data increases compression ratios (by putting similar values next to each other in the data stream), but it might be less obvious at first how we use sorted data to increase query speed as well.

Let’s start with the simplest and easiest to understand example: the data is sorted the way a query requests it (ORDER BY). Consider a snippet of trading data sorted by stock and price (see Table 1).  If the user’s query requests all the data ordered by the stock and price, they might use something like:

SELECT stock, price FROM ticks ORDER BY stock, price;

Clearly Vertica is off the hook to do any sort at runtime: data is just read off disk (with perhaps some merging) and we are done.

Table 1: Illustration of data sorted on (stock, price). Other columns are omitted for clarity.

A more interesting query might ask for a single stock’s data ordered by price:

SELECT stock, price FROM ticks WHERE stock=’IBM’ ORDER BY price;

Finding rows in storage (disk or memory) that match stock=’IBM’ is quite easy when the data is sorted, simply by applying your favorite search algorithm (no indexes are required!). Furthermore, it isn’t even necessary to sort the stock=’IBM’ rows because the predicate ensures the secondary sort becomes primary within the rows that match as illustrated below:

Table 2: when only rows that match stock=’IBM’ are considered, the results are ordered by price, and thus no additional sorting is required.

Next, let us consider a query that computes the average price for each stock symbol:

SELECT stock, avg(price) FROM ticks GROUP BY stock;

In general, the aggregator operator does not know a priori how many distinct stocks there are nor in what order that they will be encountered. One common approach to computing the aggregation is to keep some sort of lookup table in memory with the partial aggregates for each distinct stock. When a new tuple is read by the aggregator, its corresponding row in the table is found (or a new one is made) and the aggregate is updated as shown below:

Table 3: Illustration of aggregation when data is not sorted on stock. The aggregator has processed the first 4 rows: It has updated HPQ three times with 100, 102 and 103 for an average of 101.66, and it has updated IBM once for an average of 100. Now it encounters ORCL and needs to make a new entry in the table.

This scheme, often denoted as “Group By Hash” because a hash table is used as the lookup data structure, does a good job when there are a small number of groups. However, when there are a large number of groups, it takes significant RAM to store the hash table and provisions need to be made when RAM is exhausted (typically by spilling to disk).

With Vertica, a second type of aggregation algorithm is possible because the data is already sorted, so every distinct stock symbol appears together in the input stream. In this case, the aggregator can easily find the average stock price for each symbol while keeping only one intermediate average at any point in time. Once it sees a new symbol, the same symbol will never be seen again and the current average may be generated. This is illustrated below:

Table 4: Illustration of aggregation when data is sorted on stock. The aggregator has processed the first 7 rows. It has already computed the final averages of stock A and of stock HPQ and has seen the first value of stock IBM resulting in the current average of 100. When the aggregator encounters the next IBM row with price 103 it will update the average to 101.5. When the ORCL row is encountered the output row IBM,101.5 is produced.

This scheme, commonly called “one pass aggregation“ has pipelined parallelism (the same concept as instruction pipelining) if the data is already sorted according to stock. This means we can start producing tuples for downstream operators to consume almost immediately. Given that the Vertica execution is multi-threaded, and all modern machines have multiple cores, pipelined parallelism decreases query execution time.

Of course, one pass aggregation is used in other systems (often called SORT GROUP BY), but they require a sort at runtime to sort the data by stock. Forcing a sort before the aggregation costs execution time and it prevents pipelined parallelism because all the tuples must be seen by the sort before any can be sent on. Using an index is also a possibility, but that requires more I/O, both to get the index and then to get the actual values. This is a reasonable approach for systems that aren’t designed for reporting, such as those that are designed for OLTP, but for analytic systems that often handle queries that contain large numbers of groups it is a killer.

I hear you ask what kind of real-world queries aggregate large numbers of groups? There are at least two very common scenarios that our customers encounter: distinct counts and correlated subqueries with aggregation that have been flattened into joins. Our web analytics customers typically have queries that look for distinct visitors given some condition such as:

SELECT count(DISTINCT visitor_id) FROM user_sessions WHERE <filtering predicates>;

The applicability of one-pass aggregation can be seen if we rewrite the query to an equivalent form:

SELECT COUNT(sq.visitor_id) from (select visitor_id FROM user_sessions WHERE <filtering predicates> GROUP BY visitor_id) as sq

And as such is amenable to the same “group by pipeline” optimization of data sorted on visitor_id. As you are probably glazing over at this point, I will postpone further technical discussion of flattened subqueries for a future discussion if there is sufficient interest.

Another area where having pre-sorted data helps is the computation of SQL-99 analytics. We can optimize the PARTITON BY clause in a manner very similar to GROUP BY when the partition keys are sequential in the data stream. We can also optimize the analytic ORDER BY clause similarly to the normal SQL ORDER BY clause.

The final area to consider is Merge-Join. Of course this is not a new idea, but other database systems typically have Sort-Merge-Join, whereby a large join can be performed by pre-sorting the data from both input relations according to the join keys. Since Vertica already has the data sorted, it is often possible to skip the costly sort and begin the join right away.

Since sorting is such a fundamental part of our system, we have built sophisticated infrastructure in the Vertica Optimizer to track the sortedness of various intermediate results. Our infrastructure takes into account that some columns are equivalent after joining, that some columns have had constant predicates, that some expressions (e.g. price * 100) maintain sortedness, and a host of other factors. By keeping careful track, we maximize the opportunities to apply the optimizations shown above, all without any additional storage.

Of course, Vertica is not limited to a single sort order for each table. In fact, if redundant copies of the data need to be stored to survive node failures, the different copies can be stored with different sort orders. Different sort orders furthers the chance that we can apply one of our sort-based optimizations. And lest you think we are simply swapping determining sort order for determining indexes for a new DBA headache, the optimal sort order of the physical storage is typically automatically determined by the Vertica Database Designer!

If anyone wants me to spell out a specific topic in more detail leave a comment below and let me know!

Andrew

Tales from a Cocktail Party: How Customers Use Vertica

On Wednesday, I gave one of the lightning talks at the 4th Extremely Large Databases Conference (XLDB) at Stanford University about how real customers use Vertica to store and analyze their ‘extremely large’ databases. The 5 minute format is tough.  I had to pick only three customers   but I hope I was able to get across how cool the things they are doing with Vertica are.

Given the proximity to San Francisco, it occurs to me that Vertica is basically in the same business that Levi Strauss was in during the 19th century gold rush: selling supplies to a rapid and lucrative market where there is lots of money to be made. Our modern day gold rush is between businesses to see who can use the massive amounts of data they collect to make the best X possible, where X is anything from ‘television distribution network’ to ‘massive online game’ and everything in between.

Given that we are in the system software business, we sell a (vital) component of an end product and thus I largely don’t see complete applications. I most typically see the SQL that those applications generate, and on most days I am focused on the nitty-gritty details of making it work as fast as possible.  Sometimes, though, I get to step back and see what is really being done with our products and it’s usually quite cool.

So, in this blog post, I wanted to give you some cool cocktail party tidbits about what our customers do with their data…

One telecom company told me that (unsurprisingly) most of those 500 channels on your cable box never get watched.  So most of the time, the company doesn’t waste their network’s bandwidth sending all of the channels to your house.  Instead, they use switched video and then use Vertica to analyze the patterns of who watches what and when in every local distribution group, so that they can make network planning decisions like “how much switched video capacity do we really need for those 100 subscribers?”

Cocktail item: This telco has an (anonymized) record of what station each television box is tuned to for every minute of every day.

Another of our customers analyzes SMS message content for big mobile service providers. To be honest, I don’t know exactly what they are doing with this information, but you might imagine they have interesting data and interesting conclusions, and they use Vertica to parse this data.

Cocktail item: Question: What is the most common text message? Answer: The single letter ‘k.’

My aunt is a devoted online farmer, but I doubt she realizes how much technical firepower is used to analyze her online buying habits. Zynga, the developer of FarmVille, Mafia Wars and a number of other popular online games, has a massive Vertica cluster that they use to analyze past in-game decisions with the goal of making their games more fun to play.  This means you will spend more time playing the games, and hopefully spending more money with Zynga.

Cocktail Item: Zynga’s revenue is massive, estimated by various sources to be $200M or greater in 2009, and also reportedly on track for a killer 2010.  Who would have thought virtual farming or playing a gangster would be so lucrative?

Vertica’s customers are from a wide range of industries – gaming, telecommunications, financial, healthcare, and more – but they share one thing in common.  They all use Vertica to analyze massive amounts of data in real time, converting it into usable information that helps drive business decisions.  And now, I feel like a cocktail!

There Will Come Soft Rains

I am an early riser, especially on the weekends in summer when the sun rises early. This means I get 2-3 hours of time to myself before anyone else

is awake. Much to my wife’s chagrin, I often spend the time catching up on Vertica email and tasks that didn’t quite get done during the week. This is one of the only times where I can write email without getting semi-immediate responses back.

This morning, I did get several emails, though from our automated performance test systems, which reminded of Ray Bradbury’s short story “There Will Come Soft Rains” in which the humans are gone, but their automated systems still remain.

Of course the story’s main points aren’t about software automation early morning emails, but sitting here in the quiet before the day wakes, it seems somehow appropriate.

The automation systems we have at Vertica please me. Of course I am biased — I mostly wrote what we call “ServerFarm” which runs regression tests regularly (every few checkins) as well as allows developers to run tests on demand on a farm of machines. This system came about like most good productivity enhancing systems do: I was annoyed. Specifically I was annoyed of having to run our 1 hour long test suite on my laptop before checking in. I had taken to saving patches for my changes and running them manually at night so that I didn’t tie up my development machine during the day. ServerFarm started out as a way for me to automate that process and has grown from there.

Automated and regular regression testing is, of course, software best practice, but it is cool to see it working well at Vertica. By lowering the bar for using ServerFarm, even the crustiest of our developers now uses the Farm. The one thing that could be better is reporting of the results, especially over time. Thankfully we have a clever intern this summer helping us out with that. I am totally psyched to see what he comes up with.

Vertica Under the Hood: The Query Optimizer

As we bring our 4.0 release to market, we are starting a series of educational blog posts to provide a in-depth look at Vertica’s core technology. We start with one of our crown jewels – the Vertica Query Optimizer.

The goal of query optimizers in general is to allow users to get the maximal performance from their database without worrying about details of how it gets done.  At Vertica, we take this goal to heart in everything that we build.  From the first day starting out, the Vertica Optimizer team has focused on creating a product that reduces the need for manual tuning as much as possible.  This lets users focus on their business needs rather than tuning our technology.

Before we dive into the unique innovations within our optimizer, let’s get a few simple facts straight:

  • The Vertica Optimizer is not limited to classic Star and Snowflake Schemas – it hasn’t been since version 2.5. Many of our 130+ customers in production today are using non-star schemas with great success.  In fact, our Optimizer easily handles very complicated queries – from workloads as simple as TPCH, containing only relatively simple Star queries with a few tables, to complex queries containing hundreds of joins with mixes of INNER/OUTER joins and a variety of predicates and sub-queries.
  • It is not common and certainly not necessary to have one projection per query to get great performance from Vertica. While the Optimizer understands and chooses the optimal plan in the presence of several choices, few customers have found it necessary to do custom tuning for individual queries except in very unusual circumstances.  It is far more typical to have great performance without such tuning at all.
  • The Vertica Optimizer is the only true columnar optimizer developed from scratch to make best use of a column store engine.  Unlike some other column store vendors, we do not use any part of the Postgres optimizer.

Why? Because fundamentally, we believe that no amount of retrofitting can turn a row-oriented optimizer into a column-oriented one.

For the optimizer geeks out there, here are some of the capabilities that we believe give the Vertica Optimizer that special edge over others, even mature ones:

  • The entire Optimizer is designed as a set of extensible modules so that we can change the brains of the optimizer without rewriting much of the code. This means we can incorporate knowledge gleaned from end-user experiences into the Optimizer, without a lot of engineering effort.  After all, when you build a system from scratch, you can build it smarter and better!
  • Unlike standard optimizers that determine the optimal single-node plan and then introduce parallelizing operators into it as an after thought, our patent-pending optimizer algorithms account for data distribution during the join order enumeration phase of the optimizer. We use sophisticated heuristics based on knowledge of physical properties of the available projections to control the explosion in search space.
  • Unlike standard optimizers that restrict the join search space to left-deep plans, the Vertica Optimizer considers bushy plans very naturally.
  • The Vertica Optimizer is cost-based with a cost-model based not just on I/O but also CPU and Network transfer costs and takes into account the unique details of our columnar operators and runtime environment.
  • The Vertica Optimizer employs many techniques that take advantage of the specifics of our sorted columnar storage and compression – for example, late materialization, compression aware costing and planning, stream aggregation, sort elimination, merge joins, etc.
  • The Vertica Database designer works hand-in-glove with the optimizer by producing a physical design that can take advantage of the many clever optimizations available to the optimizer.

While innovating on the core algorithms, we have also incorporated many of the best practices developed over the past 30 years of optimizer research, such as:

  • Using histograms to calculate selectivity.
  • Optimizing queries to favor co-located joins where possible.  Note that optimizer can handle physical designs with arbitrary distribution properties and uses distribution techniques such as re-segmented or broadcast joins.
  • Transformations such as converting outer joins to inner joins, taking advantage of primary/foreign key and null constraints, sub-query de-correlation, view flattening, introducing transitive predicates based on join keys and automatically pruning out unnecessary parts of the query.

As a testament to the quality of our optimizer, we are proud to say that customers rarely override the plans produced by our optimizer.  This removes an entire class of management from the DBA and letting our algorithms take full advantage of our ever-improving execution engine. That being said, we believe that performance and ease-of-use speak for themselves and so we invite you to Test Drive the Vertica Database on your schema, your queries and your data!