Vertica

Archive for the ‘Uncategorized’ Category

Physical Design Automation in the HP Vertica Analytic Database

Automatic physical database design is a challenging task. Different customers have different requirements and expectations, bounded by their resource constraints. To deal with these challenges in HP Vertica, we adopt a customizable approach by allowing users to tailor their designs for specific scenarios and applications. To meet different customer requirements, any physical database design tool should allow its users to trade off query performance and storage footprint for different applications.

In this blog, we present a technical overview of the Database Designer (DBD), a customizable physical design tool that primarily operates under three design policies:

  • Load-optimized –DBD proposes the minimum required set of super projections (containing all columns) that permit fast load and deliver required fault tolerance.
  • Query-optimized –DBD may propose additional (possibly non-super) projections such that all workload queries are fully-optimized
  • Balanced—DBD proposes projections until it reaches the point where additional projections do not bring sufficient benefits in query optimization.

These options allow users to choose to trade off query performance and storage footprint, while considering update costs. These policies indirectly control the number of projections proposed to achieve the desired balance among query performance, storage and load constraints.
In real-world environments, query workloads often evolve over time. A projection that was helpful in the past may not be relevant today and could be wasting space or slowing down loads. This space could instead be reused to create new projections that optimize current workloads. To cater to such workload changes, DBD operates in two different modes:

  • Comprehensive–DBD creates an entirely new physical design that optimizes for the current workload while retaining parts of the existing design that are beneficial and dropping parts that are non-beneficial
  • Incremental– Customers can optionally create additional projections that optimize new queries without disturbing the existing physical design. Customers should use the incremental mode when workloads have not changed significantly. With no input queries, DBD optimizes purely for storage and load purposes.

ram_comprehensiveMode

The key challenges involved in the projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings that optimize query performance while reducing space overhead and allowing faster recovery. The DBD proceeds in two major sequential phases. During the query optimization phase, DBD chooses projection columns, sort orders, and cluster distributions (segmentation) that optimize query performance. DBD enumerates candidate projections after extracting interesting column subsets by analyzing query workload for predicate, join, group-by, order-by and aggregate columns. Run length encoding (RLE) is given special preference for columns appearing early in the sort order, because it is beneficial for both query performance and storage optimization. DBD then invokes the query optimizer for each workload query and presents a choice of the candidate projections. The query optimizer evaluates the query plans for all candidate projections, progressively narrowing the set of candidates until a stopping condition (based on the design policy) is reached. Query and table filters are applied during this process to filter one or more queries that are sufficiently optimized by chosen projections or tables that have reached a target number of projections set by the design policy. DBD’s direct use of the optimizer’s cost and benefit model guarantees that it remains synchronized as the optimizer evolves over time.

ram_inputParameters

During the storage optimization phase, DBD finds the best non-RLE column encoding schemes that achieve the smallest storage footprint for the designed projections via a series of empirical encoding experiments on the sample data. In addition, DBD creates the required number of buddy projections containing the same data but distributed differently across the cluster, enabling the design to be tolerant to node-down scenarios. When a node is down, buddy projections are employed to source the missing data in the down nodes. In HP Vertica, identical buddy projections (with same sort orders and column encodings) enable faster recovery by facilitating direct copy of their physical storage structures and DBD automatically produces such designs.

When DBD is invoked with an input set of workload queries, the queries are parsed and useful query meta-data is extracted (e.g., the predicate, group-by, order-by, aggregate and join query columns). Design proceeds in iterations. In each iteration, one new projection is proposed for each table under design. Once an iteration is done, queries that have been optimized by the newly proposed projections are removed, and the remaining queries serve as input to the next iteration. If a design table has reached its targeted number of projections (decided by the design policy), it is not considered in future iterations to ensure that no more projections are proposed for it. This process is repeated until there are no more design tables or design queries are available to propose projections for.

To form the complete search space for enumerating projections, we identify the following design features in a projection definition:

  • Feature 1: Sort order
  • Feature 2: Segmentation
  • Feature 3: Column encoding schemes
  • Feature 4: Column sets (select columns)

We enumerate choices for features 1 and 2 above, and use the optimizer’s cost and benefit model to compare and evaluate them (during the query optimization phase ). Note that the choices made for features 3 and 4 typically do not affect the query performance significantly. The winners decided by the cost and benefit model are then extended to full projections by filling out the choices for features 3 and 4, which have a large impact on load performance and storage (during the storage optimization phase).
In summary, the HP Vertica Database Designer is a customizable physical database design tool that works with a set of configurable input parameters that allow users to trade off query performance, storage footprint, fault tolerance and recovery time to meet their requirements and optionally override design features.

Building Bridges with Gumdrops and Toothpicks

soniyaBlogPic2

In June 2014, the HP Vertica summer interns headed to the East End House in East Cambridge, MA to work with students through a community service project. Sarah Perkins, a business planner on the Project Management team, organized the project. Since 1875, the East End House has offered innovative programs to the community and continues to strive for excellence. Their programs help support families and individuals through curricula that enhance education standards. Programming supports the whole family with comprehensive services like a Food Pantry, Community Workshops, Parent Education and Senior Programming.

The interns, accompanied by mentors Sarah Lemaire and Jaimin Dave, helped students build bridges with very limited materials: fifty gumdrops and one hundred toothpicks! The goal was to build a bridge that spanned a six-inch gap and would hold at least 300 grams, or 120 pennies.

Teams of interns paired up to work with four students, ranging from third to eighth grade. They watched and assisted as the students discussed strategies, drew prototypes and started to build and re-build their structures. As the students worked on their bridges, they got to know more about the Vertica interns, including their majors, hometowns, and the projects they are working on for the summer. Throughout the course of the day, if students could correctly answer questions about their interns, they would win HP Vertica swag, including toy bulldozers, stress balls, flashlights, and more.

Once the bridges were built, the interns and students tested them across a six-inch gap. Students placed pennies on a paper plate on top of the bridge, one-by-one, until the bridge collapsed under the weight. The winning team’s bridge, led by interns Swikriti Jain and Jun Yin, held 255 pennies, which weigh more than 1 1/3 pounds! The top two teams won a bundle of HP Vertica swag, including t-shirts, water bottles, and baseballs.

The HP Vertica interns had a great time learning about students at the East End House, and helping them build successful bridges. It was a unique opportunity to interact with students of many ages, while also encouraging them to remain active in school and participate in extracurricular activities.

Can Big Data Analytics Save Our World?

CI image

If you ask Conservation International this question, they may just say yes. After all, Conservation International has teamed up with HP Earth Insights to provide organizations around the world — from environmentalists to policy makers – with a real-time look at what is happening within our planets most valuable natural resource: the rain forest.

But how does their work relate to you as a start-up organization or a Fortune 500 company?
First, they have surprisingly similar analytical needs to many other start-ups and corporations, collecting data regularly from 16 sites around the globe, performing more than 4 million climate measurements as of this February, and managing more than 3 TB of biodiversity information. As the name implies, this information is incredibly, well… diverse, including everything from photos to hand-recorded measurements to weather station and camera trap imagery. While your company may not be recording/analyzing the metadata of candid photos of elephants and/or chimpanzees, chances are, many of you out there are working with at least more than one type of data.

Collecting and Analyzing Multiple Data Types
All of these different data types have to be funneled into a database, analyzed, and then acted on. Running queries based on millions of climate readings begins to look a lot like doing the same on a diverse customer base like many other companies deal with every day. Many agricultural companies collect sensor data from across their farm lands to get a forecast of how the climate has affected their crops for the upcoming year. These days, utilities companies are launching Advanced Metering Infrastructures (AMI) to deal with the staggering amounts of sensor data collected from the energy usage of millions of homes. HP Vertica coincidentally works as an effective Meter Data Management (MDM) system (read more here).

Visualizing the Data and Reaching More People
Working with HP, Conservation International has built from the ground up their own analytics system and dashboard for visualizing their data from all 16 rainforests around the globe. CI DBA’s discover trends based on over 140 million simulations, and analyze the metadata from over 1.7 million photos. Not only is their custom interface intuitive, it also enables them to generate PDFs instantly and share to social media directly from the dashboard. For CI, this means more people now see more of their impact in more places to proactively address environment threats. For you, it might mean anything from less time spent prepping your data to present to management, or just simply fewer emails to send.

The Power of Prediction for the Greater Good
Like many companies, CI uses standard methodology in processing their data, and uses R for their analysis, as is very common in scientific studies. Using R, CI can proactively assess where the future trouble spots will be, and what parts of their monitored ecosystems are most threatened. Many other HP Vertica customers use R in surprisingly similar ways, such as seeing what neighborhoods a future power outage might affect most, or how serious the next year’s dry season will be to a farmer’s crops

See Conservation International at the HP Vertica Big Data Conference
These are just a few examples of how an incredibly unique organization uses HP Vertica to analyze unique data, yet does it in ways that many other groups might find surprisingly familiar. Sometimes after a closer look, we can see that many organizations have a lot more in common with their data needs than they may think, and HP Vertica is the right tool for the job.

Be sure to attend out upcoming Big Data Conference in Boston MA, where Conservation International is leading the hackathon!

Tech Support Series: Optimizing for Deletes

This blog is just the first in a series that addresses frequently asked tech support questions. For now, we’ll talk about optimizing your database for deletion.

You may find that from time to time your recovery and query execution is slow due to high volumes of delete vectors. Occasionally, performing a high number of deletes or updates can negatively affect query performance and recovery due to delete replay.

Delete replay occurs when ROS containers are merged together. The data marked for deletion in each of the ROS containers needs to be remarked once the containers are merged. This process can hold up your ETL processes because the Tuple Mover lock (T lock) stays on until the replay deletes finish.

Luckily, optimizing your database for deletes can help speed up your processes. If you expect to perform a high number of deletes, first consider the reason for deletion. The following is a list of common reasons for high delete usage:

  • You regularly delete historical data and upload new data at specific intervals
  • You constantly update data or you want to delete data that was loaded my mistake
  • You often delete staging tables

To optimize your database for deletion, follow the suggestions that correspond to your reason for deletion.

  1. If you regularly delete historical data to make room for newer data, use partitioning to chunk data into groups that will be deleted together. For example, if you regularly delete the previous month’s data, partition data by month. When you use partitioning, you can use the DROP_PARTITION function to discard all ROS containers that contain data for the partition. This operation removes historical data fast because no purging or replay deletes are involved.
  2. You may also want to delete a high volume of data because it was loaded by mistake or because you frequently update data (which involves frequently deleting data). In these cases, you may see a high volume of delete vectors. There are three good ways to prevent this:
  3.  

    1. Create delete-optimized projections by using a high cardinality column at the end of the sort order. This helps the replay delete process quickly identify rows to be marked for deletion.
    2.  

    3. Make sure your Ancient History Mark (AHM) is advancing and close to the Last Good Epoch (LGE) or Current Epoch. You may also want to periodically use the MAKE_AHM_NOW function to advance the ancient history mark to the greatest allowable value. When a mergeout occurs, all data that is marked for deletion before the AHM will be purged, minimizing the amount of replay deletes.
    4.  

    5. Periodically check the number of delete vectors in your tables using the DELETE_VECTORS system table. The automatic Tuple Mover will eventually purge deleted data but if you find your tables have a large number of delete vectors, you can manually purge records using the PURGE_TABLE function.
  4. You may find that you frequently delete staging tables. To streamline this process, you can truncate the staging table instead of deleting it using the TRUNCATE TABLE function. Truncating a table will discard the ROS containers that contain the data instead of creating delete vectors, and thus is more efficient than table deletion.

 
Frequently deleting data is often a cause of slow query performance. Fortunately, you can optimize your database for deletions with these tips and avoid the headache.

How to:

Drop a partition:

=> SELECT DROP_PARTITION (table_name, partition_value);

Get epoch:

=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM SYSTEM;

Set AHM to greatest allowable value:

=> SELECT MAKE_AHM_NOW();

Determine number of delete vectors:

=> SELECT * FROM v_monitor.DELETE_VECTORS;

Purge data:

=> SELECT PURGE_TABLE(table_name);

Introducing HP Vertica “Dragline”

Today, we announced “Dragline,” the code name for the latest release of the HP Vertica Analytics Platform. Focused on the strategic value of all data to every organization, “Dragline” includes a range of industrial-strength features befitting its code name for serious Big Data initiatives.

Our data sheet provides detailed feature descriptions on this release’s full range of capabilities and benefits, so we’ll just focus on three top features that are sure to capture your attention (after all, they came highly requested from our growing customer base).

By the way, learn about these and all of the new “Dragline” features in our upcoming Webinar.


“Project Maverick” – Speed in All Directions

Danger Zone

Speed is a given … but what about in every possible direction? “Project Maverick” has multiple elements including fast, self-service analytics about discrete individuals or devices. Live Aggregate Projections, a key, new feature of “Project Maverick,” accelerates the speed and performance of these live lookups by up to 10x – more frequent or highly concurrent queries on an individual customer basis — by computing metrics on the data as it arrives for targeted and personalized analytics without programming accelerator layers.

For telecommunications companies as well as utilities and energy providers who, due to deregulation, are pursuing smart metering initiatives to differentiate from competitors, this capability is hot. With Live Aggregate Projections, these providers can deliver smart metering reports that educate their customers on consumption rates vis a vis their neighbors in promoting a greener planet and establishing a tighter relationship with their provider. Interested in learning how HP Vertica easily chews through the enormous volume and scale of smart meter data and the Internet of Things? Check out our newly published white paper, including detailed benchmarks.


Mixed Dynamic Workload Management – Make ’em All Happy

Coffee

Another major advancement of “Project Maverick” is Dynamic Mixed Workload Management. Commonly found in many data warehouse technologies and requested by some of our largest customers, this powerful new feature identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to meet the needs of all data consumers.

So, now, you can preserve your precious hardware and system resources, while pleasing even your most demanding internal and external constituents – from analysts who seek ad-hoc insights to data scientists who manage complex analytics to business executives who need customized views or dashboards on Monday mornings.


Cost-Optimized Storage – Don’t Break the Bank

pig

Hadoop is commonly used as a data lake to store bulk data. That’s why with each release, we offer tighter integration that delivers the most open SQL on Hadoop. “Dragline” builds on that strong heritage by helping you to manage multiple storage tiers for cost-effective analytics. You can perform advanced SQL queries on bulk data stored in HDFS using the HP Vertica Analytics Platform without moving the data or using any connectors. You can move data into HP Vertica when your organization requires faster performance in-depth analytics.

HP Vertica supports all of the major Hadoop distributors but the power of MapR and HP Vertica on the same cluster is something special. Join our upcoming Webinar with MapR to get the full details behind this solution and to build your business case for SQL on Hadoop.

Try HP Vertica 7 today and stay tuned for more blog posts and materials to support this release.

Work hard, have fun and make a difference!

JaminBlogPic

My name is Jaimin and I work as a Software Engineer in the Distributed Query Optimizer Team at HP Vertica. I wanted to share with you what I think makes Vertica the best place to work! I will explain the kind of impact you can make as an employee/intern at HP Vertica, while sharing my personal experiences.

As a student, I researched many companies I might want to work for to get a better understanding of the everyday life of software engineers. However, what I was most interested in learning about was the kinds of things engineers might do that went above and beyond the normal day-to-day stuff.

Is writing code something unique to the job?

No! Right?

As Software Engineers, we write code, develop algorithms, and implement them. But here at HP Vertica, we do lots of other things besides simply writing code.

Go above and beyond!

Vertica is different from other companies as far as normal day-to-day stuff goes.

Let me ask you this question: How many new graduates would you guess could get a chance to file a patent within their first 6 months of joining a company? How many would get chance to write a paper within first six months? Not a lot, right?

In my experience at HP Vertica, I’ve seen that just about all new graduate engineers file at least one patent in their first year at work. This speaks to the fact that the work we do here at Vertica is completely innovative. Our projects have a huge business impact.

Be the captain of your ship!

Vertica offers engineers incredible opportunities! All you have to do is be willing to accept them. One of the best things about HP Vertica is that you work in an environment where other engineers are smarter than you! You’ll find yourself constantly challenged to learn new, interesting, and exciting things. You’ll get better exposure and, more importantly, you have a massive role to play in the company’s growth and development.

Something else that’s unique about HP Vertica—the projects you work on as an intern become part of the shipping product! As a result, you’ll get the chance to see your code in action and sometimes you can learn what customers have to say about your feature in particular. You won’t be allowed to sit idle for a minute because we have a very short release cycle. This will keep you on your toes and encourage you to think something new day in and day out.

Here, engineers are not forced to work on this and that—they have a great deal of autonomy and frequently get to choose the things they work on. If you have an idea you think can help improve the product, you are encouraged to see it through. And, you’ll also get a chance to participate in various technical events that take place within HP and submit your ideas.

Taking initiative is always encouraged and you’ll be expected to make, discuss, and defend your design decisions with your mentors instead of just following directions. You’ll also be able to learn about the complexities of building a database and how we achieve the performance advantages in HP Vertica.

It is also easy to move between the teams. It is entirely up to you and the only question is what you want to do?

Share and gain knowledge!

Knowledge Sharing is another important thing at Vertica. We do a lunch talk where we discuss any new paper related to database systems. Every now and then people from various teams give tech talks so that each team is aware of what people in other groups are doing.

As a fresh graduate before joining Vertica, I did not have any experience working on a database optimizer product, though I had worked a bit on optimizations when I took a compiler class. Because of the great culture and environment at Vertica, I didn’t find the transition difficult at all. Sometimes it was challenging, but it allowed me to learn a lot by working with incredibly smart people at the company while working on challenging projects (I wonder how many people have the opportunity to work on the design and implementation of queries involving Set Operators during their first year of work).

Have fun!

We frequently unwind doing fun things at work, including watching the Olympics games or other sporting events during lunch, or playing table-tennis and board games when we can. Vertica provides a lot of flexibility and it comes with huge responsibility. You’re expected to get your work done on time—if you do that, no one will have any problem with having a little fun. Interns also go on outdoor field trips, including horseback riding, hiking to Blue Hills, going for a movie, participating in a bocce tournament, and water activities such as motor boat racing. Once, we went to the Boston Harbor and tried to learn how to sail a boat from one of our in-house experts in Vertica.

We are looking for people to join Vertica! Do you have any interest in being challenged in an innovative design environment? Then apply today!

How to make ROLLUP fly in Vertica?

ROLLUP screenshot

ROLLUP is a very common Online Analytic Processing (OLAP) function and is part of ANSI SQL. Many customers use ROLLUP to write reports that automatically perform sub-total aggregations across multiple dimensions at different levels in one SQL query. The following is an example:

Select DEPT_CODE, STORE_CODE, SUM(SALE_TOT_QTY) As TOTAL_SALES
From AGG_CUST_MONTHLY A INNER JOIN DIM_DATE B
ON ( A.DATE_KEY = B.DATE_KEY )
WHERE B.DATE_KEY BETWEEN 20120301 AND 20120331
GROUP BY ROLLUP(DEPT_CODE, STORE_CODE);

Note: ROLLUP syntax is not supported in the current version of Vertica 7.0. This
is used only as an illustration. See blog text for more details

The business question the above query intends to answer is: For the month of March, 2014, show the total sales across both departments (dept_code) and stores (store_code) to generate sub-total/total sales figures at three levels:

  1. At the detailed level of each pair of (dept_code, store_code)
  2. At the level of only (dept_code) but aggregating across all stores
  3. Finally a grand sales total aggregating across all departments and stores

Assuming performance is not an issue, clearly ROLLUP is a very powerful and useful function that can do a lot of data crunching in the background and deliver a quite complex report.

Unlike the more familiar GROUP BY SQL function in which the column order is not material, ROLLUP works by creating subtotals that “roll up” from the most detailed level to a grand total and requires an ordered list of grouping expressions to be supplied as arguments. First ROLLUP calculates the standard aggregate values specified in the standard GROUP BY (without ROLLUP) and then ROLLUP moves from right to left through the list of grouping columns, starting with the lowest level of aggregation to create progressively higher-level subtotals. This process continues until all grouping columns have been consumed and aggregations computed. The operator essentially “rolls up” its grouping operations, culminating in a grand total in the end.

If N is the number of grouping columns, ROLLUP function will create (N+1) levels of subtotals (including a final grand total). The column order in a ROLLUP operator is critical since it works by removing the left most column at each step. For example:

ROLLUP(A, B, C) creates 4 groups: (A, B, C), (A, B), (A), ().

Note there are NO groups such as (A, C) and (B, C).

Unfortunately the current version of Vertica 7.0 does not support ROLLUP. So what is the alternative or workaround for the existing Vertica customers? Will the workaround perform on a large data set? Fortunately, the answers to both questions are yes. However it does take some ingenuity in physical design and SQL re-write to make ROLLUP work and perform in the currently supported Vertica releases (version 5.1 and later).

The key considerations are to create and pre-load a local temp table which is sorted on ALL the ROLLUP columns in the same exact order, rewrite SQL at each level using GROUP BY into a sub-query and finally UNION ALL of these sub-queries together. By applying this strategy to the ROLLUP SQL posted at the beginning of this blog, we arrive at the following alternative/workaround:

DROP TABLE IF EXISTS AGG_TEMP CASCADE;

CREATE LOCAL TEMPORARY TABLE AGG_TEMP
ON COMMIT PRESERVE ROWS
AS
/*+ DIRECT */
( Select DEPT_CODE, STORE_CODE, SUM(SALE_TOT_QTY) As TOTAL_SALES
From AGG_CUST_MONTHLY A INNER JOIN DIM_DATE B
ON ( A.DATE_KEY = B.DATE_KEY )
WHERE B.DATE_KEY BETWEEN 20140301 AND 20140331
GROUP BY DEPT_CODE, STORE_CODE )
ORDER BY DEPT_CODE,
STORE_CODE
SEGMENTED BY HASH (DEPT_CODE, STORE_CODE) ALL NODES;

SELECT ANALYZE_STATISTICS(‘AGG_TEMP’);

SELECT DEPT_CODE, STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP
GROUP BY DEPT_CODE, STORE_CODE
UNION ALL
SELECT DEPT_CODE, NULL As STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP
GROUP BY DEPT_CODE
UNION ALL
SELECT NULL As DEPT_CODE, NULL As STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP;

The fact that temp table AGG_TEMP is sorted on the ROLLUP columns (dept_code, store_code) is important for performance because this means that all the GROUP BY sub-queries at individual levels will be able to exploit the highly efficient GROUPBY PIPELINED operator (explicit in the query’s explain plan) in Vertica to get the job done quickly and accurately.

A nice property of GROUPBY PIPELINED operator (in contrast to the more resource intensive GROUPBY HASH) is that its performance is largely independent of the actual table size. As a result this workaround can easily scale up to include tables with 10+B or even 100+B rows without any noticeable drop in query performance (as measured in first fetch time).

To conclude I would like to mention that there has been serious work going on in Vertica Engineering to implement ROLLUP (and many other OLAP functions) in the up-coming Vertica releases. Stay tuned!

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.

Get Started With Vertica Today

Subscribe to Vertica