Vertica

Archive for July, 2012

Workload Analyzer – from past history to future performance!

by CJ Sun

One of the routine yet challenging tasks for a database administrator (DBA) is to ensure that the database is in a “healthy” state – from the viewpoints of smooth operations and optimal performance. In most databases, performance tuning is a huge time sink because of the myriad changes in the application or database deployment that could impact performance:

  • Data/Schema changes: loading more data, adding/dropping tables
  • Workload changes: queries, concurrency pattern changes
  • System configuration parameter changes
  • Cluster/Hardware changes: adding new nodes, storage/memory/network upgrade
  • User/role management: new users, new role/privilege grants

At HP Vertica, our design philosophy has been to make database tuning a science rather than the black art it often is! Easier said than done of course but we’ve been working on this from day one! From the outset, we made sure that physical design can be automated using the Database Designer, with which data can be deployed with the optimal query performance, minimized storage footprint and the desired data redundancy level. With Vertica 6, we’ve taken this approach to a new level with the Workload Analyzer (WLA), which make health checks and tuning a breeze! By analyzing the historical performance-related events and activities, the WLA can locate where performance degradation occurred, analyze the root cause, evaluate the current system status, and thus recommend tuning action for DBA to rectify the problem.

The workflow is illustrated as follows:

Can you guess what’s going on? Vertica collects every ounce of data conceivable about itself as it is running using a component called the Data Collector.  Surprise! This is a Big Data problem. Then we use the world’s best analytics platform (i.e. Vertica) to analyze this historical data, combined with current system status (catalog objects, locks, sessions, data layout, etc) to identify problems and recommend solutions.

Here are just some types of statistics that are collected by the Data Collector.

  • stale data statistics causing a sub-optimal query plan
  • any operation that spilled data to disk
  • outdated physical design compared to current workload
  • incorrect system parameter settings
  • mis-configured system services
  • potential security risks
  • data skew and fragmentation problems
  • excess resource usage

And here are some tuning recommendations you may see:

  • set configuration parameter
  • invoke database designer to design projections and drop unused ones
  • reconfigure background system services
  • adjust table partitioning and projection segmentation
  • enforce password policy on privileged user

The ultimate power to accept the recommendations is left to the DBA, of course, who can review and choose to whether or when to apply them.

Using the Workload Analyzer

Workload analyzer is an online tool and can be run against a live database. It has been integrated into the new management console (MC) since v6.0, and it is scheduled to refresh daily by default. The DBA can monitor the tuning recommendations in MC and take recommended action at appropriate time.

In addition to the MC, the Workload Analyzer can be manually invoked through command line where users can specify the 2-dimensional scope of analysis:  1. is it on system-wide, or a specific schema/table);  2. how far WLA should look back in the system event history.  The command is like following:

select analyze_workload(‘scope’, ['since_time']);

Configuring and Extending the WLA

Best of all, the Workload Analyzer is configurable and extensible! The WLA includes a rule-based framework to convert observations into tuning recommendations. Since rules work independently from each other, they can be individually disabled or enabled anytime to control WLA scope.

WLA rules can be configured using one or more parameters. A parameter value can be changed to tailor each rule for a different customer. For example, the rule detecting data skew has parameters for minimal table size and data skew factor threshold, which can be adjusted for customers with different data skew tolerances. In the rule for monitoring CPU utilization, the parameter for CPU usage threshold and duration can be set lower after a hardware upgrade. These parameter changes can be done on-the-fly without requiring cluster restart or server upgrade.

In Vertica 6.0, the system offers 19 built-in rules. In addition to these system-defined rules, the WLA also supports user-defined rules to extend its tuning capability. Users can define their own rules to meet their business requirements. For example, a customer may create a rule to monitor queries from a particular client and raise a flag when the query runtime exceeds a certain threshold. These user-defined rules can be added, dropped from system or enabled or disabled at will.

There you have it – what better proof point for Vertica than using it to analyze data about itself! The Workload Analyzer offers self-diagnosis capabilities to detect performance degradation from workload history, and intelligently recommends tuning actions to the DBA.  It even detects some types of security vulnerabilities. Do you have WLA rules to share with our community?

On the Trail of a Red-Tailed Hawk! – Part 2

Part 2:  Why User-Defined-Loads is music to my ears!

Webcams and hawks – really?  I heard a healthy dose of skepticism expressed about what User-defined Loads can accomplish.  So how about I show something a bit more practical?

I’m a fan of music of all flavors. Among the various devices I’ve had connected to my stereo at home are an LP player and an open-reel tape deck.  While it’s fun to get out the old vinyl records from time to time, they’re a bit cumbersome and don’t travel very well; more often I just want to pull up iTunes and hit “Play”.  So if I have a recording that I particularly enjoy, I try to digitize it.  But it is a huge pain to split it up into tracks.  I thought – why not automate it? You guessed right – with UDL, of course!

Vertica’s not much for music; natively it couldn’t tell the difference between The Beach Boys and Van Halen.  In fact, it can’t make heads or tails of most any music format.  So I need one (or both) of Vertica’s two other UDL tools.  One is a User-Defined Parser:  I could write a parser that directly parsed the WAV format, for example.  This would give me the best performance, but would take some time to code.

The other is a User-Defined Filter:  A Filter takes in data and processes it and emits it in a new form.  They’re intended for things like decompression and transcoding.  It’d be nice to support FLAC files and the like (c’mon, you can’t use lossy MP3 files for this!; this is the good stuff!), and my computer has more free time than I do, so a simple Filter seems like the way to go.

We conveniently have an example external-Filter example on Github as well; it opens up an external process and passes data through it via pipes.  So let’s go install it:

$ cd Vertica-Extension-Packages/shell_load_package
$ make install

Now, let’s once again make a table and try loading some data:

>>> CREATE TABLE music (filename varchar(64), time_index float, L float, R float)
     ORDER BY filename, time_index;
>>> COPY music (filename AS ‘Sample’, time_index, L, R)
     FROM ‘/home/aseering/Music/Recordings/Sample.flac’
     FIXEDWIDTH COLSIZES (17, 17, 17)
     WITH FILTER ExternalFilter(cmd=’sox –type flac – –type dat -’);

The FIXEDWIDTH parser takes advantage of the format of the ‘.dat’ file:  It’s three floating-point columns; each floating-point number can be up to 16 characters wide (and is padded with spaces to that length if it’s smaller), and there’s one extra leading space as a separator.

Of course, using ExampleSource, I can load directly from my audio input as well, assuming a properly-configured sound system on the computer:

>>> COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data)
     FIXEDWIDTH COLSIZES (17, 18)
     WITH SOURCE ExternalSource(cmd=’arecord -d 10′)
     FILTER ExternalFilter(cmd=’sox –type wav – –type dat -’);

Once I’ve loaded the data, I have to split up tracks.  On recordings of old records, the only real way I know to split tracks is to look for areas where the audio is quiet for a while.  Vertica can do this with window analytics.  The following is a good start, though there is of course a ton of flexibility if you want to tune the algorithm:

>>> SELECT sqrt(avg(L*L + R*R)
         over (order by time_index rows between 100 preceding and 100 following))
     AS avg_volume
     FROM music WHERE filename = ‘Sample’;

And there we go!  Now I can enjoy my Carpenters on the move.

Most people think of databases as tools for managing payroll or user accounts, or diving through sales records.  Vertica can do all of that.  But Vertica 6 is so much cooler than that:  It gives you a platform for running your own custom code in a robust, distributed way.  You can mix-and-match UDLs (and all the other UDx types) however you want; develop libraries of UDL tools; even share those tools on Github if you want, or download tools that others have shared.  (We take pull requests!)  You can also use UDLs with external tables to provide a real queryable table that’s backed by any data store that you want, so you can run any task through Vertica’s parallel execution engine.

Are you convinced now? Whether you are or not, I’ve spend the past few hours digitizing several of my favorite LPs.  Now I can sit by my window with my headphones on listening to music and working away until the hawk shows up. Life is good!

Teaching the elephant new tricks

by Shilpa Lawande & Rajat Venkatesh

Introduction

A growing number of Vertica customers use Hadoop as an ETL/pre-processing tool or HDFS as a “data parking lot” – either to stage files before they are loaded or to simply store data whose worth is yet to be proven. For several years, Vertica has provided a Hadoop Connector that provides bi-directional connectivity between the two systems. In this post, we evaluate the pros and cons of the current approach and discuss a new technique that is enabled by Vertica 6, newly released.

Current Design

The most efficient topology of network connections is shown in the diagram  above. There is one database connection for each partition or a subset of partitions and all these connections transfer data in parallel. Most solutions in the market including the current Vertica-Hadoop connector do follow this design. They open a JDBC connection for each (or a subset of) partition and execute a batch insert to transfer the data. Some products, like Vertica may optimize the transfer by using a bulk load API or transform the data to avoid using resources on the database. Apache Sqoop is the end result of this evolution and provides a platform to implement generic or optimized connectors for all databases that provide a JDBC driver.

Problems

The architecture described above has some inherent bad qualities owing to a fundamental impedance mismatch between Vertica and Hadoop.

Even though Vertica and Hadoop are both MPP systems there is no good way to coordinate the degree of parallelism across the two systems, resulting in an inefficient use of the combined resources. Even a moderately large Hadoop cluster can overwhelm a database with too many JDBC connections. Therefore, it is common best practice to reduce the number of reduce (or map) tasks that connect to the database.

Vertica’s customers typically transfer a large file (with many HDFS chunks) or a set of files into one table. Each of the resulting JDBC connections initiates a new transaction overwhelming the transaction module with too many actors. Since each connection is a transaction, it’s not possible to roll back a transfer. A failed transfer may leave partial data – the worst kind of failure. A common best practice is to first store the data into a temporary table, verify the transfer and then move it to its final resting table. Another common strategy is to add a column to identify the rows inserted in a transfer, verify the rows with the unique id and then keep or delete the rows. Depending on the database and storage engine, verifying and moving the data may take up significant resources. Regardless of the approach or database, there is significant management overhead for loading data from Hadoop. A majority of Vertica’s customers use Vertica so that they don’t have to worry about ACID for bulk loads. They don’t want to be in the business of writing and maintaining complex scripts.

In our survey of customers, we found that many of them have come up with creative solutions to the “too-many-connections” problems. By popular demand, we’re in the process of creating a repository of the Hadoop connector source on Github and hope to see some of these solutions contributed there. However, by far, the most preferred solution is to simply write the results of a MR job to HDFS files, then transfer those files to Vertica and load them. And that’s what we’ve chosen to do in our new design as well.

New Design

In the new HDFS connector, currently in private beta, we have taken the first step to solve the problems described earlier by allowing direct load from files from HDFS to Vertica,  thereby eliminating the non-ACID behaviors due to multiple connections. Further, we solve the problem of too many connections by using one TCP/IP connection per file in HDFS. How did we do this? With Vertica 6, we introduced User-defined Loads. This exposes APIs that can extend Vertica’s native load command COPY, to load from any source and any format. With a few days of effort, using the UDL API and Apache Hadoop REST API for HDFS files, we wrote a plugin to read HDFS files from COPY. The benefits are obvious – COPY can process many files within the same transaction in parallel, across the Vertica cluster. Parallelism is limited by the hardware of the machines. Since it is only one transaction, a failure will result in a complete roll back.

The resulting network topology looks like the diagram below.

Let’s see how the HDFS connector works using some examples.

Imagine that there are 6 files in a directory in HDFS

-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem1.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem11.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem12.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem2.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem31.tbl
Found 1 items

The Vertica Copy command to load lineitem.tbl works as follows:

copy lineitem source
Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem.tbl');

If you wanted to load all 6 files, you simply use the glob * feature. In this case, the 6 files are loaded into Vertica and are processed in parallel across the cluster.

copy lineitem source
Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem*');

In this example, lineitem.tbl is a tab-limited file. What about all the other popular file formats? User Defined Load provides an API to plugin a parser for any file format as well as APIs to decompress, decrypt or perform any arbitrary binary transformation. Check out this blog post showing how UDL can be used to process image file data.

Further, with Vertica 6 external tables, in combination with the HDFS connector, you can now use Vertica SQL to analyze data in HDFS directly (see the “Vertica Analytics Anywhere” blog post). This means, while you are in the exploratory stage with some data, you could experiment with different data models and still use SQL analytics (or use the Vertica R SDK) to plough through the data.

CREATE EXTERNAL TABLE hdfs.lineitem (<list of columns>) AS COPY
FROM Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/*');

Summary

“Wait a minute! Most chunks are sent to another Hadoop datanode and then to Vertica. This is worse than what we had before”. Yes we have exchanged expensive JDBC connections with one extra hop for most chunks. However, the new strategy is still faster and maintains transaction semantics. The next step is to enable COPY to parse data chunks in parallel. However, we believe that the current solution has significant advantages over using JDBC connections and already solves major pain points of Vertica-Hadoop users.

An under-performing connector requires far more resources – machine and human – to bring together data from various sources. With an efficient connector, users can transfer data in real time and on a whim, allowing them to use the right tool for the job.

If you would be interested in our private beta for the HDFS connector, please send a note to beta@vertica.com with a brief description of your use-case.

Vertica Analytics Anywhere

by Andrew Lamb

As Vertica evolved to address the needs of diverse analytics users, a common refrain we heard from our customers is that data modeling and exploration is a key activity for data scientists. This is the phase when data is available but they aren’t quite sure how to harness it yet. Over a series of experiments and iterations, the right data model emerges, and at that point it can be operationalized in Vertica for ongoing interactive use. People often use Hadoop for this phase, which gives them the flexibility to access any data, but it means they must write MR programs for analytics and are unable to leverage the sophisticated analytics available in Vertica. This insight led us to decouple our Analytics Engine from our columnar storage to further extend our patent-pending FlexStore architecture . With Vertica 6, it is now possible to use the full expressive power of Vertica Analytics Engine and its analytics without having to load the data into Vertica!

With External Tables combined with User-Defined Loads in Vertica 6, we not only support conventional external tables backed by files on a database server, but also external tables backed by any user defined data sources. We have already written adapters for HDFS, FTP or HTTP servers, JSON and XML objects, IDOL, and of course, other databases via ODBC. (Stay tuned for future blog posts on each of these!). The ability to analyze arbitrary data sources in this federated fashion enables powerful mash-ups such as, joining structured data in Vertica with semi-structured data (think log files) in HDFS or unstructured data (think audio or images) indexed in IDOL or  master data in other legacy relational databases.  The combined data set can now be analyzed using the native analytics in Vertica such as Timeseries, Event Series Pattern Matching, SQL, as well as a growing body of user defined analytic custom extensions in C++, and now R!

Of course, as you might expect, analytics over external data is significantly slower than data stored in Vertica’s native, highly compressed columnar storage format, but it offers the same flexibility of “late binding” people love about NoSQL interfaces, while continuing to leverage familiar SQL interfaces and BI tools.  And, thanks to Vertica’s fast MPP engine and C++ implementation, significantly faster than using alternatives like Pig or Hive on top of Hadoop. Now, you may choose to leave less valuable information in cheaper and slower storage such as HDFS and never move it into Vertica. And if you change your mind, or when the right data model is discovered, or you just want a go-fast switch, with a slight tweak of syntax, voila! – the same data is loaded into Vertica to automatically get full high availability, high compression, backup, recovery, automatic storage optimization, and other benefits of an enterprise class analytic platform!

The figure illustrates how external tables fit into the overall architecture of Vertica.

To use an external table, you define a table with an external keyword and provide information about the data source. Whenever that external table is read, the database retrieves data from the external source and parses it into the appropriate relational form and the rest of the query plan proceeds as normal.

And of course, we also enable the legacy use-case for external tables, which is simpler and/or quicker ETL/ELT. Rather than loading data into a temporary staging table prior to transformation in the database, the data transformation begins by reading the data directly from the external files it lives in thus avoiding an unnecessary materialization in database storage structures.

We believe that this separation of analytics from storage will let more people use Vertica’s analytics on more data in more interesting ways! And that is after all, what Vertica is all about!

Get Started With Vertica Today

Subscribe to Vertica