All of the interns at Vertica have found interesting and useful projects to work on during their stay. The focus of my efforts has been the Data Quest project.
At Vertica, an important goal is to keep customers happy. In order to do this effectively it’s necessary to know what it is a customer wants even if they aren’t sure themselves. A potential source of this information comes in the form of diagnostic data which can tell us how Vertica is actually used in the wild. Internally, the Vertica Analytics Platform tracks a wealth of information about its state and usage in over 300 system tables. If something goes wrong, customers have the option of creating a diagnostics data dump file of the current states of these system tables so that this information can be used to resolve problems and improve future experiences. As of this writing there is over 515 dump files available having over 100Gb of compressed data.
The Data Quest goal has been to take all this Vertica meta-data, pump it into a database and get cracking on the analysis. Unfortunately, getting this data ready for analysis is a nontrivial task. A number of hurdles had to be crossed to get the data ready for loading an analysis.
First, the data in the dumps is messy. While the data is pipe (|) delimited, columns often contain newlines, extraneous delimiters, or both. This means the data needs to be cleaned up before being handed off. The problem was tackled by creating an algorithm to attack the problem column from both sides of the row and apply appropriate escaping and quoting to clean up the data. On a parallel front the diagnostic generation scripts were updated in order to make future diagnostic data a snap to load.
Next, this diagnostic information can be from a variety of different versions of Vertica. As Vertica has grown in capabilities, additional system tables have been added to provide more information to users, administrators, and developers. The new information comes at a cost – older system tables schemas are no longer compatible with the new schemas. In order to make the Data Quest project as valuable as possible, care was taken in solving this problem. The solution to this issue involved creating methods for automatic version merging at load time and auto-generated views which span all versions.
Finally, the goal of the project is analysis. Now that the data is loaded, it’s time to mine the wealth of information available.
First off let’s look at a sample of customer data in Vertica databases. There are a number of sources of this information including column statistics, table information, and projection data.
What if we wanted to know information about how this data is distributed? Questions worth asking would include how many tables and projections customers have. On the optimization side it’s useful to know what the most prevalent data types are. In this case, Integers and Varchars far outweigh the rest. Numerics, Floats, and Chars are also ranked highly.
One point of curiosity was figuring out when the busiest times of the week were for running queries.
However, this database doesn’t need to only be used for aggregations. One of our developers wanted a glimpse at what sort of real world data ends up in various system tables in a production database. Instead of faking some numbers or even making educated guesses, a couple quick queries reveals what sorts of data you would expect to find in any of the system tables.
All of these examples were run just by single queries run against the Diagnostic database. Data is automatically loaded into the Diagnostic Dump Database daily, so stay tuned for further results and insights!