by Andrew Lamb
As Vertica evolved to address the needs of diverse analytics users, a common refrain we heard from our customers is that data modeling and exploration is a key activity for data scientists. This is the phase when data is available but they aren’t quite sure how to harness it yet. Over a series of experiments and iterations, the right data model emerges, and at that point it can be operationalized in Vertica for ongoing interactive use. People often use Hadoop for this phase, which gives them the flexibility to access any data, but it means they must write MR programs for analytics and are unable to leverage the sophisticated analytics available in Vertica. This insight led us to decouple our Analytics Engine from our columnar storage to further extend our patent-pending FlexStore architecture . With Vertica 6, it is now possible to use the full expressive power of Vertica Analytics Engine and its analytics without having to load the data into Vertica!
With External Tables combined with User-Defined Loads in Vertica 6, we not only support conventional external tables backed by files on a database server, but also external tables backed by any user defined data sources. We have already written adapters for HDFS, FTP or HTTP servers, JSON and XML objects, IDOL, and of course, other databases via ODBC. (Stay tuned for future blog posts on each of these!). The ability to analyze arbitrary data sources in this federated fashion enables powerful mash-ups such as, joining structured data in Vertica with semi-structured data (think log files) in HDFS or unstructured data (think audio or images) indexed in IDOL or master data in other legacy relational databases. The combined data set can now be analyzed using the native analytics in Vertica such as Timeseries, Event Series Pattern Matching, SQL, as well as a growing body of user defined analytic custom extensions in C++, and now R!
Of course, as you might expect, analytics over external data is significantly slower than data stored in Vertica’s native, highly compressed columnar storage format, but it offers the same flexibility of “late binding” people love about NoSQL interfaces, while continuing to leverage familiar SQL interfaces and BI tools. And, thanks to Vertica’s fast MPP engine and C++ implementation, significantly faster than using alternatives like Pig or Hive on top of Hadoop. Now, you may choose to leave less valuable information in cheaper and slower storage such as HDFS and never move it into Vertica. And if you change your mind, or when the right data model is discovered, or you just want a go-fast switch, with a slight tweak of syntax, voila! – the same data is loaded into Vertica to automatically get full high availability, high compression, backup, recovery, automatic storage optimization, and other benefits of an enterprise class analytic platform!
The figure illustrates how external tables fit into the overall architecture of Vertica.
To use an external table, you define a table with an external keyword and provide information about the data source. Whenever that external table is read, the database retrieves data from the external source and parses it into the appropriate relational form and the rest of the query plan proceeds as normal.
And of course, we also enable the legacy use-case for external tables, which is simpler and/or quicker ETL/ELT. Rather than loading data into a temporary staging table prior to transformation in the database, the data transformation begins by reading the data directly from the external files it lives in thus avoiding an unnecessary materialization in database storage structures.
We believe that this separation of analytics from storage will let more people use Vertica’s analytics on more data in more interesting ways! And that is after all, what Vertica is all about!