When (and When Not) to Use Data Integration with HP Vertica
In December, HP released version 7 of the HP Vertica analytics platform which includes, among others, a great new feature called HP Vertica Flex Zone (Flex Zone). Flex Zone enables you to quickly and easily load, explore and analyze some forms of semi-structured data. It eliminates the need for coding-intensive schemas to be defined or applied before the data is loaded for exploration.
One of Flex Zone’s important values is that it can save you hours of work setting up and managing data extraction. Rather than setting up schemas and mappings in an ETL tools and later worrying about whether structure will change, the process is simplified with Flex Zone. Data is simply pulled into Flex Zone and structure is automatically understood. Flex Zone is powerful for the exploration of common types of data. Flex Tables can immediately leverage:
- Delimited data – semi-structured text files. These are often referred to as flat files because the information is not stored in a relational database.
- JSON – A readable file that is often used in social media and new online applications
For these types of files, which are very common in modern IT infrastructure, you do not need an ETL to extract, transform and load the data. This functionality is included with Flex Zone and can save you many hours in pre-processing data for analytics. It can save you time in the long-run by lowering the need to monitor ETL processes. Other mechanisms also let Vertica ingest data from other common big data structures from HIVE and HDFS.
Having a function in Flex Zone that automatically understands structure is powerful. This is something that normally takes time, slowing the overall process of exploration of the data. Should the structure of the data change, maintaining it is also time-consuming. By integrating these less structured data sources and supporting vanilla SQL queries against them, Vertica brings a key feature of relational databases to bear: abstracting the storage representation from the query semantics.
ETL – Extract Transform Load
However, most ETL tools offer hundreds of connectors that allow for connection into anything from Salesforce to Oracle to DB2 on the mainframe. For these types of uses, you can either use an ETL or export data from the application into a supported format in order to use Flex Tables.
For ETL, users take on the process of extracting data and transforming it to make it fit-for-purpose. The longer process may be necessary, however. During the ETL process, users can ensure that the data conforms to the schema and that data quality standards are upheld. Users can establish business rules and reject any records that don’t conform to standards. Users can recode certain values in the data to standardize them (e.g. ST, Street, strt can be recoded to ‘STREET’). Users can also extract data from sources that have proprietary formats, like SAP, MS SQL and AS/400 and hundreds of others. Therefore, in order to deliver accurate analytics, gain access to odd file formats, ETL is still necessary for certain data.
Some companies need to keep track of where data came from and what was changed in the data. The Data Lineage features of many ETL tools help you track where a change occurred. The result of the data lineage shows in a report which traces a change from the target end component of a Job up to the source end. If this is an important part of your process, you may need an ETL tool.
The good news is that Vertica has several partners who offer free open source ETL with support for Vertica, like Talend and Pentaho, as well as commercial partners like Syncsort, Informatica and others. See http://www.vertica.com/partners/data-integration-etl/ for a complete list.