Vertica

Vertica's Blog

Index Data into Your HP Vertica Database with the New IDOL CFS Vertica Module

HP is pleased to announce the new IDOL CFS Vertica Module. The CFS Vertica module allows the HP Connector Framework Server (CFS) to index into an HP Vertica database.

The new indexing capability makes real integration between HP IDOL and HP Vertica possible, allowing you to use Vertica to perform analytics on data that has been indexed by IDOL. The CFS Vertica Module is compatible only with IDOL 10.9 and later and version 7.1.x and later of the Vertica server. In this blog, we’ll give you a high-level overview of how the new integration works by walking you through a simple example, described below.

Scenario:
Your organization has a large repository of documents, written by many different authors. You want to find the length of documents written by each individual author.

Using IDOL CFS with the HP Vertica Indexer
The power of IDOL allows CFS to process data it retrieves from connectors and index the information into HP Vertica. The process of getting data from a repository into HP Vertica can be broken down into the following five steps:

  1. Connectors scan files from repositories and send documents to CFS
  2. CFS performs pre-import tasks (optional)
  3. CFS uses KeyView to filter document content and extract sub-files
  4. CFS performs post-import tasks (optional)
  5. CFS indexes data into existing HP Vertica flex tables

idolConnectorImage2

 

Step 1: Connectors
IDOL provides many different connectors through which you can access data from difference sources. For example, IDOL has a SharePoint connector, a social media connector, and an Exchange connector. The connectors scan and send files to CFS, where they are processed. By default, the files sent to CFS contain only metadata extracted from the repository. The files contain both the metadata and the file content only AFTER the KeyView filtering step (step 3). As discussed later, you can configure this process with pre-import and post-import tasks.

Step 2: Pre-import tasks
You can also choose to run optional pre-import tasks on the metadata contained in the files before KeyView filtering takes place. In IDOL, import tasks help you manipulate incoming data from a repository to better suit your needs. For example, you can run a facial recognition import task. You can also run post-import tasks on the files after the KeyView filtering step, when the files contain both metadata and content (see step 4).

Step 3: KeyView
You might be wondering, what exactly is the KeyView step? In a nutshell, KeyView filters and extracts elements from the files and records you are retrieving. You can also use it to customize imports. For example, you can run a pre-import task that adds the AUTN_NO_FILTER field to the document. The AUTN_NO_FILTER field specifies that you do not want to extract document content. Because we set this field, during the KeyView step, CFS knows not to extract all the document content. This is the case for our example; to get the file size and author information we want, we need only the metadata associated with the documents. The metadata that it does extract is what will ultimately end up in our HP Vertica database.

Steps 4 and 5: Post-import tasks and Indexing
After CFS has processed the document and performed any post-import tasks (step 4), it automatically indexes the document(step 5). By default, CFS indexes the document into the index or indexes (separated by commas) specified by the IndexerSections parameter in the [Indexing] section of its configuration file. CFS can index into IDOL Server, IDOL OnDemand, and now, a Vertica database.

To have CFS index your information into Vertica, open the CFS configuration file and use the IndexerSections parameter to specify the section containing the indexing setting, as shown here:

indexing

Then, create a new section with the same name that you specified in the IndexerSections parameter:

verticaIndexer

Save and close the configuration file.

The Vertica indexer is part of the CFS product. However, to use the Vertica indexer, you must have the Vertica ODBC drivers installed and configured on the same machine as CFS. This is necessary because CFS uses the ODBC connection to send JSON-formatted data to the existing HP Vertica flex table.

Creating HP Vertica Flex Tables
Since metadata is variable, you must have a destination that can handle variable data. HP flex tables (short for flexible tables) are tables designed especially for loading and querying semi-structured data into your HP Vertica database, which makes them a perfect fit for use with IDOL CFS. Note that the flex table must already exist for CFS to insert the data into it. In our example, we’ve previously created a flex table called myFlexTable (see it listed under TableName in the Vertica indexer code example above). When we created the flex table, we included column definitions for data we want to retrieve, along with CFS data that is inserted automatically:

createFlexTable

We also created a projection to make sure we view only the latest record for any given document:

createProjection

For more information about HP Vertica flex tables, see the documentation here.

When CFS indexes data to an HP Vertica flex table, it issues a COPY command using ODBC with the JSON formatted data:

copy

Our JSON data might look like this:

jsonFile

Here’s where we can see the length of documents written by different authors. The file contains our expected metadata, like author and file size, but we also see some automatically-inserted data like DREREFERENCE, VERTICA_INDEXER_TIMESTAMP, and VERTICA_INDEXER_DELETED. DREREFERENCE is a unique document id used by IDOL. VERTICA_INDEXER_TIMESTAMP is a timestamp inserted by CFS in the JSON record sent to HP Vertica, which represents the time at which the information was indexed for Vertica. The timestamp is used to distinguish and sort different versions or changes of the JSON record. The VERTICA_INDEXER_DELETED field is a Boolean value that, if true, denotes that the document was deleted from the source repository. You can use this field to filter out deleted documents.

Accessing Your Data in HP Vertica
Now that the data is in HP Vertica, we can access it as usual. To view the data in Vertica, query the projection we created earlier:

selectStatement

Use the mapToString() function (with the __raw__ column of flexProjection) to inspect its contents in readable JSON text format. Notice that with this statement, we can see all the metadata that was extracted, even though we didn’t view it in our projection:

map_to_string1

map_to_string2

Using the new CFS Vertica module, you open up new possibilities for your data. You now have the ability to use all of the powerful IDOL features and integrate your data with HP Vertica for analysis. Stay tuned for more blogs about this new integration.

Learn more:
To read more about HP Vertica flex tables, see the flex table documentation.
If you are an IDOL customer, read more about IDOL CFS (password required).
See this post on our new community!

HP Vertica Best Practices: Resource Management

In a perfect world, every query you ever run would receive full attention from all system resources. And in a single user environment, when you are only running one query, this is in fact the case; the system can devote all its resources (CPU, memory, disk I/O, etc.) to your one query. But chances are, you are running more than one query at a time, maybe even more than you can count.

This means that queries have to share resources when they run. Since the performance of a given query depends on how many resources it has been allocated, it’s easy to see how things can get jammed up. Luckily for you, HP Vertica has a resource management Read More »

Announcing HP Distributed R

original

Today, HP announced HP Distributed R, a massive leap forward in the world of predictive and statistical analytics. A scalable and high-performance engine for the R language, HP Distributed R allows tasks to be split across multiple nodes, enabling scale where before there simply was none. Now data scientists can analyze billions of rows for regression, page rank, and much more, all the while using the familiar RStudio and R console that is commonly used by an estimated 2 million + strong user base. Below is an overview of a workshop hosted at HP labs, focused around the new found benefits of Distributed R.

Over the last two Read More »

Partner Interview: Logi Analytics & Haven OnDemand

Logi interview

Last week I sat down with Steven Schneider from Logi Analytics to discuss their integration with the new Haven OnDemand.

  HP: What is new with the Haven OnDemand integration with Logi Analytics?

Logi Analytics: So, with IDOL and its new web-services oriented architecture as well as Vertica and its natural SQL layer, you can use Logi’s direct-connect model to not only query and access features, but also integrate that data with other sources.

For example, with Logi you can query your data using Vertica, return the data to IDOL, and execute an operation such as Natural language translation ECT… Then, take the result, and create a visualization. It’s Read More »

Enter to Win the March Data Madness Machine Learning Mania Contest!

Slam Data Dunk

If you’re a frequent visitor to our blog, you may recall reading about the March Data Madness Sentiment Tracker that we demonstrated at the MIT Sloan Sports Analytics Conference just before the 2013 NCAA Men’s Basketball “March Madness” Tournament.

The demonstration focused on tracking the “sentiment of the crowd” by collecting and analyzing roughly a half million tweets with the HP Vertica and HP IDOL engines. These results were displayed with a Tibco Spotfire dashboard and offered great conversation fodder at the event:

Volume of tweets by team Volume of tweets by player Positive, negative, and neutral sentiment groupings Volume of tweets by Read More »

Getting Starting With HP Vertica OnDemand

We recently introduced HP Vertica OnDemand. HP Vertica OnDemand is the massively parallel, super-fast analytics you know and love, coupled with the convenience and accessibility of the cloud. With HP Vertica OnDemand, you can rapidly go from signup to loading data without ever worrying about things like server hardware, configuration, or IT.

Scale your OnDemand service as your needs change. With a selection of tiered plans, you can add additional features and space to your OnDemand database as your business grows. Get started today! Check out this video to see how you can get started with HP Vertica OnDemand now.

For more information, see our HP Vertica OnDemand documentation.

To sign up for HP Vertica OnDemand, visit https://www.pronq.com/

Read More »

Thoughts About HP Vertica for SQL on Hadoop

Et voilà

Recently, HP has announced HP Vertica for SQL on Hadoop. We’ve leveraged our years of experience in big data analytics and opened up our platform to allow users to tap into the full power of Hadoop. It’s a rich, fast, and enterprise-ready implementation of SQL on Hadoop that we’re very proud to introduce.

We know that you have choice when it comes to SQL-on-Hadoop engines. There are several SQL on Hadoop engines on the market for a reason – they are very powerful way to perform analytics on big data stored in Hadoop by using the familiar SQL language. Users are able to leverage any reporting or analytical tool to Read More »

Get Started With Vertica Today

Subscribe to Vertica