HP is pleased to announce the new IDOL CFS Vertica Module. The CFS Vertica module allows the HP Connector Framework Server (CFS) to index into an HP Vertica database.
The new indexing capability makes real integration between HP IDOL and HP Vertica possible, allowing you to use Vertica to perform analytics on data that has been indexed by IDOL. The CFS Vertica Module is compatible only with IDOL 10.9 and later and version 7.1.x and later of the Vertica server. In this blog, we’ll give you a high-level overview of how the new integration works by walking you through a simple example, described below.
Your organization has a large repository of documents, written by many different authors. You want to find the length of documents written by each individual author.
Using IDOL CFS with the HP Vertica Indexer
The power of IDOL allows CFS to process data it retrieves from connectors and index the information into HP Vertica. The process of getting data from a repository into HP Vertica can be broken down into the following five steps:
- Connectors scan files from repositories and send documents to CFS
- CFS performs pre-import tasks (optional)
- CFS uses KeyView to filter document content and extract sub-files
- CFS performs post-import tasks (optional)
- CFS indexes data into existing HP Vertica flex tables
Step 1: Connectors
IDOL provides many different connectors through which you can access data from difference sources. For example, IDOL has a SharePoint connector, a social media connector, and an Exchange connector. The connectors scan and send files to CFS, where they are processed. By default, the files sent to CFS contain only metadata extracted from the repository. The files contain both the metadata and the file content only AFTER the KeyView filtering step (step 3). As discussed later, you can configure this process with pre-import and post-import tasks.
Step 2: Pre-import tasks
You can also choose to run optional pre-import tasks on the metadata contained in the files before KeyView filtering takes place. In IDOL, import tasks help you manipulate incoming data from a repository to better suit your needs. For example, you can run a facial recognition import task. You can also run post-import tasks on the files after the KeyView filtering step, when the files contain both metadata and content (see step 4).
Step 3: KeyView
You might be wondering, what exactly is the KeyView step? In a nutshell, KeyView filters and extracts elements from the files and records you are retrieving. You can also use it to customize imports. For example, you can run a pre-import task that adds the
AUTN_NO_FILTER field to the document. The
AUTN_NO_FILTER field specifies that you do not want to extract document content. Because we set this field, during the KeyView step, CFS knows not to extract all the document content. This is the case for our example; to get the file size and author information we want, we need only the metadata associated with the documents. The metadata that it does extract is what will ultimately end up in our HP Vertica database.
Steps 4 and 5: Post-import tasks and Indexing
After CFS has processed the document and performed any post-import tasks (step 4), it automatically indexes the document(step 5). By default, CFS indexes the document into the index or indexes (separated by commas) specified by the
IndexerSections parameter in the
[Indexing] section of its configuration file. CFS can index into IDOL Server, IDOL OnDemand, and now, a Vertica database.
To have CFS index your information into Vertica, open the CFS configuration file and use the
IndexerSections parameter to specify the section containing the indexing setting, as shown here:
Then, create a new section with the same name that you specified in the
Save and close the configuration file.
The Vertica indexer is part of the CFS product. However, to use the Vertica indexer, you must have the Vertica ODBC drivers installed and configured on the same machine as CFS. This is necessary because CFS uses the ODBC connection to send JSON-formatted data to the existing HP Vertica flex table.
Creating HP Vertica Flex Tables
Since metadata is variable, you must have a destination that can handle variable data. HP flex tables (short for flexible tables) are tables designed especially for loading and querying semi-structured data into your HP Vertica database, which makes them a perfect fit for use with IDOL CFS. Note that the flex table must already exist for CFS to insert the data into it. In our example, we’ve previously created a flex table called myFlexTable (see it listed under
TableName in the Vertica indexer code example above). When we created the flex table, we included column definitions for data we want to retrieve, along with CFS data that is inserted automatically:
We also created a projection to make sure we view only the latest record for any given document:
For more information about HP Vertica flex tables, see the documentation here.
When CFS indexes data to an HP Vertica flex table, it issues a
COPY command using ODBC with the JSON formatted data:
Our JSON data might look like this:
Here’s where we can see the length of documents written by different authors. The file contains our expected metadata, like author and file size, but we also see some automatically-inserted data like
DREREFERENCE is a unique document id used by IDOL.
VERTICA_INDEXER_TIMESTAMP is a timestamp inserted by CFS in the JSON record sent to HP Vertica, which represents the time at which the information was indexed for Vertica. The timestamp is used to distinguish and sort different versions or changes of the JSON record. The
VERTICA_INDEXER_DELETED field is a Boolean value that, if true, denotes that the document was deleted from the source repository. You can use this field to filter out deleted documents.
Accessing Your Data in HP Vertica
Now that the data is in HP Vertica, we can access it as usual. To view the data in Vertica, query the projection we created earlier:
mapToString() function (with the
__raw__ column of flexProjection) to inspect its contents in readable JSON text format. Notice that with this statement, we can see all the metadata that was extracted, even though we didn’t view it in our projection:
Using the new CFS Vertica module, you open up new possibilities for your data. You now have the ability to use all of the powerful IDOL features and integrate your data with HP Vertica for analysis. Stay tuned for more blogs about this new integration.