Vertica

Archive for the ‘Technical’ Category

The HP Vertica Community has Moved!

Visit our new community site: https://community.dev.hp.com/.

The HP Vertica online community now has a new home. We’ve joined the Big Data and Analytics Community, part of the HP Developer Community, located at https://community.dev.hp.com/.

Why are we doing this?

We’ve joined the new community so that you’ll have a centralized place to go for all your big data questions and answers. Using the Big Data and Analytics Community, you can:

  • Connect with customers across all our Big Data offerings, including HP Vertica Enterprise and Community Editions, HP Vertica OnDemand, HP IDOL , and HP IDOL OnDemand.
  • Learn more about HP Haven, the HP Big Data Platform that allows you to harness 100% of your data, including business, machine, and human-generated data.
  • Browse blogs, forum posts, and best practices documents for all big data topics.

In short, the Big Data and Analytics Community provides you with one-stop shopping for product information, guidance on best practices, and solutions to technical problems.

What about the old content?

To preserve the rich exchange of knowledge in our previous community and forum, we migrated all of the content from our old forum to our new Big Data and Analytics location. All your questions and answers are accessible on the new forum.

Want a preview?

Here’s what the new community looks like:

annotatedscreenshot

 

 

 

 

 

 

 

 

 

 

 

 

 

We look forward to greeting you in our new space!

Index Data into Your HP Vertica Database with the New IDOL CFS Vertica Module

HP is pleased to announce the new IDOL CFS Vertica Module. The CFS Vertica module allows the HP Connector Framework Server (CFS) to index into an HP Vertica database.

The new indexing capability makes real integration between HP IDOL and HP Vertica possible, allowing you to use Vertica to perform analytics on data that has been indexed by IDOL. The CFS Vertica Module is compatible only with IDOL 10.9 and later and version 7.1.x and later of the Vertica server. In this blog, we’ll give you a high-level overview of how the new integration works by walking you through a simple example, described below.

Scenario:
Your organization has a large repository of documents, written by many different authors. You want to find the length of documents written by each individual author.

Using IDOL CFS with the HP Vertica Indexer
The power of IDOL allows CFS to process data it retrieves from connectors and index the information into HP Vertica. The process of getting data from a repository into HP Vertica can be broken down into the following five steps:

  1. Connectors scan files from repositories and send documents to CFS
  2. CFS performs pre-import tasks (optional)
  3. CFS uses KeyView to filter document content and extract sub-files
  4. CFS performs post-import tasks (optional)
  5. CFS indexes data into existing HP Vertica flex tables

idolConnectorImage2

 

Step 1: Connectors
IDOL provides many different connectors through which you can access data from difference sources. For example, IDOL has a SharePoint connector, a social media connector, and an Exchange connector. The connectors scan and send files to CFS, where they are processed. By default, the files sent to CFS contain only metadata extracted from the repository. The files contain both the metadata and the file content only AFTER the KeyView filtering step (step 3). As discussed later, you can configure this process with pre-import and post-import tasks.

Step 2: Pre-import tasks
You can also choose to run optional pre-import tasks on the metadata contained in the files before KeyView filtering takes place. In IDOL, import tasks help you manipulate incoming data from a repository to better suit your needs. For example, you can run a facial recognition import task. You can also run post-import tasks on the files after the KeyView filtering step, when the files contain both metadata and content (see step 4).

Step 3: KeyView
You might be wondering, what exactly is the KeyView step? In a nutshell, KeyView filters and extracts elements from the files and records you are retrieving. You can also use it to customize imports. For example, you can run a pre-import task that adds the AUTN_NO_FILTER field to the document. The AUTN_NO_FILTER field specifies that you do not want to extract document content. Because we set this field, during the KeyView step, CFS knows not to extract all the document content. This is the case for our example; to get the file size and author information we want, we need only the metadata associated with the documents. The metadata that it does extract is what will ultimately end up in our HP Vertica database.

Steps 4 and 5: Post-import tasks and Indexing
After CFS has processed the document and performed any post-import tasks (step 4), it automatically indexes the document(step 5). By default, CFS indexes the document into the index or indexes (separated by commas) specified by the IndexerSections parameter in the [Indexing] section of its configuration file. CFS can index into IDOL Server, IDOL OnDemand, and now, a Vertica database.

To have CFS index your information into Vertica, open the CFS configuration file and use the IndexerSections parameter to specify the section containing the indexing setting, as shown here:

indexing

Then, create a new section with the same name that you specified in the IndexerSections parameter:

verticaIndexer

Save and close the configuration file.

The Vertica indexer is part of the CFS product. However, to use the Vertica indexer, you must have the Vertica ODBC drivers installed and configured on the same machine as CFS. This is necessary because CFS uses the ODBC connection to send JSON-formatted data to the existing HP Vertica flex table.

Creating HP Vertica Flex Tables
Since metadata is variable, you must have a destination that can handle variable data. HP flex tables (short for flexible tables) are tables designed especially for loading and querying semi-structured data into your HP Vertica database, which makes them a perfect fit for use with IDOL CFS. Note that the flex table must already exist for CFS to insert the data into it. In our example, we’ve previously created a flex table called myFlexTable (see it listed under TableName in the Vertica indexer code example above). When we created the flex table, we included column definitions for data we want to retrieve, along with CFS data that is inserted automatically:

createFlexTable

We also created a projection to make sure we view only the latest record for any given document:

createProjection

For more information about HP Vertica flex tables, see the documentation here.

When CFS indexes data to an HP Vertica flex table, it issues a COPY command using ODBC with the JSON formatted data:

copy

Our JSON data might look like this:

jsonFile

Here’s where we can see the length of documents written by different authors. The file contains our expected metadata, like author and file size, but we also see some automatically-inserted data like DREREFERENCE, VERTICA_INDEXER_TIMESTAMP, and VERTICA_INDEXER_DELETED. DREREFERENCE is a unique document id used by IDOL. VERTICA_INDEXER_TIMESTAMP is a timestamp inserted by CFS in the JSON record sent to HP Vertica, which represents the time at which the information was indexed for Vertica. The timestamp is used to distinguish and sort different versions or changes of the JSON record. The VERTICA_INDEXER_DELETED field is a Boolean value that, if true, denotes that the document was deleted from the source repository. You can use this field to filter out deleted documents.

Accessing Your Data in HP Vertica
Now that the data is in HP Vertica, we can access it as usual. To view the data in Vertica, query the projection we created earlier:

selectStatement

Use the mapToString() function (with the __raw__ column of flexProjection) to inspect its contents in readable JSON text format. Notice that with this statement, we can see all the metadata that was extracted, even though we didn’t view it in our projection:

map_to_string1

map_to_string2

Using the new CFS Vertica module, you open up new possibilities for your data. You now have the ability to use all of the powerful IDOL features and integrate your data with HP Vertica for analysis. Stay tuned for more blogs about this new integration.

Learn more:
To read more about HP Vertica flex tables, see the flex table documentation.
If you are an IDOL customer, read more about IDOL CFS (password required).
See this post on our new community!

HP Vertica Best Practices: Resource Management

In a perfect world, every query you ever run would receive full attention from all system resources. And in a single user environment, when you are only running one query, this is in fact the case; the system can devote all its resources (CPU, memory, disk I/O, etc.) to your one query. But chances are, you are running more than one query at a time, maybe even more than you can count.

This means that queries have to share resources when they run. Since the performance of a given query depends on how many resources it has been allocated, it’s easy to see how things can get jammed up. Luckily for you, HP Vertica has a resource management feature that can help you allocate resources in a manner that suits your business needs.

Watch this video for an overview of HP Vertica resource management and learn how it can help you optimize your HP Vertica database for use with third-party tools.

For more information about HP Vertica resource management, visit our documentation .

Check out our other video about the resource pool cascade feature, which can further help you with resource management.

Getting Starting With HP Vertica OnDemand

We recently introduced HP Vertica OnDemand. HP Vertica OnDemand is the massively parallel, super-fast analytics you know and love, coupled with the convenience and accessibility of the cloud. With HP Vertica OnDemand, you can rapidly go from signup to loading data without ever worrying about things like server hardware, configuration, or IT.

Scale your OnDemand service as your needs change. With a selection of tiered plans, you can add additional features and space to your OnDemand database as your business grows.
Get started today! Check out this video to see how you can get started with HP Vertica OnDemand now.

For more information, see our HP Vertica OnDemand documentation.

To sign up for HP Vertica OnDemand, visit https://www.pronq.com/

Click here Learn more about installing HP Vertica client drivers.

HP Vertica Storage Location for HDFS

Do you find yourself running low on disk space on your HP Vertica database? You could delete older data, but that sacrifices your ability to perform historical queries. You could add new nodes to your cluster or add storage to your existing nodes. However, these options require additional expense.

The HP Vertica Storage Locations for HDFS feature introduced in HP Vertica Version 7.1 offers you a new solution: storing data on an Apache Hadoop cluster. You can use this feature to store data in a Hadoop Distributed File System (HDFS) while still being able to query it through HP Vertica.

Watch this video for an overview of the HP Vertica Storage Locations for HDFS feature and an example of how you can use it to free storage space on your HP Vertica cluster.

For more information about this feature, see the HP Vertica Storage Location for HDFS section of the documentation.

HP Vertica Best Practices: Native Connection Load Balancing

You may be aware that each client connection to a host in your HP Vertica cluster requires a small overhead in memory and processor time. For a single connection, this impact is minimal, almost unnoticeable. Now imagine you have many clients all connecting to the same host at the same time. In this situation, the compounded overhead can potentially affect database performance.

To limit the database performance consequences caused by multiple client connections, you might manually assign certain client connections to certain hosts. But this can become tedious and difficult as more and more client connections are added. Luckily, HP Vertica offers a feature that can do all this for you. It’s called native connection load balancing.

Native connection load balancing is available in HP Vertica 7.0 and later releases. It is a feature built into both the server and the client libraries that helps spread the CPU and memory overhead caused by client connections across the hosts in the database. When you enable native load balancing on the server and client, you won’t have to manually assign clients to specific hosts to reduce overhead.

Watch this best practices video to learn more about HP Vertica native connection load balancing and how to enable and disable it on the server and client.

For more information, see Native Connection Load Balancing in our documentation.

What Is a Range Join and Why Is It So Fast?

chuck5

Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One HP Vertica user at the conference said he saw a “range join” in a query plan, and wondered what it is and why it is so fast.

First, you need to understand what kind of queries turn into range joins. Generally, these are queries with inequality (greater than, less than, or between) predicates. For example, a map of the IPv4 address space might give details about addresses between a start and end IP for each subnet. Or, a slowly changing dimension table might, for each key, record attributes with their effective time ranges.

A rudimentary approach to handling such joins would be as follows: For each fact table row, check each dimension row to see if the range condition is true (effectively taking the Cartesian product and filtering the results). A more sophisticated, and often more efficient, approach would be to use some flavor of interval trees. However, HP Vertica uses a simpler approach based on sorting.

Basically, if the ranges don’t overlap very much (or at all), sorting the table by range allows sections of the table to be skipped (using a binary search or similar). For large tables, this can reduce the join time by orders of magnitude compared to “brute force”.

Let’s take the example of a table fact, with a column fv, which we want to join to a table dim using a BETWEEN predicate against attributes dv_start and dv_end (fv >= dv_start AND fv <= dv_end). The dim table contains the following data:

chuck3

We can choose, arbitrarily, to sort the data on dv_start. This way, we can eliminate ranges that have a dv_start that is too large to be relevant to a particular fv value. In the second figure, this is illustrated for the lookup of an fv value of 62. The left shaded red area does not need to be checked, because 62 is not greater than these dv_start values.

chuck4

Optimizing dv_end is slightly trickier, because we have no proof that the data is sorted by dv_end (in fact, in this example, it is not). However, we can keep the largest dv_end seen in the table starting from the beginning, and search based on that. In this manner, the red area on the right can be skipped, because all of these rows have a dv_end that is not greater than 62. The part in blue, between the red areas, is then scanned to look for matches.

If you managed to follow the example, you can see our approach is simple. Yet it has helped many customers in practice. The IP subnet lookup case was the first prominent one, with a 1000x speedup. But if you got lost in this example, don’t worry… the beauty of languages like SQL is there is a community of researchers and developers who figure these things out for you. So next time you see us at a conference, don’t hesitate to ask about HP Vertica features. You just might see a blog about it after.

Get Started With Vertica Today

Subscribe to Vertica