Vertica

Archive for the ‘Uncategorized’ Category

Thoughts About HP Vertica for SQL on Hadoop

Et voilà

Recently, HP has announced HP Vertica for SQL on Hadoop. We’ve leveraged our years of experience in big data analytics and opened up our platform to allow users to tap into the full power of Hadoop. It’s a rich, fast, and enterprise-ready implementation of SQL on Hadoop that we’re very proud to introduce.

We know that you have choice when it comes to SQL-on-Hadoop engines. There are several SQL on Hadoop engines on the market for a reason – they are very powerful way to perform analytics on big data stored in Hadoop by using the familiar SQL language. Users are able to leverage any reporting or analytical tool to analyze and study the data rather than write their own Java and Map/Reduce code.

However, not all SQL-on-Hadoop is created the same. We think HP Vertica for SQL on Hadoop has some very big differences. These include:

  • Platform Agnostic – When you adopt a SQL on Hadoop query engine, it may be stuck to one distribution of Hadoop. Not so with HP Vertica for SQL on Hadoop. Our implementation works with Hortonworks, Cloudera and MapR distributions.
  • SQL Completeness – The richer the SQL engine, the wider the range of analytics that you can perform with extensive coding and data movement. You get a very rich set of analytical functions with HP Vertica for SQL on Hadoop. HP Vertica offers enterprise-ready, advanced analytics that support JOINs, complex data types, and other capabilities only available from our SQL on Hadoop implementation.
  • Manageability – Tools for managing queries and managing the resources of your cluster are fairly scarce and immature in the Hadoop world. However, with some of the tools we include, you can divide resources among different queries and different types of queries. If unplanned and resource-intensive queries have to be cancelled or temporarily interrupted, they can be.
  • Data Source Transparency – It’s important to allow you to query common data standard storage formats such as Parquet, Avro and ORC. When you can use native formats, you avoid having to move the data.
  • Path to Optimization – When you need to boost performance, HP Vertica for SQL on Hadoop offers optimizations like compression, columnar storage, and projections

You can’t really forget the fact that this offering comes from HP Software. Users should be able to take advantage of all the power of our Haven platform for big data. Encompassing proven technologies from HP Software, including Autonomy, Vertica, and ArcSight, Haven enables forward-thinking organizations to make use of virtually all information sources from both inside and outside its four walls to make better, faster decisions.

Download the report here.

See more

And more…

HP Vertica Storage Location for HDFS

Do you find yourself running low on disk space on your HP Vertica database? You could delete older data, but that sacrifices your ability to perform historical queries. You could add new nodes to your cluster or add storage to your existing nodes. However, these options require additional expense.

The HP Vertica Storage Locations for HDFS feature introduced in HP Vertica Version 7.1 offers you a new solution: storing data on an Apache Hadoop cluster. You can use this feature to store data in a Hadoop Distributed File System (HDFS) while still being able to query it through HP Vertica.

Watch this video for an overview of the HP Vertica Storage Locations for HDFS feature and an example of how you can use it to free storage space on your HP Vertica cluster.

For more information about this feature, see the HP Vertica Storage Location for HDFS section of the documentation.

HP Vertica Best Practices: Native Connection Load Balancing

You may be aware that each client connection to a host in your HP Vertica cluster requires a small overhead in memory and processor time. For a single connection, this impact is minimal, almost unnoticeable. Now imagine you have many clients all connecting to the same host at the same time. In this situation, the compounded overhead can potentially affect database performance.

To limit the database performance consequences caused by multiple client connections, you might manually assign certain client connections to certain hosts. But this can become tedious and difficult as more and more client connections are added. Luckily, HP Vertica offers a feature that can do all this for you. It’s called native connection load balancing.

Native connection load balancing is available in HP Vertica 7.0 and later releases. It is a feature built into both the server and the client libraries that helps spread the CPU and memory overhead caused by client connections across the hosts in the database. When you enable native load balancing on the server and client, you won’t have to manually assign clients to specific hosts to reduce overhead.

Watch this best practices video to learn more about HP Vertica native connection load balancing and how to enable and disable it on the server and client.

For more information, see Native Connection Load Balancing in our documentation.

Workshop on Distributed Computing in R

R-icon_167_167
R is used by millions of data scientists. In the near future, these data scientists will have to rely on distributed computing to meet the computational demands of Big Data. Wouldn’t it be helpful if R provides simple ways to harness the power of multiple servers?

HP is hosting an R workshop January 26-27, 2015 where R users will brainstorm on this topic. The workshop is being organized by Indrajit Roy, Principal Researcher at HP Labs, and Michael Lawrence, R-core member at Genentech. A number of well-known R contributors, including members affiliated with universities, national labs, and the industry, are going to present their views at the workshop.

Here is a summary of the workshop goals, as stated on the workshop’s web page:

“As data sizes increase, so does the need to provide R users with tools to efficiently analyze large datasets. The goal of this workshop is to standardize the API for exposing distributed computing in R, learn from the experiences of attendees in using R for large scale analysis, and collaborate in open source. We want to encourage R contributors (including students) to implement parallel versions of their favorite algorithms. By standardizing the infrastructure for distributed computing, we will be able to increase the availability of parallel algorithms in R, and ensure that R is an appealing choice even for analysis on really large data.”
Read more about the workshop and its agenda by clicking here.

Get involved if you are interested in R and Big Data!

What Is a Range Join and Why Is It So Fast?

chuck5

Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One HP Vertica user at the conference said he saw a “range join” in a query plan, and wondered what it is and why it is so fast.

First, you need to understand what kind of queries turn into range joins. Generally, these are queries with inequality (greater than, less than, or between) predicates. For example, a map of the IPv4 address space might give details about addresses between a start and end IP for each subnet. Or, a slowly changing dimension table might, for each key, record attributes with their effective time ranges.

A rudimentary approach to handling such joins would be as follows: For each fact table row, check each dimension row to see if the range condition is true (effectively taking the Cartesian product and filtering the results). A more sophisticated, and often more efficient, approach would be to use some flavor of interval trees. However, HP Vertica uses a simpler approach based on sorting.

Basically, if the ranges don’t overlap very much (or at all), sorting the table by range allows sections of the table to be skipped (using a binary search or similar). For large tables, this can reduce the join time by orders of magnitude compared to “brute force”.

Let’s take the example of a table fact, with a column fv, which we want to join to a table dim using a BETWEEN predicate against attributes dv_start and dv_end (fv >= dv_start AND fv <= dv_end). The dim table contains the following data:

chuck3

We can choose, arbitrarily, to sort the data on dv_start. This way, we can eliminate ranges that have a dv_start that is too large to be relevant to a particular fv value. In the second figure, this is illustrated for the lookup of an fv value of 62. The left shaded red area does not need to be checked, because 62 is not greater than these dv_start values.

chuck4

Optimizing dv_end is slightly trickier, because we have no proof that the data is sorted by dv_end (in fact, in this example, it is not). However, we can keep the largest dv_end seen in the table starting from the beginning, and search based on that. In this manner, the red area on the right can be skipped, because all of these rows have a dv_end that is not greater than 62. The part in blue, between the red areas, is then scanned to look for matches.

If you managed to follow the example, you can see our approach is simple. Yet it has helped many customers in practice. The IP subnet lookup case was the first prominent one, with a 1000x speedup. But if you got lost in this example, don’t worry… the beauty of languages like SQL is there is a community of researchers and developers who figure these things out for you. So next time you see us at a conference, don’t hesitate to ask about HP Vertica features. You just might see a blog about it after.

The HP Vertica Community is Moving!

The HP Vertica online community will soon have a new home. In the next few months, we’ll be joining the Big Data and Analytics Community, part of the HP Developer Community, located at https://community.dev.hp.com/.

Why are we doing this?

We’re joining the new community so that you’ll have a centralized place to go for all your big data questions and answers. Using the Big Data and Analytics Community, you will be able to:

  • Connect with customers across all our Big Data offerings, including HP Vertica Enterprise and Community Editions, HP Vertica OnDemand, HP IDOL , and HP IDOL OnDemand.
  • Learn more about HP Haven, the HP Big Data Platform that allows you to harness 100% of your data, including business, machine, and human-generated data.

In short, the Big Data and Analytics Community will provide you with one-stop shopping for product information, guidance on best practices, and solutions to technical problems.

What about existing content?

To preserve the rich exchange of knowledge in our current community and forum, we are migrating all of the content from our current forum to our new Big Data and Analytics location. All your questions and answers will be saved and accessible on the new forum.

When will this happen?

The migration process is just beginning and we estimate it will take a number of weeks. As the new launch date nears, we’ll share more information with you about the actions you’ll need to take to access the new forum.

Want a preview?

Here’s a sneak peak at new community plans:

annotatedscreenshot

 

 

 

 

 

 

 

 

 

 

 

 

 

We look forward to greeting you in our new space! Stay tuned for more detailed information to come.

HP Vertica Gives Back this Holiday Season

EastEndHouseThanks

This holiday season, four teams of HP Vertica employees and families made a trip to East End House in Cambridge, MA to help with the annual Thanksgiving Basket Giveaway. If this organization sounds familiar, you might have read our blog about our summer interns visiting the same location to work with students to build bridges made of toothpicks and gumdrops.

This time around, Vertica volunteers assisted with a program that provided food to individuals and families for Thanksgiving. On Monday, the team helped stuff hundreds of bags with donated goods like whole frozen turkeys, boxed stuffing, canned fruits and vegetables, potatoes, and even fresh kale. They bagged over 22 thousand pounds of fresh produce! All of these items were generously donated by individuals and companies. The following day, more Vertica volunteers helped distribute the (now overflowing) bags to over 1,200 families to enjoy this Thanksgiving.

The HP Vertica volunteers are pleased to know they contributed. In the words of Tim Severyn, East End House’s Director of Community Programs, “we couldn’t have done it without you.”

East Cambridge is thankful to have a community center that provides such a great service to local families and HP Vertica looks forward to working with it in the future!

Learn more about East End House and how you can give back to the community here: http://www.eastendhouse.org/get-involved

Get Started With Vertica Today

Subscribe to Vertica