Author Archive

Thoughts About HP Vertica for SQL on Hadoop

Et voilà

Recently, HP has announced HP Vertica for SQL on Hadoop. We’ve leveraged our years of experience in big data analytics and opened up our platform to allow users to tap into the full power of Hadoop. It’s a rich, fast, and enterprise-ready implementation of SQL on Hadoop that we’re very proud to introduce.

We know that you have choice when it comes to SQL-on-Hadoop engines. There are several SQL on Hadoop engines on the market for a reason – they are very powerful way to perform analytics on big data stored in Hadoop by using the familiar SQL language. Users are able to leverage any reporting or analytical tool to analyze and study the data rather than write their own Java and Map/Reduce code.

However, not all SQL-on-Hadoop is created the same. We think HP Vertica for SQL on Hadoop has some very big differences. These include:

  • Platform Agnostic – When you adopt a SQL on Hadoop query engine, it may be stuck to one distribution of Hadoop. Not so with HP Vertica for SQL on Hadoop. Our implementation works with Hortonworks, Cloudera and MapR distributions.
  • SQL Completeness – The richer the SQL engine, the wider the range of analytics that you can perform with extensive coding and data movement. You get a very rich set of analytical functions with HP Vertica for SQL on Hadoop. HP Vertica offers enterprise-ready, advanced analytics that support JOINs, complex data types, and other capabilities only available from our SQL on Hadoop implementation.
  • Manageability – Tools for managing queries and managing the resources of your cluster are fairly scarce and immature in the Hadoop world. However, with some of the tools we include, you can divide resources among different queries and different types of queries. If unplanned and resource-intensive queries have to be cancelled or temporarily interrupted, they can be.
  • Data Source Transparency – It’s important to allow you to query common data standard storage formats such as Parquet, Avro and ORC. When you can use native formats, you avoid having to move the data.
  • Path to Optimization – When you need to boost performance, HP Vertica for SQL on Hadoop offers optimizations like compression, columnar storage, and projections

You can’t really forget the fact that this offering comes from HP Software. Users should be able to take advantage of all the power of our Haven platform for big data. Encompassing proven technologies from HP Software, including Autonomy, Vertica, and ArcSight, Haven enables forward-thinking organizations to make use of virtually all information sources from both inside and outside its four walls to make better, faster decisions.

Download the report here.

See more

And more…

Live Aggregate Projections with HP Vertica


The Dragline release of HP Vertica offers an exciting new feature that is unique in the world of big data analytics platforms. We now offer Live Aggregate projections as part of the platform. The impact is that you can really fly through certain types of big data analytics that typically grind down any analytics system.

Before I get into that, however, it’s important to back up and give some background on HP Vertica projections. Many databases use indexes and materialized views to improve query performance. However, these secondary structures have drawbacks. Materialized views and indexes can bloat and become a very inefficient way to optimize data analytics. They can be time-consuming to keep up-to-date during data loading, can require frequent rebuilding, and they can be tedious to manage.

HP Vertica has always had a better solution to materialized views and indexes. Vertica has no raw uncompressed base tables, no materialized views, and no indexes. Our optimizations consist of optimized collections of table columns, which we call “projections”. There are several different types of projections. At the core, a projection could be an optimized collection of pre-sorted columns than may contain some or all of the columns of one or more tables. A projection that joins one or more tables is called a pre-join projection with the benefit of speeding up joins. A projection that contains a pre-calculated aggregate function such as average, top-K, sum, etc. is called an aggregate projection, which is a new feature of our Dragline release.

What’s cool about aggregate projections is that queries that rely on aggregate functions like SUM, MIN/MAX and COUNT are no longer bog down the system with excessive I/O and calculation. Now, these calculations can be calculated and updated as data loads. The HP Vertica query optimizer creates the projections and always keeps them up-to-date, ready to answer your aggregate queries without having to grind and churn through the data.

In real life analytics situations, this new feature accelerates the speed and performance by computing metrics on the data as it arrives for targeted and personalized analytics without programming accelerator layers. It’s particularly powerful if you’re implementing smart metering applications, for example, where you are helping your customers understand their usage and compare it to others in the neighborhood. The aggregate information is available in the projection without having to recalculate it over and over again so your data analytics system is free to take on other workloads without the fuss.

Speeding up aggregate functions should help with many use cases for today and tomorrow. We live in a world where data volumes from smart devices such as smart buildings, mobile phones, GPS devices and sensors are ever-increasing. We’re finding value in leveraging this data to predict usage based on history, predict equipment failure, maximize heating/cooling/lighting costs, detect fraud and more. HP Vertica continues to believe that projections offer a superior solution to materialized views and indexes. Projections remove the trade-off between performance and data size and offer the ultimate in flexibility for fast big data analytics.

Enter the Flex Zone – Modernizing the Enterprise Data Warehouse

I’ve had the privilege of attending the Data Warehouse Institute’s (TDWI) conference this week. The Las Vegas show is usually one of their biggest gatherings. This year, there were about 600 of us gathered together to talk about the latest and greatest in the data warehouse and business intelligence world. HP Vertica was a sponsor.
The latest buzz was around many of the new data discovery tools that were announced by some vendors. Vendors recognize that there is a significant amount of undiscovered data in most businesses. As data warehouse teams go merrily along delivering daily analytics, piles and piles of dark data builds within that might have value. To innovate, users are recognizing that some of this unexplored data could be quite valuable, and it’s spurring on the development of a new breed of data discovery tools that enable users to develop new views of structured, semi-structured, and unstructured data.

Of course, this is the very reason that we have developed HP Vertica Flex Zone. The ability to ingest semi-structured data and use current visualization tools are one of the key tenets of HP Vertica Flex Zone. With HP Vertica Flex Zone, you can leverage your existing business intelligence (BI) and visualization tools to visually explore and draw conclusions from data patterns across a full spectrum of structured and semi-structured data. Analysts, data scientists, and business users can now explore and visualize information without burdening or waiting for your IT organizations to use lengthy and costly ETL tools and processes typical with legacy databases and data warehouses.
Most agreed that special data discovery tools should converge with standard analytical platforms in the coming months. Discovery should be as much a part of your business as daily analytics.

There were some first-rate executive sessions led by Fern Halper and Philip Russom, who talked about the transformation of analytics over the years. Analytics has become more mainstream, more understood by the masses of business users. Therefore innovation comes when we can deliver business intelligence for this new generation of information consumers.

The panel discussions and sessions focused very much on business value and put forth a call-to-action for some. Innovate. Feed the business users needs for information that will help drive revenue, improve efficiency, and achieve compliance with regulations. It was clear that data warehouse must be modernized of data warehouse (and that is happening today). Data warehouse pros aren’t satisfied with daily static analytics that they delivered in the past. They are looking for new data sources, including big data, and new-age data analytic platforms to help achieve their business goals.

Get started modernizing your enterprise data warehouse – evaluate HP Vertica 7 today.

Enter the FlexZone – Let’s talk ETL

When (and When Not) to Use Data Integration with HP Vertica

In December, HP released version 7 of the HP Vertica analytics platform which includes, among others, a great new feature called HP Vertica Flex Zone (Flex Zone). Flex Zone enables you to quickly and easily load, explore and analyze some forms of semi-structured data. It eliminates the need for coding-intensive schemas to be defined or applied before the data is loaded for exploration.

One of Flex Zone’s important values is that it can save you hours of work setting up and managing data extraction. Rather than setting up schemas and mappings in an ETL tools and later worrying about whether structure will change, the process is simplified with Flex Zone. Data is simply pulled into Flex Zone and structure is automatically understood. Flex Zone is powerful for the exploration of common types of data. Flex Tables can immediately leverage:

  • Delimited data – semi-structured text files. These are often referred to as flat files because the information is not stored in a relational database.
  • JSON – A readable file that is often used in social media and new online applications

For these types of files, which are very common in modern IT infrastructure, you do not need an ETL to extract, transform and load the data. This functionality is included with Flex Zone and can save you many hours in pre-processing data for analytics. It can save you time in the long-run by lowering the need to monitor ETL processes. Other mechanisms also let Vertica ingest data from other common big data structures from HIVE and HDFS.

Having a function in Flex Zone that automatically understands structure is powerful. This is something that normally takes time, slowing the overall process of exploration of the data. Should the structure of the data change, maintaining it is also time-consuming. By integrating these less structured data sources and supporting vanilla SQL queries against them, Vertica brings a key feature of relational databases to bear: abstracting the storage representation from the query semantics.

ETL – Extract Transform Load

However, most ETL tools offer hundreds of connectors that allow for connection into anything from Salesforce to Oracle to DB2 on the mainframe. For these types of uses, you can either use an ETL or export data from the application into a supported format in order to use Flex Tables.

For ETL, users take on the process of extracting data and transforming it to make it fit-for-purpose. The longer process may be necessary, however. During the ETL process, users can ensure that the data conforms to the schema and that data quality standards are upheld. Users can establish business rules and reject any records that don’t conform to standards. Users can recode certain values in the data to standardize them (e.g. ST, Street, strt can be recoded to ‘STREET’). Users can also extract data from sources that have proprietary formats, like SAP, MS SQL and AS/400 and hundreds of others. Therefore, in order to deliver accurate analytics, gain access to odd file formats, ETL is still necessary for certain data.

Some companies need to keep track of where data came from and what was changed in the data. The Data Lineage features of many ETL tools help you track where a change occurred. The result of the data lineage shows in a report which traces a change from the target end component of a Job up to the source end. If this is an important part of your process, you may need an ETL tool.

The good news is that Vertica has several partners who offer free open source ETL with support for Vertica, like Talend and Pentaho, as well as commercial partners like Syncsort, Informatica and others. See for a complete list.

Data Governance and Chicago’s CampIT event

Steve I recently had the privilege to talk at a CampIT event in Chicago, a very well-attended event at the Stephens Convention Center near Chicago’s O’Hare airport. Analytics professionals gathered and shared ideas on technologies like Hadoop, big data analytics, columnar store databases and in-memory technologies – to name just a few of the topics.

Challenges of Modern Analytics

In my presentation, I covered some of the challenges in modern analytics. Perhaps the biggest technical challenge we’re facing is the ever-growing volumes of data in our organizations. More data means that our legacy analytical solutions are slowing, making it harder and harder to deliver analytics at the right time for the right audience. Business user may lack the technical understanding of how this affects them. They only know that they can’t get answers and business intelligence as readily as they need to.

Another challenge is that IT professionals continued to be asked to do more with less funding. According to Gartner, IT spending increased only about 0.8% this year. IT is spending all of their funds on keeping the wheels on the bus spinning, but few funds in IT are available to innovate. Other budgets, like marketing and sales technology spend are increasing, however. IT is still seen as a cost center in many organizations, while the business side is considered to be revenue-generating.

Data Governance Can Help

Data governance can help us tap into the business-focused budgets with a couple of important edicts:

    1. IT should form an alliance with business users
    Take a real interest in some of the challenges that your business users have by inviting them for coffee or giving them an opportunity to beef about their challenges.

    2. IT should focus on important business aspects of the IT initiative
    If you ask your business users, the most important aspects of IT aren’t technical. The three most important business aspects of any initiative is revenue, efficiency and compliance. IT should be trying hard to help the company make more money, be more efficient in the way that day-to-day business is done, and compliance with state, local, federal or industry regulations.

    3. The data governance team should initially pick projects that can provide quick return on investment and track benefits.
    Quick wins that are profitable to the corporation form an agile approach to data governance. Initiatives shouldn’t take months or years, but days or weeks. When users see the value that IT is bring to the organization, they will want to work with you on solving their issues.

    4. Analytics is just one of the systems of opportunity to begin your data governance initiative.
    Providing fast analytics with Vertica’s help is just one system of opportunity to move your data governance initiative forward.

Tap into Business Budgets

By understanding your business user’s needs, providing a strong ROI and talking about the business benefits of Vertica, you can sell the benefits of big data analytics into your organization. Again, it’s about revenue, efficiency and compliance in your business. It speaks to revenue when you have execution windows to run analysis that you have never had before and now you can find new ways to reach your customers. It speaks to efficiency when you increase speed, typically hundreds of times faster than the old way of doing analytics, and avoid worries about a long analysis taking up too much processing time. It speaks to compliance when you can deliver analysis that’s fast and accurate, and analysis that you don’t have to check and re-check it before you deliver it to a broader audience.

Get Started With Vertica Today

Subscribe to Vertica