Vertica

Archive for September, 2014

All Aboard the Modern Data Architecture Roadshow!

Following fresh on the heels of HP and HortonWorks’ partnership, HP is sponsoring the upcoming #Hadoop and the Modern Data Architecture roadshow. We are proud to be Gold sponsors of the event in Tyson’s Corner, VA on September 23rd, 2014. The day long workshop will focus on gaining insight into the business value derived from Hadoop, understanding the technical role of Hadoop within your data center, and looking at the future of Hadoop as presented by the builders and architects of the platform.

Our HP team at the event will focus on HAVEn, our big data analytics solution, and we’re sending some of our top experts from Vertica and Autonomy to answer any questions you might have. Vertica is HP’s next generation analytics platform and is focused on handling structured data. The HP Vertica engine is helping data analysts all over the world perform ad hoc queries in record time. With HAVEn, this is paired with HP Autonomy, which handles the unstructured data and analysis portion of your big data needs.

HAVEn pic

  • Hadoop/HDFS: Catalogue massive volumes of distributed data.
  • Autonomy IDOL: Process and index all information.
  • Vertica: Analyze at extreme scale in real-time.
  • Enterprise Security: Collect & unify machine data.
  • nApps: Powering HP Software + your apps.

Come by to the conference tomorrow to learn all about how HP HAVEn and Hortonworks work together to meet your Big Data needs. Get more information here.

We look forward to seeing you there!

What’s New in Dragline (7.1.0): Top-K Projections

Top-K Projections Video

Top K Projections from Vertica Systems on Vimeo.

HP Vertica 7.1 introduces Top-K projections. A Top-K projection is a type of live aggregate projection that returns the top k rows from a partition.

Top-K projections are useful when you want to retrieve the top rows from a group of frequently updated, aggregated data. For example, say you want to view a list of the 5 singers who have the most votes in a competition. This value will change every time a vote comes in. But since the data is aggregated as it is loaded into your table and Top-k projection, all you have to do is grab the top 5 rows to see the top 5 singers at any moment. Using a Top-k projection is more efficient than if you had to aggregate (count all the votes) every time you wanted to find the top 5 singers.

Check out this video to learn more about Top-k projections and stayed tuned for our next video ‘Projections with Expressions’.

Top-K Documentation

How Vertica Met Facebook’s 35 TB/hour Ingest SLA

Facebook thumbs up

There were many interesting technical sessions at the 2014 Vertica Big Data Conference. However, I feel that the most captivating session was on loading 35TB per hour. Just the sheer number of nodes in the clusters, amount of data ingested, and use cases made for a truly educational session. This session covered how Vertica was able to meet Facebook’s ingest SLA (Service Level Agreement) of 35TB/hour.

Background

Facebook’s SLA required continuous loading without disrupting queries and still budgeting enough capacity to catch up in the event of maintenance, unplanned events, or just ETL process issues. Planning for these events with continuous loading presents the options of either catching up or skipping them. There are also numerous considerations such as what is the average ingest (in terms of size) and also the peak ingest. There may also be an availability SLA which must give users access to data within a certain amount of time of an event happening.

Planning for Facebook’s SLA, there were processes that drove the large fact tables which were scheduled to run every 15 minutes or 4 times an hour. Along with what the largest batch could be, and allowing time for batches to catch up, this was calculated out to be 35 TB/hour. To achieve this rate, a speed of 15,000 MB/sec was required to write to disk once.

Discovery

The first stage of the discovery process was to determine how fast one node can parse with a COPY statement. For the test, a COPY was initiated from any node with files staged on node 1. Only the CPU on node 1 performed parsing, and subsequently parsed rows were delivered to all nodes, then sorted on each node. It was found that a speed of 170 MB/sec could be achieved with 4 COPY statements running in parallel (on one year old hardware). Any more statements would not deliver an increase in parse rate. Within this usage, memory was utilized during the sorting phase.

Scaling out to parse on multiple nodes, the throughput scaled pretty linearly to about 510 MB/sec. In this test, a COPY was initiated from any node with files staged on nodes 1, 2 & 3. Again, CPUs on only node 1, 2 & 3 performed parsing, and subsequently parsed rows were delivered to all nodes, then sorted on each node. More memory was utilized in receive than sort when 50-60 nodes were parsing. However, it doesn’t scale well past that point. Vertica is designed to make sure that there is never head of line blocking in a query plan. This means that after 50 or 60 nodes, the overhead of 50 or 60 streams converging on a node requires a lot of memory for receive buffers.

At 50 to 60 nodes, the overhead of running a COPY statement is large enough that it makes it difficult to run 3 or 4 statements in parallel and continue to scale. If the cluster was scaled up, there would be disruption to queries as the nodes would be completely utilized for 10 minutes at a time. Running COPY for 15 minutes would leave some CPU for queries, but not enough room to catch up. In addition, Facebook required that there not be more than 20% degradation in query performance.

Ephemeral Nodes

Continuing to scale to 180 nodes, it is expensive to run a COPY statement in terms of memory. The best approach is to isolate the workload by separating the workload of loading data from the workload of querying using Vertica’s ephemeral node feature. This feature is typically used in preparation from removing a node from a cluster. However, in this solution, some nodes were marked ephemeral before data was loaded. Therefore, when data would be loaded into the database, it would not be loaded on the ephemeral nodes and these nodes would not be used for querying.

Having nodes isolated from querying, the loader nodes’ CPU could be utilized 100%. The implementation required adjusting resource pool scaling only on the ingest nodes using an undocumented feature, SysMemoryOverrideMB, within vertica.conf on the ingest nodes. There’s additional considerations around memory utilization on the ingest nodes to set aside memory for the catalog. After some minor tuning, the results look like:

 

Nodes Nodes (Ingest) Node: (Data) Ingest Rate (TB/HR) Peak Ingest Rate (TB/HR)
180 30 150 24 41
270 45 215 36 55
360 60 300 45 72

 

File System

There were considerations around using a cluster file system. For instance, there may be a batch process running that produces all the files to be loaded in a COPY statement and then those files are transferred to loader nodes. If one of the loader nodes has a hardware failure, the batch would fail during the COPY. A recovery could be possible if there was a NAS or cluster file system by simply re-running the batch. The cost of disk I/O and network bandwidth required to have a NAS or cluster file system was too expensive when writing extra copies of the data for each batch. Therefore, it would be easier to re-run the batches.

Given the massive pipes at Facebook, it was cheaper to re-run an ingest rather than trying to build a high availability solution that could stage data in the event of a failure in case a part of a batch did not complete. However, if it’s critical to have data available to users rather quickly, it may be worth investing more in an architecture that allows the COPY statement to restart without having to re-stage data.

Along the way, it was discovered that file systems lose performance as they age when subjected to a high rate of churn. The typical Vertica environment has about a year or two of history. With a short window of history and loading 3-4% of the file system capacity each day, the file system is scrubbed frequently. Although some file systems age better than others, it’s good practice to monitor the percentage of CPU time spent in system.

Facebook’s churn rate was about 1.5 PB every 3 days in their 3.5-4 PB cluster. After about a week, the file system would become extremely fragmented. This doesn’t suggest that this was a Vertica issue, just operational issues from the hardware up through the software stack.

Recovery

There will be times when a recovery will not be able to finish if loading takes place 24 hours/day and there are long running queries. The larger the cluster, the less avoidable a hardware failure and nodes in recovery. As of 7.1, standby nodes can be assigned to fault groups. For example, a rack of servers can be considered a fault domain. Within that fault domain, there is a spare node in standby. If a node were to fail and cross a downtime threshold, the standby node within the same rack would be become active. This approach is more suitable if nodes must enter recovery fairly quickly and the cluster is continuously busy.

The most common practice for a standby cluster is to run an ETL twice to maintain two separate instances of Vertica. With clusters having nodes potentially in the hundreds at Facebook, this approach was not feasible as a high availability solution.

Looking Forward

Facebook is taking steps to give better operational performance to other groups in their organization while ensuring that none of their clusters are sitting idle. The focus is moving away from having dedicated resources and to dynamically allocate resources and offer them as a service. The goal is to have any cluster, regardless of size, be able to request data sets from any input source and ingest it into Vertica. In terms of size, 10 nodes is considered small, and several hundred nodes considered large.

To accomplish this, an adapter retrieves the source data and puts it into a form that can be staged or streamed. At the message bus layer, the data is pre-hashed meaning that hashing will be local on Vertica and it will hash to its buddy. This is accomplished by implementing the hash function that’s used on the projection in the message distribution layer. Based on the value of the column, the node on which the row will reside can be predicted. This allows for data to be explicitly sent to targeted nodes in the sense that there is no need for data to be sent between nodes. The ephemeral node approach for ingest would no longer be needed in this elastic and scalable structure.

Meet our Summer 2014 Interns!

Did your 2014 summer internship include rubber bulldozers, ice cream, and bumper boats? Were you able to develop and work on real projects for real customers, while eating free bagels and bananas? If not, then consider applying to HP Vertica for the summer of 2015.

Our 2014 interns had a great time scaling mountains (all right, it was a 635-foot hill) and building bridges (made out of toothpicks and gumdrops), while developing software that makes the HP Vertica database faster, more accurate, and more secure.

Our interns work closely with their mentors to solve hard problems and improve our product. Along the way, we encourage them to collaborate with cross-functional teams, attend technical talks that aren’t necessarily related to their projects, and create funny short videos about our database features. One of those videos appeared in EVP and GM Colin Mahony’s welcome presentation at the HP Vertica Big Data Conference!

Our interns tackled (and solved!) some interesting problems during the summer of 2014, including:

  • Improving trickle loading using Kestrel and Apache Storm.
  • Improving the encoding algorithm for Vmap data in a flex table.
  • Creating an R package for pattern mining.
  • Integrating HP Vertica with Apache Hadoop YARN.
  • Enhancing the documentation about database locks.
  • Implementing key-based client authentication for the HP Vertica Amazon Machine Image (AMI).
  • Adding features to and improve the performance of our test-tracking application.
  • Improving the scalability and performance of HP Vertica Database Designer.
  • Improving query optimizer plans for columns that are guaranteed to be unique.
  • Developing a tool that processes diagnostic information.

Everyone at HP Vertica works hard, but we like to have fun, too. We make sure to include the interns in our company outings and weekly gaming nights, but we also plan extra activities for them: hiking, mini-golf, volleyball, and tubing were some of this summer’s highlights. And our in-work and out-of-work activities this summer usually included copious consumption of ice cream.

Over the years, many of our best employees are former interns. So if you want to improve your technical skills, gain an understanding of our column-store database, make new friends, and have a lot of fun in the process, now is the best time to apply for an internship at HP Vertica.

What’s New in Dragline (7.1.0): Live Aggregate Projections

Live Aggregate Projections Video

Live Aggregate Projections from Vertica Systems on Vimeo.

 

HP Vertica 7.1 introduces live aggregate projections. A live aggregate projection is a projection that contains one or more columns of data that have been aggregated from a table.

If you frequently query data that requires aggregation, you could benefit from using a live aggregate projection. Because data in your live aggregate projection is aggregated at load time, rather than at the time you run a query, you can save time and resources. On subsequent data loads, HP Vertica updates the table and loads the aggregated values into the live aggregate projection. If you query the live aggregate projection any time after that, you’ll not only see the same results you would if you queried the data from the table and then aggregated it, but you’ll also use less resources in the process.

Check out this video to learn more about Live Aggregate Projections. Stay tuned for our next video, ‘Top-K Projections’.

See Also:

 Live Aggregate Projections with HP Vertica

Live Aggregate Projection documentation

Smart Grid Solution Demo

That Giant Sucking Sound is Your Big Data Science Project

shutterstock_144005719

Vertica recently hosted its second annual Big Data Conference in Boston, Massachusetts. It was very well attended with over eight hundred folks, and about two hundred companies represented. We at Vertica love these events for a few reasons – first because our customers tend to be our best spokespeople because it’s such a sound product, but also because it’s a chance for us to learn from them.

In one of the sessions, the presenter asked the audience how many of them had Hadoop installed today. Almost all the hands went up. This wasn’t too surprising given that the session was on Hadoop and Vertica integration. Then the presenter asked how many of those folks had actually paid for Hadoop. Most of the hands went down. Then the presenter asked how many of those folks felt that they were getting business value out of their investment. Only two or three hands stayed up. This was eye-opening for us at HP, and it was surprising to the audience as well. Everyone seemed to think they were doing something wrong with Hadoop that was causing them to miss out on the value.

Over the next few days, I made a point to track down folks in the audience I knew and get their thoughts on what the issues were. Since most of them were Vertica customers I knew many of them already. I thought it would be helpful to identify the signs indicative of a big data science project – a project where a team has installed something like Hadoop and is experimenting with it in the hope of achieving some new analytic insights, but isn’t on a clear path to deriving value out of it. And some clear themes emerged. And these align with what I and my colleagues in the industry have been observing over the last few years. So, without further ado, here are the top five signs that you may have a big data science project in your enterprise:

    1. The project isn’t tied to business value, but has lots of urgency. Somebody on the leadership team went to a big data presentation and has hit the panic button. As a result, the team rushes ahead and does…something. And maybe splinters into different teams doing different things. We all know how well this will turn out.
    2. The technologies were chosen primarily because they beef up resumes. There’s so much hype around big data and the shortage of people with relevant skills that salaries are inflated. And in the face of a project with high urgency, nobody wants to stand still. So download some software! That open source stuff is great, right? While it’s generally true that multiple technologies can solve the same big data problems, some will fit with the business more readily than others. Maybe they’re easier to deploy. Maybe they don’t require extensive skill retooling for the staff. Maybe the TCO is better. Those are all good things to keep in mind during technology selection. But selecting technology for “resume polishing”? Not so much.
    3. The project is burdened with too much process. Most organizations already have well-defined governance processes in place for technology projects. And, so the reasoning goes, big data is basically just a bunch more of the same data and same old reporting & analysis. So when it’s time to undertake a highly experimental big data analytics project which requires agility and adaptability, rigid process usually results in a risk-averse mindset where failure at any level is seen as a bad thing. For projects like these, failure during the experimentation isn’t just expected, it’s a critical part of innovation.
    4. The “can’t-do” attitude. It’s been a well understood fact of life for decades that IT departments often feel under siege – the business always asks for too much, never knows what it wants, and wants it yesterday. As a result, the prevailing attitude in many IT teams today is to start by saying “no”, and then line up a set of justifications for why radical change is bad.
    5. The decision-making impedance mismatch. Sometimes, organizations need to move fast to develop their insights. Maybe it’s driven by the competition, or maybe it’s driven by a change in leadership. And…then they move slooooowly, and miss the opportunity. Other times, the change represents a big one with impact across the company, and requires extensive buy-in and consensus. And…then it moves at a breakneck pace and causes the organization to develop antibodies and reject the project.

     

    So if your organization has one or more big data projects underway, ask whether it suffers from any of these issues. If so, you may have a big data science project on your hands.

What’s New in Dragline (7.1.0): Using HP Vertica Pulse

Using HP Vertica Pulse Video

Using Pulse from Vertica Systems on Vimeo.

In our previous video, we showed you how to install HP Vertica Pulse, our add-on sentiment analysis package that allows you to analyze and extract the sentiment from text.

Take a look at this video to learn how to use and tune HP Vertica Pulse to work for your specific business needs.

You can download HP Vertica Pulse as an add-on package for your Enterprise Edition of HP Vertica at my.vertica.com.

HP Vertica Pulse documentation.

Get Started With Vertica Today

Subscribe to Vertica