Vertica

Author Archive

HP Vertica presents three talks at ICDE, Chicago, March 31-April 4, 2014

g10278016032009_jpghighres (1)

For the past three years, HP Vertica has presented innovative topics at prestigious database conferences such as the International Conference on Data Engineering (ICDE), the Very Large Databases (VLDB) conference, and the Extremely Large Database (XLDB) conference. This year, our engineering team proudly announces three talks at the upcoming ICDE held in Chicago, IL, USA, March 31-April 4, 2014.

  • On 3/31/2014, at SMDB, Ben Vandiver introduces Flex Zone, one of the new features of our recent release, HP Vertica Analytics Platform 7. Flex Zone enables the
    smooth data load and exploration flexibility of NoSQL solutions while maintaining a unified SQL interface over structured and semi-structured data.
  • On 4/1/2014, Ramakrishna Varadarajan presents Vertica’s customizable physical design tool, called the Database Designer, which produces designs optimized for various scenarios and applications. For a given workload and space budget, Database Designer automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance, and database recovery to meet different customer requirements.
  • On 4/2/2014, Jaimin Dave introduces HP Vertica SQL Query Optimizer. The Query Optimizer was written from the ground up for the HP Vertica Analytic Database. Jaimin will discuss its design and the tradeoffs encountered during its implementation. He’ll also argue that the full power of today’s database systems can be realized only with a carefully designed custom Query Optimizer, written specifically for the system in which it operates.

Click here to find details and schedule. Please do attend the talks and stop by and say hello to our presenters and their co-authors. We’ll be happy to tell you more about our designs and the trade-offs we encountered.

When UPDATE is actually INSERT

At the VLDB 2012 conference a few weeks ago, we had a chance to listen to Jiri Schindler giving a tutorial about NoSQL.  His interesting and informative presentation covered the fundamental architecture and I/O usage patterns of RDBMS systems and various NoSQL data management systems, such as HBase, Cassandra, and MongoDB.

During the presentation, Schindler listed basic I/O access patterns for columnar databases using the slide below. It is hard to summarize the operation of the various columnar database systems on a single slide, and Schindler did a great job given the constraints of the presentation. However, while his characterization might hold for other columnar databases, the Vertica Analytic Database  has a different I/O pattern for UPDATEs which we wanted to explain in more detail.

First, Vertica does not require synchronous I/O of a recovery log. Unlike most other RDBMS systems,  Vertica implements durability and fault tolerance via distributed replication.

Second, since Vertica never modifies storage in place, it avoids the other I/O intensive operations referenced in the slide.

When a user issues an UPDATE statement, Vertica performs the equivalent of a delete followed by an insert. The existing row is deleted by inserting a Delete Vector (a small record saying that the row was deleted), and a new copy of the row with the appropriately updated columns is inserted. Both the Delete Vector and the new version of the row are stored in a memory buffer known as the WOS (write optimized store). After sufficient data has accumulated in the WOS from INSERTs, UPDATEs, DELETEs, and COPYs (bulk loads), they are moved in bulk to disk storage known as the ROS.

It is important to note that existing files in the ROS are not modified while data is moved from WOS to the ROS – rather a new set of sorted and encoded column files is created. To avoid a large number of files accumulating over time, the Tuple Mover regularly merges column files together using an algorithm that limits the number of times any tuple is rewritten and also uses large contiguous disk operations, which is quite efficient well on most modern file and disk systems.

This arrangement has several advantages:

  1. From the user’s point of view, the update statement completes quickly and future queries get the expected answer (by filtering out the original values at runtime using the Delete Vectors).
  2. The cost of sorting, encoding, and writing column files to disk is amortized over a large number of rows by utilizing the in memory WOS.
  3. I/O is always in proportion to the number of rows inserted or modified – it is never the case that an update of a small number of rows causes I/O on a significant amount of previously stored data.

More details about how data is stored and Vertica’s overall architecture and design decisions, please consider reading our VLDB 2012 paper.

 

 

How to Load New Data and Modify Existing Data Simultaneously

Many Vertica customers tell us “we have an OLTP workload” which is not Vertica’s architectural sweet spot. However, when we dig into what they are actually doing, it often turns out that they are simply bulk loading mostly new data with some small number of updates to existing rows. In Vertica 6, we have added support for the MERGE statement to allow users to do just that.

Let’s look at the example shown in Figure 1. In this example, users and their numbers of appearances at a specific location (given by X, Y columns) are being merged from the table New_Location (a) into the existing table Location (b), and the merged results are shown in (c), with the updated and new data in pink. The user with UserID 1 comes to Burger King again, thus his total number of appearances must be updated to 2; while users 2 and 3 go to a new location, so their data must be inserted.

(more…)

Get Started With Vertica Today

Subscribe to Vertica