How to Load New Data and Modify Existing Data Simultaneously

Posted August 2, 2016 by Chana Zolty

Originally posted 8/27/2012

Many Vertica customers tell us “we have an OLTP workload” which is not Vertica’s architectural sweet spot. However, when we dig into what they are actually doing, it often turns out that they are simply bulk loading mostly new data with some small number of updates to existing rows. In Vertica 6, we have added support for the MERGE statement to allow users to do just that.

Let’s look at the example shown in Figure 1. In this example, users and their numbers of appearances at a specific location (given by X, Y columns) are being merged from the table New_Location (a) into the existing table Location (b), and the merged results are shown in (c), with the updated and new data in pink. The user with UserID 1 comes to Burger King again, thus his total number of appearances must be updated to 2; while users 2 and 3 go to a new location, so their data must be inserted.

Figure 1: Merge numbers of user appearances at a specific location from table New_Location into table Location

With Vertica versions before 6, this loading process can be done in two SQL steps: updating the existing tuples with changed tuples as shown in Table 1(a), and inserting new tuples as shown in Table 1(b).

Table 1: Two-step process to load new and modified tuples

In Vertica 6, this two-step process can be done in a single SQL MERGE statement as shown in Table 2.

Table 2: Merge statement that merges data of Figure 1

In a MERGE statement, the table of the existing tuples (Location) is called the “target” table, while the table of the new-and-modified tuples (New_Location) is called the “source” table. The meaning (semantics) of the above MERGE statement can be described as follows:

For each tuple in the source table, New Location:
  • If there is a match that satisfied the specified condition, same UserID, X and Y, with any tuples of table Location, then
    • Update the matched tuple in the target table, Location, with Count value equals to sum of Count in source and target tables.
  • If there is no match found for the specified condition then
    • Insert this tuple into the target table, Location
In our popular customer use cases, the target is usually large and consists of hundreds of millions of tuples. The source table is usually much smaller than the target table and consists of hundreds of thousands of tuples.

Our experiments have shown that MERGE is always faster than its corresponding two-step update-and-then-insert process on the above example on different data sets. On the most popular use case in which there are 500 thousand and 100 million tuples in New_Location and Location respectively, and half of the tuples in New_Location are new, MERGE is over 70 times faster than its corresponding two-step process.

If the source data is kept in a file instead of a table, the MERGE statement still can be used by defining an external table for this data file. External table is also a new feature in Vertica 6.

In summary, MERGE is a succinct and powerful way to simultaneously bulk load and update tables in a single SQL statement.