Many Vertica customers tell us “we have an OLTP workload” which is not Vertica’s architectural sweet spot. However, when we dig into what they are actually doing, it often turns out that they are simply bulk loading mostly new data with some small number of updates to existing rows. In Vertica 6, we have added support for the MERGE statement to allow users to do just that.
Let’s look at the example shown in Figure 1. In this example, users and their numbers of appearances at a specific location (given by X, Y columns) are being merged from the table New_Location (a) into the existing table Location (b), and the merged results are shown in (c), with the updated and new data in pink. The user with UserID 1 comes to Burger King again, thus his total number of appearances must be updated to 2; while users 2 and 3 go to a new location, so their data must be inserted.
With Vertica versions before 6, this loading process can be done in two SQL steps: updating the existing tuples with changed tuples as shown in Table 1(a), and inserting new tuples as shown in Table 1(b).
SET count = Location.count + src.count
FROM New_Location src
WHERE Location.userid = src.userid
AND Location.x = src.x
AND Location.y = src.y;
(a) Loading Step 1: Update Modified Tuples
|INSERT INTO Location
SELECT userid, x, y, count, name
WHERE (userid, x, y) NOT IN
(SELECT userid, x, y
(b) Loading Step 2: Insert New Tuples
Table 1: Two-step process to load new and modified tuples
In Vertica 6, this two-step process can be done in a single SQL MERGE statement as shown in Table 2.
|MERGE INTO Location tgt
USING New_Location src
ON src.userid = tgt.userid
AND src.x = tgt.x
AND src.y = tgt.y
WHEN MATCHED THEN
UPDATE SET count = tgt.count + src.count
WHEN NOT MATCHED THEN
INSERT VALUES (src.userid, src.x, src.y, src.count, src.name);
Table 2: Merge statement that merges data of Figure 1
In a MERGE statement, the table of the existing tuples (Location) is called the “target” table, while the table of the new-and-modified tuples (New_Location) is called the “source” table. The meaning (semantics) of the above MERGE statement can be described as follows:
For each tuple in the source table, New Location
- If there is a match that satisfied the specified condition, same UserID, X and Y, with any tuples of table Location, then
- Update the matched tuple in the target table, Location, with Count value equals to sum of Count in source and target tables.
- If there is no match found for the specified condition then
- Insert this tuple into the target table, Location
In our popular customer use cases, the target is usually large and consists of hundreds of millions of tuples. The source table is usually much smaller than the target table and consists of hundreds of thousands of tuples.
Our experiments have shown that MERGE is always faster than its corresponding two-step update-and-then-insert process on the above example on different data sets. On the most popular use case in which there are 500 thousand and 100 million tuples in New_Location and Location respectively, and half of the tuples in New_Location are new, MERGE is over 70 times faster than its corresponding two-step process.
If the source data is kept in a file instead of a table, the MERGE statement still can be used by defining an external table for this data file. External table is also a new feature in Vertica 6; its examples and use cases can be found in the “Vertica Analytics Anywhere” blog.
In summary, MERGE is a succinct and powerful way to simultaneously bulk load and update tables in a single SQL statement.