This blog post was authored by Curtis Bennett.
Deletes in Vertica are a complicated topic. I’ve had many people say to me that they’d heard that one should never do deletes in Vertica, or that deletes in Vertica are slow. Nothing could be further from the truth. But like anything in a database, they should be tuned – especially if you intend to do them frequently. And in Vertica, they have to be managed. Vertica didn’t even support deletes until version 3. Wrap your brain around that one for a second.
In Vertica, deleting a record doesn’t actually delete the record, instead something called a delete vector is created. A delete vector is a file that contains a position and an epoch. These tell Vertica the “where” and the “when” of the delete. If I have 1,000 rows in a table, and I delete 100 of them, what I end up with on disk is a file of 1,000 rows, and a new file of 100 delete vectors. When I select from that table, I expect to see 900 rows. Vertica takes the 1,000 rows, and subtracts 100 delete vectors from it. It’s also possible that these delete vectors could be spread out over several days, and Vertica has the ability to show you the data in the table at various stages throughout that process using an “AT TIME” query, but that’s a whole other class.
When Vertica issues a SELECT statement, the optimizer decides which projection is best suited to handle that SELECT and then uses that projection. Of course, if there is just one projection, then it has to use that one. A DELETE or UPDATE isn’t as fortunate. If there are multiple projections on a table, a DELETE (or UPDATE) has to apply to ALL the projections on that table – and each of them can accrue delete vectors independent of one another. Here’s the important bit: a delete against a table is only as fast as the slowest projection on that table.
It’s also important to understand that an UPDATE is actually a concurrent DELETE and INSERT. So, UPDATE statements follow the same rules. This can also apply to MERGE commands, if you’re using them to perform UPDATE or DELETE statements.
What does all of this actually mean? Let’s say you have a table called “Person” and it has roughly 7.6 billion rows in it – one for every man, woman and child in the world. Roughly 15,000 rows are inserted into this table every hour, and another 6,316 rows are deleted (on average), every hour. This corresponds to the number of births and deaths on the planet. The primary key on our table is a bit of a mystery (what is it that makes us unique, after all?), but for discussion purposes, let’s say we’ve rolled that all into a surrogate key called soul_id. And on this table, it works really well. Everything works great, until a DBA decides this table needs a new projection that has just a handful of columns in it – continent_cd, country_cd, and hair_color, and is sorted in that order.
Now we have a problem. Given that roughly 2 people die every second, it doesn’t take long before a DELETE shows up in our system. And since a DELETE applies to ALL of the projections on our table, the second projection is now going to cause a massive problem. “Delete from person where soul_id = 7623948859;” isn’t going to do so well against a projection which doesn’t even have soul_id on it. What will happen here? Will the delete fail? The good news is, no – it will succeed, but it will take a long time. This second projection will actually have to ask the first projection what “soul_id = 7623948859” means, and it will use the language of ORDER BY clauses in order to answer it. Because the only context the second projection has for its view of the world is “continent_cd, country_cd, and hair_color” the first projection will respond in kind – indicating how many records were deleted in that context. Because we’re deleting a specific record, the first projection will indicate which continent and country that person lived in, and what their hair color was. In this case, it’s not terribly important which
dark-haired Japanese person we’re deleting, because the 2nd projection doesn’t have that level of detail.
It works, but it’s slow. The good news is that Vertica provides a relatively obscure function which we can use to find such issues. It’s called EVALUATE_DELETE_PERFORMANCE
EVALUATE_DELETE_PERFORMANCE works by evaluating the cardinality of the ORDER BY clauses on all of your projections in order to determine whether sufficient granularity exists to find a single tuple within the context of the projection’s definition. In other words – how hard is it to find a single record in a projection definition? In our above example, a projection ordered by “continent_cd, country_cd and hair_color” would not be good at all. Imagine how many dark-haired people live in China, on the continent of Asia. Pretty much all of them – that’s not going to be a useful way of finding a single individual. This function will identify those projections so that you can be aware of them. It doesn’t solve the problem for you – you will still need to look at the projection definition, and make adjustments on your own if you feel it is necessary. It’s also worth noting here that this function can also improve the performance of UPDATE statements as well, since an UPDATE performs a DELETE statement internally.
If you’re feeling bored, and you’re looking for work to do, kick off EVALUATE_DELETE_PERFORMANCE – and then go to lunch. Or, if you have 5,000 tables in your system, go home for the weekend. You don’t have to monitor the output, the function populates the PROJECTION_DELETE_CONCERNS table with projections that it finds.