This blog post was authored by Curtis Bennett.
Many clients who are new to Vertica are also new to big data. While Vertica’s reliance on industry-standard SQL can make the transition very easy, often the introduction of multiple nodes used in support of a database platform can take some getting used to. It is the very existence of multiple nodes that make big data, well, big. It is also these nodes that can make things confusing and complicated, at least until you fully understand them. For decades, data architects could simply create a logical and physical model for their tables, design the primary keys, implement the tables, and everything would work more or less OK. In a multi-node environment, a new wrinkle has been added to this equation – it now becomes important to understand where data resides within the cluster, and how that data relates to other data in the database. Understanding this bigger picture is what makes big data more challenging.
Consider for example, two tables – SALES and CUSTOMER. Let’s assume that these tables are both large enough that I have decided to SEGMENT both of them. In Vertica parlance, to segment a table implies that you’ve evenly divided the data up such that only a part of the table’s data lives on any one individual node. Segmentation is the best option for large tables because it means that no one individual node has to handle all the data volume for that table by itself. In really large clusters, it’s highly likely that a single node would not even be able to physically hold all the data for a single table anyway. A table is segmented on a segmentation key. This is a hash of a column or set of columns which determines where the data resides.
In our tables, we have a customer, we’ll call him Bob, who has many sales associated with him. The tables each have a primary key – sales_id, and customer_id, and we have chosen to segment each of these tables by their primary key values. Segmenting on a primary key value is a great idea, because it ensures an even distribution of our data, since primary key values don’t repeat. But now have a problem – Bob lives on one node, and all of Bob’s sales lives on different nodes! This means if I’m joining Bob to all of his sales, I have to go out and ask all the nodes for this information. This might not be that bad if Bob has a few sales, but if I want ALL customers and ALL of their sales, this could become nightmarishly bad. A better solution for this problem is to segment both CUSTOMER and SALES by the CUSTOMER_ID. This ensures that both Bob and all of Bob’s sales live harmoniously on the same node. This makes queries between them very efficient. In Vertica, this concept is known as “Identically Segmented Projections” and can make queries with segmented tables very efficient.
Segmentation key values should follow two main rules – consistency and simplicity. Consistent segmentation key values are keys that are commonly used throughout a system. Most database systems have a central idea that ties everything together – a customer, an invoice, an account_id, or something similar. Often many tables share this same key and that column shows up in many of the join conditions. Including this column in as many of the segmentation keys as possible will make for more efficient queries. Simplicity means to not overly complicate the segmentation key. It might be tempting to make the segmentation key on the SALES table (customer_id, sale_date), but such a key would complicate joins between these tables. While it might mean that we end up with a little skew if some customers have a lot more data than other customers, making the segmentation keys the same is more important. The rule of simplicity means to not overly complicate the segmentation key. Following these two core principles of Consistency and Simplicity with regards to segmentation key values will pay dividends in the long run.
But what if you didn’t do that, and you’ve been in production for a while, and you’re staring at the business end of a database with 10,000 tables in it, and you don’t have any idea where to even start? No problem! There are ways of figuring this.
The PROJECTIONS system table has a couple of columns that are helpful here – is_segmented and segment_expression. The first is a flag that tells us whether the projection is segmented or not. Tables that are not segmented (i.e., replicated) do not have segmentation expressions, so we can safely ignore those. The segment_expression includes the actual segmentation clause syntax that was used at projection creation time. It includes fully qualified comma-delimited column names. If you look at the segment_expressions, you get something like this:
If you want to see tables that share the same segmentation key, we’ll need to eliminate the table_name in the output here. We can use a regexp_replace:
SELECT REGEXP_REPLACE(SUBSTR(segment_expression, 1, 500), '\w+\.')
FROM projections ;
This command simply ignores the words that come before a ‘.’ – “hash(date_dimension.date_key)” becomes just “hash(date_key)”. With this, I can easily aggregate my results:
SELECT REGEXP_REPLACE(SUBSTR(segment_expression, 1, 500), '\w+\.'), COUNT(*)
WHERE projection_name NOT ILIKE '%b1'
GROUP BY 1
ORDER BY 2 DESC ;
Because Vertica designates the primary and the buddy with “_b0” or “_b1” at the end of the projection name, I can remove the buddy entry by ignoring any projection ending with “b1”.
At this point, useful data starts to appear. Anything that returns from this query with a count of “1” is a segmentation island
. That is, it is segmented uniquely and anything else that joins to it that is also segmented is going to perform a less-than-optimal query. It is worth noting that the segmentation key is quite irrelevant if the table is not joined to anything. But it’s worth reviewing tables that do join elsewhere.
Another thing that can start to appear in this data are segmentation expressions that violate the core principles of consistency
. Consider a system with the following simplified segmentation expressions:
regexp_replace | count
hash(a) | 10
hash(b) | 8
hash(a, c, d) | 2
hash(a, b, c) | 1
hash(b, c) | 1
hash(e) | 1
In this example, it likely doesn’t make sense to have projections segmented by (a, c, d) and (a, b, c) since I already have 10 projections segmented on just (a). It would be more efficient to change the segmentation on these tables to just (a) so that they join more easily to the majority of tables in the system. Furthermore, it might make sense to create two projections for the table segmented by (a, b, c) into one segmented by (a) and another segmented by (b), depending on what the use cases are.
A Better Solution
Going through this list and determining what to fix could be tedious, especially if you have a lot of segmented tables. There’s a better way:
SELECT projection_schema || '.' || p.projection_name, REGEXP_REPLACE(SUBSTR(p.segment_expression, 1, 500), '\w+\.')
FROM projections p
JOIN projection_columns pc using (projection_id)
WHERE REGEXP_REPLACE(SUBSTR(p.segment_expression, 1, 500), '\w+\.')
IN (SELECT REGEXP_REPLACE(SUBSTR (p.segment_expression, 1, 500), '\w+\.')
FROM projections p
WHERE p.projection_name NOT ILIKE '%b1'
GROUP BY 1 HAVING COUNT(*) = 1 )
AND p.projection_name NOT ILIKE '%b1'
AND pc.projection_column_name IN (
SELECT hash_expr from (SELECT replace(substr(regexp_replace(substr(p.segment_expression, 1, 500), '\w+\.'), 6), ')', '') as hash_expr, COUNT(*)
FROM projections p
WHERE p.projection_name NOT ILIKE '%b1'
GROUP BY 1 ORDER BY 2 DESC LIMIT 20 ) AS foo )
ORDER BY 1, 2, 3 ;
This is a complicated query, but it builds on the simple query above. But, in a nutshell, it says “Give me all the segmentation islands
(that is, all the projections segmented uniquely) that contain a column in its segmentation expression matches one of the most commonly used segmentation expressions in my system.” This query becomes more useful when you have a large number of tables that share a common segmentation expression. For example, if I have many tables that are segmented by hash(customer_id), than any table that is hashed by something like (customer_id, transaction_dt) would show up in this list if it is a unique segmentation expression. You could also tweak the query to find projections that appear not just once, but maybe 2 or fewer times by adjusting the “HAVING COUNT(*)” value.
Making as many segmented tables share a consistent segmentation expression can have profound performance implications in a large-scale environment. These queries are also useful for being able to easily visualize how the tables in your system are segmented.