This blog post was authored by Curtis Bennett.
In Vertica, projections can either be replicated (unsegmented), or segmented. A segmented projection divides the data up across all the nodes in your cluster. Segmentation works by hashing a key value, and then using some simple math, figuring out which node that piece of data will live on. Ideally, the segmentation value is a primary key, or something unique (or close to unique) in order to provide even data distribution. Otherwise, you may end up with more data on one node than on another node.
Projection skew occurs when the volume of a given segmented projection is much larger on one node than it is on another node. Small amounts of skew are OK, but tables that are significantly skewed should be segmented in a different way. Large tables with a high degree of skew will cause performance issues and could create stability issues if unmanaged.
Here is a query that can easily identify skewed segmented projections. This query works by identifying the sizes of all segmented projections with more than 1 million rows, and then uses the FIRST_VALUE analytic functions to compare the sizes of that projections’ smallest node footprint versus its largest node footprint. This is calculated as a skew percentage. You could easily subtract one from the other if you wanted to see the actual size difference as well.
select distinct lower(trim(ps.projection)) as projection
, first_value(to_char(used_bytes/1024^2, '999,999.9999')) over (w order by used_bytes asc) as min_used_MB
, first_value(to_char(used_bytes/1024^2, '999,999,999.9999')) over (w order by used_bytes desc) as max_used_MB
, to_char((first_value(used_bytes) over (w order by used_bytes asc) /
first_value(used_bytes) over (w order by used_bytes desc)-1) *-1*100, '99.9999%') as skew_pct
(select node_name, projection_id, projection_schema || '.' || projection_name as projection
, sum(used_bytes) as used_bytes
from projection_storage group by 1,2,3 ) as ps
join projections p using (projection_id)
and ps.used_bytes > 1000000
window w as (partition by ps.projection)
order by 4 desc limit 30 ;
projection | min_used_MB | max_used_MB | skew_pct
test1.fact_balance_b0 | 1.1077 | 2.5187 | 56.0233%
test1.fact_balance_b1 | 1.1623 | 2.4834 | 53.1959%
test9.fact_certification_assignment_status_b0 | 260.5255 | 525.7532 | 50.4472%
test9.fact_certification_assignment_status_b1 | 260.5248 | 523.0742 | 50.1935%
test9.fact_required_adjustments_aging_b0 | 4.1575 | 8.3436 | 50.1717%
test8.dim_assignment_b0 | 8.4535 | 16.9286 | 50.0640%
test8.dim_assignment_b1 | 8.4535 | 16.9263 | 50.0569%
test9.fact_required_adjustments_aging_b1 | 4.1627 | 8.3328 | 50.0452%
store.store_orders_fact_b0 | 1,192.2614 | 1,709.3171 | 30.2493%
store.store_orders_fact_b1 | 1,192.2603 | 1,709.3109 | 30.2491%
Small tables that have significant skew might not be worth worrying about, but use your judgment here – certainly in this example it’s worth fixing test9.fact_certification_assignment_status, but it might be OK to not worry too much about test1.fact_balance, even though it has a higher skew percentage. It is recommended that anything with over 20% skew be addressed, if possible.
In this example, test9.fact_certification_assignment_status has a lot of skew, but store.store_orders_fact is worse – even though the percentage is not as great, the amount of data difference is over 500MB between nodes.