Identifying Projection Skew

This blog post was authored by Curtis Bennett.

In Vertica, projections can either be replicated (unsegmented), or segmented. A segmented projection divides the data up across all the nodes in your cluster. Segmentation works by hashing a key value, and then using some simple math, figuring out which node that piece of data will live on. Ideally, the segmentation value is a primary key, or something unique (or close to unique) in order to provide even data distribution. Otherwise, you may end up with more data on one node than on another node.

Projection skew occurs when the volume of a given segmented projection is much larger on one node than it is on another node. Small amounts of skew are OK, but tables that are significantly skewed should be segmented in a different way. Large tables with a high degree of skew will cause performance issues and could create stability issues if unmanaged.

Here is a query that can easily identify skewed segmented projections. This query works by identifying the sizes of all segmented projections with more than 1 million rows, and then uses the FIRST_VALUE analytic functions to compare the sizes of that projections’ smallest node footprint versus its largest node footprint. This is calculated as a skew percentage. You could easily subtract one from the other if you wanted to see the actual size difference as well.

select distinct lower(trim(ps.projection)) as projection
, first_value(to_char(used_bytes/1024^2, '999,999.9999')) over (w order by used_bytes asc) as min_used_MB
, first_value(to_char(used_bytes/1024^2, '999,999,999.9999')) over (w order by used_bytes desc) as max_used_MB
, to_char((first_value(used_bytes) over (w order by used_bytes asc) /
first_value(used_bytes) over (w order by used_bytes desc)-1) *-1*100, '99.9999%') as skew_pct
from
(select node_name, projection_id, projection_schema || '.' || projection_name as projection
, sum(used_bytes) as used_bytes
from projection_storage group by 1,2,3 ) as ps
join projections p using (projection_id)
where p.is_segmented
and ps.used_bytes > 1000000
window w as (partition by ps.projection)
order by 4 desc limit 30 ;


                projection                    |  min_used_MB  |    max_used_MB    | skew_pct
----------------------------------------------+---------------+-------------------+-----------
test1.fact_balance_b0                         |        1.1077 |            2.5187 |  56.0233%
test1.fact_balance_b1                         |        1.1623 |            2.4834 |  53.1959%
test9.fact_certification_assignment_status_b0 |      260.5255 |          525.7532 |  50.4472%
test9.fact_certification_assignment_status_b1 |      260.5248 |          523.0742 |  50.1935%
test9.fact_required_adjustments_aging_b0      |        4.1575 |            8.3436 |  50.1717%
test8.dim_assignment_b0                       |        8.4535 |           16.9286 |  50.0640%
test8.dim_assignment_b1                       |        8.4535 |           16.9263 |  50.0569%
test9.fact_required_adjustments_aging_b1      |        4.1627 |            8.3328 |  50.0452%
store.store_orders_fact_b0                    |    1,192.2614 |        1,709.3171 |  30.2493%
store.store_orders_fact_b1                    |    1,192.2603 |        1,709.3109 |  30.2491%
…

Small tables that have significant skew might not be worth worrying about, but use your judgment here – certainly in this example it’s worth fixing test9.fact_certification_assignment_status, but it might be OK to not worry too much about test1.fact_balance, even though it has a higher skew percentage. It is recommended that anything with over 20% skew be addressed, if possible.

In this example, test9.fact_certification_assignment_status has a lot of skew, but store.store_orders_fact is worse – even though the percentage is not as great, the amount of data difference is over 500MB between nodes.

About the Author

Soniya Shah
Information Developer

Currently, a first year law student with a background in science and technology. Experienced technical writer, with specializations in software documentation, big data, blog development, and website development. I build user-centered content to communicate complex and technical information more easily.

I used to work for Vertica full time for about 3 years. I still work at Vertica part time while going to law school.

Update: Soniya is now doing her law internship, and no longer working at Vertica. Good luck, Soniya!

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

Identifying Projection Skew

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

Identifying Projection Skew

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates

See More Best Practices Posts