Are Your Columns too Wide?

Posted November 14, 2019 by Bryan Herger, Vertica Big Data Solution Architect at Micro Focus

This tip expands on the earlier post on encoding and compression at Checking and Improving Column Compression and Encoding

When you have millions to billions of rows, data type becomes a bit more important: even an extra 10 bytes per row across a huge data set will impact storage or performance (or both!).

When I created the big_fact_table, I included some VARCHAR fields, but failed to set a width. Vertica defaults to VARCHAR(80) in this case. However, all of my strings are 8-character alpahnumerics, so why have VARCHAR(80) when VARCHAR(10) will suffice? I created a new table named “varchar10” with the same fields but altered the VARCHAR fields to VARCHAR(10) with ALTER TABLE…ALTER COLUMN…SET DATA TYPE VARCHAR(10).

This had almost no impact on table size, thanks to compression and encoding:

dbadmin=> select anchor_table_name, projection_name, used_bytes from projection_storage where anchor_table_name like 'big_fact_table%' limit 5;
anchor_table_name | projection_name | used_bytes
-----------------------------+-----------------------------------------+------------
big_fact_table | big_fact_table_super | 6843970195
big_fact_table_varchar10 | big_fact_table_varchar10_super | 6843968398

But, when I load data with INSERT…SELECT, I see a significant performance improvement with the smaller VARCHAR:

dbadmin=> insert into public.big_fact_table select * from public.big1090;
OUTPUT
445580312
(1 row)
Time: First fetch (1 row): 380333.748 ms. All rows formatted: 380333.823 ms
dbadmin=> insert into public.big_fact_table_varchar10 select * from public.big1090;
OUTPUT
445580312
(1 row)
Time: First fetch (1 row): 269333.788 ms. All rows formatted: 269333.864 ms

That is around 25% faster! Why? During load and query, Vertica must allocate enough resources assuming the fields will actually contain the full data size. If you allocate too much memory, this will slow down sorts, segmentation, and so on as Vertica needs to move around data types that have lots of empty space. So optimizing data type and length helps Vertica optimize memory usage and run faster!

You can check whether you’ve allocated too much space in a VARCHAR or VARBINARY with SELECT MAX(LENGTH()) FROM ; and compare to the column DDL.

For more information and limitations:
https://www.vertica.com/docs/latest/HTML/Content/Authoring/AdministratorsGuide/Tables/ColumnManagement/ReduceColumnWidth.htm

Related Posts:
Checking and Improving Column Compression and Encoding
Get the Row Count from an Outer Table Join
Encode Projection Columns with Zstandard Compression: Quick Tip