This tip expands on the earlier post on encoding and compression at Checking and Improving Column Compression and Encoding
When you have millions to billions of rows, data type becomes a bit more important: even an extra 10 bytes per row across a huge data set will impact storage or performance (or both!).
When I created the big_fact_table, I included some VARCHAR fields, but failed to set a width. Vertica defaults to VARCHAR(80) in this case. However, all of my strings are 8-character alpahnumerics, so why have VARCHAR(80) when VARCHAR(10) will suffice? I created a new table named “varchar10” with the same fields but altered the VARCHAR fields to VARCHAR(10) with ALTER TABLE…ALTER COLUMN…SET DATA TYPE VARCHAR(10).
This had almost no impact on table size, thanks to compression and encoding:
dbadmin=> select anchor_table_name, projection_name, used_bytes from projection_storage where anchor_table_name like 'big_fact_table%' limit 5;
anchor_table_name | projection_name | used_bytes
big_fact_table | big_fact_table_super | 6843970195
big_fact_table_varchar10 | big_fact_table_varchar10_super | 6843968398
But, when I load data with INSERT…SELECT, I see a significant performance improvement with the smaller VARCHAR:
dbadmin=> insert into public.big_fact_table select * from public.big1090;
Time: First fetch (1 row): 380333.748 ms. All rows formatted: 380333.823 ms
dbadmin=> insert into public.big_fact_table_varchar10 select * from public.big1090;
Time: First fetch (1 row): 269333.788 ms. All rows formatted: 269333.864 ms
That is around 25% faster! Why? During load and query, Vertica must allocate enough resources assuming the fields will actually contain the full data size. If you allocate too much memory, this will slow down sorts, segmentation, and so on as Vertica needs to move around data types that have lots of empty space. So optimizing data type and length helps Vertica optimize memory usage and run faster!
You can check whether you’ve allocated too much space in a VARCHAR or VARBINARY with SELECT MAX(LENGTH()) FROM ; and compare to the column DDL.
For more information and limitations:
Checking and Improving Column Compression and Encoding
Get the Row Count from an Outer Table Join
Encode Projection Columns with Zstandard Compression: Quick Tip