This blog post was authored by Shrirang Kamat.
The following design considerations will help you improve the performance and memory of your Vertica queries.
When creating table definitions, you should carefully choose the size of the lookup column based on your data. Properly sizing your column based on your data will help to improve performance. For example, if you have an INVENTORY table that stores information about books in a library, the TITLE column could be VARCHAR (200) versus VARCHAR (2000).
To improve query performance, Vertica performs late materialization for lookup columns whenever possible. For queries with a join between two tables, lookup columns in the SELECT clause that belong to the inner relation of a HASH join will be materialized early. The memory acquired by Vertica for building hash tables is based on the defined size of the lookup columns. This can have significant performance and memory resource implications for queries. Setting the size of the lookup column appropriately may also resolve “inner join did not fit in memory” issues that you may have encountered when the memory required to build a HASH join does not fit in the available memory.
For this post, we used a data set hosted on Kaggle
that contains checkouts from a Seattle library. The data set contains two file types, Checkouts and Inventory. Note that this example was run on a small virtual machine with no tuning improvements. These performance numbers should not be treated as Vertica performance benchmarks.
We will measure performance and the memory acquired by the following query:
SELECT checkouts.*, title
FROM CHECKOUTS JOIN INVENTORY USING (BibNumber)
WHERE CheckoutDateTime < '2007-01-01';
The explain plan on the above query indicates that Vertica will use a HASH join and materialize two columns named inventory.BibNumber and inventory.Title from the inner relation to the HASH join operator.
+-JOIN HASH [Cost: 584K, Rows: 4M] (PATH ID: 1) Inner (BROADCAST)
| Join Cond: (checkouts.BibNumber = inventory.BibNumber)
| Materialize at Output: checkouts.ItemBarcode, checkouts.ItemType, checkouts.Collection, checkouts.CallNumber, checkouts.CheckoutDateTime, checkouts.notes
| Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for checkouts [Cost: 104K, Rows: 4M] (PATH ID: 2)
| | Projection: public.checkouts_P_b0
| | Materialize: checkouts.BibNumber
| | Filter: (checkouts.CheckoutDateTime < '2006-01-01 00:00:00'::timestamp) | | Execute on: All Nodes | | Runtime Filter: (SIP1(HashJoin): checkouts.BibNumber) | +-- Inner -> STORAGE ACCESS for inventory [Cost: 14K, Rows: 3M] (PATH ID: 3)
| | Projection: public.inventory_NEWP_b0
| | Materialize: inventory.BibNumber, inventory.Title
| | Execute on: All Nodes
The size of columns materialized from the inner relation to the join plays an important role in the memory acquired to build the HASH join. In our case, inventory.BibNumber is defined as NUMERIC (38,) and inventory.Title is defined as VARCHAR (2000). The query above produced the first 1000 rows in 8 seconds and all rows were formatted in 45 seconds. The query started with 3GB and acquired an additional 6GB during the course of execution.
We altered the size of inventory.Title from VARCHAR (2000) to VARCHAR (200) using the following ALTER TABLE statement. The query above now produced the first 1000 rows in 3 seconds and all rows were formatted in 38 seconds. The query started with 3GB and did no additional acquisitions.
ALTER TABLE INVENTORY ALTER COLUMN TITLE SET DATA TYPE varchar(200);
While Vertica is considering this enhancement request for a future release, you can get an immediate gain in performance and memory acquisition by using this solution.
Note: We used the following queries to check for memory usage:
dbadmin=> select max(memory_kb) from dc_resource_acquisitions where transaction_id=45035996280563129 and statement_id=1;
dbadmin=> select max(memory_kb) from dc_resource_acquisitions where transaction_id=45035996280563117 and statement_id=1;