Hierarchical Partitioning

Posted April 16, 2018 by Soniya Shah, Information Developer

High angle view of Beijing Guomao.
This blog post was authored by Shrirang Kamat. This document demonstrates how you can use hierarchical partitions, introduced in Vertica 9.0, to solve a few example use cases. For more information, see Hierarchical Partitioning. Partitioning a table helps with managing the information life cycle. Since data for partitions is segregated into different ROS containers, predicates on partitioning columns may help prune ROS containers more effectively during query execution. As a best practice, we advise that you only partition fact tables. Partitioning dimension tables or partitioning with fine grained partitioning keys may create too many ROS containers, which will increase the catalog size. In this document we will discuss how hierarchical partitioning may be used to solve the following use cases.

Use Case 1

We want to partition the store_orders table by day and we want to keep six years of historical data. Without using hierarchical partitioning, this problem is more difficult to handle. Since the data for partitions is segregated into different ROS containers, we will encounter the max ROS container limit of 1024 once we try to load the third year of data (365 days times 3 years equals 1,095 containers). Considering that we hit the container limit with less than three years of data, we will definitely encounter it if we need to store six years of data. Previously, you could work around this problem by using the now deprecated merge_partition API to merge partitions that are older than six months by the month. This workaround runs into issues such as the Tuple Mover Mergeout algorithm, which does not handle ROS containers with merged partitions correctly. Creating more than 1024 partitions using a single COPY statement would result in a too many partitions error. This use case is ideal for hierarchical partitioning. Hierarchical partitioning works well for tables partitioned on date column; you specify additional partitions and then grouping by expression. The following DDL statement can be used to create the store_orders table, which is partitioned by date. Partitions older than six months will be automatically merged by months. The Tuple Mover Mergeout algorithm will merge partitions based on the grouping expression specified in table definition. Two important things to note here are: 1. You can load data for all six years in one statement and hierarchical partitions will merge partitions into ROS containers according to the “GROUP BY” clause at load. 2. You must be aware of partition grouping. Since the data for the oldest month is grouped by month, it is efficient to run partition operations on the entire month versus each day. The operation on a day from the oldest month will require that the ROS container to be partitioned. Please refer to the Vertica documentation for details on hierarchical partitions and the CALENDAR_HIERARCHY_DAY function. => CREATE TABLE public.store_orders ( order_no int, order_date timestamp NOT NULL, shipper varchar(20), ship_date date ) PARTITION BY order_date::DATE GROUP BY CALENDAR_HIERARCHY_DAY(order_date::DATE,6,6); The first argument, active_months, of 6 will keep 5 months and 1 day, up to 6 months. If you want to keep a minimum of 6 months, then the first argument, active_months, should be set to 7.

Use Case 2

We want to partition the store_orders table by hour and want to keep 60 months of historical data. We frequently get data corrections up to the last 7 days and we have to replace data for an hour with new data. Most dashboard applications fetch data for date ranges within the most recent three months and ROS pruning is key to achieve expected performance. This is another ideal case for hierarchical partitioning. Since we want to partition by hour, we may not be able to use CALENDAR_HIERARCHY_DAY function, but we can create a custom GROUP BY clause to handle the above requirements. The following DDL statements will partition the table by the hour. We will leave 7 days worth of data in the ROS containers per partition and merge partitions older than 3 days by day up to 3 months, and partition the partitions older than 3 months by month. As you can see, we will create 43,800 partitions at the end of 5 years. But, the number of ROS containers will be less than half of the ROS container limit of 1024, because we are grouping the partitions into ROS containers. => CREATE TABLE public.store_orders1 ( order_no int, order_date timestamp NOT NULL, shipper varchar(20), ship_date date ) PARTITION BY (date_trunc('hour',store_orders.order_date)) GROUP BY (CASE WHEN ("datediff"('month', (date_trunc('hour',store_orders.order_date)), current_timestamp) >= 3) THEN (date_trunc('month', (date_trunc('hour',store_orders.order_date)))) WHEN ("datediff"('day', (date_trunc('hour',store_orders.order_date)), current_timestamp) >=7) THEN (date_trunc('day', (date_trunc('hour',store_orders.order_date)))) ELSE date_trunc('hour',store_orders.order_date) END); With hierarchical partitioning, you can create more than 1024 partitions on a table as long as the table is defined with the GROUP BY expression that creates less than 1024 groups. In use case 2, you will have 43800 partitions but approximately 300 groups. The following are best practices to keep in mind when using hierarchical partitions: 1. If you are building a custom GROUP BY clause, create a maximum of 3 levels of hierarchies. 2. You can add the GROUP BY clause to partitioned tables by running the ALTER TABLE statement. For an example, see CALENDAR_HIERARCHY_DAY. 3. When partitions are grouped into a ROS container, you need to make sure that partition operations are performed on a group entirely. Otherwise Vertica will be forced to partition ROS containers, which may fail on the ROS container limit or too many data partitions error. In use case 2, we avoided grouping partitions into years because the customer wanted to drop the oldest partition by month. For example: It is a best practice to swap all partitions from January 2014 entirely because all partitions for the month of January are grouped in a single ROS container. SELECT SWAP_PARTITIONS_BETWEEN_TABLES('store_orders','2014-01-01 00:00:00','2014-01-01 23:00:00','tests.store_orders ',true); – recommended It is not recommended to swap a subset of partitions grouped into a ROS container as shown below. Doing so will run into errors such as “Too many data partitions”. SELECT SWAP_PARTITIONS_BETWEEN_TABLES('store_orders','2014-01-01 00:00:00','2014-01-01 13:00:00','tests.store_orders ',true);