Using Partition Columns
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values. You can use partitioning to improve the performance of queries that restrict results by the partitioned column.
For example, if you have a table with a date column, and you know you will be writing queries restricted to particular dates, you can partition by date. Thus, Vertica can skip reading some files entirely when executing your date-restricted queries. This behavior is called partition pruning.
You can create partitions regardless of where you store the files—in HDFS, in an S3 bucket, on a local file system, or in a shared file system such as NFS.
You can use Hive or the Vertica Parquet Writer to create partitions, or you can create them manually. For information about creating partitions as part of exporting data from Vertica, see Partitioning and Sorting Data. See Partitioning Hive Tables for information about tuning partitions.
By default, both Hive and Vertica write Hadoop columnar format files that contain the data for all table columns without partitioning. The column data is laid out in stripes, or groups of row data. When Vertica loads this data it reads all of the stripes.
If you partition the data, however, you can avoid writing some of that data into the files and thus reduce the amount to be read. Instead of storing a column's data in the files, you create a directory structure that partitions the data based on the value in a column.
For example, if the data includes a date column, you can write each date as a separate partition. Each partition is a directory with a name of the form "column=value". If you have a date column named "created" that is partitioned by day, you would have the following directory structure:
path/created=2016-11-01/* path/created=2016-11-02/* path/created=2016-11-03/* path/...
As this example shows, the files in each subdirectory contain all columns except the "created" column.
You can partition by more than one column, creating a layered structure. For example, adding another partitioned column, "region", to the preceding example would produce the following directory structure:
path/created=2016-11-01/region=northeast/* path/created=2016-11-01/region=central/* path/created=2016-11-01/region=southeast/* path/created=2016-11-01/... path/created=2016-11-02/region=northeast/* path/created=2016-11-02/region=central/* path/created=2016-11-02/region=southeast/* path/created=2016-11-02/... path/created=2016-11-03/... path/...
With this change, the data files contain all columns except "created" and "region".
The files must contain at least one real (not partitioned) column. You cannot partition by every column in a table.
You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.
Under some circumstances Hive writes a partition with a value of __HIVE_DEFAULT_PARTITION__. Vertica treats these values as NULL.
When creating an external table from partitioned data, you must do all of the following:
- In the column definition in the table, list the partition columns last and in order.
- In the path, use wildcards to include all of the levels of directories and files.
- In the ORC or PARQUET statement, specify the partition columns, in order, in the hive_partition_cols parameter. (The argument name is the same even if you didn't use Hive to do the partitioning; it refers to Hive-style partitions.)
The following example creates an external table using the partitioned data shown previously. The table includes four columns. Two columns, "id" and "name", are in the data files. The other two, "created" and "region", are partitioned.
=> CREATE EXTERNAL TABLE t (id int, name varchar(50), created date, region varchar(50)) AS COPY FROM 'hdfs:///path/*/*/*' ORC(hive_partition_cols='created,region');
The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must always be one more than the number of partitioned columns.
You do not need to include all of the partitioned columns in hive_partition_cols if those columns are not relevant for your queries. However, the partition columns must be the last columns in the table definition. For example, you can define the following table for the partitioned data shown previously:
=> CREATE EXTERNAL TABLE t2 (id int, name varchar(50), created date, region varchar(50)) AS COPY FROM 'hdfs:///path/*/*/*' ORC(hive_partition_cols='region');
Values in the "created" column are all null because no data appears in the files for that column and hive_partition_cols does not include it.
However, the following example produces an error.
=> CREATE EXTERNAL TABLE t3 (id int, name varchar(50), created date, region varchar(50)) AS COPY FROM 'hdfs:///path/*/*/*' ORC(hive_partition_cols='created');
In this example, the table definition includes the "region" column after the "created" column, and "region" is not included in hive_partition_cols. Because this column is not listed as a partition column, Vertica interprets it as a data column and produces an error because the column is not present.
If Vertica cannot convert a partition value to the declared type for that column, it sets the value to NULL. The following example incorrectly declares region to be an integer rather than a varchar.
=> CREATE EXTERNAL TABLE t4 (id int, name varchar(50), created date, region int) AS COPY FROM 'hdfs:///path/*/*/*' ORC(hive_partition_cols='region');
Vertica cannot coerce a directory named "region=northeast" into an integer value, so it sets that column value to NULL for all rows it reads from this directory. If you declare the column with IS NOT NULL, Vertica rejects the row. If the number of rows exceeds REJECTMAX, Vertica reports an error.
If you change how files are partitioned on disk, you must re-create your external tables.
When executing queries with predicates, Vertica skips subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. See Improving Query Performance for more information about partition pruning and other techniques for optimizing queries.
The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.
=> SELECT * FROM t WHERE region='northeast';
To verify that Vertica is pruning partitions, look in the explain plan for a message similar to the following:
files with unmatched Hive partition have been pruned