Auditing Database Size

You can use your Vertica software until columnar data reaches the maximum raw data size that your license agreement allows. Vertica periodically runs an audit of the columnar data size to verify that your database complies with this agreement. You can also run your own audits of database size with two functions:

  • AUDIT: Estimates the raw data size of a database, schema, or table.
  • AUDIT_FLEX: Estimates the size of one or more flexible tables in a database, schema, or projection.

The following two examples audit the database and schema online_sales:

=> SELECT AUDIT('', 'database');
(1 row)
=> SELECT AUDIT('online_sales', 'schema');
(1 row)

Raw Data Size

AUDITand AUDIT_FLEX use statistical sampling to estimate the raw data size of data stored in Vertica tables—that is, the uncompressed data that the database stores. For most data types, Vertica evaluates the raw data size as if the data were exported from the database in text format, rather than as compressed data. For arrays and sets, it evaluates the elements individually rather than the collection.

By using statistical sampling, the audit minimizes its impact on database performance. The tradeoff between accuracy and performance impact is a small margin of error. Reports on your database size include the margin of error, so you can assess the accuracy of the estimate.

Data in ORC and Parquet-based external tables are also audited whether they are stored locally in the Vertica cluster's file system or remotely in S3 or on a Hadoop cluster. AUDIT always uses the file size of the underlying data files as the amount of data in the table. For example, suppose you have an external table based on 1GB of ORC files stored on a Hadoop cluster's HDFS. Then an audit of the table reports it as being 1GB in size.

The Vertica audit does not verify that these files contain actual ORC or Parquet data. It just checks the size of the files that correspond to the external table definition.

Unaudited Data

Table data that appears in multiple projections is counted only once. An audit also excludes the following data:

  • Temporary table data.
  • Data in SET USING columns.
  • Non-columnar data accessible through external table definitions. Data in columnar formats such as ORC and Parquet count against your totals.
  • Data that was deleted but not yet purged.
  • Data stored in system and work tables such as monitoring tables, Data Collector tables, and Database Designer tables.
  • Delimiter characters.

Evaluating Data Type Footprint

Vertica evaluates the footprint of different data types as follows:

  • Strings and binary types—CHAR, VARCHAR, BINARY, VARBINARY—are counted as their actual size in bytes using UTF-8 encoding.
  • Numeric data types are evaluated as if they were printed. Each digit counts as a byte, as does any decimal point, sign, or scientific notation. For example, -123.456 counts as eight bytes—six digits plus the decimal point and minus sign.
  • Date/time data types are evaluated as if they were converted to text, including hyphens, spaces, and colons. For example, vsql prints a timestamp value of 2011-07-04 12:00:00 as 19 characters, or 19 bytes.

Controlling Audit Accuracy

AUDIT can specify the level of an audit's error tolerance and confidence, by default set to 5 and 99 percent, respectively. For example, you can obtain a high level of audit accuracy by setting error tolerance and confidence level to 0 and 100 percent, respectively. Unlike estimating raw data size with statistical sampling, Vertica dumps all audited data to a raw format to calculate its size. If you audit the entire database at this level, the audit also includes contents of the WOS.

Vertica discourages database-wide audits at this level. Doing so can have a significant adverse impact on database performance.

The following example audits the database with 25% error tolerance:

=> SELECT AUDIT('', 25);
(1 row)

The following example audits the database with 25% level of tolerance and 90% confidence level:

=> SELECT AUDIT('',25,90);
(1 row)

These accuracy settings have no effect on audits of external tables based on ORC or Parquet files. Audits of external tables based on these formats always use the file size of ORC or Parquet files.