Calculating the Database Size
You can use your Vertica software until your columnar data reaches the maximum raw data size that the license agreement provides. This section describes when data is monitored, what data is included in the estimate, and the general methodology used to produce an estimate. For more information about monitoring for data size, see Monitoring Database Size for License Compliance.
How Vertica Estimates Raw Data Size
Vertica uses statistical sampling to calculate an accurate estimate of the raw data size of the database. In this context, raw data means the uncompressed data stored in a single Vertica database. For the purpose of license size audit and enforcement, Vertica evaluates the raw data size as if the data had been exported from the database in text format, rather than as compressed data.
Vertica conducts your database size audit using statistical sampling. This method allows Vertica to estimate the size of the database without significantly affecting database performance. The trade-off between accuracy and impact on performance is a small margin of error, inherent in statistical sampling. Reports on your database size include the margin of error, so you can assess the accuracy of the estimate. To learn more about simple random sampling, see Simple Random Sampling.
Excluding Data From Raw Data Size Estimate
Not all data in the Vertica database is evaluated as part of the raw data size. Specifically, Vertica excludes the following data:
- Multiple projections (underlying physical copies) of data from a logical database entity (table). Data appearing in multiple projections of the same table is counted only once.
- Data stored in temporary tables.
- Data accessible through external table definitions.
- Data that has been deleted, but that remains in the database. To understand more about deleting and purging data, see Purging Deleted Data.
- Data stored in the WOS.
- Data stored in system and work tables such as monitoring tables, Data Collector tables, and Database Designer tables.
- Delimiter characters.
- Data stored in SET USING denormalized columns. For information on denormalized (flattened) tables, see Flattened Tables.
Evaluating Data Type Footprint Size
Vertica treats the data sampled for the estimate as if it had been exported from the database in text format (such as printed from vsql). This means that Vertica evaluates the data type footprint sizes as follows:
- Strings and binary types (CHAR, VARCHAR, BINARY, VARBINARY) are counted as their actual size in bytes using UTF-8 encoding.
- Numeric data types are counted as if they had been printed. Each digit counts as a byte, as does any decimal point, sign, or scientific notation. For example, -123.456 counts as eight bytes (six digits plus the decimal point and minus sign).
- Date/time data types are counted as if they had been converted to text, including any hyphens or other separators. For example, a timestamp column containing the value for noon on July 4th, 2011 would be 19 bytes. As text, vsql would print the value as 2011-07-04 12:00:00, which is 19 characters, including the space between the date and the time.
Using AUDIT to Estimate Database Size
To supply a more accurate database size estimate than statistical sampling can provide, use the AUDIT function to perform a full audit. This function has parameters to set both the error_tolerance
and confidence_level
. Using one or both of these parameters increases or decreases the function's performance impact.
For instance, lowering the error_tolerance
to zero (0) and raising the confidence_level
to 100, provides the most accurate size estimate, and increases the performance impact of calling the AUDIT function. During a detailed, low error-tolerant audit, all of the data in the database is dumped to a raw format to calculate its size. Since performing a stringent audit can significantly affect database performance, never perform a full audit of a production database. See AUDIT for details.
Note: Unlike estimating raw data size using statistical sampling, a full audit performs SQL queries on the full database contents, including the contents of the WOS.