Transforming Data During Loads

To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading. Transforming data while loading is useful for computing values to insert into a target database column from other columns in the source database.

To transform data during load, use the following syntax to specify the target column for which you want to compute values, as an expression:

COPY [ [database-name.]schema-name.]table [([Column as Expression] / column[FORMAT 'format']
     [ ,...])]
FROM ...

Understanding Transformation Requirements

When transforming data during loads, the COPY statement must contain at least one parsed column. The parsed column can be a FILLER column.

Specify only RAW data in the parsed column source data. If you specify nulls in that RAW data, the columns are evaluated with the same rules as for SQL statement expressions.

You can intersperse parsed and computed columns in a COPY statement.

Loading FLOAT Values

Vertica parses floating-point values internally. COPY does not require you to cast floats explicitly, unless you need to transform the values for another reason.

Using Expressions in COPY Statements

The expression in a COPY statement can be as simple as a single column, or more complex, such as a case statement for multiple columns. An expression can specify multiple columns, and multiple expressions can refer to the same parsed column. You can use expressions for columns of all supported data types.

COPY expressions can use many Vertica-supported SQL functions, operators, constants, NULLs, and comments, including these functions:

Requirements and restrictions

  • COPY expressions cannot use SQL meta-functions (Vertica-specific), analytic functions, aggregate functions, or computed columns.
  • For computed columns, you must list all parsed columns in the COPY statement expression. Do not specify FORMAT or RAW in the source data for a computed column.
  • Expressions used in a COPY statement can contain only constants. The return data type of the expression must be coercible to that of the target column. Parsed column parameters are also coerced to match the expression.

Handling Expression Errors

Errors in expressions within your COPY statement are SQL errors. As such, they are handled differently from parse errors. When a parse error occurs, COPY rejects the row and adds it to the rejected data file or table. COPY also adds the reason for a rejected row to the exceptions file or the rejected data table. For example, COPY parsing does not implicitly cast data types. If a type mismatch occurs between the data being loaded and a column type (such as attempting to load a text value into a FLOAT column), COPY rejects the row, and continues processing.

If an error occurs in an expression in your COPY statement, then by default the entire load fails. For example, if your COPY statement has a transform function expression, and a syntax error exists in that expression, the entire load is rolled back. All SQL errors, including COPY rollback from an expression, are stored in the Vertica-specific log file. However, unlike parse rejections and exception messages, SQL expression errors are brief, and may require further research.

You can have COPY treat errors in transformation expressions like parse errors. Rejected rows are added to the same file or table, and exceptions are added to the same exceptions file or table. To enable this behavior, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General Parameters.)

Loading data with expression rejections is potentially slower than loading with the same number of parse rejections. Enable expression rejections if your data has a few bad rows; doing so allows the rest of the data to be loaded. If you are concerned about the time it takes to complete a load with many bad rows, use the REJECTMAX parameter to set a limit. If COPY finds more than REJECTMAX bad rows, it aborts and rolls back the load.

See Capturing Load Rejections and Exceptions for more information about managing rejected data.

Transformation Example

Following is a small transformation example.

  1. Create a table t.

    => CREATE TABLE t (
        year VARCHAR(10), 
        month VARCHAR(10), 
        day VARCHAR(10), 
        k timestamp
    );
  2. Use COPY to copy the table, computing values for the year, month, and day columns in the target database, based on the timestamp columns in the source table.
  3. Load the parsed column, timestamp, from the source data to the target database.

    => COPY t(year AS TO_CHAR(k, 'YYYY'), 
            month AS TO_CHAR(k, 'Month'), 
            day AS TO_CHAR(k, 'DD'), 
            k FORMAT 'YYYY-MM-DD') FROM STDIN NO COMMIT;
    2009-06-17
    1979-06-30
    2007-11-26
    \. 
    
  4. Select the table contents to see the results:

    SELECT * FROM t;
     year |   month   | day |         k    
    ------+-----------+-----+---------------------
     2009 | June      | 17  | 2009-06-17 00:00:00
     1979 | June      | 30  | 1979-06-30 00:00:00
     2007 | November  | 26  | 2007-11-26 00:00:00
    (3 rows)
    

Deriving Table Columns From Data File Columns

You can use COPY to derive a table column from the data file to load. For more information, see Manipulating Source Data Columns.