Stored Procedures: Use Cases and Examples

Stored procedures in Vertica are best suited for complex, analytical workflows rather than small, transaction-heavy ones. Some recommended use cases include include information lifecycle management (ILM) activities like extract, transform, and load (ETL), and data preparation for more complex analytical tasks like machine learning. For example:

  • Swapping partitions according to age
  • Exporting data at end-of-life and dropping the partitions
  • Saving inputs, outputs, and metadata from a machine learning model (e.g. who ran the model, the version of the model, how many times the model was run, who received the results, etc.) for auditing purposes

Searching for a Value

The find_my_value() procedure searches for a user-specified value in any table column in a given schema and stores the locations of instances of the value in a user-specified table:

=> CREATE PROCEDURE find_my_value(p_table_schema VARCHAR(128), p_search_value VARCHAR(1000), p_results_schema VARCHAR(128), p_results_table VARCHAR(128)) AS $$
DECLARE
    sql_cmd VARCHAR(65000);
    sql_cmd_result VARCHAR(65000);
    results VARCHAR(65000);
BEGIN
    IF p_table_schema IS NULL OR p_table_schema = '' OR
        p_search_value IS NULL OR p_search_value = '' OR
        p_results_schema IS NULL OR p_results_schema = '' OR
        p_results_table IS NULL OR p_results_table = '' THEN
        RAISE EXCEPTION 'Please provide a schema to search, a search value, a results table schema, and a results table name.';
        RETURN;
    END IF;
  
    sql_cmd := 'CREATE TABLE IF NOT EXISTS ' || QUOTE_IDENT(p_results_schema) || '.' || QUOTE_IDENT(p_results_table) || 
        '(found_timestamp TIMESTAMP, found_value VARCHAR(1000), table_name VARCHAR(128), column_name VARCHAR(128));';

    sql_cmd_result := EXECUTE 'SELECT LISTAGG(c USING PARAMETERS max_length=1000000, separator='' '') 
        FROM (SELECT ''
        (SELECT '''''' || NOW() || ''''''::TIMESTAMP , ''''' || QUOTE_IDENT(p_search_value) || ''''','''''' || table_name || '''''', '''''' || column_name || '''''' 
            FROM '' || table_schema || ''.'' || table_name || '' 
            WHERE '' || column_name || ''::'' || 
            CASE 
                WHEN data_type_id IN (17, 115, 116, 117) THEN data_type 
                ELSE ''VARCHAR('' || LENGTH(''' || QUOTE_IDENT(p_search_value)|| ''') || '')'' END || '' = ''''' || QUOTE_IDENT(p_search_value) || ''''''' || DECODE(LEAD(column_name) OVER(ORDER BY table_schema, table_name, ordinal_position), NULL, '' LIMIT 1);'', '' LIMIT 1)

        UNION ALL '') c 
            FROM (SELECT table_schema, table_name, column_name, ordinal_position, data_type_id, data_type 
            FROM columns WHERE NOT is_system_table AND table_schema ILIKE ''' || QUOTE_IDENT(p_table_schema) || ''' AND data_type_id < 1000 
            ORDER BY table_schema, table_name, ordinal_position) foo) foo;';
            
    results := EXECUTE 'INSERT INTO ' || QUOTE_IDENT(p_results_schema) || '.' || QUOTE_IDENT(p_results_table) || ' ' || sql_cmd_result;

    RAISE INFO 'Matches Found: %', results;
END;
$$;

For example, to search the public schema for instances of the string 'dog' and then store the results in public.table_list:

=> CALL find_my_value('public', 'dog', 'public', 'table_list');
 find_my_value
---------------
             0
(1 row)

=> SELECT * FROM public.table_list;
      found_timestamp       | found_value |  table_name   | column_name
----------------------------+-------------+---------------+-------------
 2021-08-25 22:13:20.147889 | dog         | another_table | b
 2021-08-25 22:13:20.147889 | dog         | some_table    | c
(2 rows)

Optimizing Tables

You can automate loading data from Parquet files and optimizing your queries with the create_optimized_table(). This procedure:

  1. Creates an external table whose structure is built from Parquet files using the Vertica INFER_EXTERNAL_TABLE_DDL function.
  2. Creates an internal Vertica table, like the external table, resizing all VARCHAR columns to the MAX length of the data to be loaded.
  3. Creates a super projection using the optional segmentation/order by columns passed in as a parameter.
  4. Adds an optional primary key to the managed table passed in as a parameter.
  5. Loads a sample data set (1 million rows) from the external table into the Vertica-managed table.
  6. Drops the external table.
  7. Runs the ANALYZE_STATISTICS function on the Vertica-managed table.
  8. Runs the DESIGNER_DESIGN_PROJECTION_ENCODINGS function to get a properly encoded super projection for the Vertica-managed table.
  9. Truncates the now-optimized Vertica-managed table (we will load the entire data set in a separate script / stored procedure).
=> CREATE OR REPLACE PROCEDURE create_optimized_table(p_file_path VARCHAR(1000), p_table_schema VARCHAR(128), p_table_name VARCHAR(128), p_seg_columns VARCHAR(1000), p_pk_columns VARCHAR(1000)) LANGUAGE PLvSQL AS $$
DECLARE
    command_sql VARCHAR(1000);
    seg_columns VARCHAR(1000);
    BEGIN
    IF (QUOTE_LITERAL(p_file_path) IS NULL OR QUOTE_LITERAL(p_file_path) = '') THEN
        RAISE EXCEPTION 'Please provide a file path.';
  
    ELSEIF (QUOTE_IDENT(p_table_schema) IS NULL OR QUOTE_IDENT(p_table_schema) = '') THEN
        RAISE EXCEPTION 'Please provide a table schema.';
  
    ELSEIF (QUOTE_IDENT(p_table_name) IS NULL OR QUOTE_IDENT(p_table_name) = '') THEN
        RAISE EXCEPTION 'Please provide a table name.';  
  
    END IF;
  
    IF (QUOTE_IDENT(p_seg_columns) IS NULL OR QUOTE_IDENT(p_seg_columns) = '') THEN
        seg_columns := '';
  
    ELSE
      seg_columns := 'ORDER BY ' || QUOTE_IDENT(p_seg_columns) || ' SEGMENTED BY HASH(' || QUOTE_IDENT(p_seg_columns) || ') ALL NODES';
    END IF;
  
    EXECUTE 'DROP TABLE IF EXISTS ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp CASCADE;';

    command_sql := EXECUTE 'SELECT infer_external_table_ddl(''' || QUOTE_LITERAL(p_file_path) || ''' USING PARAMETERS format = ''parquet'', table_name = ''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp'');';
    command_sql := 'SELECT REPLACE(\$\$' || command_sql || '\$\$, ''"'', '''');'; 
    command_sql := EXECUTE command_sql;
    EXECUTE command_sql;
  
    command_sql := 'SELECT LISTAGG(y USING PARAMETERS separator='' '') 
        FROM ((SELECT 0 x, ''SELECT ''''CREATE TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '('' y 
            UNION ALL SELECT ordinal_position, column_name || '' '' || 
            CASE WHEN data_type LIKE ''varchar%'' 
                THEN ''varchar('''' || (SELECT MAX(LENGTH('' || column_name || '')) 
                    FROM '' || table_schema || ''.'' || table_name || '') || '''')'' ELSE data_type END || NVL2(LEAD('' || column_name || '', 1) OVER (ORDER BY ordinal_position), '','', '')'') 
                    FROM columns WHERE table_schema = ''' || QUOTE_IDENT(p_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_table_name) || '_temp'' 
                    UNION ALL SELECT 10000, ''' || seg_columns || ''' UNION ALL SELECT 10001, '';'''''') ORDER BY x) foo WHERE y <> '''';';
    command_sql := EXECUTE command_sql;
    command_sql := EXECUTE command_sql;
    EXECUTE command_sql;
  
    IF QUOTE_IDENT(p_pk_columns) IS NOT NULL OR QUOTE_IDENT(p_pk_columns) <> '' THEN
      EXECUTE 'ALTER TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ' ADD CONSTRAINT ' || QUOTE_IDENT(p_table_name) || '_pk PRIMARY KEY (' || QUOTE_IDENT(p_pk_columns) || ') ENABLED;';
    
    END IF;
  
    EXECUTE 'INSERT INTO ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ' SELECT * FROM ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp LIMIT 1000000;';
    EXECUTE 'DROP TABLE IF EXISTS ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp CASCADE;';
    EXECUTE 'SELECT analyze_statistics(''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ''');';
    EXECUTE 'SELECT designer_design_projection_encodings(''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ''', ''/tmp/toss.sql'', TRUE, TRUE);';
    EXECUTE 'TRUNCATE TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ';';
  
  END;
  $$;
=> SELECT create_optimized_table('/home/dbadmin/rga_example/*', 'public', 'parquet_table', 'c1,c2', 'c1');
create_optimized_table
------------------------
                      0
(1 row)

=> SELECT export_objects('', 'parquet_table');
 export_objects                            
----------------
CREATE TABLE public.parquet_table
(
    c1 int NOT NULL,
    c2 varchar(4),
    c3 date,
    CONSTRAINT parquet_table_pk PRIMARY KEY (c1) ENABLED
);
 
 
CREATE PROJECTION public.parquet_table_super /*+createtype(D)*/
(
c1 ENCODING COMMONDELTA_COMP,
c2 ENCODING ZSTD_FAST_COMP,
c3 ENCODING DELTAVAL
)
AS
SELECT parquet_table.c1,
        parquet_table.c2,
        parquet_table.c3
FROM public.parquet_table
ORDER BY parquet_table.c1,
          parquet_table.c2
SEGMENTED BY hash(parquet_table.c1, parquet_table.c2) ALL NODES OFFSET 0;
 

SELECT MARK_DESIGN_KSAFE(0);
 
(1 row)

Pivoting Tables Dynamically

The stored procedure unpivot() takes as input a source table and target table. It unpivots the source table and outputs it into a target table.

This example uses the following table:

=> SELECT * FROM make_the_columns_into_rows;
c1  | c2  |                  c3                  |             c4             |    c5    | c6
-----+-----+--------------------------------------+----------------------------+----------+----
123 | ABC | cf470c5b-50e3-492a-8483-b9e4f20d195a | 2021-08-24 18:49:40.835802 |  1.72964 | t
567 | EFG | 25ea7636-d924-4b4f-81b5-1e1c884b06e3 | 2021-08-04 18:49:40.835802 | 41.46100 | f
890 | XYZ | f588935a-35a4-4275-9e7f-ebb3986390e3 | 2021-08-29 19:53:39.465778 |  8.58207 | t
(3 rows)

This table contains the following columns:

=> \d make_the_columns_into_rows
                                               List of Fields by Tables
Schema |           Table            | Column |     Type      | Size | Default | Not Null | Primary Key | Foreign Key
--------+----------------------------+--------+---------------+------+---------+----------+-------------+-------------
public | make_the_columns_into_rows | c1     | int           |    8 |         | f        | f           |
public | make_the_columns_into_rows | c2     | varchar(80)   |   80 |         | f        | f           |
public | make_the_columns_into_rows | c3     | uuid          |   16 |         | f        | f           |
public | make_the_columns_into_rows | c4     | timestamp     |    8 |         | f        | f           |
public | make_the_columns_into_rows | c5     | numeric(10,5) |    8 |         | f        | f           |
public | make_the_columns_into_rows | c6     | boolean       |    1 |         | f        | f           |
(6 rows)

The target table has columns from the source table pivoted into rows as key/value pairs. It also has a ROWID column to tie the key/value pairs back to their original row from the source table:

=> CREATE PROCEDURE unpivot(p_source_table_schema VARCHAR(128), p_source_table_name VARCHAR(128), p_target_table_schema VARCHAR(128), p_target_table_name VARCHAR(128)) AS $$
DECLARE
    explode_command VARCHAR(10000);
BEGIN
    explode_command := EXECUTE 'SELECT ''explode(string_to_array(''''['''' || '' || LISTAGG(''NVL('' || column_name || ''::VARCHAR, '''''''')'' USING PARAMETERS separator='' || '''','''' || '') || '' || '''']'''')) OVER (PARTITION BY rn)'' explode_command FROM (SELECT table_schema, table_name, column_name, ordinal_position FROM columns ORDER BY table_schema, table_name, ordinal_position LIMIT 10000000) foo WHERE table_schema = ''' || QUOTE_IDENT(p_source_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_source_table_name) || ''';';

    EXECUTE 'CREATE TABLE ' || QUOTE_IDENT(p_target_table_schema) || '.' || QUOTE_IDENT(p_target_table_name) || ' 
        AS SELECT rn rowid, column_name key, value FROM (SELECT (ordinal_position - 1) op, column_name 
            FROM columns WHERE table_schema = ''' || QUOTE_IDENT(p_source_table_schema) || ''' 
            AND table_name = ''' || QUOTE_IDENT(p_source_table_name) || ''') a 
            JOIN (SELECT rn, ' || explode_command || ' 
            FROM (SELECT ROW_NUMBER() OVER() rn, * 
            FROM ' || QUOTE_IDENT(p_source_table_schema) || '.' || QUOTE_IDENT(p_source_table_name) || ') foo) b ON b.position = a.op';
END;
$$;

Call the procedure:

=> CALL unpivot('public', 'make_the_columns_into_rows', 'public', 'columns_into_rows');
 unpivot
---------
       0
(1 row)

=> SELECT * FROM columns_into_rows ORDER BY rowid, key;
rowid | key |                value
-------+-----+--------------------------------------
     1 | c1  | 123
     1 | c2  | ABC
     1 | c3  | cf470c5b-50e3-492a-8483-b9e4f20d195a
     1 | c4  | 2021-08-24 18:49:40.835802
     1 | c5  | 1.72964
     1 | c6  | t
     2 | c1  | 890
     2 | c2  | XYZ
     2 | c3  | f588935a-35a4-4275-9e7f-ebb3986390e3
     2 | c4  | 2021-08-29 19:53:39.465778
     2 | c5  | 8.58207
     2 | c6  | t
     3 | c1  | 567
     3 | c2  | EFG
     3 | c3  | 25ea7636-d924-4b4f-81b5-1e1c884b06e3
     3 | c4  | 2021-08-04 18:49:40.835802
     3 | c5  | 41.46100
     3 | c6  | f
(18 rows)

The unpivot() procedure can handle new columns in the source table as well.

Add a new column z to the source table, and then unpivot the table with the same procedure:

=> ALTER TABLE make_the_columns_into_rows ADD COLUMN z VARCHAR;
ALTER TABLE

=> UPDATE make_the_columns_into_rows SET z = 'ZZZ' WHERE c1 IN (123, 890);
OUTPUT
--------
      2
(1 row)

=> CALL unpivot('public', 'make_the_columns_into_rows', 'public', 'columns_into_rows');
unpivot
---------
       0
(1 row)

=> SELECT * FROM columns_into_rows;
rowid | key |                value
-------+-----+--------------------------------------
     1 | c1  | 567
     1 | c2  | EFG
     1 | c3  | 25ea7636-d924-4b4f-81b5-1e1c884b06e3
     1 | c4  | 2021-08-04 18:49:40.835802
     1 | c5  | 41.46100
     1 | c6  | f
     1 | z   |   -- new column
     2 | c1  | 123
     2 | c2  | ABC
     2 | c3  | cf470c5b-50e3-492a-8483-b9e4f20d195a
     2 | c4  | 2021-08-24 18:49:40.835802
     2 | c5  | 1.72964
     2 | c6  | t
     2 | z   | ZZZ   -- new column
     3 | c1  | 890
     3 | c2  | XYZ
     3 | c3  | f588935a-35a4-4275-9e7f-ebb3986390e3
     3 | c4  | 2021-08-29 19:53:39.465778
     3 | c5  | 8.58207
     3 | c6  | t
     3 | z   | ZZZ   -- new column
(21 rows)

Machine Learning: Optimizing AUC Estimation

The ROC function can approximate the AUC (area under the curve), the accuracy of which depends on the num_bins parameter; greater values of num_bins give you more precise approximations, but may impact performance.

You can use the stored procedure accurate_auc() to approximate the AUC, which automatically determines the optimal num_bins value for a given epsilon (error term):

=> CREATE PROCEDURE accurate_auc(relation VARCHAR, observation_col VARCHAR, probability_col VARCHAR, epsilon FLOAT) AS $$
DECLARE
    auc_value FLOAT;
    previous_auc FLOAT;
    nbins INT;
BEGIN
    IF epsilon > 0.25 THEN
        RAISE EXCEPTION 'epsilon must not be bigger than 0.25';
    END IF;
    IF epsilon < 1e-12 THEN
        RAISE EXCEPTION 'epsilon must be bigger than 1e-12';
    END IF;
    auc_value := 0.5;
    previous_auc := 0; -- epsilon and auc should be always less than 1
    nbins := 100;
    WHILE abs(auc_value - previous_auc) > epsilon and nbins < 1000000 LOOP
        RAISE INFO 'auc_value: %', auc_value;
        RAISE INFO 'previous_auc: %', previous_auc;
        RAISE INFO 'nbins: %', nbins;
        previous_auc := auc_value;
        auc_value := EXECUTE 'SELECT auc FROM (select roc(' || QUOTE_IDENT(observation_col) || ',' || QUOTE_IDENT(probability_col) || ' USING parameters num_bins=$1, auc=true) over() FROM ' || QUOTE_IDENT(relation) || ') subq WHERE auc IS NOT NULL' USING nbins;
        nbins := nbins * 2;
    END LOOP;
    RAISE INFO 'Result_auc_value: %', auc_value;
END;
$$;

For example, suppose we had the following data, test_data.csv:

Load the data into table categorical_test_data:

=> \set datafile '\'/data/test_data.csv\''
=> CREATE TABLE categorical_test_data(obs INT, pred INT, prob FLOAT);
CREATE TABLE

=> COPY categorical_test_data FROM :datafile DELIMITER ',';

Call accurate_auc(). For this example, the approximated AUC will be within the an epsilon of 0.01:

=> CALL accurate_auc('categorical_test_data', 'obs', 'prob', 0.01);
INFO 2005:  auc_value: 0.5
INFO 2005:  previous_auc: 0
INFO 2005:  nbins: 100
INFO 2005:  auc_value: 0.749597423510467
INFO 2005:  previous_auc: 0.5
INFO 2005:  nbins: 200
INFO 2005:  Result_auc_value: 0.750402576489533