Stored Procedures: Use Cases and Examples
Stored procedures in Vertica are best suited for complex, analytical workflows rather than small, transaction-heavy ones. Some recommended use cases include include information lifecycle management (ILM) activities like extract, transform, and load (ETL), and data preparation for more complex analytical tasks like machine learning. For example:
- Swapping partitions according to age
- Exporting data at end-of-life and dropping the partitions
- Saving inputs, outputs, and metadata from a machine learning model (e.g. who ran the model, the version of the model, how many times the model was run, who received the results, etc.) for auditing purposes
Searching for a Value
The find_my_value()
procedure searches for a user-specified value in any table column in a given
schema and stores the locations of instances of the value in a user-specified table:
=> CREATE PROCEDURE find_my_value(p_table_schema VARCHAR(128), p_search_value VARCHAR(1000), p_results_schema VARCHAR(128), p_results_table VARCHAR(128)) AS $$ DECLARE sql_cmd VARCHAR(65000); sql_cmd_result VARCHAR(65000); results VARCHAR(65000); BEGIN IF p_table_schema IS NULL OR p_table_schema = '' OR p_search_value IS NULL OR p_search_value = '' OR p_results_schema IS NULL OR p_results_schema = '' OR p_results_table IS NULL OR p_results_table = '' THEN RAISE EXCEPTION 'Please provide a schema to search, a search value, a results table schema, and a results table name.'; RETURN; END IF; sql_cmd := 'CREATE TABLE IF NOT EXISTS ' || QUOTE_IDENT(p_results_schema) || '.' || QUOTE_IDENT(p_results_table) || '(found_timestamp TIMESTAMP, found_value VARCHAR(1000), table_name VARCHAR(128), column_name VARCHAR(128));'; sql_cmd_result := EXECUTE 'SELECT LISTAGG(c USING PARAMETERS max_length=1000000, separator='' '') FROM (SELECT '' (SELECT '''''' || NOW() || ''''''::TIMESTAMP , ''''' || QUOTE_IDENT(p_search_value) || ''''','''''' || table_name || '''''', '''''' || column_name || '''''' FROM '' || table_schema || ''.'' || table_name || '' WHERE '' || column_name || ''::'' || CASE WHEN data_type_id IN (17, 115, 116, 117) THEN data_type ELSE ''VARCHAR('' || LENGTH(''' || QUOTE_IDENT(p_search_value)|| ''') || '')'' END || '' = ''''' || QUOTE_IDENT(p_search_value) || ''''''' || DECODE(LEAD(column_name) OVER(ORDER BY table_schema, table_name, ordinal_position), NULL, '' LIMIT 1);'', '' LIMIT 1) UNION ALL '') c FROM (SELECT table_schema, table_name, column_name, ordinal_position, data_type_id, data_type FROM columns WHERE NOT is_system_table AND table_schema ILIKE ''' || QUOTE_IDENT(p_table_schema) || ''' AND data_type_id < 1000 ORDER BY table_schema, table_name, ordinal_position) foo) foo;'; results := EXECUTE 'INSERT INTO ' || QUOTE_IDENT(p_results_schema) || '.' || QUOTE_IDENT(p_results_table) || ' ' || sql_cmd_result; RAISE INFO 'Matches Found: %', results; END; $$;
For example, to search the public
schema for instances of the string 'dog
' and then
store the results in public.table_list
:
=> CALL find_my_value('public', 'dog', 'public', 'table_list'); find_my_value --------------- 0 (1 row) => SELECT * FROM public.table_list; found_timestamp | found_value | table_name | column_name ----------------------------+-------------+---------------+------------- 2021-08-25 22:13:20.147889 | dog | another_table | b 2021-08-25 22:13:20.147889 | dog | some_table | c (2 rows)
Optimizing Tables
You can automate loading data from Parquet files and optimizing your queries with the create_optimized_table()
. This procedure:
- Creates an external table whose structure is built from Parquet files using the Vertica INFER_EXTERNAL_TABLE_DDL function.
- Creates an internal Vertica table, like the external table, resizing all VARCHAR columns to the MAX length of the data to be loaded.
- Creates a super projection using the optional segmentation/order by columns passed in as a parameter.
- Adds an optional primary key to the managed table passed in as a parameter.
- Loads a sample data set (1 million rows) from the external table into the Vertica-managed table.
- Drops the external table.
- Runs the ANALYZE_STATISTICS function on the Vertica-managed table.
- Runs the DESIGNER_DESIGN_PROJECTION_ENCODINGS function to get a properly encoded super projection for the Vertica-managed table.
- Truncates the now-optimized Vertica-managed table (we will load the entire data set in a separate script / stored procedure).
=> CREATE OR REPLACE PROCEDURE create_optimized_table(p_file_path VARCHAR(1000), p_table_schema VARCHAR(128), p_table_name VARCHAR(128), p_seg_columns VARCHAR(1000), p_pk_columns VARCHAR(1000)) LANGUAGE PLvSQL AS $$ DECLARE command_sql VARCHAR(1000); seg_columns VARCHAR(1000); BEGIN IF (QUOTE_LITERAL(p_file_path) IS NULL OR QUOTE_LITERAL(p_file_path) = '') THEN RAISE EXCEPTION 'Please provide a file path.'; ELSEIF (QUOTE_IDENT(p_table_schema) IS NULL OR QUOTE_IDENT(p_table_schema) = '') THEN RAISE EXCEPTION 'Please provide a table schema.'; ELSEIF (QUOTE_IDENT(p_table_name) IS NULL OR QUOTE_IDENT(p_table_name) = '') THEN RAISE EXCEPTION 'Please provide a table name.'; END IF; IF (QUOTE_IDENT(p_seg_columns) IS NULL OR QUOTE_IDENT(p_seg_columns) = '') THEN seg_columns := ''; ELSE seg_columns := 'ORDER BY ' || QUOTE_IDENT(p_seg_columns) || ' SEGMENTED BY HASH(' || QUOTE_IDENT(p_seg_columns) || ') ALL NODES'; END IF; EXECUTE 'DROP TABLE IF EXISTS ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp CASCADE;'; command_sql := EXECUTE 'SELECT infer_external_table_ddl(''' || QUOTE_LITERAL(p_file_path) || ''' USING PARAMETERS format = ''parquet'', table_name = ''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp'');'; command_sql := 'SELECT REPLACE(\$\$' || command_sql || '\$\$, ''"'', '''');'; command_sql := EXECUTE command_sql; EXECUTE command_sql; command_sql := 'SELECT LISTAGG(y USING PARAMETERS separator='' '') FROM ((SELECT 0 x, ''SELECT ''''CREATE TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '('' y UNION ALL SELECT ordinal_position, column_name || '' '' || CASE WHEN data_type LIKE ''varchar%'' THEN ''varchar('''' || (SELECT MAX(LENGTH('' || column_name || '')) FROM '' || table_schema || ''.'' || table_name || '') || '''')'' ELSE data_type END || NVL2(LEAD('' || column_name || '', 1) OVER (ORDER BY ordinal_position), '','', '')'') FROM columns WHERE table_schema = ''' || QUOTE_IDENT(p_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_table_name) || '_temp'' UNION ALL SELECT 10000, ''' || seg_columns || ''' UNION ALL SELECT 10001, '';'''''') ORDER BY x) foo WHERE y <> '''';'; command_sql := EXECUTE command_sql; command_sql := EXECUTE command_sql; EXECUTE command_sql; IF QUOTE_IDENT(p_pk_columns) IS NOT NULL OR QUOTE_IDENT(p_pk_columns) <> '' THEN EXECUTE 'ALTER TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ' ADD CONSTRAINT ' || QUOTE_IDENT(p_table_name) || '_pk PRIMARY KEY (' || QUOTE_IDENT(p_pk_columns) || ') ENABLED;'; END IF; EXECUTE 'INSERT INTO ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ' SELECT * FROM ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp LIMIT 1000000;'; EXECUTE 'DROP TABLE IF EXISTS ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_temp CASCADE;'; EXECUTE 'SELECT analyze_statistics(''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ''');'; EXECUTE 'SELECT designer_design_projection_encodings(''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ''', ''/tmp/toss.sql'', TRUE, TRUE);'; EXECUTE 'TRUNCATE TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ';'; END; $$;
=> SELECT create_optimized_table('/home/dbadmin/rga_example/*', 'public', 'parquet_table', 'c1,c2', 'c1'); create_optimized_table ------------------------ 0 (1 row) => SELECT export_objects('', 'parquet_table'); export_objects ---------------- CREATE TABLE public.parquet_table ( c1 int NOT NULL, c2 varchar(4), c3 date, CONSTRAINT parquet_table_pk PRIMARY KEY (c1) ENABLED ); CREATE PROJECTION public.parquet_table_super /*+createtype(D)*/ ( c1 ENCODING COMMONDELTA_COMP, c2 ENCODING ZSTD_FAST_COMP, c3 ENCODING DELTAVAL ) AS SELECT parquet_table.c1, parquet_table.c2, parquet_table.c3 FROM public.parquet_table ORDER BY parquet_table.c1, parquet_table.c2 SEGMENTED BY hash(parquet_table.c1, parquet_table.c2) ALL NODES OFFSET 0; SELECT MARK_DESIGN_KSAFE(0); (1 row)
Pivoting Tables Dynamically
The stored procedure unpivot()
takes as input a source table and target table. It unpivots the source table and outputs it into a target table.
This example uses the following table:
=> SELECT * FROM make_the_columns_into_rows; c1 | c2 | c3 | c4 | c5 | c6 -----+-----+--------------------------------------+----------------------------+----------+---- 123 | ABC | cf470c5b-50e3-492a-8483-b9e4f20d195a | 2021-08-24 18:49:40.835802 | 1.72964 | t 567 | EFG | 25ea7636-d924-4b4f-81b5-1e1c884b06e3 | 2021-08-04 18:49:40.835802 | 41.46100 | f 890 | XYZ | f588935a-35a4-4275-9e7f-ebb3986390e3 | 2021-08-29 19:53:39.465778 | 8.58207 | t (3 rows)
This table contains the following columns:
=> \d make_the_columns_into_rows List of Fields by Tables Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key --------+----------------------------+--------+---------------+------+---------+----------+-------------+------------- public | make_the_columns_into_rows | c1 | int | 8 | | f | f | public | make_the_columns_into_rows | c2 | varchar(80) | 80 | | f | f | public | make_the_columns_into_rows | c3 | uuid | 16 | | f | f | public | make_the_columns_into_rows | c4 | timestamp | 8 | | f | f | public | make_the_columns_into_rows | c5 | numeric(10,5) | 8 | | f | f | public | make_the_columns_into_rows | c6 | boolean | 1 | | f | f | (6 rows)
The target table has columns from the source table pivoted into rows as key/value pairs. It also has a
ROWID
column to tie the key/value pairs back to their original row from the source table:
=> CREATE PROCEDURE unpivot(p_source_table_schema VARCHAR(128), p_source_table_name VARCHAR(128), p_target_table_schema VARCHAR(128), p_target_table_name VARCHAR(128)) AS $$ DECLARE explode_command VARCHAR(10000); BEGIN explode_command := EXECUTE 'SELECT ''explode(string_to_array(''''['''' || '' || LISTAGG(''NVL('' || column_name || ''::VARCHAR, '''''''')'' USING PARAMETERS separator='' || '''','''' || '') || '' || '''']'''')) OVER (PARTITION BY rn)'' explode_command FROM (SELECT table_schema, table_name, column_name, ordinal_position FROM columns ORDER BY table_schema, table_name, ordinal_position LIMIT 10000000) foo WHERE table_schema = ''' || QUOTE_IDENT(p_source_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_source_table_name) || ''';'; EXECUTE 'CREATE TABLE ' || QUOTE_IDENT(p_target_table_schema) || '.' || QUOTE_IDENT(p_target_table_name) || ' AS SELECT rn rowid, column_name key, value FROM (SELECT (ordinal_position - 1) op, column_name FROM columns WHERE table_schema = ''' || QUOTE_IDENT(p_source_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_source_table_name) || ''') a JOIN (SELECT rn, ' || explode_command || ' FROM (SELECT ROW_NUMBER() OVER() rn, * FROM ' || QUOTE_IDENT(p_source_table_schema) || '.' || QUOTE_IDENT(p_source_table_name) || ') foo) b ON b.position = a.op'; END; $$;
Call the procedure:
=> CALL unpivot('public', 'make_the_columns_into_rows', 'public', 'columns_into_rows'); unpivot --------- 0 (1 row) => SELECT * FROM columns_into_rows ORDER BY rowid, key; rowid | key | value -------+-----+-------------------------------------- 1 | c1 | 123 1 | c2 | ABC 1 | c3 | cf470c5b-50e3-492a-8483-b9e4f20d195a 1 | c4 | 2021-08-24 18:49:40.835802 1 | c5 | 1.72964 1 | c6 | t 2 | c1 | 890 2 | c2 | XYZ 2 | c3 | f588935a-35a4-4275-9e7f-ebb3986390e3 2 | c4 | 2021-08-29 19:53:39.465778 2 | c5 | 8.58207 2 | c6 | t 3 | c1 | 567 3 | c2 | EFG 3 | c3 | 25ea7636-d924-4b4f-81b5-1e1c884b06e3 3 | c4 | 2021-08-04 18:49:40.835802 3 | c5 | 41.46100 3 | c6 | f (18 rows)
The unpivot()
procedure can handle new columns in the source table as well.
Add a new column z
to the source table, and then unpivot the table
with the same procedure:
=> ALTER TABLE make_the_columns_into_rows ADD COLUMN z VARCHAR; ALTER TABLE => UPDATE make_the_columns_into_rows SET z = 'ZZZ' WHERE c1 IN (123, 890); OUTPUT -------- 2 (1 row) => CALL unpivot('public', 'make_the_columns_into_rows', 'public', 'columns_into_rows'); unpivot --------- 0 (1 row) => SELECT * FROM columns_into_rows; rowid | key | value -------+-----+-------------------------------------- 1 | c1 | 567 1 | c2 | EFG 1 | c3 | 25ea7636-d924-4b4f-81b5-1e1c884b06e3 1 | c4 | 2021-08-04 18:49:40.835802 1 | c5 | 41.46100 1 | c6 | f 1 | z | -- new column 2 | c1 | 123 2 | c2 | ABC 2 | c3 | cf470c5b-50e3-492a-8483-b9e4f20d195a 2 | c4 | 2021-08-24 18:49:40.835802 2 | c5 | 1.72964 2 | c6 | t 2 | z | ZZZ -- new column 3 | c1 | 890 3 | c2 | XYZ 3 | c3 | f588935a-35a4-4275-9e7f-ebb3986390e3 3 | c4 | 2021-08-29 19:53:39.465778 3 | c5 | 8.58207 3 | c6 | t 3 | z | ZZZ -- new column (21 rows)
Machine Learning: Optimizing AUC Estimation
The ROC function can
approximate the AUC (area under the curve), the accuracy of which depends on the
num_bins
parameter; greater values of num_bins
give you more precise approximations,
but
may impact performance.
You can use the stored procedure accurate_auc()
to approximate the AUC, which automatically
determines
the optimal num_bins
value for a given epsilon (error term):
=> CREATE PROCEDURE accurate_auc(relation VARCHAR, observation_col VARCHAR, probability_col VARCHAR, epsilon FLOAT) AS $$ DECLARE auc_value FLOAT; previous_auc FLOAT; nbins INT; BEGIN IF epsilon > 0.25 THEN RAISE EXCEPTION 'epsilon must not be bigger than 0.25'; END IF; IF epsilon < 1e-12 THEN RAISE EXCEPTION 'epsilon must be bigger than 1e-12'; END IF; auc_value := 0.5; previous_auc := 0; -- epsilon and auc should be always less than 1 nbins := 100; WHILE abs(auc_value - previous_auc) > epsilon and nbins < 1000000 LOOP RAISE INFO 'auc_value: %', auc_value; RAISE INFO 'previous_auc: %', previous_auc; RAISE INFO 'nbins: %', nbins; previous_auc := auc_value; auc_value := EXECUTE 'SELECT auc FROM (select roc(' || QUOTE_IDENT(observation_col) || ',' || QUOTE_IDENT(probability_col) || ' USING parameters num_bins=$1, auc=true) over() FROM ' || QUOTE_IDENT(relation) || ') subq WHERE auc IS NOT NULL' USING nbins; nbins := nbins * 2; END LOOP; RAISE INFO 'Result_auc_value: %', auc_value; END; $$;
For example, suppose we had the following data, test_data.csv
:
1,0,0.186 1,1,0.993 1,1,0.9 1,1,0.839 1,0,0.367 1,0,0.362 0,1,0.6 1,1,0.726 0,0,0.087 0,0,0.004 0,1,0.562 1,0,0.477 0,0,0.258 1,0,0.143 0,0,0.403 1,1,0.978 1,1,0.58 1,1,0.51 0,0,0.424 0,1,0.546 0,1,0.639 0,1,0.676 0,1,0.639 1,1,0.757 1,1,0.883 1,0,0.301 1,1,0.846 1,0,0.129 1,1,0.76 1,0,0.351 1,1,0.803 1,1,0.527 1,1,0.836 1,0,0.417 1,1,0.656 1,1,0.977 1,1,0.815 1,1,0.869 0,0,0.474 0,0,0.346 1,0,0.188 0,1,0.805 1,1,0.872 1,0,0.466 1,1,0.72 0,0,0.163 0,0,0.085 0,0,0.124 1,1,0.876 0,0,0.451 0,0,0.185 1,1,0.937 1,1,0.615 0,0,0.312 1,1,0.924 1,1,0.638 1,1,0.891 0,1,0.621 1,0,0.421 0,0,0.254 0,0,0.225 1,1,0.577 0,1,0.579 0,1,0.628 0,1,0.855 1,1,0.955 0,0,0.331 1,0,0.298 0,0,0.047 0,0,0.173 1,1,0.96 0,0,0.481 0,0,0.39 0,0,0.088 1,0,0.417 0,0,0.12 1,1,0.871 0,1,0.522 0,0,0.312 1,1,0.695 0,0,0.155 0,0,0.352 1,1,0.561 0,0,0.076 0,1,0.923 1,0,0.169 0,0,0.032 1,1,0.63 0,0,0.126 0,0,0.15 1,0,0.348 0,0,0.188 0,1,0.755 1,1,0.813 0,0,0.418 1,0,0.161 1,0,0.316 0,1,0.558 1,1,0.641 1,0,0.305
Load the data into table categorical_test_data
:
=> \set datafile '\'/data/test_data.csv\'' => CREATE TABLE categorical_test_data(obs INT, pred INT, prob FLOAT); CREATE TABLE => COPY categorical_test_data FROM :datafile DELIMITER ',';
Call accurate_auc()
. For this example, the approximated AUC will be within the an epsilon of 0.01:
=> CALL accurate_auc('categorical_test_data', 'obs', 'prob', 0.01); INFO 2005: auc_value: 0.5 INFO 2005: previous_auc: 0 INFO 2005: nbins: 100 INFO 2005: auc_value: 0.749597423510467 INFO 2005: previous_auc: 0.5 INFO 2005: nbins: 200 INFO 2005: Result_auc_value: 0.750402576489533