Converting Multicharacter Delimiters for Loading

Posted October 13, 2015 by Tim Donar

Do you have issues loading data in Vertica from a single input file containing multicharacter delimiters like ^~^? If so, you can develop a parser to convert the delimiters from multiple characters into a single character and then load the data in parallel streams. This blog includes an example of such a parser that you can adapt for your own use case. Alternatively, you can convert the code and implement the parser as a Vertica UDF.

The workflow for converting the delimiters and loading the data is as illustrated in the following figure:

 

 

The steps below match the numbers in the figure:

  1. Read the data in the source file.
  2. Split the source file for parallel processing.
  3. Run the Lex parser (describe in this document) to convert the multicharacter delimiters to single-character delimiters.
  4. Load the data into Vertica using the COPY statement.

Multicharacter Delimiters

What if the multicharacter delimiter in your source file looks like any of the following? Or each row contains a different delimiter between each column?

Boston~^~!MA|!|Vertica
Dallas^~^Texas~!@Corp

The example Lex parser described in the next section provides a method for converting delimiter characters to characters that are compatible with loading.

Lex Parser

To scan the text patterns in the source file and convert the delimiters, one approach is to use regular expressions and compiled C code. (A regular expression lets you perform pattern matching on character strings.)

The following Lex parser code (parse.l) converts 2- and 3-character delimiter patterns to a single record separator character.

Here’s the full source code for the Lex parser. You can use this as the basis for your own UDF to pre-process data.

%{
#include
#include
%}
%%
[^|~][~^|!][^|@]? {
/* convert multi-char delimiter */
/*
Example delimiters:
^~^
|^|
|^
^|
~^
^|^
|~
~!@
|~|
*/ /* Write the record separator character */
printf("x1E");
}
%%
main() {
yylex();
return(0);
}

To compile the Lex parser code, use the following command:
$ flex parse.l; cc lex.yy.c -o parse -ll

If you don’’t have the flex utility, use the yum installer utility to download and install this utility so you can compile the Lex parser:
$ yum install flex
$ yum install flex-devel

Load Commands

To use STDIN input instead of a flat file, you need to change the standard COPY statement. Use the Lex parser (parse) to convert the delimiter characters. Then, pipe them into the database using the new COPY statement.

The script to do this, which we named load.sh, looks like this:
inputfile=$1
db_host="localhost"
db_name="vdb"
db_username="dbadmin"
db_password="mypassword"
db_schema="public"
db_table="sample_tb"
./parse < $inputfile | vsql -h $db_host -d $db_name -U $db_username -w $db_password -t -c "COPY $db_schema.$db_table FROM STDIN DELIMITER E'x1E';"

Adding the Linux split Utility

To increase the load performance of a single input file, split the file into multiple pieces. Using a split method, Vertica can convert the delimiters and load the data in multiple parallel streams.

The following script, run_split.sh, takes a single-source input file and creates n output splits. This script passes each split through the Lex parser and then loads the data into Vertica using the COPY statement.

The run_split.sh script is as follows:
file=$1
n=12 # number of split segments
if [ "$file" == "" ]
then
echo "Error: Enter a source file"
exit 1
fi
size=$(wc -l < $file)
lineCount=$((size/$n))
echo "split $file [$size][$lineCount]"
split -d -l $lineCount < $file
# list split files
ls x* |
while read f
do
# start load job in background
./load.sh $f &
done
echo -n "Wait for load jobs to complete"
while ps -ef | grep vsql | grep -v grep > /dev/null
do
sleep 1
echo -n "."
done;
echo "done"

Performance Results

To see the results of parallelizing the parsing and loading, we ran the following performance tests.

We established the baseline performance with a simple pipe-delimited input file: a 7.5GB file with 59,986,000 rows. The COPY statement looked like the following:
COPY table FROM '/path/file.txt' DELIMITER '|';

Using the inline stream parsing and loading with 16x parallel streams, we achieved the best performance:

  • Single-file load, parsing multicharacter delimiters: 297 seconds
  • Single-file load, no parsing: 78 seconds
  • Parallel load, parsing multicharacter delimiters: 62 seconds

For More Information

This document describes one method to pre-process data and a simple method to split files during load. There are several other techniques available to accomplish the same end result, and ways to parallelize data loading.

Vertica supports user-defined functions (UDFs), which can be embedded into a COPY statement. For information about developing UDFs, see Extending Vertica in the Vertica documentation.

Outside of the database, there are several commercial ETL tools that you can use to pre-process data prior to loading the data into Vertica.