Performing the Initial Database Load

To perform the initial database load, use COPY with its DIRECT parameter from vsql.

Tip: Vertica supports multiple schema types. If you have a star schema, load the smaller tables before you load the largest tables.

Only a superuser can use the COPY statement to bulk load data. Two exceptions to the superuser requirement are to:

  1. Run COPY to load from a stream on the host (such as STDIN) rather than a file (see Streaming Data Via JDBC).
  2. Use the COPY statement with the FROM LOCAL option.

A non-superuser can also perform a batch load with a JDBC prepared statement, which invokes COPY to load data as a background task.

Extracting Data From an Existing Database

If possible, export the data in text form to a local file or attached disk. When working with large amounts of load data (> 500GB), Vertica recommends that you test the load process using smaller load files as described in Configuration Procedure to avoid compatibility or file formatting issues.

ETL products typically use ODBC or JDBC to extract data, which gives them program-level access to modify load file column values, as needed.

Database systems typically provide a variety of export methods.

Tip: To export data from an Oracle database, run a SELECT query in Oracle’s SQL*Plus command line query tool using a specified column delimiter, suppressed headers, and so forth. Redirect the output to a local file.

Smaller tables generally fit into a single load file. Split any large tables into 250-500GB load files. For example, a 10 TB fact table will require 20-40 load files to maintain performance.

Checking for Delimiter Characters in Load Data

The default delimiter for the COPY statement is a vertical bar (|). Before loading your data, make sure that no CHAR(N) or VARCHAR(N) data values include the delimiter character.

To test for the existence of a specific character in a column, use a query such as this:

SELECT COUNT(*) FROM T WHERE X LIKE '%|%'

If only a few rows contain |, you can eliminate them from the load file using a WHERE clause and load them separately using a different delimiter.

Tip: : For loading data from an Oracle database, use a WHERE clause to avoid problem rows in the main load file, and the negated WHERE clause with REGEX_REPLACE for problem rows.

Moving Data From an Existing Database to Vertica Nodes

To move data from an existing database to Vertica, consider using:

Deliver chunks of data to the different Vertica nodes by connecting the transport disk or by writing files from network copy.

Loading From a Local Hard Disk

USB 2.0 disks can deliver data at about 30 MB per second, or 108 GB per hour. USB 2.0 disks are easy to use for transporting data from Linux to Linux. Set up an ext4 filesystem on the disk and write large files there. Linux 2.6 has USB plug-and-play support, so a USB 2.0 disk is instantly usable on various Linux systems.

For other UNIX variants, if there is no common filesystem format available, use the disk without a filesystem to copy a single large file. For example:

$ cp bigfile /dev/sdc1

Even without a filesystem on the disk, plug-and-play support still works on Linux to provide a device node for the disk. To find out the assigned device, plug in the disk and enter:

$ dmesg | tail -40

SATA disks are usually internal, but can be external, or unmounted safely if they are internal.

Loading Over the Network

A 1Gbps (gigabits per second) network can deliver about 50 MB/s, or 180GB/hr. Vertica can load about 30-50GB/hour/node for a 1-Ksafe projection design. Therefore, you should use a dedicated 1Gbps LAN. Using a LAN with a performance that is < 1Gbps will be proportionally slower. Vertica recommends not loading data across an external network, because the delays over distance slow down the TCP protocol to a small fraction of its available bandwidth, even without competing traffic.

Note: The actual load rates you obtain can be higher or lower depending on the properties of the data, number of columns, number of projections, and hardware and network speeds. Load speeds can be further improved by using multiple parallel streams.

Loading From Windows

Use NTFS for loading files directly from Windows to Linux. Although Red Hat Linux as originally installed can read Windows FAT32 file systems, this is not recommended.