Fast Data Loading with Vertica

Posted August 20, 2018 by Phil Molea, Sr. Information Developer, Vertica

Curtis Bennett authored this post. Vertica is well known for its blinding query performance at big data scale, but it can also insert data at very high rates of speed. It can even load data non-stop while being queried, thus enabling real-time analysis of data.

Basic Loading Methods

There are many ways of loading data into Vertica all of which can run while the database is concurrently queried. The most simplistic method is to use Vertica’s COPY command to load data in from a source file or files. Vertica supports a wide variety of input file types as well. • Trickle Load – insert data into memory as it arrives • Bulk Load to Memory – ideal for fast loading of smaller data volumes (a few gigabytes) • Bulk Load to Disk – ideal for large data volumes (many giga- or terabytes) Vertica is capable of loading many megabytes/second per machine per load stream. With Vertica’s shared-nothing architecture, loading can be performed in parallel, scaling linearly as servers are added to the Vertica cluster.

Vertica Customer Loading Experiences

Before examining how Vertica loads data this rapidly, here is a look at some load speeds reported by Vertica customers: Trickle loads 48MB/minute – SNMP data generated by devices in the Comcast cable network is trickle-loaded on a 24×7 basis at rates as high as 135,000 rows/second. The system runs on 5 HP ProLiant DL 380 servers. Bulk loads to memory 300MB/minute – 50MB to 300MB of call detail records (1K record size—150 columns per row) are loaded every 10 minutes. Vertica runs on 6 HP ProLiant DL380 servers. Bulk loads to disk 5GB/minute – The loading and enrichment (i.e., summary table creation) of 1.5TB of call detail records formerly took 5 days in a row-oriented data warehouse database. Vertica required 5 hours to load the same data. Global Investment Firm Trickle loads 2.6GB/minute – Historic financial trade and quote (TaQ) data was bulk loaded into the database at a rate of 125GB/hour. New TaQ data was trickled into the database at rates as high as 90,000 rows per second (480b per row). Global Electronics Firm 4.6 TB/Hour – A recent POC with a client implementing micro-batch loading from a variety of source files containing quality control measurements for IoT workload utilizing a variety of Vertica loading methods such as trickle and direct loads.

Supported Sources

Vertica supports a wide variety of input file types. These files can also come from many sources including the database servers themselves, a file stored locally on a client’s machine, Hadoop (HDFS), Spark, AWS S3 Cloud Storage, or data on a SAN and other external data sources. Vertica also supports streaming data as well. Some of the common types of supported input files include: • ASCII/CSV/text files in both delimited and fixed-width formats • BZIP/GZIP/LZO compressed files • ORC • PARQUET • JSON • NATIVE – Vertica’s own proprietary binary file format • Virtually any other file-format utilizing Vertica’s User-Defined Load (UDL) functionality

Streaming Data

Traditional Data Warehouses are often loaded with complex ETL or ELT solutions that involve either batch loading, or micro-batches in small incremental bursts. While these approaches can often be sufficient for most classical data warehouses, truly big data analytic environments in IoT, click-stream analytics or other large-workload environments require something more sophisticated. To this end, Vertica supports real-time streaming of continuous data through the native Apache Kafka scheduler that comes with Vertica Enterprise Edition. Furthermore, this support is bi-directional and you can easily push data into Apache Kafka for ingest into other applications or systems.

Linear Scalability

Vertica is linearly scalable. Vertica can run on a variety of platforms, including on-premise Linux hardware, as well as all of the major cloud providers – AWS, Google, and Microsoft Azure. There are a number of factors which can influence load performance including, but not limited to, hardware specifications, concurrency, file sizes and widths, and network latency. However, load performance can be linearly increased by simply adding more nodes to the process. Through Vertica’s Resource Management capabilities, Vertica can also isolate load activities from query activities ensuring dedicated resources for load operations do not interfere with queries and vice-versa. One prominent Silicon-Valley social media company achieved a peak ingest rate of up to 72 TB/hour utilizing up to 360 nodes and resource management isolation techniques. At 180 nodes, they achieved consistent performance of between 24 and 41 TB/hour.

Vertica Eon Mode

With Eon mode, Vertica is a database that takes advantage of all elastic compute and scalable storage capabilities in the cloud. In Eon mode, Vertica can take advantage of cloud economics – it can be made to scale and shrink rapidly in response to changing workloads, and separates the size of your data from the compute you want to dedicate to running queries over it. Eon mode is also built to handle failure gracefully, even in volatile cloud environments. Currently, Eon mode works on AWS. In Eon mode, the source data is available on shared storage. The local cache serves as the secondary copy. This means that adding and removing nodes does not redistribute the primary source of data. This shared storage model enables elasticity, meaning it is both time and cost effective to adapt the cluster resources to fit the usage pattern of the cluster. Using this model, you have flexibility. You can have heavy storage with minimal compute, or small storage with heavy compute. It’s up to you and your workload needs. Most importantly, you can add more compute nodes in real time to add additional processing capabilities without having to rebalance the underlying data sources. The optimizer and execution engine that make Vertica great are largely untouched in Eon mode. Under the hood, Vertica uses a storage API to interact with the underlying storage.

PSTL

Quite possibly the fastest way of doing data ingest in Vertica is through the use of PSTL – Parallel Streaming Transformation Loader (pronounced, “Pistol”). PSTL is the result of multiple generations of streaming solutions, coupled with Vertica’s collective experience in open- and closed-source software with many customers over the years. PSTL is a highly scalable, fault-tolerant, extensible, self-service application framework that enables data to be written over dynamic streaming data sources. It provides all of the necessary components for a complete end-to-end pipeline that would otherwise require vast amounts of custom code. The PSTL is a customized real-time ETL offering from Vertica Professional Services and is available as open source offering on the Vertica GitHub page. It is based on both Vertica’s integration with Kafka and the Apache Spark structured streaming. The application framework enables users to write SQL over streaming data sources. One prominent mobile gaming company was able to load 2.42 Billion rows of data into Vertica (451Gb) in 7 minutes, and 35 seconds on an 8-node Enterprise Vertica cluster (59 Gb/minute). This same implementation, scaled up to around 80 nodes, could achieve 36 TB/hour.