Vertica Architecture Basics

There are several key concepts at the core of the Vertica architecture that you should understand, which are explained the the following sections.

Column Storage

Vertica stores data in a column format so it can be queried for best performance. Compared to row-based storage, column storage reduces disk I/O making it ideal for read-intensive workloads. Vertica reads only the columns needed to answer the query. For example:

=> SELECT avg(price) FROM tickstore WHERE symbol = 'AAPL' and date = '5/31/13';

For this example query, a column store reads only three columns while a row store reads all columns:

Data Encoding and Compression

Vertica uses encoding and compression to optimize query performance and save storage space.

Encoding converts data into a standard format. Vertica uses a number of different encoding strategies, depending on column data type, table cardinality, and sort order. Encoding increases performance because there is less disk I/O during query execution. In addition, you can store more data in less space.

Compression transforms data into a compact format. Vertica uses several different compression methods and automatically chooses the best one for the data being compressed. Using compression, Vertica stores more data, provides more views, and uses less hardware than other databases. Using compression lets you keep much more historical data in physical storage.

The following shows compression using sorting and cardinality:

For more information, see Data Encoding and Compression.

Clustering

Clustering supports scaling and redundancy. You can scale your database cluster by adding more nodes, and you can improve reliability by distributing and replicating data across your cluster.

Column data gets distributed across nodes in a cluster, so if one node becomes unavailable the database continues to operate. When a node is added to the cluster, or comes back online after being unavailable, it automatically queries other nodes to update its local data.

Projections

A projection consists of a set of columns with the same sort order, defined by a column to sort by or a sequence of columns by which to sort. Like an index or materialized view in a traditional database, a projection accelerates query processing. When you write queries in terms of the original tables, the query uses the projections to return query results.

Projections are distributed and replicated across nodes in your cluster, ensuring that if one node becomes unavailable, another copy of the data remains available. For more information, see K-Safety in an Enterprise Mode Database.

Automatic data replication, failover, and recovery provide for active redundancy, which increases performance. Nodes recover automatically by querying the system.

Logical and Physical Schema

Vertica stores information about database objects in the logical schema and the physical schema. The difference between the two schemas and how they relate to data storage is an important and unique aspect of the Vertica architecture.

A logical schema consists of objects such as tables, constraints, and views. Vertica supports any relational schema design that you choose. A physical schema consists of collections of table columns called projections. A projection can contain some or all of the columns of a table.

Continuous Performance

Vertica queries and loads data continuously 24x7.

Concurrent loading and querying provides real-time views and eliminates the need for nightly load windows. On-the-fly schema changes allow you to add columns and projections without shutting down your database; Vertica manages updates while keeping the database available.

Terminology

It is helpful to understand the following terms when using Vertica:

Host

A computer system with a 64-bit Intel or AMD processor, RAM, hard disk, and TCP/IP network interface (IP address and hostname). Hosts share neither disk space nor main memory with each other.

Instance

An instance of Vertica consists of the running Vertica process and disk storage (catalog and data) on a host. Only one instance of Vertica can be running on a host at any time.

Node

A host configured to run an instance of Vertica. It is a member of the database cluster. For a database to have the ability to recover from the failure of a node requires a database K-safety value of at least 1 (3+ nodes).

Cluster

The concept of Cluster in the Vertica Analytics Platform is a collection of hosts with the Vertica software packages (RPM or DEB) that are in one admin tools domain. You can access and manage a cluster from one admintools initiator host.

Database

A cluster of nodes that, when active, can perform distributed data storage and SQL statement execution through administrative, interactive, and programmatic user interfaces.

Although you can define more than one database on a cluster, Vertica supports running only one database per cluster at a time.