Best Practices for Selecting Shards, Nodes, and Depots in Vertica Eon Mode Beta

Vertica in Eon Mode Beta allows you to use the analytic database in a cloud environment. Using Vertica in a cloud environment raises new questions on how to maximize performance. This document provides guidelines on configuring your database to achieve best results. This document explains how to determine the proper amount of shards, nodes, and depots for your system.

How Many Shards Do I Need?

The number of shards (S) you may need is determined by how many segments your data is split into. Under ideal conditions, for any given session, each node will serve one shard. Therefore, the number of shards governs "node parallelism" for the work load.

For example:

  • If you have 6 shards and 12 nodes, only 6 nodes participate in query processing.
  • If you have 6 shards and 4 nodes, 2 nodes are considered DOWN, and the alternative node serves that shard for the query. So the degree of node parallelism never exceeds 6.

You can set the number of shards during database creation, and it cannot be changed.

$ /opt/vertica/bin/admintools  -t create_db -d eondb --hosts=eng1,eng2 --shard-count=6

Note Having more shards may provide better performance of parallelized queries. However, increasing the number of shards results in diminishing returns.

When setting the shard count, note the following:

  • Setting the correct shard count requires understanding your aggregate workload and may require a few experiments.
  • Typically users select a shard count between 6 and 16; if users are migrating from an existing Enterprise mode cluster, the shard count in Eon Mode Beta is the same as that in the Enterprise mode cluster.

How Many Nodes Do I Need?

In Eon Mode Beta, adding and removing nodes does not require an expensive data re-balance operation. Therefore, Vertica expects the number of nodes to increase based on demand, and decrease when not needed. This occurs in a fairly quick and elastic manner.

When determining node count (N), consider the following:

  • Your best query performance occurs when N=S; Vertica considers this the baseline.
  • When N > S, you can run a larger workload (that is, more queries). However, the execution time of an individual query does not improve beyond the baseline.
  • When N < S, the query performance is equal to or lesser than the baseline.

How Much Depot Do I Need per Node?

The depot (D) is a Least Recently Used (LRU) cache. Vertica uses 80% of the volume size for this cache.

When you create the database set the depot location and size :

$ /opt/vertica/bin/admintools  -t create_db -d eondb --hosts=eng1,eng2 --shard-count=6 
 --depot-path=/home/dbadmin/depot/   --depot-size=20G (optional)

You can modify the depot settings at any time using the following commands:

Note Although a function to alter the depot size exists, Vertica recommends not modifying the depot size.

=> SELECT ALTER_LOCATION_SIZE('/home/dbadmin/depot/', '', '60G');
=> SELECT ALTER_LOCATION_SIZE('/home/dbadmin/depot/', 'v_eon_walkthrough_db_node0001', '50%'); 

Other Considerations

  • Use instance/ephemeral storage for the depots when in Eon Mode Beta. Although you may use ebs volumes for depot storage, this costs more and may degrade performance.
  • Since the data persists on S3, local persistence offered by ebs volumes does not provide performance improvements.
  • Vertica recommends NOT using RAID or LVM instance volumes. Use them as ext4 partitions instead. The additional processing needed to manage the LVM and RAID incurs unnecessary overhead.
  • Since the depot serves as a local cache, having a large(r) depot improves query performance. For example:
    • Your total database is 1.2 TB compressed on S3 and you have 6 shards.
    • Each node caches 1.2 TB/6 = 200 MB of data to be able to cache the entire shard.
    • So if the depot volume is 250 MB then the entire shard (80% of 250MB) fits in the depot cache.
    • Typically a node caches at least two shards. Therefore, having a depot volume of D=500 MB (2 x 250 MB) in this situation is optimal, allowing for all operations on a hot depot cache. Anything less may result in a cache miss and consequently slower query performance.

Monitoring Depot Activity

Monitor depot activity using the following system and DC tables:

=> SELECT * FROM vs_depot_lru;
=> SELECT * FROM dc_depot_evictions;
=> SELECT * FROM dc_depot_fetches; 

Conclusion

Selecting N, S and D is a cost performance trade-off. The right number is a subjective judgment with respect to how much performance you want and what you are willing to pay for it.