There were many interesting technical sessions at the 2014 Vertica Big Data Conference. However, I feel that the most captivating session was on loading 35TB per hour. Just the sheer number of nodes in the clusters, amount of data ingested, and use cases made for a truly educational session. This session covered how Vertica was able to meet Facebook’s ingest SLA (Service Level Agreement) of 35TB/hour.
Facebook’s SLA required continuous loading without disrupting queries and still budgeting enough capacity to catch up in the event of maintenance, unplanned events, or just ETL process issues. Planning for these events with continuous loading presents the options of either catching up or skipping them. There are also numerous considerations such as what is the average ingest (in terms of size) and also the peak ingest. There may also be an availability SLA which must give users access to data within a certain amount of time of an event happening.
Planning for Facebook’s SLA, there were processes that drove the large fact tables which were scheduled to run every 15 minutes or 4 times an hour. Along with what the largest batch could be, and allowing time for batches to catch up, this was calculated out to be 35 TB/hour. To achieve this rate, a speed of 15,000 MB/sec was required to write to disk once.
The first stage of the discovery process was to determine how fast one node can parse with a COPY statement. For the test, a COPY was initiated from any node with files staged on node 1. Only the CPU on node 1 performed parsing, and subsequently parsed rows were delivered to all nodes, then sorted on each node. It was found that a speed of 170 MB/sec could be achieved with 4 COPY statements running in parallel (on one year old hardware). Any more statements would not deliver an increase in parse rate. Within this usage, memory was utilized during the sorting phase.
Scaling out to parse on multiple nodes, the throughput scaled pretty linearly to about 510 MB/sec. In this test, a COPY was initiated from any node with files staged on nodes 1, 2 & 3. Again, CPUs on only node 1, 2 & 3 performed parsing, and subsequently parsed rows were delivered to all nodes, then sorted on each node. More memory was utilized in receive than sort when 50-60 nodes were parsing. However, it doesn’t scale well past that point. Vertica is designed to make sure that there is never head of line blocking in a query plan. This means that after 50 or 60 nodes, the overhead of 50 or 60 streams converging on a node requires a lot of memory for receive buffers.
At 50 to 60 nodes, the overhead of running a COPY statement is large enough that it makes it difficult to run 3 or 4 statements in parallel and continue to scale. If the cluster was scaled up, there would be disruption to queries as the nodes would be completely utilized for 10 minutes at a time. Running COPY for 15 minutes would leave some CPU for queries, but not enough room to catch up. In addition, Facebook required that there not be more than 20% degradation in query performance.
Continuing to scale to 180 nodes, it is expensive to run a COPY statement in terms of memory. The best approach is to isolate the workload by separating the workload of loading data from the workload of querying using Vertica’s ephemeral node feature. This feature is typically used in preparation from removing a node from a cluster. However, in this solution, some nodes were marked ephemeral before data was loaded. Therefore, when data would be loaded into the database, it would not be loaded on the ephemeral nodes and these nodes would not be used for querying.
Having nodes isolated from querying, the loader nodes’ CPU could be utilized 100%. The implementation required adjusting resource pool scaling only on the ingest nodes using an undocumented feature, SysMemoryOverrideMB, within vertica.conf on the ingest nodes. There’s additional considerations around memory utilization on the ingest nodes to set aside memory for the catalog. After some minor tuning, the results look like:
||Ingest Rate (TB/HR)
||Peak Ingest Rate (TB/HR)
There were considerations around using a cluster file system. For instance, there may be a batch process running that produces all the files to be loaded in a COPY statement and then those files are transferred to loader nodes. If one of the loader nodes has a hardware failure, the batch would fail during the COPY. A recovery could be possible if there was a NAS or cluster file system by simply re-running the batch. The cost of disk I/O and network bandwidth required to have a NAS or cluster file system was too expensive when writing extra copies of the data for each batch. Therefore, it would be easier to re-run the batches.
Given the massive pipes at Facebook, it was cheaper to re-run an ingest rather than trying to build a high availability solution that could stage data in the event of a failure in case a part of a batch did not complete. However, if it’s critical to have data available to users rather quickly, it may be worth investing more in an architecture that allows the COPY statement to restart without having to re-stage data.
Along the way, it was discovered that file systems lose performance as they age when subjected to a high rate of churn. The typical Vertica environment has about a year or two of history. With a short window of history and loading 3-4% of the file system capacity each day, the file system is scrubbed frequently. Although some file systems age better than others, it’s good practice to monitor the percentage of CPU time spent in system.
Facebook’s churn rate was about 1.5 PB every 3 days in their 3.5-4 PB cluster. After about a week, the file system would become extremely fragmented. This doesn’t suggest that this was a Vertica issue, just operational issues from the hardware up through the software stack.
There will be times when a recovery will not be able to finish if loading takes place 24 hours/day and there are long running queries. The larger the cluster, the less avoidable a hardware failure and nodes in recovery. As of 7.1, standby nodes can be assigned to fault groups. For example, a rack of servers can be considered a fault domain. Within that fault domain, there is a spare node in standby. If a node were to fail and cross a downtime threshold, the standby node within the same rack would be become active. This approach is more suitable if nodes must enter recovery fairly quickly and the cluster is continuously busy.
The most common practice for a standby cluster is to run an ETL twice to maintain two separate instances of Vertica. With clusters having nodes potentially in the hundreds at Facebook, this approach was not feasible as a high availability solution.
Facebook is taking steps to give better operational performance to other groups in their organization while ensuring that none of their clusters are sitting idle. The focus is moving away from having dedicated resources and to dynamically allocate resources and offer them as a service. The goal is to have any cluster, regardless of size, be able to request data sets from any input source and ingest it into Vertica. In terms of size, 10 nodes is considered small, and several hundred nodes considered large.
To accomplish this, an adapter retrieves the source data and puts it into a form that can be staged or streamed. At the message bus layer, the data is pre-hashed meaning that hashing will be local on Vertica and it will hash to its buddy. This is accomplished by implementing the hash function that’s used on the projection in the message distribution layer. Based on the value of the column, the node on which the row will reside can be predicted. This allows for data to be explicitly sent to targeted nodes in the sense that there is no need for data to be sent between nodes. The ephemeral node approach for ingest would no longer be needed in this elastic and scalable structure.