One of the more interesting and popular Vertica solutions that I’ve come across is the so-called multi-tenant implementation. A multi-tenant database is simply a database designed to support multiple unique customers. Often, these customers interact with the database through some sort of web interface which uses an Vertica database in the backend.
Commonly, these types of implementations segregate clients out by schema. That is, every client that shares the database has its own unique schema. Normally, the tables in each schema are duplicated among schemas, such that each client has its own set of tables. Sometimes there are client-specific changes to individual tables. This is generally kept to a minimum, because it does complicate the database design. In addition, there might be several global tables that are shared across all clients equally. These are generally metadata tables and are stored in the public schema, or some other named schema.
One of the chief challenges in this setup is the noisy neighbor problem. Noisy neighbor is when you have one client that dominates cluster resources because of its skewed high usage. Often a handful of very large, very active clients can negatively impact the smaller client’s ability to get resources or run optimally. You can use several strategies that can overcome this problem:
- Resource Pools
- Multiple Clusters
The biggest weapon in Vertica’s arsenal is the Resource Manager. The Resource Manager is the subsystem that manages resources for the various queries running in Vertica. By default, the Resource Manager will do nothing to isolate noisy neighbors. It needs to be configured.
The most common approach to managing noisy neighbors is to create client-specific resource pools and set a MaxConcurrency on them. By capping the amount of queries that can run at a specific time in a pool, resources that might otherwise get consumed will remain available for other clients.
The other parameter here that’s useful is MaxMemorySize. This parameter caps the total amount of memory the pool is able to borrow from the GENERAL pool. Once this limit is reached, queries will queue until resources are available. Setting a MemorySize value is generally not recommended here — this reserves a specific amount of memory for the pool, and makes it available only to that pool even if it’s not being used.
Another less frequently used, but very powerful feature available here is cascading pools. Cascading pools are a way of setting up tiers of pools for the purposes of separating smaller, fast-running queries from larger, longer-running queries. You can create different resource pools with different configurations. Commonly, these are pools that represent fast, medium, or long queries, or some variation. You can design a fast pool to move queries through the system quickly by setting a small MaxMemorySize and a higher priority. The long pool can have a lower priority, perhaps a higher MaxMemorySize, and a longer QueueTimeout, since more queries might end up here. Configure the medium pool to lie somewhere in between. When a query exceeds the RunTimeCap of one pool, it will cascade to another pool defined by the CascadeTo parameter. Setting up cascading pools can be a great way of managing a high volume workload.
In addition to resource pools, you can get clever with the segmentation clause in Vertica. The projection segmentation clause defines where data resides based on the hash of some key value. Normally, the segmentation clause is the hash of something fairly unique like a primary key value, which ensures even distribution of the data across all the nodes. However, instead of picking a random key, you can segment projections on business keys like client_name or client_id. That way, all data for that client will reside on a specific node.
If you’re trying to accomplish a multi-tenant environment using a single Vertica cluster, you might want to stop and ask yourself what you hope to gain. There are some definite disadvantages to putting all your eggs in a single basket. Vertica is a big data, multi-node MPP database. It’s designed for big data, and does a great job handling it fast. If your big data comes in the form of thousands of tables, and hundreds of tenants, then the best solution might not be one giant cluster. For some clients, breaking things out into several, smaller cluster provides many advantages:
- Clients can be moved around. If the cluster is not behaving normally, you could code the ability to move clients off a cluster onto a completely different cluster. This removes a single point of failure from your system.
- Clusters can be located in different geographies. A European cluster could be created to service clients in Europe, for example. Also, on the off chance that a data center goes offline, you’re not completely reliant on a single location. Yes, data centers can and do fail.
- One client can’t ruin it for everyone. The noisy neighbor problem becomes more challenging the more neighbors you have. Could you write a query so bad that it affects performance cluster-wide? You probably could. Could your clients?
- Upgrades and maintenance are less impactful and less time-consuming. Would it take longer to upgrade 10 3-node clusters or 1 30-node cluster? Probably the former, but which carries less risk, less stress, and provides less impact?
We’ve talked about segmentation as a strategy for defeating noisy neighbor, but what if we go the other way? If you’re using a relatively small cluster, with more or less evenly-sized client information, replication might be the best way to mitigate the noisy neighbor issue. For example, you could have 6 different clients sharing a 3-node cluster. With a minimum of 3 nodes, you still take advantage of the powerful Vertica K-Safety feature, which automatically provides failover capabilities. But you could hard-code your application to recognize which node belongs to a specific client. In our hypothetical 3-node cluster, let’s service six different clients – A, B, C, D, E and F. Since all of my data is fully replicated, all data for all 6 clients lives on all 3 nodes. But my interface is coded in such a way that clients A and B only ever use node 1, clients C and D only ever use node 2, and clients E and F only ever use node 3. Because my data is fully replicated, if client F issues a horrible query, the only other client it could ever impact is client E. And if it became a recurring problem, I could modify my interface to switch client E to using node 2, where maybe C and D are not doing too much that I can add a 3rd client there with little overall impact.
The bottom line is that, it’s hard to predict which clients (or tenants) are going to cause you a problem up front. Being able to easily (and cheaply) move them around, isolate them, or just minimize the amount of impact they have on the system is critical for ease of use and uptime.