Implement Multi-Tenancy Using Vertica

Overview

Vertica allows you to design and implement multi-tenancy systems to service multiple customers. Evaluate your performance, storage, and security needs to choose a model that best suits your data design and workload. Carefully consider the number of tables and columns that each tenant requires over time. Tables and columns translate into catalog objects. If the storage requirements for catalog objects increase, the memory size increases. In addition, a larger catalog can negatively affect database performance.

This document provides best practices for Vertica multi-tenant environment. The document describes why you might implement multi-tenancy and three approaches for configuring your Vertica database to support multiple tenants.

About multi-tenancy
Advantages and disadvantages of multi-tenancy
Using Vertica for multi-tenancy
- Multiple schemas: One schema per tenant
- One schema: Add tenant_id field to base tables
- Multiple clusters: One cluster per tenant

Introduction to Multi-Tenancy

In a Vertica multi-tenant environment, one instance of a Vertica database supports multiple tenants accessing the data. Each tenant may access its own data set or multiple tenants may share specific data sets.

Software as a Service (SaaS) environments use multi-tenancy. A Vertica customer has database that provides services to several customers. Typically, the database has multiple data sets, each of which corresponds to a custom service provided to each customer (tenant).

Consider an example: XYZ Company has a domain www.xyz.com. XYZ Company has multiple customers that share that web space, but use their own URLs. ABC Company owns www.abc.com, and DEF Company owns www.def.com. XYZ Company configures a Vertica database such that each domain has its own data set, defined as schemas, in that database. Each customer can load, manipulate, and analyze only their own data. They cannot access data in any other schema.

This configuration saves money for customers of the XYZ Company. The XYZ Company incurs the costs of purchasing a Vertica database and hiring database administrators instead of its customers. The XYZ Company manages the database and provides services to its customers.

Advantages and Disadvantages of Using Vertica for Multi-Tenancy

Use Vertica multi-tenancy if it suits your business needs.

Advantages	Disadvantages
With Vertica you can create a multi-tenant environment using these features: User-defined resource pools allow tenants to share memory resources. Tenants who have higher workloads or noisy neighbors need their own resource pools. The resource pools do not help manage the CPU, I/O, and network demands of multiple tenants. Schemas to segregate each tenant allow each tenant to access data in their designated schema. Object-level backups can back up each tenant individually and on different schedules. Storage locations direct the storage location of the data. You can store each tenant's data in a different storage location for I/O or security purposes. Perform database upgrades only once if you have one instance of the database to affect all the tenants. Copy data between schemas using EXPORT TO and COPY FROM. These statements support copying data at the table or schema level.	If your environment requires a SaaS multi-tenancy model, take into consideration the following issues with Vertica and multi-tenancy: If you have a large number of tenants, you may require a complex database configuration so that no tenant becomes starved for resources. Some Vertica database operations have system-wide impact, and they bypass the schema isolation. For example, if you create a projection for a specific tenant and Vertica does not refresh the projection, it can hold back the Ancient History Mark (AHM) for all the tenants. In addition, the DDL operations can cause system-wide conflicts. Adding a new tenant might impact existing tenants when creating database objects. An increase in tenants may result in a large database catalog. Catalog lock hold times increase, slowing all the operations that depend on the catalog lock.

Advantages

Disadvantages

With Vertica you can create a multi-tenant environment using these features:

User-defined resource pools allow tenants to share memory resources. Tenants who have higher workloads or noisy neighbors need their own resource pools. The resource pools do not help manage the CPU, I/O, and network demands of multiple tenants.
Schemas to segregate each tenant allow each tenant to access data in their designated schema.
Object-level backups can back up each tenant individually and on different schedules.
Storage locations direct the storage location of the data. You can store each tenant's data in a different storage location for I/O or security purposes.
Perform database upgrades only once if you have one instance of the database to affect all the tenants.
Copy data between schemas using EXPORT TO and COPY FROM. These statements support copying data at the table or schema level.

If your environment requires a SaaS multi-tenancy model, take into consideration the following issues with Vertica and multi-tenancy:

If you have a large number of tenants, you may require a complex database configuration so that no tenant becomes starved for resources.
Some Vertica database operations have system-wide impact, and they bypass the schema isolation. For example, if you create a projection for a specific tenant and Vertica does not refresh the projection, it can hold back the Ancient History Mark (AHM) for all the tenants. In addition, the DDL operations can cause system-wide conflicts. Adding a new tenant might impact existing tenants when creating database objects.
An increase in tenants may result in a large database catalog. Catalog lock hold times increase, slowing all the operations that depend on the catalog lock.

Three Approaches to Multi-Tenancy in Vertica

You can choose from three approaches to configure multi-tenancy in a Vertica database. Each approach has its advantages and disadvantages. Vertica recommends that you carefully evaluate each approach to see which approach best suits your environment.

Multiple Schemas: One Schema per Tenant

In this approach, you create a schema for each tenant. You configure role-level and user-level security to ensure that each tenant has access to only its specific data.

When to Use

Use the one schema per tenant approach for true multi-tenancy. This approach balances risk, flexibility, performance, and cost for a Vertica multi-tenancy environment. It is easy to set up and manage. You can use user-defined resource pools because the tenant workload is predictable. This approach is appropriate when tenants require direct and secure access to the database from their applications or clients.

Advantages and Disadvantages

Advantages	Disadvantages
Each tenant has its own schema and tables with easy database management. Schemas are mutually exclusive and do not require manual filtering of data. Unlimited schemas, projections, and system-wide number of ROS containers. Easy backup and restore of individual tenant data. You can define individual data retention policies for each tenant.	A large number of tenants with their schemas and tables result in a large catalog. A large catalog size results in long catalog lock hold times that affect all the database operations. Analysis of data across all tenants when each tenant has separate schema is complex. You need resource pools for each tenant to manage noisy neighbors. When the cluster goes down, it affects all the tenants. If one node fails, another node services two tenants that impacts database performance. Data separation into per-customer tables takes longer to compress than in a single table. You need to limit the size and the number of tenants to avoid large catalog problems. Vertica uses scaled-out clusters and distributes queries to all the nodes. This design limits the number of cores. For example, suppose you have 40 concurrent threads. You must provide resources and a connection to all tenants. 40 small tenants can lock a connection to the cluster, while a large tenant potentially might wait a long time to connect.

Advantages

Disadvantages

Each tenant has its own schema and tables with easy database management.
Schemas are mutually exclusive and do not require manual filtering of data.
Unlimited schemas, projections, and system-wide number of ROS containers.
Easy backup and restore of individual tenant data.
You can define individual data retention policies for each tenant.

A large number of tenants with their schemas and tables result in a large catalog. A large catalog size results in long catalog lock hold times that affect all the database operations.
Analysis of data across all tenants when each tenant has separate schema is complex.
You need resource pools for each tenant to manage noisy neighbors.
When the cluster goes down, it affects all the tenants. If one node fails, another node services two tenants that impacts database performance.
Data separation into per-customer tables takes longer to compress than in a single table.
You need to limit the size and the number of tenants to avoid large catalog problems.
Vertica uses scaled-out clusters and distributes queries to all the nodes. This design limits the number of cores.
For example, suppose you have 40 concurrent threads. You must provide resources and a connection to all tenants. 40 small tenants can lock a connection to the cluster, while a large tenant potentially might wait a long time to connect.

One Schema: Add tenant_id Field to Base Tables

In this approach, you assign a unique ID to each tenant. Each table contains a tenant_id field and stores data for all the tenants. The tenants access data they have read permissions to, because every query specifies the tenant_id. You must implement security filters at the record level or at the role level to maintain data privacy across tenants.

The following query returns all rows in the click_data table for tenant ABC:

SELECT * FROM click_data WHERE tenant_id="ABC";

When to Use

Use the one-schema approach when Vertica is a backend database for hosted tenants. In this approach, the same table stores related records, regardless of tenant_id. If you want to analyze all the data from all the tenants, that data is available in a single table.

For example, for the XYZ Company, the click_data table records the number of clicks per day per tenant. You need to access only click_data table to analyze this data across all the tenants. For less than 10 tenants, you can achieve excellent performance by implementing RLE encoding and using ORDER BY tenant_id. This approach accommodates many tenants. However, you must evaluate the impact of the workloads and the catalog sizes of individual tenants.

The one-schema approach is effective when the tenant workloads use queries with memory of less than 5 GB. Tenants in different geographical time zones do not require resources concurrently. The tenants with same physical projections can affect other tenants service-level agreements (SLAs). The tenants sharing same schema and tables are less secure. The one-schema approach is suited to fixed applications where the data and the reports are secure and controlled by one application.

Advantages and Disadvantages

Advantages Disadvantages

Advantages	Disadvantages
Role-based configuration access ensures tenant privacy. Design of fewer schemas and tables avoids large catalog and ensures few global locks that degrades performance. Easy analysis of data across all the tenants as compared to analysis of each tenant with own schema. Design is scalable with no architectural upper limit on the number of tenants. The ability to manage tenants of varying database footprints with equal efficiency. The ability of BI platforms to support multi-tenancy by using security filters on all the queries. Better data compression within a single table with comparable data types for all tenants instead of data separated into tenant-specific tables. Ability to create partitions using `tenant_id` and time-based expressions. Using `tenant_id` enables partition pruning during query execution, because `tenant_id` is in the WHERE clause of every query. Partition pruning is evaluated on a date/time range such as the current month. Each partition contains all the data for that month for all tenants. Using time-based expressions enables the option of creating offline archives of older partitions. Offline archives are backed up, dropped, and restored at a later date if required. Using time-based expressions reduces the impact of Tuple Mover operations. New data is processed in the active partition.	A large number of tenants result in excessive partitions. Excessive partitions lead to ROS pushback and disk fragmentation. Bulk loading of new data across partition results in Tuple Mover moveout issues. All tenants' data in the same table must follow the same data retention policy to use partitioning. Backing up and restoring individual tenant data is difficult because tables contain data for multiple tenants. Administrative challenges including setting up access for each tenant, creating and assigning `tenant_id`, and supporting database operations using the `tenant_id`. Queries from tenants monopolize system resources. If you have tenants with heavy workloads, you need to create resource pools for each tenant. Auditing raw data per tenant license is difficult. Limited disk space for each tenant usage. Conflicts between tenants with schema can change. If one cluster goes down, it affects all the tenants.

Role-based configuration access ensures tenant privacy.
Design of fewer schemas and tables avoids large catalog and ensures few global locks that degrades performance.
Easy analysis of data across all the tenants as compared to analysis of each tenant with own schema.
Design is scalable with no architectural upper limit on the number of tenants.
The ability to manage tenants of varying database footprints with equal efficiency.
The ability of BI platforms to support multi-tenancy by using security filters on all the queries.
Better data compression within a single table with comparable data types for all tenants instead of data separated into tenant-specific tables.
Ability to create partitions using tenant_id and time-based expressions.

Using tenant_id enables partition pruning during query execution, because tenant_id is in the WHERE clause of every query. Partition pruning is evaluated on a date/time range such as the current month. Each partition contains all the data for that month for all tenants.
Using time-based expressions enables the option of creating offline archives of older partitions. Offline archives are backed up, dropped, and restored at a later date if required. Using time-based expressions reduces the impact of Tuple Mover operations. New data is processed in the active partition.

A large number of tenants result in excessive partitions. Excessive partitions lead to ROS pushback and disk fragmentation.
Bulk loading of new data across partition results in Tuple Mover moveout issues.
All tenants' data in the same table must follow the same data retention policy to use partitioning.
Backing up and restoring individual tenant data is difficult because tables contain data for multiple tenants.
Administrative challenges including setting up access for each tenant, creating and assigning tenant_id, and supporting database operations using the tenant_id.
Queries from tenants monopolize system resources. If you have tenants with heavy workloads, you need to create resource pools for each tenant.
Auditing raw data per tenant license is difficult.
Limited disk space for each tenant usage.
Conflicts between tenants with schema can change.
If one cluster goes down, it affects all the tenants.

Multiple Clusters: One Cluster per Tenant

In this approach, you set up multiple clusters, one for each tenant. Each cluster has an instance of Vertica running on it. Each tenant has full isolation from other clients and full access to the resources on their cluster.

Using multiple clusters automates cluster provisioning and scales clusters effectively. It is easy and inexpensive in the cloud to add new clusters. Separate clusters for each tenant provides full isolation for each tenant. Isolating tenants eliminates impact from noisy neighbors.

When to Use

Use the multiple-cluster approach to multi-tenancy when Vertica is hosted in the cloud. This approach is favorable when tenant workloads are varied and unpredictable.

Advantages and Disadvantages

Advantages	Disadvantages
Each tenant has its own schema and tables with easy database management. Schemas are mutually exclusive and do not require manual filtering of data. You can implement user-based and role-based security. Large catalogs are manageable on a single cluster. Each tenant has exclusive access to system resources. The tenants do not impact each other. Easy backup and restore of individual tenant data. Each tenant has individual data retention policy. If a cluster fails, it only affects one tenant.	There are additional hardware infrastructure costs like networks, switches, storage, and servers to support multiple clusters. System and database management costs increase with multiple clusters. Database administrators need to monitor the software, perform backups and restores, and perform software upgrades on all the clusters. Multiple instances of Vertica require duplicate scripts and procedures for data loading, ETL, and reporting.

Advantages

Disadvantages

Each tenant has its own schema and tables with easy database management. Schemas are mutually exclusive and do not require manual filtering of data. You can implement user-based and role-based security.
Large catalogs are manageable on a single cluster.
Each tenant has exclusive access to system resources. The tenants do not impact each other.
Easy backup and restore of individual tenant data.
Each tenant has individual data retention policy. If a cluster fails, it only affects one tenant.

There are additional hardware infrastructure costs like networks, switches, storage, and servers to support multiple clusters.
System and database management costs increase with multiple clusters. Database administrators need to monitor the software, perform backups and restores, and perform software upgrades on all the clusters.
Multiple instances of Vertica require duplicate scripts and procedures for data loading, ETL, and reporting.

Conclusion

We saw three approaches, each with its advantages and disadvantages:

Multiple schemas: One schema per tenant: A design with independent schemas is easy to maintain but has limitations on the number of tenants you can host in the same cluster.
One schema: Add tenant_id field to base tables: A design with one schema where all tables have a tenant_id field makes cross-tenant data analysis easy. However, this approach requires careful access management for each tenant's data for security and database performance.
Multiple clusters: One cluster per tenant: A design with a separate cluster per client can be expensive and time-consuming to implement and maintain.

Make sure to review this information and evaluate your database needs in order to implement a design that works best for your use case.