Docker CE and Kubernetes - FAQs

Adoption of containerization has expanded in the past few years for its ease of deployment and lightweight nature in terms of the usage of machine resources. From Vertica version 10.1.1, Vertica supports deploying itself as a Docker container (One Node Enterprise Mode - Community Edition). Moreover, support for Vertica in Kubernetes was introduced for a multi-node containerized environment. This allows a user to deploy Vertica in a Kubernetes (K8s) cluster that is self-healing, highly available, and a scalable Eon Mode database that requires minimal manual intervention. This document briefly introduces you to Vertica in containerized environments -- Docker Community Edition (CE) and Kubernetes (K8s). It answers basic and advanced questions about Docker CE Vertica images, Vertica-K8s images, generic deployment-specific questions, and troubleshooting.

This article is intended for users interested in learning or using Vertica in a containerized environment.

Note that Community Edition is referred to as CE and Kubernetes as K8s throughout this document.

What is Vertica?

Vertica is a unified analytics platform, based on a massively scalable architecture with the broadest set of analytical functions spanning event and time series, pattern matching, geospatial, and end-to-end in-database machine learning. Vertica enables you to easily apply these powerful functions to the largest and most demanding analytical workloads, arming you and your customers with predictive business insights faster than any analytics data warehouse in the market. Vertica provides a unified analytics platform across major public clouds and on-premises data centers, and integrates data in cloud object storage (S3, GCS, Azure Blob) and HDFS File System without you having to move any data. To know more, see the Vertica website.

What is Containerization, Docker, and Kubernetes?

Containerization, simply put, involves wrapping of standard Linux processes with all dependencies into an isolated environment.

Docker is a software that can package an application and its dependencies in a virtual container that runs on any Linux, Windows, or macOS computer. This enables the application to run on-premises, in public, or private cloud.

Kubernetes (K8s) is a container orchestration platform that automates and simplifies application deployment, scaling, and management using containers.

Docker CE

Basics

How is the Vertica one node CE container different from Vertica VM?
- A Vertica CE VM needs a virtualization platform to run and is more heavyweight (in size and memory) compared to the Vertica CE one-node container.
- Deploying a VM can take several minutes whereas a Docker based container can be spun up within a minute using a Dockerfile.
- Container technology provides the freedom to run environments independently of the host operating system and "package" environments in a lightweight manner.
What does a Vertica CE image contain? How is it created?
1. A Vertica CE image (CentOS 7.9) is available on docker hub: Vertica Docker Hub. The CE image is built using the Dockerfile provided in the Vertica GitHub page, Vertica Containers. The Dockerfile contains a set of instructions to build and package a Vertica Image. This includes
  - Installing prerequisites (libraries and dependencies)
  - Installing Vertica
  - Fixing the right locale
  - Adding/modifying directory permissions
  - Creating user and groups
  - Compile VMART binary and generate data
  - Setting environment variables necessary for the entrypoint script.
2. This image can be easily exported and is downloadable on docker-hub.
3. We also provide Dockerfiles for CentOS 8.3 and Ubuntu 18.04/20.04. For more information on DockerFiles, see https://github.com/vertica/Vertica-containers.
Can I create a custom Vertica Image with a different dataset?

Yes. We publish the Dockerfile (Vertica Containers - GitHub) that you can easily modify to replace the VMart dataset with yours. You can load the required dataset after you launch a container by connecting to Vertica within the container.
How many images does Vertica provide? What are the specs of each image?

We publish the CE image on Dockerhub (https://hub.docker.com/u/vertica). Additionally, we provide images specific to K8s (minimal and full-fledged), Vertica-Operator Image and an image for Vertica logger. The minimal images exclude TensorFlow libraries and have a smaller footprint. The operator image runs the container for our Vertica-K8s operator. The logger image is used to run sidecar containers in k8s for collecting logs. We intend to publish a new vertica-k8s image for each release but not hot fixes.
What is the advantage of deploying Vertica in K8s or in a container?
1. For customers and users moving towards cloud native and microservices infrastructure, deploying traditional Vertica could feel out of place. We provide an operator, which can simplify deployment of Vertica in K8s and other compatible cloud orchestration platforms. The Operator automates several of the tasks that an admin would otherwise perform.
  
  Deploying a Vertica cluster is now as simple as applying a YAML file with the right configurations. Modifying the cluster follows suit. Scaling or adding/removing nodes (pods in K8s) is easier and time efficient. We do, however, recommend using one pod per node strategy to not impact Vertica's performance.
2. One-node CE gives a quick and easy way to deploy Vertica on a small scale. This can be useful but not limited to building POCs or even to test out a dataset.
What are advantages and disadvantages of building my own Docker image vs using Vertica pre-built from Docker Hub?
1. Using the pre-built image from docker hub gives one ease-of-use. By default, the Vertica image comes pre-built with VMart dataset that you can try out-of-the-box. It also is pre-configured with several nuanced configurations that our team has determined, to be able to run Vertica binaries within a container.
2. Creating your own image and/or container gives one flexibility. You could build different version binaries of Vertica, inject a custom dataset, maybe even try out a different OS. However, we caution that Vertica as a software still has some OS/storage/environment specific requirements that need to be fulfilled. This would be up to the user to ensure they meet those requirements and that nothing breaks once you go beyond our published/recommended image.
3. The default CE image installs many optional Vertica packages (for example, for Machine Learning, Tensorflow integration, ORC integration, Kafka, and GIS extensions). These increase the size of the image, which increases load-time.

Advanced

How is the data mounted in the Docker container? Is it an Enterprise mode database?
- Data directories can be mounted onto the container either via docker-volumes or bind-mounts. If no bind-mounts are specified on container startup, the container runtime will mount all directories in the container on the host at
  /var/lib/docker/volumes/<vol-name>/<dir-name>.
- If your /var does not have enough disk space, you may be unable to load large datasets. In such cases you can use a bind-mounted directory that has enough space, to persists data from within the container.
- This by default creates a One Node verticaDB - Enterprise Mode.

How can I use this image in a docker-compose file?

Following is an example deployment:

docker-compose.yaml
version: "3.9"
services:
  vertica:
    environment:
      APP_DB_USER: "newdbadmin"
      APP_DB_PASSWORD: "vertica"
      TZ: "Europe/Prague"
    container_name: vertica-ce
    image: vertica/vertica-ce
    ports:
      - "5433:5433"
      - "5444:5444"
    deploy:
      mode: global
    volumes:
      - type: volume
        source: vertica-data2
        target: /data
volumes:
  vertica-data2:

Next, you can run docker-compose up:

docker-compose --file ./docker-compose.yml --project-directory <directory_name> up -d

Can I use this image with Docker Swarm or Docker Stack?

We do not officially support Docker Swarm. However, due to the inherent flexible nature of containerization and with some infrastructure configuration, you can get something setup.

We encourage users to use Kubernetes because it solves the various orchestration problems such as how to set up the network of docker nodes and network them together.

One could try to fashion something but bear in mind that the simplest use case would be an Enterprise Mode and not Eon Mode. For Eon Mode further effort may be required.

Kubernetes (K8s)

Basics

What is the difference between deploying a cluster of VMs for Vertica cluster vs using a K8s cluster?

A VM organization software such as ESX vsphere distributes VMs across several resource- pools which are derived from physical ESX machines. The distribution and scheduling of the VM can be configured in ESX. Similarly, K8s aims to distribute and schedule containers across physical nodes. Typically, there exists one physical node that acts as the K8s master and several that are worker nodes. There can be one or more of each.
How is Vertica deployed in K8s?
1. As of this article, we only support Vertica's Eon Mode on K8s. Vertica provides an operator to deploy an Eon Mode database as a series of StatefulSets on a Kubernetes cluster, each StatefulSet corresponding to a Vertica subcluster. Each pod in this deployment corresponds to a Vertica "Node". For example, if you deploy a 3-pod Vertica StatefulSet cluster, it means you have 3 Vertica nodes to work with. You can then access and treat each of these pods just like how you would treat a Vertica node. For production, we recommend you keep a one-to-one mapping of pods-to-K8s worker nodes. However, you could easily deploy multiple pods per your worker node to provision a larger cluster. (This can however impact performance for obvious reasons.)
2. Client connectivity happens through Service objects. We create a separate Service object for each subcluster. Each Service object does load balancing between the Ready pods in the subcluster. Pods are considered Ready if the Vertica daemon process is running and accepting connections.
3. We provide the Vertica-k8s operator that deploys and manages the StatefulSet deployment (https://github.com/vertica/vertica-kubernetes). A custom resource (CR) should be defined in your cluster which allows the API server to recognize when you try to create or manage any Vertica specific resources (subclusters and such).
4. A StatefulSet is an ordered list of pods. This is ideal and necessary as each node in Vertica is unique and requires its state to be persistent. Each pod is bound to its data using the persistent volume claim (PVC) and its corresponding persistent volume (PV).

Advanced

What kind of services does Vertica StatefulSet deploy?

A headless service that maintains DNS records and ordered names for each pod, and a load balancing service for each subcluster (ClusterIP/NodePort/LoadBalancer) that manages internal traffic and external client requests for the pods in your cluster.
What is a headless service for Vertica-K8s? Why do we need it?

Vertica StatefulSet uses a Headless Service to control the domain of its pods. The domain managed by this service takes the form:
$(service name).$(namespace).svc.cluster.local, where "
cluster.local" is the cluster domain.
For headless Services, a cluster IP is not allocated, kube-proxy does not handle these services, and there is no load balancing or proxying done by the K8s for them.
How can I connect to Vertica from an external client?
- K8s provides external connectivity via either NodePort or LoadBalancer (LB) service objects. A NodePort maps each ClusterIP object to a unique port on each of the K8s nodes. A LB creates a single service object that can be used across the cluster.
- A LB service type is only supported on cloud platforms such as AKS and EKS. Bare-metal deployments do not support LB service type. You may use a third-party LB such as MetalLB or HAProxy to use the LB service type.
- Using a NodePort service type out of the box is difficult for connecting from an external client such as vsql because by default K8s assigns a port in the range 30000-32768. Each pod maps its 5433 port to this port on the node. However, vsql and other clients connect to the 5433 port while connecting to Vertica. Therefore, if you plan to use a NodePort service type make sure to use an external LB such as HAProxy to map the 3xxxxx port to a 5433 port. If the client allows port override for Vertica, you can use the NodePort to access the DB.
Does K8s load-balance my requests or does Vertica take part in the load balance amongst pods?

We recommend that you turn off the Vertica native load balancer and allow the Kubernetes services to manage load balancing for the subcluster. You can configure the native Vertica load balancer within the Kubernetes cluster, but the results might be unexpected. For example, if you set the Vertica load balancing policy to ROUNDROBIN, the load balancing might appear random, due to the 2 layers of load-balancing happening. The clusterIP service automatically load balances across the pods that are attached to that service.
What kind of storage provisioners does Vertica support?

Storage provisioners come into the picture when you want to decide on the local storage. Before you decide on the storage provisioner, review filesystem formats we currently support, File System Formats.

Though we have not thoroughly tested each provisioner, we expect that the agnostic behavior of using storage classes in K8s lets you use most of them if the underlying filesystems are the supported ones. For example, we typically use https://github.com/rancher/local-path-provisioner to create a local-path storage class that simply uses host directories as mount points.
Which network plugin should I use in my K8s deployment?

Though not an extensive list, Flannel and Calico should work fine. However, we suggest using iptables instead of IPVS for kube-proxy due to a known issue of client-disconnects with IPVS under heavy loads.
What object store endpoints are supported for communal storage? What do I do if I am using an https enabled endpoint?
- We currently support AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, and HDFS. We also support On-prem S3-compatible object storage: MINIO, Pure Storage. For a full list and more information, check out On-Premise Object Storage.
  The way of setting up the Custom Resource for your cluster may vary according to which endpoint you use. For more details, see Configuring Communal Storage in the Vertica documentation.
- If you are using the HTTPS endpoint, you need to provide a CA cert file for secure TLS communication. You need to create a secret that references your certificate and specify it in your CR config file.
  Note AWS S3 works out-of-the-box with Vertica and you do not need a CA/key file to use https with an AWS endpoint.
How does a database like Vertica get past pod restarts without losing data?
- Each pod uses a PVC/PVs to bind to a persistent storage location, which stores all the catalog/temporary and depot data. Since we use Vertica's Eon Mode to deploy the cluster, the DB data is stored in the communal storage (S3/Azure/GCP/HDFS) and to some extent is independent of node (in our case, pod) restarts. The Vertica-k8s operator also provides an easy way to revive the cluster even when all the pods go down. All you need is to set the "initPolicy" in the spec of your yaml manifest to "Revive" instead of "Create".
  For more information on local storage check the "Local Volume Mounts" section in Containerized Vertica.
- DNS names provide continuity between pod life cycles. Each pod is assigned an ordered and stable DNS name that is unique within its cluster. When a Vertica pod fails, the rescheduled pod uses the same DNS name as its predecessor. Rescheduled pods require information about the environment to become part of the cluster. This information is provided by the Downward API.
How can I monitor or export logs from Vertica-K8s nodes?

The Vertica pod allows a sidecar, a utility container that can access and perform utility tasks for the Vertica server process. For example, logging is a common utility task. Idiomatic Kubernetes practices retrieve logs from stdout and stderr on the host node for log aggregation. To facilitate this practice, Vertica offers the vlogger sidecar image that sends the contents of vertica.log to stdout on the host node. To know more about how to use vlogger, see Adding a Sidecar Container in Creating a Custom Resource.
Can I use a bare-metal K8s cluster with a cloud S3-compatible endpoint?

Though you can use such a configuration, it is not advisable. Having your compute on- prem and communal storage on the cloud, is bound to have serious performance implications. Especially if the depot isn't large enough and data is fetched from the communal endpoint repeatedly.
How can I upgrade my Vertica-k8s cluster?
- The operator automates Vertica server version upgrades for a custom resource. For more information on upgrades, see Upgrade Vertica with K8s Operator.
- You can use the upgradePolicy setting in the CR to determine whether your cluster remains online or is taken offline during the version upgrade. If choosing online mode, the cluster continues to operate during an online upgrade. The data is in read-only mode while the operator upgrades the image for the primary subcluster.
- Ensure the upgrade path is incremental. You cannot skip versions.

What health/readiness probes does Vertica-k8s use?

Health probes are used in Kubernetes to monitor an application such that traffic is only routed to the pod if it is ready or will reschedule the pod if it becomes unhealthy. There are a couple of health probes that K8s provides. We intend to use only the readiness probe to poll for Vertica readiness. It determines when the pod is considered in a Ready state. A pod that is in the Ready state will be included in the service object and will have traffic routed to it. We determine if a pod is in the Ready state if the Vertica daemon is running and is accepting connections.

We also have a few processes to monitor in our container:

sshd: This is the ssh daemon that accepts ssh connections on port 22.

The following chart describes how it is checked.

sshd

container exit

Container entrypoint will return.

If the container entrypoint returns, the pod will be deleted. It is then rescheduled by the Statefulset controller. The vertica daemon process is restarted by the operator after the pod is rescheduled.

My Vertica-k8s cluster is successfully installed. How do I enter vsql prompt?

Once the cluster pods are in "Running" state, you can jump into the bash prompt of any one pod of your subcluster with a:
```
kubectl exec -it <pod_name> -n <namespace_name> --  bash
```

Once in the pod, you can then proceed to treat it like any Vertica node and run vsql:

[dbadmin@vertica-sample-defaultsubcluster-0 /]$ vsql
Welcome to vsql, the Vertica Analytic Database interactive terminal.
 
Type:  \h or \? for help with vsql commands
       \g or terminate with semicolon to execute query
       \q to quit
 
dbadmin=>

You can go directly to vsql without first starting bash, such as:

kubectl exec -it <pod_name> -n <namespace_name>  -- vsql

What are the recommended resources for my Worker Nodes for Vertica-k8s? Can I run all the pods on a single node?
- Before you proceed with the stateful-set deployment, the resource recommendations per pod are same as what we recommend per node for any Vertica DB: Vertica Node and Cluster sizing.
- You can however have over-provisioned worker nodes, and then assign the resources per pod in your yaml configuration. This allows you flexibility to set hard and soft limits under heavy workloads.
- Though you could run all pods on a single worker node, we recommend you use podAntiAffinity spec to configure your deployment as One Pod per Node. This is because resource contention between multiple pods running on the same node will be detrimental to the performance. For more information on how to set up one pod per node, see Node Affinity in Creating a Custom Resource.
How can I check logs and events for my Vertica custom resource deployments?

You can check the configuration and related events of a Vertica CRD using the following command:
```
[root@master ~]# kubectl describe vdb <custom_resource_name>
Name:         vertica-sample
Namespace:    default
Labels:       <none>
Annotations:  vertica.com/buildDate: Wed Oct 13 20:01:44 2021
              vertica.com/buildRef: ea6f825da94b647040285340d091dc780c39d0e2
              vertica.com/version: v11.0.1-0
API Version:  vertica.com/v1beta1
Kind:         VerticaDB
.
.
.
.
.
.
.
 Up Node Count:    6
Events:             <none>
```
From 11.0SP2 onwards, the Vertica pod allows a sidecar, a utility container that can access and perform utility tasks for the Vertica server process. For example, logging is a common utility task. Idiomatic Kubernetes practices retrieve logs from stdout and stderr on the host node for log aggregation. To facilitate this practice, Vertica offers the vlogger sidecar image that sends the contents of vertica.log to stdout on the host node.

You may also exec into the pods and check vertica.log and admintools.log or for operator specific logs you can use:
```
[root@master ~]# kubectl logs <Operator-pod-name> -c manager
```
How many subclusters can I create? How many pods per subcluster? Is there a limit?

Though not a hard limit we recommend that you stay with the confines of the K8s node/pod limit, https://kubernetes.io/docs/setup/best-practices/cluster-large/.
What if I reduce the pods from 3 to 2? Does this affect the K-safety?

The Operator has an encapsulated admission-controller that ensures that the kSafety value cannot change after the initial creation of the VerticaDB. If unset, at the time of creation, it defaults to 1, regardless of the number of pods. If you try to create a single pod with kSafety of 1, the webhook rejects the request.

We do not strictly enforce kSafety from changing. We just prevent its change in the custom resource (CR). You can still go into vsql and change it. We recommend that any production cluster should set it to 1 and test environments can get away with 0 but are limited to 3 nodes.
Do I need to set the system configuration to Vertica recommended values on the K8s hosts? Is this taken care of by the containers?

Vertica requires various kernel parameters to be set (For more information, see OS configuration). If these are set on the host machine, the pods themselves do not need to set them. The pod just inherits those settings from the host. This means it does not have to run with a privileged security context. This is good because it is a best practice not to run with host privileges.

The user can set node tolerations/taints, selectors or affinity to ensure Vertica is scheduled on pods that have the required sysctl settings. The following parameters were taken from the Vertica documentation:

kernel.pid_max

vm.swappiness

vm.max_map_count

vm.min_free_kbytes
What are the licensing limitations in Vertica clusters I created in K8s?

By default, we use the Community Edition (CE) license if no license is provided. The CE license limits the number pods in a cluster to 3, and the dataset size to 1TB. Use your own license to extend the cluster past these limits.

To use your own license, add it to a Secret in the same namespace as the operator. The following command copies the license into a Secret named license:
```
kubectl create secret generic license --from-file=license.key=/path/to/license.key
```
Next, specify the name of the Secret in the CR by populating the licenseSecret field:
```
vertica-crd.yaml
apiVersion: vertica.com/v1beta1
kind: VerticaDB
metadata:
  name: vertica-sample
spec:
  licenseSecret: license
  communal:
    path: "s3://<bucket-name>/<key-name>"
    endpoint: http://path/to/endpoint
    credentialSecret: s3-creds  
  subclusters:
    - name: defaultsubcluster
      size: 3
```

Troubleshooting

Pods do not start up and remain in pending state. How do I debug?

Some typical issues you can look for in such a case are:
- PVC/PVs are not getting provisioned. Verify the PVCs are in "Bound" state: "Kubectl get pvc" and kubectl describe pvc "pvc_name"
- Communal endpoint is not accessible or credentials are incorrect. Use "kubectl describe vdb <db_name>" to verify.
- The number of pods are more than what your license support. By default, the Helm chart uses the free Community Edition license. This license is limited to 3 nodes and 1 TB of data. If you are using more than 3 pods, you need a license configured as a secret.
My cluster is running, but not in ready state

Typically, deploying a Vertica cluster in K8s could take more than 10 minutes for the first time. This may happen as the operator tries connect to the communal endpoint, add nodes to the DB, and bring up the DB. If the pods dont go into ready state (1/1) for more than 30 minutes, we recommend that you:
- Try checking if your Vertica CRD failed. Use "kubectl describe vdb <db_name>". Most of the issues that are detected by the operator, admission controller will be listed under the <Events> section.
- Verify you have appropriate image pull credentials from Docker hub. Recently Docker added a restriction on the number of images you can pull anonymously.
Note There might be several permutations and combinations of deploying Vertica in Kubernetes or related platforms. We urge you to try to use the supported platforms. Our Operator is still under development and we are working on supporting other platforms soon.

For More Information

Containerized Vertica

How to generate a Custom Resource

Troubleshooting Vertica-K8s Cluster

Docker CE and Kubernetes - FAQs

What is Vertica?

What is Containerization, Docker, and Kubernetes?

Docker CE

Basics

Advanced

Kubernetes (K8s)

Basics

Advanced

Troubleshooting

For More Information