How to Run Cassandra and Kubernetes Together?

Photo of author

By admin

With Kubernetes, you can run distributed applications on different systems. But Cassandra creates a distributed database environment, and it works as an infrastructure for Kubernetes. But the developers have to use Cassandra and Kubernetes together to manage data so that the applications can run in different locations in the same way. In this post, we are going to focus on how to run Kubernetes and Apache Cassandra together so that you can manage your containerized applications effortlessly.

What is Apache Cassandra?

When it comes to running your Kubernetes applications on a cloud server, you need something that will help you with an easier deployment process. Cassandra is such a tool that helps you scale your applications. It has a completely tolerant database and also provides you with an easy data management method. Since Cassandra is a database, it always needs more storage to store that data. The data is used to containerize applications in Kubernetes.

Cassandra is a distributed NoSQL DBMS that comes with huge storage to handle any amount of data. Cassandra works accurately and in a way that it should not cause any failure while containerizing applications on Kubernetes. There are a few important things about Cassandra that you should know:

  • Cassandra is written in Java.
  • It consists of nodes, racks, data centers, and clusters.
  • It uses an IP address to identify the nodes.

To run Cassandra on Kubernetes, the Kubernetes community has come up with a new plan. It is called K8ssandra, which is ready for Apache Cassandra to run on Kubernetes. By doing so, it will avoid the latency in application operation and improve the scaling as well. However, before running the Cassandra on Kubernetes, you need to figure out which system is operating the applications.

One way to do this is through Kubernetes operators that will set up the deployment process of applications. It works automatically in finding domain-specific information and interacting with external systems. Cass-operator is a Kubernetes operator that supports open-sourced servers such as Kubernetes, Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), Pivotal Container Service (PKS), etc.

All you need to do is install a Cass-operator on your Kubernetes. We will look into that in later sections and figure out how to install the cass-operator and configure it using configuration YAML files on your Kubernetes cluster.

What Would You Need to Transform Cassandra to Kubernetes?

If you are using a lot of K8s clusters, you need to learn how to maintain all of them conveniently, and migrating Cassandra to Kubernetes will help you with that method. In this section, we are going to focus on the tools that can transform Cassandra to Kubernetes.

1. Data Storage

Cassandra stores some of the data in RAM, which is also known as MemTable, and the other part of the data is saved in the disk and called SSTable. The disk also keeps the transaction records of the data, and this is called the Commit Log. While storing data, developers follow a certain type of mechanism, and that is known as PersistentVolume. Cassandra comes with many default mechanisms that help you with storing and distributing data. It also has mechanisms designed especially for data replication.

Since Cassandra comes with a wide range of nodes and clusters, you won’t have to find a distributed data storage machine such as Ceph or GlusterFS. You can store the data in the node’s disk using the hostPath command or go to persistent local volumes. You can also store data in the distributed storage if you want to create multiple environments for the developers. Sometimes, to work on the different features of applications, developers work separately, and in that case, creating a single Cassandra node would do the job.

Now that the data is in a distributed storage, you know that it is safe. If by any chance, a node in Kubernetes doesn’t work properly, the distributed storage will back you up with the data.

2. Monitoring

How do you monitor the performance and events going on in Kubernetes clusters? There is a tool for that, and it is called Prometheus. But now the question is if Cassandra supports Prometheus for monitoring activities?

If you want to monitor from the Cassandra dashboard, you can use jmx_exporter and cassandra_exporter for good. But most people use the first option because it is easy to use. You can run the exporter with this command:

 

javaagent:<plugin-dir-name>/cassandra-exporter.jar=–listen=:9180

3. How to Choose Kubernetes Primitives?

You can transfer the Cassandra cluster to Kubernetes resources. For example, Cassandra Node → Pod, Cassandra Rack → StatefulSet, Cassandra Datacenter → pool of StatefulSets, but how to convert a Cassandra cluster? There is a Kubernetes mechanism for defining custom resources (CRDs). You can use a Kubernetes operator and controller to convert the Cassandra cluster to Kubernetes resources.

4. How to Identify Pods?

Cluster nodes in Cassandra are equal to Kubernetes pods, and Cassandra can identify the pods with IP addresses. Every pod of the Kubernetes has a different IP address, and after Cassandra detects the pods, you will have to add new nodes to the Cassandra cluster. But there are some other things you can do to help Cassandra identify the pods:

  1. You can track the pods with UUIDs or host identifiers. You can also use the IP addresses and save their information by creating a table.
  2. You can create a service for individual cluster nodes using ClusterIP.
  3. You can create a network for Cassandra nodes instead of a dedicated pod network. You can set up the hostNetwork: true in this case.

These are the 3 different ways to help Cassandra identify pods in Kubernetes, and you can choose any one of them depending on your requirements.

5. Backups

CronJob helps you backup the data from Cassandra nodes. However, since Cassandra stores most of the data in its memory, you need to flush the data in Memtables to SSTables. In other words, you will have to create a node drain. Cassandra nodes will stop receiving data through the node drain process, and you won’t be able to reach them. But luckily, the node will back up the data by creating a snapshot and saving the scheme.

However, the backed-up data is not enough but with a Google-made script, you can backup more data files to Kubernetes. But please note that the script won’t flush data to the Cassandra node before taking the snapshot. Here, have a look at an example of such a Google-made script for backing data on Cassandra nodes:

set -eu

if [[ -z “$1” ]]; then

info “Please provide a keyspace”

exit 1

fi

KEYSPACE=”$1″

result=$(nodetool snapshot “${KEYSPACE}”)

if [[ $? -ne 0 ]]; then

echo “Error while making snapshot”

exit 1

fi

timestamp=$(echo “$result” | awk ‘/Snapshot directory: / { print $3 }’)

mkdir -p /tmp/backup

for path in $(find “/var/lib/cassandra/data/${KEYSPACE}” -name $timestamp); do

table=$(echo “${path}” | awk -F “[/-]” ‘{print $7}’)

mkdir /tmp/backup/$table

mv $path /tmp/backup/$table

done

tar -zcf /tmp/backup.tar.gz -C /tmp/backup .

nodetool clearsnapshot “${KEYSPACE}”

How to Set Up the Cass-Operator Definitions?

In case we have not talked about the different parts of Cassandra already, now is the time. The parts of Cassandra, such as the nodes, racks, and data centers have different work to do in the system, and they are also known as “definitions”. Let’s learn about them first, and then we can go on and set up the cass-operator definitions on Kubernetes.

Nodes: A node is a computer system that runs an instance of Cassandra. You can refer to a physical host or a computer as a node. Even cloud storage and docker containers fall under the node category.

Racks: Racks are a set of Cassandra nodes that join with each other and connect to a single network switch. In cloud deployment systems, racks refer to the machine instances that run in the same and common zone.

Data center: A combination of racks can create a complete data center. The racks should be in the same location and connected to the same network. When it comes to cloud deployments, data centers work to measure cloud storage.

Clusters: A collection of data centers create a cluster together, and the cluster is to support the same application. Data centers can be physical or cloud-based, and clusters are supposed to run in both. Clusters are also distributable sources that you can transfer to different locations to reduce latency.

Now that we know the definitions of Cassandra, we can use the GKE or other Kubernetes engines. This will help you connect Cassandra to Kubernetes so that both of them can run together. Let’s follow the steps mentioned below:

Step 1: Apply the Cass-operator YAML Files to Your Cluster

Here, you need to use the kubectl command-line tool to run the YAML files. The configuration files will apply the cass-operator definitions to the Kubernetes cluster. Cassandra operators can also be named as the API object descriptions or manifests that define the state of the data in your applications. You can go to the GitHub page and find out the version-specific manifests.

Here is an example of the kubectl command in the GKE cloud that is running Kubernetes:

kubectl command for GKE cloud running Kubernetes 1.16

Step 2: Apply YAML to the Storage Configuration

The next step requires you to configure the YAML files using the kubectl command tool. The configuration files will define the storage settings that you could use in Cassandra nodes. There is a StorageClass resource that works as a layer between the physical and persistent storage in a Kubernetes cluster. See the example below where we have used SSD as the storage type.

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

name: server-storage

provisioner: kubernetes.io/gce-pd

parameters:

type: pd-ssd

replication-type: none

volumeBindingMode: WaitForFirstConsumer

reclaimPolicy: Delete

Step 3: Applying the YAML Files that Define the Data Centre

You can use the kubectl again and apply the YAML file to define the Cassandra Datacenter. Check out the example below:

# Sized to work on 3 k8s workers nodes with 1 core / 4 GB RAM

# See neighboring example-cassdc-full.yaml for docs for each parameter

apiVersion: cassandra.datastax.com/v1beta1

kind: CassandraDatacenter

metadata:

name: dc1

spec:

clusterName: cluster1

serverType: cassandra

serverVersion: “3.11.6”

managementApiAuth:

insecure: {}

size: 3

storageConfig:

cassandraDataVolumeClaimSpec:

storageClassName: server-storage

accessModes:

– ReadWriteOnce

resources:

requests:

storage: 5Gi

config:

cassandra-yaml:

authenticator: org.apache.cassandra.auth.PasswordAuthenticator

authorizer: org.apache.cassandra.auth.CassandraAuthorizer

role_manager: org.apache.cassandra.auth.CassandraRoleManager

jvm-options:

initial_heap_size: “800M”

max_heap_size: “800M”

After this step, you can check out the resources that you have made, and you can find them in the Google Cloud Console. Click on the Clusters tab and look at the services that are running. You can manage these computing units in the Kubernetes clusters because they are also deployable.

Other Cassandra Solutions for Kubernetes

You can use Statefulset or helm-chart to deploy Cassandra on Kubernetes. However, you need to figure out which solutions are the best for using Cassandra on Kubernetes. Here are some solutions that you can use:

StatefulSet or Helm-Chart-Based Solutions

You can use Statefulset or helm-chart-based solutions to deploy a Cassandra cluster on Kubernetes. This method is effective as well as used commonly, but the failure of a node during the process can ruin your whole deployment plan. In that case, standard Kubernetes tools cannot solve this situation alone. With helm-chart-based solutions, you cannot replace nodes or restore and monitor data.

Hence, that leads us to another choice for Cassandra on Kubernetes, which depends on the operators now.

Kubernetes Operator

Have a look at some of the available operators apart from the one we used above:

1. Cassandra-operator

Cassandra-operator is written in Java and has an Apache 2.0 license which offers Cassandra deployments management for Kubernetes. It supports monitoring, cluster management, and data backup. However, it does not support multiple racks for a single data center.

2. Navigator

Navigator is used in Kubernetes to implement DB-as-a-Service, and it also is compatible with Elasticsearch and Cassandra. You can control access to the Kubernetes database via RBAC with Navigator.

3. CassKop

You can use the CassKop operator to interact with Cassandra nodes with its CassKop plugin. The plugin is a Python-based feature that helps the Cassandra nodes communicate with one another.

Bottomline

Cassandra helps in scaling Kubernetes applications easily. Since, there are so many operators that support Kubernetes and Cassandra together, you won’t have to face any problem with it. Alongside, this guide will help in running them together. If you face any issue while performing the steps, connect with us through the comment box below.

Leave a Comment