Kubernetes for Data Engineering¶

Kubernetes orchestrates containerized workloads at scale. For DE, it provides the compute layer when separating storage (S3) and compute, running Spark, Airflow, JupyterHub.

Architecture¶

Control Plane: API Server (central hub), etcd (distributed KV store), Scheduler, Controller Manager

Worker Nodes: kubelet, kube-proxy, container runtime (containerd, CRI-O)

Core Abstractions¶

Abstraction	Use Case
Pod	Smallest deployable unit
Deployment	Production apps with rolling updates
StatefulSet	Stateful apps (databases)
Service	Network exposure (ClusterIP, NodePort, LoadBalancer)
ConfigMap	Application configuration
Secret	Sensitive data
PersistentVolume + PVC	Storage surviving pod restarts

kubectl Cheatsheet¶

kubectl get pods -n namespace
kubectl describe pod <name>
kubectl logs <pod-name>
kubectl apply -f manifest.yaml
kubectl delete -f manifest.yaml

Spark on Kubernetes¶

Production-ready since Spark 3.1 (March 2021).

helm install my-release spark-operator/spark-operator \
  --namespace spark-operator --create-namespace \
  --set webhook.enable=true --version 1.1.11

Always pin Spark Operator version. Latest may have breaking changes.

Cloud-Native Data Architecture¶

Storage Layer: S3 (or compatible)
Compute Layer: Kubernetes (Spark, Presto, Airflow)

Separation enables elastic compute independent of storage.

Key Facts¶

Container filesystem is ephemeral - use PV/PVC or S3 for persistent data
etcd is sensitive to network latency
Always pause clusters after use to save costs
Spark History Server for job log inspection

Gotchas¶

Pin all Helm chart versions for reproducibility
etcd performance degrades with cross-datacenter latency
Pod resource limits must be set to prevent noisy-neighbor issues
K8s cluster version matters - test compatibility with Spark/Airflow versions