As data organizations scale, managing complex computational workflows—Data Pipelines or DAGs (Directed Acyclic Graphs)—becomes a monumental challenge. Apache Airflow has emerged as the industry standard for orchestrating these processes. However, running Airflow on a single virtual machine quickly hits resource limits.
The solution? Kubernetes. By deploying Airflow on a Kubernetes (K8s) cluster, you gain elasticity, fault tolerance, and isolated environments for your tasks. In this guide, we will walk through why this combination is powerful and provide a step-by-step tutorial on setting up Airflow in K8s using Helm.
[WP Block: Image Placeholder]
- Suggested Image: A conceptual diagram showing the Apache Airflow logo on one side and the Kubernetes logo on the other, with arrows indicating integration.
- Alt Text: Apache Airflow and Kubernetes integration diagram.
Why Run Airflow on Kubernetes?
Running Airflow on a traditional setup requires pre-allocating fixed resources (CPU/RAM) for workers. If your workflows are bursty, you waste money during idle times and face bottlenecks during peak times. Kubernetes solves this via cloud-native features.
- Dynamic Scaling: With the
KubernetesExecutor, Airflow spins up a new K8s Pod for every single task that runs. When the task finishes, the Pod terminates. You only pay for the compute you use. - Isolation: Since every task runs in its own container, you eliminate dependency conflicts. One task needing Python 3.7 won’t conflict with another needing Python 3.10.
- High Availability: Kubernetes automatically restarts failed Airflow components (Scheduler, Webserver), ensuring your orchestration layer remains robust.
- Resource Management: You can precisely define CPU and Memory requests and limits for individual tasks directly within your DAG definition.
[WP Block: Image Placeholder]
- Suggested Image: An architecture diagram illustrating the
KubernetesExecutorworkflow: Airflow Scheduler -> K8s API -> K8s Pods (Workers) being created dynamically. - Alt Text: Airflow KubernetesExecutor architecture diagram.
Prerequisites
Before we begin the setup, ensure you have the following tools installed and configured:
- A Kubernetes Cluster: You can use a local cluster like Minikube or Kind, or a managed cloud service like GKE (Google), EKS (AWS), or AKS (Azure).
- kubectl: The command-line tool for interacting with Kubernetes.
- Helm: The package manager for Kubernetes (think
aptorbrewfor K8s). We will use the official Helm Chart.
Step-by-Step Deployment Guide
Step 1: Set up the Kubernetes Namespace
It is best practice to isolate large applications within their own logical space in the cluster. We will create a namespace called airflow.
Bash
kubectl create namespace airflow
Step 2: Add the Official Airflow Helm Repository
Helm uses repositories to find charts. We need to add the official Apache Airflow repository to our local Helm setup.
Bash
helm repo add apache-airflow https://airflow.apache.org
helm repo update
Step 3: Configure your Deployment (values.yaml)
The strength of Helm lies in configuration files, usually named values.yaml. While you can install Airflow with default settings, you almost always need to customize it.
Create a file named override-values.yaml. For this tutorial, we will focus on enabling the KubernetesExecutor and setting up a simple workflow. (Note: In production, you should use an external database like RDS or Cloud SQL, but for this guide, we will use the default PostgreSQL chart included).
YAML
# override-values.yaml
# Use KubernetesExecutor so tasks run in their own pods
executor: "KubernetesExecutor"
# Examples are enabled by default, good for testing
webserver:
defaultUser:
username: admin
password: password # CHANGE THIS IN PRODUCTION
role: Admin
email: admin@example.com
# In production, set to False and use external DB
postgresql:
enabled: true
Step 4: Install Airflow via Helm
Now, we run the installation command. We point Helm to our namespace, our custom configuration file, and give the release a name (my-airflow).
Bash
helm install my-airflow apache-airflow/airflow \
--namespace airflow \
--f override-values.yaml
This command may take a few minutes. Helm is creating deployments, services, statefulsets (for the database), and secrets. You can monitor the progress by running:
Bash
kubectl get pods -n airflow --watch
- Suggested Image: A screenshot of a terminal showing the output of
kubectl get pods -n airflow, with all components (scheduler, webserver, postgresql) showing “Running” or “Completed” status. - Alt Text: Kubernetes terminal output showing running Airflow pods.
Step 5: Access the Airflow Web UI
By default, the web server is not accessible from outside the cluster. To access it without setting up complex Ingress rules, we can use port forwarding. Run this command in a separate terminal window:
Bash
kubectl port-forward svc/my-airflow-webserver 8080:8080 -n airflow
Now, open your browser and navigate to http://localhost:8080. Use the credentials we defined in the values.yaml (Username: admin, Password: password). You will now see the Airflow dashboard filled with example DAGs.
Writing a Kubernetes-Native DAG
Now that Airflow is running, you need to tell it to run tasks on Kubernetes. While the default operators (like PythonOperator) work out-of-the-box with KubernetesExecutor, the most powerful tool is the KubernetesPodOperator. This allows you to run any Docker image as a task.
Python
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime
with DAG(
dag_id='example_kubernetes_pod_operator',
start_date=datetime(2023, 1, 1),
schedule_interval=None,
catchup=False,
) as dag:
# This task will run a completely separate container
# using the Alpine Linux image and run a shell command.
run_container_task = KubernetesPodOperator(
namespace='airflow',
image="alpine",
cmds=["echo"],
arguments=["Hello from a Kubernetes Pod!"],
name="airflow-test-pod",
task_id="run_container_task",
get_logs=True, # Fetch logs from the pod back to Airflow
)
Conclusion
Setting up Apache Airflow on Kubernetes provides the ideal platform for modern, scalable data engineering. While the initial setup requires understanding Helm and K8s concepts, the long-term benefits of resource isolation, dynamic scaling, and reliability are massive.
For production deployments, remember to move beyond this guide by configuring an external PostgreSQL database, setting up persistent storage for DAGs (e.g., Git-Sync or Persistent Volumes), and configuring SSL/TLS for the webserver.

