Mastering Workflow Automation: Deploying Apache Airflow on Kubernetes (A Step-by-Step Guide)

As data organizations scale, managing complex computational workflows—Data Pipelines or DAGs (Directed Acyclic Graphs)—becomes a monumental challenge. Apache Airflow has emerged as the industry standard for orchestrating these processes. However, running Airflow on a single virtual machine quickly hits resource limits.

The solution? Kubernetes. By deploying Airflow on a Kubernetes (K8s) cluster, you gain elasticity, fault tolerance, and isolated environments for your tasks. In this guide, we will walk through why this combination is powerful and provide a step-by-step tutorial on setting up Airflow in K8s using Helm.


[WP Block: Image Placeholder]

  • Suggested Image: A conceptual diagram showing the Apache Airflow logo on one side and the Kubernetes logo on the other, with arrows indicating integration.
  • Alt Text: Apache Airflow and Kubernetes integration diagram.

Why Run Airflow on Kubernetes?

Running Airflow on a traditional setup requires pre-allocating fixed resources (CPU/RAM) for workers. If your workflows are bursty, you waste money during idle times and face bottlenecks during peak times. Kubernetes solves this via cloud-native features.

  • Dynamic Scaling: With the KubernetesExecutor, Airflow spins up a new K8s Pod for every single task that runs. When the task finishes, the Pod terminates. You only pay for the compute you use.
  • Isolation: Since every task runs in its own container, you eliminate dependency conflicts. One task needing Python 3.7 won’t conflict with another needing Python 3.10.
  • High Availability: Kubernetes automatically restarts failed Airflow components (Scheduler, Webserver), ensuring your orchestration layer remains robust.
  • Resource Management: You can precisely define CPU and Memory requests and limits for individual tasks directly within your DAG definition.

[WP Block: Image Placeholder]

  • Suggested Image: An architecture diagram illustrating the KubernetesExecutor workflow: Airflow Scheduler -> K8s API -> K8s Pods (Workers) being created dynamically.
  • Alt Text: Airflow KubernetesExecutor architecture diagram.

Prerequisites

Before we begin the setup, ensure you have the following tools installed and configured:

  1. A Kubernetes Cluster: You can use a local cluster like Minikube or Kind, or a managed cloud service like GKE (Google), EKS (AWS), or AKS (Azure).
  2. kubectl: The command-line tool for interacting with Kubernetes.
  3. Helm: The package manager for Kubernetes (think apt or brew for K8s). We will use the official Helm Chart.

Step-by-Step Deployment Guide

Step 1: Set up the Kubernetes Namespace

It is best practice to isolate large applications within their own logical space in the cluster. We will create a namespace called airflow.

Bash

kubectl create namespace airflow

Step 2: Add the Official Airflow Helm Repository

Helm uses repositories to find charts. We need to add the official Apache Airflow repository to our local Helm setup.

Bash

helm repo add apache-airflow https://airflow.apache.org
helm repo update

Step 3: Configure your Deployment (values.yaml)

The strength of Helm lies in configuration files, usually named values.yaml. While you can install Airflow with default settings, you almost always need to customize it.

Create a file named override-values.yaml. For this tutorial, we will focus on enabling the KubernetesExecutor and setting up a simple workflow. (Note: In production, you should use an external database like RDS or Cloud SQL, but for this guide, we will use the default PostgreSQL chart included).

YAML

# override-values.yaml

# Use KubernetesExecutor so tasks run in their own pods
executor: "KubernetesExecutor"

# Examples are enabled by default, good for testing
webserver:
  defaultUser:
    username: admin
    password: password # CHANGE THIS IN PRODUCTION
    role: Admin
    email: admin@example.com

# In production, set to False and use external DB
postgresql:
  enabled: true

Step 4: Install Airflow via Helm

Now, we run the installation command. We point Helm to our namespace, our custom configuration file, and give the release a name (my-airflow).

Bash

helm install my-airflow apache-airflow/airflow \
  --namespace airflow \
  --f override-values.yaml

This command may take a few minutes. Helm is creating deployments, services, statefulsets (for the database), and secrets. You can monitor the progress by running:

Bash

kubectl get pods -n airflow --watch

  • Suggested Image: A screenshot of a terminal showing the output of kubectl get pods -n airflow, with all components (scheduler, webserver, postgresql) showing “Running” or “Completed” status.
  • Alt Text: Kubernetes terminal output showing running Airflow pods.

Step 5: Access the Airflow Web UI

By default, the web server is not accessible from outside the cluster. To access it without setting up complex Ingress rules, we can use port forwarding. Run this command in a separate terminal window:

Bash

kubectl port-forward svc/my-airflow-webserver 8080:8080 -n airflow

Now, open your browser and navigate to http://localhost:8080. Use the credentials we defined in the values.yaml (Username: admin, Password: password). You will now see the Airflow dashboard filled with example DAGs.


Writing a Kubernetes-Native DAG

Now that Airflow is running, you need to tell it to run tasks on Kubernetes. While the default operators (like PythonOperator) work out-of-the-box with KubernetesExecutor, the most powerful tool is the KubernetesPodOperator. This allows you to run any Docker image as a task.

Python

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime

with DAG(
    dag_id='example_kubernetes_pod_operator',
    start_date=datetime(2023, 1, 1),
    schedule_interval=None,
    catchup=False,
) as dag:

    # This task will run a completely separate container
    # using the Alpine Linux image and run a shell command.
    run_container_task = KubernetesPodOperator(
        namespace='airflow',
        image="alpine",
        cmds=["echo"],
        arguments=["Hello from a Kubernetes Pod!"],
        name="airflow-test-pod",
        task_id="run_container_task",
        get_logs=True, # Fetch logs from the pod back to Airflow
    )

Conclusion

Setting up Apache Airflow on Kubernetes provides the ideal platform for modern, scalable data engineering. While the initial setup requires understanding Helm and K8s concepts, the long-term benefits of resource isolation, dynamic scaling, and reliability are massive.

For production deployments, remember to move beyond this guide by configuring an external PostgreSQL database, setting up persistent storage for DAGs (e.g., Git-Sync or Persistent Volumes), and configuring SSL/TLS for the webserver.

Leave a Reply

Your email address will not be published. Required fields are marked *

Commonly asked questions and answers

Phone:

+44 7926 690028

Email:

contact@codespact.com

What does your system engineering and consulting involve?

Before writing code, we start with a deep technical diagnosis. We analyze your entire infrastructure, software, and daily operations to identify risks and real opportunities for system improvement.

Based on the initial diagnosis, we design a clear architecture and a realistic technical roadmap. Every single decision considers stability, scalability, and compatibility with your ongoing operations. We never apply generic fixes to complex tech systems.

Finally, we execute structural changes in a controlled and documented manner, strictly aligned with your internal teams. Execution is just a part of the process, not the end. We provide continuous tech support to ensure full platform adoption, smooth continuity, and the absolute capacity for future evolution.

We focus on the complexity of your systems rather than just the size of your company. We partner with organizations that already have running operations but face technical limits due to fast growth.

Often, companies scale their operations rapidly without establishing a solid technical architecture. They end up dealing with accumulated technical debt, unscalable software, or critical infrastructure that is simply too difficult and costly to maintain.

Whether you are a mid-sized team or a large enterprise, our tech interventions are always progressive and highly conscious. We deeply respect your ongoing processes and existing teams. Our main objective is to enable true technical evolution without ever putting your daily operational continuity at risk.

Yes, we frequently intervene in existing platforms that suffer from accumulated technical debt.

Before any intervention, we completely analyze the entire system: your infrastructure, software, and processes. This allows us to spot operational risks and find the safest path to refactor your tech debts.

Our interventions are always progressive and highly conscious. We redesign the architecture and implement structural improvements without ever risking your daily operational continuity.

We never rely on generic tools. Our tech stack is chosen based on your specific system needs. We utilize cloud infrastructure, robust software frameworks, and automated deployments to ensure solid stability.

We build robust backend architectures with Python and Laravel, and scalable applications using React Native. Our cloud infrastructure is strictly powered by Docker, Kubernetes, and GCP to ensure high availability.

For complex data and AI, we leverage TensorFlow and NLP models. Every tool is implemented with strict operational control and continuous support.

Yes, we do. In codesPACT, execution is merely a part of the process, not the end. We provide continuous tech support to ensure your systems evolve with absolute stability, proper control, and a clear technical direction long after the initial deployment phases.

We accompany the transition to assure full adoption, continuity, and future evolution capacity. We do not just deliver the system; we make sure that your internal teams operate it securely.

This approach allows real improvements without generating unnecessary dependencies. Our ongoing role is to act as your technical partner for strategic decisions.

Newsletter subscribe!

Enter your email to unlock our exclusive IT insights on professional systems architecture tailored to your business needs.

Have tech questions?

Let’s schedule a short call to discuss how we can work together and contribute to the stability of your tech ecosystem.