Airflow in Kubernetes with GitOps and CI/CD for Big Data ETL

· 2 min read

Deploying Airflow in Kubernetes with GitSync involves using GitSync to automatically synchronize Airflow DAG definitions stored in a Git repository with the Airflow deployment running in Kubernetes. Here's a step-by-step guide to deploy Airflow in Kubernetes with GitSync:

Set Up Kubernetes Cluster: Deploy and configure a Kubernetes cluster where you'll run Airflow. You can use managed Kubernetes services like Amazon EKS, Google Kubernetes Engine, or self-managed solutions like kops or kubeadm.

Install Helm: Helm is a package manager for Kubernetes that simplifies deploying and managing applications. Install Helm on your local machine or on your Kubernetes cluster.

Install Airflow Helm Chart: Use the official Airflow Helm chart to deploy Airflow in Kubernetes. Add the Airflow Helm repository and install the chart with the desired configuration options. You can customize the Airflow deployment settings such as the number of scheduler and worker pods, storage options, and external database connection settings.

helm repo add apache-airflow https://airflow.apache.org
helm install my-airflow apache-airflow/airflow

Configure GitSync: GitSync is a sidecar container that runs alongside the Airflow scheduler pod and synchronizes DAG definitions from a Git repository to the Airflow DAGs folder. Customize the GitSync configuration in the Airflow Helm chart values file (values.yaml) or using command-line arguments.

airflow:
  config:
    AIRFLOW__CORE__DAGS_FOLDER: "/opt/airflow/dags"
gitSync:
  enabled: true
  image:
    repository: k8s.gcr.io/git-sync/git-sync
    tag: v3.2.0
  source:
    repo: "https://github.com/your-org/airflow-dags.git"
    branch: main
  destination:
    container: my-airflow-scheduler
    path: "/opt/airflow/dags"

Deploy Airflow with GitSync: Deploy Airflow with the configured GitSync settings using Helm. Pass the values file with the GitSync configuration to the helm install command.

helm install my-airflow apache-airflow/airflow -f values.yaml

Verify Deployment: Check that Airflow and GitSync pods are running in your Kubernetes cluster. Use kubectl get pods to list all pods in the namespace where Airflow is deployed.

kubectl get pods

Access Airflow UI: Access the Airflow web UI to view and manage your DAGs. By default, the Airflow web server is exposed as a Kubernetes service. Retrieve the external IP address or domain name of the Airflow web service and access the web UI using a web browser.

kubectl get svc my-airflow-web -o wide

Manage DAGs in Git Repository: Store your Airflow DAG definitions in a Git repository. GitSync will automatically synchronize the DAGs from the repository to the Airflow DAGs folder defined in the configuration.

Update DAGs: Make changes to your DAGs in the Git repository. GitSync will automatically pull the changes and update the DAGs in the Airflow deployment. You can monitor the synchronization process in the GitSync logs.

Monitor and Troubleshoot: Monitor the Airflow deployment and GitSync synchronization process using Kubernetes logs and monitoring tools. Troubleshoot any issues with DAG synchronization or Airflow operation by inspecting pod logs and cluster resources.

By following these steps, you can deploy Airflow in Kubernetes with GitSync to automate the synchronization of DAG definitions from a Git repository to your Airflow deployment, enabling efficient management of ETL workflows in a containerized environment.