Airflow Setup for Data and Content Delivery

This project sets up Apache Airflow to manage data and content delivery workflows. It includes configurations for various providers and connections to external services.

Requirements

Apache Airflow
Python 3.12 or higher
Required Python packages listed in requirements.txt

Pre-requisites

To start from scratch, follow these steps:

Install uv and pip if not already installed.
```
pip install uv pip
```
Create the virtual environment.
```
uv venv .venv
```
Activate the virtual environment.
- On macOS/Linux:
```
source .venv/bin/activate
```
- On Windows:
```
.venv\Scripts\activate
```
Install the required packages.
```
uv sync --locked
```

NOTE: You can also use mise to install the required tools and packages:

brew install mise

mise install

Configuration for local development

Create a .env file in the root directory of the project. You can use the provided .env-example as a template. This file should contain your Airflow configuration settings, including database connections and other environment variables.

Play around with Jupyter Notebooks

You can explore the Jupyter notebooks provided in the notebooks directory to understand how to use Airflow with various data sources and sinks. You can run these notebooks in your local environment or on a Jupyter server.

To run the notebooks, run the following command:

uv run --with jupyter jupyter lab

run tests

uv run pytest --cov=dags --cov-report html:cov_html

Running Airflow in Docker

To run Airflow in Docker, you can use the provided docker-compose.yml file. This file sets up the necessary services, including the Airflow web server, scheduler, and database.

To start Airflow in Docker, build the docker image first:

docker build -t content-delivery-airflow:latest -f Dockerfile.local .

Then, run the following command to start the services:

docker compose up

You can use local DBs for your tests, by ensuring that the environment variables in your .env are correctly configured. In order to do that you might want to do an export and import of the product DB.

# set these variables in your .env file to use local DBs
# USE_LOCAL_INTERNAL_DOCKER_DB=true
# USE_LOCAL_PRODUCT_DOCKER_DB=true

# dump the product DB from AWS RDS manually, you will be prompted for the password
# or you can use the script dump-product-db.sh
pg_dump -U read_only_user -h product-data-db.cisims9lnzeb.eu-central-1.rds.amazonaws.com -d product_data_db > product-data-db.sql

# if not given create the schema in the local DB
docker exec -it content-delivery-airflow-internal-relational-db-1 /bin/bash -c "PGPASSWORD=postgres psql --username postgres"
CREATE DATABASE "product-data-db"

# import the dump into the local DB
docker exec -i content-delivery-airflow-internal-relational-db-1 /bin/bash -c "PGPASSWORD=postgres psql --username postgres product-data-db" < product-data-db.sql

Use bash script for local setup

You can use the provided local-setup.sh script to automate the setup process for local development. This script will create a virtual environment, install the required packages, and set up the Airflow configuration.

./local-setup.sh

Useful Commands

To export uv dependencies into a requirements file:
```
uv export --no-hashes --no-config --format requirements-txt > requirements.txt
```
NOTE: Remove the -e . line from the generated requirements.txt file if you want to use it with pip directly.

Unified metadata

The models for the unified metadata are based on the unified-metadata-schema. There is a shell script to generate pydantic code out of the latest schema in the unified-metadata repository. The script clones the unified metadata repository, generates jsonSchema and then generates pydantic models.

./dags/content_delivery_airflow/unified_metadata/update-and-generate-models.sh

command line utilitiy

You can execute command using the Airflow Rest API either by calling it directly or by using the mwaa util.

In order to trigger the smoke test dag you can call:

aws sso login --profile dcd-dev 
uv run mwaa smoketest

Last updated: March 31, 2026 at 14:49

By: Marco Batzinger

📄 View source

Repository: Data-and-Content-Delivery/content-delivery