Airflow Setup for Data and Content Delivery
This project sets up Apache Airflow to manage data and content delivery workflows. It includes configurations for various providers and connections to external services.
Requirements
- Apache Airflow
- Python 3.12 or higher
- Required Python packages listed in
requirements.txt
Pre-requisites
To start from scratch, follow these steps:
- Install
uvandpipif not already installed.pip install uv pip - Create the virtual environment.
uv venv .venv - Activate the virtual environment.
- On macOS/Linux:
source .venv/bin/activate - On Windows:
.venv\Scripts\activate
- On macOS/Linux:
- Install the required packages.
uv sync --locked
NOTE: You can also use mise to install the required tools and packages:
brew install mise
mise install
Configuration for local development
Create a .env file in the root directory of the project. You can use the provided .env-example as a template. This file should contain your Airflow configuration settings, including database connections and other environment variables.
Play around with Jupyter Notebooks
You can explore the Jupyter notebooks provided in the notebooks directory to understand how to use Airflow with various data sources and sinks. You can run these notebooks in your local environment or on a Jupyter server.
To run the notebooks, run the following command:
uv run --with jupyter jupyter lab
run tests
uv run pytest --cov=dags --cov-report html:cov_html
Running Airflow in Docker
To run Airflow in Docker, you can use the provided docker-compose.yml file. This file sets up the necessary services, including the Airflow web server, scheduler, and database.
To start Airflow in Docker, build the docker image first:
docker build -t content-delivery-airflow:latest -f Dockerfile.local .
Then, run the following command to start the services:
docker compose up
You can use local DBs for your tests, by ensuring that the environment variables in your .env are correctly configured.
In order to do that you might want to do an export and import of the product DB.
# set these variables in your .env file to use local DBs
# USE_LOCAL_INTERNAL_DOCKER_DB=true
# USE_LOCAL_PRODUCT_DOCKER_DB=true
# dump the product DB from AWS RDS manually, you will be prompted for the password
# or you can use the script dump-product-db.sh
pg_dump -U read_only_user -h product-data-db.cisims9lnzeb.eu-central-1.rds.amazonaws.com -d product_data_db > product-data-db.sql
# if not given create the schema in the local DB
docker exec -it content-delivery-airflow-internal-relational-db-1 /bin/bash -c "PGPASSWORD=postgres psql --username postgres"
CREATE DATABASE "product-data-db"
# import the dump into the local DB
docker exec -i content-delivery-airflow-internal-relational-db-1 /bin/bash -c "PGPASSWORD=postgres psql --username postgres product-data-db" < product-data-db.sql
Use bash script for local setup
You can use the provided local-setup.sh script to automate the setup process for local development. This script will create a virtual environment, install the required packages, and set up the Airflow configuration.
./local-setup.sh
Useful Commands
- To export uv dependencies into a requirements file:
uv export --no-hashes --no-config --format requirements-txt > requirements.txtNOTE: Remove the
-e .line from the generatedrequirements.txtfile if you want to use it with pip directly.
Unified metadata
The models for the unified metadata are based on the unified-metadata-schema. There is a shell script to generate pydantic code out of the latest schema in the unified-metadata repository. The script clones the unified metadata repository, generates jsonSchema and then generates pydantic models.
./dags/content_delivery_airflow/unified_metadata/update-and-generate-models.sh
command line utilitiy
You can execute command using the Airflow Rest API either by calling it directly or by using the mwaa util.
In order to trigger the smoke test dag you can call:
aws sso login --profile dcd-dev
uv run mwaa smoketest