Data and Content Delivery Platform

In this repository you will find the core components of our data and content delivery platform.

NOTE: This documentation should be considerered deprecated. Components like the API and MWAA will be phased out in its current form.

The platform is cabable of retrieving Data from multiple sources like Databases, SNS/SQS messages, S3 buckets and other input channels in different formats, transforming it to a format which can be used by multiple clients and then expose the data with a reliable contract to different consumers.

We expose data to the following teams:

Connect / Public API
Search
Website / Platform Teams (soon)
AI Integration cases

The Platform acts as middle layer between the teams which source information and create raw data and the consuming teams which offer end user facing Solutions. The data and content should be delivered in a way that it can be used multi channel and all data types follow a comparable structure.

This enables the API team to offer all datasets in same manner, the website team to build collections and generic UI Widgets and the search team to effectively offer search results.

The platform consists of multiple components, you can get technical detail information about the budiling bricks in the following subsections:

architecture

Technical and organizational constraints

There are a few technological boundaries for our decisions:

we use systems in AWS
we stick to existing programming languages at Statista
we try to use existing frameworks, instead of reinventing the wheel

Context and scope: Business and technical context, external interfaces

Our Platform is a middle layer between the teams which source information and create raw data and the consuming teams. We do not create content and we do not define the usage of the data and content.

We consume:

Aggregated GCS Data - from Product DB and/or from parquet files in a landing zone S3 bucket which is filled by the Data Production team
Market Data - from Product DB and/or from parquet files in a landing zone S3 bucket which is filled by the Data Production team
Statistic Data and content from numera by listening to Updates from SNS from the Numera AWS accounts

We offer data:

API Team - we offer JSON Files in S3 buckets available for the API teams
Search - we call the RFD12 unified metadata ingestion API
Website Teams - The websites teams can call our rest API and listen to messages (SNS) to get notified about updates
AI-Solutions Team - we replicate our JSON files in the S3 buckets to a destination bucket in the AiSolutions AWS account

As of date of writing 10.12.2025, the rest API (retrieval api) is in experimental mode, so can only used for POCs by Platform teams The integrations will change over time

Building blocks and runtime views

We are using this core modules:

core components

MWAA Amazon Managed Workflows for Apache Airflow

MWAA is a amazon managed version of airflow. Airflow is an orchestration system to start jobs, delegate them to worker nodes inside airflow or compute outside of airflow, handle depdendencies of jobs. In this Airflow clusters we are executing python logic.

S3 Landing zones buckets

Since syncronization over databases is not stable, we offer a landingzone. In this landingzone the product data team is placing parquet files with the data from the Product DB. We will consume these files (V2).

S3 API-bucket

this bucket contains the data from us for the API team and replication to AI-Solutions team.

SNS/SQS

in order to send information to other teams, we decouple the sending logic via messaging.

lambda and fargate

There are cases where we want to use compute outside of airflow. we are using:

lambdas to call the RFD12 Ingestion API and to consume messages from Numera
fargate tasks to operate heavy tasks outside of airflow workers

Internal DB

We store our data not only as files, instead we are storing structured versions in our internal DB. This is a serverless AWS RDS (postgres compatible).

Retrieval API

We offer our data as Rest API, called retrieval API.

You can call the service via rest 'ENV-api.data-cd.STAGE.aws.statista.com' or for personal systems 'env-NAME-api.data-cd.STAGE.aws.statista.com'.

It contains an endpoint to retrieve a collections via id like for example:

https://api.data-cd.dev.aws.statista.com/v2/collections/543d3086-277c-5af7-e3de-86000e246abf

A detailed definition of the API spec can be found in datadog under APIs, search for 'dcd-retrieval-api-service'.

runtime views

The data flow follows the following approaches.

XMI

Example update of XMI Data follows this approach:

The data is uploaded by Product Data team as parquet file to landing zone
A nightly Dag takes the data and stores it into the internal DB
multiple other dags read from the internal DB and generate a json Market mapping and market collection files in S3 for the API team, and write RFD12 metadata files
another dag sends the metadata via SNS to a lambda which then updates the metadata via API call in the search systems.

XMI data is updated multiple times a week (full load).

GCS

The data is uploaded by Product Data team as parquet file to landing zone
on manual request the data and stored into the internal DB (quarterly)
multiple other dags read from the internal DB and generate a json Survey collection files in S3 for the API team, and write RFD12 metadata files
another dag sends the metadata via SNS to a lambda which then updates the metadata via API call in the search systems.

GCS data is updated once a quarter (full load).

Statistics

A statistic is updated in numera
Numera sends a message via SNS (Statistic data is updated thousand times a day with single statistic updates)
we consumer the message from a SQS in our account which is subscribed to the Numera SNS
a lambda saves the content in S3 as json and add an entry in the internal DB that there is an update
a dag checks every hour for updates in the internal DB, read the json and writes the data as collection in the internal DB
multiple other dags read from the internal DB and generate a json Statistics collection files in S3 for the API team, and write RFD12 metadata files
another dag sends the metadata via SNS to a lambda which then updates the metadata via API call in the search systems.

Deployment view: Hardware and technical infrastructure, deployment

We are using Workflow Management via Amazon Managed Workflows for Apache Airflow (MWAA), that means the whole MWAA infra is a black box for us.

For the deployment we use CDK triggered by Github Actions. The DAGs are pulled from MWAA itself from S3 buckets which are part of the configuration. In order to deploy changes in our logic, we upload the DAGs and the corresponding python code to the S3 bucket via Github Action.

There is an environment of MWAA for dev, stage and prod. Within Dev Stage we use personal envs to test things, which are deleted automatically.

We use the trunk based deployment strategy with release branches. On every merge to main, the deployment to DEV is triggered, automatically. To deploy to STAGE and PROD, use the following steps:

Create a release branch with the Create Release Branch workflow. This will create a new release branch with the pattern release-YYYY-MM-DD-timestamp.
Trigger the deployment with the Deploy form Release Branch workflow. Select the corresponding release branch via the Use workflow from selection and select the environment you want to deploy to. In this step, also the retrieval-api-service docker image is build and published and used for the deployment.

Bugfixes

In case we find a bug on STAGE or PROD, we tend to stick to the trunk based approach on DEV. This means: - We first fix and test the bugfix on DEV (main) - After the fix, we have 2 opions to apply the fix: 1. We create a new release from the main branch and do another release using the normal approach 2. In worst case (hotfix), we do a cherry-pick of the commit on the former release branch and do another deployment from the fixed release branch afterwards. We do not merge back the change to main, since the release branch is a one time branch

Backup and recovery

For data and recovery we use 3 concepts:

Infrastructure as code - All infrastructure is in CDK, means we could re-create the whole infra from scratch by code if needed.
re-create data from sources - Our data itself can be re-generated from the sources (Numera and Data Production), so a loss of all data would just lead to a time delay of delivering new updates, but we can always re-trigger all jobs.
backup and import as self service - We implemented DAGs which can be triggered manually to create DB dumps, to store them in S3 and import them from a Dump in S3 into the DB.

Check section Backup-and-recovery for details.

Last updated: March 31, 2026 at 14:49

By: Marco Batzinger

📄 View source

Repository: Data-and-Content-Delivery/content-delivery