Skip to content

Aggregator

Purpose

The aggregator is a TypeScript tool that discovers and pulls documentation from all GitHub repositories that have a /docs folder. It's the first step in the documentation pipeline, collecting content from multiple repositories into a unified structure.

How It Works

The aggregator follows a simple four-step process:

  1. Authentication - Retrieves GitHub App credentials from AWS Secrets Manager
  2. Discovery - Scans all GitHub organizations where the app is installed, finding repositories with /docs folders
  3. Extraction - Uses sparse checkout to clone only the /docs folder from each repository, keeping downloads lightweight
  4. Index Generation - Creates the homepage (docs/index.md) with links to all discovered repositories

Structure

The aggregator is organized into focused modules:

  • aws-client.ts - Fetches GitHub App credentials from Secrets Manager
  • github-client.ts - Authenticates with GitHub App, discovers repositories
  • docs-extractor.ts - Clones /docs folders using sparse checkout
  • index-generator.ts - Generates MkDocs configuration from aggregated docs
  • index.ts - Main entry point that orchestrates the pipeline

Index Generation

The index generator (index-generator.ts) creates the site homepage at docs/index.md. This page serves as a catalog of all aggregated documentation.

Discovery Process

  1. Scans the docs/ folder structure to find all aggregated repositories
  2. For each repository, attempts to read metadata from entry files in priority order:
    • index.md (preferred)
    • README.md (falls back to this, becomes index.html)
    • readme.md (last resort, becomes readme.html)
  3. If no entry file is found, generates a placeholder index.md with:
    • A warning message indicating documentation is missing
    • A direct link to the repository's /docs folder on GitHub
    • Instructions for adding a README.md file

Special Handling for docs-builder

The statista/docs-builder repository (this repo) is handled differently during aggregation:

  • Other repos: Content goes to docs/{org}/{repo}/
  • docs-builder: Content goes directly to docs/ root

This ensures that meta-documentation like docs/about/ appears at the top level rather than buried under docs/statista/docs-builder/about/. The docs-builder repo is excluded from appearing as a card on the index page.

Metadata Extraction

The generator reads YAML frontmatter to extract display information:

---
teaser:
  roof-title: "My Service"
  title: "A comprehensive guide to My Service"
---

Extracted Fields:

  • Card Title: teaser.roof-titletitle → repository name (fallback)
  • Card Description: teaser.title"No description available" (fallback)

Output Format

Generates a card grid layout grouped by organization:

---
title: Home
---
# Repositories

## PIT-Numera

<div class="grid cards" markdown>

- [**Numera Compliance Service**](PIT-Numera/compliance-service/) <br> Management of compliance requests
- [**Numera Statistic Service**](PIT-Numera/statistic-service/) <br> Storing Statistic content

</div>

Different entry files produce different link formats:

Entry File Link Format Resulting URL
index.md {org}/{repo}/ {org}/{repo}/index.html
README.md {org}/{repo}/ {org}/{repo}/index.html
readme.md {org}/{repo}/readme.html {org}/{repo}/readme.html
none (placeholder) {org}/{repo}/ {org}/{repo}/index.html (generated)

This accounts for MkDocs' use_directory_urls: false configuration.

Usage

Run the full aggregation:

cd aggregator
pnpm build
pnpm start

Regenerate only the index (requires docs already aggregated):

pnpm generate-index

The aggregator outputs to the docs/ directory, organized by organization and repository name.