Skip to content

DWH Service

How does it work

DWH will store data in form of CSV files to S3 at least one a night.

tracking-data.csv:

  • as_of_date is the date of calculation starting with YYYY-MM-DD
  • content_id is a mix between content type and id e.g. topic_882
  • original_content_id is the legacy statista 3.5 id
  • content_type topic, statistic,...
  • number_days_online total number of days online as decimal e.g. 4459.0
  • ga_views_since_2024 as integer e.g. 25945 interpreted by us as overall
  • downloads_since_2024 as integer e.g. 5 interpreted by us as overall
  • registration_calls_since_2024 as integer e.g. 315 interpreted by us as overall
  • ga_views_current_year
  • downloads_current_year
  • registration_calls_current_year
  • ga_views_180_d
  • downloads_180_d
  • registration_calls_180_d
  • ga_views_30_d
  • downloads_30_d
  • registration_calls_30_d
  • ga_views_7_d
  • downloads_7_d
  • registration_calls_7_d
  • ga_views_daily_since_2024 an average daily view number as float e.g. 33.47741935483871
  • downloads_daily_since_2024 an average daily view number as float e.g. 0.0
  • registration_calls_daily_since_2024 an average daily view number as float e.g. 0.4064516129032258
  • ga_views_daily_current_year
  • downloads_daily_current_year
  • registration_calls_daily_current_year
  • ga_views_daily_180_d
  • downloads_daily_180_d
  • registration_calls_daily_180_d
  • ga_views_daily_30_d
  • downloads_daily_30_d
  • registration_calls_daily_30_d
  • ga_views_daily_7_d
  • downloads_daily_7_d
  • registration_calls_daily_7_d
    as_of_date;content_id;original_content_id;content_type;number_days_online;ga_views_since_2024;downloads_since_2024;registration_calls_since_2024;ga_views_current_year;downloads_current_year;registration_calls_current_year;ga_views_180_d;downloads_180_d;registration_calls_180_d;ga_views_30_d;downloads_30_d;registration_calls_30_d;ga_views_7_d;downloads_7_d;registration_calls_7_d;ga_views_daily_since_2024;downloads_daily_since_2024;registration_calls_daily_since_2024;ga_views_daily_current_year;downloads_daily_current_year;registration_calls_daily_current_year;ga_views_daily_180_d;downloads_daily_180_d;registration_calls_daily_180_d;ga_views_daily_30_d;downloads_daily_30_d;registration_calls_daily_30_d;ga_views_daily_7_d;downloads_daily_7_d;registration_calls_daily_7_d
    2024-02-16 12:07:13.812766;topic_882;882;topic;4459.0;25945;0;315;1388;0;21;4873;0;75;866;0;11;137;0;1;33.47741935483871;0.0;0.4064516129032258;30.844444444444445;0.0;0.4666666666666667;27.072222222222223;0.0;0.4166666666666667;28.866666666666667;0.0;0.36666666666666664;19.571428571428573;0.0;0.14285714285714285
    2024-02-16 12:07:13.812766;statistic_500455;500455;statistic;2954.0;12920;52;74;729;3;6;3387;16;29;481;2;4;87;1;1;16.670967741935485;0.06709677419354838;0.09548387096774194;16.2;0.06666666666666667;0.13333333333333333;18.816666666666666;0.08888888888888889;0.16111111111111112;16.033333333333335;0.06666666666666667;0.13333333333333333;12.428571428571429;0.14285714285714285;0.14285714285714285
    2024-02-16 12:07:13.812766;topic_2814;2814;topic;2907.0;55482;0;101;3283;0;7;7501;0;20;2655;0;5;568;0;0;71.58967741935484;0.0;0.13032258064516128;72.95555555555555;0.0;0.15555555555555556;41.672222222222224;0.0;0.1111111111111111;88.5;0.0;0.16666666666666666;81.14285714285714;0.0;0.0
    2024-02-16 12:07:13.812766;statistic_164047;164047;statistic;4946.0;545614;293;452;44745;17;49;147598;82;144;26925;10;26;5850;1;8;704.018064516129;0.37806451612903225;0.5832258064516129;994.3333333333334;0.37777777777777777;1.0888888888888888;819.9888888888889;0.45555555555555555;0.8;897.5;0.3333333333333333;0.8666666666666667;835.7142857142857;0.14285714285714285;1.1428571428571428
    2024-02-16 12:07:13.812766;statistic_1347219;1347219;statistic;448.0;1020;100;28;43;5;0;215;14;8;28;2;0;7;0;0;2.2767857142857144;0.22321428571428573;0.0625;0.9555555555555556;0.1111111111111111;0.0;1.1944444444444444;0.07777777777777778;0.044444444444444446;0.9333333333333333;0.06666666666666667;0.0;1.0;0.0;0.0
    ........
    

highscore-data.csv:

  • as_of_date is the date of calculation starting with YYYY-MM-DD
  • content_id is a mix between content type and id e.g. topic_882
  • original_content_id is the legacy statista 3.5 id
  • content_type topic, statistic,...
  • The hi_score_bucket is a string with one of 5 categories e.g. Top Performer
  • The hi_score_percentage_ranking is a floating number e.g. 0.172144116778161
    as_of_date;content_id;original_content_id;content_type;hi_score_bucket;hi_score_percentage_ranking
    2024-01-23 12:57:06;statistic_829296;829296;statistic;High Performer;0.10360364442398418
    2024-01-23 12:57:06;statistic_1253289;1253289;statistic;High Performer;0.172144116778161
    ........
    

This data is loaded from S3 every night and stored into a database.

Prerequisites to work on the service

Check out the Numera new joiner guide for additional information. Please refer to our Naming Conventions when choosing names

There is a repository offering the most important databases for local work as docker compose setup https://github.com/PIT-Numera/local-dev-setup

additionally you have to prepare the following:

  • Create GitHub token
  • Configure settings

Create a GitHub token

Create an access token in Github. After creating the token, copy and paste it to a safe place. Once you leave that view, the token will not be visible to you anymore.

Make sure to configure SSO for this token, authorizing PIT-Numera and PIT-User-Tools.

Create a .settings.xml file

Copy the following settings to a file called settings.xml, located in ${HOME}/.m2:

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">

    <activeProfiles>
        <activeProfile>github</activeProfile>
    </activeProfiles>

    <profiles>
        <profile>
            <id>github</id>
            <repositories>
                <repository>
                    <id>github</id>
                    <url>https://maven.pkg.github.com/PIT-Numera/maven-packages/</url>
                    <snapshots>
                        <enabled>true</enabled>
                    </snapshots>
                </repository>
            </repositories>
        </profile>
    </profiles>

    <servers>
        <server>
            <id>github</id>
            <username>your-github-user-name</username>
            <password>your-github-token</password>
        </server>
    </servers>
</settings>

where your-github-user-name is your GitHub username (something like firstname-lastname_statista) and your-github-token is the token (some string starting with ghp_) you created in the previous step.

Start it locally

  1. Start the services from https://github.com/PIT-Numera/local-dev-setup
  2. Create a dwh database in your local postgres with user numera and pw numera
  3. Create a run configuration in Intellij and start it. Example:
    -Dmessaging.env-name=env_john
    -Dmessaging.bucket.name="env-john-centralmsgbuck-envjohncentralmsgbucke-adrgh4fndtts"`
    
  4. Fill your data by calling the triggerSync endpoint
mvn install 
cd dhw-rest
mvn spring-boot:run -Dspring-boot.run.profiles=localdb -DskipTests

Access to S3 bucket

The S3 bucket is not in our control, but we are allowed to access the dwh bucket using the assume role functionality.

export AWS_ACCESS_KEY_ID="AS..."
export AWS_SECRET_ACCESS_KEY="7k..."
export AWS_SESSION_TOKEN="IQ..."
. ./scripts/assume-aws-role.sh "arn:aws:iam::966693701884:role/access_numera_bucket"
aws s3 ls 's3://numera-tracking-data-dev/'