DWH Service
How does it work
DWH will store data in form of CSV files to S3 at least one a night.
tracking-data.csv:
- as_of_date is the date of calculation starting with YYYY-MM-DD
- content_id is a mix between content type and id e.g. topic_882
- original_content_id is the legacy statista 3.5 id
- content_type topic, statistic,...
- number_days_online total number of days online as decimal e.g. 4459.0
- ga_views_since_2024 as integer e.g. 25945 interpreted by us as overall
- downloads_since_2024 as integer e.g. 5 interpreted by us as overall
- registration_calls_since_2024 as integer e.g. 315 interpreted by us as overall
- ga_views_current_year
- downloads_current_year
- registration_calls_current_year
- ga_views_180_d
- downloads_180_d
- registration_calls_180_d
- ga_views_30_d
- downloads_30_d
- registration_calls_30_d
- ga_views_7_d
- downloads_7_d
- registration_calls_7_d
- ga_views_daily_since_2024 an average daily view number as float e.g. 33.47741935483871
- downloads_daily_since_2024 an average daily view number as float e.g. 0.0
- registration_calls_daily_since_2024 an average daily view number as float e.g. 0.4064516129032258
- ga_views_daily_current_year
- downloads_daily_current_year
- registration_calls_daily_current_year
- ga_views_daily_180_d
- downloads_daily_180_d
- registration_calls_daily_180_d
- ga_views_daily_30_d
- downloads_daily_30_d
- registration_calls_daily_30_d
- ga_views_daily_7_d
- downloads_daily_7_d
- registration_calls_daily_7_d
as_of_date;content_id;original_content_id;content_type;number_days_online;ga_views_since_2024;downloads_since_2024;registration_calls_since_2024;ga_views_current_year;downloads_current_year;registration_calls_current_year;ga_views_180_d;downloads_180_d;registration_calls_180_d;ga_views_30_d;downloads_30_d;registration_calls_30_d;ga_views_7_d;downloads_7_d;registration_calls_7_d;ga_views_daily_since_2024;downloads_daily_since_2024;registration_calls_daily_since_2024;ga_views_daily_current_year;downloads_daily_current_year;registration_calls_daily_current_year;ga_views_daily_180_d;downloads_daily_180_d;registration_calls_daily_180_d;ga_views_daily_30_d;downloads_daily_30_d;registration_calls_daily_30_d;ga_views_daily_7_d;downloads_daily_7_d;registration_calls_daily_7_d 2024-02-16 12:07:13.812766;topic_882;882;topic;4459.0;25945;0;315;1388;0;21;4873;0;75;866;0;11;137;0;1;33.47741935483871;0.0;0.4064516129032258;30.844444444444445;0.0;0.4666666666666667;27.072222222222223;0.0;0.4166666666666667;28.866666666666667;0.0;0.36666666666666664;19.571428571428573;0.0;0.14285714285714285 2024-02-16 12:07:13.812766;statistic_500455;500455;statistic;2954.0;12920;52;74;729;3;6;3387;16;29;481;2;4;87;1;1;16.670967741935485;0.06709677419354838;0.09548387096774194;16.2;0.06666666666666667;0.13333333333333333;18.816666666666666;0.08888888888888889;0.16111111111111112;16.033333333333335;0.06666666666666667;0.13333333333333333;12.428571428571429;0.14285714285714285;0.14285714285714285 2024-02-16 12:07:13.812766;topic_2814;2814;topic;2907.0;55482;0;101;3283;0;7;7501;0;20;2655;0;5;568;0;0;71.58967741935484;0.0;0.13032258064516128;72.95555555555555;0.0;0.15555555555555556;41.672222222222224;0.0;0.1111111111111111;88.5;0.0;0.16666666666666666;81.14285714285714;0.0;0.0 2024-02-16 12:07:13.812766;statistic_164047;164047;statistic;4946.0;545614;293;452;44745;17;49;147598;82;144;26925;10;26;5850;1;8;704.018064516129;0.37806451612903225;0.5832258064516129;994.3333333333334;0.37777777777777777;1.0888888888888888;819.9888888888889;0.45555555555555555;0.8;897.5;0.3333333333333333;0.8666666666666667;835.7142857142857;0.14285714285714285;1.1428571428571428 2024-02-16 12:07:13.812766;statistic_1347219;1347219;statistic;448.0;1020;100;28;43;5;0;215;14;8;28;2;0;7;0;0;2.2767857142857144;0.22321428571428573;0.0625;0.9555555555555556;0.1111111111111111;0.0;1.1944444444444444;0.07777777777777778;0.044444444444444446;0.9333333333333333;0.06666666666666667;0.0;1.0;0.0;0.0 ........
highscore-data.csv:
- as_of_date is the date of calculation starting with YYYY-MM-DD
- content_id is a mix between content type and id e.g. topic_882
- original_content_id is the legacy statista 3.5 id
- content_type topic, statistic,...
- The hi_score_bucket is a string with one of 5 categories e.g. Top Performer
- The hi_score_percentage_ranking is a floating number e.g. 0.172144116778161
as_of_date;content_id;original_content_id;content_type;hi_score_bucket;hi_score_percentage_ranking 2024-01-23 12:57:06;statistic_829296;829296;statistic;High Performer;0.10360364442398418 2024-01-23 12:57:06;statistic_1253289;1253289;statistic;High Performer;0.172144116778161 ........
This data is loaded from S3 every night and stored into a database.
Prerequisites to work on the service
Check out the Numera new joiner guide for additional information. Please refer to our Naming Conventions when choosing names
There is a repository offering the most important databases for local work as docker compose setup https://github.com/PIT-Numera/local-dev-setup
additionally you have to prepare the following:
- Create GitHub token
- Configure settings
Create a GitHub token
Create an access token in Github. After creating the token, copy and paste it to a safe place. Once you leave that view, the token will not be visible to you anymore.
Make sure to configure SSO for this token, authorizing PIT-Numera and PIT-User-Tools.
Create a .settings.xml file
Copy the following settings to a file called settings.xml, located in ${HOME}/.m2:
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<activeProfiles>
<activeProfile>github</activeProfile>
</activeProfiles>
<profiles>
<profile>
<id>github</id>
<repositories>
<repository>
<id>github</id>
<url>https://maven.pkg.github.com/PIT-Numera/maven-packages/</url>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
</profile>
</profiles>
<servers>
<server>
<id>github</id>
<username>your-github-user-name</username>
<password>your-github-token</password>
</server>
</servers>
</settings>
where your-github-user-name is your GitHub username (something like firstname-lastname_statista)
and your-github-token is the token (some string starting with ghp_) you created in the previous step.
Start it locally
- Start the services from https://github.com/PIT-Numera/local-dev-setup
- Create a dwh database in your local postgres with user numera and pw numera
- Create a run configuration in Intellij and start it. Example:
-Dmessaging.env-name=env_john -Dmessaging.bucket.name="env-john-centralmsgbuck-envjohncentralmsgbucke-adrgh4fndtts"` - Fill your data by calling the triggerSync endpoint
mvn install
cd dhw-rest
mvn spring-boot:run -Dspring-boot.run.profiles=localdb -DskipTests
Access to S3 bucket
The S3 bucket is not in our control, but we are allowed to access the dwh bucket using the assume role functionality.
export AWS_ACCESS_KEY_ID="AS..."
export AWS_SECRET_ACCESS_KEY="7k..."
export AWS_SESSION_TOKEN="IQ..."
. ./scripts/assume-aws-role.sh "arn:aws:iam::966693701884:role/access_numera_bucket"
aws s3 ls 's3://numera-tracking-data-dev/'