Roularta is a leading publishing company in Belgium. As digital news and channels move at a rapid pace and contain massive volumes of data, Roularta decided in 2019 to invest in a Spark-based data platform to drive true real-time website analytics and unlock insights on previously untouched (big) data sources. In this talk we’ll first explain why and how Roularta embarked from a classical data warehouse to a Spark-based Lakehouse using Delta. We’ll outline the series of publishing & marketing use-cases done in the last 12 months and highlight for each use-case the advantages of Spark and how the team further tuned performance to truly deliver insights with high velocity.
How a Media Data Platform Drives Real-time Insights & Analytics using Apache Spark
1. How a media data platform drives
real-time insights & analytics using Spark
Bart Van Der Vurst
Partner at element61
2. The media landscape is in its biggest transformation
Ongoing market changes
▪ rise of the “Tech Giants”
▪ death of anonymous tracking and
third-party cookies
▪ GDPR / privacy
▪ digital transformation
▪ growing paid content, digital subscriptions
▪ search for personalisation
3. Data is flooding the media landscape
Enormous amount of data is tracked continuously
▪ Online behavioural or clickstream data
▪ Reading behaviour & attention time
▪ Interests & cross-brand tracking
▪ Digital subscription data
▪ Digital newspaper
▪ Weekly magazines
▪ Programmatic advertising & marketing
▪ Impressions
▪ Clicks
VALUE?
4. Media companies need a data & digital turnaround
Challenge to tackle
▪ From tackling small to huge data volumes
▪ Subscription data vs. clickstream data
▪ From reporting to data-driven (media) services
▪ From classical BI to a modern data platform
5. 2 years ago,
Roularta & element61 embarked this challenge
Who is Roularta Media Group?
▪ Roularta Media Group is a Belgian multimedia
group, market leader in the field of magazines,
local media and business television
since 1954
▪ 1300+ employees, EUR 295+ mln revenue
Who is element61?
▪ Analytics & AI consultancy team (70 people)
& Databricks partner in Belgium
8. A data analytics platform serving as foundation
of reporting, real-time personalization & insight services
9. The available BI stack couldn’t deliver on requirements
BI Stack vs. Modern Requirements
▪ One Analytical Data Hub
▪ Manage big volumes of data – 2,5 TB/month
▪ High performance platform – 20 mio trans/day
▪ True real-time dependencies
(content recommendation, advertising audiences, etc.)
▪ Process structured & unstructured data
▪ GDPR compliant
▪ Advanced scoring, modeling and AI/ML capabilities
▪ Data democratization/self-service
▪ Dashboards in different departments
10. We’ve built this modern data analytics platform
with Azure & Databricks
Approach taken
▪ Leverage latest
best-practices
(e.g. Delta Lake)
▪ First use-case live in
<6 months
▪ Built iteratively
& use-case driven
▪ First focused on
reporting, now added
AI services
11. We use Delta Lake end-to-end
What it means
▪ Generally available 2019Q2
Introduced at Roularta 2019Q3
▪ Used across all data lake layers
and for both real-time & batch
data processing
▪ Cornerstone for the GDPR solution
& compliance put into place
14. We use predictive article quality scoring
for editorial tuning & paywall decisions
Traffic score Engagement score Conversion score
(Predicted) article quality score
Data of every article
Want to know more?
(Re-)watch the Data & AI session “Building an ML Tool to predict Article Quality Scores using Delta & MLFlow”
16. The best is yet to come !
• Enhanced content classification algorithms
for traffic, engagement & conversion
• Dynamic content-tagging for advertising
• Consumer segmentation & profiling
• Publisher dashboards
• Marketeer dashboards
• Further use of (Artificial) Intelligence for Dynamic Paywall
• Newsfeed curation based on behaviour & content potential
17. Roularta has a media data platform capable to scale
… and best if yet to come
• Enhanced content classification algorithms
for traffic, engagement & conversion
• Publisher dashboards
• Dynamic content-tagging for advertising
• Consumer segmentation
• Marketeer dashboards
• Further AI for Dynamic Paywall
• Newsfeed curation based on behaviour &
content potential