As digital consumption of rich media content explodes and with audience expectations at its peak, media providers have been challenged with not only delivering high-quality audience experiences but also the audience analytics in realtime to enable actionable insights for content publishers. Arkena, one of Europe’s leading media services organizations chose to power it’s analytical platform with Hortonworks Data Platform to cost effectively store and analyze over 3.5 terabytes of data per day. Join Hortonworks and Arkena as they share the industry challenges faced, the solution created which enables real-time and better analytics for their customers.
5. AGENDA
WHO WE ARE
CDN / OTT Business
Why Media
Experience & Video
Analytics is so
Important for CDN.
21
6. AGENDA
WHO WE ARE
CDN / OTT Business
Why Media
Experience & Video
Analytics is so
Important for CDN.
Video Analytics
Challenge &
Difficulties
321
7. AGENDA
WHO WE ARE
CDN / OTT Business
Why Media
Experience & Video
Analytics is so
Important for CDN.
Video Analytics
Challenge &
Difficulties
Why we chosen
Hadoop Technology.
321
4
8. AGENDA
WHO WE ARE
CDN / OTT Business
Why Media
Experience & Video
Analytics is so
Important for CDN.
Video Analytics
Challenge &
Difficulties
Why we chosen
Hadoop Technology.
Architecture & Result
3
5
21
4
9. AGENDA
WHO WE ARE
CDN / OTT Business
Why Media
Experience & Video
Analytics is so
Important for CDN.
Video Analytics
Challenge &
Difficulties
Why we chosen
Hadoop Technology.
Architecture & Result Why we selected
Hortonworks
3
5
21
64
10. WHO WE ARE
YOUR TRUSTED MEDIA PARTNER
10
A TDF Group
Business Unit
• 16 POPs CDN
• 1 Tbps connectivity
• 400 live radios & 360 live TVs
• 630 hours of On Demand video processed daily
• United
Kingdom
• Norway
• USA
• Finland
• Denmark
• Poland
• France
• Spain
• Sweden
13 Offices in 9 Countries
A team of 400 employees
At a glance
12. MEDIA COMPANIES & OPERATORS
SOLUTIONS AND SERVICES
Cloud4Media
is our SaaS/PaaS
service that provides all
the necessary tools for
managing and
exchanging media
assets.
12
Cloud4Media
a SaaS/PaaS service
that provides all the necessary tools
for managing and exchanging
media assets.
PLAYOUT
optimized for the audiovisual
industry, designed to distribute
your live and
on-demand content.
OTT / CDN
solution with modular components
that enables content owners,
telecom operators and broadcasters
to provide video content to viewers
worldwide.
Video Platform
provides enterprises,
organizations and the public
sector with all-in-one tool to
publish, manage and distribute
video live or on-demand to every
device.
Play
is part of the Arkena Video
Platform and handles video
playback. Learn how to
customize PLAY to fit your
needs.
Mobile Publisher
is an add-on to the Arkena
Video Platform that lets you
publish live broadcasts and on
demand videos directly with
your Iphone.
14. ARKENA OTT / CDN
A UNIQUE EUROPEAN PRESENCE, ESPECIALLY FRANCE AND NORDICS
14
We offer advanced CDN solutions for media
Live & OnDemand Streaming
First-class Origin
Transmuxing service
Ads Insertion (Audio: Triton, Radionomy,
Adswizz)
Timeshifting and Catch-up services
Arkena will offer a set of new
media analytics services
Real Time Analytics
Advanced Media Analytics
14
16. ARKENA OTT / CDN
A UNIQUE EUROPEAN PRESENCE, ESPECIALLY FRANCE AND NORDICS
1616
Content management & animation
• Metadata and catalog organization
• Offer scheduling and promotions
• Subscription, rental and purchase models
• Automatic sorts and optimized API for OTT apps display
User accounts management
• User ownership tracking
• DRM entitlements
• Device pairing and restrictions
• Multiscreen favorites and resume
Content processing & protection
• Adaptive streaming and download support
• Multiple audio tracks and subtitles support
• Smooth streaming with Playready and DASH with Marlin
• Geoblocking and streaming limits
17. 17
Arkena OTT / CDN Analytics
Challenge
“Infrastructure capable of handling Millions
of simultaneous connections/requests”.
18. CDN Architecture
Media specialized CDN with strong presence in Europe,
especially France and Nordics: audiovisual media streaming-
dedicated CDN
Video and Audio delivery, Live and On-Demand services
Multiscreen workflow expertise and broadcast / IP convergence
18
We deliver your contents with optimal performance
on all devices
NETWORK
IP
Regional
Network
CACHING SERVER
ORIGIN /
TRANSMUX SERVICE
PoP
CACHING SERVER
PoP
CACHING SERVER
PoP
IP
Regional
Network
IP
Regional
Network
More than 300 CDN customers in Europe.
With 16 European PoPs, local to final end-users
Capacity: 1Tbps, very high storage capacity (~PB)
More than 1000 streaming servers.
19. CDN Architecture
Media specialized CDN with strong presence in Europe,
especially France and Nordics: audiovisual media streaming-
dedicated CDN
Video and Audio delivery, Live and On-Demand services
Multiscreen workflow expertise and broadcast / IP convergence
19
We deliver your contents with optimal performance
on all devices
NETWORK
IP
Regional
Network
CACHING SERVER
ORIGIN /
TRANSMUX SERVICE
PoP
CACHING SERVER
PoP
CACHING SERVER
PoP
IP
Regional
Network
IP
Regional
Network
More than 300 CDN customers in Europe.
With 16 European PoPs, local to final end-users
Capacity: 1Tbps, very high storage capacity (~PB)
More than 1000 streaming servers.
CLUSTER
Logs trafic
20. Why Media Experience & Video Analytics is so Important
20
Customer Trust
Real Time AnalyticsAdvanced Metrics
Advanced Media Analytics
to monetize your audience.
Billing & Payment
Reporting, Billing
20
Why we need an Efficient
Analytics System
22. daily raw log size
(uncompressed, no replication)
20 GB
to 200 GB per day
Video Analytics Challenge
a peak rate of 60K
Events/Second
keep raw logs for 3-9 months
average raw log data
input rate
20 Mbps
to 120 Mbps
23. daily raw log size
(uncompressed, no replication)
20 GB
to 200 GB per day
Video Analytics Challenge
We compute15 Metricsat every batch:
Volume, Hits, Session duration, Concurrent sessions, Unique viewers...
All metrics are available over 15 Dimensions
Country, City, User agent, Browser, HTTP status code...
Real time statistics should be provided in
3 min
a peak rate of 60K
Events/Second
keep raw logs for 3-9 months
average raw log data
input rate
20 Mbps
to 120 Mbps
24. daily raw log size
(uncompressed, no replication)
20 GB
to 200 GB per day
Video Analytics Challenge
We compute15 Metricsat every batch:
Volume, Hits, Session duration, Concurrent sessions, Unique viewers...
All metrics are available over 15 Dimensions
Country, City, User agent, Browser, HTTP status code...
Real time statistics should be provided in
3 min
a peak rate of 60K
Events/Second
keep raw logs for 3-9 months
1 CDN "Edge" server generates
an average of
15 – 22 Million
Lines/Day
DASHAdaptative Bitrate
Streaming
average raw log data
input rate
20 Mbps
to 120 Mbps
1 Movie (HD, 1 hour) in
DASH format with
8 Video Tracks
1 Audio Track
4200 log
events
30. 30
Video Analytics Story
2012 20142013 2015
Arkena Analytics has
built and developed
In House.
There is a major problem
in production with a
significant downtime.
Home made Open Source
31. 31
Video Analytics Story
2012 20142013 2015
Arkena Analytics has
built and developed
In House.
There is a major problem
in production with a
significant downtime.
Analysis of the market.
Make or Buy
Launching of the project
with the partners. Build
the team (1 Project
Manager, 1 Developer , 1
System Engineer)
Home made Open Source
POC
32. 32
Video Analytics Story
2012 20142013 2015
V 1
Arkena Analytics has
built and developed
In House.
There is a major problem
in production with a
significant downtime.
Analysis of the market.
Make or Buy
Launching of the project
with the partners. Build
the team (1 Project
Manager, 1 Developer , 1
System Engineer)
Release the Analytics
Platform to the operation
team and open the
services to the customers.
Home made Open Source
POC
34. 34
TRANSPORT
Flume
Apache Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving large
amounts of streaming data into the HDP cluster.
Flume already Integrated in HDP: YARN coordinates data
ingest from Apache Flume and other services that deliver raw
data into an HDP cluster.
Rsyslog
RSYSLOG is the rocket-fast system for log processing. It offers
high-performance, great security features and a modular
design to transport data from our Edge.
We use the RELP protocol (The Reliable Event Logging
Protocol). protocol to provide reliable delivery of event
messages.
Transport Safe
35. 35
STORE DATA
Shared data set
In-house solution: can't query the whole data set
HDP: single entry point from HDFS, can query and
cross-correlate everything from the beginning of times
(almost).
Opportunities
In-house solution: Rigid, A nightmare for the
operational teams.
HDP: Give us new opportunities (Machine learning, new
metrics,…).
Stability & Trust
Hortonworks Data Platform
In-house solution: add clusters to scale out (we had
3!)
HDP: add nodes to scale out (storage + compute)
36. 36
OPERATION
Reliability &
Scalability
YARN
View your cluster as a single Data Operating System
Run multiple jobs on multiple processing engines
High availability with Standby Resource Manager
Easy scale-out by adding more YARN NodeManagers
Queue Management
Make sure business-critical jobs never lack resources
Separate operation tasks from business tasks
Validate new jobs' versions with no production impact
37. 37
OPERATION
Compute Real Time HDP Stack SPARK Streaming
HDP packages and incorporates the most recent and hadoop
software technology in the same Stack (Spark, Hive,Tez,…).
Apache Spark is a fast, in-memory data processing engine.
Process the data very 2 min.
HDP YARN-based architecture provides the foundation that
enables Spark and other applications to share a common
cluster and dataset while ensuring consistent levels of service
and response.
38. 38
OPERATION
Compute Real Time HDP Stack SPARK Streaming
HDP packages and incorporates the most recent and hadoop
software technology in the same Stack (Spark, Hive,Tez,…).
Apache Spark is a fast, in-memory data processing engine.
Process the data very 2 min.
HDP YARN-based architecture provides the foundation that
enables Spark and other applications to share a common
cluster and dataset while ensuring consistent levels of service
and response.
Use Architecture Lambda
• Processing real Time : Spark Streaming.
• Synchronize the Data in the HDFS.
• Consolidate the data with Hive/Tez.
• Ingest in the ElasticSearch.
Events
Near Real
Time
Store Batch
39. 39
OPERATION
Reduce operational
cost day-to-day
Easy To Use for the long run
Easy Setup and Installation.
Machine provisioning and capacity planning.
Easier Provisioning and Faster Cluster Deployment
Ambari
Expand clusters automatically as new nodes come
online
Track cluster health, job progress and KPIs with alerts,
customizable views, customizable dashboards...
REST API making deployment & configuration easy to
automate with modern conf management tools (Ansible)
43. 43
ARKENA CDN : HDP Cluster
1 2
3 4
5
Live ProcessingBatch Processing
Transport Multiple Processing Archivage Operations
44. 44
ARKENA CDN : Hardware Cluster
A peak rate of
60K Events/Second
keep raw logs for
3-9 months
HDP Compute Cluster :
We made the choice on DELL R730 the configuration we have
set with 16 Core, 128G RAM and 14 disk with 1To.
We attempted to respect the rule of thumb for Hadoop of
(1 Disk -> 8G RAM -> 1 physical core) in order to optimize the
I/O performances with 10 file channel per machine and we
kept 2 disk for the system.
ElasticSearch Cluster :
We choice 5 machines M610 in order to have an odd number
for the redundancy and the failover
45. 45
ARKENA CDN : Hardware Cluster
A peak rate of
60K Events/Second
keep raw logs for
3-9 months
Elastic Search Cluster
5 Machines
Cluster API
6 VM
HDP Cluster
8 Machines
HDP Compute Cluster :
We made the choice on DELL R730 the configuration we have
set with 16 Core, 128G RAM and 14 disk with 1To.
We attempted to respect the rule of thumb for Hadoop of
(1 Disk -> 8G RAM -> 1 physical core) in order to optimize the
I/O performances with 10 file channel per machine and we
kept 2 disk for the system.
ElasticSearch Cluster :
We choice 5 machines M610 in order to have an odd number
for the redundancy and the failover
46. 46
ARKENA CDN : Transport
Form Edge to Cluster
Rsyslog transport the logs from Edge to log
aggregator component.
Feature available on Rsyslog :
RELP protocol
Native in Linux
Disk Assit Queue beffuering
Ingest to HDFS
Apache Flume is used to fetch the logs from
Rsyslog and push them to HDFS.
Transport Technology
It’s not just how quickly you move data, but how
move safly from the Edge to the Cluster without
losing any lines.
How we can have resilient solution : mixed 2
softwares.
47. 47
ARKENA CDN : Transport
The log aggregator (with Rsyslog)
The log aggregator is responsible for reliably
forwarding the logs to the Compute Cluster.
If the compute cluster is unavailable or networks
issue, the logs are spooled on disk, and stay on the
aggregators until the compute cluster comes back
online.
Logs are sent from the edge servers to Log
Aggregators. There is one aggregator per PoP.
log aggregators are not specific to any PoP, we could
reproduce this setup on any PoP, hereby designated
as "PoPx" or "PoPy", just by deploying generic log
aggregators.
48. 48
ARKENA CDN : Transport
The log aggregator (with Rsyslog)
The log aggregator is responsible for reliably forwarding the
logs to the Compute Cluster.
If the compute cluster is unavailable or networks issue, the
logs are spooled on disk, and stay on the aggregators until
the compute cluster comes back online.
Logs are sent from the edge servers to Log Aggregators.
There is one aggregator per PoP.
log aggregators are not specific to any PoP, we could
reproduce this setup on any PoP, hereby designated as
"PoPx" or "PoPy", just by deploying generic log aggregators.
49. 49
ARKENA CDN : HDFS
Ingest into HDFS
The logs are ingested in HDFS once the local
Rsyslog on each Hadoop node receives an event.
Apache Flume is used to fetch the logs from
Rsyslog and push them to HDFS.
The local Rsyslog forwards an event to the local
Flume agent (TCP connection to `localhost`)
The Flume agent then proceeds to send the logs to
HDFS, while buffering them on disk for durability
reasons.
50. 50
ARKENA CDN : HDFS
Ingest into HDFS
The logs are ingested in HDFS once the local
Rsyslog on each Hadoop node receives an event.
Apache Flume is used to fetch the logs from
Rsyslog and push them to HDFS.
The local Rsyslog forwards an event to the local
Flume agent (TCP connection to `localhost`)
The Flume agent then proceeds to send the logs to
HDFS, while buffering them on disk for durability
reasons.
51. 51
ARKENA CDN : HDFS
Ingest into HDFS
An SyslogSource, listening on a TCP socket, receives
the incoming rsyslog event
a "FileChannel" listens for incoming events on the
rsyslog TCP source, and writes them locally to 10
different "datadirs" on 10 separate physical hard disk
drives.
Each datadir acts as a FIFO. Load is balanced evenly
from the single Rsyslog TCP source to the 10 datadirs
The "FileChannel" is plugged to 4 "HDFS Sinks".
When enough events have been buffered in the
channel, those events are sent to the 4 HDFS sinks in
an evenly balanced fashion.
56. 56
Why we selected Hortonworks
Avoid Vendor Lock In
Hortonworks Data Platform is close to the open
source trunk as possible and is developed 100%
in the open so you are never locked in.
Present a single, tested and completely open
Hadoop platform with no proprietary bolt-ons.
Transparency
Price Model & Unlimited Support Throughout
our projects
“Hortonworks loves and lives
open source innovation”,
Arkena does as well!
57. 57
Why we selected Hortonworks
Connect With the Community
We employ a large number of Apache project
committers & innovators so that you are
represented in the open source community.
Only Hortonworks can deliver the deepest level
of support across all the components of the
Hadoop platform.
Support from the Experts
They provide the highest quality of support for
deploying at scale.
“Hortonworks loves and lives
open source innovation”,
Arkena as well.
59. 59
What Happened after the Release
We have identified some improvement items after
the production release.
60. 60
What Happened after the Release
We have identified some improvement items after
the production release.
Transport
61. 61
What Happened after the Release
We have identified some improvement items after
the production release.
Transport Operation
62. 62
About The team
Reda Benzair
Projet Roles :
Architect & Project management
Work Experience
Executive MBA, Graduate from Engineering
School and Master of Advanced Study
university (DEA). 15 years of experience in
SmartJog SAS (become Arkena in 2013)
TDF subsidiary. Since 2013 VP Technical
development, leading technical
development team located in Paris,
Stockholm and Warsaw.
Projet Roles : Senior Software Engineer,
Spark, System
Work Experience
A passionate programmer with a strong
interest in devops and software
craftmanship, Erwan has been working on
complex distributed architectures during
the last 10 years. He joined Arkena as a
general-purpose Analytics engineer,
worked on the Hadoop data processing
pipeline, developped a decent chunk
Erwan Queffelec Julien Girardin
Projet Roles : Senior System
administrator and python developper.
Work Experience
A passionate with Linux system with a
strong interest in devops and python
development. Strong experience with
complex distributed architectures.
65. Hadoop Summit 2016 - Dublin
Date: Wednesday 13 – Thursday 14 April, 2016
Venue: Convention Centre Dublin
Website: www.hadoopsummit.org
Why Should You Attend?
• Hadoop Summit is Europe’s premier industry event for Apache Hadoop users, developers and vendors
• Two full days of practical and cutting edge education designed by the community – for the community
• Over 90 sessions spanning 7 tracks dedicated to enabling the next generation data platform
• A Community Showcase featuring the industries who’s who
• Crash courses for those just beginning with Hadoop
• Community driven meetups
• Birds of a Feather (BoFs) meetings to promote collaboration
• Comprehensive pre-event hands on classroom training
• A social program which provides ample opportunity to network and make new industry connections
• An amazing event party at the Guinness Storehouse Brewery
Plus much much more!
Register Now to take advantage of our Early Bird rates!
TALK TRACK
Good morning.
I’m Justin Sears and I run Industry Marketing at Hortonworks.
I’m excited to be speaking with you today about how Hortonworks is powering the future of data.
[NEXT SLIDE]
TALK TRACK
Here are just a few of the modern data apps that convert yesterday’s impossible challenges into today’s new products, cures, conveniences and life saving innovations.
These apps are either custom-built by our customers or they come of the shelf, created by Hortonworks or one of of our ecosystem partners to solve a particular problem.
Symantec and other cyber security leaders have built powerful apps to detect threats to digital information.
Leading pharma, automotive, consumer electronics and packaged goods companies are building their factories of the future that use actionable intelligence to improve manufacturing yields.
And age-old industries like automotive, agriculture and retail are taking connected data platforms on the road, through the field or to the cash register to do things that have never before been possible.
[NEXT SLIDE]
Bonjour Messieurs & Dames,
Je me présente Reda Benzair Vise Président technique chez Arkena. Qui est une société de services media du groupe TDF présente dans 9 pays et compte 1500 clients dans les media & télécoms.
Nous fournissons des solutions de gestion de contenus (Echange , Stockage ,…) et de diffusion linéaire et à la demande. Et une plateforme OTT dans le cloud et un CDN pour la distribution video.
Voici quelques références de nos clients qui nous fons confiances dans le gestion des medias et de diffusion. Des exemples dans le sport comme BeIN sur leur CDN et OTT.
Nous proposons à nos clients media une plateforme de CDN (Caching) Difussion avec une forte presence en France et en europe, Transmux ,Origin pour la diffusion, et la mnetisation des diffusions audio.
Nous fournissons aussi un services de statistiques temps réel pour nos clients.
Nous proposons à nos clients media une plateforme de CDN (Caching) Difussion avec une forte presence en France et en europe, Transmux ,Origin pour la diffusion, et la mnetisation des diffusions audio.
Nous fournissons aussi un services de statistiques temps réel pour nos clients.
Le challagende d’un CDN c’est de fournri uneinfrastr capable de supporter et de traiter
C’est important pour une entreprise de diffusion d’avoir une solution de statistique efficace et robuste pour les raisons
De facturation.
Ca permet au diffusieur d’aa nos clients a bien monétisé leur diffusion internet
Une plateforme stable permet de crée une relation de confiance ave cnos client. C’est l’element que le client regarde
Pour vous donner un ordre d’idée et de volumétrie et des contraintes que nous devons gérer avec le flow de donnée qui es génére par nos platforme de diffusion.
Comme toute boite informatique nous avons décider de construire notre propre solution analytique in house basé sur les technologies Open Source. Fin 2013 suite un problème opérationnel et les difficultés de faire évoluer
Comme toute boite informatique nous avons décider de construire notre propre solution analytique in house basé sur les technologies Open Source. Fin 2013 suite un problème opérationnel et les difficultés de faire évoluer
Comme toute boite informatique nous avons décider de construire notre propre solution analytique in house basé sur les technologies Open Source. Fin 2013 suite un problème opérationnel et les difficultés de faire évoluer
Une rapide vue sur notre architecture qui tourne en production aujourd’hui. Avec les derniers génération de machine nous avons pu faire fournir un cluster HDP avec seulement 8 machines qui reste relativement raisonnable (Prix / Perfomance) . R730 ( 16 core , 128 GRAM, 14 Disque de 1 To) et M610 (12 Core, 48G RAM).
Une rapide vue sur notre architecture qui tourne en production aujourd’hui. Avec les derniers génération de machine nous avons pu faire fournir un cluster HDP avec seulement 8 machines qui reste relativement raisonnable (Prix / Perfomance) . R730 ( 16 core , 128 GRAM, 14 Disque de 1 To) et M610 (12 Core, 48G RAM).
La stack Hortonworks est complètement ouvert et open source. Aucune dépendance ou contrainte. Ce qui permet une liberté d echanger de fournisseur si le besoin se ressent.
We are pleased to have completed our first year as a publicly traded company, and 2015 marked several milestones for us:
Customers. We more than doubled our customer base in 2015 and now have over 800 support subscription customers. We believe this traction is indicative of the unique value proposition we bring to the market in the form of 100% open source, our standing within the Apache community, and multi-product offerings that address both Data in Motion and Data at Rest.
Scale. We became the fastest enterprise software company to reach $100 million in annual revenue (according to Barclays research). In fact, we once again experienced triple digit annual revenue growth coming in at $122 million for 2015.
Employees. We hired the best and brightest people. We exited the year with about 850 employees, and from an engineering perspective, our efforts are applied across well over 200 Apache committer seats, which allowed us to accelerate innovation through our teams and the community.
Expand Market Opportunity. With the acquisition of Onyara last year, we expanded our Big Data and Analytics market focus to also target adjacent opportunities within the Internet of Things.