La collecte de données au sein d'un DataLake sans impacter les systèmes opérationnels est un challenge pour de nombreuses entreprises.
Lors du meetup Paris Data Engineers du 26 mars 2019, Dimitri Capitaine nous a présenté Data Collector qui est un outil de Change Data Capture (CDC) développé en interne chez OVH. Data Collector est capable d'assurer une réplication fiable et performante des bases de données jusqu'au DataLake.
Hugo Larcher nous a alors présenté un cas d'utilisation autour de l'exploitation de données aéronautiques avec une touche d'IoT et de DataViz.
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Change Data Capture with Data Collector @OVH
1. #PARISDATAENG’ MEETUP
CHANGE DATA CAPTURE WITH
DATA COLLECTOR
HUGO LARCHER
DATA & SOFTWARE ENGINEER @OVH
@hugoch
DIMITRI CAPITAINE
SENIOR DEVOPS BIG DATA @OVH
@pirion
2. Big data common pattern
Messages bus, Twitter feed, website statistics, ….
Big data Cluster
Software (Hadoop)
Compute (CPU/RAM)
Storage
Data « at rest »
Data « in motion »
CSV, JSON, Database dump, …
Better understanding
Better decisions
Analyze, Exploit dataCollect & store various data
Perform
massive
operations
3. Data + OVH = ❤
OVH
Data collector
Cloudera
platform
(fully managed)
Analytics Data
Platform
Apache Spark
as a Service
Machine
Learning
NVIDIA NGC
catalog
Collect data Store data Process, Analyze Learn, predict
Free Lab
Free Lab
Free Lab
Object storage
Block storage
Storage
dedicated
servers
File storage
Managed
databases
Logs & Metrics
9. Data Collector client
A lightweight data replication tool
KafkaData Source
Data Collector
OVH Cloud
SinkSource
10. Data Collector client
Performance
• 300 000 events/s in "Query" Mode
• ~40 000 events/s in "Change data capture"
Mode
Reliability
• Failure tolerant
• Encrypted
• Source Filter
Simplicity
• Remote control by API
• modular
A lightweight data replication tool
11. Data @ OVH
How to get data from kafka to datalake ?
Datalake
Kafka
35. Plane tracking... at scale
17,000+ receivers
200k flights/day
105,000,000 pts/hour
ANALYTICS
DATA PLATFORM
For this demo:
3 master nodes
3 compute nodes
2x NVMe 2To per node
2.4Ghz 8 vCores per node
80Go RAM per node
36. Welcome Analytics Data Platform !
… to production !
From zero…
02
04
03
01
Flexible infra, flexible payment
On top of OVH Public Cloud
Competitive pricing
Ready to use Hadoop cluster
Secured and configured
Performance
Soon : High-speed storage instances (NVMe)
W
ithin 1 hour
37. Analyzing a flight dataset
Archive data
3 months ~3.5TB
Raspberry Pi
OVH DATA COLLECTOR
ANALYTICS DATA
PLATFORM
+
42. Useful URLs
ü Lab Data Collector : https://labs.ovh.com/ovh-data-collector
ü Data Collector Agent Github : https://github.com/Pirionfr/lookatch-agent
ü Lab Spark as a Service : https://labs.ovh.com/analytics-data-compute
ü Big data Analytics Data Platform offer : https://www.ovh.com/fr/platform/big-data/analytics-data-platform.xml
ü Big Data Cloudera offer : https://www.ovh.com/fr/platform/big-data/managed-cluster.xml
ü AI solutions : https://www.ovh.com/fr/platform/ai-machine-learning.xml
ü NVIDIA NGC : https://www.ovh.com/fr/public-cloud/instances/gpu-tesla.xml
ü Lab Machine Learning : https://labs.ovh.com/machine-learning-platform
ü Lab Premium Databases : https://labs.ovh.com/ha-database