"Are you developing or declining? Don't become an IT-dinosaur"Sigma Software
Weitere ähnliche Inhalte
Ähnlich wie Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov
Ähnlich wie Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov (20)
2. Leading DWH @ AOL. / Vidible division /
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME
8. THE WORLD OF BIG DATA
DATA MANAGEMENT DATA ANALYSIS
INGESTION & ETL
§ Flexible data
pipelines
INTEGRATION
§ Multiple 3rd party
sources
WAREHOUSING
§ Efficient data
organization
REPORTING
§ Organizing data into
informational
summaries
DATA ANALYTICS
§ Find meaningful
correlations between
data
DATA MININIG
§ Extract new knowledge
DATA SCIENCE
§ Insights
§ Modes & Predictions
§ Machine Learning
VISUALISATION
§ Get insights
INFRASTRUCTURE
RELIABLE SERVICES
§ Private vs public
clouds
§ Quick scale out/down
§ Instant deployments
§ Efficient maintenance
MONITORING
§ Big Picture and total
control on every
service
§ Metrics and Alerts
32. MIGRATING SPARK TO EMR
• EASY CREATE, EASY DESTROY
• MULTIPLE EMR CLUSTERS
• Separating concerns
• Simplified autoscaling rules
• M4.4XLARGE AS A MAJOR BUILD UNIT
• STATELESS EMR CLUSTER
• CUSTOMIZED EMR DEPLOYMENT PROCEDURE
• Using Docker for Spark Driver and Yarn configs to deploy as typical yarn app
• Option to use custom Spark versions
34. EMR AUTOSCALING
• We use custom strategy to scale out/in EMR cluster as the most stable and
reliable solution
• It checks on regular basis input rate and makes decision how many nodes add or
remove
• We built this formula:
NEW NODES = (RATE * 60s) / (MAX_PARTITION_SIZE * 16 vcores) + BUFFER
- RATE - incoming events rate per second
- 60s -- aggregation time for Spark Streaming
- every node has just 1 executor and 16 vcores (m4.4xlarge)
- MAX_PARTITION_SIZE -- empirical precalculated max comfortable partition size for 1 Spark
vcore
- BUFFER -- 20% to mitigate accidental spikes in rate
36. CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Per second billing (!)
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Ordinary Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per region EMR cluster on demand as unit
of clustering
41. WRITING FASTER – FILE FORMATS
• Best and stable performance on ORC uncompressed:
• spark apps writes raw data in ORC
• presto reads ORC and writes aggregations in ORC
• replication uses ORC to send delta to Vertica
• Best performance on HDFS block size and Strip 64M
• Thankfully to strict retention policy 6 hours
• Enabling hive.orc.use-column-names=true
• simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name
• allows to evolve/modify schema for dataframe and database independently
43. WHAT WE HAVE ACHIEVED
• Scalable production
• Ability to grow further beyond 1M/s
• Stable production environment
• More stateless components, easier to recover
• Less expensive
• Smaller Spark cluster (-30%)
• Presto cluster is smaller than Memsql-driven one (30%)
• Simplified maintenance
• EMR scale out/in does not require yarn or apps restart
57. AUTOSCALING
● Autoscaling shared cluster based on memory use
● Autoscaling dedicated pipelines using python scripts
● Optimal node count is a complex problem
○ Optimize for run time?
○ Optimize for cost?
○ Easier to solve on Spark
● Not autoscaling core nodes
● Can run into EC2 instance limitations
● Random issues with scaling up or scaling down