Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov

TORNS AND ROSES
OF REALTIME DATA PLATFORM
BORIS TROFIMOV @ SIGMA SOFTWARE

Leading DWH @ AOL. / Vidible division /
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME

THE WORLD OF BIG DATA
DATA MANAGEMENT DATA ANALYSIS
INGESTION & ETL
§ Flexible data
pipelines
INTEGRATION
§ Multiple 3rd party
sources
WAREHOUSING
§ Efficient data
organization
REPORTING
§ Organizing data into
informational
summaries
DATA ANALYTICS
§ Find meaningful
correlations between
data
DATA MININIG
§ Extract new knowledge
DATA SCIENCE
§ Insights
§ Modes & Predictions
§ Machine Learning
VISUALISATION
§ Get insights
INFRASTRUCTURE
RELIABLE SERVICES
§ Private vs public
clouds
§ Quick scale out/down
§ Instant deployments
§ Efficient maintenance
MONITORING
§ Big Picture and total
control on every
service
§ Metrics and Alerts

API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
INEGRATION
POINTS
INTRODUCING DATA PLATFORM

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
BIG DATA ?

API
DATA PLATFORM
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DATA PLATFORM

DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS

DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
REPORTING
ANALYTICS
Major mission: organize data

TYPICAL DATA PLATFORM
INGESTION
MODULE
REPORTING
SERVICE
WAREHOUSE
VALIDATION
ENRICHMENT
MODULE
RAW DATA
AGGREGATIONS
MODULE
RAW DATA
RAW DATA
DIMENSIONS
ANALYTICS
MODULE
CONFIGURATION
MODULE
DIMENSION
UPDATER

OUR DOMAIN
CORE
PLATFORM
DATA
PLATFORM
VIDEO PLAYERS
CONTENT OWNERS
END USERS

OUR DOMAIN
S3 Data Lake
5 PB
Vertica
500 TB
Raws/Table
600 B
Events/Sec
2.5 M
Files/Hour/Pipeline
15 K
Data/Daly
25 TB
DATA LAKE PROCESSING

UNITED PLATFORM
DATA PLATFORM
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~1h

UNITED PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h

BENEFITS
• DATA DELIVERY TIME 2 min
• SCALABILITY UP TO 1M/s

НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ
И ВЫРАСТИ В 10 РАЗ

WHY WE NEEDED CHANGES
Unscalable CDH
•Adding/removing nodes to CDH requires yarn restart and downtime for apps
•Uneasy to build quick sandboxes
Memsql Scalability limits reached
•The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes
Memsql Processing limits reached
•Upper bound incoming rate limit was 1M events/s, while Business required 2.5M/s
Memsql Stability issues
•Buggy HA, even one faulty node could break entire cluster
•EMR node outages could require database recreation and drop all data

MEMSQL ALTERNATIVES
MEMSQL WITH PIPELINES
Pipelins give more efficient write to memsql however operational issues remain
MEMSQL 6.X
Unstable release, scalability issues not resolved yet
SPARK SQL
Unacceptable execution time on big aggregation queries (>50 group by fields)
DIRECT VERTICA
The approach means writing raw data directly to vertica from Spark and running aggregations on top of Vertica.
Did not work because of Vertica batch-oriented nature, 30s was not enough to write 50GB
KAFKA DRIVEN PRESTO FOR MICRO BATCHING
The approach means writing raw data to Kafka first, Presto would read raw data from Kafka
Presto’s Kafka connector does not respect offsets and scan entire topic every time
DIRECT PRESTO FOR MICRO BATCHING
Spark application writes raw data to HDFS/S3. Presto aggregates it and stores results back to HDFS/S3, then results are sent to vertica

UNITED PLATFORM 2.0
DATA PLATFORM
KAFKA SPARK PRESTO
REPORTING
SERVICE
DATA LAG ~3m
DATA LAG ~1h

WHAT IS EMR
• EMR stands for Elastic MapReduce
• Ordinary YARN under hood
• Start a MapReduce cluster in 15 minutes
• Runs OS versions of Hadoop, Hive, Pig, Tez, Spark
• Instance groups for Master, Core and Task

MIGRATING SPARK TO EMR
• EASY CREATE, EASY DESTROY
• MULTIPLE EMR CLUSTERS
• Separating concerns
• Simplified autoscaling rules
• M4.4XLARGE AS A MAJOR BUILD UNIT
• STATELESS EMR CLUSTER
• CUSTOMIZED EMR DEPLOYMENT PROCEDURE
• Using Docker for Spark Driver and Yarn configs to deploy as typical yarn app
• Option to use custom Spark versions

CAUTION, EMR!
• Easy to allocate m4.4 however easy to lose EMR node
• Address speculation to mitigate the issue
• We added PR to Spark codebase to connect speculation with blacklisting
• Losing master node -- losing entire cluster
• develop one-step procedure to boostrap cluster and easily redeploy Spark
there
• Hard to build reliable platform involving multiple HA

EMR AUTOSCALING
• We use custom strategy to scale out/in EMR cluster as the most stable and
reliable solution
• It checks on regular basis input rate and makes decision how many nodes add or
remove
• We built this formula:
NEW NODES = (RATE * 60s) / (MAX_PARTITION_SIZE * 16 vcores) + BUFFER
- RATE - incoming events rate per second
- 60s -- aggregation time for Spark Streaming
- every node has just 1 executor and 16 vcores (m4.4xlarge)
- MAX_PARTITION_SIZE -- empirical precalculated max comfortable partition size for 1 Spark
vcore
- BUFFER -- 20% to mitigate accidental spikes in rate

DEPLOYMENT DETAILS
• Docker is used to run only spark driver application
• Spark binaries is a part of docker image
• EMR cluster publishes its config to S3 when boostrapped. Zero efforts to
submit spark apps.
• URL with Yarn configuration files (http for cdh or S3 for emr) is passed to
docker container
• Rancher ensures HA out of the box.

CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Per second billing (!)
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Ordinary Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per region EMR cluster on demand as unit
of clustering

INTRODUCING PRESTO – 1
DATA PLATFORM
KAFKA SPARK MEMSQL
NGINX
REPORTING
SERVICE

INTRODUCING PRESTO – 2
• Aggregations and replications are running every minute
• Presto uses dimensions hosted outside. We keep using Memsql for that
allowing up to date dimensions
VERTICA
NODE
REPORTING
SERVICE
SPARK, EMR
HDFS
NODE
NODE
COLLOCATED HDFS/PRESTO
PRESTO
REPLICATORS
JENKINS SCHEDULER
MEMSQL

MAKING SPARK RUNNING FASTER
• Using custom FileOutputCommiter committer with V2 option to
exclude file moving in HDFS
• Spark writes to HDFS into partitioned folder and registers new partition
in Hive (no HCatalog as in Batch)

WRITING FASTER – FILE FORMATS
• Best and stable performance on ORC uncompressed:
• spark apps writes raw data in ORC
• presto reads ORC and writes aggregations in ORC
• replication uses ORC to send delta to Vertica
• Best performance on HDFS block size and Strip 64M
• Thankfully to strict retention policy 6 hours
• Enabling hive.orc.use-column-names=true
• simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name
• allows to evolve/modify schema for dataframe and database independently

BACKPRESSURE DRIVEN
DATA PLATFORM
KAFKA SPARK PRESTO
REPORTING
SERVICE
DATA LAG ~3m

WHAT WE HAVE ACHIEVED
• Scalable production
• Ability to grow further beyond 1M/s
• Stable production environment
• More stateless components, easier to recover
• Less expensive
• Smaller Spark cluster (-30%)
• Presto cluster is smaller than Memsql-driven one (30%)
• Simplified maintenance
• EMR scale out/in does not require yarn or apps restart

BATCH-DRIVEN BRANCH
DATA PLATFORM
KAFKA SPARK PRESTO
REPORTING
SERVICE

WHY MIGRATE TO EMR
• Provisioning new hardware
• New hardware can take months to obtain
• Type of hardware requirements changes
• Hardware gets old, needs to be replaced
• Auto scaling
• Per second billing
• Save costs by resizing clusters
• Meet computational growth demands instantly

EMR COSTS
• Instance type: m4.4xlarge 16 CPU, 64gb
• Base price for EC2 (on demand): $0.8/h
• Base price for EMR: $0.24/h
• 100 nodes running for a year:
(0.8 + 0.24) * 24 * 365 * 100 = $911,040 / y
Final cost can be significantly lower with company discounts

STORAGE COSTS
• Normally Hadoop clusters run on HDFS
• HDFS runs on local disks
• Natural option is EBS volumes
• Storing 1 PB on 100 EBS SSD volumes: $0.10 per 1 GB
1 PB * 3 * 1024^2 * $0.10 * 12 = $3,700,000
• Storing 1 PB on S3 would cost $270,000 / y

BEFORE MIGRATION
REPORTING
PLATFORM
DATA LAG ~1h
NGINX S3 NIFI
HDFS
MAP-REDUCE
VERTICA
HIVE
Private DC

AFTER MIGRATION
REPORTING
PLATFORM
DATA LAG ~1h
NGINX S3
NIFI
MAP-REDUCE
VERTICA
S3
S3
PRESTO
ATHENA

BATCH PROCESSING IN DETAILS
VERTICA
NODE
REPORTING
SERVICE
HADOOP EMR
NODE
NODE
PRESTO EMR
REPLICATORS
JENKINS SCHEDULER
RAW LOGS
NODE
NODE
NODE
NODE
NGINX CLUSTER
NODE
NODE
NODE
NIFI
NODE
NODE
SEALED PARTITIONED
RAW LOGS
RAW TABLE AGGREGATION TABLE
S3
Replicators use local
Presto’s HDFS to keep
transfer data and
provide hdfs link to
Vertica

S3 IS NOT LIKE HDFS
● It’s a REST API for files
● It is eventually consistent
● Rename is actually a copy / delete
● Request rate spikes can cause failures
● Can effect KMS if using encryption

EMRFS
● Amazon’s solution to S3 eventual consistency problem
● Uses Dynamo DB in the background
● Dynamo DB costs not significant
● Updates Dynamo DB metadata with every write
● S3 is the source of truth
● Polls when S3 is inconsistent with Dynamo DB
● EMRFS updates can fail
● Slower S3 reads and writes

HADOOP STAGE IN DETAILS
VERTICA
NODE
REPORTING
SERVICE
HADOOP EMR
NODE
NODE
PRESTO EMR
REPLICATORS
JENKINS SCHEDULER
RAW LOGS
NODE
NODE
NODE
NODE
NGINX CLUSTER
NODE
NODE
NIFI
NODE
NODE
PARTITIONED
RAW LOGS
S3
RAW LOGS
RAW LOGS

OUTPUT COMMITER
● Spark and MR use output committers
● Mappers / Reducers write to buffers / local disk
● commitTask moves files to temp job directory
● commitJob moves files to final directory
● commitJob is very slow on S3 because of the rename
● Can take hours to complete
● Can be avoided with algorithm version 2 committer
● Version 2 committer can leave garbage
● S3 native committer does not work with HCatalog

SMALL FILES PROBLEM
● Small files will get a mapper each
● On S3 mappers start slower
● Try CombineTextInputFormat for better performance
● Use a splittable and fast compression like LZO

AUTOSCALING
● Autoscaling shared cluster based on memory use
● Autoscaling dedicated pipelines using python scripts
● Optimal node count is a complex problem
○ Optimize for run time?
○ Optimize for cost?
○ Easier to solve on Spark
● Not autoscaling core nodes
● Can run into EC2 instance limitations
● Random issues with scaling up or scaling down

WHAT WE HAVE CHANGED
● Enabled EMRFS to work with S3 eventual consistency
● Removed most renames from our pipelines
● Increased global KMS request limit
● Using output committer version 2
● Using CombineTextInputFormat for pipelines w/ small files
● Using Snappy for intermediate compression
● Stopped using HCatalog with dynamic partitioning
● Increased MR sort buffer for better shuffle performance

HADOOP STAGE IN DETAILS
VERTICA
NODE
REPORTING
SERVICE
HADOOP EMR
NODE
NODE
PRESTO EMR
REPLICATORS
JENKINS SCHEDULER
RAW LOGS
NODE
NODE
NODE
NODE
NGINX CLUSTER
NODE
NODE
NIFI
NODE
NODE
PARTITIONED
RAW LOGS
S3

AGGREGATIONS ON HIVE
● Hive is SQL-like interface for structured data
● Convenient to query structured data on HDFS
● Supports multiple query engines
● Transforms SQL queries into query engine jobs
● Works well on HDFS
● Even faster with TEZ query engine
● Much slower on S3 :(

AGGREGATIONS ON PRESTO
● Distributed, in memory SQL query engine
● Developed by Facebook
● Used by Netflix, Airbnb, Dropbox and others
● Still relatively young, current version 0.193
● Very fast on S3 out of the box
● Even faster on HDFS
● Runs as a service, memory hungry

PRESTO ON EMRFS
● Possible to run with EMRFS support
● Ticket number #8938 on github
● Roughly 20-30% slower
● No official support, can be unstable
● Not using EMRFS support currently
● Verifying row counts between Presto and Vertica

PERFOMANCE BENCHMARK
● Cluster of 60 core m4.4xlarge instances
● 16 CPU, 64GB of RAM
● Presto version 0.192 with 50 GB heap
● EMR 5.11.0, Hive 2.3.2, Spark 2.2.1

INPUT DATA
● Stored on S3
● ORC format
● Compressed with gzip
● Total of 526.6 GiB
● Tatal 6 billion records
● 700 files total, ±750.0 MiB per file

SAMPLE AGGREGATION
INSERT INTO target_tbl PARTITION (date, batch_id)
SELECT
date, batch_id, dim1, dim2,
SUM(IF(some_field = ‘foo’, 1, 0) fact1
FROM src_tbl
WHERE other_field = ‘bar’
GROUP BY date, batch_id, dim1, dim2

BENCHMARK RESULTS
● Presto on average 85% faster than MR
● Tez results slightly better than MR
● Comparable Presto performance to Spark SQL

BENCHMARK CONFIGURATION
● HIVE SETTINGS
● tez.am.resource.memory.mb=8192
● hive.tez.container.size=4096
● hive.exec.dynamic.partition=true
● hive.exec.dynamic.partition.mode=nonstrict
● hive.vectorized.execution.enabled=true
● hive.vectorized.execution.reduce.enabled=true
● hive.auto.convert.join.noconditionaltask.size=1GB
● SPARK SQL SETTINGS
● spark.executor.count=180
● spark.executor.memory=15G
● spark.executor.cores=5
● Leaving 1 core for OS

WHAT WE ACHIEVED
Scalable production
- Ability to grow further beyond 2.5M r/s
Stable production environment
- More stateless components, easier to recover
- Back pressure
Less expensive
- Over 1M$ yearly savings
Simplified maintenance
- EMR scale out/in does not require yarn or apps restart
- Faulty clusters destroyed and recreated

Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov

Ähnlich wie Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov (20)

Mehr von Sigma Software

Mehr von Sigma Software (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Realtime data ingestion on Spark and Presto. The story about 2.5M events per second and Data Lake in S3 or saying bye, HDFS by Boris Trofimov