Wisely Chen Spark Talk At Spark Gathering in Taiwan

SparkSQL and Parquet
Wisely Chen
Data Tech Lead at Appier

Agenda
• Introduce me and Appier
• How do we build our pipeline?
• Why do we use SparkSQL + HDFS?
• Why do we use Parquet?

Who am I?
• Data Team Lead at Appier
• Spark Code Contributor
• Personal Email: thegiive@gmail.com
• Speaker at
• Spark Summit 2014 SF
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto

What is Appier?
• AI and Data Company
• Mission is to make advertisement the preferred
content that connects business and users
• Back by Sequoia Capital

Data Team in Appier
• Deal with Perabyte per day
• Handling 2K~3K cores cluster on AWS
• Build and maintain a robust data pipeline
• Data correctness is must
• Partial pipeline need < 1min latency
• Total infra need low cost

Architecture
Log Kafka
Spark
Streaming
ETLS3
HDFS
Parquet
SparkSQL
ML

Heavy Spark User
• ML : Custom Spark Application(no mllib)
• ETL: Spark Application
• SQL: SparkSQL + Parquet
• Streaming: Spark Streaming + Kafka

Why Spark?
• We love spark and familiar with Spark
• Appier commit >10 commits in last Quater
• Perfect for ML application
• A general engine for every aspect usage
• You don’t have to learn a lot of big data term

Why SQL is important?
Before SparkSQL
5 engineer coding
scala
After SparkSQL
All engineer can involved
into data project
Data analytics can query
data on their own

User Interface
SQL +
TimeRange
File Util SQL Engine
File List
HDFS
Parquet
S3
Parquet

Why SparkSQL?
• We know Spark
• Tuning Spark Application knowledge can be
reused in SparkSQL
• Any table/UDF deﬁned in SparkSQL application
can be reused in ML application
• SparkSQL and Dataframe will be more important
in Spark eco-system

Which storage is best
for SparkSQL in Appier?

We try Cassandra
• Pros
• Easy to use and implement application
• Easy to scale up
• Hide all heavy stuff inside the platform
• Cons
• Not so easy to maintain
• Not so easy to tune performance
• Hide all heavy stuff inside the platform

We try AeroSpike
• Pros
• Very good performance
• Easy to maintain
• Easy to scale
• Hide all heavy stuff inside the platform but better implement
• Cons
• Expensive!!!!!!

HDFS + File
• Pros
• Low cost
• Good read and write
performance on big data
• HDFS is very stable
• We know all the detail
• Easy to scale up
• Cons
• We have to implement all the
detail
• We have to implement all the
maintain script

Why do we give up AeroSpike?
• Cost is too high
• We prefer put money on people rather than
machine

Why do we give up Cassandra?
• We are not familiar with Cassandra(Main
Reason)
• Very easy to implement POC
• Reduce a lot of effort on start phase
• We feel Hard to maintain on later phase (again :
we are not familiar with Cassandra)

Why do we use HDFS/File?
• Cost is cheap
• Implement need a lot of time
• Solid engineering team don’t afraid this
• We can control all detail
• We can build up a maintainable platform

The main reason
• We love Spark
• I have used with HDFS before. But I tend to love
HDFS after these days

What is Parquet?
• From Google Dremel paper
• Column format storage
• Support nested data structures(List,Map..)
• Support Protobuf/thrift/Json

ID Name Age
1 Alice 23
2 Beverly 32
3 Cate 15
Select Name From xxx
ID Name Age
1 Alice 23
2 Beverly 32
3 Cate 15
Select * From xxx where
Age > 20
Column Pruning Predicate Pushdown

Different Encoding
Encoding Algo Use Case
Run Length Encoding Repeated Data
Delta Encoding
Sequence Data with order
(Timestamp,auto create id…)
Dictionary Encoding Small scale data set(IP…)
Preﬁx Encoding Delta Encoding for strings

The real reason is
• SparkSQL treat Parquet/JSON as ﬁrst citizen
• ORC, RCFile is not on their plan
• Parquet perform well in every aspect

Good Lesson we learn
• File(Parquet) is better storage than any other DB
• Easily to backup, replication
• Easily to change storage solution
• Easy to debug
• Easy to maintain

Conclusion
• Spark Spark Spark
• SparkSQL + Parquet is very good combine
solution
• Don’t trust any solution / service. Don’t put any
critical service on the platform you don’t trust
• A solid team can do anything you want

Wisely Chen Spark Talk At Spark Gathering in Taiwan

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Wisely Chen Spark Talk At Spark Gathering in Taiwan

Ähnlich wie Wisely Chen Spark Talk At Spark Gathering in Taiwan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Wisely Chen Spark Talk At Spark Gathering in Taiwan