2. Agenda
• Introduce me and Appier
• How do we build our pipeline?
• Why do we use SparkSQL + HDFS?
• Why do we use Parquet?
3. Who am I?
• Data Team Lead at Appier
• Spark Code Contributor
• Personal Email: thegiive@gmail.com
• Speaker at
• Spark Summit 2014 SF
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
4. What is Appier?
• AI and Data Company
• Mission is to make advertisement the preferred
content that connects business and users
• Back by Sequoia Capital
5. Data Team in Appier
• Deal with Perabyte per day
• Handling 2K~3K cores cluster on AWS
• Build and maintain a robust data pipeline
• Data correctness is must
• Partial pipeline need < 1min latency
• Total infra need low cost
8. Heavy Spark User
• ML : Custom Spark Application(no mllib)
• ETL: Spark Application
• SQL: SparkSQL + Parquet
• Streaming: Spark Streaming + Kafka
9. Why Spark?
• We love spark and familiar with Spark
• Appier commit >10 commits in last Quater
• Perfect for ML application
• A general engine for every aspect usage
• You don’t have to learn a lot of big data term
10. Why SQL is important?
Before SparkSQL
5 engineer coding
scala
After SparkSQL
All engineer can involved
into data project
Data analytics can query
data on their own
12. Why SparkSQL?
• We know Spark
• Tuning Spark Application knowledge can be
reused in SparkSQL
• Any table/UDF defined in SparkSQL application
can be reused in ML application
• SparkSQL and Dataframe will be more important
in Spark eco-system
14. We try Cassandra
• Pros
• Easy to use and implement application
• Easy to scale up
• Hide all heavy stuff inside the platform
• Cons
• Not so easy to maintain
• Not so easy to tune performance
• Hide all heavy stuff inside the platform
15. We try AeroSpike
• Pros
• Very good performance
• Easy to maintain
• Easy to scale
• Hide all heavy stuff inside the platform but better implement
• Cons
• Expensive!!!!!!
16. HDFS + File
• Pros
• Low cost
• Good read and write
performance on big data
• HDFS is very stable
• We know all the detail
• Easy to scale up
• Cons
• We have to implement all the
detail
• We have to implement all the
maintain script
17. Why do we give up AeroSpike?
• Cost is too high
• We prefer put money on people rather than
machine
18. Why do we give up Cassandra?
• We are not familiar with Cassandra(Main
Reason)
• Very easy to implement POC
• Reduce a lot of effort on start phase
• We feel Hard to maintain on later phase (again :
we are not familiar with Cassandra)
19. Why do we use HDFS/File?
• Cost is cheap
• Implement need a lot of time
• Solid engineering team don’t afraid this
• We can control all detail
• We can build up a maintainable platform
20. The main reason
• We love Spark
• I have used with HDFS before. But I tend to love
HDFS after these days
24. ID Name Age
1 Alice 23
2 Beverly 32
3 Cate 15
Select Name From xxx
ID Name Age
1 Alice 23
2 Beverly 32
3 Cate 15
Select * From xxx where
Age > 20
Column Pruning Predicate Pushdown
25. Different Encoding
Encoding Algo Use Case
Run Length Encoding Repeated Data
Delta Encoding
Sequence Data with order
(Timestamp,auto create id…)
Dictionary Encoding Small scale data set(IP…)
Prefix Encoding Delta Encoding for strings
28. The real reason is
• SparkSQL treat Parquet/JSON as first citizen
• ORC, RCFile is not on their plan
• Parquet perform well in every aspect
29. Good Lesson we learn
• File(Parquet) is better storage than any other DB
• Easily to backup, replication
• Easily to change storage solution
• Easy to debug
• Easy to maintain
30. Conclusion
• Spark Spark Spark
• SparkSQL + Parquet is very good combine
solution
• Don’t trust any solution / service. Don’t put any
critical service on the platform you don’t trust
• A solid team can do anything you want