This slides show how to integrate with the powerful tool in big data area. When using spark to do data preprocessing then produce the training data set to scikit learn , it will cause performance issue . So i share some tips how to overcome related performance issue
2. Who am I ?
● Kent (施晨揚)
● 熱愛 Machine Learning & Big
Data
● 兩個孩子的爸
https://www.facebook.com/texib
3. What ? Key Factor - Influence Performance
• Key Factor - Influence Performance
• Large Raw Data Size (4 Billion Record)
• Large Number of Cookies (40 Million Records )
• Machine Learning Library - Prediction Function Cost
4. How ?
● Spark
o Parallel Computing
o Scaleable
o Very Powerful Data Processing Tool
o 但 Machine Learning Library ….
● Python Scikit-Learn
o Very Powerful Machine Learning Library
o 但部份都只能用到單核 XD
7. Use Python Prepare Prediction Data
Prepare Data
Train Model
Prepare Prediction
Data
About 30 Mins
> 1 Weeks
Do Prediction
8. Aggregation - Where is Slow ?
50%
• Aggregate 4 Billion Rows to 40 Million Cookies is a Very
Consuming Job
9. Use mapPartitions()
• Instead of Using ReduceByKey() with Yours Aggregation
Logic
• How :
• 1 step : use db(redshift) to prepare prediction data
order by cookie
• 2 step : use local map partitions to do batch prediction
14. Conclusion
• Use db to do presort data better than do aggregation by
spark
• Use batch better than atomic
15. Another Case - Spam Article Classifier
• Article Structure Classifier
• Article Content Classifier
• Bag of Word
• High Dimension Feature Space
• Very Sparse Vector
• Large Number of Documents
17. New
sc.textfile
RDD
text to terms
RDD
distinct rdd
tf-idf transformtf-idf sparse matrix
collect terms
and
Build Vectorize
terms to sparse
vector
RDD
collect sparse
vector to list
Use Vstack
list to sparse
martix
build classifier