11. What You Probably Need Is A Team
Business Analyst Knowing how to use different tools under different circumstance
Statistician How to process big data?
DBA How to deal with unstructured data
Software Engineer Knowing how to user statistics
12. Four Dimension
12
Single Machine Memory R Local File
Cloud Distributed Hadoop HDFS
Statistics Analysis Linear Algebra
Architect Management Standard
Concept MapReduce Linear Algebra Logistic Regression
Tool Hadoop PostgreSQL R
Analyst How to use these tools
Hackers R Python Java
13. “80% are doing summing and averaging”
Content
1.Data Munging
2.Data Analysis
3.Interpret Result
What Data Scientists Do?
14. Application of Data Analysis
Text Mining
Classify Spam Mail
Build Index
Data Search Engine
Social Network Analysis
Finding Opinion Leader
Recommendation System
What user likes?
Opinion Mining
Positive/Negative Opinion
Fraud Analysis
Credit Card Fraud
17. Predictive Analysis
Learn from experience (Data), to predict future behavior
What to Predict?
e.g. Who is likely to click on that ad?
For What?
e.g. According to the click possibility and revenue to decide which ad to show.
Predictive Analysis
18. Customer buying beer will also buy pampers?
People are surfing telephone fee rate are likely to switch its vendor
People belong to same group are tend to have same telecom vendor
Surprising Conclusion
19. According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior.
Predictive Model
20. Linear Model
e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability)
Female 13.5%,Male1.5%
Rule Model
e.g.
If the user is “She”
And Income is over 30k
And haven’t seen the ad yet
The click rate is 11%
Simple Predictive Model
21. Induction
From detail to general
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E
-- Tom Mitchell (1998)
Discover an effective model
Start from a simple model
Update the model based on feeding data
Keep on improving prediction power
Machine Learning
24. Decision Tree
Rate > 1,299/Month
Probability to switch vendor 15%
Probability to switch vendor 3%
Yes
No
25. Decision Tree
Rate > 1,299/Month
Probability to switch vendor 3%
Yes
No
Probability to switch vendor 10%
Probability to switch vendor 22%
Income>22,000
Yes
No
26. Decision Tree
Rate > 1,299/Month
Yes
No
Probability to switch vendor 10%
Probability to switch vendor 22%
Income>22,000
Yes
No
Probability to switch vendor 1%
Probability to switch vendor 7%
Free for intranet
Yes
No
30. Dimension Reduction
e.g. Making a new index
Clustering
e.g. Customer Segmentation
Unsupervised Learning
31. Lift
The better the lift, the greater the cost?
The more decision rule, the more campaign?
Design strategy for different persona?
The lift for 4 campaign?
The lift for 20 ampaign?
Lift
32. Can we use the production rate of butter to predict stock market?
Overfitting
33. Use noise as information
Over assumption
Over Interpretation
What overfitting learn is not truth
Like memorize all answers in a single test.
Overfitting
36. Statistics On The Fly
Built-in Math and Graphic Function
Free and Open Source
http://cran.r-project.org/src/base/
R Language
36
37. Functional Programming
Use Function Definition To Retrieve Answer
Interpreted Language
Statistics On the Fly
Object Oriented Language
S3 and S4 Method
R Language
38. Most Used Analytic Language
Most popular languages are R, Python (39%), SQL (37%). SAS (20%).
By Gregory Piatetsky, Aug 27, 2013.
40. Data Scientist in Google and Apple Use R
What is your programming language of choice, R, Python or something else?
“I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.”
http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/
“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred”
http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
42. Account Information
state
account length.
area code
phone number
User Behavior
international plan
voice mail plan, number vmail messages
total day minutes, total day calls, total day charge
total eve minutes, total eve calls, total eve charge
total night minutes, total night calls, total night charge
total intl minutes, total intl calls, total intl charge
number customer service calls
Target
Churn (Yes/No)
Data Description
43. > install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,]
Split data into training and testing dataset
70% as training dataset
30% as testing dataset
45. > predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions)
Prediction Result
pred
no
yes
no
859
18
yes
41
100
46. > confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes
Use Confusion Matrix
47. Use Testing Data to Validate Result
predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))
48. Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
49. Dynamic Language
Execution at runtime
Dynamic Type
Interpreted Language
See the result after execution
OOP
Python Language
49
50. Cross Platform(Python VM)
Third-Party Resource
(Data Analysis、Graphics、Website Development)
Simple, and easy to learn
Benefit of Python
55. Monitor Social Media and News
Monitor post on social media
Configure keyword and alert
Use line plot to show daily post statistics
55
蘋果, nownews, udn, 中央跟風傳媒 還有 其他財經媒體
68. Knowing Who You Are?
Personal recommendation
Customer relation management
Knowing What Futures Likes?
From the history, we can see the future
Predictive analysis
Knowing What is Hidden Beneath?
Correlation, Correlation, Correlation
So… What is Big Data?
89. Focus on algorithm
Divide and Conquer, Trie, Collaborative Filtering
Being an expert of single programming language
But knowing what tools and algorithm you can use to solve your problem
Define your role
Statistician
Software engineer
What You Should Do