There are many batch and stream scenarios in Alibaba, and many data analysts are non-technical, like to use GUI or script tool to deal with data to help business decisions. We’d like to share our experiences on developing algorithms on Apache Flink and build web UI and client to help people easily use algorithms on data analysis, training and inferencing with machine learning model.
4. Why based on Flink/Blink
• More requirements on stream processing
• Advanced Flink architecture
• User
– Low learning curve
– Less coding
– More func;ons
21. Alink Functions (Part 1 of 3)
• Sta;s;cs and Visualiza;on
- Current and History
- Basic Sta;s;c
• Mean, Variance, StdVar, CV, StdErr, Moment, Central Moment, Skewness, Kurtosis
• Histogram, TopK, Bo[omK, Frequency, Percen;le, Quan;le, Median, Mode
• Covariance, Coef of Correla;on, Cross Table, Ranking List
- Sta;s;cal Analysis
• PCA, Correspondence Analysis, Mul;-collinearity
• T-Test, Chi2-Test, KS-Test, AD-Test
22. Demo for Statistics and Visualization
• IJCAI-17 Dataset
- h[ps://;anchi.aliyun.com/datalab/index.htm
- Trading amounts and loca;ons of Alipay users
- 19.6 million users, 67 million trades
23. Stat Demo: Current and History
• AllStat for History, stat from start to now
• WindowStat for Current, stat over last 3 seconds
• Trading amounts
• Frequency of shop_level
25. Stat Demo: Distribution
• Get 2 stream data: shop_level=‘low’, shop_level=‘high’
• Consider 2 Features : comment_cnt and pay
• Probabilis;c Distribu;on
27. Stat Demo: Relationship of Features
• Numerical Features: pay, comment_cnt and shop_level_int
- Mul;collinearity, Coef of Correla;on
• Categorical Features: province and shop_level
- Correspondence Analysis, Cross Table
32. Alink Functions (Part 3 of 3)
• Classifica)on
– Logis;c Regression
– Linear SVM
– Perceptron
– Mul;-Class Logis;c Regression
– Mul;-Class Linear SVM
– Mul;-Class Perceptron
– Random Forest
– ID3
– C45
– CART
– Naïve Bayes
– KNN
• OnlineLearning
– FTRL
– Perceptron
– Passive Aggressive (PA)
– PA-I
– PA-II
• Regression
– Linear Regression
– Lasso Regression
– Ridge Regression
– Linear SVR
– Linear Regression Stepwise
• Others
– One Hot Encoding
– EvalClassifica;on
– EvalRegression
– MLModelPredic;on
33. Demo: Text Classification
• Dataset
- h[p://jmcauley.ucsd.edu/data/amazon/
- 142.8 million product reviews with ra;ng value (1 ~ 5)
- Task: From review content, predict ra;ng value
41. Demo: Recommendation of Movies
• Dataset:
– MovieLens 20M Dataset
– 20 million ra;ngs; 27,000 movies; 138,000 users
– Ra;ng value range: 0.5 ~ 5
• Task
– Recommend movies for each user
• Algorithm:
– ALS (Alterna;ng Least Square)