Was invited to share with the SMU Masters of IT in Business students on (i) how I got to my current position as a data scientist and (ii) what I do in my current position.
Includes suggested areas to focus on (e.g., distributed systems and processing) and how to gain more experience (e.g., volunteering). I also go through the problems that we solve at Lazada using machine learning and a high level architecture of how we do it.
25. Machine Learning via MOOCs:
- Machine Learning (Stanford)
- Statistical Learning (Stanford)
- Social and Economic Networks (Stanford)
- Text Mining and Analytics (Urbana-Champaign)
26. Distributed storage and processing via MOOCs:
- Mining Massive Datasets (Stanford)
- Big data with Apache Spark (UC Berkeley)
- Scalable Machine Learning with Apache Spark (UC Berkeley)
28. Volunteer for things people donât want to do
- Volunteered for project on Twitter tracking with $0 budget
29. Twitter project: Connect to API, download tweets
24/7 over 2 weeks, analyze tweets; learnt how to:
- Work with APIs
- Recover from failure automatically
- Work with data that canât fit in memory
- Text analytics and sentiment analysis
41. My journey so farâŠ
- Statistics
- Experimental
Design
- SPSS & R
- Communication
- Teamwork
- Python
- SQL
- Machine Learning
- Distribute Storage
& Processing
- Finding use cases
- Software Engineering
- Designing data
products
- Spark & Scala
42. So what can you do?
- Get very good at basic SQL
- Get very good at either R or Python
- Understand basic machine learning techniques
- Understand distributed systems and processing
- Improve communication by writing and sharing
- Get experience by doing projects on machine
learning and distributed processing (e.g., Open
data, Volunteering, Kaggle, etc)
45. A rough guide to each role
Collect, store, maintainEngineers
Explore, prepare, modelScientists
Expose, integrate, platform-ize
Tool
Developers
Lines
may blur
between
roles
52. Product categorization
Product title &
description
Machine Learning
Categorization
Rules-based
Categorization
Crowd
Categorization
Product Category
Quality Checking
and Validation
Sufficient confidence
If insufficient confidence
API for self-service
Production
Scheduled batch jobs
Product Category
53. Product Ranking for onsite display
Product Data
Purchase Data
Behavioral Data
(e.g., clickstream)
Other Data (e.g.,
ratings, etc)
Merging datasets
Feature
Engineering
Model product
rankings
Data Cleaning
Rule-based
modifiers
Measurement &
A/B Testing
54. Recommendations for newsletter subscribers
Product Data
Purchase Data
Behavioral Data
(e.g., clickstream)
Other Data (e.g.,
ratings, etc)
Merging datasets
Feature
Engineering
Data Cleaning
Customer
Segmentation
Forecasted Top
Sellers
Recommendations Newsletter
Creation
Measurement &
A/B Testing
Rule-based
modifiers