Sharing about my data science journey and what I do at Lazada

Hi, I’m Eugene
I’m here to share about
my data science journey and
what I do at Lazada
4th April 2016
SMU Masters of IT in Business

Before I begin, any questions you
would like addressed?
I’ll answer throughout my sharing.

Studied Psychology and Business
at Singapore Management
University (SMU); wanted to use
data to create positive impact

Did economic and political analysis at
Ministry of Trade & Industry (MTI)

Joined IBM to pursue passion
in working with data

First step into data science
as a data analyst, where I…

Developed dashboards and analytics for
end-to-end supply chain optimization

Worked on an anti-money laundering and
entity resolution system for a global bank

Collected and analyzed tweets to provide insight on
tweet share and sentiment for electronics conglomerate

Then, was transferred to workforce analytics team,
working on data from IBM’s 450k employees to build…

Forecast models for global job demand to
optimize recruitment and workforce allocation

Job recommendation engine to increase internal
transfers, skill renewal, satisfaction, and reduce attrition

Currently at Lazada’s Data
Science team; more later

Skill sets needed to be a data analyst
and how I acquired them

Probability, statistics and
experimental design from
education in Psychology

Technical skills in SPSS Statistics and R from
undergraduate education in Psychology

Written and verbal communication from essays and
presentations (SMU), and briefs and stakeholder
engagement with industry leaders (MTI)

Teamwork from projects in SMU and MTI

Skill sets needed to be a data scientist
and how I acquired them
- Statistics
- Experimental
Design
- SPSS & R
- Communication
- Teamwork

More R via MOOCs:
- Data Analysis and statistical inference (Duke)
- Computing for Data Analysis (Johns Hopkins)

Python via MOOCs:
- Computer Science and Programming in Python (MIT)
- Interactive programming in Python (Rice)

SQL via any site with in-browser query engine

Machine Learning via MOOCs:
- Machine Learning (Stanford)
- Statistical Learning (Stanford)
- Social and Economic Networks (Stanford)
- Text Mining and Analytics (Urbana-Champaign)

Distributed storage and processing via MOOCs:
- Mining Massive Datasets (Stanford)
- Big data with Apache Spark (UC Berkeley)
- Scalable Machine Learning with Apache Spark (UC Berkeley)

Learning alone is insufficient;
I also had to practice (a lot)

Volunteer for things people don’t want to do
- Volunteered for project on Twitter tracking with $0 budget

Twitter project: Connect to API, download tweets
24/7 over 2 weeks, analyze tweets; learnt how to:
- Work with APIs
- Recover from failure automatically
- Work with data that can’t fit in memory
- Text analytics and sentiment analysis

Volunteer with DataKind SG and helping NGOs
tackle problems through data science

Volunteer to facilitate Johns Hopkins Data Science
Specialization (Statistical Inference)

Kaggle meaningfully on competitions with real-
world applications; competitions I’ve tried include…

Otto Production Classification:
Classify products into 9 main
product categories

Springleaf Marketing Response:
Predict if customers will respond to direct mail

Telstra Network Disruptions:
Predict severity of service disruption

Skill sets to be a better data scientist
(what I’m focusing on now)
- Statistics
- Experimental
Design
- SPSS & R
- Communication
- Teamwork
- Python
- SQL
- Machine Learning
- Distribute Storage
& Processing

Finding problems and opportunities
people overlook

Designing and building
data products end-to-end

Building data products
using Spark (Scala)

My journey so far…
- Statistics
- Experimental
Design
- SPSS & R
- Communication
- Teamwork
- Python
- SQL
- Machine Learning
- Distribute Storage
& Processing
- Finding use cases
- Software Engineering
- Designing data
products
- Spark & Scala

So what can you do?
- Get very good at basic SQL
- Get very good at either R or Python
- Understand basic machine learning techniques
- Understand distributed systems and processing
- Improve communication by writing and sharing
- Get experience by doing projects on machine
learning and distributed processing (e.g., Open
data, Volunteering, Kaggle, etc)

Lazada Data Science: Data Engineers,
Scientists, Tool Developers

A rough guide to each role
Collect, store, maintainEngineers
Explore, prepare, modelScientists
Expose, integrate, platform-ize
Tool
Developers
Lines
may blur
between
roles

Product-related:
- Product Categorization
- Attribute Extraction
- Spam Detection
- Image Quality Checking

Consumer-related:
- Recommendations
- Product Ranking
- Consumer Segmentation
- Customer Lifetime Value

Seller-related:
- Price Elasticity
- Detecting Counterfeits

Operation-related:
- Delivery time forecasting

Product categorization
Product title &
description
Machine Learning
Categorization
Rules-based
Categorization
Crowd
Categorization
Product Category
Quality Checking
and Validation
Sufficient confidence
If insufficient confidence
API for self-service
Production
Scheduled batch jobs
Product Category

Product Ranking for onsite display
Product Data
Purchase Data
Behavioral Data
(e.g., clickstream)
Other Data (e.g.,
ratings, etc)
Merging datasets
Feature
Engineering
Model product
rankings
Data Cleaning
Rule-based
modifiers
Measurement &
A/B Testing

Recommendations for newsletter subscribers
Product Data
Purchase Data
Behavioral Data
(e.g., clickstream)
Other Data (e.g.,
ratings, etc)
Merging datasets
Feature
Engineering
Data Cleaning
Customer
Segmentation
Forecasted Top
Sellers
Recommendations Newsletter
Creation
Measurement &
A/B Testing
Rule-based
modifiers

Data
Preparation,
50%
Modeling,
20%
Productionizing,
30%
Coding Breakdown
Majority of time spent
coding (thankfully)
Coding,
55%
Engagment,
30%
Others,
15%

Data Preparation
- Merging data
- Imputing nulls
- Removing duplicates
- Handling outliers
- Fixing formats
- Etc, etc, etc

Building the model
- Feature engineering
- Machine learning
- Validation
- Iterate, iterate, iterate

Deploying to production
- Proof-of-concept
- Developing API
- Scheduling jobs
- Continuous integration
- Fixing bugs

Engagement (with stakeholders)
- Roadmap planning (quarterly)
- Aligning solution with problem
- Explaining and getting buy-in

Other tasks
- Providing assistance
- Research and brainstorming
- Team sharing

Any further questions?
eugeneyanziyou@gmail.com
eugene.yan@lazada.com

Sharing about my data science journey and what I do at Lazada

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Sharing about my data science journey and what I do at Lazada

Ähnlich wie Sharing about my data science journey and what I do at Lazada (20)

Mehr von Eugene Yan Ziyou

Mehr von Eugene Yan Ziyou (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sharing about my data science journey and what I do at Lazada