2. Who am I?
● Data Scientist at ZEFR, ad tech, LA
● Previously worked in healthcare, SaaS, and finance
3. Agenda
● Data Science
● My perspective
○ Problems
○ Pitfalls
○ Minimum skills
○ How to build your skills
● Resources
4. Data Science, a short history
● 1960, Peter Naur used it as a substitute for computer science
● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture
● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job
● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018
● 2015, Data Scientists don’t scale
● 2016, Why You’re Not Getting Value from Your Data Science
https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
7. Data Science, too broad
● BI Analyst/Engineer
● Analytics Engineer
● Data Engineer
● Statistician
● Research Scientist
● Machine Learning Engineer
● AI Engineer
● Solutions Specialist (with analytical background)
● Software Architect
● Financial Modeler
● Actuary
● ...
8. Data Science, definition
“Data Scientist is a Data Analyst who lives in California”
“Data Scientist is statistics on a Mac”
“...someone who is better at statistics than any software engineer and better at software
engineering than any statistician”
10. Data Science, process
● Data wrangling (get data from any source, reshape, scale up if needed)
● Problem formulation and modeling (ML, DL, AI)
● Communicate the findings (visualization, UI/UX)
● Productize (SWE, Data Engineering, DevOps)
In the context of:
● Benefit (business value)
● Cost (development, infrastructure, and architecture)
11. My perspective, what does ZEFR do?
● Ingesting hundreds of millions of videos per day
● Help brands show relevant ads
● Identify content for monetization
● Data science
○ Optimize advertising campaigns
○ Forecast inventory
○ Process text, image, audio, and video
○ Petabyte scale
12. My perspective, scale and automation
Requirements
● Billions of examples, million of features to train the models with
● Scoring on a similar scale of data
● Models to be re-trained near real-time
Implications
● Have to use cloud computing and distributed systems
● Small deltas in quality and algorithm efficiency magnified to massive cost or
benefit deltas
● Solid software engineering and automation is key
13. My perspective, example
Task
● Train a better forecasting model (vs. a benchmark statistic)
● Hundreds of terabytes of historical data available
Process
● Wrangling Pre-process and featurize (Spark, S3, RedShift)
● Modeling VW, H2O, hyper-parameter optimization
● Communication Justify cost of 100 node EMR cluster ($1,000 per day)
● Productize Test, deploy, automate with Jenkins, ECS and Kafka
14. My perspective, the grind
Weeks of tuning the infrastructure,
finding the right features, reasoning
through algorithm complexity
15. My perspective, pitfalls
● Unreasonable expectations
○ Hype, just hire a few PhDs
○ Is data science too easy?
● Throwing it over the fence*
○ Data science builds models in R/Python, engineering implements it in Java, C, Scala
● Dismissing the importance of good software engineering practices
○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible
● Dismissing the importance of understanding and formulating the problem
○ Get out and talk to people
● Dismissing or not understanding architecture, infrastructure, and cost/benefit
* Full disclosure: article is written by my boss Jonathan Morra at ZEFR
16. My perspective, data science platforms
● Many companies have recognized the problem with the the disconnect between
data science and engineering
● Facebook and Uber have in-house platforms
● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data
Robot, Yhat, just to name a few
● Very expensive and inflexible in our case
https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/
https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw
https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
17. My perspective, minimum data science requirements
- Statically-typed language (C, Java, Scala)
- Dynamically-typed language (Python, R)
- SQL (lag, partition, joins, rank, nested subqueries)
- NoSQL (JSON, MongoDB, Couch)
- Data wrangling (Pandas, dplyr, Julia, PySpark, Dask)
- Command-line fu
- Cloud computing (spin up instances, S3, ssh) and environment isolation
- Software engineering best practices (testing, version control, complexity)
- ML theory (bias/variance, complexity, encoding, hashing, feature engineering)
- ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow)
- Basic stats (experiment design, hypothesis testing, moments)
18. My perspective, how to build your skills
● Take courses in areas of weakness (Udacity, Coursera)
● Showcase your skills with projects on GitHub
● Write a blog about things you’re good at to refine your understanding
● Do Kaggle competitions
● Contribute to StackOverflow and/or CrossValidated
● Contribute to open source projects (sklearn, tensorflow, dask, spark)
19. Resources
Newsletters, blogs and people to follow
Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild
Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian
Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog,
Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of
a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...