SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Data Science
A practitioner’s perspective
Amir Ziai
@amirziai
Who am I?
● Data Scientist at ZEFR, ad tech, LA
● Previously worked in healthcare, SaaS, and finance
Agenda
● Data Science
● My perspective
○ Problems
○ Pitfalls
○ Minimum skills
○ How to build your skills
● Resources
Data Science, a short history
● 1960, Peter Naur used it as a substitute for computer science
● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture
● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job
● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018
● 2015, Data Scientists don’t scale
● 2016, Why You’re Not Getting Value from Your Data Science
https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
Data Science, growth
Data Science, hyped?
http://www.kdnuggets.com/wp-content/uploads/gartner-2014-hype-cycle.jpeg
Data Science, too broad
● BI Analyst/Engineer
● Analytics Engineer
● Data Engineer
● Statistician
● Research Scientist
● Machine Learning Engineer
● AI Engineer
● Solutions Specialist (with analytical background)
● Software Architect
● Financial Modeler
● Actuary
● ...
Data Science, definition
“Data Scientist is a Data Analyst who lives in California”
“Data Scientist is statistics on a Mac”
“...someone who is better at statistics than any software engineer and better at software
engineering than any statistician”
Data Science, the many Venn diagrams
Data Science, process
● Data wrangling (get data from any source, reshape, scale up if needed)
● Problem formulation and modeling (ML, DL, AI)
● Communicate the findings (visualization, UI/UX)
● Productize (SWE, Data Engineering, DevOps)
In the context of:
● Benefit (business value)
● Cost (development, infrastructure, and architecture)
My perspective, what does ZEFR do?
● Ingesting hundreds of millions of videos per day
● Help brands show relevant ads
● Identify content for monetization
● Data science
○ Optimize advertising campaigns
○ Forecast inventory
○ Process text, image, audio, and video
○ Petabyte scale
My perspective, scale and automation
Requirements
● Billions of examples, million of features to train the models with
● Scoring on a similar scale of data
● Models to be re-trained near real-time
Implications
● Have to use cloud computing and distributed systems
● Small deltas in quality and algorithm efficiency magnified to massive cost or
benefit deltas
● Solid software engineering and automation is key
My perspective, example
Task
● Train a better forecasting model (vs. a benchmark statistic)
● Hundreds of terabytes of historical data available
Process
● Wrangling Pre-process and featurize (Spark, S3, RedShift)
● Modeling VW, H2O, hyper-parameter optimization
● Communication Justify cost of 100 node EMR cluster ($1,000 per day)
● Productize Test, deploy, automate with Jenkins, ECS and Kafka
My perspective, the grind
Weeks of tuning the infrastructure,
finding the right features, reasoning
through algorithm complexity
My perspective, pitfalls
● Unreasonable expectations
○ Hype, just hire a few PhDs
○ Is data science too easy?
● Throwing it over the fence*
○ Data science builds models in R/Python, engineering implements it in Java, C, Scala
● Dismissing the importance of good software engineering practices
○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible
● Dismissing the importance of understanding and formulating the problem
○ Get out and talk to people
● Dismissing or not understanding architecture, infrastructure, and cost/benefit
* Full disclosure: article is written by my boss Jonathan Morra at ZEFR
My perspective, data science platforms
● Many companies have recognized the problem with the the disconnect between
data science and engineering
● Facebook and Uber have in-house platforms
● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data
Robot, Yhat, just to name a few
● Very expensive and inflexible in our case
https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/
https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw
https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
My perspective, minimum data science requirements
- Statically-typed language (C, Java, Scala)
- Dynamically-typed language (Python, R)
- SQL (lag, partition, joins, rank, nested subqueries)
- NoSQL (JSON, MongoDB, Couch)
- Data wrangling (Pandas, dplyr, Julia, PySpark, Dask)
- Command-line fu
- Cloud computing (spin up instances, S3, ssh) and environment isolation
- Software engineering best practices (testing, version control, complexity)
- ML theory (bias/variance, complexity, encoding, hashing, feature engineering)
- ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow)
- Basic stats (experiment design, hypothesis testing, moments)
My perspective, how to build your skills
● Take courses in areas of weakness (Udacity, Coursera)
● Showcase your skills with projects on GitHub
● Write a blog about things you’re good at to refine your understanding
● Do Kaggle competitions
● Contribute to StackOverflow and/or CrossValidated
● Contribute to open source projects (sklearn, tensorflow, dask, spark)
Resources
Newsletters, blogs and people to follow
Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild
Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian
Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog,
Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of
a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...

Weitere ähnliche Inhalte

Was ist angesagt?

Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?CodePolitan
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Scienceds4good
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science clubData Science Club
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Data science and visualization lab presentation
Data science and visualization lab presentationData science and visualization lab presentation
Data science and visualization lab presentationiHub Research
 
Datascienceindia article
Datascienceindia articleDatascienceindia article
Datascienceindia articleHimanshuPise1
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
So, What Does a Data Scientist do?
So, What Does a Data Scientist do?So, What Does a Data Scientist do?
So, What Does a Data Scientist do?Jameel Syed
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 

Was ist angesagt? (20)

Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data science and visualization lab presentation
Data science and visualization lab presentationData science and visualization lab presentation
Data science and visualization lab presentation
 
Datascienceindia article
Datascienceindia articleDatascienceindia article
Datascienceindia article
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
So, What Does a Data Scientist do?
So, What Does a Data Scientist do?So, What Does a Data Scientist do?
So, What Does a Data Scientist do?
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 

Ähnlich wie Data science a practitioner's perspective

Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionWeCloudData
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesUpXAcademy
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist Manjunath Sindagi
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?DeZyre
 
Big data webinar may23 nrit by sunil
Big data webinar may23 nrit by sunilBig data webinar may23 nrit by sunil
Big data webinar may23 nrit by sunilSujit Ghosh
 
Careers in Data Science _ Navigating the Digital Frontier (1).pptx
Careers in Data Science _  Navigating the Digital Frontier (1).pptxCareers in Data Science _  Navigating the Digital Frontier (1).pptx
Careers in Data Science _ Navigating the Digital Frontier (1).pptx2075AAGEPRATIK
 
Pratik Patel resume
Pratik Patel  resumePratik Patel  resume
Pratik Patel resumePratik Patel
 
Pratik Patel Python/ Big Data Analyst
Pratik Patel Python/ Big Data AnalystPratik Patel Python/ Big Data Analyst
Pratik Patel Python/ Big Data AnalystPratik Patel
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valuePeadar Coyle
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 

Ähnlich wie Data science a practitioner's perspective (20)

Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
Paving The Way To Data Driven
Paving The Way To Data DrivenPaving The Way To Data Driven
Paving The Way To Data Driven
 
How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science roles
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?
 
Big data webinar may23 nrit by sunil
Big data webinar may23 nrit by sunilBig data webinar may23 nrit by sunil
Big data webinar may23 nrit by sunil
 
Careers in Data Science _ Navigating the Digital Frontier (1).pptx
Careers in Data Science _  Navigating the Digital Frontier (1).pptxCareers in Data Science _  Navigating the Digital Frontier (1).pptx
Careers in Data Science _ Navigating the Digital Frontier (1).pptx
 
Pratik Patel resume
Pratik Patel  resumePratik Patel  resume
Pratik Patel resume
 
Pratik Patel Python/ Big Data Analyst
Pratik Patel Python/ Big Data AnalystPratik Patel Python/ Big Data Analyst
Pratik Patel Python/ Big Data Analyst
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 

Kürzlich hochgeladen

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 

Kürzlich hochgeladen (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Data science a practitioner's perspective

  • 1. Data Science A practitioner’s perspective Amir Ziai @amirziai
  • 2. Who am I? ● Data Scientist at ZEFR, ad tech, LA ● Previously worked in healthcare, SaaS, and finance
  • 3. Agenda ● Data Science ● My perspective ○ Problems ○ Pitfalls ○ Minimum skills ○ How to build your skills ● Resources
  • 4. Data Science, a short history ● 1960, Peter Naur used it as a substitute for computer science ● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture ● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job ● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018 ● 2015, Data Scientists don’t scale ● 2016, Why You’re Not Getting Value from Your Data Science https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
  • 7. Data Science, too broad ● BI Analyst/Engineer ● Analytics Engineer ● Data Engineer ● Statistician ● Research Scientist ● Machine Learning Engineer ● AI Engineer ● Solutions Specialist (with analytical background) ● Software Architect ● Financial Modeler ● Actuary ● ...
  • 8. Data Science, definition “Data Scientist is a Data Analyst who lives in California” “Data Scientist is statistics on a Mac” “...someone who is better at statistics than any software engineer and better at software engineering than any statistician”
  • 9. Data Science, the many Venn diagrams
  • 10. Data Science, process ● Data wrangling (get data from any source, reshape, scale up if needed) ● Problem formulation and modeling (ML, DL, AI) ● Communicate the findings (visualization, UI/UX) ● Productize (SWE, Data Engineering, DevOps) In the context of: ● Benefit (business value) ● Cost (development, infrastructure, and architecture)
  • 11. My perspective, what does ZEFR do? ● Ingesting hundreds of millions of videos per day ● Help brands show relevant ads ● Identify content for monetization ● Data science ○ Optimize advertising campaigns ○ Forecast inventory ○ Process text, image, audio, and video ○ Petabyte scale
  • 12. My perspective, scale and automation Requirements ● Billions of examples, million of features to train the models with ● Scoring on a similar scale of data ● Models to be re-trained near real-time Implications ● Have to use cloud computing and distributed systems ● Small deltas in quality and algorithm efficiency magnified to massive cost or benefit deltas ● Solid software engineering and automation is key
  • 13. My perspective, example Task ● Train a better forecasting model (vs. a benchmark statistic) ● Hundreds of terabytes of historical data available Process ● Wrangling Pre-process and featurize (Spark, S3, RedShift) ● Modeling VW, H2O, hyper-parameter optimization ● Communication Justify cost of 100 node EMR cluster ($1,000 per day) ● Productize Test, deploy, automate with Jenkins, ECS and Kafka
  • 14. My perspective, the grind Weeks of tuning the infrastructure, finding the right features, reasoning through algorithm complexity
  • 15. My perspective, pitfalls ● Unreasonable expectations ○ Hype, just hire a few PhDs ○ Is data science too easy? ● Throwing it over the fence* ○ Data science builds models in R/Python, engineering implements it in Java, C, Scala ● Dismissing the importance of good software engineering practices ○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible ● Dismissing the importance of understanding and formulating the problem ○ Get out and talk to people ● Dismissing or not understanding architecture, infrastructure, and cost/benefit * Full disclosure: article is written by my boss Jonathan Morra at ZEFR
  • 16. My perspective, data science platforms ● Many companies have recognized the problem with the the disconnect between data science and engineering ● Facebook and Uber have in-house platforms ● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data Robot, Yhat, just to name a few ● Very expensive and inflexible in our case https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/ https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
  • 17. My perspective, minimum data science requirements - Statically-typed language (C, Java, Scala) - Dynamically-typed language (Python, R) - SQL (lag, partition, joins, rank, nested subqueries) - NoSQL (JSON, MongoDB, Couch) - Data wrangling (Pandas, dplyr, Julia, PySpark, Dask) - Command-line fu - Cloud computing (spin up instances, S3, ssh) and environment isolation - Software engineering best practices (testing, version control, complexity) - ML theory (bias/variance, complexity, encoding, hashing, feature engineering) - ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow) - Basic stats (experiment design, hypothesis testing, moments)
  • 18. My perspective, how to build your skills ● Take courses in areas of weakness (Udacity, Coursera) ● Showcase your skills with projects on GitHub ● Write a blog about things you’re good at to refine your understanding ● Do Kaggle competitions ● Contribute to StackOverflow and/or CrossValidated ● Contribute to open source projects (sklearn, tensorflow, dask, spark)
  • 19. Resources Newsletters, blogs and people to follow Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog, Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...