SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Designing and Scoping a
Data Science Project
Data Science for Beginners, Session 1
About these Sessions
Session Format
Session:
• One topic
• Learn 4-6 concepts related to that topic
• Try apps or code related to that topic
Before each session:
• Install required tools (see the ‘tool installs’ instructions sheet)
• Do background reading
Session Topics
People
• Designing a data science project
• Communicating results
Tools
• Python basics
• Enterprise data tools
Getting Data
• Acquiring data
• Cleaning and exploring data
Special data types
• Handling text data
• Handling geospatial data
• Handling big data
Learning from data
• Predicting values from data
• Learning relationships from data
• Learning classes from data
Sessions Timeline
1. Scoping a data science project
2. Python basics
3. Acquiring data
4. Communicating results
5. Cleaning and exploring data
6. Predicting values from data
7. Handling text data
8. Handling geospatial data
9. Learning relationships from data
10. Enterprise data tools
11. Learning classes from data
12. Handling big data
Session 1: your 5-7 things
• What is data science?
• Data science is a process
• What’s a data scientist?
• Data science competitions
• Writing a problem statement
What is Data Science?
Defining Data Science
“A data scientist… excels at analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.”
“The analysis of data using the scientific method”
“A data scientist is an individual, organization or application that performs statistical
analysis, data mining and retrieval processes on a large amount of data to identify
trends, figures and other relevant information.”
Understanding through Data
Data Science is a Process
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize your results
Ask an interesting question
Write hypotheses that can be explored
● Do people have more phones than toilets?
● How is Ebola spreading?
● Is using wood fires sustainable in rural Tanzania?
● Can we feed 9 billion people?
Make them simple, actionable, incremental
Get the data
Data files (CSV, Excel, Json, Xml...)
● Databases (sqlite, mysql, oracle, postgresql...)
● APIs
● Report tables (tables on websites, in pdf reports...)
● Text (reports and other documents…)
● Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
● Images (satellite images, drone footage, pictures, videos…)
Most data is small, but…
Reformat the data
Explore the data
Model the Data
Communicate results
What’s a Data
Scientist?
The Data Science Venn Diagram
How do you become a data scientist?
Learning and Practice
● Kaggle - online datascience competitions
● Driven Data - social good datascience competitions
● Innocentive - some datascience challenges
● CrowdAnalytix - business datascience competitions
Should you become a data
scientist?
● Not necessarily. There are lots of data science
students desperate for good problems to work on.
● You might want to become someone who can
work with data scientists
● Which means learning how to specify data
problems well
Problem examples:
Data Science
Competitions
Who Does What
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize
your results
Problem Owner
Competitor
?
DrivenData
Kaggle
DataKind
Example project: Pump It Up
Tanzania wells:
“Your goal is to predict the
operating condition of a
waterpoint for each record in the
dataset”
Example project: Cervical cancer
DrivenData competition guidelines
Impact: “… clear win for the organisation in terms of effective planning, resources
saved or people served… good story around how they generate social impact…”
Challenge: “… challenging enough for a rich competition…”
Feasibility: “….the right kind of data to answer the question at hand… does it
have enough signal to be useful?...”
Privacy: “… can answer this question while protecting the privacy of individuals in
the dataset and the operational privacy of an organisation…”
Writing a Problem
Statement
Design your project
Context: who needs this work, and what are they doing it for?
Needs: what are you trying to fix
Vision: what do you expect your final result to look like?
Outcome: how do you get your results to the people who need them? What
happens next?
Design your questions
Is the question concrete enough?
Can you translate the question into an experiment?
Is it actionable?
What actions will be taken given the answer?
What data is needed to do the analysis?
Data Science Ethics
Data Risk and Ethics
You’re responsible for your data outputs
Could your outputs increase risk to anyone?
How will you respect privacy and security?
Data Risk
Risk: “The probability of something happening multiplied by the resulting cost or
benefit if it does”
Risk of: physical, legal, reputational, privacy harm
Likelihood (e.g. low, medium, high)
Risk to: data subjects, collectors, processors, releasers, users
PII: Personally Identifiable Information
“Personally identifiable information (PII) is any data that could potentially
identify a specific individual. Any information that can be used to distinguish one
person from another and can be used for de-anonymizing anonymous data can be
considered PII.”
PII Red Flags
Names, addresses, phone numbers
Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier)
Members of small populations
Untranslated text
Codes (e.g. “41”)
Slang terms
Exercises
3-minute exercise: Ask interesting questions
Either your own questions:
Questions that data might help with
Stories you want to tell with data
Datasets you’d like to explore
Or pick an existing question:
● Competition questions: Kaggle, DrivenData
● A data science project that interested you
3-minute exercise: Get the data
Pick one of your questions
List the ideal data you need to answer it
List the data that’s (probably) available
Think about what you’ll do if the data you need isn’t available
What compromises could you make
Where would you look for more data
Are there proxies (other datasets that tell you something about your question)
3-min exercise: design your communications
List the types of people you’d want to show your results to
How do you want them to change the world? Can they take actions, can they
change opinions etc
Describe the types of outputs that might be persuasive to them - visuals, text,
numbers, stories, art… be as wild with this as you want
Things to do before next week
See file Tool Install Instructions
• Make friends with the terminal window
• Install iPython
• Install Git

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Vignesh Prajapati
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 

Was ist angesagt? (19)

Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Big data and data science
Big data and data scienceBig data and data science
Big data and data science
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Startup Data Science
Startup Data ScienceStartup Data Science
Startup Data Science
 

Andere mochten auch

PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINAL
SolarWinds MSP
 

Andere mochten auch (14)

Global team
Global teamGlobal team
Global team
 
CrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing DealsCrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing Deals
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINAL
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
India Startup Report
India Startup ReportIndia Startup Report
India Startup Report
 
This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data Analytics
 
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and Validation
 
List of Software Development Model and Methods
List of Software Development Model and MethodsList of Software Development Model and Methods
List of Software Development Model and Methods
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Session 01 designing and scoping a data science project

Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Getting started in ds (july 17) atlanta
Getting started in ds (july 17)   atlantaGetting started in ds (july 17)   atlanta
Getting started in ds (july 17) atlanta
Thinkful
 

Ähnlich wie Session 01 designing and scoping a data science project (20)

Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Getting started in ds (july 17) atlanta
Getting started in ds (july 17)   atlantaGetting started in ds (july 17)   atlanta
Getting started in ds (july 17) atlanta
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
D92-198gstindspdx
D92-198gstindspdxD92-198gstindspdx
D92-198gstindspdx
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 

Mehr von bodaceacat

Ardrone represent
Ardrone representArdrone represent
Ardrone represent
bodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
bodaceacat
 

Mehr von bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 

Kürzlich hochgeladen

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 

Kürzlich hochgeladen (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 

Session 01 designing and scoping a data science project

  • 1. Designing and Scoping a Data Science Project Data Science for Beginners, Session 1
  • 3. Session Format Session: • One topic • Learn 4-6 concepts related to that topic • Try apps or code related to that topic Before each session: • Install required tools (see the ‘tool installs’ instructions sheet) • Do background reading
  • 4. Session Topics People • Designing a data science project • Communicating results Tools • Python basics • Enterprise data tools Getting Data • Acquiring data • Cleaning and exploring data Special data types • Handling text data • Handling geospatial data • Handling big data Learning from data • Predicting values from data • Learning relationships from data • Learning classes from data
  • 5. Sessions Timeline 1. Scoping a data science project 2. Python basics 3. Acquiring data 4. Communicating results 5. Cleaning and exploring data 6. Predicting values from data 7. Handling text data 8. Handling geospatial data 9. Learning relationships from data 10. Enterprise data tools 11. Learning classes from data 12. Handling big data
  • 6. Session 1: your 5-7 things • What is data science? • Data science is a process • What’s a data scientist? • Data science competitions • Writing a problem statement
  • 7. What is Data Science?
  • 8. Defining Data Science “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.” “The analysis of data using the scientific method” “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.”
  • 10. Data Science is a Process • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results
  • 11. Ask an interesting question Write hypotheses that can be explored ● Do people have more phones than toilets? ● How is Ebola spreading? ● Is using wood fires sustainable in rural Tanzania? ● Can we feed 9 billion people? Make them simple, actionable, incremental
  • 12. Get the data Data files (CSV, Excel, Json, Xml...) ● Databases (sqlite, mysql, oracle, postgresql...) ● APIs ● Report tables (tables on websites, in pdf reports...) ● Text (reports and other documents…) ● Maps and GIS data (openstreetmap, shapefiles, NASA earth images...) ● Images (satellite images, drone footage, pictures, videos…)
  • 13. Most data is small, but…
  • 19. The Data Science Venn Diagram
  • 20. How do you become a data scientist? Learning and Practice ● Kaggle - online datascience competitions ● Driven Data - social good datascience competitions ● Innocentive - some datascience challenges ● CrowdAnalytix - business datascience competitions
  • 21. Should you become a data scientist? ● Not necessarily. There are lots of data science students desperate for good problems to work on. ● You might want to become someone who can work with data scientists ● Which means learning how to specify data problems well
  • 23. Who Does What • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results Problem Owner Competitor ?
  • 27. Example project: Pump It Up Tanzania wells: “Your goal is to predict the operating condition of a waterpoint for each record in the dataset”
  • 29. DrivenData competition guidelines Impact: “… clear win for the organisation in terms of effective planning, resources saved or people served… good story around how they generate social impact…” Challenge: “… challenging enough for a rich competition…” Feasibility: “….the right kind of data to answer the question at hand… does it have enough signal to be useful?...” Privacy: “… can answer this question while protecting the privacy of individuals in the dataset and the operational privacy of an organisation…”
  • 31. Design your project Context: who needs this work, and what are they doing it for? Needs: what are you trying to fix Vision: what do you expect your final result to look like? Outcome: how do you get your results to the people who need them? What happens next?
  • 32. Design your questions Is the question concrete enough? Can you translate the question into an experiment? Is it actionable? What actions will be taken given the answer? What data is needed to do the analysis?
  • 34. Data Risk and Ethics You’re responsible for your data outputs Could your outputs increase risk to anyone? How will you respect privacy and security?
  • 35. Data Risk Risk: “The probability of something happening multiplied by the resulting cost or benefit if it does” Risk of: physical, legal, reputational, privacy harm Likelihood (e.g. low, medium, high) Risk to: data subjects, collectors, processors, releasers, users
  • 36. PII: Personally Identifiable Information “Personally identifiable information (PII) is any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII.”
  • 37. PII Red Flags Names, addresses, phone numbers Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier) Members of small populations Untranslated text Codes (e.g. “41”) Slang terms
  • 39. 3-minute exercise: Ask interesting questions Either your own questions: Questions that data might help with Stories you want to tell with data Datasets you’d like to explore Or pick an existing question: ● Competition questions: Kaggle, DrivenData ● A data science project that interested you
  • 40. 3-minute exercise: Get the data Pick one of your questions List the ideal data you need to answer it List the data that’s (probably) available Think about what you’ll do if the data you need isn’t available What compromises could you make Where would you look for more data Are there proxies (other datasets that tell you something about your question)
  • 41. 3-min exercise: design your communications List the types of people you’d want to show your results to How do you want them to change the world? Can they take actions, can they change opinions etc Describe the types of outputs that might be persuasive to them - visuals, text, numbers, stories, art… be as wild with this as you want
  • 42. Things to do before next week See file Tool Install Instructions • Make friends with the terminal window • Install iPython • Install Git