SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Pittsburgh Data Jam 2016
Bringing Big Data Education and Awareness to
Pittsburgh High School Students
February 26, 2016
Introductions
Saman Haqqi - President - Pittsburgh Dataworks
 saman.haqqi@pghdataworks.org
Brian Macdonald – Data Scientist – Oracle Corporation
 brian.macdonald@oracle.com
Pitt Science Outreach
 Margaret Farrell mef85@pitt.edu
 Laura Marshall LJM82@pitt.edu
 Jenny Lundahl jal225@pitt.edu
 Jackie Choffo jac335@pitt.edu
 Kyle Wiche KAW196@pitt.edu
 Chris Davis CJD81@pit.edu
Mentors
 Each team will be assigned a mentor
Can ask questions via email at any time
 Copy everyone on your team
 Copy your teacher
Pitt Science Outreach students
 Send email to all
Have a regular scheduled call with your mentor
 Don’t wait to right before presentations.
Data Analysis Workshop
Today’s Goals
Identifying relevant variables
Depicting them graphically
Doing the analysis
Drawing conclusions
Making recommendations
What technology will you use?
Lots of tools are available
Keep it simple at the beginning
Use Excel
Tableau is also available
Many Others
 R, SAS, Cognos, Oracle Business Intelligence, Google Apps,
Matlab, Pyhton, Spotfire, QlikView
Data Analysis Process
A standard repeatable process to guide data analysis.
Used formally and informally
 If you do analysis, you will do these steps.
Used for Big Data or not so Big Data
Becomes second nature as you do more analysis.
Is not about using a cool data analysis tool
 Although they are extremely helpful.
The Data Analysis Process
Define your Problem
Identify Data
Plan your Analysis
 Explore Data
 Prepare Data
 Model Data
Tell A Story
Make Recommendations
Determine What’s Next
Today’s Focus
In practice it looks like this
https://cyberitgs.wikispaces.com/Sandbox+Yerlan
Basic Steps for Analysis
Data Exploration
Data Preparation
Build Models
Data Exploration
Exploratory Data Analysis (EDA)
 Goal is to get an understanding of what data you have
What are your variables
Basic Statistics
Graph Data
Look for missing values
Look for outliers
Will this data help you answer your question?
Basic Statistics
Goal is to get a basic understanding of your data
 Mean (Average)
• Sum of values/Count of values
 Median
• Mid Point of Values
 Maximum, Minimum (Range)
 Standard Deviation (σ) & Variance (σ^2)
• How spread out the values are compared to the mean
 Quartiles
• Nice buckets of the spread of the data
Demo - Statistics in Excel
Graphing Data
Helps visualize patterns in the data
Especially with large data sets.
 https://www.mapbox.com/labs/twitter-
gnip/locals/#12/40.4620/-80.0151
Spot exceptions
Use the best graph for the data types
Help tell your story
Demo - Graphing in Excel
Missing Values
Can have large impact on basic statistics
Count # of missing values of every variable (column)
Important to understand why data is missing?
 Data entry
 Wasn’t collected
 Isn’t relevant
Should you use the variable?
Should you fill in missing values
 Use mean, median, max, min, 0.
 You need to determine best method
Outliers
Outliers are values at the extreme
Much larger or smaller than most of your data
May have many causes
 Data Entry Error
 Instrument Malfunction
 Real Exceptional data
Is 140º F an Outlier
Some are easy to spot within a single variable
Some are only found with multiple variables
Outliers
Need to decide how to treat Outliers
 Is the variable ok to use? Do you question the validity of the
data?
 Remove them from your data set?
 Keep them as is?
 Change the value (i.e. make it less extreme)
 Infer the real meaning
• -90º F temperature in Miami is likely 90º
Make sure you understand implications
Document your decision making
Demo – Missing Values &
Outlier Detection in Excel
One Last Thought on Exploring Data
You must be observant
Count the Number of F’s in the following sentence.
 You will have 15 Seconds
FINISHED FILES ARE THE RE-
SULT OF YEARS OF SCIENTIF-
IC STUDY COMBINED WITH
THE EXPERIENCE OF YEARS.
Leave your assumptions at the door!
FINISHED FILES ARE THE RE-
SULT OF YEARS OF SCIENTIF-
IC STUDY COMBINED WITH
THE EXPERIENCE OF YEARS.
Exploration Exercise
Using Excel
Sort
Filter
Summarize
Create Crosstabs
Charting
Basic Steps for Analysis
Data Exploration
Data Preparation
Build Models
Data Preparation
 This step will fix any issues you found during data exploration
 Fix missing values
 Remove bad data
 Create new variables
 Add/Subtract/Multiply/Divide multiple variables
 Ratios
 Binning
 Other functions like Square Root or Exponents
Anything else you feel appropriate
 Have fun and experiment. You can not hurt data.
Demo – Data Preparation
Preparation Exercise
Using Excel
Merge data
New Calculations
Fix Missing Data
Fix Outliers
Basic Steps for Analysis
Data Exploration
Data Preparation
Build Models
Explaining Insights
How do you know what you
see is valid?
And not due to chance?
Correlation
http://musicthatmakesyoudumb.virgil.gr/
Correlation
The degree to which two or more attributes or measurements on the
same group of elements show a tendency to vary together
Positive when values increase together
Negative when values decrease together
http://www.mathsisfun.com/data/correlation.html
What can you tell me about this graph?
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80
Ice Cream Consumption/Capita
Ice Cream
Consumption/Capita
Linear (Ice Cream
Consumption/Capita)
IceCreamconsumption/capita
Drownings
Does Ice Cream Consumption Cause
Drowning?
Obviously not
Correlation does not imply Causation
 One may cause the other, but correlation just defines how
they vary.
 There may be other reasons. i.e. Hot temperatures
Be very cautious with Causation
 There are tests to determine causation
How do I know if variables are correlated
R = Correlation Coefficient
 Values between -1 & 1
 Positive Correlation > 0 - As one variable increases, the other
increases
 Perfect Correlation = 1
 Negative Correlation < 0 - As one variable increases, the other
decreases
 Perfect Negative Correlation = -1
 0 = No correlation
 Can be shown with a trend line
Understanding R and R2
How do I know if variables are correlated
R2 = Coefficient of Determination
 Tells how likely one variable predicts the other variable
 Values between 0 & 1
 If R 2 = 0.850, 85% of the total variation in y can be explained
by the linear relationship between x and y
 R2 is more commonly used
Understanding R and R2
Some Terminology
Independent Variable
 These are the variables that you modify
 In trend equation they are the X values
Dependent Variable
 These values depend on the values of the Independent
variables.
 In trend equation they are the Y values
y = 0.0045x + 691.18
y is Living Area
x is Sale Price
Slope Intercept
Demo – Modeling Data
Modeling Exercise
Using Excel
Create scatter plot
Show Coefficient of determination
Create a formula to predict a value
What did the Data Tell You
Did it support your initial question?
 What conclusions can you make?
 Make sure they are fact based
 Check your bias
What is your story?
 Is it compelling?
• Does x influence y?
 Can it support actions to be taken?
 If not, is there still some benefit?
What did the Data Tell You
What recommendations will you make?
 Will you stand behind them?
 If not, why not?
 Can they really be implemented?
 What is the value of implementing the recommendation
What new questions would you ask?
 To clarify your analysis?
 Expand on your analysis
 Can better questions be asked?
And the most important Item
Have
Questions?
Always ask questions!!!!
Timing
Introductions – 10 Minutes
Overview/Data exploration Lecture – 35 Minutes
Exploration Hands-on – 30 Minutes
Data Prep Lecture – 20 Minutes
Data Prep Hands-on – 25 Minutes
Data Modeling Lecture – 20 Minutes
Data Modeling – Hand-on – 30 Minutes
Questions/Wrap Up – 10 Minutes
Total 3:00

Weitere ähnliche Inhalte

Was ist angesagt?

Basic Analytics Module for Sponsors
Basic Analytics Module for SponsorsBasic Analytics Module for Sponsors
Basic Analytics Module for Sponsors
Dee Daley
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 

Was ist angesagt? (19)

Standard error and sample size
Standard error and sample sizeStandard error and sample size
Standard error and sample size
 
Presenting data
Presenting dataPresenting data
Presenting data
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Imputation of missing data in clinical trials
Imputation of missing data in clinical trialsImputation of missing data in clinical trials
Imputation of missing data in clinical trials
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False Discoveries
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
Cause and effect analysis
Cause and effect analysisCause and effect analysis
Cause and effect analysis
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 Analysis of "A Predictive Analytics Primer" by Tom Davenport Analysis of "A Predictive Analytics Primer" by Tom Davenport
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
 
Natureof Sci
Natureof SciNatureof Sci
Natureof Sci
 
Building Better Models
Building Better ModelsBuilding Better Models
Building Better Models
 
StatVignette06_HypTesting.pptx
StatVignette06_HypTesting.pptxStatVignette06_HypTesting.pptx
StatVignette06_HypTesting.pptx
 
What is statistics
What is statisticsWhat is statistics
What is statistics
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1
 
Basic Analytics Module for Sponsors
Basic Analytics Module for SponsorsBasic Analytics Module for Sponsors
Basic Analytics Module for Sponsors
 
Top 5 tips on how to learn statistics more effectively
Top 5 tips on how to learn statistics more effectivelyTop 5 tips on how to learn statistics more effectively
Top 5 tips on how to learn statistics more effectively
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 

Andere mochten auch

Andere mochten auch (11)

Austin, TX: State of the Economy
Austin, TX: State of the EconomyAustin, TX: State of the Economy
Austin, TX: State of the Economy
 
Space Evaders Hacking for Diplomacy week 8
Space Evaders Hacking for Diplomacy week 8Space Evaders Hacking for Diplomacy week 8
Space Evaders Hacking for Diplomacy week 8
 
Team 621 Hacking for Diplomacy week 8
Team 621 Hacking for Diplomacy week 8Team 621 Hacking for Diplomacy week 8
Team 621 Hacking for Diplomacy week 8
 
Hacking CT Lessons Learned H4Dip Stanford 2016
Hacking CT Lessons Learned H4Dip Stanford 2016Hacking CT Lessons Learned H4Dip Stanford 2016
Hacking CT Lessons Learned H4Dip Stanford 2016
 
Aggregate db Lessons Learned H4Dip Stanford 2016
Aggregate db Lessons Learned H4Dip Stanford 2016Aggregate db Lessons Learned H4Dip Stanford 2016
Aggregate db Lessons Learned H4Dip Stanford 2016
 
Trace Lessons Learned H4Dip Stanford 2016
Trace Lessons Learned H4Dip Stanford 2016 Trace Lessons Learned H4Dip Stanford 2016
Trace Lessons Learned H4Dip Stanford 2016
 
Peacekeeping Lessons Learned H4Dip Stanford 2016
Peacekeeping Lessons Learned H4Dip Stanford 2016Peacekeeping Lessons Learned H4Dip Stanford 2016
Peacekeeping Lessons Learned H4Dip Stanford 2016
 
Exodus Lessons Learned H4Dip Stanford 2016
Exodus Lessons Learned H4Dip Stanford 2016Exodus Lessons Learned H4Dip Stanford 2016
Exodus Lessons Learned H4Dip Stanford 2016
 
Hacking CT Hacking for Diplomacy week 8
Hacking CT Hacking for Diplomacy week 8Hacking CT Hacking for Diplomacy week 8
Hacking CT Hacking for Diplomacy week 8
 
Space Evaders Lessons Learned H4Dip Stanford 2016
Space Evaders Lessons Learned H4Dip Stanford 2016Space Evaders Lessons Learned H4Dip Stanford 2016
Space Evaders Lessons Learned H4Dip Stanford 2016
 
Fatal journeys (Team 621) Lessons Learned H4Dip Stanford 2016
Fatal journeys (Team 621) Lessons Learned H4Dip Stanford 2016Fatal journeys (Team 621) Lessons Learned H4Dip Stanford 2016
Fatal journeys (Team 621) Lessons Learned H4Dip Stanford 2016
 

Ähnlich wie 2016 Pittsburgh Data Jam Student Workshop

Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0
lee_anderson40
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
Brian Lin
 
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docxSTAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
dessiechisomjj4
 

Ähnlich wie 2016 Pittsburgh Data Jam Student Workshop (20)

Analyzing survey data
Analyzing survey dataAnalyzing survey data
Analyzing survey data
 
Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0
 
Dymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdfDymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdf
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
How to Start Doing Data Science
How to Start Doing Data ScienceHow to Start Doing Data Science
How to Start Doing Data Science
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
part 5
part 5part 5
part 5
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
Bj research session 9 analysing quantitative
Bj research session 9 analysing quantitativeBj research session 9 analysing quantitative
Bj research session 9 analysing quantitative
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy Workshop
 
PSYCH 625 Exceptional Education - snaptutorial.com
PSYCH 625   Exceptional Education - snaptutorial.comPSYCH 625   Exceptional Education - snaptutorial.com
PSYCH 625 Exceptional Education - snaptutorial.com
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Data visualisations quality aspects
Data visualisations quality aspectsData visualisations quality aspects
Data visualisations quality aspects
 
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docxSTAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
 
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 

Kürzlich hochgeladen

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Kürzlich hochgeladen (20)

Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

2016 Pittsburgh Data Jam Student Workshop

  • 1. Pittsburgh Data Jam 2016 Bringing Big Data Education and Awareness to Pittsburgh High School Students February 26, 2016
  • 2. Introductions Saman Haqqi - President - Pittsburgh Dataworks  saman.haqqi@pghdataworks.org Brian Macdonald – Data Scientist – Oracle Corporation  brian.macdonald@oracle.com Pitt Science Outreach  Margaret Farrell mef85@pitt.edu  Laura Marshall LJM82@pitt.edu  Jenny Lundahl jal225@pitt.edu  Jackie Choffo jac335@pitt.edu  Kyle Wiche KAW196@pitt.edu  Chris Davis CJD81@pit.edu
  • 3. Mentors  Each team will be assigned a mentor Can ask questions via email at any time  Copy everyone on your team  Copy your teacher Pitt Science Outreach students  Send email to all Have a regular scheduled call with your mentor  Don’t wait to right before presentations.
  • 4. Data Analysis Workshop Today’s Goals Identifying relevant variables Depicting them graphically Doing the analysis Drawing conclusions Making recommendations
  • 5. What technology will you use? Lots of tools are available Keep it simple at the beginning Use Excel Tableau is also available Many Others  R, SAS, Cognos, Oracle Business Intelligence, Google Apps, Matlab, Pyhton, Spotfire, QlikView
  • 6. Data Analysis Process A standard repeatable process to guide data analysis. Used formally and informally  If you do analysis, you will do these steps. Used for Big Data or not so Big Data Becomes second nature as you do more analysis. Is not about using a cool data analysis tool  Although they are extremely helpful.
  • 7. The Data Analysis Process Define your Problem Identify Data Plan your Analysis  Explore Data  Prepare Data  Model Data Tell A Story Make Recommendations Determine What’s Next Today’s Focus In practice it looks like this https://cyberitgs.wikispaces.com/Sandbox+Yerlan
  • 8. Basic Steps for Analysis Data Exploration Data Preparation Build Models
  • 9. Data Exploration Exploratory Data Analysis (EDA)  Goal is to get an understanding of what data you have What are your variables Basic Statistics Graph Data Look for missing values Look for outliers Will this data help you answer your question?
  • 10. Basic Statistics Goal is to get a basic understanding of your data  Mean (Average) • Sum of values/Count of values  Median • Mid Point of Values  Maximum, Minimum (Range)  Standard Deviation (σ) & Variance (σ^2) • How spread out the values are compared to the mean  Quartiles • Nice buckets of the spread of the data
  • 11. Demo - Statistics in Excel
  • 12. Graphing Data Helps visualize patterns in the data Especially with large data sets.  https://www.mapbox.com/labs/twitter- gnip/locals/#12/40.4620/-80.0151 Spot exceptions Use the best graph for the data types Help tell your story
  • 13. Demo - Graphing in Excel
  • 14. Missing Values Can have large impact on basic statistics Count # of missing values of every variable (column) Important to understand why data is missing?  Data entry  Wasn’t collected  Isn’t relevant Should you use the variable? Should you fill in missing values  Use mean, median, max, min, 0.  You need to determine best method
  • 15. Outliers Outliers are values at the extreme Much larger or smaller than most of your data May have many causes  Data Entry Error  Instrument Malfunction  Real Exceptional data Is 140º F an Outlier Some are easy to spot within a single variable Some are only found with multiple variables
  • 16. Outliers Need to decide how to treat Outliers  Is the variable ok to use? Do you question the validity of the data?  Remove them from your data set?  Keep them as is?  Change the value (i.e. make it less extreme)  Infer the real meaning • -90º F temperature in Miami is likely 90º Make sure you understand implications Document your decision making
  • 17. Demo – Missing Values & Outlier Detection in Excel
  • 18. One Last Thought on Exploring Data You must be observant Count the Number of F’s in the following sentence.  You will have 15 Seconds FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF- IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.
  • 19. Leave your assumptions at the door! FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF- IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.
  • 21. Basic Steps for Analysis Data Exploration Data Preparation Build Models
  • 22. Data Preparation  This step will fix any issues you found during data exploration  Fix missing values  Remove bad data  Create new variables  Add/Subtract/Multiply/Divide multiple variables  Ratios  Binning  Other functions like Square Root or Exponents Anything else you feel appropriate  Have fun and experiment. You can not hurt data.
  • 23. Demo – Data Preparation
  • 24. Preparation Exercise Using Excel Merge data New Calculations Fix Missing Data Fix Outliers
  • 25. Basic Steps for Analysis Data Exploration Data Preparation Build Models
  • 26. Explaining Insights How do you know what you see is valid? And not due to chance? Correlation http://musicthatmakesyoudumb.virgil.gr/
  • 27. Correlation The degree to which two or more attributes or measurements on the same group of elements show a tendency to vary together Positive when values increase together Negative when values decrease together http://www.mathsisfun.com/data/correlation.html
  • 28. What can you tell me about this graph? 0.2 0.3 0.4 0.5 0.6 0 20 40 60 80 Ice Cream Consumption/Capita Ice Cream Consumption/Capita Linear (Ice Cream Consumption/Capita) IceCreamconsumption/capita Drownings
  • 29. Does Ice Cream Consumption Cause Drowning? Obviously not Correlation does not imply Causation  One may cause the other, but correlation just defines how they vary.  There may be other reasons. i.e. Hot temperatures Be very cautious with Causation  There are tests to determine causation
  • 30. How do I know if variables are correlated R = Correlation Coefficient  Values between -1 & 1  Positive Correlation > 0 - As one variable increases, the other increases  Perfect Correlation = 1  Negative Correlation < 0 - As one variable increases, the other decreases  Perfect Negative Correlation = -1  0 = No correlation  Can be shown with a trend line Understanding R and R2
  • 31. How do I know if variables are correlated R2 = Coefficient of Determination  Tells how likely one variable predicts the other variable  Values between 0 & 1  If R 2 = 0.850, 85% of the total variation in y can be explained by the linear relationship between x and y  R2 is more commonly used Understanding R and R2
  • 32. Some Terminology Independent Variable  These are the variables that you modify  In trend equation they are the X values Dependent Variable  These values depend on the values of the Independent variables.  In trend equation they are the Y values y = 0.0045x + 691.18 y is Living Area x is Sale Price Slope Intercept
  • 34. Modeling Exercise Using Excel Create scatter plot Show Coefficient of determination Create a formula to predict a value
  • 35. What did the Data Tell You Did it support your initial question?  What conclusions can you make?  Make sure they are fact based  Check your bias What is your story?  Is it compelling? • Does x influence y?  Can it support actions to be taken?  If not, is there still some benefit?
  • 36. What did the Data Tell You What recommendations will you make?  Will you stand behind them?  If not, why not?  Can they really be implemented?  What is the value of implementing the recommendation What new questions would you ask?  To clarify your analysis?  Expand on your analysis  Can better questions be asked?
  • 37. And the most important Item Have
  • 39. Timing Introductions – 10 Minutes Overview/Data exploration Lecture – 35 Minutes Exploration Hands-on – 30 Minutes Data Prep Lecture – 20 Minutes Data Prep Hands-on – 25 Minutes Data Modeling Lecture – 20 Minutes Data Modeling – Hand-on – 30 Minutes Questions/Wrap Up – 10 Minutes Total 3:00