SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
CS639:
Data Management for
Data Science
Lecture 1: Intro to Data Science and Course Overview
Theodoros Rekatsinas
1
2
3
Big science is data driven.
Big science is data driven.
4
Increasingly many companies see
themselves as data driven.
Increasingly many companies see
themselves as data driven.
5
Even more “traditional” companies…
The world is increasingly
driven by data…
6
This class teaches the basics of
how to use & manage data to
obtain useful insights.
Today’s Lecture
1. Motivation, Admin, & Setup
2. Introduction to Data Science
7
1. Motivation, Admin & Setup
8
Section 1
What you will learn about in this section
1. Motivation for studying Data Science
2. Administrative structure
3. Course logistics
9
Section 1
10
Section 1
Data Analysis has always been around
R.A. Fisher “Correlation does not imply causation”
Peter Luhn
A pioneer in hash coding and full text processing
Coined the term “business intelligence”
11
Section 1
Data Analysis has always been around
Introduced the box-plot
Also introduced the term “bit” (later used by Shannon)
The time of data-driven AI
John Tukey
Tom Mitchell
12
Section 1
Data Analysis has always been around
Essays on Scientific Discovery based on data-intensive science
Father of ACID
(requirements for reliable transaction processing)
Known for AI programming
(at Google)
By Jim Gray
Peter Norvig
13
Section 1
Data Analysis is popular
14
Section 1
Why should you study data science?
• Mercenary-make more $$$:
• Startups need data science talent right away = low employee #
• Massive industry…
• Intellectual:
• Science: data poor to data rich
• No idea how to handle the data!
• Fundamental ideas to/from all of CS:
• Systems, theory, AI, logic, stats, analysis….
15
Section 1
What this course is (and is not)
• Discuss the fundamentals data management for data science
workflows
• How to represent and store data
• How to extract and prepare data for analysis
• How to analyze data and how to visualize and communicate insights
• You will learn how to design data science pipelines
• This is not a databases, systems, or machine learning class. We will
touch many topics covered in these classes but we will not go into
details.
Who we are…
Instructor (me) Theo Rekatsinas
• Faculty in the Computer Sciences and part of the UW-Database Group
• Research: data integration and cleaning, statistical analytics, and machine
learning.
• thodrek@cs.wisc.edu
• Office hours: MWF after class @CS 4361
16
Section 1
17
Section 1
Frank Zou
Huawei
Wang
Communication w/ Course Staff
• Piazza https://piazza.com/wisc/spring2019/cs639
• Class mailing list: compsci639-4-s19@lists.wisc.edu
• Office hours: Listed on the website
• Also By appointment!
18
Section 1
The goal is to get you
to answer each
other’s questions so
you can benefit and
learn from each other.
The goal is to get you
to answer each
other’s questions so
you can benefit and
learn from each other.
Course Website:
https://thodrek.github.io/cs639_spring19/
19
Section 1
Course Github:
https://github.com/thodrek/cs639_spring19
20
Lectures
• Lecture slides cover essential material
• This is your best reference.
• We will have pointers as needed
• Recommended textbooks listed on website
• Try to cover same thing in many ways: Lecture, lecture slides,
homework, exams (no shock)
• Attendance makes your life easier…
Section 1
Graded Elements
• Six Programming assignments (45%)
• Focus on different aspects of data science
• Midterm (20%)
• Final exam (35%)
21
Dates are posted on
the website!!!
Dates are posted on
the website!!!
Section 1
What is expected from you
• Attend lectures
• If you don’t, it’s at your own peril
• Be active and think critically
• Ask questions, post comments on forums
• Do programming projects
• Start early and be honest
• Study for tests and exams
22
Section 1
Programming Assignments
• Six programming assignments
• Python plus Jupiter notebooks
• These are individual assignments
• Submission via Canvas
• ~1 week per programming assignments
• Ask questions, post comments on forums
• Start early!
• You have late days. The policy is described on the website
23
Section 1
Programming setup for class
24
1. For all assignments we will use the provided Virtual Machine.
• Ubuntu + necessary python libraries
• Link provided on the website
• We will not provide support for any other platform (you can still use your own machine)
2. To deploy and run the provided VM:
1. Download and Install Virtual Box: https://www.virtualbox.org/wiki/Downloads
2. Download the Class VM from:
https://www.dropbox.com/s/xjvj3jlaurzjfas/cs639_vm.ova.zip?dl=0
3. Import the VM by following the instructions here:
https://blogs.oracle.com/oswald/importing-a-vdi-in-virtualbox
4. Run the VM and login with the following credentials:
1. Username: CS639_DS_USER
2. Password: cs639_ds_user
3. Come to office hours if you need help with installation!
Please help out your
peers by posting issues
/ solutions on Piazza!
Please help out your
peers by posting issues
/ solutions on Piazza!
Section 1
2. Introduction to Data Science
25
Section 2
What you will learn about in this section
1. What is Data Science?
2. Data Science workflows
3. What should a Data Scientist know?
4. Overview of lecture coverage
26
Section 2
Data Science is an emerging field
• https://www.oreilly.com/ideas/what-is-data-science
27
Section 2
Data Science products
28
Section 2
One definition of data science
29
Section 2
Data science is a broad field that refers to the collective
processes, theories, concepts, tools and technologies that
enable the review, analysis and extraction of valuable
knowledge and information from raw data.
Source: Techopedia
Data science is not databases
30
Section 2
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records,
Personnel records,
Census,
Medical records
Online clicks,
GPS logs,
Tweets,
Building sensor readings
Priorities Consistency,
Error recovery,
Auditability
Speed,
Availability,
Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
Riak, Memcached,
MongoDB, CouchDB,
Hbase, Cassandra,…
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
Data science is not databases
31
Section 2
Databases Data Science
Querying the past Querying the future
Business intelligence (BI) is the transformation of raw data into meaningful and
useful information for business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop and otherwise create new
strategic business opportunities - Wikipedia
Data science workflow
32
Section 2
https://cacm.acm.org/blogs/blog-cacm/169199-data-science-
workflow-overview-and-challenges/fulltext
Data science workflow
33
Section 2
Data science workflow
34
Section 2
Digging Around
in Data
Hypothesize
Model
Large Scale
Exploitation
Evaluate
Interpret
Clean,
prep
What is hard about Data Science
35
Section 2
• Overcoming assumptions
• Making ad-hoc explanations of data patterns
• Overgeneralizing
• Communication
• Not checking enough (validate models, data pipeline
integrity, etc.)
• Using statistical tests correctly
• Prototype  Production transitions
• Data pipeline complexity (who do you ask?)
What is hard about Data Science
36
Section 2
What is hard about Data Science
37
Section 2
What are Data Scientists really doing?
38
Section 2
https://visit.figure-eight.com/rs/416-ZBE-
142/images/CrowdFlower_DataScienceReport_2016.pdf
Lectures: 1st part – Data Storage
1. Overview: What is data science
• Lectures 1-3
2. Foundations: Relational data models & SQL
• Lectures 4-7
• How to manipulate data with SQL, a declarative language
• reduced expressive power but the system can do more for you
• Query optimization
3. MapReduce and NoSQL systems: MapReduce, KeyValue Stores,
Graph DBs
• Lectures 8-14
• Dealing with massive amounts of data and non-relational data
39
Section 2
Lectures: 2nd part – Predictive analytics
4. Statistical Reasoning: Inference, Sampling Bayesian Methods
• Lectures 15-17
• How to reason about patterns in data
5. Machine Learning: Decision Trees, Evaluation of ML models, Ensembles
• Lectures 18-22
• Overview of different ML paradigms
6. Optimization: How to train ML models (efficiently)?
• Lecture 23
• Loss Functions, Optimization via Gradient Descent
• Stochastic Gradient Descent (SGD), Parallel SGD
40
Section 2
Lectures: 3rd part – Data Integration
7. Information Extraction: Named entity recognition and relation extraction
• Lecture 24
• How to identify entities of interest in unstructured data?
• How to find relationships between them?
8. Data Integration: Combine information from different data sources
• Lecture 25
• How to find if a real-world entity is mentioned in different sources?
• How to align data from different sources?
9. Data Cleaning: Remove errors and noise from data
• Lecture 26
• How can we detect errors in data to be used for analytics?
• How can we fix these errors automatically?
41
Section 2
Lectures: 4th part – Communicating Insights
10. Data Visualization: Creating data charts that convey interesting
findings
• Lectures 27-29
• How to convey insights most effectively?
• How to explore raw data?
11. Data Privacy: Sharing sensitive information
• Lecture 30-32
• How can we share sensitive data?
• How can we perform analytics on sensitive data?
42
Section 2

Weitere ähnliche Inhalte

Ähnlich wie Lecture_1_Intro.pdf

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
Ayele40
 

Ähnlich wie Lecture_1_Intro.pdf (20)

Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 

Mehr von paijitk (12)

IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdfIDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
 
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdfลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
 
Chapter_01wht.pdf
Chapter_01wht.pdfChapter_01wht.pdf
Chapter_01wht.pdf
 
functions2-200924082810.pdf
functions2-200924082810.pdffunctions2-200924082810.pdf
functions2-200924082810.pdf
 
stringsinpython-181122100212.pdf
stringsinpython-181122100212.pdfstringsinpython-181122100212.pdf
stringsinpython-181122100212.pdf
 
Python-review1.pdf
Python-review1.pdfPython-review1.pdf
Python-review1.pdf
 
Functions_21_22.pdf
Functions_21_22.pdfFunctions_21_22.pdf
Functions_21_22.pdf
 
Functions_19_20.pdf
Functions_19_20.pdfFunctions_19_20.pdf
Functions_19_20.pdf
 
03-Data-Exploration.en.th.pptx
03-Data-Exploration.en.th.pptx03-Data-Exploration.en.th.pptx
03-Data-Exploration.en.th.pptx
 
02-Lifecycle.en.th.pptx
02-Lifecycle.en.th.pptx02-Lifecycle.en.th.pptx
02-Lifecycle.en.th.pptx
 
01. Introduction.en.th.pptx
01. Introduction.en.th.pptx01. Introduction.en.th.pptx
01. Introduction.en.th.pptx
 
Lecture_2_Stats.pdf
Lecture_2_Stats.pdfLecture_2_Stats.pdf
Lecture_2_Stats.pdf
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Kürzlich hochgeladen (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 

Lecture_1_Intro.pdf

  • 1. CS639: Data Management for Data Science Lecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1
  • 2. 2
  • 3. 3 Big science is data driven. Big science is data driven.
  • 4. 4 Increasingly many companies see themselves as data driven. Increasingly many companies see themselves as data driven.
  • 6. The world is increasingly driven by data… 6 This class teaches the basics of how to use & manage data to obtain useful insights.
  • 7. Today’s Lecture 1. Motivation, Admin, & Setup 2. Introduction to Data Science 7
  • 8. 1. Motivation, Admin & Setup 8 Section 1
  • 9. What you will learn about in this section 1. Motivation for studying Data Science 2. Administrative structure 3. Course logistics 9 Section 1
  • 10. 10 Section 1 Data Analysis has always been around R.A. Fisher “Correlation does not imply causation” Peter Luhn A pioneer in hash coding and full text processing Coined the term “business intelligence”
  • 11. 11 Section 1 Data Analysis has always been around Introduced the box-plot Also introduced the term “bit” (later used by Shannon) The time of data-driven AI John Tukey Tom Mitchell
  • 12. 12 Section 1 Data Analysis has always been around Essays on Scientific Discovery based on data-intensive science Father of ACID (requirements for reliable transaction processing) Known for AI programming (at Google) By Jim Gray Peter Norvig
  • 14. 14 Section 1 Why should you study data science? • Mercenary-make more $$$: • Startups need data science talent right away = low employee # • Massive industry… • Intellectual: • Science: data poor to data rich • No idea how to handle the data! • Fundamental ideas to/from all of CS: • Systems, theory, AI, logic, stats, analysis….
  • 15. 15 Section 1 What this course is (and is not) • Discuss the fundamentals data management for data science workflows • How to represent and store data • How to extract and prepare data for analysis • How to analyze data and how to visualize and communicate insights • You will learn how to design data science pipelines • This is not a databases, systems, or machine learning class. We will touch many topics covered in these classes but we will not go into details.
  • 16. Who we are… Instructor (me) Theo Rekatsinas • Faculty in the Computer Sciences and part of the UW-Database Group • Research: data integration and cleaning, statistical analytics, and machine learning. • thodrek@cs.wisc.edu • Office hours: MWF after class @CS 4361 16 Section 1
  • 18. Communication w/ Course Staff • Piazza https://piazza.com/wisc/spring2019/cs639 • Class mailing list: compsci639-4-s19@lists.wisc.edu • Office hours: Listed on the website • Also By appointment! 18 Section 1 The goal is to get you to answer each other’s questions so you can benefit and learn from each other. The goal is to get you to answer each other’s questions so you can benefit and learn from each other.
  • 19. Course Website: https://thodrek.github.io/cs639_spring19/ 19 Section 1 Course Github: https://github.com/thodrek/cs639_spring19
  • 20. 20 Lectures • Lecture slides cover essential material • This is your best reference. • We will have pointers as needed • Recommended textbooks listed on website • Try to cover same thing in many ways: Lecture, lecture slides, homework, exams (no shock) • Attendance makes your life easier… Section 1
  • 21. Graded Elements • Six Programming assignments (45%) • Focus on different aspects of data science • Midterm (20%) • Final exam (35%) 21 Dates are posted on the website!!! Dates are posted on the website!!! Section 1
  • 22. What is expected from you • Attend lectures • If you don’t, it’s at your own peril • Be active and think critically • Ask questions, post comments on forums • Do programming projects • Start early and be honest • Study for tests and exams 22 Section 1
  • 23. Programming Assignments • Six programming assignments • Python plus Jupiter notebooks • These are individual assignments • Submission via Canvas • ~1 week per programming assignments • Ask questions, post comments on forums • Start early! • You have late days. The policy is described on the website 23 Section 1
  • 24. Programming setup for class 24 1. For all assignments we will use the provided Virtual Machine. • Ubuntu + necessary python libraries • Link provided on the website • We will not provide support for any other platform (you can still use your own machine) 2. To deploy and run the provided VM: 1. Download and Install Virtual Box: https://www.virtualbox.org/wiki/Downloads 2. Download the Class VM from: https://www.dropbox.com/s/xjvj3jlaurzjfas/cs639_vm.ova.zip?dl=0 3. Import the VM by following the instructions here: https://blogs.oracle.com/oswald/importing-a-vdi-in-virtualbox 4. Run the VM and login with the following credentials: 1. Username: CS639_DS_USER 2. Password: cs639_ds_user 3. Come to office hours if you need help with installation! Please help out your peers by posting issues / solutions on Piazza! Please help out your peers by posting issues / solutions on Piazza! Section 1
  • 25. 2. Introduction to Data Science 25 Section 2
  • 26. What you will learn about in this section 1. What is Data Science? 2. Data Science workflows 3. What should a Data Scientist know? 4. Overview of lecture coverage 26 Section 2
  • 27. Data Science is an emerging field • https://www.oreilly.com/ideas/what-is-data-science 27 Section 2
  • 29. One definition of data science 29 Section 2 Data science is a broad field that refers to the collective processes, theories, concepts, tools and technologies that enable the review, analysis and extraction of valuable knowledge and information from raw data. Source: Techopedia
  • 30. Data science is not databases 30 Section 2 Databases Data Science Data Value “Precious” “Cheap” Data Volume Modest Massive Examples Bank records, Personnel records, Census, Medical records Online clicks, GPS logs, Tweets, Building sensor readings Priorities Consistency, Error recovery, Auditability Speed, Availability, Query richness Structured Strongly (Schema) Weakly or none (Text) Properties Transactions, ACID* CAP* theorem (2/3), eventual consistency Realizations SQL NoSQL: Riak, Memcached, MongoDB, CouchDB, Hbase, Cassandra,… ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
  • 31. Data science is not databases 31 Section 2 Databases Data Science Querying the past Querying the future Business intelligence (BI) is the transformation of raw data into meaningful and useful information for business analysis purposes. BI can handle enormous amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities - Wikipedia
  • 32. Data science workflow 32 Section 2 https://cacm.acm.org/blogs/blog-cacm/169199-data-science- workflow-overview-and-challenges/fulltext
  • 34. Data science workflow 34 Section 2 Digging Around in Data Hypothesize Model Large Scale Exploitation Evaluate Interpret Clean, prep
  • 35. What is hard about Data Science 35 Section 2 • Overcoming assumptions • Making ad-hoc explanations of data patterns • Overgeneralizing • Communication • Not checking enough (validate models, data pipeline integrity, etc.) • Using statistical tests correctly • Prototype  Production transitions • Data pipeline complexity (who do you ask?)
  • 36. What is hard about Data Science 36 Section 2
  • 37. What is hard about Data Science 37 Section 2
  • 38. What are Data Scientists really doing? 38 Section 2 https://visit.figure-eight.com/rs/416-ZBE- 142/images/CrowdFlower_DataScienceReport_2016.pdf
  • 39. Lectures: 1st part – Data Storage 1. Overview: What is data science • Lectures 1-3 2. Foundations: Relational data models & SQL • Lectures 4-7 • How to manipulate data with SQL, a declarative language • reduced expressive power but the system can do more for you • Query optimization 3. MapReduce and NoSQL systems: MapReduce, KeyValue Stores, Graph DBs • Lectures 8-14 • Dealing with massive amounts of data and non-relational data 39 Section 2
  • 40. Lectures: 2nd part – Predictive analytics 4. Statistical Reasoning: Inference, Sampling Bayesian Methods • Lectures 15-17 • How to reason about patterns in data 5. Machine Learning: Decision Trees, Evaluation of ML models, Ensembles • Lectures 18-22 • Overview of different ML paradigms 6. Optimization: How to train ML models (efficiently)? • Lecture 23 • Loss Functions, Optimization via Gradient Descent • Stochastic Gradient Descent (SGD), Parallel SGD 40 Section 2
  • 41. Lectures: 3rd part – Data Integration 7. Information Extraction: Named entity recognition and relation extraction • Lecture 24 • How to identify entities of interest in unstructured data? • How to find relationships between them? 8. Data Integration: Combine information from different data sources • Lecture 25 • How to find if a real-world entity is mentioned in different sources? • How to align data from different sources? 9. Data Cleaning: Remove errors and noise from data • Lecture 26 • How can we detect errors in data to be used for analytics? • How can we fix these errors automatically? 41 Section 2
  • 42. Lectures: 4th part – Communicating Insights 10. Data Visualization: Creating data charts that convey interesting findings • Lectures 27-29 • How to convey insights most effectively? • How to explore raw data? 11. Data Privacy: Sharing sensitive information • Lecture 30-32 • How can we share sensitive data? • How can we perform analytics on sensitive data? 42 Section 2