SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
19EC2029
Dr. D. Sugumar
Associate Professor/ECE
Karunya
Lots of data is being collected and warehoused
Web data, e-commerce
Financial transactions, bank/credit transactions
Online trading and purchasing
Social Network
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs
(100 MB/s)
There's certainly a lot of it!
2015
1 Zettabyte
1 Exabyte
1 Petabyte
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
1 Petabyte == 1000 TB 2002 2009
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
2006 2011
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
5 EB
161 EB
800 EB
1.8 ZB 8.0 ZB
14 PB
60 PB
Data produced each year
100-years of HD video + audio
Human brain's capacity
References
1 TB = 1000 GB
120 PB
logarithmic
scale
Data, data everywhere…
data
information
knowledge
wisdom
I'd call it data,
not information
2000’s
Content Management
1990’s
Relational Databases
& Data Warehouses
2010’s
Key-Value Storages
& Unstructured Data
VOLUME
OF
INFORMATION
LARGE
SMALL
TERABYTES PETABYTES EXABYTES
Big Data is any data that is expensive to manage and hard to extract value from
Volume
The size of the data
Velocity
The latency of data processing relative to the growing demand for
interactivity
Variety and Complexity
the diversity of sources, formats, quality, structures.
Hmmm… where am I
on this diagram?
the companies are expanding as fast as the data!
Is this really
about size?
Big Data?
I agree with this…
Make data easier to use ~ by using it!
It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
• Naive definition:
• Big data only depends on the data size
• 1 Gigabyte? 1 Terabyte? 1 Petabyte?
• Naive interpretation misses important aspects
• Time:
• Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per
second
• Diversity:
• Analyzing spread sheets with numeric data is different from analyzing Web pages that
contain a mixture of text and images
• Distribution:
• Analyzing data from a single source is different from analyzing data from multiple
sources
• Following Gartner‘s IT Glossary:
• Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced insight,
decision making, and process automation.
• The three Vs
• Volume
• Velocity
• Variety
Some people actually use 10 Vs to define
big data!
• Variability
• Veracity
• Validity
• Vulnerability
• Volatility
• Visualization
• Value
• Scale of the data must be „big“
• No clear definition
• „that demand […] innovative forms of information processing“ (Gartner)
© Statista 2018
Data center storage worldwide
• Speed at which new data is created
• Speed at which data must be processed and analyzed
• Often close to real-time
• Diversity in data types and data sources
Structured
Semi-
Structured
Quasi-Structured
Unstructured
• Data with defined types and structure
• Example: comma separated values
• Textual data with parseable pattern
• Example: XML files with schema
• Textual data with erratic formats that can be
formated with effort
• Example: Clickstream data
• Data that has no inherent structure, often with
multiple formats
• Example: Web site, videos
Structured Quasi-Structured
Semi-Structured Unstructured
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
Aggregation and Statistics
Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
• “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief
Economist
• The U.S. will need 140,000-190,000 predictive analysts and 1.5 million
managers/analysts by 2018. McKinsey Global Institute’s June 2011
• New Data Science institutes being created or repurposed – NYU, Columbia,
Washington, UCB,...
• New degree programs, courses, boot-camps:
• e.g., at Berkeley: Stats, I-School, CS, Astronomy…
• One proposal (elsewhere) for an MS in “Big Data Science”
• An area that manages, manipulates, extracts, and interprets knowledge from tremendous
amount of data
• Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big
data
• Data science principles apply to all data – big and small
• Theories and techniques from many fields and disciplines are used to investigate
and analyze a large amount of data to help decision makers in many industries
such as science, engineering, economics, politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.
• Gartner’s 2014 Hype Cycle
Is "Data Science"
important or just trendy?
• Companies learn your secrets, shopping patterns, and preferences
• For example, can we know if a woman is pregnant, even if she doesn’t want us
to know? Target case study
• Data Science and election (2008, 2012)
• 1 million people installed the Obama Facebook app that gave access to info on
“friends”
• Data Scientist
• The Sexiest Job of the 21st Century
• They find stories, extract knowledge. They are not reporters
• Data scientists are the key to realizing the opportunities presented by big data.
They bring structure to it, find compelling patterns in it, and advise executives
on the implications for products, processes, and decisions
• National Security
• Cyber Security
• Business Analytics
• Engineering
• Healthcare
• And more ….
• Mathematics and Applied Mathematics
• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery
• Unfortunately, there is no clear definition (yet?)
• Goal is the extraction of knowledge from data
• Combination of techniques from different disciplines
• Scientific principles guide the data analysis
Tools? Big Data?
Machine Learning?
Computational
Geometry
Optimization Stochastics
Scientific
Computing
Machine
Learning
Data Structures and
Algorithms
Databases Distributed Computing
Software Engineering Artificial Intelligence Machine Learning
Linear Models Statistical Tests Inference
Time Series Analysis Machine Learning
Intelligent Systems Robotics Marketing
Medicine Autonomous Driving Social Networks
• 1. Understand the statistics and machine learning concepts that are vital for data science
• 2. Learn to statistically analyze a dataset
• 3. Critically evaluate data visualization based on their design and use for communicating stories
from data
• The Student will be able to
• 1. Understand the key concepts in data science, its applications and the toolkit used by data
scientists;
• 2. Realize how data is collected, managed and stored for data science;
• 3. Apply various machine learning techniques in real-world applications
• 4. Implement data collection and management
• 5. Apply visualization tools for data visualization
• 6. Possess the required knowledge and expertise to become a proficient data scientist
• Module 1: Introduction to Data Analytics (7 hrs)
Introduction, Terminology, data science process, data science toolkit, Types of data, Introduction to Python, Data
Analysis in Excel, Analytics Problem Solving, Exploratory Data Analysis, Example applications.
• Module 2: Data collection management and Statistics (7 hrs)
Introduction, Sources of data, Data collection and APIs, Exploring and fixing data, Data storage and management,
Advanced SQL using multiple data sources, Statistics and Hypothesis Testing, Inferential Statistics, Big Data
Storage and Processing Framework, Hadoob
• Module 3: Data analysis (8 hrs)
Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance,
Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM,
Naive Bayes
• Module 4: Data visualization (8 hrs)
Introduction, Types of data visualization, Data for visualization: Data types, Data encodings, Retinal
variables, Mapping variables to encodings, Visual encodings, Data Visualization in Python-Superset or
in Microsoft Power BI
• Module 5: Computing and Applications (8 hrs)
Using Python for Data Science - Using Open Source R for Data Science - Using SQL in Data Science
- Software Applications for Data Science. Applications of Data Science, Technologies for visualization
like Data Visualization in Microsoft Power BI.
• Module 6: Trends and Technologies (7 hrs)
Recent trends in various data collection and analysis techniques, various visualization techniques,
application development methods used in data science, NYC Parking Case Study: Apache Spark
Text Books:
• 1. Cathy O’Neil and Rachel Schutt, Doing Data Science, Straight Talk from The Frontline. O’Reilly, 2014. ISBN: 978-1-
449-35865-5
• 2. Davy Cielen. Arno D.B Meysman, Mohamed Ali, “Introducing Data Science”, Dreamtech Press, 2016. ISBN: 978-93-
5119-937-3
Reference Books:
• 1. Joel Grus, Data Science from Scratch, O’Reilly, 2015, ISBN: 978-1-491-90142-7
• 2. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman, Mining of Massive Datasets. v2.1, Cambridge University
Press, 2014. ISBN : 9781139924801
• 3. John W. Foreman, Using Data Science to Transform Information into Insight – Data Smart, Wiley, 2014. ISBN:
978-81-265-4614-5
• 4. https://github.com/maximrohit/SPARK-R-SQL-NYC-PArking-Ticket-Analysis
• Big data has a high volume, velocity, and variety
• Different data structures
• Structured, semi-structured, quasi-structured, unstructured
• Data science is a very diverse discipline
• Maths, computer science, statistics, applications
→ Data scientists require a diverse skillset
• Not computer scientists
• But should know about databases, data structures, algorithms, etc.
• Not mathematicians
• But should know about optimization, stochastics, etc.
• Not statisticians
• But should know about regression, statistical tests, etc.
• Not domain experts
• But must work together with them
Data
Scientists
Quantitative
• Maths
• Algorithms
• Statistics
Technical
• Programming
• Infrastructures
Skeptical
• Create hypotheses,
but be skeptical
about them
Collaborative
• Teamwork
• Communication
skills
A bit of everything
… but actually as much as
possible of everything
• According to Microsoft Research:
• Polymath
• Do it all
• Data Evangelist
• Data analysis, disseminating and acting on
insights
• Data Preparer
• Querying existing data, preparing data for
analysis
• Data Shapers
• Analyzing and preparing data
• Data Analyzer
• Analyzing data
• Platform Builder
• Collect data and create infrastructures
• Moonlighters (50%/20%)
• Spare time data scientists
• Insight Actors
• Use the outcome and act on insights.
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software
Engineering (Online First)

Weitere ähnliche Inhalte

Ähnlich wie 00-01 DSnDA.pdf

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
Philip Piety
 

Ähnlich wie 00-01 DSnDA.pdf (20)

Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptx
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 

Mehr von SugumarSarDurai

Mehr von SugumarSarDurai (19)

Parking NYC.pdf
Parking NYC.pdfParking NYC.pdf
Parking NYC.pdf
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Power BI.pdf
Power BI.pdfPower BI.pdf
Power BI.pdf
 
Unit 6.pdf
Unit 6.pdfUnit 6.pdf
Unit 6.pdf
 
Unit 5.pdf
Unit 5.pdfUnit 5.pdf
Unit 5.pdf
 
07 Data-Exploration.pdf
07 Data-Exploration.pdf07 Data-Exploration.pdf
07 Data-Exploration.pdf
 
06 Excel.pdf
06 Excel.pdf06 Excel.pdf
06 Excel.pdf
 
05 python.pdf
05 python.pdf05 python.pdf
05 python.pdf
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
 
UNit4d.pdf
UNit4d.pdfUNit4d.pdf
UNit4d.pdf
 
Unit 4 Time Study.pdf
Unit 4 Time Study.pdfUnit 4 Time Study.pdf
Unit 4 Time Study.pdf
 
Unit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdfUnit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdf
 
02 Work study -Part_1.pdf
02 Work study -Part_1.pdf02 Work study -Part_1.pdf
02 Work study -Part_1.pdf
 
02 Method Study part_2.pdf
02 Method Study part_2.pdf02 Method Study part_2.pdf
02 Method Study part_2.pdf
 
01 Production_part_2.pdf
01 Production_part_2.pdf01 Production_part_2.pdf
01 Production_part_2.pdf
 
01 Production_part_1.pdf
01 Production_part_1.pdf01 Production_part_1.pdf
01 Production_part_1.pdf
 
01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf
 
01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 

00-01 DSnDA.pdf

  • 1. 19EC2029 Dr. D. Sugumar Associate Professor/ECE Karunya
  • 2.
  • 3.
  • 4.
  • 5. Lots of data is being collected and warehoused Web data, e-commerce Financial transactions, bank/credit transactions Online trading and purchasing Social Network
  • 6. Google processes 20 PB a day (2008) Facebook has 60 TB of daily logs eBay has 6.5 PB of user data + 50 TB/day (5/2009) 1000 genomes project: 200 TB Cost of 1 TB of disk: $35 Time to read 1 TB disk: 3 hrs (100 MB/s)
  • 7.
  • 8. There's certainly a lot of it! 2015 1 Zettabyte 1 Exabyte 1 Petabyte (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm 1 Petabyte == 1000 TB 2002 2009 (2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf 2006 2011 (2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly! 5 EB 161 EB 800 EB 1.8 ZB 8.0 ZB 14 PB 60 PB Data produced each year 100-years of HD video + audio Human brain's capacity References 1 TB = 1000 GB 120 PB logarithmic scale Data, data everywhere…
  • 10. 2000’s Content Management 1990’s Relational Databases & Data Warehouses 2010’s Key-Value Storages & Unstructured Data VOLUME OF INFORMATION LARGE SMALL TERABYTES PETABYTES EXABYTES
  • 11. Big Data is any data that is expensive to manage and hard to extract value from Volume The size of the data Velocity The latency of data processing relative to the growing demand for interactivity Variety and Complexity the diversity of sources, formats, quality, structures.
  • 12. Hmmm… where am I on this diagram?
  • 13. the companies are expanding as fast as the data!
  • 15. Big Data? I agree with this…
  • 16. Make data easier to use ~ by using it! It may be true that Data Science isn't a science – but that doesn't mean it's not useful!
  • 17. • Naive definition: • Big data only depends on the data size • 1 Gigabyte? 1 Terabyte? 1 Petabyte? • Naive interpretation misses important aspects • Time: • Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per second • Diversity: • Analyzing spread sheets with numeric data is different from analyzing Web pages that contain a mixture of text and images • Distribution: • Analyzing data from a single source is different from analyzing data from multiple sources
  • 18. • Following Gartner‘s IT Glossary: • Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. • The three Vs • Volume • Velocity • Variety Some people actually use 10 Vs to define big data! • Variability • Veracity • Validity • Vulnerability • Volatility • Visualization • Value
  • 19. • Scale of the data must be „big“ • No clear definition • „that demand […] innovative forms of information processing“ (Gartner) © Statista 2018 Data center storage worldwide
  • 20. • Speed at which new data is created • Speed at which data must be processed and analyzed • Often close to real-time
  • 21. • Diversity in data types and data sources Structured Semi- Structured Quasi-Structured Unstructured • Data with defined types and structure • Example: comma separated values • Textual data with parseable pattern • Example: XML files with schema • Textual data with erratic formats that can be formated with effort • Example: Clickstream data • Data that has no inherent structure, often with multiple formats • Example: Web site, videos
  • 23. Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data You can afford to scan the data once
  • 24. Aggregation and Statistics Data warehousing and OLAP Indexing, Searching, and Querying Keyword based search Pattern matching (XML/RDF) Knowledge discovery Data Mining Statistical Modeling
  • 25. • “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief Economist • The U.S. will need 140,000-190,000 predictive analysts and 1.5 million managers/analysts by 2018. McKinsey Global Institute’s June 2011 • New Data Science institutes being created or repurposed – NYU, Columbia, Washington, UCB,... • New degree programs, courses, boot-camps: • e.g., at Berkeley: Stats, I-School, CS, Astronomy… • One proposal (elsewhere) for an MS in “Big Data Science”
  • 26. • An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data • Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big data • Data science principles apply to all data – big and small
  • 27. • Theories and techniques from many fields and disciplines are used to investigate and analyze a large amount of data to help decision makers in many industries such as science, engineering, economics, politics, finance, and education • Computer Science • Pattern recognition, visualization, data warehousing, High performance computing, Databases, AI • Mathematics • Mathematical Modeling • Statistics • Statistical and Stochastic modeling, Probability.
  • 28. • Gartner’s 2014 Hype Cycle
  • 29. Is "Data Science" important or just trendy?
  • 30.
  • 31. • Companies learn your secrets, shopping patterns, and preferences • For example, can we know if a woman is pregnant, even if she doesn’t want us to know? Target case study • Data Science and election (2008, 2012) • 1 million people installed the Obama Facebook app that gave access to info on “friends”
  • 32. • Data Scientist • The Sexiest Job of the 21st Century • They find stories, extract knowledge. They are not reporters
  • 33. • Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions
  • 34. • National Security • Cyber Security • Business Analytics • Engineering • Healthcare • And more ….
  • 35. • Mathematics and Applied Mathematics • Applied Statistics/Data Analysis • Solid Programming Skills (R, Python, Julia, SQL) • Data Mining • Data Base Storage and Management • Machine Learning and discovery
  • 36.
  • 37. • Unfortunately, there is no clear definition (yet?) • Goal is the extraction of knowledge from data • Combination of techniques from different disciplines • Scientific principles guide the data analysis
  • 40. Data Structures and Algorithms Databases Distributed Computing Software Engineering Artificial Intelligence Machine Learning
  • 41. Linear Models Statistical Tests Inference Time Series Analysis Machine Learning
  • 42. Intelligent Systems Robotics Marketing Medicine Autonomous Driving Social Networks
  • 43. • 1. Understand the statistics and machine learning concepts that are vital for data science • 2. Learn to statistically analyze a dataset • 3. Critically evaluate data visualization based on their design and use for communicating stories from data
  • 44. • The Student will be able to • 1. Understand the key concepts in data science, its applications and the toolkit used by data scientists; • 2. Realize how data is collected, managed and stored for data science; • 3. Apply various machine learning techniques in real-world applications • 4. Implement data collection and management • 5. Apply visualization tools for data visualization • 6. Possess the required knowledge and expertise to become a proficient data scientist
  • 45. • Module 1: Introduction to Data Analytics (7 hrs) Introduction, Terminology, data science process, data science toolkit, Types of data, Introduction to Python, Data Analysis in Excel, Analytics Problem Solving, Exploratory Data Analysis, Example applications. • Module 2: Data collection management and Statistics (7 hrs) Introduction, Sources of data, Data collection and APIs, Exploring and fixing data, Data storage and management, Advanced SQL using multiple data sources, Statistics and Hypothesis Testing, Inferential Statistics, Big Data Storage and Processing Framework, Hadoob • Module 3: Data analysis (8 hrs) Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM, Naive Bayes
  • 46. • Module 4: Data visualization (8 hrs) Introduction, Types of data visualization, Data for visualization: Data types, Data encodings, Retinal variables, Mapping variables to encodings, Visual encodings, Data Visualization in Python-Superset or in Microsoft Power BI • Module 5: Computing and Applications (8 hrs) Using Python for Data Science - Using Open Source R for Data Science - Using SQL in Data Science - Software Applications for Data Science. Applications of Data Science, Technologies for visualization like Data Visualization in Microsoft Power BI. • Module 6: Trends and Technologies (7 hrs) Recent trends in various data collection and analysis techniques, various visualization techniques, application development methods used in data science, NYC Parking Case Study: Apache Spark
  • 47. Text Books: • 1. Cathy O’Neil and Rachel Schutt, Doing Data Science, Straight Talk from The Frontline. O’Reilly, 2014. ISBN: 978-1- 449-35865-5 • 2. Davy Cielen. Arno D.B Meysman, Mohamed Ali, “Introducing Data Science”, Dreamtech Press, 2016. ISBN: 978-93- 5119-937-3 Reference Books: • 1. Joel Grus, Data Science from Scratch, O’Reilly, 2015, ISBN: 978-1-491-90142-7 • 2. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman, Mining of Massive Datasets. v2.1, Cambridge University Press, 2014. ISBN : 9781139924801 • 3. John W. Foreman, Using Data Science to Transform Information into Insight – Data Smart, Wiley, 2014. ISBN: 978-81-265-4614-5 • 4. https://github.com/maximrohit/SPARK-R-SQL-NYC-PArking-Ticket-Analysis
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. • Big data has a high volume, velocity, and variety • Different data structures • Structured, semi-structured, quasi-structured, unstructured • Data science is a very diverse discipline • Maths, computer science, statistics, applications → Data scientists require a diverse skillset
  • 55. • Not computer scientists • But should know about databases, data structures, algorithms, etc. • Not mathematicians • But should know about optimization, stochastics, etc. • Not statisticians • But should know about regression, statistical tests, etc. • Not domain experts • But must work together with them
  • 56. Data Scientists Quantitative • Maths • Algorithms • Statistics Technical • Programming • Infrastructures Skeptical • Create hypotheses, but be skeptical about them Collaborative • Teamwork • Communication skills A bit of everything … but actually as much as possible of everything
  • 57.
  • 58. • According to Microsoft Research: • Polymath • Do it all • Data Evangelist • Data analysis, disseminating and acting on insights • Data Preparer • Querying existing data, preparing data for analysis • Data Shapers • Analyzing and preparing data • Data Analyzer • Analyzing data • Platform Builder • Collect data and create infrastructures • Moonlighters (50%/20%) • Spare time data scientists • Insight Actors • Use the outcome and act on insights. Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)