SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
INF2190 - Data Analytics:
Introduction, Methods and Practical
Approaches
Winter 2016 – Week 1
Dr. Attila Barta
atibarta@cs.toronto.edu
Introduction to the Course
 Instructor: Attila Barta, Ph.D. Computer Science UofT.
 Details of the course can be found in the syllabus (published
on the Blackboard).
 Current course is based on the course first taught by Prof.
Periklis Andritsos in Winter 2014 with updates to reflect the
current trends in Big Data Technologies.
 All material is under copyright by FI unless specified explicitly.
 Time and place: Thursday, 6:30pm-9:30pm.
2
Data Analytics – (old) definitions
 Analysis of data is a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision-making.
 Data Mining is a particular data analysis technique that
focuses on modeling and knowledge discovery for predictive
rather than purely descriptive purposes.
 Business Intelligence covers data analysis that relies
heavily on aggregation, focusing on business information.
(Wikipedia, Jan 2016).
3
Where Data Analytics fits in the (new) big picture?
4
Enterprise Data Analytics Architecture – Copyright Attila Barta
The Data Analytics world changed significantly in the last 5 years with the arrival of the Big Data.
Evolution of the database technologies
 Before data analytics there was data, lots of it:
 Hierarchical databases (early ‘70), IBM IMS still extensively in use
 Network databases (mid ‘70s), CA IDMS still in use
 Relational databases (mid ‘80s), DB2, Sybase, Oracle, MS-SQL Server
 Object-oriented databases (early ’90s), Poet, O2
 Data Warehouses (early ‘90s)
 all started with RedBrick – first time when the database research community
had to catch-up to industry
 The Inmod vs Kimball debates starts, as well as normalized vs de-normalized,
star vs snowflake schema…
 Data Analytics (early ’90s), the famous beer and diapers story
 Graph Databases (mid ‘90s), UofT leader in web databases, semantic databases
 Semi-structured database (late ‘90s), ToX (UofT) still one of the best XML native
databases
 Data Mining (late ‘90s)
 Stream databases (early ‘2000s), network sensors – Berkeley
 Big Data (late ‘2000s)
5
Big Data – How we got here
 In a 2001 research report[1] Gartner analyst Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity
(speed of data in and out), and variety (range of data types and sources). Gartner, and now
much of the industry, continue to use this "3Vs" model for describing Big Data[2]. (Wikipedia).
 What was happening in 2001? Three major trends:
 Sloan Digital Sky Survey began collecting astronomical data in 2000 at a rate of 200GB/night – volume
 Sensor networks (web of things) and streaming databases (Message Oriented Middleware) – velocity
 Semi-structured databases, XML native databases beside object-oriented, relational databases – variety
 What happened after 2001?
 Rise of search engines and portals - Yahoo and Google:
 Problem: how to store and query (cheaply large amounts of (semi-structured) data.
 Answer: Hadoop on commodity Linux farms.
 Memory got cheaper – in-memory data grids.
 Rise of Social Media – petabytes in pictures, unstructured and semi-structured data.
 Increased computational power and large memory – visual analytics.
6
Big Data – Definitions and Examples
7
•In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery
and process optimization“[3].
• In 2012 IDC defines Big Data technologies as “a new generation of technologies and architectures designed
to extract value economically from very large volumes of a wide variety of data by enabling high-velocity
capture, discovery, and/or analysis”[4].
•In 2012 Forrester characterize Big Data as “increases in data volume, velocity, variety, and variability”[5].
•Big Data Characteristics:
1. Data Volume: data size in order of petabytes.
• Example: Facebook on June 13, 2012 announced that their had reached 100 PB of data. On
November 8, 2012 they announced that their warehouse grows by half a PB per day.
2. Data Velocity: real time processing of streaming data, including real time analytics.
• Example: a jet engine generates 20TB data/hour that has to be processed near real time.
3. Data Variety: structured, semi-structured, text, imagines, video, audio, etc.
• Example: 80% of enterprise data is unstructured. YouTube - 500TB of video uploaded per year
4. Data Variability: data flows can be inconsistent with periodic peaks.
• Example: blogs commenting the new Blackberry 10; stock market data that reacts to market events.
8
Big Data – Reference Architecture
An Architecture for Big Data has to address following the capabilities:
1. Real-time complex event processing (including sense and response, streaming
data).
2. Massive volumes of data (petabytes) relational and non-relational (i.e. social
media, location, RFID).
3. Parallel processing/fast loading, typically based on Hadoop/Sparks.
4. High-performance query systems based on in-memory data architectures.
5. Advanced analytics, e.g. visual analytics, columnar databases.
Big Data – Reference Architecture (contd.)
9
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Shared nothing hwd,
massively parallel
Commodity;
own or rent
Massive load via
parallel processing
Data Stream
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Big Data Reference Architecture – Copyright by Attila Barta
10
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Client Omni-Channel
Interactions
Tableau, SAS
Spotfire, HANA
Tibco
BusinessEvents
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Tibco ActiveSpaces,
HANA, Kafka
R, MapReduce,
Sparks SQL
PaaS, IaaS
Big Data – Sample Technology Placement
HDFS, Sparks,
Casandra
11
Traditional Data Analytics
Enterprise Data
Warehouse
Highly normalized, usually multi-level, relational or start
schema.
Data Marts
A simple form of a data warehouse that is focused on a
single subject (or functional area).
Data Cubes
Multi-dimensional data sets, usually specific for a certain BI
tool (e.g. Cognos, BO, MS).
OLAP
Analyze multidimensional data interactively using
consolidation (roll-up), drill-down, and slicing and dicing.
Works on data cubes (MOLAP) or RDBMS (ROLAP).
Fixed, regularly scheduled (canned) reports usually based
on decision support systems.
Mgmt. Inf.
System
Statistical
Computing (R)
Statistical computing and modeling packages, e.g. SAS, R.
Diagnostic
Operational analytics that address the “why did it happen” based
on data aggregation and/or modeling.
• Complex to deploy (a new data warehouse takes months to build); most run on specialized hardware (e.g. SAS only
runs on AIX).
• Proprietary technologies of significant up front and running cost; difficult to migrate them to a cloud solution.
• Difficult to change both at the data source level (data warehouse) and at the analytical level (canned reports).
Characteristics:
12
Big Data era Data Analytics
Stream processor for sensor data, multi-media, geo-
location, GIS, etc.
Sense and Response capability, in memory data
aggregation.
Object pair, document, semi-structured, XML in-
memory databases.
In-memory columnar databases, support for R
language.
Distributed File System (HDFS, Casandra) based
relational, non-relational, multi-media, sensor or
document data
Analytical
Appliances
Specialized analytical hardware, e.g. Netezza,
Oracle Exadata.
Columnar
Database
NO-SQL
Database
In-Memory Data
Grid
Stream
Processing
Operational
Reporting
Real time in-sights based on streaming data, e.g.
sensor, geo-location, GIS, multi-media.
Data
Visualization
Self-service data visualization tools, e.g. Tableu,
Spotfire.
Big Data Search MapReduce real-time or batch search.
Descriptive
Analysis
What happened?
Predictive
Analysis
What will happened?
Prescriptive
Analysis
What do to about? Decision support automation.
• High volume and data diversity, support for new data
types.
• High horizontal and vertical scalability.
• Easy to setup and change.
• Low ownership most, mostly open source and commodity
hardware, cloud solutions readily available.
Characteristics:
Data Lakes
Objective of this course: the illusive Data Scientist…
13
 “Data Scientist: The Sexiest Job of the 21st Century” –
Harvard Business Review, Oct 2012
 Data scientists today are akin to the Wall Street “quants” of the
1980s and 1990s.
 The Hot Job of the Decade.
 185 Data Scientist Job vacancies available in Toronto
as Jan 6, 2016 on Indeed Canada, alone.
 How this course will qualify you?
 Foundation in Data Mining algorithms and techniques.
 Foundation on Big Data architecture and challenges.

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01Aseem Chakrabarthy
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 

Was ist angesagt? (20)

Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big data mining
Big data miningBig data mining
Big data mining
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 

Andere mochten auch

Presentación 6
Presentación 6Presentación 6
Presentación 6lizara03
 
Turismo en san martin
Turismo en san martinTurismo en san martin
Turismo en san martinjhoaro
 
Presentador en línea
Presentador en líneaPresentador en línea
Presentador en línealudync
 
Sant Josep Maria Escrivà
Sant Josep Maria EscrivàSant Josep Maria Escrivà
Sant Josep Maria EscrivàMihai Voicu
 
Complexo da Maré: Múltiplas Territorialidades Locais em Movimento
Complexo da Maré: Múltiplas Territorialidades Locais em MovimentoComplexo da Maré: Múltiplas Territorialidades Locais em Movimento
Complexo da Maré: Múltiplas Territorialidades Locais em MovimentoRogerio Santos
 
Diario Resumen 20161108
Diario Resumen 20161108Diario Resumen 20161108
Diario Resumen 20161108Diario Resumen
 
La sociedad del conocimiento
La sociedad del conocimientoLa sociedad del conocimiento
La sociedad del conocimientodeyver885
 
ASD Voters Manual Final Draft TP RAP CARE EXPO Fall 2016
ASD Voters Manual  Final Draft  TP RAP CARE EXPO Fall 2016ASD Voters Manual  Final Draft  TP RAP CARE EXPO Fall 2016
ASD Voters Manual Final Draft TP RAP CARE EXPO Fall 2016Heather E Hanzlick
 
[Bài dạy] cấu trúc rẽ nhánh
[Bài dạy] cấu trúc rẽ nhánh[Bài dạy] cấu trúc rẽ nhánh
[Bài dạy] cấu trúc rẽ nhánhNguyễn Thiên Ý
 
Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.
Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.
Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.ludync
 
Are you considering buying New Prom Dresses grey belt (HOD223)?
Are you considering buying New Prom Dresses grey belt (HOD223)?Are you considering buying New Prom Dresses grey belt (HOD223)?
Are you considering buying New Prom Dresses grey belt (HOD223)?bubibubibi
 
ILP Essay Final
ILP Essay FinalILP Essay Final
ILP Essay FinalAngus Muir
 
Antropometria web salvador zubiran
Antropometria web salvador zubiranAntropometria web salvador zubiran
Antropometria web salvador zubirannuticionista
 
Lição 6 - Paciênca: evitando as dissensões
Lição 6 - Paciênca: evitando as dissensõesLição 6 - Paciênca: evitando as dissensões
Lição 6 - Paciênca: evitando as dissensõesAilton da Silva
 

Andere mochten auch (18)

Presentación 6
Presentación 6Presentación 6
Presentación 6
 
Turismo en san martin
Turismo en san martinTurismo en san martin
Turismo en san martin
 
Presentador en línea
Presentador en líneaPresentador en línea
Presentador en línea
 
Kunskap, skolwebb och plattform för nyanlända
Kunskap, skolwebb och plattform för nyanländaKunskap, skolwebb och plattform för nyanlända
Kunskap, skolwebb och plattform för nyanlända
 
Ntic
NticNtic
Ntic
 
Sant Josep Maria Escrivà
Sant Josep Maria EscrivàSant Josep Maria Escrivà
Sant Josep Maria Escrivà
 
Complexo da Maré: Múltiplas Territorialidades Locais em Movimento
Complexo da Maré: Múltiplas Territorialidades Locais em MovimentoComplexo da Maré: Múltiplas Territorialidades Locais em Movimento
Complexo da Maré: Múltiplas Territorialidades Locais em Movimento
 
Diario Resumen 20161108
Diario Resumen 20161108Diario Resumen 20161108
Diario Resumen 20161108
 
La sociedad del conocimiento
La sociedad del conocimientoLa sociedad del conocimiento
La sociedad del conocimiento
 
ASD Voters Manual Final Draft TP RAP CARE EXPO Fall 2016
ASD Voters Manual  Final Draft  TP RAP CARE EXPO Fall 2016ASD Voters Manual  Final Draft  TP RAP CARE EXPO Fall 2016
ASD Voters Manual Final Draft TP RAP CARE EXPO Fall 2016
 
OCA
OCAOCA
OCA
 
[Bài dạy] cấu trúc rẽ nhánh
[Bài dạy] cấu trúc rẽ nhánh[Bài dạy] cấu trúc rẽ nhánh
[Bài dạy] cấu trúc rẽ nhánh
 
LAtest Doc
LAtest DocLAtest Doc
LAtest Doc
 
Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.
Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.
Unidad Didactica Interdisciplinaria:El Lenguaje itinerario de la humanidad.
 
Are you considering buying New Prom Dresses grey belt (HOD223)?
Are you considering buying New Prom Dresses grey belt (HOD223)?Are you considering buying New Prom Dresses grey belt (HOD223)?
Are you considering buying New Prom Dresses grey belt (HOD223)?
 
ILP Essay Final
ILP Essay FinalILP Essay Final
ILP Essay Final
 
Antropometria web salvador zubiran
Antropometria web salvador zubiranAntropometria web salvador zubiran
Antropometria web salvador zubiran
 
Lição 6 - Paciênca: evitando as dissensões
Lição 6 - Paciênca: evitando as dissensõesLição 6 - Paciênca: evitando as dissensões
Lição 6 - Paciênca: evitando as dissensões
 

Ähnlich wie INF2190_W1_2016_public

Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxYashiBatra1
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataSitaram Kotnis
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 

Ähnlich wie INF2190_W1_2016_public (20)

Big data seminor
Big data seminorBig data seminor
Big data seminor
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Unit 2
Unit 2Unit 2
Unit 2
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
Sample
Sample Sample
Sample
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 

INF2190_W1_2016_public

  • 1. INF2190 - Data Analytics: Introduction, Methods and Practical Approaches Winter 2016 – Week 1 Dr. Attila Barta atibarta@cs.toronto.edu
  • 2. Introduction to the Course  Instructor: Attila Barta, Ph.D. Computer Science UofT.  Details of the course can be found in the syllabus (published on the Blackboard).  Current course is based on the course first taught by Prof. Periklis Andritsos in Winter 2014 with updates to reflect the current trends in Big Data Technologies.  All material is under copyright by FI unless specified explicitly.  Time and place: Thursday, 6:30pm-9:30pm. 2
  • 3. Data Analytics – (old) definitions  Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.  Data Mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes.  Business Intelligence covers data analysis that relies heavily on aggregation, focusing on business information. (Wikipedia, Jan 2016). 3
  • 4. Where Data Analytics fits in the (new) big picture? 4 Enterprise Data Analytics Architecture – Copyright Attila Barta The Data Analytics world changed significantly in the last 5 years with the arrival of the Big Data.
  • 5. Evolution of the database technologies  Before data analytics there was data, lots of it:  Hierarchical databases (early ‘70), IBM IMS still extensively in use  Network databases (mid ‘70s), CA IDMS still in use  Relational databases (mid ‘80s), DB2, Sybase, Oracle, MS-SQL Server  Object-oriented databases (early ’90s), Poet, O2  Data Warehouses (early ‘90s)  all started with RedBrick – first time when the database research community had to catch-up to industry  The Inmod vs Kimball debates starts, as well as normalized vs de-normalized, star vs snowflake schema…  Data Analytics (early ’90s), the famous beer and diapers story  Graph Databases (mid ‘90s), UofT leader in web databases, semantic databases  Semi-structured database (late ‘90s), ToX (UofT) still one of the best XML native databases  Data Mining (late ‘90s)  Stream databases (early ‘2000s), network sensors – Berkeley  Big Data (late ‘2000s) 5
  • 6. Big Data – How we got here  In a 2001 research report[1] Gartner analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing Big Data[2]. (Wikipedia).  What was happening in 2001? Three major trends:  Sloan Digital Sky Survey began collecting astronomical data in 2000 at a rate of 200GB/night – volume  Sensor networks (web of things) and streaming databases (Message Oriented Middleware) – velocity  Semi-structured databases, XML native databases beside object-oriented, relational databases – variety  What happened after 2001?  Rise of search engines and portals - Yahoo and Google:  Problem: how to store and query (cheaply large amounts of (semi-structured) data.  Answer: Hadoop on commodity Linux farms.  Memory got cheaper – in-memory data grids.  Rise of Social Media – petabytes in pictures, unstructured and semi-structured data.  Increased computational power and large memory – visual analytics. 6
  • 7. Big Data – Definitions and Examples 7 •In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization“[3]. • In 2012 IDC defines Big Data technologies as “a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis”[4]. •In 2012 Forrester characterize Big Data as “increases in data volume, velocity, variety, and variability”[5]. •Big Data Characteristics: 1. Data Volume: data size in order of petabytes. • Example: Facebook on June 13, 2012 announced that their had reached 100 PB of data. On November 8, 2012 they announced that their warehouse grows by half a PB per day. 2. Data Velocity: real time processing of streaming data, including real time analytics. • Example: a jet engine generates 20TB data/hour that has to be processed near real time. 3. Data Variety: structured, semi-structured, text, imagines, video, audio, etc. • Example: 80% of enterprise data is unstructured. YouTube - 500TB of video uploaded per year 4. Data Variability: data flows can be inconsistent with periodic peaks. • Example: blogs commenting the new Blackberry 10; stock market data that reacts to market events.
  • 8. 8 Big Data – Reference Architecture An Architecture for Big Data has to address following the capabilities: 1. Real-time complex event processing (including sense and response, streaming data). 2. Massive volumes of data (petabytes) relational and non-relational (i.e. social media, location, RFID). 3. Parallel processing/fast loading, typically based on Hadoop/Sparks. 4. High-performance query systems based on in-memory data architectures. 5. Advanced analytics, e.g. visual analytics, columnar databases.
  • 9. Big Data – Reference Architecture (contd.) 9 Virtual Infrastructure Workload Management Infrastructure Services Event Mgmt. Query (SQL, non-SQL) Processing Advanced Analytics Shared nothing hwd, massively parallel Commodity; own or rent Massive load via parallel processing Data Stream Stream Processing Non-relational dbms Data Management Relational dbms Distributed File System In-Memory Data Grid Big Data Reference Architecture – Copyright by Attila Barta
  • 10. 10 Virtual Infrastructure Workload Management Infrastructure Services Event Mgmt. Query (SQL, non-SQL) Processing Advanced Analytics Client Omni-Channel Interactions Tableau, SAS Spotfire, HANA Tibco BusinessEvents Stream Processing Non-relational dbms Data Management Relational dbms Distributed File System In-Memory Data Grid Tibco ActiveSpaces, HANA, Kafka R, MapReduce, Sparks SQL PaaS, IaaS Big Data – Sample Technology Placement HDFS, Sparks, Casandra
  • 11. 11 Traditional Data Analytics Enterprise Data Warehouse Highly normalized, usually multi-level, relational or start schema. Data Marts A simple form of a data warehouse that is focused on a single subject (or functional area). Data Cubes Multi-dimensional data sets, usually specific for a certain BI tool (e.g. Cognos, BO, MS). OLAP Analyze multidimensional data interactively using consolidation (roll-up), drill-down, and slicing and dicing. Works on data cubes (MOLAP) or RDBMS (ROLAP). Fixed, regularly scheduled (canned) reports usually based on decision support systems. Mgmt. Inf. System Statistical Computing (R) Statistical computing and modeling packages, e.g. SAS, R. Diagnostic Operational analytics that address the “why did it happen” based on data aggregation and/or modeling. • Complex to deploy (a new data warehouse takes months to build); most run on specialized hardware (e.g. SAS only runs on AIX). • Proprietary technologies of significant up front and running cost; difficult to migrate them to a cloud solution. • Difficult to change both at the data source level (data warehouse) and at the analytical level (canned reports). Characteristics:
  • 12. 12 Big Data era Data Analytics Stream processor for sensor data, multi-media, geo- location, GIS, etc. Sense and Response capability, in memory data aggregation. Object pair, document, semi-structured, XML in- memory databases. In-memory columnar databases, support for R language. Distributed File System (HDFS, Casandra) based relational, non-relational, multi-media, sensor or document data Analytical Appliances Specialized analytical hardware, e.g. Netezza, Oracle Exadata. Columnar Database NO-SQL Database In-Memory Data Grid Stream Processing Operational Reporting Real time in-sights based on streaming data, e.g. sensor, geo-location, GIS, multi-media. Data Visualization Self-service data visualization tools, e.g. Tableu, Spotfire. Big Data Search MapReduce real-time or batch search. Descriptive Analysis What happened? Predictive Analysis What will happened? Prescriptive Analysis What do to about? Decision support automation. • High volume and data diversity, support for new data types. • High horizontal and vertical scalability. • Easy to setup and change. • Low ownership most, mostly open source and commodity hardware, cloud solutions readily available. Characteristics: Data Lakes
  • 13. Objective of this course: the illusive Data Scientist… 13  “Data Scientist: The Sexiest Job of the 21st Century” – Harvard Business Review, Oct 2012  Data scientists today are akin to the Wall Street “quants” of the 1980s and 1990s.  The Hot Job of the Decade.  185 Data Scientist Job vacancies available in Toronto as Jan 6, 2016 on Indeed Canada, alone.  How this course will qualify you?  Foundation in Data Mining algorithms and techniques.  Foundation on Big Data architecture and challenges.