1. INF2190 - Data Analytics:
Introduction, Methods and Practical
Approaches
Winter 2016 – Week 1
Dr. Attila Barta
atibarta@cs.toronto.edu
2. Introduction to the Course
Instructor: Attila Barta, Ph.D. Computer Science UofT.
Details of the course can be found in the syllabus (published
on the Blackboard).
Current course is based on the course first taught by Prof.
Periklis Andritsos in Winter 2014 with updates to reflect the
current trends in Big Data Technologies.
All material is under copyright by FI unless specified explicitly.
Time and place: Thursday, 6:30pm-9:30pm.
2
3. Data Analytics – (old) definitions
Analysis of data is a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision-making.
Data Mining is a particular data analysis technique that
focuses on modeling and knowledge discovery for predictive
rather than purely descriptive purposes.
Business Intelligence covers data analysis that relies
heavily on aggregation, focusing on business information.
(Wikipedia, Jan 2016).
3
4. Where Data Analytics fits in the (new) big picture?
4
Enterprise Data Analytics Architecture – Copyright Attila Barta
The Data Analytics world changed significantly in the last 5 years with the arrival of the Big Data.
5. Evolution of the database technologies
Before data analytics there was data, lots of it:
Hierarchical databases (early ‘70), IBM IMS still extensively in use
Network databases (mid ‘70s), CA IDMS still in use
Relational databases (mid ‘80s), DB2, Sybase, Oracle, MS-SQL Server
Object-oriented databases (early ’90s), Poet, O2
Data Warehouses (early ‘90s)
all started with RedBrick – first time when the database research community
had to catch-up to industry
The Inmod vs Kimball debates starts, as well as normalized vs de-normalized,
star vs snowflake schema…
Data Analytics (early ’90s), the famous beer and diapers story
Graph Databases (mid ‘90s), UofT leader in web databases, semantic databases
Semi-structured database (late ‘90s), ToX (UofT) still one of the best XML native
databases
Data Mining (late ‘90s)
Stream databases (early ‘2000s), network sensors – Berkeley
Big Data (late ‘2000s)
5
6. Big Data – How we got here
In a 2001 research report[1] Gartner analyst Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity
(speed of data in and out), and variety (range of data types and sources). Gartner, and now
much of the industry, continue to use this "3Vs" model for describing Big Data[2]. (Wikipedia).
What was happening in 2001? Three major trends:
Sloan Digital Sky Survey began collecting astronomical data in 2000 at a rate of 200GB/night – volume
Sensor networks (web of things) and streaming databases (Message Oriented Middleware) – velocity
Semi-structured databases, XML native databases beside object-oriented, relational databases – variety
What happened after 2001?
Rise of search engines and portals - Yahoo and Google:
Problem: how to store and query (cheaply large amounts of (semi-structured) data.
Answer: Hadoop on commodity Linux farms.
Memory got cheaper – in-memory data grids.
Rise of Social Media – petabytes in pictures, unstructured and semi-structured data.
Increased computational power and large memory – visual analytics.
6
7. Big Data – Definitions and Examples
7
•In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery
and process optimization“[3].
• In 2012 IDC defines Big Data technologies as “a new generation of technologies and architectures designed
to extract value economically from very large volumes of a wide variety of data by enabling high-velocity
capture, discovery, and/or analysis”[4].
•In 2012 Forrester characterize Big Data as “increases in data volume, velocity, variety, and variability”[5].
•Big Data Characteristics:
1. Data Volume: data size in order of petabytes.
• Example: Facebook on June 13, 2012 announced that their had reached 100 PB of data. On
November 8, 2012 they announced that their warehouse grows by half a PB per day.
2. Data Velocity: real time processing of streaming data, including real time analytics.
• Example: a jet engine generates 20TB data/hour that has to be processed near real time.
3. Data Variety: structured, semi-structured, text, imagines, video, audio, etc.
• Example: 80% of enterprise data is unstructured. YouTube - 500TB of video uploaded per year
4. Data Variability: data flows can be inconsistent with periodic peaks.
• Example: blogs commenting the new Blackberry 10; stock market data that reacts to market events.
8. 8
Big Data – Reference Architecture
An Architecture for Big Data has to address following the capabilities:
1. Real-time complex event processing (including sense and response, streaming
data).
2. Massive volumes of data (petabytes) relational and non-relational (i.e. social
media, location, RFID).
3. Parallel processing/fast loading, typically based on Hadoop/Sparks.
4. High-performance query systems based on in-memory data architectures.
5. Advanced analytics, e.g. visual analytics, columnar databases.
9. Big Data – Reference Architecture (contd.)
9
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Shared nothing hwd,
massively parallel
Commodity;
own or rent
Massive load via
parallel processing
Data Stream
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Big Data Reference Architecture – Copyright by Attila Barta
10. 10
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Client Omni-Channel
Interactions
Tableau, SAS
Spotfire, HANA
Tibco
BusinessEvents
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Tibco ActiveSpaces,
HANA, Kafka
R, MapReduce,
Sparks SQL
PaaS, IaaS
Big Data – Sample Technology Placement
HDFS, Sparks,
Casandra
11. 11
Traditional Data Analytics
Enterprise Data
Warehouse
Highly normalized, usually multi-level, relational or start
schema.
Data Marts
A simple form of a data warehouse that is focused on a
single subject (or functional area).
Data Cubes
Multi-dimensional data sets, usually specific for a certain BI
tool (e.g. Cognos, BO, MS).
OLAP
Analyze multidimensional data interactively using
consolidation (roll-up), drill-down, and slicing and dicing.
Works on data cubes (MOLAP) or RDBMS (ROLAP).
Fixed, regularly scheduled (canned) reports usually based
on decision support systems.
Mgmt. Inf.
System
Statistical
Computing (R)
Statistical computing and modeling packages, e.g. SAS, R.
Diagnostic
Operational analytics that address the “why did it happen” based
on data aggregation and/or modeling.
• Complex to deploy (a new data warehouse takes months to build); most run on specialized hardware (e.g. SAS only
runs on AIX).
• Proprietary technologies of significant up front and running cost; difficult to migrate them to a cloud solution.
• Difficult to change both at the data source level (data warehouse) and at the analytical level (canned reports).
Characteristics:
12. 12
Big Data era Data Analytics
Stream processor for sensor data, multi-media, geo-
location, GIS, etc.
Sense and Response capability, in memory data
aggregation.
Object pair, document, semi-structured, XML in-
memory databases.
In-memory columnar databases, support for R
language.
Distributed File System (HDFS, Casandra) based
relational, non-relational, multi-media, sensor or
document data
Analytical
Appliances
Specialized analytical hardware, e.g. Netezza,
Oracle Exadata.
Columnar
Database
NO-SQL
Database
In-Memory Data
Grid
Stream
Processing
Operational
Reporting
Real time in-sights based on streaming data, e.g.
sensor, geo-location, GIS, multi-media.
Data
Visualization
Self-service data visualization tools, e.g. Tableu,
Spotfire.
Big Data Search MapReduce real-time or batch search.
Descriptive
Analysis
What happened?
Predictive
Analysis
What will happened?
Prescriptive
Analysis
What do to about? Decision support automation.
• High volume and data diversity, support for new data
types.
• High horizontal and vertical scalability.
• Easy to setup and change.
• Low ownership most, mostly open source and commodity
hardware, cloud solutions readily available.
Characteristics:
Data Lakes
13. Objective of this course: the illusive Data Scientist…
13
“Data Scientist: The Sexiest Job of the 21st Century” –
Harvard Business Review, Oct 2012
Data scientists today are akin to the Wall Street “quants” of the
1980s and 1990s.
The Hot Job of the Decade.
185 Data Scientist Job vacancies available in Toronto
as Jan 6, 2016 on Indeed Canada, alone.
How this course will qualify you?
Foundation in Data Mining algorithms and techniques.
Foundation on Big Data architecture and challenges.