BDIA Roundtable
Live Webcast on April 9, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=c84869fcca958d278b210cfca2a023a0
Big Data can offer big value and big challenges, and there are lots of solutions and promises out there. But in order to harness the most insight from Big Data, organizations need to solve pain points with more than triage. Since data challenges continue to permeate the information landscape, businesses would do well to incorporate solutions that fit into the infrastructure and provide a sustainable method for managing and analyzing Big Data.
Register for this Roundtable Webcast to hear veteran Analysts Robin Bloor, Mike Ferguson and Richard Winter as they offer their perspectives on the evolving Big Data industry. They’ll comment on the proposed Big Data Information Architecture, and take questions from the audience. This is the second event of The Bloor Group's Interactive Research Report for 2014 which will focus on illuminating optimal Big Data Information Architectures. The series will include a dozen interviews with today's Big Data visionaries, plus three interactive Webcasts and a detailed findings report.
Visit InsideAnlaysis.com for more information.
Foundation for Success: How Big Data Fits in an Information Architecture
1. Grab some coffee and enjoy
the pre-show banter before
the top of the hour!
2. “The Inevitable Shift: How Big Data
Impacts Enterprise Architecture”
RoundTable Webcast | April 9, 2014
3. Host
Eric Kavanagh
CEO, The Bloor Group
@eric_kavanagh eric.kavanagh@bloorgroup.com
4. Big Data Information Architecture
Exploratory Webcast
January 22, 2014
Roundtable Webcast
April 9, 2014
Findings Webcast
June 25, 2014
#BigDataArch
✓
✓
5. Analysts
Robin Bloor
Chief Analyst, The Bloor Group
Richard Winter
President & Founder, WinterCorp
Mike Ferguson
Managing Director, Intelligent Business Strategies
10. Big Data – A Poorly Defined Term
WHAT
IS BIG
DATA?
Traditional
data
Business
data
Log file
data
Operational
data
Mobile data
Location
data Social
network
data
Public data
Commercial
databases
Streaming
data
Internet of
Things
11. Atoms and Molecules
The ATOM of data has
become the EVENT
A TRANSACTION is a
MOLECULE of ATOMIC
EVENTS
16. The Workload Paradigm Shift
u Previously, we viewed
database workloads as
an i/o optimization
problem
u With analytics the
workload is a very
variable mix of i/o and
calculation
u No databases were built
precisely for this – not
even Big Data databases
17. The Big Data Applications
It’s pretty much
all about
BI & ANALYTICS
18. The Biological System
u Our human control system
works at different speeds:
• Almost instant reflex
• Swift response
• Considered response
u Organizations will gradually
implement similar control
systems
u This suggests a data-flow-based
architecture
u The EDW is memory
19. The Corporate Biological System
u Right now this division into
two different data flows is
already occurring
u Currently we can distinguish
between:
• Real-time/Business-time
applications
• Analytical applications
u We should build specific
architectures for this
20. W I N T E R C O R P
Big Data Information Architecture
Bloor Group Roundtable
Richard Winter
WinterCorp
April 2014
T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
21. Big Data and the Data Reservoir
From Robin’s charts:
30. Big Data Information Architecture
Mike Ferguson
Managing Director
Intelligent Business Strategies
Bloor Group Big Data Roundtable
April 2014
Twitter: @mikeferguson1
31. For Many Years The Traditional Data Warehouse and BI
Environment Has Been Used For Analysis & Reporting
31
Operational
systems
web
P
o
r
t
a
l
Employees
Partners
Customers
BI
Tools
Platform
Integration / DQ
Data
Reports &
analytics
DW
Data warehouse &
data marts
32. However There Are New Types of Data That Businesses
Now Want to Analyse
§ Web data
32
• Clickstream data, e-commerce logs
• Social networks data e.g., Twitter
§ Semi-structured data e.g., e-mail,
XML, JSON
§ Unstructured content
• How much is TEXT worth to you
§ Sensor data
• Temperature, light, vibration, location,
liquid flow, pressure, RFIDs
§ Vertical industries structured
transaction data
• E.g. Telecom call data records, retail Source: Analytics: The Real-World Use of Big Data
Said Business School Oxford and IBM
33. The Impact of Big Data – We Now Have Different
Platforms Optimised For Different Analytical Workloads
Big Data workloads now mean we require multiple platforms for analytical processing
33
Streaming
data
Advanced Analytic
(multi-structured data)
Hadoop
data store
DW & marts
Data Warehouse
RDBMS
NoSQL
DBMS
EDW
NoSQL DB
e.g. graph DB
mart
Advanced Analytics
(structured data)
DW
Appliance
Analytical
RDBMS
C
MDM
R
U
Cust
Prod
Asset
D
Graph
analysis
Investigative
analysis,
Data refinery
Data mining,
model
development
Traditional
query,
reporting &
analysis
Real-time
stream
processing &
decision
management
Master data
management
34. Hadoop Is A Platform At The Heart of Big Data Analytics
– There Are Multiple Ways To Access Hadoop
34
Java MapReduce SQL
APIs to HDFS,
HBase, Cascading
file file file file file
file file file file file
file file
file file
Vendor SQL on
Hadoop engine
webHDFS
(An HTTP
interface to
HDFS has
REST APIs)
HDFS
file
file
index
index Index
partition
file
file
MapReduce Hadoop 2.0 F’work
YARN
SQL
PIG latin
scripts
MapReduce
Application
BI Tools / Apps
35. 35
Popular Hadoop Use Cases
§ Hadoop as a data refinery
• Offloading data integration from a DW
§ Hadoop for investigative analysis in an analytical sandbox
§ Hadoop as an on-line data warehouse archive
36. 36
The Hadoop Data Refinery
EDW
Graph
DBMS
Analytical DBMS
DW
Appliance
CRM
ERP
SCM
Ops
XML,
JSON
Web
logs
social
NoSQL DB
web
Data marts
insights
ELT
processing
cloud
37. A Centralised Hadoop Based Data Refinery is One Way to
Scale at Reduced Cost
37
Data Hub - Consume, Clean, Integrate, Analyse And Provision
Data From Hadoop To Any Analytical Platform
DW & marts
mar
t
business
insight
NoSQL DB
e.g. graph DB EDW
Generated
MapReduce
ELT jobs
sandbox
ELT Processing
!"#$%
&'()%
Advanced Analytics
(structured data)
RDBMS social Cloud Files office docs
*+,*-./0123%
Web logs web services
sensors feeds
DW
Appliance
Exploratory analysis
Staging area /
landing zone
Sometime analysts refer to this as a Data Refinery
Data Refinery
What is the purpose
of the data refinery?
Is it to process un-modelled
data or all data?
38. Investigative Analysis Can Be Done In A Hadoop Sandbox
38
Click stream web log data
Customer interaction data
Social interaction data (e.g.
Twitter, Facebook)
Sensor data
Rich media data (video, audio)
External web content
Documents
Internal web content
Seismic data (oil & gas)
Investigative /
Exploratory
Analysis
Data Scientists
master data archived DW data
MDM System
C
R
U
Product
Asset
Customer
D
EDW
mart
new
business
insight
sandbox
Multi-structured
data
Historical Data
39. Joining Big Data With Master Data During Exploratory
Analysis Can Produce Insight for Competitive Advantage
39
Streaming Data
Graph Data Multi-Structured
NoSQL DB
e.g. graph DB C
+
Master Data Business Value
Created
R
U
Master
data
D
sentiment Customer
sentiment &
Product sentiment
Customer online
behaviour
Prospects &
Influencers
Sensor data Field service
optimization
Risk mgm’t
Asset performance
customer
product
customer
customer
asset
40. 40
New Insights Can Be Added Into A DW To Enrich What You
Already Know
DW
D
I
new
insights
Operational
systems
Data Scientists
sandbox
Web
logs
social
web cloud
e.g. Deriving insight from social web sites like for sentiment analytics
41. Alternatively New Insights In Hadoop Can Integrated With A
DW Using Data Virtualization To Provide Enriched Information
41
DW
D
I
Data Vitualisation
SQL on
Hadoop
new
insights
OLTP systems
Data Scientists
sandbox
Web
logs
social
web cloud
e.g. Deriving insight from social web sites like for sentiment analytics
42. 42
Using Hadoop As A Data Archive Means Data Can Be Kept
On-line, Analysed And Still Integrated With Data In The DW
DW
D
I
new
insights
OLTP systems
Data Vitualisation
SQL on
Hadoop
Archive unused
or data > n years
Archived data
43. Real-time Data From NoSQL DBMSs Can Also Be Joined To
DW Data Using Data Virtualization
43
DW
D
I
Data Vitualisation
Nested
data !!
real-time
insights
OLTP systems
Web
logs
social
NoSQL DB
Column Family DB
Document DB
sensors
Nested data like JSON needs to be handled by the data virtualisation server
44. 44
Investigative Analysis Can Be Done In A Graph DBMS
– New Insight Can Also Come From Graph Analysis
Investigative /
Exploratory
Analysis
Data Scientists
MDM System
C
R
U
Product
Asset
Customer
D
new
business
Insight
Structured data
master data
Multi-structured
data
Graph
DBMS
45. SQL access to
streaming data in
45
SQL Access To Big Data - Options
SQL access to
big data in
Hadoop
SQL
SQL access to big
data in an
analytical RDBMS
SQL
Analytical
RDBMS
motion
SQL
streaming
data
SQL access to
big data via data
virtualisation
SQL
data virtualisation server
DW
SQL access to a combination of the above
46. 46
SQL on Hadoop Challenges
– Multi-structured Data May Need to Be Analysed
{ "firstName": ”Wayne",
"lastName": ”Rooney",
"age": 25,
"address": {
"streetAddress": "21 Sir Matt Busby Way",
"city": ”Manchester”,
“country”: “England”,
"postalCode": “M1 6DY”
},
"phoneNumbers": [
{ "type": "home”,
"number": ”0161-123-1234”
},
{
"type": ”mobile",
"number": ”07779-123234”
}
]
}
JSON data
Text data
Image Data
SQL??
SQL??
SQL??
47. 47
SQL on Hadoop Challenges
– Multi-structured Data May Need to Be Analysed
Web log data
SQL??
SQL??
Tab delimited
file data
48. Hadoop Storage Is Independent of Any SQL Engine Accessing
HDFS - Multiple SQL Engines Can Coexist On The Same Data
Storage is independent
of any SQL engine
48
SQL SQL SQL SQL
Source: Hortonworks
§ Key points about Hadoop
• It is possible to have MULTIPLE SQL engines on the same data
• Different SQL engines run on different Hadoop frameworks (M/R, Tez,
Spark) or on no framework at all i.e. directly access HDFS or HBase data
49. Relational DBMS / Hadoop Integration – Several Vendors Have
Integrated RDBMS with Hadoop to Run Analytics
49
SQL, XQuery
Relational DBMS
External
Polymorphic
table function(s)
HDFS / Hbase/ Hive
Allows join across data in a
single RDBMS and Hadoop
RDBMS optimizer handles
transparent access to external
analytical platforms on behalf
of the user
CitusDB
Exasol EXAPowerlitics
IBM PureData System for Analytics and DB2 HDFS clients
Oracle HDFS Client
Pivotal HAWQ PFX
Teradata SQL H
RDBMS and Hadoop could
be deployed on the same
hardware cluster or on
different hardware clusters
50. Product examples:
Cirro, Cisco, Denodo, Informatica Data Services, ScleraDB
BUT what about optimization?
Can the data virtualisation server push
down analytics to underlying platforms
to make them do the work?
50
Self-Service BI
Self-Service Access To Big Data Via Data Virtualization
Business
analyst
Self-service Data
Discovery & Visualisation
or Dashboard Server
Data Virtualization and Optimization
personal
& office
data Predictive
DW
models
Transaction
systems
Data Management Tools (ETL, DQ, etc.)
51. Conclusions - People In Different Roles In The Analytical
Landscape Need to Work Together To Deliver Value
51
sandbox Analytical Operational
Exploratory analysis
Model producer
Business Analyst Business Manager/
Operations Worker
Data Scientist
Model consumer
Data discovery &
visualisation
Information Producer
• Build reports
• Build and publish
dashboards
Information consumer
Decision maker
Action taker