This TDWI EU 2012 presentation looks at the various options for implementing a data store for analytical purposes and shows that there's no 'one size fits all' solution available
6. ⇨
Back to basics: BI & DWHBack to basics: BI & DWH
⇨
The Need for SpeedThe Need for Speed
⇨
Database ArchitecturesDatabase Architectures
⇨
#BigData & the Hadoop Hoopla#BigData & the Hadoop Hoopla
⇨
The forgotten power ofThe forgotten power of Olap & MDXOlap & MDX
⇨
A Cloudy future?A Cloudy future?
⇨
Shootout: Evaluating alternativesShootout: Evaluating alternatives
7. 7
”
“
Business Intelligence (BI)
Process of identifying, collecting,
combining, analyzing, interpreting
and communicating internal and
external information to support
decision making processes
Concepts and methods to improveConcepts and methods to improve
business decision making by usingbusiness decision making by using
fact-based support systemsfact-based support systems
“
”
First definition of BI: 1958!
8. 8
Business Intelligence is....
Doing useful stuff with data, in order to…
Support the
Decision Making
process
So why not simply use
Decision Support Systems?
9. 9
How it all started in 1958...
⇨ Hans Peter Luhn (IBM) → A
Business Intelligence System
The notion of intelligence is also defined here, in a
more general sense, as the “ability to apprehend
the interrelationships of presented facts in such a
way as to guide action towards a desired goal.”
Full text on Timo Elliott's blog:
http://timoelliott.com/blog/2007/11/the_real_pioneer_of_busin
ess_i.html
10. 10
Luhn's Vision (1958!)
A Business Intelligence System
Abstract: An automatic system is being developed to disseminate
information to the various sections of any industrial, scientific or
government organization. This intelligence system will utilize data-
processing machines for auto-abstracting and auto-encoding of documents
and for creating interest profiles for each of the “action points” in an
organization. Both incoming and internally generated documents are
automatically abstracted, characterized by a word pattern, and sent
automatically to appropriate action points. This paper shows the flexibility
of such a system in identifying known information, in finding who needs to
know it and in disseminating it efficiently either in abstract form or as a
complete document.
13. 13
The Evolution of Enterprise Business
Intelligence
Enterprise Decision Management
Embracing all relevant data sources
BI injected into everyday business
processes
Master Data Management
Advanced Data Mining / Analytics
Business Activity Monitoring
Common Information &
Processes
BusinessValue
Disconnected Silos of
Information
Query/Reporting/Online Analytical Processing
Content Management/Data Warehousing
Search
2000 2005 2010 2015
Slide 13
Image courtesy Jim Fitzgerald, IBM research
26. 26
Remember the Origins
•The general conception of a
separate architecture for BI has been
around longer, but this is the first
formal relational architecture and
definition published.
•One thing left out of most designs:
the box labeled business process
definitions.
“An architecture for a business and
information system”, B. A. Devlin, P. T.
Murphy, IBM Systems Journal, Vol.27,
No. 1, (1988)
27. 27
2012: we're still doing this!
Staging
Area
CSV
Files
ETL
ERP
DBMS
Sources ETL Process Data Warehouse EUL
DBMS
Files
ETL
Central DWH &
Data Marts
DBMS ETL
End User Layer,
in case you were
wondering ;-)
28. 28
We’ve (also) accumulated over 20 years of changes
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
Data Consumers
Databases Dashboards OLAP Productivity BAM/BPM Reporting ETL Data Mining Applications
Warehouse
Database
ETL
Marts
ODS
EDR EII
Content
Store
EAI
Stream
processing
SQL Service API
29. 29
The assumption of the warehouse as a
database is gone
29
Traditional tabular
or structured data
Data at rest
Non-traditional
data (logs, audio,
documents)
Parallel
programming
platforms
Databases
Streaming
DBs/engines
Message
streams
Data in motion
Slide 29
Copyright Third Nature, Inc.
31. 31
Why BI Projects Fail?
1. Query Performance Too Slow
(BI Survey 9)
70% of DWH's experience
performance constrained issues
of various types
(Gartner DWH MQ 2010)
Poor Query Performance
No 1 reason for replacing DWH
(TDWI Best Practices)
32. 32
Two dimensions of Speed
Companies wishing to maximize BI benefits
should focus on
1) support quality
2) implementation timeimplementation time
3) query response timequery response time and
4) breadth of deployment,
in that order.
33. 33
Minimize Implementation Time
Use 'RTF': vs
Off the shelf: Etc.
+ use Agile methods like Scrum or DSDM
+ look at Data Vault model & methodology
34. 34
Minimize Query Response Time
Source: TDWI Next generation Data Warehouse Platforms,
By Philip Russom
Why replace
a data
warehouse
solution?
35. 35
Solving Performance Problems
Replace every single thing before the database?
Migrating to an analytic database is twice as likely as to another row-store database.
36. 36
Applying “Laborware”: think twice...
⇨
Apply traditional optimization techniques:
⇨ Redesign solution
⇨ Add/optimize indexes
⇨ Horizontal partitioning
⇨ Add materialized views
⇨ Rewrite queries
⇨ Reorganize data
⇨ Offload old data
⇨ …
⇨ Costs will increase & recur!
44. 44
Storage costs keep going down
Year Size in GB US $/GB
1955 0.012 6,382,933.00
1960 0.01 3,686,400.00
1970 0.1 265,933.00
1980 2.5 16,000.00
1990 0.34 5,406.00
2000 40 7.17
2010 2,000 0.05
2012 3,000 0.07
“By the end of 2012, drives will have 100 times more
capacity at 1/100 of the cost per GB compared to 2000”
45. 45
Your next data warehouse?
The next-generation SDXC memory card specification,
released to members in April, 2009, dramatically
improves consumers digital lifestyles by increasing
storage capacity from more than 32 GB up to 2 TB and
increasing bus interface speed up to 104 MB per second
in 2009 with a road map to 300 MB per second.
47. 47
Choosing the right architecture is a trade off
FlexibilityFlexibility
AgilityAgility
Real-TimeReal-Time
ComplexityComplexity
IntegrationIntegration
AuditabilityAuditability
Data VolumeData Volume
Advanced
Analysis
Advanced
Analysis
PerformancePerformance
Low costLow cost
Skills &
Standards
Skills &
Standards
BI ArchitectureBI Architecture
Source:
48. 48
What this means…
• No ‘one size fits all’ solution
• Easy to over or under provision
• There are always exceptions
• Clueless analysts
• Tech savvy managers (even C-level)
• Excel Junkies
49. 49
Comparing Solutions
⇨ By Technology?
⇨
Columns, MPP, In-Memory, etc
⇨
By Storage Type?
⇨ Files, tables, OLAP
⇨ By Deployment type?
⇨ Appliance, Cloud, Saas
⇨ By Features/API?
⇨ SQL, MapReduce, R, etc.
⇨ By Speed?
⇨ TPC-H, Airline DB, Custom
⇨ By Licence type/price?
⇨ CPU, data size, memory usage
51. 51
Major BI Vendors have SQL DB's
IBM:
Microsoft:
Oracle:
SAP:
⇨ DB2, Netezza
⇨ SQL Server
⇨ MySQL, Oracle DB, Exalytics (TimesTen)
⇨ Sybase (IQ) & SAP Hana
..and all others are DB agnostic: Microstrategy, SAS, Tableau,
Tibco Spotfire, LogiXML, Pentaho, Jaspersoft, etc.
52. 52
Analytical DB's: What’s Different?
⇨ MPP: Massive Parallel Processing
⇨ Column based data organization
⇨ Data compression
⇨ Read optimization
⇨ In memory operation
⇨ Different disk configuration options
⇨ In DB analytics
⇨ Data mining
⇨ Statistics
53. 53
Architecture: SMP vs MPP
Different storage approaches:
● Shared Disk (clustering)
● Shared Nothing
Most DWH appliance & new software vendors
use Shared Nothing, MPP, Scale Out architecture
64. 64
Disk Usage/Configuration (2)
⇨ 3. Use standard devices, e.g. VectorWise, Vertica
ADB
⇨ 3 is easiest to set up (but some ADB's auto config)
⇨ Speed depends on other things too
65. 65
RAIS instead of RAID
⇨ 1. Failover Node (Hot Standby) ⇨ 2. Data Distribution
A
B
B
A
C
C
etc.
Hot Standby
67. 67
ILM: Software meets Hardware
⇨ Different approaches
⇨ Usage (e.g. TeraData)
⇨ Age (e.g. Oracle)
⇨ Partitions (e.g. Sybase IQ)
Burning
Hot
Warm
Cool
Cold
Sas
Sata
www.etre.com
68. 68
Beware of (Interconnect) Bottlenecks
Fast & Expensive SAN
Fast & Expensive Servers(s)
1Gb/s
1Gb/s shared
DWH
VM
ERP
VM
MAIL
VM
CRM
VM
Undersized Virtual DWH
You want:
* Dedicated hardware
* Infiniband QDR 12x: 96 Gb/s, or
* 100 Gb Ethernet: 100Gb/s
OR: Local storage (MPP w DASD)
76. 76
*THIS* is Hadoop:
a Distributed File System
Data Distribution Data Retrieval using M/R
77. 77
#BigData & NoSQL: No Standards
“Each NoSQL DB has its own strengths/weaknesses;
most are not (directly) suited for typical BI workloads”
78. 78
The Great Divide(s)
⇨ Pure SQL DB's
⇨ All OS Column Stores
⇨ Paraccel, Kognitio
⇨ In Database Analytics
⇨ Map/Reduce (many)
⇨ R (GreenPlum)
⇨ SAS (TeraData)
⇨ Everything (Netezza iClass)
⇨ NoSQL Databases
⇨ Hive (Hadoop)
⇨ MongoDB
⇨ CouchDB
⇨ etc.
79. Worlds Colliding
⇨MapReduce (NoSQL)
⇨
Programming model
⇨
No DBMS/SQL required
⇨
Schema free
⇨
Exclusively <key,value>
⇨
Java, Python, C++, C,
etc.
⇨
Text/data mining
⇨
Eventually Consistent
⇨SQL (RDBMS)
⇨
Query language
⇨
DBMS required
⇨
Fixed schema
⇨
Complex structure
⇨
SQL
⇨
Not good at Text
⇨
ACID compliant
80. 80
What is MapReduce?
⇨ M/R is now patented by
Google (Patent
#7,650,331)
⇨ Used in many ADB's
⇨Hadoop, CouchDB
⇨AsterData
⇨GreenPlum
⇨Vertica
⇨...
MapReduce is a programming
model and an associated
implementation for processing and
generating large data sets.
Users specify a map function that
processes a key/value pair to
generate a set of intermediate
key/value pairs, and a reduce
function that merges all
intermediate values associated with
the same intermediate key
82. 82
M/R & SQL: How to get there
⇨ SQL on top of M/R
⇨ e.g. Hive-Hadoop
⇨ M/R invoking SQL
⇨ e.g. Greenplum
⇨ SQL invoking M/R
⇨ e.g. TeraData/Aster Data
⇨ Most ADB vendors implementing/investigating M/R
⇨ e.g. Vertica (Hadoop integration), Oracle, Netezza, etc.
92. 92
(R/H/M)OLAP
⇨ OnLine Analytical Processing
⇨ Analyse multidimensional data
⇨ Basic architecture:
Data Warehouse
MDX
OLAP
engine/server
Analysis front end
93. 93
Stars and Cubes
⇨ Star schema
⇨ Dimension & fact tables
⇨ Best foundation for cubes
⇨ Cubes (logical/physical)
⇨ Dimensions
⇨Hierarchies
⇨ Levels
⇨Attributes
⇨ Measures
94. 94
The power of OLAP
Aggregates, positional calculations (prior vs current), range
calculations (ytd, mtd), level calculations (child to parent
contribution)
95. 95
MDX
⇨ Short for 'Multi Dimensional Expressions'
⇨ ~ SQL for OLAP:
⇨
SELECT
{set for column headers} ON COLUMNS,
{set for row headers} on ROWS
FROM [Cube Name]
WHERE {set for filtering}
⇨
SELECT:
{[Measures].[Unit Sales]} ON COLUMNS,
{[Product].[Drink], [Product].[Food]} ON ROWS
FROM [Sales]
WHERE [Time].[1997]
96. 96
The Power of MDX
Positional: [Measures].[Profit], [Time].PrevMember
Range: Aggregate(YTD(), [Measures].[Profit]
“MDX is far
more powerful
than SQL for
the typical BI
questions”
97. 97
Adding OLAP to the mix
⇨ Virtual Cubes, e.g.
⇨ Kognitio Pablo
⇨ Pentaho Mondrian
⇨ Microstrategy
⇨ Physical Cubes, e.g.
⇨ Microsoft Analysis Services
⇨ Oracle Essbase
⇨ Jedox Palo
Physical cubes allow 'write back': what
if, forecasting, budgetting & planning
99. 99
The promises of the Cloud
⇨ “Utility computing”
⇨ Unlimited capacity
⇨ Pay as you go/by the sip
⇨ Lower costs
⇨ Always up to date
⇨ Invisible OS
⇨ Security
⇨ Safety
111. 111
The Shootout!
⇨
Things to ask your (potential) vendor
⇨ References
⇨ Assist in a paid POC
⇨ License model & unit of cost: CPU, Core,
Server, (raw) Data volume, Memory used
⇨ Free dev/test editions (only pay for
production use)
⇨ Support options (updates only, mail/phone
support, etc)
⇨ If migrating: trade in discount
⇨ Opt out/de-integration options
112. 112
Does your DB cover the Basics?
⇨ Full SQL 2003 support?
⇨ Easy backup/restore features?
⇨ Scaling up or out?
⇨ Failover & persistency?
⇨ External (management) Tool integration?
117. 117
Beware of
Benchmarks !
⇨ Differences in
⇨# threads
⇨# cores
⇨# disks
⇨# nodes
⇨CPU generation/speed
1. Always use P.O.C. on your own
data & query workload
2. Don't trust the MQ's
118. ⇨
Ongoing Market Consolidation
⇨
More additional/alternative storage engines
⇨
Hybrid Row/Column solutions
⇨
Every db will get In DB analytics
⇨
Every db will get Hadoop/MR extensions
⇨
Everything in-memory
122. Web: www.tholis.com
Email: jos<at>tholis.com
Phone: +31-(0)6-51169606
Skype: tholis.jos
LinkedIn: jvdongen
Twitter: josvandongen
IRC: _grumpy
Jos van Dongen
In BI since 1991
Principal Consultant
Author/Speaker/Analyst
Proud member of #BBBT
Hinweis der Redaktion
The original definition of business intelligence
What most people in the BI/DWH department tend to forget is that BI is not about technology, cool dashboards or the fastest analytical database. Nor is it about building ETL flows and publishing 100’s of reports. It is about helping the business user and manager to make more insightful decisions. If a simple Excel spreadsheet gets you there: great! Unfortunately, things are usually more complex than that...
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;
In order to deliver full business impact, business intelligence must shift from retrospective analysis by experts to mechanisms that make it fully operational in a business context
e.g. automatically triggered by external events as well as driven by people making decisions.
The former is action within processes, while the latter is more often action on processes.
As core business processes become more service oriented, there is increased scope for injecting decision-driven services. The technology evolution of software architecture means we can mix BI and decision services with application services. This allows us to maintain both application-oriented and data-oriented architectures.
If business intelligence is going to directly impact business processes then we need a closed loop system to evaluate and improve results on an ongoing basis. This is where the combination of performance management concepts, business process models and data all come together.
Current waterfall methods of design and construction are inadequate because they don’t allow evolution in different areas at different speeds, nor do they take into account the service model architecture over the application function-centric architecture.
Data warehouses usually follow a predictable evolution. After over 25 years, we have seen the “stages” companies go through on their path to enterprise data warehousing. Moving from Stage 1 (What Happened?) into Stage 2 (Why Did It Happen?) requires new capabilities for ad hoc analysis. Then as you evolve to Stage 3 (Predicting What Will Happen) you again grow in your platform and database requirements. As you cross the chasm into Stages 4 and 5 (Operational Intelligence) you require a platform capable of “active” analysis.
Monitor: passive monitoring, basic description of what’s going on
Identify: active monitoring, human intervention may be required, identify exceptions, alerts
Explore: examine exceptions, determine what happened, boundaries and data
Analyze: determine root causes, more detailed analysis,
Predict: model problems and processes, determine future outcomes
Prescribe: optimize, determine choices between options, define actions to take
Monitor: passive monitoring, basic description of what’s going on
Identify: active monitoring, human intervention may be required, identify exceptions, alerts
Explore: examine exceptions, determine what happened, boundaries and data
Analyze: determine root causes, more detailed analysis,
Predict: model problems and processes, determine future outcomes
Prescribe: optimize, determine choices between options, define actions to take
Monitor: passive monitoring, basic description of what’s going on
Identify: active monitoring, human intervention may be required, identify exceptions, alerts
Explore: examine exceptions, determine what happened, boundaries and data
Analyze: determine root causes, more detailed analysis,
Predict: model problems and processes, determine future outcomes
Prescribe: optimize, determine choices between options, define actions to take
Monitor: passive monitoring, basic description of what’s going on
Identify: active monitoring, human intervention may be required, identify exceptions, alerts
Explore: examine exceptions, determine what happened, boundaries and data
Analyze: determine root causes, more detailed analysis,
Predict: model problems and processes, determine future outcomes
Prescribe: optimize, determine choices between options, define actions to take
Step one is ad-hoc analysis, most frequently done manually and not to strict schedule.
Monitor: passive monitoring, basic description of what’s going on
Identify: active monitoring, human intervention may be required, identify exceptions, alerts
Explore: examine exceptions, determine what happened, boundaries and data
Analyze: determine root causes, more detailed analysis, model building
Predict: model problems and processes, determine future outcomes
Prescribe: optimize, determine choices between options, define actions to take
Prediction implies automation of processes, systematic.
Monitor: passive monitoring, basic description of what’s going on
Identify: active monitoring, human intervention may be required, identify exceptions, alerts
Explore: examine exceptions, determine what happened, boundaries and data
Analyze: determine root causes, more detailed analysis, mdoel building
Predict: model problems and processes, determine future outcomes
Prescribe: optimize, determine choices between options, define actions to take
The process requirement, and it’s lack in our environments, is coming back in BI, model and tool requirements.
The warehouse concept is no longer a simple database-oriented model. It’s grown up into a large collection of data management, storage, processing and delivery components that must all work together.
There have been many changes from the once per night batch oriented design, with a single data model capturing the entire enterprise, and 100% of the organization’s data readily available through a single user interface.
We now have operational data stores and other staging areas to address mixed data latencies, different data types, the requirement to manage master data and clean up problems in operational data.
In larger environments we’ve created warehouse-mart architectures and offloaded some of the processing or event refined the data further.
Data, particularly in the case of planning, scenario modeling / what-if analysis, or scorecards, has writeback requirements.
Data types are more varied and complex than SQL standard types.
The new view:
Data warehouse as a platform. This means meeting application needs as well as traditional BI workloads. We have to think in terms of data and decision services, as well as traditional query-response models.
Access to both historical and current data.
Multiple storage methods, possibly distributed.
Multiple access methods.
Data usage decoupled from the underlying platform.
More fluid management of data, regardless of location.
Any architecture now will have multiple repositories for data, multiple technologies to cope with the different needs.
The primary technology classes line up like this.
For most BI programs, the low hanging fruit has been picked. The BI market is changing and BI programs, skills and architectures need to change with it.
That means learning about the storage and processing technologies and architectures, and how they can be put together.
Lots of time is wasted on evaluating different solutions; by just take what you already have (MySQL, SQL Server, Oracle, Whatever) lots of time can be saved in your first (pilot) project.
For bigger scale efforts: use &apos;Ready To Fly&apos; solutions, either off-premise (Cloud based stuff) or on-premise (Appliances)
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;
Selecting any solution is a trade off between conflicting goals; often, high performance and low cost don’t go well together; requiring full auditability and real time data access at the same time can also cause problems. For any combination of factors, a decision has to be made what factor has the more weight in a selection process.
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;
Seminar Open Source BI
November 2008
Tholis Consulting
&lt;number&gt;