SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Welcome to the webinar on

Designing High Performance Datawarehouse

Presented by

&
Contents

1

What happened in the Data 1.0 World

2

What is shaping the new Data 2.0 World

3

Designing High Performance Datawarehouse

4

Q&A
What happened in the Data 1.0 World?
Before 2000

Do we need a DWH?

2000s

Select success : top down &
bottom up

Advent of ODS

Now

Business led

We’ve got BI / DWH Tools

Volume | Variety | Velocity |
Value

Performance vs. Volume :
Game Changer

Need insights from nonstructured data as well

Drill-down Reporting from
DWH – getting into mainstream

Analytics is a differentiator

Data Silos
Metrics for success?
OLAP = Insights
Painful Implementations

Show me the ROI
Standardized KPIs
Analytics as differentiator?

(DATA) Big, Real time, In-memory
– what do with existing
initiatives?

Retaining skills and expertise
Data 2.0 : scale, performance,
knowledge, relevance
Challenges in current DW environment - Survey
42%

say
Can’t scale to big data volumes

27% say
Inadequate data load speed

27%

say
Poor query response

25%
Existing DW modeled for
reports & OLAP only

24%
24%
23%
19%

Can’t score analytic models
Fast enough

18%

Cost of scaling up or out is too expensive

15%

Can’t support high
Concurrent user count

15%
Inadequate support for
In-memory processing

9%

18%
Current platform needs great
Manual effort for performance
Poorly suited to real-time
workloads
Can’t support in-database
analytics
Poor CPU speed and
capacity

Current platform is a legacy,
We must phase it out

TDWI research based on 278 respondents – Top Responses`
Social Media
Data

Data 2.0 World

True Sentiment
Faster Compliance

Text Data

Sensor Data

High Performance
Data Warehouse

Concurrency Enabled
Able to handle Complexity
Ability to Scale

Syndicated
Data

Faster Reach

Speed

Numeric
Data

Every 18 months, non-rich structured and unstructured enterprise
data doubles.

Big Data Analytics
Analytics =
Competitive Advantage

Efficiencies driving
down costs

Customer
experience & service

Business is now equipped to consume, identify and act upon this data for superior insights
So what is a High Performance Datawarehouse?

Key Dimensions
CONCURRENCY

S
P
E
E
D

HIGH
PERFORMANCE
DATA
WAREHOUSE

SCALE

C
O
M
P
L
E
X
I
T
Y
CONCURRENCY





 Streaming Big Data
S  Event Processing
P  Real time operation
 Operational BI
E
 Near time Analytics
E
 Dashboard
D
Refresh
 Fast Queries

Competing Workloads – OLAP, Analytics
Intraday data loads
Thousands of users
Ad hoc queries

High
Performance
Data
Warehouse






Big Data volumes
Detailed source data
Thousands of reports
Scale out into: cloud, clusters, grids, etc.

SCALE

 Big Data variety
 Unstructured
 Sensor
 Social media
 Many sources /
targets
 Complex models
and SQL
 High availability

C
O
M
P
L
E
X
I
T
Y
Designing High Performance Datawarehouse
Industry recognized top techniques
45%

say
Creating Summary Tables

44%

say

33%
Adding Indexes

say
Altering SQL Statements or routines

24%
24%

Changing physical data models

16%

Using in-memory databases

21%

16%

Upgrading Hardware

20%
16%

Choosing between column-row
oriented data storage
Restricting or throttling user queries

15%

Moving an application to a
separate data mart

10%
Applying workload to
management controls

Shifting some workloads
to off-peak hours
Adjusting system parameters

6%
Others

TDWI research based on 329 responses from 114 respondents
Designing Summary Tables

45%

say
Creating Summary Tables
Summary table design process
A good sampling of queries. These may come from user interviews, testing / QA queries,

COLLECT

production queries, reports or any other means that provide a good representation of

expected production queries

ANALYZE

IDENTIFY

The dimension hierarchy levels, dimension attributes, and fact table measures that are

required by each query or report.

The row counts associated with each dimension level represented.

The most commonly required dimension levels against the number of rows in the resulting

BALANCE

summary tables. A goal should be to design summary tables that are roughly 1/100th the size
of the source fact tables in terms of rows (or less)

MINIMIZE

The columns that are carried in the summary table in favor of joining back to the dimension
table. The larger the summary table, the less performance advantages it provides.

Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a
hierarchy to the next.
Capturing requirements for Summary table
•Choosing Aggregates to Create - There are two basic pieces of information which are
required to select the appropriate aggregates.
•Expected usage patterns of the data.
•Data volumes and distributions in the fact table
Report

Date
Calendar Year

Measures
Sales
Sale_Amt

Dimension

Level

Report 1

Dimension Level
Store
Item
District

Report 2

District

Calendar Year

Sales_Qty
Sale_Amt

Store Geography

Report 3

District

Calendar Month
Calendar Year

Sales_Qty
Sale_Amt

Calendar Month
Fiscal Period
Fiscal Week
Fiscal Period
Fiscal Week

Sales_Qty
Sale_Amt
Sales_Qty
Sale_Amt
Sale_Amt

Fiscal Week

Sales_Qty
Sale_Amt

Division
Region
District
Store
Subject
Category
Department
Fiscal Year
Fiscal Quarter
Fiscal Period
Fiscal Week

Report 4
Report 5
Report 6
Report 7
Report 8
Report 9
Report 10
Report 11

District
Store

Category

Dept
Dept

District
District
District
District
Region

Dept
Category

Fiscal Quarter
Fiscal Period
Fiscal Week

Sales_Qty
Sale_Amt
Sales_Qty

Item Category
Date

#
Populated
of Members
1
3
50
3980
279
1987
4145
3
12
36
156
Summary table design considerations
Aggregate storage column selection

 Semi-additive and all non-additive fact data
– need not be stored in the summary table
 Add as many “pre calculated” columns as possible
 “Count” columns could be added for non additive
facts to preserve a portion of the information

Recreating vs. Updating Aggregates

 Efficient for aggregation programs to update the
aggregate tables with the newly loaded data
 Regeneration more appropriate if there is a lot of
program logic to determine what data must be
updated in the aggregate table

Storing Aggregate Rows
 A combined table containing basic level fact
rows and aggregate rows
 A single aggregate table which holds all
aggregate data for a single base fact table
 A separate table for each aggregate created

– Most preferred option

Storing Aggregate Dimension Data
 Multiple hierarchies in a single dimension
 Store all of the aggregate dimension records
together in a single table
 Use a separate table for each level in the

dimension
 Add dimension data to aggregate fact table
Efficient Indexing for Datawarehouse

44%

say
Adding Indexes
Dimension table indexing
Create a non clustered, primary key on the surrogate key of
each dimension table

•

A clustered index on the business key should be considered.
• Enhance the query response when the business key is
used in the WHERE clause.
• Help avoid lock escalation during ETL process

•

For large type 2 SCDs, create a four-part non-clustered index :
business key, record begin date, record end date and surrogate
key

•

Create non-clustered indexes on columns in the dimension that
will be used for searching, sorting, or grouping,.

•

If there’s a hierarchy in a dimension, such as Category- Sub
Category-Product ID, then create index on Hierarchy

Index Type

EmployeeKey

•

Index columns

Non clustered

EmployeeNationalIDAlternateKey

clustered

EmployeeNationalIDAlternateKey,
StartDate, EndDate
EmployeeKey

Non clustered

FirstName
LastName
DeoartmentName

Non clustered
Fact table indexing

Index columns

Index Type
clustered

•

Create a clustered, composite index composed of each of
the foreign keys to the fact tables

OrderDateKey
ProductKey
CustomerKey
PromotionKey
CurrencyKey
SalesTerritoryKey
DueDateKey

•

Keep the most commonly queried date column as the
leftmost column in the index

•

There can be more than one date in the fact table but there
is usually one date that is of the most interest to business
users. A clustered index on this column has the effect of
quickly segmenting the amount of data that must be
evaluated for a given query
Column Oriented databases
Row Store and Column Store
Most of the queries does not
process all the attributes of a
particular relation.

Row Store

Column Store

(+) Easy to add/modify a record

(+) Only need to read in relevant data

(-) Might read in unnecessary data

(-) Tuple writes require multiple accesses

• One can obtain the performance benefits of a column-store using a row-store
by making some changes to the physical structure of the row store.
– Vertically partitioning
– Using index-only plans
– Using materialized views
Vertical Partitioning
• Process:
– Full Vertical partitioning of each relation
• Each column =1 Physical table
• This can be achieved by adding integer position column to every table
• Adding integer position is better than adding primary key

– Join on Position for multi column fetch
Index-only plans
• Process:
– Add B+Tree index for every Table.column
– Plans never access the actual tuples on disk
– Headers are not stored, so per tuple overhead is less
Using Hadoop for Datawarehouse
Ecosystem of
open
Source projects

Metadata Management
(Hcatlog)
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)

Hosted by
Apache
Foundation

Query
(Pig)

Google
developed and
shared
concepts

(Hcatlog APIs, WebHDFS,
Talend Open Studio for Big Data, Sqoop)

Scripting
(Pig)

Data Extraction & Loading

Non-Relational Database
(Hbase)

Workflow & Scheduling
(Oozie)

Management & Monitoring
(Ambari, Zookeeper)

Hadoop ecosystem

Distributed File
System that has
the ability to
scale out
Promising uses of Hadoop in DW context

Data Staging

Hadoop’s scalability and low cost
enable organizations to keep all
data forever in a readily
accessible online environment

Data archiving

Schema flexibility

Hadoop enables the growing
practice of “late binding” –
instead of transforming data as
it’s ingested by Hadoop, structure
is applied at runtime

Hadoop allows organizations to
deploy an extremely scalable and
economical ETL environment

Hadoop can quickly and easily
ingest any data format

Processing flexibility

Distributed DW architecture

Off load workloads for big data and
advanced analytics to HDFS,
discovery platforms and MapReduce
What led to Datawarehouse at Facebook
The Problem

The Hadoop Experiment

Challenges with Hadoop

Data, data and more data

Superior in availability, scalability

Programmability & Metadata



200 GB per day in

And Manageability compared

March 2008

to commercial Databases

2+ TB (compressed) per day

Uses Hadoop File System (HDFS)



Map Reduce hard to program
Need to publish data in well
known schemas

HIVE
What is Hive?

Key Building Principles

Tables

A system for managing and
querying structured data built on
top of Hadoop

SQL on structured data as a familiar data
warehousing tool

Each table has a corresponding directory in HDFS

Uses Map Reduce for execution

Pluggable map/reduce scripts in language
of your choice: Rich Data Types

Uses HDFS for storage

Performance

Each table points to existing data directories in
HDFS
Split data based on hash of a column – mainly for
parallelism
Analytical platforms
Analytical platforms overview
1010data
Aster Data (Teradata)
Calpont
Datallegro (Microsoft)
Exasol
Greenplum (EMC)
IBM SmartAnalytics
Infobright
Kognitio
Netezza (IBM)
Oracle Exadata
Paraccel
Pervasive
Sand Technology
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)

Purpose-built database management
systems designed explicitly for query
processing and analysis that provides
dramatically higher price/performance
and availability compared to general
purpose solutions.
Deployment Options
-Software only (Paraccel, Vertica)
-Appliance (SAP, Exadata, Netezza)
-Hosted(1010data, Kognitio)

•

Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations

•

AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted
marketing
Which platform do you choose?

Hadoop

Analytic Database

General Purpose
RDBMS

Structured 

Semi-Structured 

Unstructured
Thank You
Please send your Feedback & Corporate Training /Consulting Services

requirements on BI to sameer@compulinkacademy.com

Weitere ähnliche Inhalte

Was ist angesagt?

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Enterprise Data Management
Enterprise Data ManagementEnterprise Data Management
Enterprise Data Management
Bhavendra Chavan
 

Was ist angesagt? (20)

Data mesh
Data meshData mesh
Data mesh
 
The Importance of Master Data Management
The Importance of Master Data ManagementThe Importance of Master Data Management
The Importance of Master Data Management
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
S/4 HANA presentation at INDUS
S/4 HANA presentation at INDUSS/4 HANA presentation at INDUS
S/4 HANA presentation at INDUS
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape
Data Architecture Best Practices for Today’s Rapidly Changing Data LandscapeData Architecture Best Practices for Today’s Rapidly Changing Data Landscape
Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape
 
Data models
Data modelsData models
Data models
 
Sap bw4 hana
Sap bw4 hanaSap bw4 hana
Sap bw4 hana
 
SAP BI/BW
SAP BI/BWSAP BI/BW
SAP BI/BW
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Data Modeling Best Practices - Business & Technical Approaches
Data Modeling Best Practices - Business & Technical ApproachesData Modeling Best Practices - Business & Technical Approaches
Data Modeling Best Practices - Business & Technical Approaches
 
SAP BW - Info object catalog
SAP BW - Info object catalogSAP BW - Info object catalog
SAP BW - Info object catalog
 
Power BI Advance Modeling
Power BI Advance ModelingPower BI Advance Modeling
Power BI Advance Modeling
 
SAP ABAP data dictionary
SAP ABAP data dictionarySAP ABAP data dictionary
SAP ABAP data dictionary
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Tableau Server Basics
Tableau Server BasicsTableau Server Basics
Tableau Server Basics
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaion
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Sap abap real time questions
Sap abap real time questionsSap abap real time questions
Sap abap real time questions
 
Enterprise Data Management
Enterprise Data ManagementEnterprise Data Management
Enterprise Data Management
 

Andere mochten auch

DWBI98 - Template Solutions for Data Warehouses and Data Marts - Presentation
DWBI98 - Template Solutions for Data Warehouses and Data Marts - PresentationDWBI98 - Template Solutions for Data Warehouses and Data Marts - Presentation
DWBI98 - Template Solutions for Data Warehouses and Data Marts - Presentation
David Walker
 
Oracle GoldenGate Demo and Data Integration Concepts
Oracle GoldenGate Demo and Data Integration ConceptsOracle GoldenGate Demo and Data Integration Concepts
Oracle GoldenGate Demo and Data Integration Concepts
Fumiko Yamashita
 
Datawarehouse Overview
Datawarehouse OverviewDatawarehouse Overview
Datawarehouse Overview
ashok kumar
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
 

Andere mochten auch (20)

Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Using SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesUsing SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS Cubes
 
Open Source Datawarehouse
Open Source DatawarehouseOpen Source Datawarehouse
Open Source Datawarehouse
 
DWBI98 - Template Solutions for Data Warehouses and Data Marts - Presentation
DWBI98 - Template Solutions for Data Warehouses and Data Marts - PresentationDWBI98 - Template Solutions for Data Warehouses and Data Marts - Presentation
DWBI98 - Template Solutions for Data Warehouses and Data Marts - Presentation
 
Business Intelligence with SQL Server
Business Intelligence with SQL ServerBusiness Intelligence with SQL Server
Business Intelligence with SQL Server
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
 
Business Intelligence Overview
Business Intelligence OverviewBusiness Intelligence Overview
Business Intelligence Overview
 
Seminar datawarehouse @ Universitas Multimedia Nusantara
Seminar datawarehouse @ Universitas Multimedia NusantaraSeminar datawarehouse @ Universitas Multimedia Nusantara
Seminar datawarehouse @ Universitas Multimedia Nusantara
 
Keynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive AnalyticsKeynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive Analytics
 
Oracle GoldenGate Demo and Data Integration Concepts
Oracle GoldenGate Demo and Data Integration ConceptsOracle GoldenGate Demo and Data Integration Concepts
Oracle GoldenGate Demo and Data Integration Concepts
 
Datawarehouse Overview
Datawarehouse OverviewDatawarehouse Overview
Datawarehouse Overview
 
Inmon & kimball method
Inmon & kimball methodInmon & kimball method
Inmon & kimball method
 
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
 
Data warehouse inmon versus kimball 2
Data warehouse inmon versus kimball 2Data warehouse inmon versus kimball 2
Data warehouse inmon versus kimball 2
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Data models
Data modelsData models
Data models
 
Dbms models
Dbms modelsDbms models
Dbms models
 

Ähnlich wie Designing high performance datawarehouse

Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
InformaticaTrainingClasses
 
Day 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologyDay 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminology
tovetrivel
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
bartlowe
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
RTTS
 

Ähnlich wie Designing high performance datawarehouse (20)

Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
3dw
3dw3dw
3dw
 
Become BI Architect with 1KEY Agile BI Suite - OLAP
Become BI Architect with 1KEY Agile BI Suite - OLAPBecome BI Architect with 1KEY Agile BI Suite - OLAP
Become BI Architect with 1KEY Agile BI Suite - OLAP
 
Business Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseBusiness Intelligence and Multidimensional Database
Business Intelligence and Multidimensional Database
 
Data Warehouse approaches with Dynamics AX
Data Warehouse  approaches with Dynamics AXData Warehouse  approaches with Dynamics AX
Data Warehouse approaches with Dynamics AX
 
3dw
3dw3dw
3dw
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
 
MicroStrategy - Effective Business Dashboards
MicroStrategy - Effective Business DashboardsMicroStrategy - Effective Business Dashboards
MicroStrategy - Effective Business Dashboards
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Day 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologyDay 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminology
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analytics
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
 
Project+team+1 slides (2)
Project+team+1 slides (2)Project+team+1 slides (2)
Project+team+1 slides (2)
 
Team project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataTeam project - Data visualization on Olist company data
Team project - Data visualization on Olist company data
 

Mehr von Uday Kothari

Mehr von Uday Kothari (7)

Introduction to blockchain Session @ Tie Pune
Introduction to blockchain Session @ Tie Pune Introduction to blockchain Session @ Tie Pune
Introduction to blockchain Session @ Tie Pune
 
MoSync Cross Platform mobile app development
MoSync  Cross Platform mobile app developmentMoSync  Cross Platform mobile app development
MoSync Cross Platform mobile app development
 
Cross platform mobile app development tools review
Cross platform mobile app development tools reviewCross platform mobile app development tools review
Cross platform mobile app development tools review
 
BI & Analytics in Action Using QlikView
BI & Analytics in Action Using QlikViewBI & Analytics in Action Using QlikView
BI & Analytics in Action Using QlikView
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho
 
The art technique of data visualization
The art  technique of data visualizationThe art  technique of data visualization
The art technique of data visualization
 
Innovative Internet & Digital marketing
 Innovative Internet & Digital marketing  Innovative Internet & Digital marketing
Innovative Internet & Digital marketing
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Designing high performance datawarehouse

  • 1. Welcome to the webinar on Designing High Performance Datawarehouse Presented by &
  • 2. Contents 1 What happened in the Data 1.0 World 2 What is shaping the new Data 2.0 World 3 Designing High Performance Datawarehouse 4 Q&A
  • 3. What happened in the Data 1.0 World? Before 2000 Do we need a DWH? 2000s Select success : top down & bottom up Advent of ODS Now Business led We’ve got BI / DWH Tools Volume | Variety | Velocity | Value Performance vs. Volume : Game Changer Need insights from nonstructured data as well Drill-down Reporting from DWH – getting into mainstream Analytics is a differentiator Data Silos Metrics for success? OLAP = Insights Painful Implementations Show me the ROI Standardized KPIs Analytics as differentiator? (DATA) Big, Real time, In-memory – what do with existing initiatives? Retaining skills and expertise Data 2.0 : scale, performance, knowledge, relevance
  • 4. Challenges in current DW environment - Survey 42% say Can’t scale to big data volumes 27% say Inadequate data load speed 27% say Poor query response 25% Existing DW modeled for reports & OLAP only 24% 24% 23% 19% Can’t score analytic models Fast enough 18% Cost of scaling up or out is too expensive 15% Can’t support high Concurrent user count 15% Inadequate support for In-memory processing 9% 18% Current platform needs great Manual effort for performance Poorly suited to real-time workloads Can’t support in-database analytics Poor CPU speed and capacity Current platform is a legacy, We must phase it out TDWI research based on 278 respondents – Top Responses`
  • 5. Social Media Data Data 2.0 World True Sentiment Faster Compliance Text Data Sensor Data High Performance Data Warehouse Concurrency Enabled Able to handle Complexity Ability to Scale Syndicated Data Faster Reach Speed Numeric Data Every 18 months, non-rich structured and unstructured enterprise data doubles. Big Data Analytics Analytics = Competitive Advantage Efficiencies driving down costs Customer experience & service Business is now equipped to consume, identify and act upon this data for superior insights
  • 6. So what is a High Performance Datawarehouse? Key Dimensions
  • 8. CONCURRENCY      Streaming Big Data S  Event Processing P  Real time operation  Operational BI E  Near time Analytics E  Dashboard D Refresh  Fast Queries Competing Workloads – OLAP, Analytics Intraday data loads Thousands of users Ad hoc queries High Performance Data Warehouse     Big Data volumes Detailed source data Thousands of reports Scale out into: cloud, clusters, grids, etc. SCALE  Big Data variety  Unstructured  Sensor  Social media  Many sources / targets  Complex models and SQL  High availability C O M P L E X I T Y
  • 10. Industry recognized top techniques 45% say Creating Summary Tables 44% say 33% Adding Indexes say Altering SQL Statements or routines 24% 24% Changing physical data models 16% Using in-memory databases 21% 16% Upgrading Hardware 20% 16% Choosing between column-row oriented data storage Restricting or throttling user queries 15% Moving an application to a separate data mart 10% Applying workload to management controls Shifting some workloads to off-peak hours Adjusting system parameters 6% Others TDWI research based on 329 responses from 114 respondents
  • 12. Summary table design process A good sampling of queries. These may come from user interviews, testing / QA queries, COLLECT production queries, reports or any other means that provide a good representation of expected production queries ANALYZE IDENTIFY The dimension hierarchy levels, dimension attributes, and fact table measures that are required by each query or report. The row counts associated with each dimension level represented. The most commonly required dimension levels against the number of rows in the resulting BALANCE summary tables. A goal should be to design summary tables that are roughly 1/100th the size of the source fact tables in terms of rows (or less) MINIMIZE The columns that are carried in the summary table in favor of joining back to the dimension table. The larger the summary table, the less performance advantages it provides. Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next.
  • 13. Capturing requirements for Summary table •Choosing Aggregates to Create - There are two basic pieces of information which are required to select the appropriate aggregates. •Expected usage patterns of the data. •Data volumes and distributions in the fact table Report Date Calendar Year Measures Sales Sale_Amt Dimension Level Report 1 Dimension Level Store Item District Report 2 District Calendar Year Sales_Qty Sale_Amt Store Geography Report 3 District Calendar Month Calendar Year Sales_Qty Sale_Amt Calendar Month Fiscal Period Fiscal Week Fiscal Period Fiscal Week Sales_Qty Sale_Amt Sales_Qty Sale_Amt Sale_Amt Fiscal Week Sales_Qty Sale_Amt Division Region District Store Subject Category Department Fiscal Year Fiscal Quarter Fiscal Period Fiscal Week Report 4 Report 5 Report 6 Report 7 Report 8 Report 9 Report 10 Report 11 District Store Category Dept Dept District District District District Region Dept Category Fiscal Quarter Fiscal Period Fiscal Week Sales_Qty Sale_Amt Sales_Qty Item Category Date # Populated of Members 1 3 50 3980 279 1987 4145 3 12 36 156
  • 14. Summary table design considerations Aggregate storage column selection  Semi-additive and all non-additive fact data – need not be stored in the summary table  Add as many “pre calculated” columns as possible  “Count” columns could be added for non additive facts to preserve a portion of the information Recreating vs. Updating Aggregates  Efficient for aggregation programs to update the aggregate tables with the newly loaded data  Regeneration more appropriate if there is a lot of program logic to determine what data must be updated in the aggregate table Storing Aggregate Rows  A combined table containing basic level fact rows and aggregate rows  A single aggregate table which holds all aggregate data for a single base fact table  A separate table for each aggregate created – Most preferred option Storing Aggregate Dimension Data  Multiple hierarchies in a single dimension  Store all of the aggregate dimension records together in a single table  Use a separate table for each level in the dimension  Add dimension data to aggregate fact table
  • 15. Efficient Indexing for Datawarehouse 44% say Adding Indexes
  • 16. Dimension table indexing Create a non clustered, primary key on the surrogate key of each dimension table • A clustered index on the business key should be considered. • Enhance the query response when the business key is used in the WHERE clause. • Help avoid lock escalation during ETL process • For large type 2 SCDs, create a four-part non-clustered index : business key, record begin date, record end date and surrogate key • Create non-clustered indexes on columns in the dimension that will be used for searching, sorting, or grouping,. • If there’s a hierarchy in a dimension, such as Category- Sub Category-Product ID, then create index on Hierarchy Index Type EmployeeKey • Index columns Non clustered EmployeeNationalIDAlternateKey clustered EmployeeNationalIDAlternateKey, StartDate, EndDate EmployeeKey Non clustered FirstName LastName DeoartmentName Non clustered
  • 17. Fact table indexing Index columns Index Type clustered • Create a clustered, composite index composed of each of the foreign keys to the fact tables OrderDateKey ProductKey CustomerKey PromotionKey CurrencyKey SalesTerritoryKey DueDateKey • Keep the most commonly queried date column as the leftmost column in the index • There can be more than one date in the fact table but there is usually one date that is of the most interest to business users. A clustered index on this column has the effect of quickly segmenting the amount of data that must be evaluated for a given query
  • 19. Row Store and Column Store Most of the queries does not process all the attributes of a particular relation. Row Store Column Store (+) Easy to add/modify a record (+) Only need to read in relevant data (-) Might read in unnecessary data (-) Tuple writes require multiple accesses • One can obtain the performance benefits of a column-store using a row-store by making some changes to the physical structure of the row store. – Vertically partitioning – Using index-only plans – Using materialized views
  • 20. Vertical Partitioning • Process: – Full Vertical partitioning of each relation • Each column =1 Physical table • This can be achieved by adding integer position column to every table • Adding integer position is better than adding primary key – Join on Position for multi column fetch
  • 21. Index-only plans • Process: – Add B+Tree index for every Table.column – Plans never access the actual tuples on disk – Headers are not stored, so per tuple overhead is less
  • 22. Using Hadoop for Datawarehouse
  • 23. Ecosystem of open Source projects Metadata Management (Hcatlog) Distributed Processing (MapReduce) Distributed Storage (HDFS) Hosted by Apache Foundation Query (Pig) Google developed and shared concepts (Hcatlog APIs, WebHDFS, Talend Open Studio for Big Data, Sqoop) Scripting (Pig) Data Extraction & Loading Non-Relational Database (Hbase) Workflow & Scheduling (Oozie) Management & Monitoring (Ambari, Zookeeper) Hadoop ecosystem Distributed File System that has the ability to scale out
  • 24. Promising uses of Hadoop in DW context Data Staging Hadoop’s scalability and low cost enable organizations to keep all data forever in a readily accessible online environment Data archiving Schema flexibility Hadoop enables the growing practice of “late binding” – instead of transforming data as it’s ingested by Hadoop, structure is applied at runtime Hadoop allows organizations to deploy an extremely scalable and economical ETL environment Hadoop can quickly and easily ingest any data format Processing flexibility Distributed DW architecture Off load workloads for big data and advanced analytics to HDFS, discovery platforms and MapReduce
  • 25. What led to Datawarehouse at Facebook The Problem The Hadoop Experiment Challenges with Hadoop Data, data and more data Superior in availability, scalability Programmability & Metadata  200 GB per day in And Manageability compared March 2008 to commercial Databases 2+ TB (compressed) per day Uses Hadoop File System (HDFS)  Map Reduce hard to program Need to publish data in well known schemas HIVE What is Hive? Key Building Principles Tables A system for managing and querying structured data built on top of Hadoop SQL on structured data as a familiar data warehousing tool Each table has a corresponding directory in HDFS Uses Map Reduce for execution Pluggable map/reduce scripts in language of your choice: Rich Data Types Uses HDFS for storage Performance Each table points to existing data directories in HDFS Split data based on hash of a column – mainly for parallelism
  • 27. Analytical platforms overview 1010data Aster Data (Teradata) Calpont Datallegro (Microsoft) Exasol Greenplum (EMC) IBM SmartAnalytics Infobright Kognitio Netezza (IBM) Oracle Exadata Paraccel Pervasive Sand Technology SAP HANA Sybase IQ (SAP) Teradata Vertica (HP) Purpose-built database management systems designed explicitly for query processing and analysis that provides dramatically higher price/performance and availability compared to general purpose solutions. Deployment Options -Software only (Paraccel, Vertica) -Appliance (SAP, Exadata, Netezza) -Hosted(1010data, Kognitio) • Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations • AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted marketing
  • 28. Which platform do you choose? Hadoop Analytic Database General Purpose RDBMS Structured  Semi-Structured  Unstructured
  • 29. Thank You Please send your Feedback & Corporate Training /Consulting Services requirements on BI to sameer@compulinkacademy.com