Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning

MEDI-CAPS UNIVERSITY
Faculty of Engineering
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in

Data Mining and Warehousing
Mr. Sagar Pandya
Information Technology Department
Course Code Course Name Hours Per Week Total
Credits
L T P
IT3ED02 Data Mining and Warehousing 3 0 0 3

IT3ED02 Data Mining and Warehousing 3-0-0
Mr. Sagar Pandya
 Unit 1. Introduction
 Unit 2. Data Mining
 Unit 3. Association and Classification
 Unit 4. Clustering
 Unit 5. Business Analysis

Reference Books
Text Books
 Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann,
India, 2012.
 Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press.
 Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA.
Reference Books
 Sam Anahory and Dennis Murray, Data Warehousing in the Real World,
Pearson Education Asia.
 W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India.
and many others
Mr. Sagar Pandya

Unit-1 Introduction
 Data warehousing Components –Building a Data warehouse,
 Need for data warehousing,
 Basic elements of data warehousing,
 Data Mart,
 Data Extraction, Clean-up, and Transformation Tools –Metadata,
 Star, Snow flake and Galaxy Schemas for Multidimensional databases,
 Fact and dimension data,
 Partitioning Strategy-Horizontal and Vertical Partitioning.
Mr. Sagar Pandya

What is Data?
Mr. Sagar Pandya
 Data is collection of unprocessed items that may consists of text,
numbers, images and video. Today, data can be represented in
various forms like sound, images and video.
 Structured: numbers, text etc.
 Unstructured: images, video etc.

What is Information?
 Meaningful data is called information.
 Information refers to the data that have been processed in such a way
that the knowledge of the person who uses the data is increased.
 Example:- 1A$ - Data (No meaning)
1$ - Information (Currency)
 For the decision to be meaningful, the processed data must qualify
for the following characteristics −
• Timely − Information should be available when required.
• Accuracy − Information should be accurate.
• Completeness − Information should be complete.
Mr. Sagar Pandya

What is Metadata?
 Metadata describes other data.
 Data about data,
 For example - an image may include metadata that describes how
large the picture is, the color depth, the image resolution, when the
image was created, and other data.
 A text document's metadata may contain information about how long
the document is, who the author is, when the document was written,
and a short summary of the document.
 1) Operational Metadata
 2) Extraction and Transformation Metadata
 3) End User Metadata
Mr. Sagar Pandya

What is Database and DBMS?
 Database is a collection of inter-related data which helps in efficient
retrieval, insertion and deletion of data from database and organizes
the data in the form of tables.
 The software which is used to manage database is called Database
Management System (DBMS).
 A database management system stores data in such a way that it
becomes easier to retrieve, manipulate, and produce information.
 For Example, MySQL, Oracle etc. are popular commercial DBMS
used in different applications.
Mr. Sagar Pandya

Operational vs. Informational Systems
 Operational systems, as their name implies, are the systems that help
the every day operation of the enterprise.
 These are the backbone systems of any enterprise, and include order
entry, inventory, manufacturing, payroll and accounting.
 Due to their importance to the organization, operational systems
were almost always the first parts of the enterprise to be
computerized.
Mr. Sagar Pandya

Operational vs. Informational Systems
 Informational systems deal with analyzing data and making
decisions, often major, about how the enterprise will operate now,
and in the future.
 Not only do informational systems have a different focus from
operational ones, they often have a different scope.
 Where operational data needs are normally focused upon a single
area, informational data needs often span a number of different areas
and need large amounts of related operational data.
Mr. Sagar Pandya

Data Warehouse
 The term "Data Warehouse" was first coined by Bill Inmon in 1990.
He was considered as a father of data warehouse.
 According to Inmon, a data warehouse is a subject-oriented,
integrated, time-variant, and non-volatile collection of data.
 According to Ralph Kimball, Data Warehouse is a transaction data
specifically structured for query and analysis.
 A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in a what
they can understand and use in a business context.
Mr. Sagar Pandya

Data Warehouse
 This data helps analysts to take informed decisions in an
organization.
 A Data Warehouse is a group of data specific to the entire
organization, not only to a particular group of users.
 It is not used for daily operations and transaction processing but used
for making decisions.
 This data helps analysts to take informed decisions in an
organization.
Mr. Sagar Pandya

Data Warehouse
 Data is a collection of raw material in unorganized format. Now we
have to convert that data into Information format. To make decision,
we need to collect the data, using that data we get some information
and finally we take decision.
 Example:- In an organization, we have many departments like Sales
dept, Product dept, Hr department and many other. Before releasing
any product to the market, CEO collects the data form the Sales
department and product department to take some decisions on profits
& losses.
Mr. Sagar Pandya

Data Warehouse
 In an Organisation, there are several department available and each
individual department perform different kind of transactions, all
these transactions are saved in Operational data store (ODS).
 The main characteristics of ODS is data is volatile and it doesn’t
maintain any history data. So what is volatile ? Data in volatile
means, the data changes in regular interval of time.
 Example :- Big Bazaar, CEO needs to take decision about a
particular product. So he needs 3 to 5 years of data. But in ODS, it
doesn’t maintain any history data. So, every organisation should
maintain history data to take decisions based on product sales.
Mr. Sagar Pandya

Data Warehouse
 Data warehousing is the process of constructing and using a data
warehouse.
 A data warehouse is a database, which is kept separate from the
organization's operational database.
 A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
 It possesses consolidated historical data, which helps the
organization to analyze its business.
 There is no frequent updating done in a data warehouse.
Mr. Sagar Pandya

Data Warehouse
Mr. Sagar Pandya

What can a Data Warehouse do & can’t do?
What can a Data Warehouse do?
 Get Answer Faster
 Make Decision Faster
 Optimize Performance
 Reduce Risk and Cost
What can a Data Warehouse not do?
 Can’t create data itself
 Cleaning of data is required
Mr. Sagar Pandya

Need for Data Warehouse
1. Improving Integration:
 An organization registers data in different systems, which support the
various business processes.
 In order to create an overall picture of business operations, customers
and suppliers – thus creating a single version of the truth – the data
must come together in one place and made compatible.
 Both external (from the environment) and internal data (from ERP
and financial systems) should merge into the data warehouse and
then be grouped.
Mr. Sagar Pandya

2. Speeding up response times
 The source systems are fully optimized in order to process many
small transactions, such as orders, in a short time.
 Creating information about the performance of the organization only
requires a few large ‘transactions’ during which large amounts of
data are being gathered and aggregated.
 The structure of a data warehouse is specifically designed to quickly
analyze such large amounts of data.
Mr. Sagar Pandya

3. Faster and more flexible reporting:
 The structure of both data warehouses and data marts enables end
users to report in a flexible manner and to quickly perform
interactive analysis on the basis of various predefined angles
(dimensions).
 They may, for example, with a single mouse click jump from year
level – to quarter – to month level and quickly switch between the
customer dimension and the product dimension whereby the
indicator remains fixed.
Mr. Sagar Pandya

 In most organization, data about specific parts of businesses is there
which contains lots and lots of data, somewhere, in some form.
 Data is available but not information – and not the right information
at the right time.
 Bring together information from multiple resources as to provide a
consistent database source for decision support queries.
 To help workers in their everyday business activity and improve their
productivity.
 To help knowledge workers (Executives, Managers, Analysts) make
faster and better decisions – decision support systems.
Mr. Sagar Pandya

Data Warehouse Features
Mr. Sagar Pandya

Data Warehouse Features
Mr. Sagar Pandya
 Subject Orientation:- Subject orientation means that data is organized by subject.
 Integration:- Consistency of defining parameters.
 Non-Volatility:- It means data storage medium must be stable.
 Time-Variance:- It means timeliness of data and access terms.
 Data Granularity:- It means that details of data are kept at low level.

Data Warehouse Characteristics
Mr. Sagar Pandya

Subject-oriented
Mr. Sagar Pandya
 A data warehouse is subject oriented because it provides information
around a subject rather than the organization's ongoing operations.
 Data warehouse is a subject oriented database, which supports the
business need of individual department specific user.
 Example : Sales, HR, Accounts, Marketing etc.

Subject-oriented
Mr. Sagar Pandya
 A data warehouse target on the modeling and analysis of data for
decision-makers.
 Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer,
product, or sales, instead of the global organization's ongoing
operations.
 This is done by excluding data that are not useful concerning the
subject and including all data needed by the users to understand the
subject.

Subject-oriented
Mr. Sagar Pandya

Integrated
Mr. Sagar Pandya
 In Data Warehouse, integration means the establishment of a
common unit of measure for all similar data from the dissimilar
database.
 The data also needs to be stored in the Datawarehouse in common
and universally acceptable manner.
 A data warehouse integrates various heterogeneous data sources like
RDBMS, flat files, and online transaction records.
 This integration helps in effective analysis of data. Consistency in
naming conventions, attribute measures, encoding structure etc. have
to be ensured.

Integrated
Mr. Sagar Pandya

Integrated
Mr. Sagar Pandya
 There are three different application labeled A, B and C.
 Information stored in these applications are Gender, Date, and
Balance. However, each application's data is stored different way.
• In Application A gender field store logical values like M or F
• In Application B gender field is a numerical value,
• In Application C application, gender field stored in the form of a
character value.
• Same is the case with Date and balance.
 However, after transformation and cleaning process all this data is
stored in common format in the Data Warehouse.

Time-Variant
Mr. Sagar Pandya
 A Data Warehouse is a time variant data base, which supports the
business management in analyzing the business and comparing the
business with different time periods like Year, Quarter, Month, Week
and Date.
 Historical information is kept in a data warehouse.
 For example, one can retrieve files from 3 months, 6 months, 12
months, or even previous data from a data warehouse.
 These variations with a transactions system, where often only the
most current file is kept.
 Another aspect of time variance is that once data is inserted in the
warehouse, it can't be updated or changed.

Time-Variant
Mr. Sagar Pandya

Non- Volatile
Mr. Sagar Pandya
 Non-volatile means the previous data is not erased when new data is
added to it.
 A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in
the data warehouse.
 Typical activities such as deletes, inserts, and changes that are
performed in an operational application environment are completely
nonexistent in a DW environment.
 Only two types of data operations performed in the Data
Warehousing are
1. Data loading
2. Data access

Non- Volatile
Mr. Sagar Pandya

Data Warehouse VS Operational Database
S.no. Data Warehouse Operational Database
1 It involves historical
processing of information.
It involves day-to-day
processing.
2 Data warehouse systems are
used by knowledge workers
such as executives, managers,
and analysts.
Operational Database systems
are used by clerks, DBAs, or
database professionals.
3 It is used to analyze the
business.
It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
Mr. Sagar Pandya

5 It is based on Star Schema,
Snowflake Schema, and Fact
Constellation Schema.
It is based on Entity
Relationship Model.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and
consolidated data.
It provides primitive and highly
detailed data.
Mr. Sagar Pandya

9 The number of users is in
hundreds.
The number of users is in
thousands.
10 The number of records
accessed is in millions.
The number of records
accessed is in tens.
11 The database size is from
100GB to 100 TB.
The database size is from 100
MB to 100 GB.
12 These are highly flexible. It provides high performance.
Mr. Sagar Pandya

How Datawarehouse works?
 A Data Warehouse works as a central repository where information
arrives from one or more data sources.
 Data flows into a data warehouse from the transactional system and
other relational databases.
 Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
Mr. Sagar Pandya

How Datawarehouse works?
 The data is processed, transformed, and ingested so that users can
access the processed data in the Data Warehouse through Business
Intelligence tools, SQL clients, and spreadsheets.
 A data warehouse merges information coming from different sources
into one comprehensive database.
 By merging all of this information in one place, an organization can
analyze its customers more holistically.
 This helps to ensure that it has considered all the information
available.
 Data warehousing makes data mining possible.
 Data mining is looking for patterns in the data that may lead to
higher sales and profits.
Mr. Sagar Pandya

Benefits of a Data Warehouse
1) Delivers enhanced business intelligence
 By having access to information from various sources from a single
platform, decision makers will no longer need to rely on limited data
or their instinct.
2) Saves times
 executives can query the data themselves with little to no IT support,
saving more time and money.
3) Enhances data quality and consistency
 A data warehouse converts data from multiple sources into a
consistent format. Since the data from across the organization is
standardized, each department will produce results that are
consistent. This will lead to more accurate data, which will become
the basis for solid decisions.
Mr. Sagar Pandya

4) Improves the decision-making process
 By transforming data into purposeful information, decision makers
can perform more functional, precise, and reliable analysis and create
more useful reports with ease.
5) Drives Revenue
 “data is the new oil,” referring to the high dollar value of data in
today’s world. Creating more standardized and better quality data is
the key strength of a data warehouse, and this key strength translates
clearly to significant revenue gains. The data warehouse formula
works like this: Better business intelligence helps with better
decisions, and in turn better decisions create a higher return on
investment across any sector of your business.
Mr. Sagar Pandya

Mr. Sagar Pandya

Online Analytical Processing (OLAP)
• Involves historical processing of information.
• OLAP systems are used by knowledge workers such as executives,
managers and analysts.
• It focuses on Information out.
• Based on Star Schema, Snowflake, Schema and Fact Constellation
Schema.
• Contains historical data.
• Provides summarized and consolidated data.
• Provides summarized and multidimensional view of data.
• Number or users is in hundreds.
• Number of records accessed is in millions.
• Database size is from 100 GB to 1 TB
Mr. Sagar Pandya

Online Transactional Processing (OLTP)
• Involves day-to-day processing.
• OLTP systems are used by clerks, DBAs, or database professionals.
• It focuses on Data in.
• Based on Entity Relationship Model.
• Contains current data.
• Provides primitive and highly detailed data.
• Provides detailed and flat relational view of data.
• Number of users is in thousands.
• Number of records accessed is in tens.
• Database size is from 100 MB to 1 GB.
Mr. Sagar Pandya

Data Mart
• A data mart is a simple section of the data warehouse that delivers a
single functional data set.
• Often holds only one subject area- for example, Finance, or Sales.
• May hold more summarized data.
Mr. Sagar Pandya

Data Mart
Mr. Sagar Pandya

Data Mart
• Window-based or Unix/Linux-based servers are used to implement
data marts.
• They are implemented on low-cost servers.
• The implementation data mart cycles is measured in short periods of
time, i.e., in weeks rather than months or years.
• The life cycle of a data mart may be complex in long run, if its
planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data marts are flexible.
Mr. Sagar Pandya

Need Of Data Mart
 Data Mart focuses only on functioning of particular department of an
organization.
 It is maintained by single authority of an organization.
 Since, it stores the data related to specific part of an organization,
data retrieval from it is very quick.
 Designing and maintenance of data mart is found to be quite cinch as
compared to data warehouse.
 It reduces the response time of user as it stores small volume of data.
 It is small in size due to which accessing data from it very fast.
 This Storage unit is used by most of the organizations for the smooth
running of their departments.
Mr. Sagar Pandya

Types of Data Mart:
 There are three main types of data marts are:
1. Dependent: Dependent data marts are created by drawing data
directly from operational, external or both sources.
2. Independent: Independent data mart is created without the use of a
central data warehouse.
3. Hybrid: This type of data marts can take data from data warehouses
or operational systems.
Mr. Sagar Pandya

Dependent Data Mart
 Dependent Data Mart is created by extracting the data from central
repository, Datawarehouse.
 First data warehouse is created by extracting data (through ETL tool)
from external sources and then data mart is created from data
warehouse.
 Dependent data mart is created in top-down approach of
Datawarehouse architecture.
 This model of data mart is used by big organizations.
Mr. Sagar Pandya

Dependent Data Mart
Mr. Sagar Pandya

Independent Data Mart
 The second approach is Independent data marts (IDM).
 Independent Data Mart is created directly from external sources
instead of data warehouse.
 First data mart is created by extracting data from external sources
and then Datawarehouse is created from the data present in data
mart.
 Independent data mart is designed in bottom-up approach of
Datawarehouse architecture.
 This model of data mart is used by small organizations and is cost
effective comparatively.
Mr. Sagar Pandya

Independent Data Mart
Mr. Sagar Pandya
Data Mart

Hybrid Data Mart
 This type of Data Mart is created by extracting data from operational
source or from data warehouse.
 It is best suited for multiple database environments and fast
implementation turnaround for any organization.
 It also requires least data cleansing effort.
 Hybrid Data mart also supports large storage structures, and it is best
suited for flexible for smaller data-centric applications.
 1) Path-1 reflects accessing data directly from external sources and
 2) Path-2 reflects dependent data model of data mart.
Mr. Sagar Pandya

Hybrid Data Mart
Mr. Sagar Pandya

Steps in Implementing a Datamart
Mr. Sagar Pandya
 Implementing a Data Mart is a rewarding but complex procedure.
 The significant steps in implementing a data mart are to design the
schema, construct the physical storage, populate the data mart with
data from source systems, access it to make informed decisions and
manage it over time.
 So, the steps are:

Advantages of Data Mart
Mr. Sagar Pandya
 Implementation of data mart needs less time as compared to
implementation of Datawarehouse as data mart is designed for a
particular department of an organization.
 Organizations are provided with choices to choose model of data
mart depending upon cost and their business.
 Data can be easily accessed from data mart.
 It contains frequently accessed queries, so enable to analyze business
trend.

Disadvantages of Data Mart
Mr. Sagar Pandya
 Since it stores the data related only to specific function, so does not
store huge volume of data related to each and every department of an
organisation like datawarehouse.
 It can become a big hurdle to maintain.

Difference between Datawarehouse & Data Mart
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Data Mart
A Data Warehouse is a vast
repository of information collected
from various organizations or
departments within a corporation.
A data mart is an only subtype of a
Data Warehouses. It is architecture
to meet the requirement of a
specific user group.
It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.
It holds very detailed information. It may hold more summarized data.
DW is the data-oriented. Data Marts is a project-oriented.
In data warehousing, Fact
constellation is used.
In Data Mart, Star Schema and
Snowflake Schema are used.
It is a Centralized System. It is a Decentralized System.

ETL Process
 The mechanism of extracting information from source systems and
bringing it into the data warehouse is commonly called ETL, which
stands for Extraction, Transformation and Loading.

ETL Process
 It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area and then finally,
loads it into the Data Warehouse system.

Why do you need ETL?
 It helps companies to analyze their business data for taking critical
business decisions.
 Transactional databases cannot answer complex business questions
that can be answered by ETL.
 ETL provides a method of moving the data from various sources into
a data warehouse.
 Well-designed and documented ETL system is almost essential to the
success of a Data Warehouse project.
 ETL helps to Migrate data into a Data Warehouse. Convert to the
various formats and types to adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source
data into the target database.

ETL Process - Extraction
 Extraction is the operation of extracting information from a source
system for further use in a data warehouse environment. This is the
first stage of the ETL process.
 Extraction process is often one of the most time-consuming tasks in
the ETL.
 The source systems might be complicated and poorly documented,
and thus determining which data needs to be extracted can be
difficult.
 The data has to be extracted several times in a periodic manner to
supply all changed data to the warehouse and keep it up-to-date.

ETL Process - Extraction
 It is important to extract the data from various source systems and
store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can
be corrupted also.
 Hence loading it directly into the data warehouse may damage
it. Therefore, this is one of the most important steps of ETL process.
 The extraction step should be design in such a way that it should not
have negative affect n source system.
 Data extractions’ time slot for different systems vary as per the time
zone and operational hours.

ETL Process - Transformation
 The second step of the ETL process is transformation. In this step, a
set of rules or functions are applied on the extracted data to convert it
into a single standard format.
 Data extracted from source server is raw and not usable in its original
form. Therefore it needs to be cleansed, mapped and transformed.
 The main objective of this format is to load the extracted data into
target database with clean and general format.
 For example there are two sources A and B.
 Date format of A is dd/mm/yyyy and format of B is mm/dd/yy.
 In transformation these date formats are bring into single general
format.

 In this step, a set of rules or functions are applied on the extracted
data to convert it into a single standard format. It may involve
following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
 Joining – joining multiple attributes (columns) into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally
key-attribute).
 Enrichment – Full name to ‘First Name’, ‘Middle Name’ & ‘Last
Name’.

 Following are Data Integrity Problems:
1) Different spelling of the same person like Jon, John, etc.
2) There are multiple ways to denote company name like Google,
Google pvt. ltd., Google Inc.
3) Use of different names like Mumbai, Bombay.
4) There may be a case that different account numbers are generated
by various applications for the same customer.
5) In some data required files remains blank.
6) Invalid product collected at POS as manual entry can lead to
mistakes.

ETL Process - Loading
 The third and final step of the ETL process is loading. In this step,
the transformed data is finally loaded into the data warehouse.
 Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular
intervals.
 The rate and period of loading solely depends on the requirements
and varies from system to system.
 In case of load failure, recover mechanisms should be configured to
restart from the point of failure without data integrity loss.
 Data Warehouse admins need to monitor, resume, cancel loads as per
prevailing server performance.

ETL Process
 ETL process can also use the pipelining concept i.e. as soon as some
data is extracted, it can transformed and during that period some new
data can be extracted. And while the transformed data is being loaded
into the data warehouse, the already extracted data can be
transformed.
 The block diagram of the pipelining of ETL process is shown below:

Selecting an ETL Tool
 Selection of an appropriate ETL Tools is an important decision that
has to be made in choosing the importance of an ODS or data
warehousing application.
 The ETL tools are required to provide coordinated access to multiple
data sources so that relevant data may be extracted from them.
 An ETL tool would generally contains tools for data cleansing, re-
organization, transformations, aggregation, calculation and automatic
loading of information into the object database.
 An ETL tool should provide a simple user interface that allows data
cleansing and data transformation rules to be specified using a point-
and-click approach.

ETL tools
 When all mappings and transformations have been defined, the ETL
tool should automatically generate the data
extract/transformation/load programs.
 There are many Data Warehousing tools are available in the market.
Here, are some most prominent one:
 1. MarkLogic
 2. Oracle
 3. Amazon RedShift
 4. Sybase

Components of Data Warehouse
 Source Data Component
 Data Staging Component (ETL)
 Metadata Component
 End user tools and applications
 Data Warehouse Management

Components of Data Warehouse

Data Warehouse Architecture
 DATA WAREHOUSE ARCHITECTURE is complex as it’s an
information system that contains historical and commutative data
from multiple sources.
 Datawarehouse and their architectures vary depending upon the
specifics of an organization situation. Three common architectures
are:
 Data Warehouse Architecture (Basic)
 Data Warehouse Architecture (with a staging area)
 Data Warehouse Architecture (with a staging area and data
mart)

Data Warehouse Architecture
 Data Warehouse Architecture (Basic)

Data Warehouse Architecture (Basic)
 Operational System:- An operational system is a method used in
data warehousing to refer to a system that is used to process the day-
to-day transactions of an organization.
 Flat Files:- A Flat file system is a system of files in which
transactional data is stored, and every file in the system must have a
different name.
 End-User access Tools:- The principal purpose of a data warehouse
is to provide information to the business managers for strategic
decision-making. These customers interact with the warehouse using
end-client access tools.
• Example:- Reporting and Query Tools, Application Development
Tools, Executive Information Systems Tools, Online Analytical
Processing Tools, Data Mining Tools

Data Warehouse Architecture (With Staging Area)
• We must clean and process your operational information before put it
into the warehouse.
 We can do this programmatically, although data warehouses uses
a staging area (A place where data is processed before entering the
warehouse).
 A staging area simplifies data cleansing and consolidation for
operational method coming from multiple source systems, especially
for enterprise data warehouses where all relevant data of an
enterprise is consolidated.

• Data Warehouse Staging Area is a temporary location where a
record from source systems is copied.

Data Warehouse Architecture (With Staging Area
and Data Marts)
 We may want to customize our warehouse's architecture for multiple
groups within our organization.
 We can do this by adding data marts.
 A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
 The figure illustrates an example where purchasing, sales, and stocks
are separated.
 In this example, a financial analyst wants to analyze historical data
for purchases and sales or mine historical information to make
predictions about customer behavior.

Data Warehouse Architecture (With Staging Area
and Data Marts)

Types of Data Warehouse Architectures
 DATA WAREHOUSE ARCHITECTURE is complex as it’s an
information system that contains historical and commutative data
from multiple sources. There are 3 methods for constructing data-
warehouse: Single Tier, Two tier and Three tier.

Single-Tier Architecture
 The objective of a single layer is to minimize the amount of data
stored.
 This goal is to remove data redundancy.
 This architecture is not frequently used in practice.
Two-Tier Architecture
 Two-layer architecture separates physically available sources and
data warehouse.
 This architecture is not expandable and also not supporting a large
number of end-users.
 It also has connectivity problems because of network limitations.

Three-tier architecture
 This is the most widely used architecture.
 Generally a data warehouses adopts a three-tier architecture.
 It consists of the Top, Middle and Bottom Tier.
 Data warehouses often adopt a three – tier architecture,
 1 Bottom tier
 2 Middle tier
 3 Top tier

Types of Data Warehouse Architectures – 3 Tier
1. Bottom Tier: The database of the Datawarehouse servers as the
bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end
tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the
database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools
and API that you connect and get data out from the data warehouse.
It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.

1.) Top Tier
 The Top Tier consists of the Client-side front end of the architecture.
 The Transformed and Logic applied information stored in the Data
Warehouse will be used and acquired for Business purposes in this
Tier.
 Several Tools for Report Generation and Analysis are present for the
generation of desired information.
 Data mining which has become a great trend these days is done here.
 All Requirement Analysis document, cost, and all features that
determine a profit-based Business deal is done based on these tools
which use the Data Warehouse information.

2.) Middle Tier
 The Middle Tier consists of the OLAP Servers
 OLAP is Online Analytical Processing Server
 OLAP is used to provide information to business analysts and
managers
 As it is located in the Middle Tier, it rightfully interacts with the
information present in the Bottom Tier and passes on the insights to
the Top Tier tools which processes the available information.
 Mostly Relational or Multidimensional OLAP is used in Data
warehouse architecture.

 Bottom Tier:- The Bottom Tier mainly consists of the Data Sources,
ETL Tool, and Data Warehouse.
 1. Data Sources:- The Data Sources consists of the Source Data that is
acquired and provided to the Staging and ETL tools for further process.
 2. ETL Tools:- ETL tools are very important because they help in
combining Logic, Raw Data, and Schema into one and loads the
information to the Data Warehouse Or Data Marts.
 Sometimes, ETL loads the data into the Data Marts and then
information is stored in Data Warehouse. This approach is known as
the Bottom-Up approach.
 The approach where ETL loads information to the Data Warehouse
directly is known as the Top-down Approach.

Data Warehouse Approaches
 A data-warehouse is a heterogeneous collection of different data
sources organized under a unified schema. There are 2 approaches
for constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
1. Top-down approach: The needed components are discussed below:
1.) External Sources –
External source is a source from where data is collected irrespective of
the type of data. Data can be structured, semi structured and
unstructured as well.
2.) Stage Area –
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into
Datawarehouse. For this purpose, it is recommended to use ETL tool.

E(Extracted): Data is extracted from External data source.
T(Transform): Data is transformed into the standard format.
L(Load): Data is loaded into Datawarehouse after transforming it
into the standard format.
3.) Data-warehouse – After cleansing of data, it is stored in the
Datawarehouse as central repository. It actually stores the meta data
and the actual data gets stored in the data marts. Note that
Datawarehouse stores the data in its purest form in this top-down
approach.
4.) Data Marts – Data mart is also a part of storage component. It
stores the information of a particular function of an organization which
is handled by single authority. We can also say that data mart contains
subset of the data stored in Datawarehouse.

5.) Data Mining – The practice of analyzing the big data present in
Datawarehouse is data mining. It is used to find the hidden patterns that
are present in the database or in Datawarehouse with the help of
algorithm of data mining.
Advantages of Top-Down Approach –
1. Since the data marts are created from the Datawarehouse, provides
consistent dimensional view of data marts.
2. Also, this model is considered as the strongest model for business
changes. That’s why, big organizations prefer to follow this approach.
3. Creating data mart from Datawarehouse is easy.
 Disadvantages of Top-Down Approach – The cost, time taken in
designing and its maintenance is very high.

2. Bottom-up approach:
1. First, the data is extracted from external soures (same as happens in
top-down approach).
2. Then, the data go through the staging area (as explained above) and
loaded into data marts instead of datawarehouse. The data marts are
created first and provide reporting capability. It addresses a single
business area.
3. These data marts are then integrated into datawarehouse.
 This approach is given by Kinball as – data marts are created first
and provides a thin view for analyses and datawarehouse is created
after complete data marts have been created.

 Advantages of Bottom-Up Approach –
1. As the data marts are created first, so the reports are quickly
generated.
2. We can accommodate more number of data marts here and in this
way Datawarehouse can be extended.
3. Also, the cost and time taken in designing this model is low
comparatively.
 Disadvantage of Bottom-Up Approach –
1. This model is not strong as top-down approach as dimensional
view of data marts is not consistent as it is in above approach.

Difference Between Top-down Approach and
Bottom-up Approach
S.no. Top-Down Approach Bottom-Up Approach
1 Provides a definite and
consistent view of
information as information
from the data warehouse is
used to create Data Marts
Reports can be generated easily
as Data marts are created first
and it is relatively easy to
interact with data marts.
2
Strong model and hence
preferred by big companies
Not as strong but data
warehouse can be extended and
the number of data marts can be
created
3 Time, Cost and Maintenance
is high
Time, Cost and Maintenance
are low.

Design of Data Warehouse
 An important point about Data Warehouse is its efficiency. To create
an efficient Data Warehouse, we construct a framework known as the
Business Analysis Framework.
 There are four types of views in regard to the design of a DW.
 1. Top-Down View: This View allows only specific information
needed for a data warehouse to be selected.
 2. Data Source View: This view shows all the information from the
source of data to how it is transformed and stored.
 3. Data Warehouse View: This view shows the information present
in the Data warehouse through fact tables and dimension tables.
 4. Business Query View: This is a view that shows the data from the
user’s point of view.

Advantages of Data Warehouse
 1. Integrating data from multiple sources.
 2. Performing new types of analyses.
 3. Reducing cost to access historical data.
 Other benefits may include:
 1. Standardizing data across the organization, a "single version of the
truth“.
 2. Improving turnaround time for analysis and reporting.
 3. Sharing data and allowing others to easily access data.
 4. Removing informational processing load from transaction-
oriented databases.

Disadvantages of Data Warehouse
 The major disadvantage is that a data warehouse can be costly to
maintain and that becomes a problem if the warehouse is
underutilized.
 It seems that managers have unrealistic expectations about what they
will get from having a data warehouse.
 There are considerable disadvantages involved in moving data from
multiple, often highly disparate, data sources to one data warehouse
that translate into long implementation time, high cost, lack of
flexibility, dated information, and limited capabilities.
 The data warehouse may seem easy, but actually, it is too complex
for the average users.
 Not an ideal option for unstructured data.

Metadata
 The name Meta Data suggests some high- level technological
concept.
 However, it is quite simple. Metadata is data about data which
defines the data warehouse.
 It is used for building, maintaining and managing the data
warehouse.
 In the Data Warehouse Architecture, meta-data plays an important
role as it specifies the source, usage, values, and features of data
warehouse data.
 It also defines how data can be changed and processed.
 It is closely connected to the data warehouse.

Metadata
For example, a line in sales database may contain:
 This is a meaningless data until we consult the Meta that tell us it
was
• Model number: 4030
• Sales Agent ID: KJ732
• Total sales amount of $299.90
 Therefore, Meta Data are essential ingredients in the transformation
of data into knowledge.

Metadata
 Metadata helps to answer the following questions
• What tables, attributes, and keys does the Data Warehouse contain?
• Where did the data come from?
• How many times do data get reloaded?
• What transformations were applied with cleansing?

Data Warehouse Models
S.no. DATABASE SYSTEM DATA WAREHOUSE
1 It supports operational
processes.
It supports analysis and
performance reporting.
2 Operational Database are those
databases where data changes
frequently.
A data warehouse is a repository
for structured, filtered data that
has already been processed for a
specific purpose.
3 It focuses on current
transactional data.
It focuses on historical data.
4 Data is balanced within the
scope of this one system.
Data must be integrated and
balanced from multiple system.

S.no. DATABASE SYSTEM DATA WAREHOUSE
5 ER based. Star/Snowflake.
6 Application oriented. Subject oriented.
7 It is slow for analytics queries. It is fast for analysis queries.
8 Relational databases are
created for on-line
transactional Processing
(OLTP)
Data Warehouse designed for
on-line Analytical Processing
(OLAP)
9 Data stored in the Database is
up to date.
Current and Historical Data is
stored in Data Warehouse. May
not be up to date.

Dimensional Modeling
 DIMENSIONAL MODELING (DM) is a data structure technique
optimized for data storage in a Data warehouse.
 The purpose of dimensional model is to optimize the database for fast
retrieval of data.
 The concept of Dimensional Modelling was developed by Ralph
Kimball and consists of "fact" and "dimension" tables.
 A Dimensional model is designed to read, summarize, analyze
numeric information like values, balances, counts, weights, etc. in a
data warehouse.
 In contrast, relation models are optimized for addition, updating and
deletion of data in a real-time Online Transaction System.

Elements of Dimensional Data Model
 Fact:- Facts are the measurements/metrics or facts from your
business process. For a Sales business process, a measurement would
be quarterly sales number.
 Dimension:- Dimension provides the context surrounding a business
process event. In simple terms, they give who, what, where of a fact.
In the Sales business process, for the fact quarterly sales number,
dimensions would be
• Who – Customer Names
• Where – Location
• What – Product Name
 In other words, a dimension is a window to view information in the
facts.

 Attributes
 The Attributes are the various characteristics of the dimension.
 In the Location dimension, the attributes can be
• State
• Country
• Zipcode etc.
 Attributes are used to search, filter, or classify facts. Dimension
Tables contain Attributes

 Fact Table
 A fact table is a primary table in a dimensional model.
 A Fact Table contains
1. Measurements/facts
2. Foreign key to dimension table

 Dimension table
• A dimension table contains dimensions of a fact.
• They are joined to fact table via a foreign key.
• Dimension tables are de-normalized tables.
• The Dimension Attributes are the various columns in a dimension
table
• Dimensions offers descriptive characteristics of the facts with the
help of their attributes
• No set limit set for given for number of dimensions
• The dimension can also contain one or more hierarchical
relationships

Multidimensional schema
 Multidimensional Schema is especially designed to model data
warehouse systems.
 The schemas are designed to address the unique needs of very large
databases designed for the analytical purpose (OLAP).
 Types of Data Warehouse Schema:
 Following are 3 chief types of multidimensional schemas each
having its unique advantages.
• Star Schema
• Snowflake Schema
• Galaxy Schema

Star Schema
 In the STAR Schema, the center of the star can have one fact table
and a number of associated dimension tables.
 Star schema is the fundamental schema among the data mart schema
and it is simplest.
 This schema is widely used to develop or build a data warehouse and
dimensional data marts.
 It is known as star schema as its structure resembles a star.
 The star schema is the simplest type of Data Warehouse schema.
 It is also known as Star Join Schema and is optimized for querying
large data sets.

Star Schema
 In a star schema, the fact table will be at the center and is connected
to the dimension tables.
 The tables are completely in a denormalized structure.
 SQL queries performance is good as there is less number of joins
involved.
 Data redundancy is high and occupies more disk space.
 It is said to be star as its physical model resembles to the star shape
having a fact table at its center and the dimension tables at its
peripheral representing the star’s points.
 Usually the fact tables in a star schema are in third normal
form(3NF) whereas dimensional tables are de-normalized.

Star Schema

Star Schema
Characteristics of Star Schema:
 Every dimension in a star schema is represented with the only one-dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized.
 For instance, in the above figure, Country_ID does not have Country lookup table as
an OLTP design would have.
 The schema is widely supported by BI Tools

Star Schema
 Advantages of Star Schema –
1. Simpler Queries:
Join logic of star schema is quite cinch in compare to other join logic
which are needed to fetch data from a transactional schema that is
highly normalized.
2. Simplified Business Reporting Logic:
In compared to a transactional schema that is highly normalized, the
star schema makes simpler common business reporting logic, such as
as-of reporting and period-over-period.
3. Feeding Cubes:
Star schema is widely used by all OLAP systems to design OLAP
cubes efficiently. In fact, major OLAP systems deliver a ROLAP
mode of operation which can use a star schema as a source without
designing a cube structure.

Star Schema
 Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized
schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within
business entities – at least not frequently.

Snowflake Schema
 SNOWFLAKE SCHEMA is a logical arrangement of tables in a
multidimensional database such that the ER diagram resembles a
snowflake shape.
 A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions.
 The dimension tables are normalized which splits data into
additional tables.
 The snowflake schema is a variant of the star schema.
 The snowflake effect affects only the dimension tables and does not
affect the fact tables.

Snowflake Schema
 A snowflake schema is an extension of star schema where the
dimension tables are connected to one or more dimensions.
 The tables are partially denormalized in structure.
 The performance of SQL queries is a bit less when compared to star
schema as more number of joins are involved.
 Data redundancy is low and occupies less disk space when compared
to star schema.
 The snowflake structure materialized when the dimensions of a star
schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent table.

Snowflake Schema

Snowflake Schema
 Characteristics of Snowflake Schema:
• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.

Snowflake Schema
• For example, the item dimension table in star schema is normalized
and split into two dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table.
• The supplier dimension table contains the attributes supplier_key
and supplier_type.

Snowflake Schema
 Advantages: There are two main advantages of snowflake schema
given below:
• It provides structured data which reduces the problem of data
integrity.
• It uses small disk space because data are highly structured.

Snowflake Schema
 Disadvantages:
• Snowflaking reduces space consumed by dimension tables, but
compared with the entire data warehouse the saving is usually
insignificant.
• Avoid snowflaking or normalization of a dimension table, unless
required and appropriate.
• Do not snowflake hierarchies of one dimension table into separate
tables. Hierarchies should belong to the dimension table only and
should never be snowfalked.
• Multiple hierarchies can belong to the same dimension has been
designed at the lowest possible detail.

Fact Constellation Schema
 A Fact constellation means two or more fact tables sharing one or
more dimensions. It is also called Galaxy schema.
 Fact Constellation Schema describes a logical structure of data
warehouse or data mart. Fact Constellation Schema can design with
a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.
 The schema is viewed as a collection of stars hence the name
Galaxy Schema.
 The fact constellation schema is also a type of multidimensional
model.
 In Galaxy schema shares dimensions are called Conformed
Dimensions.

 Characteristics of Galaxy Schema:
• The dimensions in this schema are separated into separate
dimensions based on the various levels of hierarchy.
• For example, if geography has four levels of hierarchy like region,
country, state, and city then Galaxy schema should have four
dimensions.
• Moreover, it is possible to build this type of schema by splitting the
one-star schema into more Star schemes.
• The dimensions are large in this schema which is needed to build
based on the levels of hierarchy.
• This schema is helpful for aggregating fact tables for better
understanding.

Fact Table vs Dimension Table
S.NO FACT TABLE DIMENSION TABLE
1 Fact table contains the
measuring on the attributes
of a dimension table.
Dimension table contains the
attributes on that truth table
calculates the metric.
2 Located at the center of a
star or snowflake schema
and surrounded by
dimensions.
Connected to the fact table and
located at the edges of the star or
snowflake schema.
3
Facts tables could contain
information like sales
against a set of dimensions
like Product and Date.
Evert dimension table contains
attributes which describe the
details of the dimension. E.g.,
Product dimensions can contain
Product ID, Product Category, etc.

Fact Table vs Dimension Table
S.NO FACT TABLE DIMENSION TABLE
4 Primary Key in fact table is
mapped as foreign keys to
Dimensions.
Dimension table has a primary key
columns that uniquely identifies
each dimension.
5
Does not contain Hierarchy.
Contains Hierarchies. For example
Location could contain, country,
pin code, state, city, etc.
6 In fact table, There is less
attributes than dimension table.
While in dimension table, There is
more attributes than fact table.
7 The number of fact table is less
than dimension table in a
schema.
While the number of dimension is
more than fact table in a schema.

Type of Facts
• Additive – As its name implied, additive measures are measures
which can be added to all dimensions.
• Non-additive – different from additive measures, non-additive
measures are measures that cannot be added to all dimensions.
• Semi-additive – semi-additive measures are the measure that can
be added to only some dimensions and not across other.

Designing fact table steps
 Here is overview of four steps to designing a fact table described
by Kimball:
1. Choosing business process to model – The first step is to decide
what business process to model by gathering and understanding
business needs and available data
2. Declare the grain – by declaring a grain means describing exactly
what a fact table record represents
3. Choose the dimensions – once grain of fact table is stated clearly,
it is time to determine dimensions for the fact table.
4. Identify facts – identify carefully which facts will appear in the
fact table.

Star Vs Snowflake Schema: Key Differences
S.no Star Schema Snow Flake Schema
1 Hierarchies for the dimensions
are stored in the dimensional
table.
Hierarchies are divided into
separate tables.
2 It contains a fact table
surrounded by dimension tables.
One fact table surrounded by
dimension table which are in turn
surrounded by dimension table.
3 In a star schema, only single join
creates the relationship between
the fact table and any dimension
tables.
A snowflake schema requires
many joins to fetch the data.
4 Simple DB Design. Very Complex DB Design.

S.no Star Schema Snow Flake Schema
5 Denormalized Data structure and
query also run faster.
Normalized Data Structure.
6 High level of Data redundancy Very low-level data redundancy
7 Single Dimension table contains
aggregated data.
Data Split into different
Dimension Tables.
8 Cube processing is faster. Cube processing might be slow
because of the complex join.
9 Offers higher performing queries
using Star Join Query
Optimization. Tables may be
connected with multiple
dimensions.
The Snow Flake Schema is
represented by centralized fact
table which unlikely connected
with multiple dimensions.

 From the perspective of data warehouse architecture, we have the
following data warehouse models −
• Enterprise warehouse:- collects all of the information about
subjects spanning the entire organization.
• Data Mart:- a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart.
• Virtual warehouse
• It is a virtual view of databases.
• Virtual Warehouse have a logical description of all the databases
and their structure.
• This method creates single Database from all the data sources.

Data Lake
 A Data Lake is a storage repository that can store large amount of
structured, semi-structured, and unstructured data.
 It is a place to store every type of data in its native format with no
fixed limits on account size or file.
 It offers high data quantity to increase analytic performance and
native integration.
 Data Lake is like a large container which is very similar to real lake
and rivers.
 Just like in a lake you have multiple tributaries coming in, a data
lake has structured data, unstructured data, machine to machine, logs
flowing through in real-time.

Data Lake

S.no Data Lakes Data Warehouse
1
Data lakes store everything.
Data Warehouse focuses only
on Business Processes.
2 Data are mainly unprocessed Highly processed data.
3 It can be Unstructured, semi-
structured and structured.
It is mostly in tabular form &
structure.
4 Data Lake is mostly used by
Data Scientist
Business professionals widely
use data Warehouse
5
Can use open source/tools like
Hadoop/ Map Reduce
Mostly commercial tools like
Google BigQuery, IBM,
Amazon, Oracle.

Big Data vs Data Warehouse
S.NO. BIG DATA DATA WAREHOUSE
1 Big data is a technology to store
and manage large amount of data.
Data warehouse is an architecture
used to organize the data.
2 Big data can handle structure,
non-structure, semi-structured
data.
Data warehouse only handles
structure data (relational or not
relational)
3.
Big data does processing by using
distributed file system.
Data warehouse doesn’t use
distributed file system for
processing.
4.
Big data doesn’t follow any SQL
queries to fetch data from
database.
In data warehouse we use SQL
queries to fetch data from relational
databases.

Data Warehousing – Partitioning Strategy
 Partitioning is done to enhance performance and facilitate easy
management of data.
 Partitioning also helps in balancing the various requirements of the
system.
 It optimizes the hardware performance and simplifies the
management of data warehouse by partitioning each fact table into
multiple separate partitions.
 Why is it Necessary to Partition?
 Partitioning is important for the following reasons −
1. For easy management,
2. To assist backup/recovery,
3. To enhance performance.

 For Easy Management
 The fact table in a data warehouse can grow up to hundreds of
gigabytes in size.
 This huge size of fact table is very hard to manage as a single entity.
Therefore it needs partitioning.
 To Assist Backup/Recovery
 If we do not partition the fact table, then we have to load the
complete fact table with all the data.
 Partitioning allows us to load only as much data as is required on a
regular basis.
 It reduces the time to load and also enhances the performance of the
system.

 Note − To cut down on the backup size, all partitions other than the
current partition can be marked as read-only.
 We can then put these partitions into a state where they cannot be
modified.
 Then they can be backed up. It means only the current partition is to
be backed up.
 To Enhance Performance
 By partitioning the fact table into sets of data, the query procedures
can be enhanced.
 Query performance is enhanced because now the query scans only
those partitions that are relevant.
 It does not have to scan the whole data.

Partitioning Strategy - Horizontal Partitioning
 There are various ways in which a fact table can be partitioned.
 In horizontal partitioning, we have to keep in mind the requirements
for manageability of the data warehouse.
 Partitioning by Time into Equal Segments:
 In this partitioning strategy, the fact table is partitioned on the basis
of time period.
 Here each time period represents a significant retention period within
the business.
 For example, if the user queries for month to date data then it is
appropriate to partition the data into monthly segments.
 We can reuse the partitioned tables by removing the data in them.

 Partition by Time into Different-sized Segments
 This kind of partition is done where the aged data is accessed
infrequently. It is implemented as a set of small partitions for
relatively current data, larger partition for inactive data.

 Points to Note
• The detailed information remains available online.
• The number of physical tables is kept relatively small, which reduces
the operating cost.
• This technique is suitable where a mix of data dipping recent history
and data mining through entire history is required.
• This technique is not useful where the partitioning profile changes on
a regular basis, because repartitioning will increase the operation cost
of data warehouse.

 Partition on a Different Dimension
 The fact table can also be partitioned on the basis of dimensions
other than time such as product group, region, supplier, or any other
dimension.
 Let's have an example.
 Suppose a market function has been structured into distinct regional
departments like on a state by state basis.
 If each region wants to query on information captured within its
region, it would prove to be more effective to partition the fact table
into regional partitions.
 This will cause the queries to speed up because it does not require to
scan information that is not relevant.

 Points to Note
• The query does not have to scan irrelevant data which speeds up the
query process.
• This technique is not appropriate where the dimensions are unlikely
to change in future. So, it is worth determining that the dimension
does not change in future.
• If the dimension changes, then the entire fact table would have to be
repartitioned.
 Note − It recommend to perform the partition only on the basis of
time dimension, unless you are certain that the suggested dimension
grouping will not change within the life of the data warehouse.

 Partition by Size of Table
 When there are no clear basis for partitioning the fact table on any
dimension, then we should partition the fact table on the basis of
their size.
 We can set the predetermined size as a critical point. When the table
exceeds the predetermined size, a new table partition is created.
 Points to Note
• This partitioning is complex to manage.
• It requires metadata to identify what data is stored in each partition.

 Partitioning Dimensions
 If a dimension contains large number of entries, then it is required to
partition the dimensions. Here we have to check the size of a
dimension.
 Consider a large design that changes over time. If we need to store all
the variations in order to apply comparisons, that dimension may be
very large. This would definitely affect the response time.
 Round Robin Partitions
 In the round robin technique, when a new partition is needed, the old
one is archived. It uses metadata to allow user access tool to refer to
the correct table partition.
 This technique makes it easy to automate table management facilities
within the data warehouse.

Partitioning Strategy - Vertical Partitioning
Vertical partitioning, splits the data vertically. The following images
depicts how vertical partitioning is done.

Partitioning Strategy - Vertical Partitioning
 Vertical partitioning can be performed in the following two ways −
• Normalization
• Row Splitting
 Normalization:- Normalization is the standard relational method of
database organization. In this method, the rows are collapsed into a
single row, hence it reduce space.
 Row Splitting:- Row splitting tends to leave a one-to-one map
between partitions. The motive of row splitting is to speed up the
access to large table by reducing its size.
 Note − While using vertical partitioning, make sure that there is no
requirement to perform a major join operation between two
partitions.

Identify Key to Partition
 It is very crucial to choose the right partition key. Choosing a wrong
partition key will lead to reorganizing the fact table.
 Let's have an example. Suppose we want to partition the following
table.
 Account_Txn_Table
 transaction_id
 account_id
 transaction_type
 value
 transaction_date
 region
 branch_name

Identify Key to Partition
 We can choose to partition on any key. The two possible keys could
be 1) region 2) transaction_date
 Suppose the business is organized in 30 geographical regions and
each region has different number of branches. That will give us 30
partitions, which is reasonable. This partitioning is good enough
because our requirements capture has shown that a vast majority of
queries are restricted to the user's own business region.
 If we partition by transaction_date instead of region, then the latest
transaction from every region will be in one partition. Now the user
who wants to look at data within his own region has to query across
multiple partitions.
 Hence it is worth determining the right partitioning key.

Summary
 A data warehouse is a subject-oriented, integrated, time-variant,
and non-volatile collection of data that is used in organizational
decision making.
 A data mart is defined as an implementation of a data warehouse with
small and more tightly restricted scope of data and data warehouse
functions, serving a single department or part of an organization.
 The mechanism of extracting information from source systems and
bringing it into the data warehouse is commonly called ETL, which
stands for Extraction, Transformation and Loading.
 Metadata is data about data, A metadata does not gives just the
description of the entity but also gives the other details explaining the
syntax and semantics of the data elements.

Summary
 Virtual Warehouse have a logical description of all the databases and
their structure.
 In the STAR Schema, the center of the star can have one fact table
and a number of associated dimension tables.
 A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions. It has normalized dimensions.
 A Fact constellation means two or more fact tables sharing one or
more dimensions. It is also called Galaxy schema.
 Partitioning is done to enhance performance and facilitate easy
management of data.
 Partitioning Strategy helps For easy management, To assist
backup/recovery and To enhance performance.

Unit – 1
Any - 5 Assignment Questions Marks:-20
 Q.1 What is Data Warehouse? Explain the data warehouse
architecture with diagram.
 Q.2 Discuss Star, Snowflake and Galaxy schema for
multidimensional Database.
 Q.3 Give reason, why it is necessary to separate data warehouse from
operational database.
 Q.4 What is the need of data warehouse. Explain characteristics of
data warehouse.
 Q.5 What is Data Mart? What are the types of Data Mart?
 Q.6 Explain ETL Process in data warehouse.
 Q.7 Explain:
 1) Metadata 2) Fact Table 3) Vertical Partitioning

Thank You
Great God, Medi-Caps, All the attendees
Mr. Sagar Pandya
www.sagarpandya.tk
LinkedIn: /in/seapandya
Twitter: @seapandya
Facebook: /seapandya

Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning

Similar to Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning (20)

More from Medicaps University

More from Medicaps University (14)

Recently uploaded

Recently uploaded (20)

Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning