SlideShare a Scribd company logo
1 of 164
MEDI-CAPS UNIVERSITY
Faculty of Engineering
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in
Data Mining and Warehousing
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in
Course Code Course Name Hours Per Week Total
Credits
L T P
IT3ED02 Data Mining and Warehousing 3 0 0 3
IT3ED02 Data Mining and Warehousing 3-0-0
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Unit 1. Introduction
 Unit 2. Data Mining
 Unit 3. Association and Classification
 Unit 4. Clustering
 Unit 5. Business Analysis
Reference Books
Text Books
 Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann,
India, 2012.
 Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press.
 Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA.
Reference Books
 Sam Anahory and Dennis Murray, Data Warehousing in the Real World,
Pearson Education Asia.
 W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India.
and many others
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Unit-1 Introduction
 Data warehousing Components –Building a Data warehouse,
 Need for data warehousing,
 Basic elements of data warehousing,
 Data Mart,
 Data Extraction, Clean-up, and Transformation Tools –Metadata,
 Star, Snow flake and Galaxy Schemas for Multidimensional databases,
 Fact and dimension data,
 Partitioning Strategy-Horizontal and Vertical Partitioning.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
What is Data?
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data is collection of unprocessed items that may consists of text,
numbers, images and video. Today, data can be represented in
various forms like sound, images and video.
 Structured: numbers, text etc.
 Unstructured: images, video etc.
What is Information?
 Meaningful data is called information.
 Information refers to the data that have been processed in such a way
that the knowledge of the person who uses the data is increased.
 Example:- 1A$ - Data (No meaning)
1$ - Information (Currency)
 For the decision to be meaningful, the processed data must qualify
for the following characteristics −
• Timely − Information should be available when required.
• Accuracy − Information should be accurate.
• Completeness − Information should be complete.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
What is Metadata?
 Metadata describes other data.
 Data about data,
 For example - an image may include metadata that describes how
large the picture is, the color depth, the image resolution, when the
image was created, and other data.
 A text document's metadata may contain information about how long
the document is, who the author is, when the document was written,
and a short summary of the document.
 1) Operational Metadata
 2) Extraction and Transformation Metadata
 3) End User Metadata
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
What is Database and DBMS?
 Database is a collection of inter-related data which helps in efficient
retrieval, insertion and deletion of data from database and organizes
the data in the form of tables.
 The software which is used to manage database is called Database
Management System (DBMS).
 A database management system stores data in such a way that it
becomes easier to retrieve, manipulate, and produce information.
 For Example, MySQL, Oracle etc. are popular commercial DBMS
used in different applications.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Operational vs. Informational Systems
 Operational systems, as their name implies, are the systems that help
the every day operation of the enterprise.
 These are the backbone systems of any enterprise, and include order
entry, inventory, manufacturing, payroll and accounting.
 Due to their importance to the organization, operational systems
were almost always the first parts of the enterprise to be
computerized.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Operational vs. Informational Systems
 Informational systems deal with analyzing data and making
decisions, often major, about how the enterprise will operate now,
and in the future.
 Not only do informational systems have a different focus from
operational ones, they often have a different scope.
 Where operational data needs are normally focused upon a single
area, informational data needs often span a number of different areas
and need large amounts of related operational data.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse
 The term "Data Warehouse" was first coined by Bill Inmon in 1990.
He was considered as a father of data warehouse.
 According to Inmon, a data warehouse is a subject-oriented,
integrated, time-variant, and non-volatile collection of data.
 According to Ralph Kimball, Data Warehouse is a transaction data
specifically structured for query and analysis.
 A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in a what
they can understand and use in a business context.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse
 This data helps analysts to take informed decisions in an
organization.
 A Data Warehouse is a group of data specific to the entire
organization, not only to a particular group of users.
 It is not used for daily operations and transaction processing but used
for making decisions.
 This data helps analysts to take informed decisions in an
organization.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse
 Data is a collection of raw material in unorganized format. Now we
have to convert that data into Information format. To make decision,
we need to collect the data, using that data we get some information
and finally we take decision.
 Example:- In an organization, we have many departments like Sales
dept, Product dept, Hr department and many other. Before releasing
any product to the market, CEO collects the data form the Sales
department and product department to take some decisions on profits
& losses.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse
 In an Organisation, there are several department available and each
individual department perform different kind of transactions, all
these transactions are saved in Operational data store (ODS).
 The main characteristics of ODS is data is volatile and it doesn’t
maintain any history data. So what is volatile ? Data in volatile
means, the data changes in regular interval of time.
 Example :- Big Bazaar, CEO needs to take decision about a
particular product. So he needs 3 to 5 years of data. But in ODS, it
doesn’t maintain any history data. So, every organisation should
maintain history data to take decisions based on product sales.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse
 Data warehousing is the process of constructing and using a data
warehouse.
 A data warehouse is a database, which is kept separate from the
organization's operational database.
 A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
 It possesses consolidated historical data, which helps the
organization to analyze its business.
 There is no frequent updating done in a data warehouse.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
What can a Data Warehouse do & can’t do?
What can a Data Warehouse do?
 Get Answer Faster
 Make Decision Faster
 Optimize Performance
 Reduce Risk and Cost
What can a Data Warehouse not do?
 Can’t create data itself
 Cleaning of data is required
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Need for Data Warehouse
1. Improving Integration:
 An organization registers data in different systems, which support the
various business processes.
 In order to create an overall picture of business operations, customers
and suppliers – thus creating a single version of the truth – the data
must come together in one place and made compatible.
 Both external (from the environment) and internal data (from ERP
and financial systems) should merge into the data warehouse and
then be grouped.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Need for Data Warehouse
2. Speeding up response times
 The source systems are fully optimized in order to process many
small transactions, such as orders, in a short time.
 Creating information about the performance of the organization only
requires a few large ‘transactions’ during which large amounts of
data are being gathered and aggregated.
 The structure of a data warehouse is specifically designed to quickly
analyze such large amounts of data.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Need for Data Warehouse
3. Faster and more flexible reporting:
 The structure of both data warehouses and data marts enables end
users to report in a flexible manner and to quickly perform
interactive analysis on the basis of various predefined angles
(dimensions).
 They may, for example, with a single mouse click jump from year
level – to quarter – to month level and quickly switch between the
customer dimension and the product dimension whereby the
indicator remains fixed.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Need for Data Warehouse
 In most organization, data about specific parts of businesses is there
which contains lots and lots of data, somewhere, in some form.
 Data is available but not information – and not the right information
at the right time.
 Bring together information from multiple resources as to provide a
consistent database source for decision support queries.
 To help workers in their everyday business activity and improve their
productivity.
 To help knowledge workers (Executives, Managers, Analysts) make
faster and better decisions – decision support systems.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse Features
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse Features
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Subject Orientation:- Subject orientation means that data is organized by subject.
 Integration:- Consistency of defining parameters.
 Non-Volatility:- It means data storage medium must be stable.
 Time-Variance:- It means timeliness of data and access terms.
 Data Granularity:- It means that details of data are kept at low level.
Data Warehouse Characteristics
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Subject-oriented
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 A data warehouse is subject oriented because it provides information
around a subject rather than the organization's ongoing operations.
 Data warehouse is a subject oriented database, which supports the
business need of individual department specific user.
 Example : Sales, HR, Accounts, Marketing etc.
Subject-oriented
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 A data warehouse target on the modeling and analysis of data for
decision-makers.
 Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer,
product, or sales, instead of the global organization's ongoing
operations.
 This is done by excluding data that are not useful concerning the
subject and including all data needed by the users to understand the
subject.
Subject-oriented
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Integrated
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 In Data Warehouse, integration means the establishment of a
common unit of measure for all similar data from the dissimilar
database.
 The data also needs to be stored in the Datawarehouse in common
and universally acceptable manner.
 A data warehouse integrates various heterogeneous data sources like
RDBMS, flat files, and online transaction records.
 This integration helps in effective analysis of data. Consistency in
naming conventions, attribute measures, encoding structure etc. have
to be ensured.
Integrated
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Integrated
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 There are three different application labeled A, B and C.
 Information stored in these applications are Gender, Date, and
Balance. However, each application's data is stored different way.
• In Application A gender field store logical values like M or F
• In Application B gender field is a numerical value,
• In Application C application, gender field stored in the form of a
character value.
• Same is the case with Date and balance.
 However, after transformation and cleaning process all this data is
stored in common format in the Data Warehouse.
Time-Variant
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 A Data Warehouse is a time variant data base, which supports the
business management in analyzing the business and comparing the
business with different time periods like Year, Quarter, Month, Week
and Date.
 Historical information is kept in a data warehouse.
 For example, one can retrieve files from 3 months, 6 months, 12
months, or even previous data from a data warehouse.
 These variations with a transactions system, where often only the
most current file is kept.
 Another aspect of time variance is that once data is inserted in the
warehouse, it can't be updated or changed.
Time-Variant
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Non- Volatile
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Non-volatile means the previous data is not erased when new data is
added to it.
 A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in
the data warehouse.
 Typical activities such as deletes, inserts, and changes that are
performed in an operational application environment are completely
nonexistent in a DW environment.
 Only two types of data operations performed in the Data
Warehousing are
1. Data loading
2. Data access
Non- Volatile
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse VS Operational Database
S.no. Data Warehouse Operational Database
1 It involves historical
processing of information.
It involves day-to-day
processing.
2 Data warehouse systems are
used by knowledge workers
such as executives, managers,
and analysts.
Operational Database systems
are used by clerks, DBAs, or
database professionals.
3 It is used to analyze the
business.
It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse VS Operational Database
S.no. Data Warehouse Operational Database
5 It is based on Star Schema,
Snowflake Schema, and Fact
Constellation Schema.
It is based on Entity
Relationship Model.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and
consolidated data.
It provides primitive and highly
detailed data.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Warehouse VS Operational Database
S.no. Data Warehouse Operational Database
9 The number of users is in
hundreds.
The number of users is in
thousands.
10 The number of records
accessed is in millions.
The number of records
accessed is in tens.
11 The database size is from
100GB to 100 TB.
The database size is from 100
MB to 100 GB.
12 These are highly flexible. It provides high performance.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
How Datawarehouse works?
 A Data Warehouse works as a central repository where information
arrives from one or more data sources.
 Data flows into a data warehouse from the transactional system and
other relational databases.
 Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
How Datawarehouse works?
 The data is processed, transformed, and ingested so that users can
access the processed data in the Data Warehouse through Business
Intelligence tools, SQL clients, and spreadsheets.
 A data warehouse merges information coming from different sources
into one comprehensive database.
 By merging all of this information in one place, an organization can
analyze its customers more holistically.
 This helps to ensure that it has considered all the information
available.
 Data warehousing makes data mining possible.
 Data mining is looking for patterns in the data that may lead to
higher sales and profits.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Benefits of a Data Warehouse
1) Delivers enhanced business intelligence
 By having access to information from various sources from a single
platform, decision makers will no longer need to rely on limited data
or their instinct.
2) Saves times
 executives can query the data themselves with little to no IT support,
saving more time and money.
3) Enhances data quality and consistency
 A data warehouse converts data from multiple sources into a
consistent format. Since the data from across the organization is
standardized, each department will produce results that are
consistent. This will lead to more accurate data, which will become
the basis for solid decisions.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Benefits of a Data Warehouse
4) Improves the decision-making process
 By transforming data into purposeful information, decision makers
can perform more functional, precise, and reliable analysis and create
more useful reports with ease.
5) Drives Revenue
 “data is the new oil,” referring to the high dollar value of data in
today’s world. Creating more standardized and better quality data is
the key strength of a data warehouse, and this key strength translates
clearly to significant revenue gains. The data warehouse formula
works like this: Better business intelligence helps with better
decisions, and in turn better decisions create a higher return on
investment across any sector of your business.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Benefits of a Data Warehouse
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Online Analytical Processing (OLAP)
• Involves historical processing of information.
• OLAP systems are used by knowledge workers such as executives,
managers and analysts.
• It focuses on Information out.
• Based on Star Schema, Snowflake, Schema and Fact Constellation
Schema.
• Contains historical data.
• Provides summarized and consolidated data.
• Provides summarized and multidimensional view of data.
• Number or users is in hundreds.
• Number of records accessed is in millions.
• Database size is from 100 GB to 1 TB
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Online Transactional Processing (OLTP)
• Involves day-to-day processing.
• OLTP systems are used by clerks, DBAs, or database professionals.
• It focuses on Data in.
• Based on Entity Relationship Model.
• Contains current data.
• Provides primitive and highly detailed data.
• Provides detailed and flat relational view of data.
• Number of users is in thousands.
• Number of records accessed is in tens.
• Database size is from 100 MB to 1 GB.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mart
• A data mart is a simple section of the data warehouse that delivers a
single functional data set.
• Often holds only one subject area- for example, Finance, or Sales.
• May hold more summarized data.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mart
• Window-based or Unix/Linux-based servers are used to implement
data marts.
• They are implemented on low-cost servers.
• The implementation data mart cycles is measured in short periods of
time, i.e., in weeks rather than months or years.
• The life cycle of a data mart may be complex in long run, if its
planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data marts are flexible.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Need Of Data Mart
 Data Mart focuses only on functioning of particular department of an
organization.
 It is maintained by single authority of an organization.
 Since, it stores the data related to specific part of an organization,
data retrieval from it is very quick.
 Designing and maintenance of data mart is found to be quite cinch as
compared to data warehouse.
 It reduces the response time of user as it stores small volume of data.
 It is small in size due to which accessing data from it very fast.
 This Storage unit is used by most of the organizations for the smooth
running of their departments.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Types of Data Mart:
 There are three main types of data marts are:
1. Dependent: Dependent data marts are created by drawing data
directly from operational, external or both sources.
2. Independent: Independent data mart is created without the use of a
central data warehouse.
3. Hybrid: This type of data marts can take data from data warehouses
or operational systems.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Dependent Data Mart
 Dependent Data Mart is created by extracting the data from central
repository, Datawarehouse.
 First data warehouse is created by extracting data (through ETL tool)
from external sources and then data mart is created from data
warehouse.
 Dependent data mart is created in top-down approach of
Datawarehouse architecture.
 This model of data mart is used by big organizations.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Dependent Data Mart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Independent Data Mart
 The second approach is Independent data marts (IDM).
 Independent Data Mart is created directly from external sources
instead of data warehouse.
 First data mart is created by extracting data from external sources
and then Datawarehouse is created from the data present in data
mart.
 Independent data mart is designed in bottom-up approach of
Datawarehouse architecture.
 This model of data mart is used by small organizations and is cost
effective comparatively.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Independent Data Mart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mart
Hybrid Data Mart
 This type of Data Mart is created by extracting data from operational
source or from data warehouse.
 It is best suited for multiple database environments and fast
implementation turnaround for any organization.
 It also requires least data cleansing effort.
 Hybrid Data mart also supports large storage structures, and it is best
suited for flexible for smaller data-centric applications.
 1) Path-1 reflects accessing data directly from external sources and
 2) Path-2 reflects dependent data model of data mart.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Hybrid Data Mart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Steps in Implementing a Datamart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Implementing a Data Mart is a rewarding but complex procedure.
 The significant steps in implementing a data mart are to design the
schema, construct the physical storage, populate the data mart with
data from source systems, access it to make informed decisions and
manage it over time.
 So, the steps are:
Advantages of Data Mart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Implementation of data mart needs less time as compared to
implementation of Datawarehouse as data mart is designed for a
particular department of an organization.
 Organizations are provided with choices to choose model of data
mart depending upon cost and their business.
 Data can be easily accessed from data mart.
 It contains frequently accessed queries, so enable to analyze business
trend.
Disadvantages of Data Mart
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Since it stores the data related only to specific function, so does not
store huge volume of data related to each and every department of an
organisation like datawarehouse.
 It can become a big hurdle to maintain.
Difference between Datawarehouse & Data Mart
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Data Mart
A Data Warehouse is a vast
repository of information collected
from various organizations or
departments within a corporation.
A data mart is an only subtype of a
Data Warehouses. It is architecture
to meet the requirement of a
specific user group.
It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.
It holds very detailed information. It may hold more summarized data.
DW is the data-oriented. Data Marts is a project-oriented.
In data warehousing, Fact
constellation is used.
In Data Mart, Star Schema and
Snowflake Schema are used.
It is a Centralized System. It is a Decentralized System.
ETL Process
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 The mechanism of extracting information from source systems and
bringing it into the data warehouse is commonly called ETL, which
stands for Extraction, Transformation and Loading.
ETL Process
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area and then finally,
loads it into the Data Warehouse system.
Why do you need ETL?
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 It helps companies to analyze their business data for taking critical
business decisions.
 Transactional databases cannot answer complex business questions
that can be answered by ETL.
 ETL provides a method of moving the data from various sources into
a data warehouse.
 Well-designed and documented ETL system is almost essential to the
success of a Data Warehouse project.
 ETL helps to Migrate data into a Data Warehouse. Convert to the
various formats and types to adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source
data into the target database.
ETL Process - Extraction
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Extraction is the operation of extracting information from a source
system for further use in a data warehouse environment. This is the
first stage of the ETL process.
 Extraction process is often one of the most time-consuming tasks in
the ETL.
 The source systems might be complicated and poorly documented,
and thus determining which data needs to be extracted can be
difficult.
 The data has to be extracted several times in a periodic manner to
supply all changed data to the warehouse and keep it up-to-date.
ETL Process - Extraction
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 It is important to extract the data from various source systems and
store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can
be corrupted also.
 Hence loading it directly into the data warehouse may damage
it. Therefore, this is one of the most important steps of ETL process.
 The extraction step should be design in such a way that it should not
have negative affect n source system.
 Data extractions’ time slot for different systems vary as per the time
zone and operational hours.
ETL Process - Transformation
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 The second step of the ETL process is transformation. In this step, a
set of rules or functions are applied on the extracted data to convert it
into a single standard format.
 Data extracted from source server is raw and not usable in its original
form. Therefore it needs to be cleansed, mapped and transformed.
 The main objective of this format is to load the extracted data into
target database with clean and general format.
 For example there are two sources A and B.
 Date format of A is dd/mm/yyyy and format of B is mm/dd/yy.
 In transformation these date formats are bring into single general
format.
ETL Process - Transformation
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
ETL Process - Transformation
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 In this step, a set of rules or functions are applied on the extracted
data to convert it into a single standard format. It may involve
following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
 Joining – joining multiple attributes (columns) into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally
key-attribute).
 Enrichment – Full name to ‘First Name’, ‘Middle Name’ & ‘Last
Name’.
ETL Process - Transformation
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Following are Data Integrity Problems:
1) Different spelling of the same person like Jon, John, etc.
2) There are multiple ways to denote company name like Google,
Google pvt. ltd., Google Inc.
3) Use of different names like Mumbai, Bombay.
4) There may be a case that different account numbers are generated
by various applications for the same customer.
5) In some data required files remains blank.
6) Invalid product collected at POS as manual entry can lead to
mistakes.
ETL Process - Loading
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 The third and final step of the ETL process is loading. In this step,
the transformed data is finally loaded into the data warehouse.
 Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular
intervals.
 The rate and period of loading solely depends on the requirements
and varies from system to system.
 In case of load failure, recover mechanisms should be configured to
restart from the point of failure without data integrity loss.
 Data Warehouse admins need to monitor, resume, cancel loads as per
prevailing server performance.
ETL Process
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 ETL process can also use the pipelining concept i.e. as soon as some
data is extracted, it can transformed and during that period some new
data can be extracted. And while the transformed data is being loaded
into the data warehouse, the already extracted data can be
transformed.
 The block diagram of the pipelining of ETL process is shown below:
Selecting an ETL Tool
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Selection of an appropriate ETL Tools is an important decision that
has to be made in choosing the importance of an ODS or data
warehousing application.
 The ETL tools are required to provide coordinated access to multiple
data sources so that relevant data may be extracted from them.
 An ETL tool would generally contains tools for data cleansing, re-
organization, transformations, aggregation, calculation and automatic
loading of information into the object database.
 An ETL tool should provide a simple user interface that allows data
cleansing and data transformation rules to be specified using a point-
and-click approach.
ETL tools
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 When all mappings and transformations have been defined, the ETL
tool should automatically generate the data
extract/transformation/load programs.
 There are many Data Warehousing tools are available in the market.
Here, are some most prominent one:
 1. MarkLogic
 2. Oracle
 3. Amazon RedShift
 4. Sybase
Components of Data Warehouse
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Source Data Component
 Data Staging Component (ETL)
 Metadata Component
 End user tools and applications
 Data Warehouse Management
Components of Data Warehouse
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Architecture
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 DATA WAREHOUSE ARCHITECTURE is complex as it’s an
information system that contains historical and commutative data
from multiple sources.
 Datawarehouse and their architectures vary depending upon the
specifics of an organization situation. Three common architectures
are:
 Data Warehouse Architecture (Basic)
 Data Warehouse Architecture (with a staging area)
 Data Warehouse Architecture (with a staging area and data
mart)
Data Warehouse Architecture
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Data Warehouse Architecture (Basic)
Data Warehouse Architecture (Basic)
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Operational System:- An operational system is a method used in
data warehousing to refer to a system that is used to process the day-
to-day transactions of an organization.
 Flat Files:- A Flat file system is a system of files in which
transactional data is stored, and every file in the system must have a
different name.
 End-User access Tools:- The principal purpose of a data warehouse
is to provide information to the business managers for strategic
decision-making. These customers interact with the warehouse using
end-client access tools.
• Example:- Reporting and Query Tools, Application Development
Tools, Executive Information Systems Tools, Online Analytical
Processing Tools, Data Mining Tools
Data Warehouse Architecture (With Staging Area)
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
• We must clean and process your operational information before put it
into the warehouse.
 We can do this programmatically, although data warehouses uses
a staging area (A place where data is processed before entering the
warehouse).
 A staging area simplifies data cleansing and consolidation for
operational method coming from multiple source systems, especially
for enterprise data warehouses where all relevant data of an
enterprise is consolidated.
Data Warehouse Architecture (With Staging Area)
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Architecture (With Staging Area)
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
• Data Warehouse Staging Area is a temporary location where a
record from source systems is copied.
Data Warehouse Architecture (With Staging Area
and Data Marts)
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 We may want to customize our warehouse's architecture for multiple
groups within our organization.
 We can do this by adding data marts.
 A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
 The figure illustrates an example where purchasing, sales, and stocks
are separated.
 In this example, a financial analyst wants to analyze historical data
for purchases and sales or mine historical information to make
predictions about customer behavior.
Data Warehouse Architecture (With Staging Area
and Data Marts)
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Types of Data Warehouse Architectures
 DATA WAREHOUSE ARCHITECTURE is complex as it’s an
information system that contains historical and commutative data
from multiple sources. There are 3 methods for constructing data-
warehouse: Single Tier, Two tier and Three tier.
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Types of Data Warehouse Architectures
Single-Tier Architecture
 The objective of a single layer is to minimize the amount of data
stored.
 This goal is to remove data redundancy.
 This architecture is not frequently used in practice.
Two-Tier Architecture
 Two-layer architecture separates physically available sources and
data warehouse.
 This architecture is not expandable and also not supporting a large
number of end-users.
 It also has connectivity problems because of network limitations.
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Types of Data Warehouse Architectures
Three-tier architecture
 This is the most widely used architecture.
 Generally a data warehouses adopts a three-tier architecture.
 It consists of the Top, Middle and Bottom Tier.
 Data warehouses often adopt a three – tier architecture,
 1 Bottom tier
 2 Middle tier
 3 Top tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Types of Data Warehouse Architectures – 3 Tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
1. Bottom Tier: The database of the Datawarehouse servers as the
bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end
tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the
database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools
and API that you connect and get data out from the data warehouse.
It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.
Types of Data Warehouse Architectures – 3 Tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Types of Data Warehouse Architectures – 3 Tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
1.) Top Tier
 The Top Tier consists of the Client-side front end of the architecture.
 The Transformed and Logic applied information stored in the Data
Warehouse will be used and acquired for Business purposes in this
Tier.
 Several Tools for Report Generation and Analysis are present for the
generation of desired information.
 Data mining which has become a great trend these days is done here.
 All Requirement Analysis document, cost, and all features that
determine a profit-based Business deal is done based on these tools
which use the Data Warehouse information.
Types of Data Warehouse Architectures – 3 Tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
2.) Middle Tier
 The Middle Tier consists of the OLAP Servers
 OLAP is Online Analytical Processing Server
 OLAP is used to provide information to business analysts and
managers
 As it is located in the Middle Tier, it rightfully interacts with the
information present in the Bottom Tier and passes on the insights to
the Top Tier tools which processes the available information.
 Mostly Relational or Multidimensional OLAP is used in Data
warehouse architecture.
Types of Data Warehouse Architectures – 3 Tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Bottom Tier:- The Bottom Tier mainly consists of the Data Sources,
ETL Tool, and Data Warehouse.
 1. Data Sources:- The Data Sources consists of the Source Data that is
acquired and provided to the Staging and ETL tools for further process.
 2. ETL Tools:- ETL tools are very important because they help in
combining Logic, Raw Data, and Schema into one and loads the
information to the Data Warehouse Or Data Marts.
 Sometimes, ETL loads the data into the Data Marts and then
information is stored in Data Warehouse. This approach is known as
the Bottom-Up approach.
 The approach where ETL loads information to the Data Warehouse
directly is known as the Top-down Approach.
Types of Data Warehouse Architectures – 3 Tier
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 A data-warehouse is a heterogeneous collection of different data
sources organized under a unified schema. There are 2 approaches
for constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
1. Top-down approach: The needed components are discussed below:
1.) External Sources –
External source is a source from where data is collected irrespective of
the type of data. Data can be structured, semi structured and
unstructured as well.
2.) Stage Area –
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into
Datawarehouse. For this purpose, it is recommended to use ETL tool.
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
E(Extracted): Data is extracted from External data source.
T(Transform): Data is transformed into the standard format.
L(Load): Data is loaded into Datawarehouse after transforming it
into the standard format.
3.) Data-warehouse – After cleansing of data, it is stored in the
Datawarehouse as central repository. It actually stores the meta data
and the actual data gets stored in the data marts. Note that
Datawarehouse stores the data in its purest form in this top-down
approach.
4.) Data Marts – Data mart is also a part of storage component. It
stores the information of a particular function of an organization which
is handled by single authority. We can also say that data mart contains
subset of the data stored in Datawarehouse.
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
5.) Data Mining – The practice of analyzing the big data present in
Datawarehouse is data mining. It is used to find the hidden patterns that
are present in the database or in Datawarehouse with the help of
algorithm of data mining.
Advantages of Top-Down Approach –
1. Since the data marts are created from the Datawarehouse, provides
consistent dimensional view of data marts.
2. Also, this model is considered as the strongest model for business
changes. That’s why, big organizations prefer to follow this approach.
3. Creating data mart from Datawarehouse is easy.
 Disadvantages of Top-Down Approach – The cost, time taken in
designing and its maintenance is very high.
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
2. Bottom-up approach:
1. First, the data is extracted from external soures (same as happens in
top-down approach).
2. Then, the data go through the staging area (as explained above) and
loaded into data marts instead of datawarehouse. The data marts are
created first and provide reporting capability. It addresses a single
business area.
3. These data marts are then integrated into datawarehouse.
 This approach is given by Kinball as – data marts are created first
and provides a thin view for analyses and datawarehouse is created
after complete data marts have been created.
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Data Warehouse Approaches
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Advantages of Bottom-Up Approach –
1. As the data marts are created first, so the reports are quickly
generated.
2. We can accommodate more number of data marts here and in this
way Datawarehouse can be extended.
3. Also, the cost and time taken in designing this model is low
comparatively.
 Disadvantage of Bottom-Up Approach –
1. This model is not strong as top-down approach as dimensional
view of data marts is not consistent as it is in above approach.
Difference Between Top-down Approach and
Bottom-up Approach
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.no. Top-Down Approach Bottom-Up Approach
1 Provides a definite and
consistent view of
information as information
from the data warehouse is
used to create Data Marts
Reports can be generated easily
as Data marts are created first
and it is relatively easy to
interact with data marts.
2
Strong model and hence
preferred by big companies
Not as strong but data
warehouse can be extended and
the number of data marts can be
created
3 Time, Cost and Maintenance
is high
Time, Cost and Maintenance
are low.
Design of Data Warehouse
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 An important point about Data Warehouse is its efficiency. To create
an efficient Data Warehouse, we construct a framework known as the
Business Analysis Framework.
 There are four types of views in regard to the design of a DW.
 1. Top-Down View: This View allows only specific information
needed for a data warehouse to be selected.
 2. Data Source View: This view shows all the information from the
source of data to how it is transformed and stored.
 3. Data Warehouse View: This view shows the information present
in the Data warehouse through fact tables and dimension tables.
 4. Business Query View: This is a view that shows the data from the
user’s point of view.
Advantages of Data Warehouse
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 1. Integrating data from multiple sources.
 2. Performing new types of analyses.
 3. Reducing cost to access historical data.
 Other benefits may include:
 1. Standardizing data across the organization, a "single version of the
truth“.
 2. Improving turnaround time for analysis and reporting.
 3. Sharing data and allowing others to easily access data.
 4. Removing informational processing load from transaction-
oriented databases.
Disadvantages of Data Warehouse
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 The major disadvantage is that a data warehouse can be costly to
maintain and that becomes a problem if the warehouse is
underutilized.
 It seems that managers have unrealistic expectations about what they
will get from having a data warehouse.
 There are considerable disadvantages involved in moving data from
multiple, often highly disparate, data sources to one data warehouse
that translate into long implementation time, high cost, lack of
flexibility, dated information, and limited capabilities.
 The data warehouse may seem easy, but actually, it is too complex
for the average users.
 Not an ideal option for unstructured data.
Metadata
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 The name Meta Data suggests some high- level technological
concept.
 However, it is quite simple. Metadata is data about data which
defines the data warehouse.
 It is used for building, maintaining and managing the data
warehouse.
 In the Data Warehouse Architecture, meta-data plays an important
role as it specifies the source, usage, values, and features of data
warehouse data.
 It also defines how data can be changed and processed.
 It is closely connected to the data warehouse.
Metadata
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
For example, a line in sales database may contain:
 This is a meaningless data until we consult the Meta that tell us it
was
• Model number: 4030
• Sales Agent ID: KJ732
• Total sales amount of $299.90
 Therefore, Meta Data are essential ingredients in the transformation
of data into knowledge.
Metadata
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Metadata helps to answer the following questions
• What tables, attributes, and keys does the Data Warehouse contain?
• Where did the data come from?
• How many times do data get reloaded?
• What transformations were applied with cleansing?
Data Warehouse Models
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.no. DATABASE SYSTEM DATA WAREHOUSE
1 It supports operational
processes.
It supports analysis and
performance reporting.
2 Operational Database are those
databases where data changes
frequently.
A data warehouse is a repository
for structured, filtered data that
has already been processed for a
specific purpose.
3 It focuses on current
transactional data.
It focuses on historical data.
4 Data is balanced within the
scope of this one system.
Data must be integrated and
balanced from multiple system.
Data Warehouse Models
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.no. DATABASE SYSTEM DATA WAREHOUSE
5 ER based. Star/Snowflake.
6 Application oriented. Subject oriented.
7 It is slow for analytics queries. It is fast for analysis queries.
8 Relational databases are
created for on-line
transactional Processing
(OLTP)
Data Warehouse designed for
on-line Analytical Processing
(OLAP)
9 Data stored in the Database is
up to date.
Current and Historical Data is
stored in Data Warehouse. May
not be up to date.
Dimensional Modeling
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 DIMENSIONAL MODELING (DM) is a data structure technique
optimized for data storage in a Data warehouse.
 The purpose of dimensional model is to optimize the database for fast
retrieval of data.
 The concept of Dimensional Modelling was developed by Ralph
Kimball and consists of "fact" and "dimension" tables.
 A Dimensional model is designed to read, summarize, analyze
numeric information like values, balances, counts, weights, etc. in a
data warehouse.
 In contrast, relation models are optimized for addition, updating and
deletion of data in a real-time Online Transaction System.
Elements of Dimensional Data Model
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Fact:- Facts are the measurements/metrics or facts from your
business process. For a Sales business process, a measurement would
be quarterly sales number.
 Dimension:- Dimension provides the context surrounding a business
process event. In simple terms, they give who, what, where of a fact.
In the Sales business process, for the fact quarterly sales number,
dimensions would be
• Who – Customer Names
• Where – Location
• What – Product Name
 In other words, a dimension is a window to view information in the
facts.
Elements of Dimensional Data Model
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Attributes
 The Attributes are the various characteristics of the dimension.
 In the Location dimension, the attributes can be
• State
• Country
• Zipcode etc.
 Attributes are used to search, filter, or classify facts. Dimension
Tables contain Attributes
Elements of Dimensional Data Model
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Fact Table
 A fact table is a primary table in a dimensional model.
 A Fact Table contains
1. Measurements/facts
2. Foreign key to dimension table
Elements of Dimensional Data Model
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Dimension table
• A dimension table contains dimensions of a fact.
• They are joined to fact table via a foreign key.
• Dimension tables are de-normalized tables.
• The Dimension Attributes are the various columns in a dimension
table
• Dimensions offers descriptive characteristics of the facts with the
help of their attributes
• No set limit set for given for number of dimensions
• The dimension can also contain one or more hierarchical
relationships
Multidimensional schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Multidimensional Schema is especially designed to model data
warehouse systems.
 The schemas are designed to address the unique needs of very large
databases designed for the analytical purpose (OLAP).
 Types of Data Warehouse Schema:
 Following are 3 chief types of multidimensional schemas each
having its unique advantages.
• Star Schema
• Snowflake Schema
• Galaxy Schema
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 In the STAR Schema, the center of the star can have one fact table
and a number of associated dimension tables.
 Star schema is the fundamental schema among the data mart schema
and it is simplest.
 This schema is widely used to develop or build a data warehouse and
dimensional data marts.
 It is known as star schema as its structure resembles a star.
 The star schema is the simplest type of Data Warehouse schema.
 It is also known as Star Join Schema and is optimized for querying
large data sets.
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 In a star schema, the fact table will be at the center and is connected
to the dimension tables.
 The tables are completely in a denormalized structure.
 SQL queries performance is good as there is less number of joins
involved.
 Data redundancy is high and occupies more disk space.
 It is said to be star as its physical model resembles to the star shape
having a fact table at its center and the dimension tables at its
peripheral representing the star’s points.
 Usually the fact tables in a star schema are in third normal
form(3NF) whereas dimensional tables are de-normalized.
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Characteristics of Star Schema:
 Every dimension in a star schema is represented with the only one-dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized.
 For instance, in the above figure, Country_ID does not have Country lookup table as
an OLTP design would have.
 The schema is widely supported by BI Tools
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Advantages of Star Schema –
1. Simpler Queries:
Join logic of star schema is quite cinch in compare to other join logic
which are needed to fetch data from a transactional schema that is
highly normalized.
2. Simplified Business Reporting Logic:
In compared to a transactional schema that is highly normalized, the
star schema makes simpler common business reporting logic, such as
as-of reporting and period-over-period.
3. Feeding Cubes:
Star schema is widely used by all OLAP systems to design OLAP
cubes efficiently. In fact, major OLAP systems deliver a ROLAP
mode of operation which can use a star schema as a source without
designing a cube structure.
Star Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized
schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within
business entities – at least not frequently.
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 SNOWFLAKE SCHEMA is a logical arrangement of tables in a
multidimensional database such that the ER diagram resembles a
snowflake shape.
 A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions.
 The dimension tables are normalized which splits data into
additional tables.
 The snowflake schema is a variant of the star schema.
 The snowflake effect affects only the dimension tables and does not
affect the fact tables.
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 A snowflake schema is an extension of star schema where the
dimension tables are connected to one or more dimensions.
 The tables are partially denormalized in structure.
 The performance of SQL queries is a bit less when compared to star
schema as more number of joins are involved.
 Data redundancy is low and occupies less disk space when compared
to star schema.
 The snowflake structure materialized when the dimensions of a star
schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent table.
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Characteristics of Snowflake Schema:
• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
• For example, the item dimension table in star schema is normalized
and split into two dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table.
• The supplier dimension table contains the attributes supplier_key
and supplier_type.
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Advantages: There are two main advantages of snowflake schema
given below:
• It provides structured data which reduces the problem of data
integrity.
• It uses small disk space because data are highly structured.
Snowflake Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Disadvantages:
• Snowflaking reduces space consumed by dimension tables, but
compared with the entire data warehouse the saving is usually
insignificant.
• Avoid snowflaking or normalization of a dimension table, unless
required and appropriate.
• Do not snowflake hierarchies of one dimension table into separate
tables. Hierarchies should belong to the dimension table only and
should never be snowfalked.
• Multiple hierarchies can belong to the same dimension has been
designed at the lowest possible detail.
Fact Constellation Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 A Fact constellation means two or more fact tables sharing one or
more dimensions. It is also called Galaxy schema.
 Fact Constellation Schema describes a logical structure of data
warehouse or data mart. Fact Constellation Schema can design with
a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.
 The schema is viewed as a collection of stars hence the name
Galaxy Schema.
 The fact constellation schema is also a type of multidimensional
model.
 In Galaxy schema shares dimensions are called Conformed
Dimensions.
Fact Constellation Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Fact Constellation Schema
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Characteristics of Galaxy Schema:
• The dimensions in this schema are separated into separate
dimensions based on the various levels of hierarchy.
• For example, if geography has four levels of hierarchy like region,
country, state, and city then Galaxy schema should have four
dimensions.
• Moreover, it is possible to build this type of schema by splitting the
one-star schema into more Star schemes.
• The dimensions are large in this schema which is needed to build
based on the levels of hierarchy.
• This schema is helpful for aggregating fact tables for better
understanding.
Fact Table vs Dimension Table
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.NO FACT TABLE DIMENSION TABLE
1 Fact table contains the
measuring on the attributes
of a dimension table.
Dimension table contains the
attributes on that truth table
calculates the metric.
2 Located at the center of a
star or snowflake schema
and surrounded by
dimensions.
Connected to the fact table and
located at the edges of the star or
snowflake schema.
3
Facts tables could contain
information like sales
against a set of dimensions
like Product and Date.
Evert dimension table contains
attributes which describe the
details of the dimension. E.g.,
Product dimensions can contain
Product ID, Product Category, etc.
Fact Table vs Dimension Table
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.NO FACT TABLE DIMENSION TABLE
4 Primary Key in fact table is
mapped as foreign keys to
Dimensions.
Dimension table has a primary key
columns that uniquely identifies
each dimension.
5
Does not contain Hierarchy.
Contains Hierarchies. For example
Location could contain, country,
pin code, state, city, etc.
6 In fact table, There is less
attributes than dimension table.
While in dimension table, There is
more attributes than fact table.
7 The number of fact table is less
than dimension table in a
schema.
While the number of dimension is
more than fact table in a schema.
Type of Facts
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
• Additive – As its name implied, additive measures are measures
which can be added to all dimensions.
• Non-additive – different from additive measures, non-additive
measures are measures that cannot be added to all dimensions.
• Semi-additive – semi-additive measures are the measure that can
be added to only some dimensions and not across other.
Designing fact table steps
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Here is overview of four steps to designing a fact table described
by Kimball:
1. Choosing business process to model – The first step is to decide
what business process to model by gathering and understanding
business needs and available data
2. Declare the grain – by declaring a grain means describing exactly
what a fact table record represents
3. Choose the dimensions – once grain of fact table is stated clearly,
it is time to determine dimensions for the fact table.
4. Identify facts – identify carefully which facts will appear in the
fact table.
Star Vs Snowflake Schema: Key Differences
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.no Star Schema Snow Flake Schema
1 Hierarchies for the dimensions
are stored in the dimensional
table.
Hierarchies are divided into
separate tables.
2 It contains a fact table
surrounded by dimension tables.
One fact table surrounded by
dimension table which are in turn
surrounded by dimension table.
3 In a star schema, only single join
creates the relationship between
the fact table and any dimension
tables.
A snowflake schema requires
many joins to fetch the data.
4 Simple DB Design. Very Complex DB Design.
Star Vs Snowflake Schema: Key Differences
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.no Star Schema Snow Flake Schema
5 Denormalized Data structure and
query also run faster.
Normalized Data Structure.
6 High level of Data redundancy Very low-level data redundancy
7 Single Dimension table contains
aggregated data.
Data Split into different
Dimension Tables.
8 Cube processing is faster. Cube processing might be slow
because of the complex join.
9 Offers higher performing queries
using Star Join Query
Optimization. Tables may be
connected with multiple
dimensions.
The Snow Flake Schema is
represented by centralized fact
table which unlikely connected
with multiple dimensions.
Data Warehouse Models
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 From the perspective of data warehouse architecture, we have the
following data warehouse models −
• Enterprise warehouse:- collects all of the information about
subjects spanning the entire organization.
• Data Mart:- a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart.
• Virtual warehouse
• It is a virtual view of databases.
• Virtual Warehouse have a logical description of all the databases
and their structure.
• This method creates single Database from all the data sources.
Data Lake
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 A Data Lake is a storage repository that can store large amount of
structured, semi-structured, and unstructured data.
 It is a place to store every type of data in its native format with no
fixed limits on account size or file.
 It offers high data quantity to increase analytic performance and
native integration.
 Data Lake is like a large container which is very similar to real lake
and rivers.
 Just like in a lake you have multiple tributaries coming in, a data
lake has structured data, unstructured data, machine to machine, logs
flowing through in real-time.
Data Lake
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Star Vs Snowflake Schema: Key Differences
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.no Data Lakes Data Warehouse
1
Data lakes store everything.
Data Warehouse focuses only
on Business Processes.
2 Data are mainly unprocessed Highly processed data.
3 It can be Unstructured, semi-
structured and structured.
It is mostly in tabular form &
structure.
4 Data Lake is mostly used by
Data Scientist
Business professionals widely
use data Warehouse
5
Can use open source/tools like
Hadoop/ Map Reduce
Mostly commercial tools like
Google BigQuery, IBM,
Amazon, Oracle.
Big Data vs Data Warehouse
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
S.NO. BIG DATA DATA WAREHOUSE
1 Big data is a technology to store
and manage large amount of data.
Data warehouse is an architecture
used to organize the data.
2 Big data can handle structure,
non-structure, semi-structured
data.
Data warehouse only handles
structure data (relational or not
relational)
3.
Big data does processing by using
distributed file system.
Data warehouse doesn’t use
distributed file system for
processing.
4.
Big data doesn’t follow any SQL
queries to fetch data from
database.
In data warehouse we use SQL
queries to fetch data from relational
databases.
Data Warehousing – Partitioning Strategy
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Partitioning is done to enhance performance and facilitate easy
management of data.
 Partitioning also helps in balancing the various requirements of the
system.
 It optimizes the hardware performance and simplifies the
management of data warehouse by partitioning each fact table into
multiple separate partitions.
 Why is it Necessary to Partition?
 Partitioning is important for the following reasons −
1. For easy management,
2. To assist backup/recovery,
3. To enhance performance.
Data Warehousing – Partitioning Strategy
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 For Easy Management
 The fact table in a data warehouse can grow up to hundreds of
gigabytes in size.
 This huge size of fact table is very hard to manage as a single entity.
Therefore it needs partitioning.
 To Assist Backup/Recovery
 If we do not partition the fact table, then we have to load the
complete fact table with all the data.
 Partitioning allows us to load only as much data as is required on a
regular basis.
 It reduces the time to load and also enhances the performance of the
system.
Data Warehousing – Partitioning Strategy
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Note − To cut down on the backup size, all partitions other than the
current partition can be marked as read-only.
 We can then put these partitions into a state where they cannot be
modified.
 Then they can be backed up. It means only the current partition is to
be backed up.
 To Enhance Performance
 By partitioning the fact table into sets of data, the query procedures
can be enhanced.
 Query performance is enhanced because now the query scans only
those partitions that are relevant.
 It does not have to scan the whole data.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 There are various ways in which a fact table can be partitioned.
 In horizontal partitioning, we have to keep in mind the requirements
for manageability of the data warehouse.
 Partitioning by Time into Equal Segments:
 In this partitioning strategy, the fact table is partitioned on the basis
of time period.
 Here each time period represents a significant retention period within
the business.
 For example, if the user queries for month to date data then it is
appropriate to partition the data into monthly segments.
 We can reuse the partitioned tables by removing the data in them.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Partition by Time into Different-sized Segments
 This kind of partition is done where the aged data is accessed
infrequently. It is implemented as a set of small partitions for
relatively current data, larger partition for inactive data.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Points to Note
• The detailed information remains available online.
• The number of physical tables is kept relatively small, which reduces
the operating cost.
• This technique is suitable where a mix of data dipping recent history
and data mining through entire history is required.
• This technique is not useful where the partitioning profile changes on
a regular basis, because repartitioning will increase the operation cost
of data warehouse.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Partition on a Different Dimension
 The fact table can also be partitioned on the basis of dimensions
other than time such as product group, region, supplier, or any other
dimension.
 Let's have an example.
 Suppose a market function has been structured into distinct regional
departments like on a state by state basis.
 If each region wants to query on information captured within its
region, it would prove to be more effective to partition the fact table
into regional partitions.
 This will cause the queries to speed up because it does not require to
scan information that is not relevant.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Points to Note
• The query does not have to scan irrelevant data which speeds up the
query process.
• This technique is not appropriate where the dimensions are unlikely
to change in future. So, it is worth determining that the dimension
does not change in future.
• If the dimension changes, then the entire fact table would have to be
repartitioned.
 Note − It recommend to perform the partition only on the basis of
time dimension, unless you are certain that the suggested dimension
grouping will not change within the life of the data warehouse.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Partition by Size of Table
 When there are no clear basis for partitioning the fact table on any
dimension, then we should partition the fact table on the basis of
their size.
 We can set the predetermined size as a critical point. When the table
exceeds the predetermined size, a new table partition is created.
 Points to Note
• This partitioning is complex to manage.
• It requires metadata to identify what data is stored in each partition.
Partitioning Strategy - Horizontal Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Partitioning Dimensions
 If a dimension contains large number of entries, then it is required to
partition the dimensions. Here we have to check the size of a
dimension.
 Consider a large design that changes over time. If we need to store all
the variations in order to apply comparisons, that dimension may be
very large. This would definitely affect the response time.
 Round Robin Partitions
 In the round robin technique, when a new partition is needed, the old
one is archived. It uses metadata to allow user access tool to refer to
the correct table partition.
 This technique makes it easy to automate table management facilities
within the data warehouse.
Partitioning Strategy - Vertical Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
Vertical partitioning, splits the data vertically. The following images
depicts how vertical partitioning is done.
Partitioning Strategy - Vertical Partitioning
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Vertical partitioning can be performed in the following two ways −
• Normalization
• Row Splitting
 Normalization:- Normalization is the standard relational method of
database organization. In this method, the rows are collapsed into a
single row, hence it reduce space.
 Row Splitting:- Row splitting tends to leave a one-to-one map
between partitions. The motive of row splitting is to speed up the
access to large table by reducing its size.
 Note − While using vertical partitioning, make sure that there is no
requirement to perform a major join operation between two
partitions.
Identify Key to Partition
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 It is very crucial to choose the right partition key. Choosing a wrong
partition key will lead to reorganizing the fact table.
 Let's have an example. Suppose we want to partition the following
table.
 Account_Txn_Table
 transaction_id
 account_id
 transaction_type
 value
 transaction_date
 region
 branch_name
Identify Key to Partition
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 We can choose to partition on any key. The two possible keys could
be 1) region 2) transaction_date
 Suppose the business is organized in 30 geographical regions and
each region has different number of branches. That will give us 30
partitions, which is reasonable. This partitioning is good enough
because our requirements capture has shown that a vast majority of
queries are restricted to the user's own business region.
 If we partition by transaction_date instead of region, then the latest
transaction from every region will be in one partition. Now the user
who wants to look at data within his own region has to query across
multiple partitions.
 Hence it is worth determining the right partitioning key.
Summary
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 A data warehouse is a subject-oriented, integrated, time-variant,
and non-volatile collection of data that is used in organizational
decision making.
 A data mart is defined as an implementation of a data warehouse with
small and more tightly restricted scope of data and data warehouse
functions, serving a single department or part of an organization.
 The mechanism of extracting information from source systems and
bringing it into the data warehouse is commonly called ETL, which
stands for Extraction, Transformation and Loading.
 Metadata is data about data, A metadata does not gives just the
description of the entity but also gives the other details explaining the
syntax and semantics of the data elements.
Summary
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Virtual Warehouse have a logical description of all the databases and
their structure.
 In the STAR Schema, the center of the star can have one fact table
and a number of associated dimension tables.
 A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions. It has normalized dimensions.
 A Fact constellation means two or more fact tables sharing one or
more dimensions. It is also called Galaxy schema.
 Partitioning is done to enhance performance and facilitate easy
management of data.
 Partitioning Strategy helps For easy management, To assist
backup/recovery and To enhance performance.
Unit – 1
Any - 5 Assignment Questions Marks:-20
Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
 Q.1 What is Data Warehouse? Explain the data warehouse
architecture with diagram.
 Q.2 Discuss Star, Snowflake and Galaxy schema for
multidimensional Database.
 Q.3 Give reason, why it is necessary to separate data warehouse from
operational database.
 Q.4 What is the need of data warehouse. Explain characteristics of
data warehouse.
 Q.5 What is Data Mart? What are the types of Data Mart?
 Q.6 Explain ETL Process in data warehouse.
 Q.7 Explain:
 1) Metadata 2) Fact Table 3) Vertical Partitioning
Questions
Thank You
Great God, Medi-Caps, All the attendees
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
www.sagarpandya.tk
LinkedIn: /in/seapandya
Twitter: @seapandya
Facebook: /seapandya

More Related Content

What's hot

How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringHadi Fadlallah
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management DATAVERSITY
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDATAVERSITY
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
The what, why, and how of master data management
The what, why, and how of master data managementThe what, why, and how of master data management
The what, why, and how of master data managementMohammad Yousri
 
Data Modeling Best Practices - Business & Technical Approaches
Data Modeling Best Practices - Business & Technical ApproachesData Modeling Best Practices - Business & Technical Approaches
Data Modeling Best Practices - Business & Technical ApproachesDATAVERSITY
 
Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsBoris Otto
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl conceptsjeshocarme
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientistVijayMohan Vasu
 
Making the Case for Legacy Data in Modern Data Analytics Platforms
Making the Case for Legacy Data in Modern Data Analytics PlatformsMaking the Case for Legacy Data in Modern Data Analytics Platforms
Making the Case for Legacy Data in Modern Data Analytics PlatformsPrecisely
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseShanthi Mukkavilli
 

What's hot (20)

How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data Warehouse
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Ebook - The Guide to Master Data Management
Ebook - The Guide to Master Data Management Ebook - The Guide to Master Data Management
Ebook - The Guide to Master Data Management
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
The what, why, and how of master data management
The what, why, and how of master data managementThe what, why, and how of master data management
The what, why, and how of master data management
 
Data Modeling Best Practices - Business & Technical Approaches
Data Modeling Best Practices - Business & Technical ApproachesData Modeling Best Practices - Business & Technical Approaches
Data Modeling Best Practices - Business & Technical Approaches
 
Data analytics
Data analyticsData analytics
Data analytics
 
Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management Systems
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
Making the Case for Legacy Data in Modern Data Analytics Platforms
Making the Case for Legacy Data in Modern Data Analytics PlatformsMaking the Case for Legacy Data in Modern Data Analytics Platforms
Making the Case for Legacy Data in Modern Data Analytics Platforms
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Big Data analytics
Big Data analyticsBig Data analytics
Big Data analytics
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 

Similar to Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning

BVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptxBVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptxDrNilimaThakur
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business IntelligenceSukirti Garg
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Dataware housing
Dataware housingDataware housing
Dataware housingwork
 
Data warehouse
Data warehouseData warehouse
Data warehouseMR Z
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introductionSujaMaryD
 
dw_concepts_2_day_course.ppt
dw_concepts_2_day_course.pptdw_concepts_2_day_course.ppt
dw_concepts_2_day_course.pptDougSchoemaker
 
The Data Warehouse Essays
The Data Warehouse EssaysThe Data Warehouse Essays
The Data Warehouse EssaysMelissa Moore
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business IntelligenceSukirti Garg
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data scienceVipul Kalamkar
 
Data warehousing interview questions
Data warehousing interview questionsData warehousing interview questions
Data warehousing interview questionsSatyam Jaiswal
 
Data miningvs datawarehouse
Data miningvs datawarehouseData miningvs datawarehouse
Data miningvs datawarehouseSuman Astani
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docxAbshar Fatima
 
Data Warehousing Datamining Concepts
Data Warehousing Datamining ConceptsData Warehousing Datamining Concepts
Data Warehousing Datamining Conceptsraulmisir
 

Similar to Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning (20)

IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
Big data vs datawarehousing
Big data vs datawarehousingBig data vs datawarehousing
Big data vs datawarehousing
 
Big data vs datawarehousing
Big data vs datawarehousingBig data vs datawarehousing
Big data vs datawarehousing
 
BVRM 402 IMS UNIT V
BVRM 402 IMS UNIT VBVRM 402 IMS UNIT V
BVRM 402 IMS UNIT V
 
BVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptxBVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptx
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Dataware housing
Dataware housingDataware housing
Dataware housing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introduction
 
dw_concepts_2_day_course.ppt
dw_concepts_2_day_course.pptdw_concepts_2_day_course.ppt
dw_concepts_2_day_course.ppt
 
The Data Warehouse Essays
The Data Warehouse EssaysThe Data Warehouse Essays
The Data Warehouse Essays
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Data warehousing interview questions
Data warehousing interview questionsData warehousing interview questions
Data warehousing interview questions
 
Abstract
AbstractAbstract
Abstract
 
Data miningvs datawarehouse
Data miningvs datawarehouseData miningvs datawarehouse
Data miningvs datawarehouse
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docx
 
Data Warehousing Datamining Concepts
Data Warehousing Datamining ConceptsData Warehousing Datamining Concepts
Data Warehousing Datamining Concepts
 

More from Medicaps University (14)

data mining and warehousing computer science
data mining and warehousing computer sciencedata mining and warehousing computer science
data mining and warehousing computer science
 
Unit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptxUnit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptx
 
Unit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptxUnit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptx
 
UNIT-3 Complete PPT.pptx
UNIT-3 Complete PPT.pptxUNIT-3 Complete PPT.pptx
UNIT-3 Complete PPT.pptx
 
UNIT-2.pptx
UNIT-2.pptxUNIT-2.pptx
UNIT-2.pptx
 
UNIT-1 CSA.pptx
UNIT-1 CSA.pptxUNIT-1 CSA.pptx
UNIT-1 CSA.pptx
 
Scheduling
SchedulingScheduling
Scheduling
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Clock synchronization
Clock synchronizationClock synchronization
Clock synchronization
 
Distributed Objects and Remote Invocation
Distributed Objects and Remote InvocationDistributed Objects and Remote Invocation
Distributed Objects and Remote Invocation
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Association and Classification Algorithm
Association and Classification AlgorithmAssociation and Classification Algorithm
Association and Classification Algorithm
 
Data Mining
Data MiningData Mining
Data Mining
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 

Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema, Partitioning

  • 1. MEDI-CAPS UNIVERSITY Faculty of Engineering Mr. Sagar Pandya Information Technology Department sagar.pandya@medicaps.ac.in
  • 2. Data Mining and Warehousing Mr. Sagar Pandya Information Technology Department sagar.pandya@medicaps.ac.in Course Code Course Name Hours Per Week Total Credits L T P IT3ED02 Data Mining and Warehousing 3 0 0 3
  • 3. IT3ED02 Data Mining and Warehousing 3-0-0 Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Unit 1. Introduction  Unit 2. Data Mining  Unit 3. Association and Classification  Unit 4. Clustering  Unit 5. Business Analysis
  • 4. Reference Books Text Books  Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann, India, 2012.  Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press.  Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA. Reference Books  Sam Anahory and Dennis Murray, Data Warehousing in the Real World, Pearson Education Asia.  W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India. and many others Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 5. Unit-1 Introduction  Data warehousing Components –Building a Data warehouse,  Need for data warehousing,  Basic elements of data warehousing,  Data Mart,  Data Extraction, Clean-up, and Transformation Tools –Metadata,  Star, Snow flake and Galaxy Schemas for Multidimensional databases,  Fact and dimension data,  Partitioning Strategy-Horizontal and Vertical Partitioning. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 6. What is Data? Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data is collection of unprocessed items that may consists of text, numbers, images and video. Today, data can be represented in various forms like sound, images and video.  Structured: numbers, text etc.  Unstructured: images, video etc.
  • 7. What is Information?  Meaningful data is called information.  Information refers to the data that have been processed in such a way that the knowledge of the person who uses the data is increased.  Example:- 1A$ - Data (No meaning) 1$ - Information (Currency)  For the decision to be meaningful, the processed data must qualify for the following characteristics − • Timely − Information should be available when required. • Accuracy − Information should be accurate. • Completeness − Information should be complete. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 8. What is Metadata?  Metadata describes other data.  Data about data,  For example - an image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data.  A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document.  1) Operational Metadata  2) Extraction and Transformation Metadata  3) End User Metadata Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 9. What is Database and DBMS?  Database is a collection of inter-related data which helps in efficient retrieval, insertion and deletion of data from database and organizes the data in the form of tables.  The software which is used to manage database is called Database Management System (DBMS).  A database management system stores data in such a way that it becomes easier to retrieve, manipulate, and produce information.  For Example, MySQL, Oracle etc. are popular commercial DBMS used in different applications. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 10. Operational vs. Informational Systems  Operational systems, as their name implies, are the systems that help the every day operation of the enterprise.  These are the backbone systems of any enterprise, and include order entry, inventory, manufacturing, payroll and accounting.  Due to their importance to the organization, operational systems were almost always the first parts of the enterprise to be computerized. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 11. Operational vs. Informational Systems  Informational systems deal with analyzing data and making decisions, often major, about how the enterprise will operate now, and in the future.  Not only do informational systems have a different focus from operational ones, they often have a different scope.  Where operational data needs are normally focused upon a single area, informational data needs often span a number of different areas and need large amounts of related operational data. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 12. Data Warehouse  The term "Data Warehouse" was first coined by Bill Inmon in 1990. He was considered as a father of data warehouse.  According to Inmon, a data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data.  According to Ralph Kimball, Data Warehouse is a transaction data specifically structured for query and analysis.  A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 13. Data Warehouse  This data helps analysts to take informed decisions in an organization.  A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.  It is not used for daily operations and transaction processing but used for making decisions.  This data helps analysts to take informed decisions in an organization. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 14. Data Warehouse  Data is a collection of raw material in unorganized format. Now we have to convert that data into Information format. To make decision, we need to collect the data, using that data we get some information and finally we take decision.  Example:- In an organization, we have many departments like Sales dept, Product dept, Hr department and many other. Before releasing any product to the market, CEO collects the data form the Sales department and product department to take some decisions on profits & losses. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 15. Data Warehouse  In an Organisation, there are several department available and each individual department perform different kind of transactions, all these transactions are saved in Operational data store (ODS).  The main characteristics of ODS is data is volatile and it doesn’t maintain any history data. So what is volatile ? Data in volatile means, the data changes in regular interval of time.  Example :- Big Bazaar, CEO needs to take decision about a particular product. So he needs 3 to 5 years of data. But in ODS, it doesn’t maintain any history data. So, every organisation should maintain history data to take decisions based on product sales. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 16. Data Warehouse  Data warehousing is the process of constructing and using a data warehouse.  A data warehouse is a database, which is kept separate from the organization's operational database.  A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.  It possesses consolidated historical data, which helps the organization to analyze its business.  There is no frequent updating done in a data warehouse. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 17. Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 18. What can a Data Warehouse do & can’t do? What can a Data Warehouse do?  Get Answer Faster  Make Decision Faster  Optimize Performance  Reduce Risk and Cost What can a Data Warehouse not do?  Can’t create data itself  Cleaning of data is required Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 19. Need for Data Warehouse 1. Improving Integration:  An organization registers data in different systems, which support the various business processes.  In order to create an overall picture of business operations, customers and suppliers – thus creating a single version of the truth – the data must come together in one place and made compatible.  Both external (from the environment) and internal data (from ERP and financial systems) should merge into the data warehouse and then be grouped. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 20. Need for Data Warehouse 2. Speeding up response times  The source systems are fully optimized in order to process many small transactions, such as orders, in a short time.  Creating information about the performance of the organization only requires a few large ‘transactions’ during which large amounts of data are being gathered and aggregated.  The structure of a data warehouse is specifically designed to quickly analyze such large amounts of data. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 21. Need for Data Warehouse 3. Faster and more flexible reporting:  The structure of both data warehouses and data marts enables end users to report in a flexible manner and to quickly perform interactive analysis on the basis of various predefined angles (dimensions).  They may, for example, with a single mouse click jump from year level – to quarter – to month level and quickly switch between the customer dimension and the product dimension whereby the indicator remains fixed. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 22. Need for Data Warehouse  In most organization, data about specific parts of businesses is there which contains lots and lots of data, somewhere, in some form.  Data is available but not information – and not the right information at the right time.  Bring together information from multiple resources as to provide a consistent database source for decision support queries.  To help workers in their everyday business activity and improve their productivity.  To help knowledge workers (Executives, Managers, Analysts) make faster and better decisions – decision support systems. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 23. Data Warehouse Features Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 24. Data Warehouse Features Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Subject Orientation:- Subject orientation means that data is organized by subject.  Integration:- Consistency of defining parameters.  Non-Volatility:- It means data storage medium must be stable.  Time-Variance:- It means timeliness of data and access terms.  Data Granularity:- It means that details of data are kept at low level.
  • 25. Data Warehouse Characteristics Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 26. Subject-oriented Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations.  Data warehouse is a subject oriented database, which supports the business need of individual department specific user.  Example : Sales, HR, Accounts, Marketing etc.
  • 27. Subject-oriented Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A data warehouse target on the modeling and analysis of data for decision-makers.  Therefore, data warehouses typically provide a concise and straightforward view around a particular subject, such as customer, product, or sales, instead of the global organization's ongoing operations.  This is done by excluding data that are not useful concerning the subject and including all data needed by the users to understand the subject.
  • 29. Integrated Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from the dissimilar database.  The data also needs to be stored in the Datawarehouse in common and universally acceptable manner.  A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction records.  This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures, encoding structure etc. have to be ensured.
  • 31. Integrated Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are three different application labeled A, B and C.  Information stored in these applications are Gender, Date, and Balance. However, each application's data is stored different way. • In Application A gender field store logical values like M or F • In Application B gender field is a numerical value, • In Application C application, gender field stored in the form of a character value. • Same is the case with Date and balance.  However, after transformation and cleaning process all this data is stored in common format in the Data Warehouse.
  • 32. Time-Variant Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A Data Warehouse is a time variant data base, which supports the business management in analyzing the business and comparing the business with different time periods like Year, Quarter, Month, Week and Date.  Historical information is kept in a data warehouse.  For example, one can retrieve files from 3 months, 6 months, 12 months, or even previous data from a data warehouse.  These variations with a transactions system, where often only the most current file is kept.  Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
  • 34. Non- Volatile Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Non-volatile means the previous data is not erased when new data is added to it.  A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.  Typical activities such as deletes, inserts, and changes that are performed in an operational application environment are completely nonexistent in a DW environment.  Only two types of data operations performed in the Data Warehousing are 1. Data loading 2. Data access
  • 35. Non- Volatile Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 36. Data Warehouse VS Operational Database S.no. Data Warehouse Operational Database 1 It involves historical processing of information. It involves day-to-day processing. 2 Data warehouse systems are used by knowledge workers such as executives, managers, and analysts. Operational Database systems are used by clerks, DBAs, or database professionals. 3 It is used to analyze the business. It is used to run the business. 4 It focuses on Information out. It focuses on Data in. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 37. Data Warehouse VS Operational Database S.no. Data Warehouse Operational Database 5 It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema. It is based on Entity Relationship Model. 6 It focuses on Information out. It is application oriented. 7 It contains historical data. It contains current data. 8 It provides summarized and consolidated data. It provides primitive and highly detailed data. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 38. Data Warehouse VS Operational Database S.no. Data Warehouse Operational Database 9 The number of users is in hundreds. The number of users is in thousands. 10 The number of records accessed is in millions. The number of records accessed is in tens. 11 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB. 12 These are highly flexible. It provides high performance. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 39. How Datawarehouse works?  A Data Warehouse works as a central repository where information arrives from one or more data sources.  Data flows into a data warehouse from the transactional system and other relational databases.  Data may be: 1. Structured 2. Semi-structured 3. Unstructured data Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 40. How Datawarehouse works?  The data is processed, transformed, and ingested so that users can access the processed data in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets.  A data warehouse merges information coming from different sources into one comprehensive database.  By merging all of this information in one place, an organization can analyze its customers more holistically.  This helps to ensure that it has considered all the information available.  Data warehousing makes data mining possible.  Data mining is looking for patterns in the data that may lead to higher sales and profits. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 41. Benefits of a Data Warehouse 1) Delivers enhanced business intelligence  By having access to information from various sources from a single platform, decision makers will no longer need to rely on limited data or their instinct. 2) Saves times  executives can query the data themselves with little to no IT support, saving more time and money. 3) Enhances data quality and consistency  A data warehouse converts data from multiple sources into a consistent format. Since the data from across the organization is standardized, each department will produce results that are consistent. This will lead to more accurate data, which will become the basis for solid decisions. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 42. Benefits of a Data Warehouse 4) Improves the decision-making process  By transforming data into purposeful information, decision makers can perform more functional, precise, and reliable analysis and create more useful reports with ease. 5) Drives Revenue  “data is the new oil,” referring to the high dollar value of data in today’s world. Creating more standardized and better quality data is the key strength of a data warehouse, and this key strength translates clearly to significant revenue gains. The data warehouse formula works like this: Better business intelligence helps with better decisions, and in turn better decisions create a higher return on investment across any sector of your business. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 43. Benefits of a Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 44. Online Analytical Processing (OLAP) • Involves historical processing of information. • OLAP systems are used by knowledge workers such as executives, managers and analysts. • It focuses on Information out. • Based on Star Schema, Snowflake, Schema and Fact Constellation Schema. • Contains historical data. • Provides summarized and consolidated data. • Provides summarized and multidimensional view of data. • Number or users is in hundreds. • Number of records accessed is in millions. • Database size is from 100 GB to 1 TB Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 45. Online Transactional Processing (OLTP) • Involves day-to-day processing. • OLTP systems are used by clerks, DBAs, or database professionals. • It focuses on Data in. • Based on Entity Relationship Model. • Contains current data. • Provides primitive and highly detailed data. • Provides detailed and flat relational view of data. • Number of users is in thousands. • Number of records accessed is in tens. • Database size is from 100 MB to 1 GB. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 46. Data Mart • A data mart is a simple section of the data warehouse that delivers a single functional data set. • Often holds only one subject area- for example, Finance, or Sales. • May hold more summarized data. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 47. Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 48. Data Mart • Window-based or Unix/Linux-based servers are used to implement data marts. • They are implemented on low-cost servers. • The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years. • The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide. • Data marts are small in size. • Data marts are customized by department. • The source of a data mart is departmentally structured data warehouse. • Data marts are flexible. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 49. Need Of Data Mart  Data Mart focuses only on functioning of particular department of an organization.  It is maintained by single authority of an organization.  Since, it stores the data related to specific part of an organization, data retrieval from it is very quick.  Designing and maintenance of data mart is found to be quite cinch as compared to data warehouse.  It reduces the response time of user as it stores small volume of data.  It is small in size due to which accessing data from it very fast.  This Storage unit is used by most of the organizations for the smooth running of their departments. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 50. Types of Data Mart:  There are three main types of data marts are: 1. Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources. 2. Independent: Independent data mart is created without the use of a central data warehouse. 3. Hybrid: This type of data marts can take data from data warehouses or operational systems. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 51. Dependent Data Mart  Dependent Data Mart is created by extracting the data from central repository, Datawarehouse.  First data warehouse is created by extracting data (through ETL tool) from external sources and then data mart is created from data warehouse.  Dependent data mart is created in top-down approach of Datawarehouse architecture.  This model of data mart is used by big organizations. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 52. Dependent Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 53. Independent Data Mart  The second approach is Independent data marts (IDM).  Independent Data Mart is created directly from external sources instead of data warehouse.  First data mart is created by extracting data from external sources and then Datawarehouse is created from the data present in data mart.  Independent data mart is designed in bottom-up approach of Datawarehouse architecture.  This model of data mart is used by small organizations and is cost effective comparatively. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 54. Independent Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Data Mart
  • 55. Hybrid Data Mart  This type of Data Mart is created by extracting data from operational source or from data warehouse.  It is best suited for multiple database environments and fast implementation turnaround for any organization.  It also requires least data cleansing effort.  Hybrid Data mart also supports large storage structures, and it is best suited for flexible for smaller data-centric applications.  1) Path-1 reflects accessing data directly from external sources and  2) Path-2 reflects dependent data model of data mart. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 56. Hybrid Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 57. Steps in Implementing a Datamart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Implementing a Data Mart is a rewarding but complex procedure.  The significant steps in implementing a data mart are to design the schema, construct the physical storage, populate the data mart with data from source systems, access it to make informed decisions and manage it over time.  So, the steps are:
  • 58. Advantages of Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Implementation of data mart needs less time as compared to implementation of Datawarehouse as data mart is designed for a particular department of an organization.  Organizations are provided with choices to choose model of data mart depending upon cost and their business.  Data can be easily accessed from data mart.  It contains frequently accessed queries, so enable to analyze business trend.
  • 59. Disadvantages of Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Since it stores the data related only to specific function, so does not store huge volume of data related to each and every department of an organisation like datawarehouse.  It can become a big hurdle to maintain.
  • 60. Difference between Datawarehouse & Data Mart Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Data Warehouse Data Mart A Data Warehouse is a vast repository of information collected from various organizations or departments within a corporation. A data mart is an only subtype of a Data Warehouses. It is architecture to meet the requirement of a specific user group. It may hold multiple subject areas. It holds only one subject area. For example, Finance or Sales. It holds very detailed information. It may hold more summarized data. DW is the data-oriented. Data Marts is a project-oriented. In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema are used. It is a Centralized System. It is a Decentralized System.
  • 61. ETL Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The mechanism of extracting information from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading.
  • 62. ETL Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area and then finally, loads it into the Data Warehouse system.
  • 63. Why do you need ETL? Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  It helps companies to analyze their business data for taking critical business decisions.  Transactional databases cannot answer complex business questions that can be answered by ETL.  ETL provides a method of moving the data from various sources into a data warehouse.  Well-designed and documented ETL system is almost essential to the success of a Data Warehouse project.  ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system.  ETL is a predefined process for accessing and manipulating source data into the target database.
  • 64. ETL Process - Extraction Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Extraction is the operation of extracting information from a source system for further use in a data warehouse environment. This is the first stage of the ETL process.  Extraction process is often one of the most time-consuming tasks in the ETL.  The source systems might be complicated and poorly documented, and thus determining which data needs to be extracted can be difficult.  The data has to be extracted several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date.
  • 65. ETL Process - Extraction Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted also.  Hence loading it directly into the data warehouse may damage it. Therefore, this is one of the most important steps of ETL process.  The extraction step should be design in such a way that it should not have negative affect n source system.  Data extractions’ time slot for different systems vary as per the time zone and operational hours.
  • 66. ETL Process - Transformation Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format.  Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed.  The main objective of this format is to load the extracted data into target database with clean and general format.  For example there are two sources A and B.  Date format of A is dd/mm/yyyy and format of B is mm/dd/yy.  In transformation these date formats are bring into single general format.
  • 67. ETL Process - Transformation Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 68. ETL Process - Transformation Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format. It may involve following processes/tasks:  Filtering – loading only certain attributes into the data warehouse.  Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States and America into USA, etc.  Joining – joining multiple attributes (columns) into one.  Splitting – splitting a single attribute into multiple attributes.  Sorting – sorting tuples on the basis of some attribute (generally key-attribute).  Enrichment – Full name to ‘First Name’, ‘Middle Name’ & ‘Last Name’.
  • 69. ETL Process - Transformation Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Following are Data Integrity Problems: 1) Different spelling of the same person like Jon, John, etc. 2) There are multiple ways to denote company name like Google, Google pvt. ltd., Google Inc. 3) Use of different names like Mumbai, Bombay. 4) There may be a case that different account numbers are generated by various applications for the same customer. 5) In some data required files remains blank. 6) Invalid product collected at POS as manual entry can lead to mistakes.
  • 70. ETL Process - Loading Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data warehouse.  Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is done after longer but regular intervals.  The rate and period of loading solely depends on the requirements and varies from system to system.  In case of load failure, recover mechanisms should be configured to restart from the point of failure without data integrity loss.  Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server performance.
  • 71. ETL Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can transformed and during that period some new data can be extracted. And while the transformed data is being loaded into the data warehouse, the already extracted data can be transformed.  The block diagram of the pipelining of ETL process is shown below:
  • 72. Selecting an ETL Tool Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Selection of an appropriate ETL Tools is an important decision that has to be made in choosing the importance of an ODS or data warehousing application.  The ETL tools are required to provide coordinated access to multiple data sources so that relevant data may be extracted from them.  An ETL tool would generally contains tools for data cleansing, re- organization, transformations, aggregation, calculation and automatic loading of information into the object database.  An ETL tool should provide a simple user interface that allows data cleansing and data transformation rules to be specified using a point- and-click approach.
  • 73. ETL tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  When all mappings and transformations have been defined, the ETL tool should automatically generate the data extract/transformation/load programs.  There are many Data Warehousing tools are available in the market. Here, are some most prominent one:  1. MarkLogic  2. Oracle  3. Amazon RedShift  4. Sybase
  • 74. Components of Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Source Data Component  Data Staging Component (ETL)  Metadata Component  End user tools and applications  Data Warehouse Management
  • 75. Components of Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 76. Data Warehouse Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  DATA WAREHOUSE ARCHITECTURE is complex as it’s an information system that contains historical and commutative data from multiple sources.  Datawarehouse and their architectures vary depending upon the specifics of an organization situation. Three common architectures are:  Data Warehouse Architecture (Basic)  Data Warehouse Architecture (with a staging area)  Data Warehouse Architecture (with a staging area and data mart)
  • 77. Data Warehouse Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data Warehouse Architecture (Basic)
  • 78. Data Warehouse Architecture (Basic) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Operational System:- An operational system is a method used in data warehousing to refer to a system that is used to process the day- to-day transactions of an organization.  Flat Files:- A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a different name.  End-User access Tools:- The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-making. These customers interact with the warehouse using end-client access tools. • Example:- Reporting and Query Tools, Application Development Tools, Executive Information Systems Tools, Online Analytical Processing Tools, Data Mining Tools
  • 79. Data Warehouse Architecture (With Staging Area) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • We must clean and process your operational information before put it into the warehouse.  We can do this programmatically, although data warehouses uses a staging area (A place where data is processed before entering the warehouse).  A staging area simplifies data cleansing and consolidation for operational method coming from multiple source systems, especially for enterprise data warehouses where all relevant data of an enterprise is consolidated.
  • 80. Data Warehouse Architecture (With Staging Area) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 81. Data Warehouse Architecture (With Staging Area) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
  • 82. Data Warehouse Architecture (With Staging Area and Data Marts) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  We may want to customize our warehouse's architecture for multiple groups within our organization.  We can do this by adding data marts.  A data mart is a segment of a data warehouses that can provided information for reporting and analysis on a section, unit, department or operation in the company, e.g., sales, payroll, production, etc.  The figure illustrates an example where purchasing, sales, and stocks are separated.  In this example, a financial analyst wants to analyze historical data for purchases and sales or mine historical information to make predictions about customer behavior.
  • 83. Data Warehouse Architecture (With Staging Area and Data Marts) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 84. Types of Data Warehouse Architectures  DATA WAREHOUSE ARCHITECTURE is complex as it’s an information system that contains historical and commutative data from multiple sources. There are 3 methods for constructing data- warehouse: Single Tier, Two tier and Three tier. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 85. Types of Data Warehouse Architectures Single-Tier Architecture  The objective of a single layer is to minimize the amount of data stored.  This goal is to remove data redundancy.  This architecture is not frequently used in practice. Two-Tier Architecture  Two-layer architecture separates physically available sources and data warehouse.  This architecture is not expandable and also not supporting a large number of end-users.  It also has connectivity problems because of network limitations. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 86. Types of Data Warehouse Architectures Three-tier architecture  This is the most widely used architecture.  Generally a data warehouses adopts a three-tier architecture.  It consists of the Top, Middle and Bottom Tier.  Data warehouses often adopt a three – tier architecture,  1 Bottom tier  2 Middle tier  3 Top tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 87. Types of Data Warehouse Architectures – 3 Tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1. Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational database system. Data is cleansed, transformed, and loaded into this layer using back-end tools. 2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database. This layer also acts as a mediator between the end-user and the database. 3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining tools.
  • 88. Types of Data Warehouse Architectures – 3 Tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 89. Types of Data Warehouse Architectures – 3 Tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1.) Top Tier  The Top Tier consists of the Client-side front end of the architecture.  The Transformed and Logic applied information stored in the Data Warehouse will be used and acquired for Business purposes in this Tier.  Several Tools for Report Generation and Analysis are present for the generation of desired information.  Data mining which has become a great trend these days is done here.  All Requirement Analysis document, cost, and all features that determine a profit-based Business deal is done based on these tools which use the Data Warehouse information.
  • 90. Types of Data Warehouse Architectures – 3 Tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2.) Middle Tier  The Middle Tier consists of the OLAP Servers  OLAP is Online Analytical Processing Server  OLAP is used to provide information to business analysts and managers  As it is located in the Middle Tier, it rightfully interacts with the information present in the Bottom Tier and passes on the insights to the Top Tier tools which processes the available information.  Mostly Relational or Multidimensional OLAP is used in Data warehouse architecture.
  • 91. Types of Data Warehouse Architectures – 3 Tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Bottom Tier:- The Bottom Tier mainly consists of the Data Sources, ETL Tool, and Data Warehouse.  1. Data Sources:- The Data Sources consists of the Source Data that is acquired and provided to the Staging and ETL tools for further process.  2. ETL Tools:- ETL tools are very important because they help in combining Logic, Raw Data, and Schema into one and loads the information to the Data Warehouse Or Data Marts.  Sometimes, ETL loads the data into the Data Marts and then information is stored in Data Warehouse. This approach is known as the Bottom-Up approach.  The approach where ETL loads information to the Data Warehouse directly is known as the Top-down Approach.
  • 92. Types of Data Warehouse Architectures – 3 Tier Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 93. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A data-warehouse is a heterogeneous collection of different data sources organized under a unified schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-up approach are explained as below. 1. Top-down approach: The needed components are discussed below: 1.) External Sources – External source is a source from where data is collected irrespective of the type of data. Data can be structured, semi structured and unstructured as well. 2.) Stage Area – Since the data, extracted from the external sources does not follow a particular format, so there is a need to validate this data to load into Datawarehouse. For this purpose, it is recommended to use ETL tool.
  • 94. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 95. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in E(Extracted): Data is extracted from External data source. T(Transform): Data is transformed into the standard format. L(Load): Data is loaded into Datawarehouse after transforming it into the standard format. 3.) Data-warehouse – After cleansing of data, it is stored in the Datawarehouse as central repository. It actually stores the meta data and the actual data gets stored in the data marts. Note that Datawarehouse stores the data in its purest form in this top-down approach. 4.) Data Marts – Data mart is also a part of storage component. It stores the information of a particular function of an organization which is handled by single authority. We can also say that data mart contains subset of the data stored in Datawarehouse.
  • 96. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 5.) Data Mining – The practice of analyzing the big data present in Datawarehouse is data mining. It is used to find the hidden patterns that are present in the database or in Datawarehouse with the help of algorithm of data mining. Advantages of Top-Down Approach – 1. Since the data marts are created from the Datawarehouse, provides consistent dimensional view of data marts. 2. Also, this model is considered as the strongest model for business changes. That’s why, big organizations prefer to follow this approach. 3. Creating data mart from Datawarehouse is easy.  Disadvantages of Top-Down Approach – The cost, time taken in designing and its maintenance is very high.
  • 97. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2. Bottom-up approach: 1. First, the data is extracted from external soures (same as happens in top-down approach). 2. Then, the data go through the staging area (as explained above) and loaded into data marts instead of datawarehouse. The data marts are created first and provide reporting capability. It addresses a single business area. 3. These data marts are then integrated into datawarehouse.  This approach is given by Kinball as – data marts are created first and provides a thin view for analyses and datawarehouse is created after complete data marts have been created.
  • 98. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 99. Data Warehouse Approaches Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Advantages of Bottom-Up Approach – 1. As the data marts are created first, so the reports are quickly generated. 2. We can accommodate more number of data marts here and in this way Datawarehouse can be extended. 3. Also, the cost and time taken in designing this model is low comparatively.  Disadvantage of Bottom-Up Approach – 1. This model is not strong as top-down approach as dimensional view of data marts is not consistent as it is in above approach.
  • 100. Difference Between Top-down Approach and Bottom-up Approach Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no. Top-Down Approach Bottom-Up Approach 1 Provides a definite and consistent view of information as information from the data warehouse is used to create Data Marts Reports can be generated easily as Data marts are created first and it is relatively easy to interact with data marts. 2 Strong model and hence preferred by big companies Not as strong but data warehouse can be extended and the number of data marts can be created 3 Time, Cost and Maintenance is high Time, Cost and Maintenance are low.
  • 101. Design of Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  An important point about Data Warehouse is its efficiency. To create an efficient Data Warehouse, we construct a framework known as the Business Analysis Framework.  There are four types of views in regard to the design of a DW.  1. Top-Down View: This View allows only specific information needed for a data warehouse to be selected.  2. Data Source View: This view shows all the information from the source of data to how it is transformed and stored.  3. Data Warehouse View: This view shows the information present in the Data warehouse through fact tables and dimension tables.  4. Business Query View: This is a view that shows the data from the user’s point of view.
  • 102. Advantages of Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  1. Integrating data from multiple sources.  2. Performing new types of analyses.  3. Reducing cost to access historical data.  Other benefits may include:  1. Standardizing data across the organization, a "single version of the truth“.  2. Improving turnaround time for analysis and reporting.  3. Sharing data and allowing others to easily access data.  4. Removing informational processing load from transaction- oriented databases.
  • 103. Disadvantages of Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The major disadvantage is that a data warehouse can be costly to maintain and that becomes a problem if the warehouse is underutilized.  It seems that managers have unrealistic expectations about what they will get from having a data warehouse.  There are considerable disadvantages involved in moving data from multiple, often highly disparate, data sources to one data warehouse that translate into long implementation time, high cost, lack of flexibility, dated information, and limited capabilities.  The data warehouse may seem easy, but actually, it is too complex for the average users.  Not an ideal option for unstructured data.
  • 104. Metadata Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The name Meta Data suggests some high- level technological concept.  However, it is quite simple. Metadata is data about data which defines the data warehouse.  It is used for building, maintaining and managing the data warehouse.  In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source, usage, values, and features of data warehouse data.  It also defines how data can be changed and processed.  It is closely connected to the data warehouse.
  • 105. Metadata Mr. Sagar Pandya sagar.pandya@medicaps.ac.in For example, a line in sales database may contain:  This is a meaningless data until we consult the Meta that tell us it was • Model number: 4030 • Sales Agent ID: KJ732 • Total sales amount of $299.90  Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.
  • 106. Metadata Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Metadata helps to answer the following questions • What tables, attributes, and keys does the Data Warehouse contain? • Where did the data come from? • How many times do data get reloaded? • What transformations were applied with cleansing?
  • 107. Data Warehouse Models Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no. DATABASE SYSTEM DATA WAREHOUSE 1 It supports operational processes. It supports analysis and performance reporting. 2 Operational Database are those databases where data changes frequently. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose. 3 It focuses on current transactional data. It focuses on historical data. 4 Data is balanced within the scope of this one system. Data must be integrated and balanced from multiple system.
  • 108. Data Warehouse Models Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no. DATABASE SYSTEM DATA WAREHOUSE 5 ER based. Star/Snowflake. 6 Application oriented. Subject oriented. 7 It is slow for analytics queries. It is fast for analysis queries. 8 Relational databases are created for on-line transactional Processing (OLTP) Data Warehouse designed for on-line Analytical Processing (OLAP) 9 Data stored in the Database is up to date. Current and Historical Data is stored in Data Warehouse. May not be up to date.
  • 109. Dimensional Modeling Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  DIMENSIONAL MODELING (DM) is a data structure technique optimized for data storage in a Data warehouse.  The purpose of dimensional model is to optimize the database for fast retrieval of data.  The concept of Dimensional Modelling was developed by Ralph Kimball and consists of "fact" and "dimension" tables.  A Dimensional model is designed to read, summarize, analyze numeric information like values, balances, counts, weights, etc. in a data warehouse.  In contrast, relation models are optimized for addition, updating and deletion of data in a real-time Online Transaction System.
  • 110. Elements of Dimensional Data Model Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Fact:- Facts are the measurements/metrics or facts from your business process. For a Sales business process, a measurement would be quarterly sales number.  Dimension:- Dimension provides the context surrounding a business process event. In simple terms, they give who, what, where of a fact. In the Sales business process, for the fact quarterly sales number, dimensions would be • Who – Customer Names • Where – Location • What – Product Name  In other words, a dimension is a window to view information in the facts.
  • 111. Elements of Dimensional Data Model Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Attributes  The Attributes are the various characteristics of the dimension.  In the Location dimension, the attributes can be • State • Country • Zipcode etc.  Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes
  • 112. Elements of Dimensional Data Model Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Fact Table  A fact table is a primary table in a dimensional model.  A Fact Table contains 1. Measurements/facts 2. Foreign key to dimension table
  • 113. Elements of Dimensional Data Model Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Dimension table • A dimension table contains dimensions of a fact. • They are joined to fact table via a foreign key. • Dimension tables are de-normalized tables. • The Dimension Attributes are the various columns in a dimension table • Dimensions offers descriptive characteristics of the facts with the help of their attributes • No set limit set for given for number of dimensions • The dimension can also contain one or more hierarchical relationships
  • 114. Multidimensional schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Multidimensional Schema is especially designed to model data warehouse systems.  The schemas are designed to address the unique needs of very large databases designed for the analytical purpose (OLAP).  Types of Data Warehouse Schema:  Following are 3 chief types of multidimensional schemas each having its unique advantages. • Star Schema • Snowflake Schema • Galaxy Schema
  • 115. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In the STAR Schema, the center of the star can have one fact table and a number of associated dimension tables.  Star schema is the fundamental schema among the data mart schema and it is simplest.  This schema is widely used to develop or build a data warehouse and dimensional data marts.  It is known as star schema as its structure resembles a star.  The star schema is the simplest type of Data Warehouse schema.  It is also known as Star Join Schema and is optimized for querying large data sets.
  • 116. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In a star schema, the fact table will be at the center and is connected to the dimension tables.  The tables are completely in a denormalized structure.  SQL queries performance is good as there is less number of joins involved.  Data redundancy is high and occupies more disk space.  It is said to be star as its physical model resembles to the star shape having a fact table at its center and the dimension tables at its peripheral representing the star’s points.  Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized.
  • 117. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 118. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Characteristics of Star Schema:  Every dimension in a star schema is represented with the only one-dimension table.  The dimension table should contain the set of attributes.  The dimension table is joined to the fact table using a foreign key  The dimension table are not joined to each other  Fact table would contain key and measure  The Star schema is easy to understand and provides optimal disk usage.  The dimension tables are not normalized.  For instance, in the above figure, Country_ID does not have Country lookup table as an OLTP design would have.  The schema is widely supported by BI Tools
  • 119. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 120. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 121. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Advantages of Star Schema – 1. Simpler Queries: Join logic of star schema is quite cinch in compare to other join logic which are needed to fetch data from a transactional schema that is highly normalized. 2. Simplified Business Reporting Logic: In compared to a transactional schema that is highly normalized, the star schema makes simpler common business reporting logic, such as as-of reporting and period-over-period. 3. Feeding Cubes: Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a source without designing a cube structure.
  • 122. Star Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Disadvantages of Star Schema – 1. Data integrity is not enforced well since in a highly de-normalized schema state. 2. Not flexible in terms if analytical needs as a normalized data model. 3. Star schemas don’t reinforce many-to-many relationships within business entities – at least not frequently.
  • 123. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  SNOWFLAKE SCHEMA is a logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snowflake shape.  A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions.  The dimension tables are normalized which splits data into additional tables.  The snowflake schema is a variant of the star schema.  The snowflake effect affects only the dimension tables and does not affect the fact tables.
  • 124. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A snowflake schema is an extension of star schema where the dimension tables are connected to one or more dimensions.  The tables are partially denormalized in structure.  The performance of SQL queries is a bit less when compared to star schema as more number of joins are involved.  Data redundancy is low and occupies less disk space when compared to star schema.  The snowflake structure materialized when the dimensions of a star schema are detailed and highly structured, having several levels of relationship, and the child tables have multiple parent table.
  • 125. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 126. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Characteristics of Snowflake Schema: • The main benefit of the snowflake schema it uses smaller disk space. • Easier to implement a dimension is added to the Schema • Due to multiple tables query performance is reduced • The primary challenge that you will face while using the snowflake Schema is that you need to perform more maintenance efforts because of the more lookup tables.
  • 127. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 128. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 129. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table. • Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key. • The supplier key is linked to the supplier dimension table. • The supplier dimension table contains the attributes supplier_key and supplier_type.
  • 130. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Advantages: There are two main advantages of snowflake schema given below: • It provides structured data which reduces the problem of data integrity. • It uses small disk space because data are highly structured.
  • 131. Snowflake Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Disadvantages: • Snowflaking reduces space consumed by dimension tables, but compared with the entire data warehouse the saving is usually insignificant. • Avoid snowflaking or normalization of a dimension table, unless required and appropriate. • Do not snowflake hierarchies of one dimension table into separate tables. Hierarchies should belong to the dimension table only and should never be snowfalked. • Multiple hierarchies can belong to the same dimension has been designed at the lowest possible detail.
  • 132. Fact Constellation Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy schema.  Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.  The schema is viewed as a collection of stars hence the name Galaxy Schema.  The fact constellation schema is also a type of multidimensional model.  In Galaxy schema shares dimensions are called Conformed Dimensions.
  • 133. Fact Constellation Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 134. Fact Constellation Schema Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Characteristics of Galaxy Schema: • The dimensions in this schema are separated into separate dimensions based on the various levels of hierarchy. • For example, if geography has four levels of hierarchy like region, country, state, and city then Galaxy schema should have four dimensions. • Moreover, it is possible to build this type of schema by splitting the one-star schema into more Star schemes. • The dimensions are large in this schema which is needed to build based on the levels of hierarchy. • This schema is helpful for aggregating fact tables for better understanding.
  • 135. Fact Table vs Dimension Table Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.NO FACT TABLE DIMENSION TABLE 1 Fact table contains the measuring on the attributes of a dimension table. Dimension table contains the attributes on that truth table calculates the metric. 2 Located at the center of a star or snowflake schema and surrounded by dimensions. Connected to the fact table and located at the edges of the star or snowflake schema. 3 Facts tables could contain information like sales against a set of dimensions like Product and Date. Evert dimension table contains attributes which describe the details of the dimension. E.g., Product dimensions can contain Product ID, Product Category, etc.
  • 136. Fact Table vs Dimension Table Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.NO FACT TABLE DIMENSION TABLE 4 Primary Key in fact table is mapped as foreign keys to Dimensions. Dimension table has a primary key columns that uniquely identifies each dimension. 5 Does not contain Hierarchy. Contains Hierarchies. For example Location could contain, country, pin code, state, city, etc. 6 In fact table, There is less attributes than dimension table. While in dimension table, There is more attributes than fact table. 7 The number of fact table is less than dimension table in a schema. While the number of dimension is more than fact table in a schema.
  • 137. Type of Facts Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • Additive – As its name implied, additive measures are measures which can be added to all dimensions. • Non-additive – different from additive measures, non-additive measures are measures that cannot be added to all dimensions. • Semi-additive – semi-additive measures are the measure that can be added to only some dimensions and not across other.
  • 138. Designing fact table steps Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Here is overview of four steps to designing a fact table described by Kimball: 1. Choosing business process to model – The first step is to decide what business process to model by gathering and understanding business needs and available data 2. Declare the grain – by declaring a grain means describing exactly what a fact table record represents 3. Choose the dimensions – once grain of fact table is stated clearly, it is time to determine dimensions for the fact table. 4. Identify facts – identify carefully which facts will appear in the fact table.
  • 139. Star Vs Snowflake Schema: Key Differences Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no Star Schema Snow Flake Schema 1 Hierarchies for the dimensions are stored in the dimensional table. Hierarchies are divided into separate tables. 2 It contains a fact table surrounded by dimension tables. One fact table surrounded by dimension table which are in turn surrounded by dimension table. 3 In a star schema, only single join creates the relationship between the fact table and any dimension tables. A snowflake schema requires many joins to fetch the data. 4 Simple DB Design. Very Complex DB Design.
  • 140. Star Vs Snowflake Schema: Key Differences Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no Star Schema Snow Flake Schema 5 Denormalized Data structure and query also run faster. Normalized Data Structure. 6 High level of Data redundancy Very low-level data redundancy 7 Single Dimension table contains aggregated data. Data Split into different Dimension Tables. 8 Cube processing is faster. Cube processing might be slow because of the complex join. 9 Offers higher performing queries using Star Join Query Optimization. Tables may be connected with multiple dimensions. The Snow Flake Schema is represented by centralized fact table which unlikely connected with multiple dimensions.
  • 141. Data Warehouse Models Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  From the perspective of data warehouse architecture, we have the following data warehouse models − • Enterprise warehouse:- collects all of the information about subjects spanning the entire organization. • Data Mart:- a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart. • Virtual warehouse • It is a virtual view of databases. • Virtual Warehouse have a logical description of all the databases and their structure. • This method creates single Database from all the data sources.
  • 142. Data Lake Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data.  It is a place to store every type of data in its native format with no fixed limits on account size or file.  It offers high data quantity to increase analytic performance and native integration.  Data Lake is like a large container which is very similar to real lake and rivers.  Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
  • 143. Data Lake Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 144. Star Vs Snowflake Schema: Key Differences Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no Data Lakes Data Warehouse 1 Data lakes store everything. Data Warehouse focuses only on Business Processes. 2 Data are mainly unprocessed Highly processed data. 3 It can be Unstructured, semi- structured and structured. It is mostly in tabular form & structure. 4 Data Lake is mostly used by Data Scientist Business professionals widely use data Warehouse 5 Can use open source/tools like Hadoop/ Map Reduce Mostly commercial tools like Google BigQuery, IBM, Amazon, Oracle.
  • 145. Big Data vs Data Warehouse Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.NO. BIG DATA DATA WAREHOUSE 1 Big data is a technology to store and manage large amount of data. Data warehouse is an architecture used to organize the data. 2 Big data can handle structure, non-structure, semi-structured data. Data warehouse only handles structure data (relational or not relational) 3. Big data does processing by using distributed file system. Data warehouse doesn’t use distributed file system for processing. 4. Big data doesn’t follow any SQL queries to fetch data from database. In data warehouse we use SQL queries to fetch data from relational databases.
  • 146. Data Warehousing – Partitioning Strategy Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Partitioning is done to enhance performance and facilitate easy management of data.  Partitioning also helps in balancing the various requirements of the system.  It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions.  Why is it Necessary to Partition?  Partitioning is important for the following reasons − 1. For easy management, 2. To assist backup/recovery, 3. To enhance performance.
  • 147. Data Warehousing – Partitioning Strategy Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  For Easy Management  The fact table in a data warehouse can grow up to hundreds of gigabytes in size.  This huge size of fact table is very hard to manage as a single entity. Therefore it needs partitioning.  To Assist Backup/Recovery  If we do not partition the fact table, then we have to load the complete fact table with all the data.  Partitioning allows us to load only as much data as is required on a regular basis.  It reduces the time to load and also enhances the performance of the system.
  • 148. Data Warehousing – Partitioning Strategy Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Note − To cut down on the backup size, all partitions other than the current partition can be marked as read-only.  We can then put these partitions into a state where they cannot be modified.  Then they can be backed up. It means only the current partition is to be backed up.  To Enhance Performance  By partitioning the fact table into sets of data, the query procedures can be enhanced.  Query performance is enhanced because now the query scans only those partitions that are relevant.  It does not have to scan the whole data.
  • 149. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are various ways in which a fact table can be partitioned.  In horizontal partitioning, we have to keep in mind the requirements for manageability of the data warehouse.  Partitioning by Time into Equal Segments:  In this partitioning strategy, the fact table is partitioned on the basis of time period.  Here each time period represents a significant retention period within the business.  For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments.  We can reuse the partitioned tables by removing the data in them.
  • 150. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Partition by Time into Different-sized Segments  This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data.
  • 151. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Points to Note • The detailed information remains available online. • The number of physical tables is kept relatively small, which reduces the operating cost. • This technique is suitable where a mix of data dipping recent history and data mining through entire history is required. • This technique is not useful where the partitioning profile changes on a regular basis, because repartitioning will increase the operation cost of data warehouse.
  • 152. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Partition on a Different Dimension  The fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier, or any other dimension.  Let's have an example.  Suppose a market function has been structured into distinct regional departments like on a state by state basis.  If each region wants to query on information captured within its region, it would prove to be more effective to partition the fact table into regional partitions.  This will cause the queries to speed up because it does not require to scan information that is not relevant.
  • 153. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Points to Note • The query does not have to scan irrelevant data which speeds up the query process. • This technique is not appropriate where the dimensions are unlikely to change in future. So, it is worth determining that the dimension does not change in future. • If the dimension changes, then the entire fact table would have to be repartitioned.  Note − It recommend to perform the partition only on the basis of time dimension, unless you are certain that the suggested dimension grouping will not change within the life of the data warehouse.
  • 154. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Partition by Size of Table  When there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on the basis of their size.  We can set the predetermined size as a critical point. When the table exceeds the predetermined size, a new table partition is created.  Points to Note • This partitioning is complex to manage. • It requires metadata to identify what data is stored in each partition.
  • 155. Partitioning Strategy - Horizontal Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Partitioning Dimensions  If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension.  Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time.  Round Robin Partitions  In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition.  This technique makes it easy to automate table management facilities within the data warehouse.
  • 156. Partitioning Strategy - Vertical Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done.
  • 157. Partitioning Strategy - Vertical Partitioning Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Vertical partitioning can be performed in the following two ways − • Normalization • Row Splitting  Normalization:- Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduce space.  Row Splitting:- Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to large table by reducing its size.  Note − While using vertical partitioning, make sure that there is no requirement to perform a major join operation between two partitions.
  • 158. Identify Key to Partition Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing the fact table.  Let's have an example. Suppose we want to partition the following table.  Account_Txn_Table  transaction_id  account_id  transaction_type  value  transaction_date  region  branch_name
  • 159. Identify Key to Partition Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  We can choose to partition on any key. The two possible keys could be 1) region 2) transaction_date  Suppose the business is organized in 30 geographical regions and each region has different number of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has shown that a vast majority of queries are restricted to the user's own business region.  If we partition by transaction_date instead of region, then the latest transaction from every region will be in one partition. Now the user who wants to look at data within his own region has to query across multiple partitions.  Hence it is worth determining the right partitioning key.
  • 160. Summary Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data that is used in organizational decision making.  A data mart is defined as an implementation of a data warehouse with small and more tightly restricted scope of data and data warehouse functions, serving a single department or part of an organization.  The mechanism of extracting information from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading.  Metadata is data about data, A metadata does not gives just the description of the entity but also gives the other details explaining the syntax and semantics of the data elements.
  • 161. Summary Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Virtual Warehouse have a logical description of all the databases and their structure.  In the STAR Schema, the center of the star can have one fact table and a number of associated dimension tables.  A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It has normalized dimensions.  A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy schema.  Partitioning is done to enhance performance and facilitate easy management of data.  Partitioning Strategy helps For easy management, To assist backup/recovery and To enhance performance.
  • 162. Unit – 1 Any - 5 Assignment Questions Marks:-20 Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Q.1 What is Data Warehouse? Explain the data warehouse architecture with diagram.  Q.2 Discuss Star, Snowflake and Galaxy schema for multidimensional Database.  Q.3 Give reason, why it is necessary to separate data warehouse from operational database.  Q.4 What is the need of data warehouse. Explain characteristics of data warehouse.  Q.5 What is Data Mart? What are the types of Data Mart?  Q.6 Explain ETL Process in data warehouse.  Q.7 Explain:  1) Metadata 2) Fact Table 3) Vertical Partitioning
  • 164. Thank You Great God, Medi-Caps, All the attendees Mr. Sagar Pandya sagar.pandya@medicaps.ac.in www.sagarpandya.tk LinkedIn: /in/seapandya Twitter: @seapandya Facebook: /seapandya