Data Bases, Data Warehousing, Data Mining, Decision Support System (DSS), OLAP, OLTP, MOLAP, ROLAP, Data Mart, Meta Data, ETL Process, Drill Up, Roll Down, Slicing, Dicing, Star Schema, SnowFlake Scheme, Dimentional Modelling
1. JBIMS MIM Second Year (2015 – 18)
Data Management
15-I-131 Mufaddal Nullwala
2. What is Database?
Database is the collection of interrelated data OR Organised mechanism to
manage, store and retrieve data
Properties Of Database:
• Efficient
• Robust
• Stable
Example:
• Students Information
• Bank Registrar / Book of Accounts
• Employees Master
3. What is database
management system?
It is a software used to manage and access the database in efficient
way
Advantages:
• It gives you Data whenever you require in few click of buttons
• Searching of Critical Information is Easy
Example:
• Oracle 11g
• MSSQL
• MySQL
4. ER Digram
ER-Diagram is a visual representation of data that
describes how data is related to each other.
Components of E-R Diagram are:
• Entity - An Entity can be any object, place, person or class.
• Attribute - An Attribute describes a property or characteristic of
an entity.
• Relationship - A Relationship describes relations
between entities. There are three types of relationship that exist
between Entities.
5. Relationships between
ER Diagram
For a binary relationship set
the mapping cardinality must
be one of the following types:
One to one
One to many
Many to one
Many to many
6. Going up in this structure is called generalisation, where entities
are clubbed together to represent a more generalised view.
Specialisation is the opposite of generalisation. In specialisation, a
group of entities is divided into sub-groups based on their
characteristics.
7. Database Keys:
Keys are used to establish and identify relation between tables.
Types of Keys:
PRIMARY KEY
• Serves as the row level addressing mechanism in the relational database model.
• It can be formed through the combination of several items.
• Indicates uniqueness within records or rows in a table.
FOREIGN KEY
• A column or set of columns within a table that are required to match those of a primary key of a
second table.
• The primary key from another table, this is the only way join relationships can be established.
8. Primary Key : In Table A, Parcel no. is the Primary Key
but the Foreign key in Table B.
9. CRUD Operations
• Create new tables & records
• Retrieve records from tables
• Update tables definition and records data
• Delete existing tables and records
10. What is OLTP?
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses, whereas OLAP systems
help to analyze it.
Online Transection Processing - is characterised by a large number of short on-line
transactions
• INSERT
• UPDATE
• DELETE
OLTP Systems are used for Order Entry, Financial Transections, CRM ( Customer
Relationship Management), Retail Sales etc. Such systems have large number of users who
conduct short transactions.
An important attribute of an OLTP system is its ability to maintain concurrency. To avoid
single points of failure, OLTP systems are often decentralized.
11. Why OLTP is Important?
Source of Data or Operational Data
To control and run fundamental business tasks
Reveals a snapshot of ongoing business process
Short and fast inserts and updates initiated by end users
Typically very fast (Performance Optimised)
Space requirements: can be relatively small if historical data is archived
Database Design Highly Optimised
Operational data is critical to run business there for the backup religiously
12. Design Principal
Application Oriented
Used to run Business
Detailed Data
Current Up to Date
Isolated Data
Repetitive Access
Clerical Users
Performance Sensitive
Few Records assessed at a
time (tens)
Read / Update access
No Data Redundancy
Database Size (100 MB - 100
GB)
14. Data Warehouse
In computing, a data warehouse (DW or DWH), also known as
an enterprise data warehouse (EDW), is a system used for
reporting and data analysis, and is considered a core
component of business intelligence.
DWs are central repositories of integrated data from one or
more disparate sources. They store current and historical data
in one single place and are used for creating analytical reports
for knowledge workers throughout the enterprise. Examples of
reports could range from annual and quarterly comparisons
and trends to detailed daily sales analysis.
The data stored in the warehouse is uploaded from the
operational systems (such as marketing or sales). The data may
pass through an operational data store and may require data
cleansing for additional operations to ensure data quality
before it is used in the DW for reporting.
15. Data Warehouse continued..
A collection of data that is used primarily in organisational
decision making
A decision support database that is maintained separately
from the organisation’s operational databases.
A data warehouse is a
• subject-oriented,
• integrated,
• time-varying,
• non-volatile
16. What is a Data Warehouse?
A single, complete and consistent store of
data obtained from a variety of different
sources made available to end users in a
what they can understand and use in a
business context.
[Barry Devlin]
17. Characteristics of Data Warehouse
Subject oriented: Data are organised based on how the
users refer to them.
Integrated: All inconsistencies regarding naming convention
and value representations are removed.
Nonvolatile: Data are stored in read-only format and do not
change over time.
Time variant: Data are not current but normally time series.
18. Why Separate Data
Warehouse?
Performance
• Operational databases designed & tuned for known workloads
• Complex OLAP queries would degrade performance, taxing operations
• Special data organisation, access & implementation methods needed for
multidimensional views & queries
Function
• Missing data: Decision support requires historical data, which operational databases
do not typically maintain
• Data consolidation: Decision support requires consolidation (aggregation,
summarisation) of data from many heterogeneous sources: operational databases,
external sources.
• Data quality: Different sources typically use inconsistent data representations, codes,
and formats which have to be reconciled.
19. The Complete Decision Support
System (Source: Franconi)
Information Sources Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients/DSS
(Tier 3)
Operational DB’s
Semistructured Sources
extract
transform
load
refresh
etc.
Data Marts
Data
Warehouse
e.g., MOLAP
e.g., ROLAP
serve
Analysis
Query/Reporting
Data Mining
serve
serve
20. Three-Tier Architecture
Warehouse database server
Almost always a relational DBMS; rarely flat files
OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on
multidimensional data to standard relational operations.
Multidimensional OLAP (MOLAP): special purpose server that directly implements
multidimensional data and operations.
Clients
Query and reporting tools
Analysis tools
Data mining tools (e.g., trend analysis, prediction)
21. Data Marts
A data mart is a scaled down version of a data warehouse that focuses on a particular
subject area.
A data mart is a subset of an organisational data store, usually oriented to a specific
purpose or major data subject, that may be distributed to support business needs.
Data marts are analytical data stores designed to focus on specific business functions
for a specific community within an organisation.
Usually designed to support the unique business requirements of a specified
department or business process
Implemented as the first step in proving the usefulness of the technologies to solve
business problems
Eg: Departmental subsets that focus on selected subjects: Marketing data mart:
customer, products, sales
22. Why Data mart?
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.
The data mart is a subset of the data warehouse and is usually oriented to a
specific business line or team. Whereas data warehouses have an enterprise-
wide depth, the information in data marts pertains to a single department. In
some deployments, each department or business unit is considered the owner
of its data mart including all the hardware, software and data.
This enables each department to isolate the use, manipulation and
development of their data. In other deployments where conformed
dimensions are used, this business unit ownership will not hold true for
shared dimensions like customer, product, etc.
Organizations build data warehouses and data marts because the information
in the database is not organized in a way that makes it readily accessible,
requiring queries that are too complicated or resource-consuming.
23. From the Data Warehouse to
Data Marts
Data Warehouse
Less
More
History
Normalised
Detailed
Data
Information
Individually
Structured
Departmentally
Structured
Organisationally
Structured
24. Characteristics of the
Departmental Data Mart
• Small
• Flexible
• Customised by
Department
• OLAP
• Source is departmentally
structured data warehouse
Data mart
Data
warehouse
25. The Meta Data
Last and the most component of DW
environments.
It is information that is kept about the
warehouse rather than information kept within
the warehouse.
The metadata is simply data about data.
It is important for designing, constructing,
retrieving, and controlling the warehouse data.
26. Types of Meta Data
Technical metadata: Include where the data come from, how
the data were changed, how the data are organised, how the data
are stored, who owns the data, who is responsible for the data and
how to contact them, who can access the data , and the date of last
update.
Business metadata: Include what data are available, where the
data are, what the data mean, how to access the data, predefined
reports and queries, and how current the data are.
27. Application of Data Ware
House
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
28. What is OLAP?
Definition - OLAP performs
multidimensional analysis of business
data and provides the capability for complex
calculations, trend analysis, and
sophisticated data modelling , thereby
providing the insight and understanding
they need for better decision making.
Users can pivot, filter, drill down and drill
up data and generate numbers of views.
Application - It is the foundation for many
kinds of business applications for Business
Performance Management, Planning,
Budgeting, Forecasting, Financial
Reporting, Analysis, Simulation
Models, Knowledge Discovery, and Data
Warehouse Reporting.
29. An OLAP structure created from the operational data is called an
OLAP cube. As Figure shows, the cube holds data more like a 3D
spreadsheet rather than a relational database, allowing different
views of the data to be quickly displayed
30. The term OLAP was first introduced by E. F. Codd, who pioneered
Relational Database Management Systems (RDBMS). Below are
the twelve rules defined by Codd that OLAP technology
must support.
Multidimensional conceptual view Supports EIS (Executive Information System) slice and dice operations and is usually required in financial modeling.
Transparency
Is part of an open system that supports heterogeneous data sources. Furthermore, the end user should not be
concerned about the details of data access or conversions.
Accessibility
Presents the user with a single logical schema of the data. OLAP engines act as middleware, sitting between
heterogeneous data sources and an OLAP front-end.
Consistent reporting performance Performance should not degrade as the number of dimensions in the model increases.
Client/server architecture
Requires open, modular systems. Not only the product should be client/server but the server component of an
OLAP product should allow that various clients could be attached with minimum effort and programming for
integration.
Generic dimensionality
Not limited to 3-D and not biased toward any particular dimension. A function applied to one dimension should
also be able to be applied to another.
Dynamic sparse-matrix handling
Related both to the idea of nulls in relational databases and to the notion of compressing large files, a sparse
matrix is one in which not every cell contains data. OLAP systems should accommodate varying storage and data-
handling options.
Multiuser support Supports multiple concurrent users, including their individual views or slices of a common database.
Unrestricted cross-dimensional
operations
All dimensions are created equal, so all forms of calculation must be allowed across all dimensions, not just the
measures dimension.
Intuitive data manipulation
Users shouldn't have to use menus or perform complex multiple step operations when an intuitive drag and drop
action will do.
Flexible reporting
Users should be able to print just what they need, and any changes to the underlying model should be
automatically reflected in reports.
Unlimited dimensional and
aggregation levels
Supports at least 15, and preferably 20, dimensions.
31. The OLAP Report, one of the most internationally authoritative sources of information on OLAP
products and applications, defines OLAP in five keywords: Fast Analysis of Shared
Multidimensional Information, or FASMI for short.
Fast
The system is targeted to deliver most responses to users within about five
seconds, with the simplest analyses taking no more than one second and very
few taking more than 20 seconds.
Analysis
The system can cope with any business logic and statistical analysis that is
relevant for the application and the user, and keep it easy enough for the target
user.
Shared
The system implements all the security requirements for confidentiality and, if
multiple write access is needed, concurrent update locking at an appropriate
level. Not all applications need users to write data back, but for the growing
number that do, the system should be able to handle multiple updates in a
timely, secure manner.
Multidimensional
The system must provide a multidimensional conceptual view of the data,
including full support for hierarchies and multiple hierarchies.
Information
The capacity of various products is measured in terms of how much input data
they can handle, not how many gigabytes they take to store it.
33. Slice
• Performs a
selection on one
dimension of the
given cube,
resulting in a
sub-cube.
• Reduces the
dimensionality
of the cubes.
• Sets one or more
dimensions to
specific values
and keeps a
subset of
dimensions for
selected values.
34. Dice
• Define a sub-cube by performing a selection of one or more dimensions.
• Refers to range select condition on one dimension, or to select condition
on more than one dimension.
• Reduces the number of member values of one or more dimensions.
Pivot (or rotate)
• Rotates the data axis to view the data from different perspectives.
• Groups data with different dimensions.
35. OLAP Architectures
MOLAP ROLAP
Information retrieval is fast. Information retrieval is comparatively
slow.
Uses sparse array to store
data-sets.
Uses relational table.
MOLAP is best suited for
inexperienced users, since
it is very easy to use.
ROLAP is best suited for experienced
users.
Maintains a separate
database for data cubes.
It may not require space other than
available in the Data warehouse.
DBMS facility is weak. DBMS facility is strong.
Static Database Dynamic Database
36. Dimensional Modelling
Dimensional modelling is one of the methods of data
modelling, that help us store the data in such a way that it is
relatively easy to retrieve the data from the database.
Different ways of storing data gives us different advantages.
For example, ER Modelling gives us the advantage of storing
data is such a way that there is less redundancy. Dimensional
modelling, on the other hand, give us the advantage of storing
data in such a fashion that it is easier to retrieve the
information from the data once the data is stored in database.
37. Dimensional Modeling V/S
ER Modeling
Dimensional Models are designed for reading, summarising
and analysing numeric information, whereas Relational
Models are optimised for adding and maintaining data using
real-time operational systems.
38. Dimensional Modeling
It is comprised of "fact" and "dimension" tables.
A "fact" is a numeric value that a business wishes to count or
sum
A "dimension" is essentially an entry point for getting at the
facts. Dimensions are things of interest to the business.
40. Star schema
The star schema architecture is the simplest data warehouse
schema.
It is called a star schema because the diagram resembles a star,
with points radiating from a centre.
The centre of the star consists of fact table and the points of
the star are the dimension tables.
42. Star Schema
Fact Tables A fact table typically has two types of columns:
foreign keys to dimension tables and measures those that
contain numeric facts. A fact table can contain fact's data on
detail or aggregated level.
A dimension is a structure usually composed of one or more
hierarchies that categories data.
http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-
schema.html
43. Snowflake Schema
The snowflake schema architecture is a more complex
variation of the star schema used in a data warehouse, because
the tables which describe the dimensions are normalised.
47. ETL process
The process of extracting data from source systems and
bringing it into the data warehouse is commonly called ETL,
which stands for extraction, transformation, and loading.
48. ETL Steps
Initiation
Build reference data
Extract from sources
Validate
Transform
Load into stages tables
Audit reports
Publish
Archive
Clean up
50. Steps of ETL process
Extracts data from homogeneous or
heterogeneous data sources
Transforms the data for storing it in proper
format or structure for querying and analysis
purpose
Loads it into the final target (database, more
specifically, operational data store, data mart,
or data warehouse)
51. Extraction
Extracting the data from different sources – the data
sources can be files (like CSV, JSON, XML) or RDBMS
etc.
This is the first step in ETL process. It covers data extraction
from the source system and makes it accessible for further
processing. The main objective of the extraction step is to
retrieve all required data from source system with as little
resources as possible. The extraction step should be designed in
a way that it does not negatively affect the source system. Most
data projects consolidate data from different source systems.
Each separate source uses a different format. Common data-
source formats include RDBMS, XML (like CSV, JSON). Thus
the extraction process must convert the data into a format
suitable for further transformation.
52. Transformation
Transforming the data – this may involve
cleaning, filtering, validating and applying
business rules.
In this step, certain rules are applied on the extracted
data. The main aim of this step is to load the data to
the target database in a cleaned and general format
(depending on the organization’s requirement). This
is because when the data is collected from different
sources each source will have their own standards like
–For example if we have two different data sources A
and B. In source A, date format is like dd/mm/yyyy,
and in source B, it is yyyy-mm-dd.
53. Transformation continued..
In the transforming step we convert these dates to a general format. The
other things that are carried out in this step are:
Cleaning (e.g. “Male” to “M” and “Female” to “F” etc.)
Filtering (e.g. selecting only certain columns to load)
Enriching (e.g. Full name to First Name , Middle Name , Last Name)
Splitting a column into multiple columns and vice versa
Joining together data from multiple sources
In some cases data does not need any transformations and here the data is
said to be “rich data” or “direct move” or “pass through” data.
54. Loading
Loading - data is loaded into a data warehouse or
any other database or application that houses data.
This is the final step in the ETL process. In this step, the
extracted data and transformed data is loaded to the target
database. In order to make data load efficient, it is necessary to
index the database and disable constraints before loading the
data.
All the three steps in the ETL process can be run parallel. Data
extraction takes time and so the second step of transformation
process is executed simultaneously. This prepares data for the
third step of loading. As soon as some data is ready it is loaded
without waiting for completion of the previous steps.
55. ETL Tools
1. Oracle Warehouse Builder (OWB)
2. SAP Data Services.
3. IBM Infosphere Information Server.
4. SAS Data Management.
5. PowerCenter Informatica.
6. Elixir Repertoire for Data ETL.
7. Data Migrator (IBI)
8. SQL Server Integration Services (SSIS)