Data Mining

KUMARAGURU COLLEGE OF TECHNOLOGY
COIMBATORE

DATA WAREHOUSING AND DATA MINING

Presented by

K.Santhosh (07bcs43)
E-Mail ID:ksanthoshselvam@gmail.com
Contact No: 9788153199
V.Siddharth (07bcs50)
E-Mail ID:siddharthindian@yahoo.com
Contact No: 9843286841


ABSTRACT:

Fast, accurate and scalable data analysis techniques are needed to extract useful
information from huge pile of data. Data warehouse is a single, integrated source of
decision support information formed by collecting data from multiple sources, internal to
the organization as well as external, and transforming and summarizing this information
to enable improved decision making. Data warehouse is designed for easy access by users
to large amounts of information, and data access is typically supported by specialized
analytical tools and applications. Typical applications include decision support systems
and execution information system.
Data mining is the exploration and analysis of large quantities of data in order to
discover valid, novel, potentially useful, and ultimately understandable patterns in data. It
is
“An information extraction activity whose goal is to discover hidden facts contained
in databases”.
The process of extracting valid, previously unknown, comprehensible and
actionable information from large databases and using it to make crucial business
decisions.
Data mining finds patterns and subtle relationships in data and infers rules that allow the
prediction of future results. A data mining model is a description of a specific aspect of a
dataset. It produces output values for an assigned set of input values. Typical applications
include market segmentation, customer profiling, fraud detection, evaluation of retail
promotions, and credit risk analysis.”


Introduction:
Everyday increasingly, organizations are analyzing current and historical data to identify
useful patterns and support business strategies.
A large amount of the right information is the key to survival in today’s competitive
environment. And this kind of information can be made available only if there’s totally
integrated enterprise data warehouse.

What is data warehousing?

A data warehouse is a subject-oriented, integrated, non-volatile & time-variant
collection of data in support of management’s decisions

NEED FOR A DATA WAREHOUSE :
• IT or business staff spending a lot of time developing special reports for decision-
makers.
• Lots of PC-based or small server systems obtaining extracts of data incapable of
presenting a holistic view of the entire gamut of information.
• Same data present on different systems, in different department and users may be
unaware of this fact.
• Difficulty in getting meaningful information in a timely manner.
• Multiple systems giving different answer to the business questions.
• Less analysis by decision makers and policy planners due to non-availability of
sophisticated tools and easily decipherable, timely and comprehensive information

PURPOSE OF A DATA WAREHOUSE :
Better business intelligence for end users.
• Reduction in time to locate, access and analyze information.
• Consolidation of disparate information sources.
• Replacement of older, less-responsive decision support systems
• Faster time to market for products and services
• Strategic advantage over competitors
Data Warehouse Characteristics:
1.Subject-orientedWH is organized around the major subjects of the enterprise
rather than the major application areas. This is reflected in the need to store decision-
support data rather than application-oriented data.

2.Integratedbecause the source data come together from different enterprise-wide
applications systems. The source data is often inconsistent using..The integrated data
source must be made consistent to present a unified view of the data to the users

3.Time-variantthe source data in the WH is only accurate and valid at some point in
time or over some time interval. The time-variance of the data warehouse is also
shown in the extended time that the data is held, the implicit or explicit association of
time with all data, and the fact that the data represents a series of snapshots

4.Non-volatiledata is not update in real time but is refresh from OS on a regular
basis. New data is always added as a supplement to DB, rather than replacement.
The DB continually absorbs this new data, incrementally integrating it with previous
data

DATA WAREHOUSE LIFE CYCLE:
Data warehousing is a concept. It is not a product that can be purchased off the shelf. It is
a set of hardware and software components integrated together which can be used to

analyze the massive amount of data stored in an efficient manner. It is a process through
which one can build a successful data warehouse. Following are the five steps towards
building a successful data warehouse.

1.JUSTIFICATION

2.REQUIREMENT ANALYSIS

3.DESIGN

4.DEVELOPMENT AND IMPLEMENTATION

5.DEPLOYMENT

Main Components:
1Operational data sourcesfor the DW is supplied from mainframe operational data
held in first generation hierarchical and network databases, departmental data held in
proprietary file systems, private data held on workstaions and private serves and
external systems such as the Internet, commercially available DB, or DB assoicated
with and organization’s suppliers or customers
2Operational datastore(ODS)is a repository of current and integrated operational
data used for analysis. It is often structured and supplied with data in the same way as
the data warehouse, but may in fact simply act as a staging area for data to be moved
into the warehouse
3load manageralso called the frontend component, it performance all the operations
associated with the extraction and loading of data into the warehouse. These
operations include simple transformations of the data to prepare the data for entry into
the warehouse
4warehouse managerperforms all the operations associated with the management of
the data in the warehouse. The operations performed by this component include
analysis of data to ensure consistency, transformation and merging of source data,
creation of indexes and views, generation of denormalizations and aggregations, and
archiving and backing-up data

5query manageralso called backend component, it performs all the operations
associated with the management of user queries. The operations performed by this
component include directing queries to the appropriate tables and scheduling the
execution of queries
6detailed, lightly and lightly summarized data,archive/backup data
7meta-data
8end-user access toolscan be categorized into five main groups: data reporting and
query tools, application development tools, executive information system (EIS) tools,
online analytical processing (OLAP) tools, and data mining tools

Data Flows
1Inflow- The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.
2upflow- The process associated with adding value to the data in the warehouse
through summarizing, packaging , packaging, and distribution of the data
3downflow- The processes associated with archiving and backing-up of data in the
warehouse
4outflow- The process associated with making the data availabe to the end-users
5Meta-flow- The processes associated with the management of the meta-data
Tools and Technologies:
1The critical steps in the construction of a data warehouse:
a. Extraction
b. Cleansing
c. Transformation
1after the critical steps, loading the results into target system can be carried out either
by separate products, or by a single, categories:
2code generators
3database data replication tools
4dynamic transformation engines

The importance of managing meta-data(integration):
1The integration of meta-data, that is ”data about data”
2Meta-data is used for a variety of purposes and the management of it is a critical
issue in achieving a fully integrated data warehouse
3The major purpose of meta-data is to show the pathway back to where the data
began, so that the warehouse administrators know the history of any item in the
warehouse
4The meta-data associated with data transformation and loading must describe the
source data and any changes that were made to the data
5The meta-data associated with data management describes the data as it is stored in
the warehouse
6The meta-data is required by the query manager to generate appropriate queries, also
is associated with the user of queries

Data Warehousing Issues
1Semantic Integration: When getting data from
multiple sources, must eliminate mismatches,
e.g., different currencies, DB schemas.
2Heterogeneous Sources: Must access data from
a variety of source formats and repositories.
Replication capabilities can be exploited here.
3Load, Refresh, Purge: Must load data,
periodically refresh it, and purge too-old data.
4Metadata Management: Must keep track of
source, loading time, and other information for
all data in the warehouse.
Star Schema:
A logical structure that has a fact table containing factual data in the center,
surrounded by dimension tables containing reference data (which can be denormalized)
Snowflake Schema:

A variant of the star schema where dimension tables do not contain denormalized
data.
Starflake Schema:
A hybrid structure that contains a mixture of star and snowflake schemas.

The benefits of data warehousing:
1The potential benefits of data warehousing are high returns on investment.
2substantial competitive advantage..
3Increased productivity of corporate decision-makers..
4More cost effective decision making
5Better enterprise intelligence
6Enhanced customer service
7Better asset/liability management
8Business process reengineering
9Empowerment of all employees
Applications:
On Line Transaction Processing:
OLTP systems are the major kinds of enterprise applications:
Examples:
Order entry systems, Inventory control systems, Reservation
systems, Point-of-sale systems, Tracking systems, etc.

Executive information system (EIS) :
Present information at the highest level of summarization using corporate business
measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is
required. Graphics are usually generously incorporated to provide at-a-glance indications
of performance
Decision Support Systems (DSS) :

They ideally present information in graphical and tabular form, providing the user with
the ability to drill down on selected information. Note the increased detail and data
manipulation options presented.

DATA MINING
What is data mining?
Data Mining refers to the process of analyzing the data from different perspectives
and summarizing it into useful information. Data mining software is one of the numbers
of tools used for analyzing data. It allows users to analyze from many different
dimensions or angles, categorize it, and summarize the relationship identified.
1Data Mining is about techniques for finding and describing Structural Patterns in
data.
Definition:
Data mining is the process of finding correlation or patterns among fields in large
relational databases.
The process of extracting valid, previously unknown, comprehensible, and actionable
information from large databases and using it to make crucial business decisions.
(Simoudis, 1996)

Different Types of Data Mining:

1Business Data Mining
2Scientific Data Mining
3Internet Data Mining

Five major elements of Data Mining:

1.Extract, transform, and load transaction data on to the data warehouse system.
2.Store and manage data in multidimensional database system.
3.Provide access to business analysts and information technology Professionals.
4.Analyze the data by application software.
5.Present the data in useful format such as graph or table.

Requirements of Data Mining:
1Handling of different type of data
2Efficiency and scalability of algorithm
3Usefulness, certainty and expressiveness of result
4Expression of various kinds of mining results
5Interactive mining knowledge at multiple levels
6Mining information from different sources of data
7Protection of privacy and data security

Various kinds of data on which Data Mining is applied :
1Relational database
2Data warehouse
3Transactional database
4Multimedia database
5Spatial and temporal data
6Object-relational database

Data mining applications:
The Main application for Data Mining is WEB MINING.
What is Web Mining?
“Web mining can be broadly defined as the automated discovery
and analysis of useful information from the Web documents and services using data
mining techniques.”

Web mining is the application of data mining or other information process
techniques to WWW, to find useful patterns. People can take advantage of these patterns
to access WWW more efficiently.

NEED FOR WEB MINING:
Now a day, the World Wide Web is a popular and interactive medium, ideal for
publishing information. It is huge, diverse and dynamic and thus raises issue of
scalability, multimedia and temporal data respectively, due to those situations; the users
are currently “drowning” in an information overload that expands at rate that far outpaces
human ability to process and exploit it.
Domains of Web Mining:
There are three domains that pertain to Web mining:

1. Web Contents Mining

2. Web Structure Mining

3. Web Usage Mining

1. Web Content Mining
Web content mining is an automatic process that extracts patterns from on-line
information, such as the HTML files, images, or E-mails, and it already goes beyond only
keyword extraction or some simple statistics of words and phrases in documents. Web
content mining is the "process of information or resource discovery from millions of
sources across the World Wide Web ". There are two approaches in Web content mining:

1Agent-based approaches

2Database approaches

Agent-Based approaches:

The agent-based approach involves artificial intelligence systems that can "act
autonomously or semi-autonomously on behalf of a particular user, to discover and
organize Web-based information ". Some intelligent Web agents can use a user profile to

search for relevant information, then organize and interpret the discovered information
(e.g., Harvest).
Database approaches:
The database approach focuses on "integrating and organizing the heterogeneous
and semi-structured data on the Web into more structured and high-level collections of
resources." These "metadata, are organized into structured collections (e.g., relational or
object-oriented databases) and can be analyzed".

2. Web Structure Mining
The Data which describes organization of content.Intra-page structure information
includes the arrangement of various HTML or XML tags within a given page. This can
be represented as tree structure, where the <html> tag becomes the root of tree. The
principal kind of inter-page structure information is hyper-links connecting one page to
another.

3. Web Usage Mining
Web servers record and accumulate data about user interactions whenever
requests for resources are received. Analyzing the Web access logs of different Web sites
can help to understand the user behavior and the Web structure, by improving design of
the colossal collection of resources.

Web Mining Techniques
The common techniques for Web mining are:

1Clustering/classification

2Association rules

3Path analysis

4Sequential patterns.

1. Clustering/classification

This technique is used to develop profiles of items with similar characteristics.
This ability enhances the discovery of relationships that are otherwise not obvious. Eg:

Classification of Web access logs allows a company to discover the average age of
customers who order a certain product.

2. Association rules

Rules that govern "databases of transactions where each transaction consists of a
set of items." This technique is used to predict the correlation of items "where the
presence of one set of items in a transaction implies (with a certain degree of confidence)
the presence of other items."

3. Path analysis

A Technique that involves the generation of some form of graph that "represents
relation[s] defined on Web pages." This can be the physical layout of a Web site in which
the Web pages are nodes and the hypertext links between these pages are directed edges.
Eg: what paths do users travel before they go to a particular URL.

4. Sequential patterns

Applied to Web access server transaction logs. The purpose is to discover
sequential patterns that indicate user visit patterns over a certain period.

Web mining as a tool:
Web mining can be a promising tool to address ineffective search
engines, which produce incomplete indexing, unverified reliability of retrieved
information. Web mining discovers information from mounds of data on the WWW, but
it also monitors and predicts user visit habits. This gives designers more reliable
information in structuring and designing a Web site. Web mining technology can help
librarians design Web sites with paths that can be traveled easily by end users, saving
time and effort. Eg: Web mining technology and academic librarianship

Conclusion:
Data Warehousing provides the means to change the raw data into information for
making effective business decisions-the emphasis on information, not data.The Data
warehouse is the hub for decision support data.
Data mining is a useful tool with multiple algorithms that can be tuned for specific
tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to
speed up data mining process.Web mining is a huge, interdisciplinary and vary
dynamic/scientific area, converging from several research communities such as database,
information retrieval and artificial intelligence especially from machine learning and
natural language processing. This area is so broad today partly due to the interests of
various research communities.

References:
1www.datawarehousingonline.com
2Data Base Systems-Elmasri, Navathe
3Data Mining Technologies-Arun K.Pujari
4Data Mining and Data Warehousing and OLAP-A.Berson, S.J.Smith
5Database Management System-Sylbardcards

Data Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Data Mining

Ähnlich wie Data Mining (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Mining