1-_Intro_to_Data_Minning__DWH.ppt

Lecture 1
Ms. Qurat-ul-Ain
Email: quratulain.raja15@gmail.com

Course Details
 Course Title: Data Warehousing & Data Mining
 Credit Hours: 3
 Course Prerequisite: DBMS

Course Contents
Data Warehousing Concepts, Data Warehousing System And Components
Data Transformation Process Functions
Online Analytical Processing (OLAP) And OLAP Tools.
Data Crawling & Programming With Python
Data Warehousing Applications
Concepts Of Data Mining
Data Pre-processing, Pre-mining And Outlier Detection
Data Mining Learning Methods & Data Mining Classes (Association Rule Mining, Clustering,
Classification)
Fundamental of other Algorithms Related To Data Mining(Fuzzy Logic, Genetic Algorithm
And Neural Network)
Decision Trees
Web Mining

Text Books
 Fundamentals of Data Warehousing - Paulraj Ponniah
 The Data Warehouse Toolkit by Ralph Kimball - John Wiley & Sons Publications.
 Decision Support in the Data Warehouse by Paul Gray, Hugh J. Watson - Prentice
Hall.
 Jiawei Han ”Data Mining: Concepts and Techniques”, Second Edition and above
 Data Mining and Analysis: Fundamental Concepts and Algorithms, 1st Edition, M.
Zaki & W. Meira
 Data Mining: Concepts and Techniques, 3rd Edition Jiawei Han, Micheline Kamber,
Jian Pei; , 2011
 Anything that you can find to help you learn.

History of IT
 The “dark ages”: paper forms in file cabinets
 Computerized systems emerge
 Initially for big projects like Social Security
 Same functionality as old paper-based systems
 The “golden age”: databases are everywhere
 Most activities tracked electronically
 Stored data provides detailed history of activity
 The next step: use data for decision-making
 The focus of this course!
 Made possible by omnipresence of IT
 Identify inefficiencies in current processes
 Quantify likely impact of decisions

What Is a Data Warehouse?
 In many organizations, we want a central “store” of all of our entities,
concepts, metadata, and historical information
 For doing data validation, complex mining, analysis, prediction, …
 This is the data warehouse
 To this point we’ve focused on scenarios where the data “lives” in the
sources – here we may have a “master” version (and archival version) in a
central database
 For performance reasons, availability reasons, archival reasons, …

 More specific, a collective data repository – Containing snapshots of the
operational data (history) – Obtained through data cleansing (Extract-
Transform- Load process) – Useful for analytics

 Experts say…
– Ralph Kimball: “a copy of transaction data specifically structured for
query and analysis”
– Bill Inmon: “A data warehouse is a: – Subject oriented – Integrated –
Non-volatile – Time variant collection of data in support of
management’s decisions.”

Properties of a Data Warehouse?
 The data in the DWH is organized in such a way that all the data elements
relating to the same real-world event or object are linked together
 Typical subject areas in DWs are Customer, Product, Order, Claim,
Account,…

 Non-Volatile
– Data in the DW is never over-written or deleted - once committed, the data
is static, read-only, and retained for future reporting
– Data is loaded, but not updated
– When subsequent changes occur, a new version or snapshot record is
written,…

 Time-varying
– The changes to the data in the DW are tracked and recorded so that
reports show changes over time
– Different environments have different time horizons associated
• While for operational systems a 60-to-90 day time horizon is
normal, DWs have a 5-to-10 year horizon

General Definition
– A large repository of some organization’s electronically stored data
– Specifically designed to facilitate reporting and analysis

Characteristics of DW
Subject oriented Data are organized by how users refer to it
Integrated Inconsistencies are removed in both
nomenclature and conflicting information;
(i.e. data are ‘clean’)
Non-volatile Read-only data. Data do not change over
time.
Time variant Data are time series, not current status

Subject Oriented
 Data Warehouse is designed around
“subjects” rather than processes
 A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System
 DW will have a Sales Subject Area

Subject Oriented
Retail Sales
System
Outlet Sales
System
Catalog Sales
System
Sales Subject Area
Subject-Oriented Sales Information
Data Warehouse
OLTP Systems

Integrated
 Heterogeneous Source Systems
 Need to Integrate source data
 For Example: Product codes could be different in different systems
 Arrive at common code in DW
 Information integrated in advance
 Stored in DW for direct querying and analysis

Integrated
Clients
Data
Warehouse
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor

Non-Volatile
 Operational update of data does not occur in the data warehouse
environment.
 Does not require transaction processing, recovery, and concurrency
control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data.

Non-Volatile(Read-Mostly)
OLTP
DW
USER
USER
Write
Read
Read

Time Variant
 The time horizon for the data warehouse is significantly longer than that of
operational systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)

Time Variant
 Most business analysis has a time
component
 Trend Analysis (historical data is
required)
2001 2002 2003 2004
Sales

 Data recording and storage is growing.
 History is excellent predictor of the future.
 Gives total view of the organization.
 Intelligent decision-support is required for decision-making.
Why a Data Warehouse (DWH)?

 Data Sets are growing.
 Size of Data Sets are going up .
 Cost of data storage is coming down .
 The amount of data average business collects and stores is doubling
every year
 Total hardware and software cost to store and manage 1 Mbyte of data
 1990: ~ $15
 2002: ~ ¢15 (Down 100 times)
 By 2007: < ¢1 (Down 150 times)
Reason-1: Why a Data Warehouse?

 A Few Examples
 WalMart: 24 TB
 France Telecom: ~ 100 TB
 CERN: Up to 20 PB by 2006
 Stanford Linear Accelerator Center (SLAC): 500TB

A Warehouse of Data
is NOT a
Data Warehouse
Caution!

Size
is NOT
Everything
Caution!

 Businesses demand Intelligence (BI).
 Complex questions from integrated data.
 “Intelligent Enterprise”

List of all items that were sold last month?
List of all items purchased by Tariq Majeed?
The total sales of the last month grouped by branch?
How many sales transactions occurred during the
month of January?
DBMS Approach

Which items sell together? Which items to stock?
Where and how to place the items? What discounts to
offer?
How best to target customers to increase sales at a branch?
Which customers are most likely to respond to my next
promotional campaign, and why?
Intelligent Enterprise

 Businesses want much more…
 What happened?
 Why it happened?
 What will happen?
 What is happening?
 What do you want to happen?
Stages of
Data
Warehouse

A complete repository of historical corporate data extracted
from transaction systems that is available for ad-hoc access
by knowledge workers.
 Complete repository
 History
 Transaction System
 Ad-Hoc access
 Knowledge workers
What is a Data Warehouse?

Transaction System
 Management Information System (MIS)
 Could be typed sheets (NOT transaction system)
Ad-Hoc access
 Dose not have a certain access pattern.
 Queries not known in advance.
 Difficult to write SQL in advance.
Knowledge workers
 Typically NOT IT literate (Executives, Analysts, Managers).
 NOT clerical workers.
 Decision makers.
What is a Data Warehouse?

Features of a DWH
– DW typically
– Reside on computers dedicated to this function
– Run on enterprise scale DBMS such as Oracle, IBM DB2,
Teradata, or Microsoft SQL Server
– Retain data for long periods of time
– Consolidate data obtained from a variety of sources
– Are built around their own carefully designed data model

Data Management in Enterprises
 Vertical fragmentation of informational systems
 Result of application (user)-driven development of
operational systems
Sales Administration Finance Manufacturing ...
Sales Planning
Stock Mngmt
...
Suppliers
...
Debt Mngmt
Num. Control
...
Inventory

 Two Approaches for accessing data:
 Query-Driven (Lazy)
 Warehouse (Eager)
Source Source
?
Data Management in Enterprises

The Need for DW
Source Source
Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper
Wrapper
 Query-driven (lazy, on-demand)

Query-Driven Approach
Disadvantages
 Delay in query processing
 Inefficient and potentially expensive for frequent
queries
 Competes with local processing at sources

The Warehousing Approach
Data
Warehouse
Clients
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
 Information
integrated in
advance
 Stored in WH for
direct querying
and analysis

Advantages of DWH Approach
 High query performance
 Doesn’t interfere with local processing at sources
 Information copied at warehouse
 Can modify, annotate, summarize, restructure, etc.
 Can store historical information
 Security, no auditing

Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras,
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets 40

Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
41

What Is Data Mining?
 Alternative name
 Knowledge discovery in databases (KDD)
 Watch out: Is everything “data mining”?
 Query processing
 Expert systems or statistical programs
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
42

Let’s start data mining with a interesting statement.
 The statement, given by Donald Rumsfeld, Defense Secretary of the USA in an
interview, is as under.
 As we know, there are known knowns. There are things we know that we know like
you know your names, your parent’s names. We also know there are known
unknowns.
 That is to say, we know that there are some things we do not know like what one is
thinking about you, what you will eat after six days, what will be result of a lottery
and so on.
 But there are also unknown unknowns, the ones we don't know that we don't know.
Are they beneficial if you know? Or it is harmful no to know them?
43

There are also unknown knowns, things we'd like to know, but don't know, but
know someone who can doctor them and pass them off as known knowns. To
associate Rumsfeld’s above quotation with data mining, we identify four core
phrases as
1. Known knowns
2. Known unknowns
3. Unknown unknowns
 The items 1 3, and 4 deal with “Knowns”. Data mining has relevance to the
third point in red.
 It is an art of digging out what exactly we don’t know that we must know in
our business.
 The methodology is to first convert “unknown unkowns” into “known
unknowns” and then finally to “known knowns”.
44

What is Data Mining?: Slightly
Informal
Tell me something that I should know. When you don’t know what you should
be knowing, how do you write SQL?
You cant!!
Tell me something that I should know i.e. you ask your DWH, data repository
that tell me something that I don’t know, or I should know. Since we don’t know
what we actually don’t know and what we must know to know, we can’t write
SQL’s for getting answers like we do in OLTP systems.
Data mining is an exploratory approach, where browsing through data using
data mining techniques may reveal something that might be of interest to the
user as information that was unknown previously. Hence, in data mining we
don’t know the results.
45

Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
 Risk analysis and management
 Forecasting, customer retention, quality control, competitive
analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
46

Market Analysis and Management
 Where does the data come from?
 Credit card transactions, discount coupons, customer complaint calls
 Target marketing
 Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
 Cross-market analysis
 Associations/co-relations between product sales, & prediction based on such
association
 Customer profiling
 What types of customers buy what products
 Customer requirement analysis
 Identifying the best products for different customers
 Predict what factors will attract new customers
47

Fraud Detection & Mining Unusual
Patterns
 Approaches: Clustering & model construction for frauds, outlier analysis
 Applications: Health care, retail, credit card service, telecomm.
 Medical insurance
 Professional patients, and ring of doctors
 Unnecessary or correlated screening tests
 Telecommunications:
 Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
 Retail industry
 Analysts estimate that 38% of retail shrink is due to dishonest
employees
48

Other Applications
 Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web
site organization, etc.
49

Data Mining: A KDD Process
 Data mining—core of
knowledge discovery process
50
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

Steps of a KDD Process
 Learning the application domain
 Relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction.
 Choosing functions of data mining
 Summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 Visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
51

Architecture: Typical Data Mining System
52
Data
Warehous
e
Data cleaning & data
integration
Filterin
g
Database
s
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base

Claude Shannon's Info. Theory More
Volume
53

 Data mining evolved as a mechanism to cater the limitations of OLTP
systems to deal massive data sets with high dimensionality, new data
types, multiple heterogeneous data resources etc.
 The conventional systems couldn’t keep pace with the ever changing
and increasing data sets.
 Data mining algorithms are built to deal high dimensionality data, new
data types (images, video etc.), complex associations among data items,
distributed data sources and associated issues (security etc.)
54

55
 Traditional Database (Transactions): -- Querying
data in well-defined processes. Reliable storage
How Data Mining is different?

Data Mining: On What Kinds of Data?
 Relational database
 Data warehouse
 Transactional database
 Advanced database and information repository
 Spatial and temporal data
 Time-series data
 Stream data
 Multimedia database
 Text databases & WWW
56

Data Mining Functionalities
 Concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics
 Association (correlation and causality)
 Diaper  Beer [0.5%, 75%]
 Classification and Prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
 Presentation: decision-tree, classification rule, neural network
57

Data Mining Functionalities
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysis
 Outlier: a data object that does not comply with the general behavior
of the data
 Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
58

Are All the “Discovered” Patterns
Interesting?
 Data mining may generate thousands of patterns: Not all of them are
interesting
 Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
59

Data Mining: Confluence of Multiple
Disciplines
60
Data Mining
Database
Systems
Statistics
Other
Disciplines
Algorithm
Machine
Learning
Visualization

Data Mining: Classification Schemes
 Different views, different classifications
 Kinds of data to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted
61

Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text,
multi-media, heterogeneous, WWW
 Knowledge to be mined
 Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier analysis,
etc.
 Multiple/integrated functions and mining at multiple
levels
62

Multi-Dimensional View of Data Mining
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-
data mining, stock market analysis, Web mining, etc.
63

OLAP Mining: Integration of Data Mining and Data
Warehousing
 Data mining systems, DBMS, Data warehouse systems
coupling
 On-line analytical mining data
 Integration of mining and OLAP technologies
 Interactive mining multi-level knowledge
 Necessity of mining knowledge and patterns at different levels of
abstraction.
 Integration of multiple mining functions
 Characterized classification, first clustering and then association
64

67
 A neural network is a series of algorithms that endeavors to recognize
underlying relationships in a set of data through a process that mimics the
way the human brain operates. In this sense, neural networks refer to
systems of neurons, either organic or artificial in nature.
 Rule induction is an area of machine learning in which formal rules are
extracted from a set of observations. The rules extracted may represent a full
scientific model of the data, or merely represent local patterns in the data.
Data Mining

Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing
one: knowledge fusion
68

Major Issues in Data Mining
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of
abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy
69

Summary
 Data mining: discovering interesting patterns from large amounts of data
 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining
70

Where to Find References?
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
 Data mining and KDD
 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations
 Database systems
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
 Journals: Machine Learning, Artificial Intelligence, etc.
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc.
71

1-_Intro_to_Data_Minning__DWH.ppt

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie 1-_Intro_to_Data_Minning__DWH.ppt

Ähnlich wie 1-_Intro_to_Data_Minning__DWH.ppt (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

1-_Intro_to_Data_Minning__DWH.ppt