SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Lecture 1
Ms. Qurat-ul-Ain
Email: quratulain.raja15@gmail.com
Course Details
 Course Title: Data Warehousing & Data Mining
 Credit Hours: 3
 Course Prerequisite: DBMS
Course Contents
Data Warehousing Concepts, Data Warehousing System And Components
Data Transformation Process Functions
Online Analytical Processing (OLAP) And OLAP Tools.
Data Crawling & Programming With Python
Data Warehousing Applications
Concepts Of Data Mining
Data Pre-processing, Pre-mining And Outlier Detection
Data Mining Learning Methods & Data Mining Classes (Association Rule Mining, Clustering,
Classification)
Fundamental of other Algorithms Related To Data Mining(Fuzzy Logic, Genetic Algorithm
And Neural Network)
Decision Trees
Web Mining
Text Books
 Fundamentals of Data Warehousing - Paulraj Ponniah
 The Data Warehouse Toolkit by Ralph Kimball - John Wiley & Sons Publications.
 Decision Support in the Data Warehouse by Paul Gray, Hugh J. Watson - Prentice
Hall.
 Jiawei Han ”Data Mining: Concepts and Techniques”, Second Edition and above
 Data Mining and Analysis: Fundamental Concepts and Algorithms, 1st Edition, M.
Zaki & W. Meira
 Data Mining: Concepts and Techniques, 3rd Edition Jiawei Han, Micheline Kamber,
Jian Pei; , 2011
 Anything that you can find to help you learn.
History of IT
 The “dark ages”: paper forms in file cabinets
 Computerized systems emerge
 Initially for big projects like Social Security
 Same functionality as old paper-based systems
 The “golden age”: databases are everywhere
 Most activities tracked electronically
 Stored data provides detailed history of activity
 The next step: use data for decision-making
 The focus of this course!
 Made possible by omnipresence of IT
 Identify inefficiencies in current processes
 Quantify likely impact of decisions
What Is a Data Warehouse?
 In many organizations, we want a central “store” of all of our entities,
concepts, metadata, and historical information
 For doing data validation, complex mining, analysis, prediction, …
 This is the data warehouse
 To this point we’ve focused on scenarios where the data “lives” in the
sources – here we may have a “master” version (and archival version) in a
central database
 For performance reasons, availability reasons, archival reasons, …
What Is a Data Warehouse?
 More specific, a collective data repository – Containing snapshots of the
operational data (history) – Obtained through data cleansing (Extract-
Transform- Load process) – Useful for analytics
What Is a Data Warehouse?
 Experts say…
– Ralph Kimball: “a copy of transaction data specifically structured for
query and analysis”
– Bill Inmon: “A data warehouse is a: – Subject oriented – Integrated –
Non-volatile – Time variant collection of data in support of
management’s decisions.”
Properties of a Data Warehouse?
 The data in the DWH is organized in such a way that all the data elements
relating to the same real-world event or object are linked together
 Typical subject areas in DWs are Customer, Product, Order, Claim,
Account,…
Properties of a Data Warehouse?
 Non-Volatile
– Data in the DW is never over-written or deleted - once committed, the data
is static, read-only, and retained for future reporting
– Data is loaded, but not updated
– When subsequent changes occur, a new version or snapshot record is
written,…
Properties of a Data Warehouse?
 Time-varying
– The changes to the data in the DW are tracked and recorded so that
reports show changes over time
– Different environments have different time horizons associated
• While for operational systems a 60-to-90 day time horizon is
normal, DWs have a 5-to-10 year horizon
General Definition
– A large repository of some organization’s electronically stored data
– Specifically designed to facilitate reporting and analysis
Characteristics of DW
Subject oriented Data are organized by how users refer to it
Integrated Inconsistencies are removed in both
nomenclature and conflicting information;
(i.e. data are ‘clean’)
Non-volatile Read-only data. Data do not change over
time.
Time variant Data are time series, not current status
Subject Oriented
 Data Warehouse is designed around
“subjects” rather than processes
 A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System
 DW will have a Sales Subject Area
Subject Oriented
Retail Sales
System
Outlet Sales
System
Catalog Sales
System
Sales Subject Area
Subject-Oriented Sales Information
Data Warehouse
OLTP Systems
Integrated
 Heterogeneous Source Systems
 Need to Integrate source data
 For Example: Product codes could be different in different systems
 Arrive at common code in DW
 Information integrated in advance
 Stored in DW for direct querying and analysis
Integrated
Clients
Data
Warehouse
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
Non-Volatile
 Operational update of data does not occur in the data warehouse
environment.
 Does not require transaction processing, recovery, and concurrency
control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data.
Non-Volatile(Read-Mostly)
OLTP
DW
USER
USER
Write
Read
Read
Time Variant
 The time horizon for the data warehouse is significantly longer than that of
operational systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
Time Variant
 Most business analysis has a time
component
 Trend Analysis (historical data is
required)
2001 2002 2003 2004
Sales
 Data recording and storage is growing.
 History is excellent predictor of the future.
 Gives total view of the organization.
 Intelligent decision-support is required for decision-making.
Why a Data Warehouse (DWH)?
 Data Sets are growing.
 Size of Data Sets are going up .
 Cost of data storage is coming down .
 The amount of data average business collects and stores is doubling
every year
 Total hardware and software cost to store and manage 1 Mbyte of data
 1990: ~ $15
 2002: ~ ¢15 (Down 100 times)
 By 2007: < ¢1 (Down 150 times)
Reason-1: Why a Data Warehouse?
 A Few Examples
 WalMart: 24 TB
 France Telecom: ~ 100 TB
 CERN: Up to 20 PB by 2006
 Stanford Linear Accelerator Center (SLAC): 500TB
Reason-1: Why a Data Warehouse?
A Warehouse of Data
is NOT a
Data Warehouse
Caution!
Size
is NOT
Everything
Caution!
 Businesses demand Intelligence (BI).
 Complex questions from integrated data.
 “Intelligent Enterprise”
Reason-2: Why a Data Warehouse?
List of all items that were sold last month?
List of all items purchased by Tariq Majeed?
The total sales of the last month grouped by branch?
How many sales transactions occurred during the
month of January?
DBMS Approach
Reason-2: Why a Data Warehouse?
Which items sell together? Which items to stock?
Where and how to place the items? What discounts to
offer?
How best to target customers to increase sales at a branch?
Which customers are most likely to respond to my next
promotional campaign, and why?
Intelligent Enterprise
Reason-2: Why a Data Warehouse?
 Businesses want much more…
 What happened?
 Why it happened?
 What will happen?
 What is happening?
 What do you want to happen?
Stages of
Data
Warehouse
Reason-3: Why a Data Warehouse?
A complete repository of historical corporate data extracted
from transaction systems that is available for ad-hoc access
by knowledge workers.
 Complete repository
 History
 Transaction System
 Ad-Hoc access
 Knowledge workers
What is a Data Warehouse?
Transaction System
 Management Information System (MIS)
 Could be typed sheets (NOT transaction system)
Ad-Hoc access
 Dose not have a certain access pattern.
 Queries not known in advance.
 Difficult to write SQL in advance.
Knowledge workers
 Typically NOT IT literate (Executives, Analysts, Managers).
 NOT clerical workers.
 Decision makers.
What is a Data Warehouse?
Features of a DWH
– DW typically
– Reside on computers dedicated to this function
– Run on enterprise scale DBMS such as Oracle, IBM DB2,
Teradata, or Microsoft SQL Server
– Retain data for long periods of time
– Consolidate data obtained from a variety of sources
– Are built around their own carefully designed data model
Data Management in Enterprises
 Vertical fragmentation of informational systems
 Result of application (user)-driven development of
operational systems
Sales Administration Finance Manufacturing ...
Sales Planning
Stock Mngmt
...
Suppliers
...
Debt Mngmt
Num. Control
...
Inventory
 Two Approaches for accessing data:
 Query-Driven (Lazy)
 Warehouse (Eager)
Source Source
?
Data Management in Enterprises
The Need for DW
Source Source
Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper
Wrapper
 Query-driven (lazy, on-demand)
Query-Driven Approach
Disadvantages
 Delay in query processing
 Inefficient and potentially expensive for frequent
queries
 Competes with local processing at sources
The Warehousing Approach
Data
Warehouse
Clients
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
 Information
integrated in
advance
 Stored in WH for
direct querying
and analysis
Advantages of DWH Approach
 High query performance
 Doesn’t interfere with local processing at sources
 Information copied at warehouse
 Can modify, annotate, summarize, restructure, etc.
 Can store historical information
 Security, no auditing
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras,
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets 40
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
41
What Is Data Mining?
 Alternative name
 Knowledge discovery in databases (KDD)
 Watch out: Is everything “data mining”?
 Query processing
 Expert systems or statistical programs
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
42
Let’s start data mining with a interesting statement.
 The statement, given by Donald Rumsfeld, Defense Secretary of the USA in an
interview, is as under.
 As we know, there are known knowns. There are things we know that we know like
you know your names, your parent’s names. We also know there are known
unknowns.
 That is to say, we know that there are some things we do not know like what one is
thinking about you, what you will eat after six days, what will be result of a lottery
and so on.
 But there are also unknown unknowns, the ones we don't know that we don't know.
Are they beneficial if you know? Or it is harmful no to know them?
43
What Is Data Mining?
There are also unknown knowns, things we'd like to know, but don't know, but
know someone who can doctor them and pass them off as known knowns. To
associate Rumsfeld’s above quotation with data mining, we identify four core
phrases as
1. Known knowns
2. Known unknowns
3. Unknown unknowns
 The items 1 3, and 4 deal with “Knowns”. Data mining has relevance to the
third point in red.
 It is an art of digging out what exactly we don’t know that we must know in
our business.
 The methodology is to first convert “unknown unkowns” into “known
unknowns” and then finally to “known knowns”.
44
What Is Data Mining?
What is Data Mining?: Slightly
Informal
Tell me something that I should know. When you don’t know what you should
be knowing, how do you write SQL?
You cant!!
Tell me something that I should know i.e. you ask your DWH, data repository
that tell me something that I don’t know, or I should know. Since we don’t know
what we actually don’t know and what we must know to know, we can’t write
SQL’s for getting answers like we do in OLTP systems.
Data mining is an exploratory approach, where browsing through data using
data mining techniques may reveal something that might be of interest to the
user as information that was unknown previously. Hence, in data mining we
don’t know the results.
45
Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
 Risk analysis and management
 Forecasting, customer retention, quality control, competitive
analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
46
Market Analysis and Management
 Where does the data come from?
 Credit card transactions, discount coupons, customer complaint calls
 Target marketing
 Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
 Cross-market analysis
 Associations/co-relations between product sales, & prediction based on such
association
 Customer profiling
 What types of customers buy what products
 Customer requirement analysis
 Identifying the best products for different customers
 Predict what factors will attract new customers
47
Fraud Detection & Mining Unusual
Patterns
 Approaches: Clustering & model construction for frauds, outlier analysis
 Applications: Health care, retail, credit card service, telecomm.
 Medical insurance
 Professional patients, and ring of doctors
 Unnecessary or correlated screening tests
 Telecommunications:
 Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
 Retail industry
 Analysts estimate that 38% of retail shrink is due to dishonest
employees
48
Other Applications
 Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web
site organization, etc.
49
Data Mining: A KDD Process
 Data mining—core of
knowledge discovery process
50
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Steps of a KDD Process
 Learning the application domain
 Relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction.
 Choosing functions of data mining
 Summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 Visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
51
Architecture: Typical Data Mining System
52
Data
Warehous
e
Data cleaning & data
integration
Filterin
g
Database
s
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
Claude Shannon's Info. Theory More
Volume
53
 Data mining evolved as a mechanism to cater the limitations of OLTP
systems to deal massive data sets with high dimensionality, new data
types, multiple heterogeneous data resources etc.
 The conventional systems couldn’t keep pace with the ever changing
and increasing data sets.
 Data mining algorithms are built to deal high dimensionality data, new
data types (images, video etc.), complex associations among data items,
distributed data sources and associated issues (security etc.)
54
55
 Traditional Database (Transactions): -- Querying
data in well-defined processes. Reliable storage
How Data Mining is different?
Data Mining: On What Kinds of Data?
 Relational database
 Data warehouse
 Transactional database
 Advanced database and information repository
 Spatial and temporal data
 Time-series data
 Stream data
 Multimedia database
 Text databases & WWW
56
Data Mining Functionalities
 Concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics
 Association (correlation and causality)
 Diaper  Beer [0.5%, 75%]
 Classification and Prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
 Presentation: decision-tree, classification rule, neural network
57
Data Mining Functionalities
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysis
 Outlier: a data object that does not comply with the general behavior
of the data
 Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
58
Are All the “Discovered” Patterns
Interesting?
 Data mining may generate thousands of patterns: Not all of them are
interesting
 Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
59
Data Mining: Confluence of Multiple
Disciplines
60
Data Mining
Database
Systems
Statistics
Other
Disciplines
Algorithm
Machine
Learning
Visualization
Data Mining: Classification Schemes
 Different views, different classifications
 Kinds of data to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted
61
Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text,
multi-media, heterogeneous, WWW
 Knowledge to be mined
 Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier analysis,
etc.
 Multiple/integrated functions and mining at multiple
levels
62
Multi-Dimensional View of Data Mining
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-
data mining, stock market analysis, Web mining, etc.
63
OLAP Mining: Integration of Data Mining and Data
Warehousing
 Data mining systems, DBMS, Data warehouse systems
coupling
 On-line analytical mining data
 Integration of mining and OLAP technologies
 Interactive mining multi-level knowledge
 Necessity of mining knowledge and patterns at different levels of
abstraction.
 Integration of multiple mining functions
 Characterized classification, first clustering and then association
64
Data Mining is…
65
66
Data Mining
67
 A neural network is a series of algorithms that endeavors to recognize
underlying relationships in a set of data through a process that mimics the
way the human brain operates. In this sense, neural networks refer to
systems of neurons, either organic or artificial in nature.
 Rule induction is an area of machine learning in which formal rules are
extracted from a set of observations. The rules extracted may represent a full
scientific model of the data, or merely represent local patterns in the data.
Data Mining
Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing
one: knowledge fusion
68
Major Issues in Data Mining
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of
abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy
69
Summary
 Data mining: discovering interesting patterns from large amounts of data
 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining
70
Where to Find References?
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
 Data mining and KDD
 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations
 Database systems
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
 Journals: Machine Learning, Artificial Intelligence, etc.
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc.
71

Weitere ähnliche Inhalte

Ähnlich wie 1-_Intro_to_Data_Minning__DWH.ppt

20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.pptSumathiG8
 
Dataware housing
Dataware housingDataware housing
Dataware housingwork
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4ambujm
 
Data warehouse
Data warehouseData warehouse
Data warehouseMR Z
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.pptSamPrem3
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
11666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect311666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect3ambujm
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemKiran kumar
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSSDeepali Raut
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseSOMASUNDARAM T
 

Ähnlich wie 1-_Intro_to_Data_Minning__DWH.ppt (20)

20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
Dataware housing
Dataware housingDataware housing
Dataware housing
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
11666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect311666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect3
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
DWIntro.ppt
DWIntro.pptDWIntro.ppt
DWIntro.ppt
 
DWIntro.ppt
DWIntro.pptDWIntro.ppt
DWIntro.ppt
 
DWIntro.ppt
DWIntro.pptDWIntro.ppt
DWIntro.ppt
 

Kürzlich hochgeladen

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Kürzlich hochgeladen (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

1-_Intro_to_Data_Minning__DWH.ppt

  • 1. Lecture 1 Ms. Qurat-ul-Ain Email: quratulain.raja15@gmail.com
  • 2. Course Details  Course Title: Data Warehousing & Data Mining  Credit Hours: 3  Course Prerequisite: DBMS
  • 3. Course Contents Data Warehousing Concepts, Data Warehousing System And Components Data Transformation Process Functions Online Analytical Processing (OLAP) And OLAP Tools. Data Crawling & Programming With Python Data Warehousing Applications Concepts Of Data Mining Data Pre-processing, Pre-mining And Outlier Detection Data Mining Learning Methods & Data Mining Classes (Association Rule Mining, Clustering, Classification) Fundamental of other Algorithms Related To Data Mining(Fuzzy Logic, Genetic Algorithm And Neural Network) Decision Trees Web Mining
  • 4. Text Books  Fundamentals of Data Warehousing - Paulraj Ponniah  The Data Warehouse Toolkit by Ralph Kimball - John Wiley & Sons Publications.  Decision Support in the Data Warehouse by Paul Gray, Hugh J. Watson - Prentice Hall.  Jiawei Han ”Data Mining: Concepts and Techniques”, Second Edition and above  Data Mining and Analysis: Fundamental Concepts and Algorithms, 1st Edition, M. Zaki & W. Meira  Data Mining: Concepts and Techniques, 3rd Edition Jiawei Han, Micheline Kamber, Jian Pei; , 2011  Anything that you can find to help you learn.
  • 5. History of IT  The “dark ages”: paper forms in file cabinets  Computerized systems emerge  Initially for big projects like Social Security  Same functionality as old paper-based systems  The “golden age”: databases are everywhere  Most activities tracked electronically  Stored data provides detailed history of activity  The next step: use data for decision-making  The focus of this course!  Made possible by omnipresence of IT  Identify inefficiencies in current processes  Quantify likely impact of decisions
  • 6. What Is a Data Warehouse?  In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information  For doing data validation, complex mining, analysis, prediction, …  This is the data warehouse  To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database  For performance reasons, availability reasons, archival reasons, …
  • 7. What Is a Data Warehouse?  More specific, a collective data repository – Containing snapshots of the operational data (history) – Obtained through data cleansing (Extract- Transform- Load process) – Useful for analytics
  • 8. What Is a Data Warehouse?  Experts say… – Ralph Kimball: “a copy of transaction data specifically structured for query and analysis” – Bill Inmon: “A data warehouse is a: – Subject oriented – Integrated – Non-volatile – Time variant collection of data in support of management’s decisions.”
  • 9. Properties of a Data Warehouse?  The data in the DWH is organized in such a way that all the data elements relating to the same real-world event or object are linked together  Typical subject areas in DWs are Customer, Product, Order, Claim, Account,…
  • 10. Properties of a Data Warehouse?  Non-Volatile – Data in the DW is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting – Data is loaded, but not updated – When subsequent changes occur, a new version or snapshot record is written,…
  • 11. Properties of a Data Warehouse?  Time-varying – The changes to the data in the DW are tracked and recorded so that reports show changes over time – Different environments have different time horizons associated • While for operational systems a 60-to-90 day time horizon is normal, DWs have a 5-to-10 year horizon
  • 12. General Definition – A large repository of some organization’s electronically stored data – Specifically designed to facilitate reporting and analysis
  • 13. Characteristics of DW Subject oriented Data are organized by how users refer to it Integrated Inconsistencies are removed in both nomenclature and conflicting information; (i.e. data are ‘clean’) Non-volatile Read-only data. Data do not change over time. Time variant Data are time series, not current status
  • 14. Subject Oriented  Data Warehouse is designed around “subjects” rather than processes  A company may have  Retail Sales System  Outlet Sales System  Catalog Sales System  DW will have a Sales Subject Area
  • 15. Subject Oriented Retail Sales System Outlet Sales System Catalog Sales System Sales Subject Area Subject-Oriented Sales Information Data Warehouse OLTP Systems
  • 16. Integrated  Heterogeneous Source Systems  Need to Integrate source data  For Example: Product codes could be different in different systems  Arrive at common code in DW  Information integrated in advance  Stored in DW for direct querying and analysis
  • 17. Integrated Clients Data Warehouse Source Source Source . . . Extractor/ Monitor Integration System . . . Metadata Extractor/ Monitor Extractor/ Monitor
  • 18. Non-Volatile  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing:  initial loading of data and access of data.
  • 20. Time Variant  The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
  • 21. Time Variant  Most business analysis has a time component  Trend Analysis (historical data is required) 2001 2002 2003 2004 Sales
  • 22.  Data recording and storage is growing.  History is excellent predictor of the future.  Gives total view of the organization.  Intelligent decision-support is required for decision-making. Why a Data Warehouse (DWH)?
  • 23.  Data Sets are growing.  Size of Data Sets are going up .  Cost of data storage is coming down .  The amount of data average business collects and stores is doubling every year  Total hardware and software cost to store and manage 1 Mbyte of data  1990: ~ $15  2002: ~ ¢15 (Down 100 times)  By 2007: < ¢1 (Down 150 times) Reason-1: Why a Data Warehouse?
  • 24.  A Few Examples  WalMart: 24 TB  France Telecom: ~ 100 TB  CERN: Up to 20 PB by 2006  Stanford Linear Accelerator Center (SLAC): 500TB Reason-1: Why a Data Warehouse?
  • 25. A Warehouse of Data is NOT a Data Warehouse Caution!
  • 27.  Businesses demand Intelligence (BI).  Complex questions from integrated data.  “Intelligent Enterprise” Reason-2: Why a Data Warehouse?
  • 28. List of all items that were sold last month? List of all items purchased by Tariq Majeed? The total sales of the last month grouped by branch? How many sales transactions occurred during the month of January? DBMS Approach Reason-2: Why a Data Warehouse?
  • 29. Which items sell together? Which items to stock? Where and how to place the items? What discounts to offer? How best to target customers to increase sales at a branch? Which customers are most likely to respond to my next promotional campaign, and why? Intelligent Enterprise Reason-2: Why a Data Warehouse?
  • 30.  Businesses want much more…  What happened?  Why it happened?  What will happen?  What is happening?  What do you want to happen? Stages of Data Warehouse Reason-3: Why a Data Warehouse?
  • 31. A complete repository of historical corporate data extracted from transaction systems that is available for ad-hoc access by knowledge workers.  Complete repository  History  Transaction System  Ad-Hoc access  Knowledge workers What is a Data Warehouse?
  • 32. Transaction System  Management Information System (MIS)  Could be typed sheets (NOT transaction system) Ad-Hoc access  Dose not have a certain access pattern.  Queries not known in advance.  Difficult to write SQL in advance. Knowledge workers  Typically NOT IT literate (Executives, Analysts, Managers).  NOT clerical workers.  Decision makers. What is a Data Warehouse?
  • 33. Features of a DWH – DW typically – Reside on computers dedicated to this function – Run on enterprise scale DBMS such as Oracle, IBM DB2, Teradata, or Microsoft SQL Server – Retain data for long periods of time – Consolidate data obtained from a variety of sources – Are built around their own carefully designed data model
  • 34. Data Management in Enterprises  Vertical fragmentation of informational systems  Result of application (user)-driven development of operational systems Sales Administration Finance Manufacturing ... Sales Planning Stock Mngmt ... Suppliers ... Debt Mngmt Num. Control ... Inventory
  • 35.  Two Approaches for accessing data:  Query-Driven (Lazy)  Warehouse (Eager) Source Source ? Data Management in Enterprises
  • 36. The Need for DW Source Source Source . . . Integration System . . . Metadata Clients Wrapper Wrapper Wrapper  Query-driven (lazy, on-demand)
  • 37. Query-Driven Approach Disadvantages  Delay in query processing  Inefficient and potentially expensive for frequent queries  Competes with local processing at sources
  • 38. The Warehousing Approach Data Warehouse Clients Source Source Source . . . Extractor/ Monitor Integration System . . . Metadata Extractor/ Monitor Extractor/ Monitor  Information integrated in advance  Stored in WH for direct querying and analysis
  • 39. Advantages of DWH Approach  High query performance  Doesn’t interfere with local processing at sources  Information copied at warehouse  Can modify, annotate, summarize, restructure, etc.  Can store historical information  Security, no auditing
  • 40. Why Data Mining?  The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras,  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 40
  • 41. Evolution of Database Technology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s:  Data mining, data warehousing, multimedia databases, and Web databases  2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems 41
  • 42. What Is Data Mining?  Alternative name  Knowledge discovery in databases (KDD)  Watch out: Is everything “data mining”?  Query processing  Expert systems or statistical programs  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data 42
  • 43. Let’s start data mining with a interesting statement.  The statement, given by Donald Rumsfeld, Defense Secretary of the USA in an interview, is as under.  As we know, there are known knowns. There are things we know that we know like you know your names, your parent’s names. We also know there are known unknowns.  That is to say, we know that there are some things we do not know like what one is thinking about you, what you will eat after six days, what will be result of a lottery and so on.  But there are also unknown unknowns, the ones we don't know that we don't know. Are they beneficial if you know? Or it is harmful no to know them? 43 What Is Data Mining?
  • 44. There are also unknown knowns, things we'd like to know, but don't know, but know someone who can doctor them and pass them off as known knowns. To associate Rumsfeld’s above quotation with data mining, we identify four core phrases as 1. Known knowns 2. Known unknowns 3. Unknown unknowns  The items 1 3, and 4 deal with “Knowns”. Data mining has relevance to the third point in red.  It is an art of digging out what exactly we don’t know that we must know in our business.  The methodology is to first convert “unknown unkowns” into “known unknowns” and then finally to “known knowns”. 44 What Is Data Mining?
  • 45. What is Data Mining?: Slightly Informal Tell me something that I should know. When you don’t know what you should be knowing, how do you write SQL? You cant!! Tell me something that I should know i.e. you ask your DWH, data repository that tell me something that I don’t know, or I should know. Since we don’t know what we actually don’t know and what we must know to know, we can’t write SQL’s for getting answers like we do in OLTP systems. Data mining is an exploratory approach, where browsing through data using data mining techniques may reveal something that might be of interest to the user as information that was unknown previously. Hence, in data mining we don’t know the results. 45
  • 46. Why Data Mining?—Potential Applications  Data analysis and decision support  Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, market segmentation  Risk analysis and management  Forecasting, customer retention, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis 46
  • 47. Market Analysis and Management  Where does the data come from?  Credit card transactions, discount coupons, customer complaint calls  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Cross-market analysis  Associations/co-relations between product sales, & prediction based on such association  Customer profiling  What types of customers buy what products  Customer requirement analysis  Identifying the best products for different customers  Predict what factors will attract new customers 47
  • 48. Fraud Detection & Mining Unusual Patterns  Approaches: Clustering & model construction for frauds, outlier analysis  Applications: Health care, retail, credit card service, telecomm.  Medical insurance  Professional patients, and ring of doctors  Unnecessary or correlated screening tests  Telecommunications:  Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm  Retail industry  Analysts estimate that 38% of retail shrink is due to dishonest employees 48
  • 49. Other Applications  Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 49
  • 50. Data Mining: A KDD Process  Data mining—core of knowledge discovery process 50 Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 51. Steps of a KDD Process  Learning the application domain  Relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation  Find useful features, dimensionality/variable reduction.  Choosing functions of data mining  Summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  Visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge 51
  • 52. Architecture: Typical Data Mining System 52 Data Warehous e Data cleaning & data integration Filterin g Database s Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  • 53. Claude Shannon's Info. Theory More Volume 53
  • 54.  Data mining evolved as a mechanism to cater the limitations of OLTP systems to deal massive data sets with high dimensionality, new data types, multiple heterogeneous data resources etc.  The conventional systems couldn’t keep pace with the ever changing and increasing data sets.  Data mining algorithms are built to deal high dimensionality data, new data types (images, video etc.), complex associations among data items, distributed data sources and associated issues (security etc.) 54
  • 55. 55  Traditional Database (Transactions): -- Querying data in well-defined processes. Reliable storage How Data Mining is different?
  • 56. Data Mining: On What Kinds of Data?  Relational database  Data warehouse  Transactional database  Advanced database and information repository  Spatial and temporal data  Time-series data  Stream data  Multimedia database  Text databases & WWW 56
  • 57. Data Mining Functionalities  Concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics  Association (correlation and causality)  Diaper  Beer [0.5%, 75%]  Classification and Prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction  Presentation: decision-tree, classification rule, neural network 57
  • 58. Data Mining Functionalities  Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Maximizing intra-class similarity & minimizing interclass similarity  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Useful in fraud detection, rare events analysis  Trend and evolution analysis  Trend and deviation: regression analysis  Sequential pattern mining, periodicity analysis 58
  • 59. Are All the “Discovered” Patterns Interesting?  Data mining may generate thousands of patterns: Not all of them are interesting  Suggested approach: Human-centered, query-based, focused mining  Interestingness measures  A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty. 59
  • 60. Data Mining: Confluence of Multiple Disciplines 60 Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization
  • 61. Data Mining: Classification Schemes  Different views, different classifications  Kinds of data to be mined  Kinds of knowledge to be discovered  Kinds of techniques utilized  Kinds of applications adapted 61
  • 62. Multi-Dimensional View of Data Mining  Data to be mined  Relational, data warehouse, transactional, stream, object- oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, WWW  Knowledge to be mined  Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.  Multiple/integrated functions and mining at multiple levels 62
  • 63. Multi-Dimensional View of Data Mining  Techniques utilized  Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.  Applications adapted  Retail, telecommunication, banking, fraud analysis, bio- data mining, stock market analysis, Web mining, etc. 63
  • 64. OLAP Mining: Integration of Data Mining and Data Warehousing  Data mining systems, DBMS, Data warehouse systems coupling  On-line analytical mining data  Integration of mining and OLAP technologies  Interactive mining multi-level knowledge  Necessity of mining knowledge and patterns at different levels of abstraction.  Integration of multiple mining functions  Characterized classification, first clustering and then association 64
  • 67. 67  A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or artificial in nature.  Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data. Data Mining
  • 68. Major Issues in Data Mining  Mining methodology  Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web  Performance: efficiency, effectiveness, and scalability  Pattern evaluation: the interestingness problem  Incorporation of background knowledge  Handling noise and incomplete data  Parallel, distributed and incremental mining methods  Integration of the discovered knowledge with existing one: knowledge fusion 68
  • 69. Major Issues in Data Mining  User interaction  Data mining query languages and ad-hoc mining  Expression and visualization of data mining results  Interactive mining of knowledge at multiple levels of abstraction  Applications and social impacts  Domain-specific data mining & invisible data mining  Protection of data security, integrity, and privacy 69
  • 70. Summary  Data mining: discovering interesting patterns from large amounts of data  A natural evolution of database technology, in great demand, with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.  Data mining systems and architectures  Major issues in data mining 70
  • 71. Where to Find References?  More conferences on data mining  PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.  Data mining and KDD  Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.  Journal: Data Mining and Knowledge Discovery, KDD Explorations  Database systems  Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA  Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.  AI & Machine Learning  Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.  Journals: Machine Learning, Artificial Intelligence, etc.  Statistics  Conferences: Joint Stat. Meeting, etc.  Journals: Annals of statistics, etc.  Visualization  Conference proceedings: CHI, ACM-SIGGraph, etc.  Journals: IEEE Trans. visualization and computer graphics, etc. 71