SlideShare a Scribd company logo
1 of 40
1

INTRODUCTION
2

A young, fast growing and promising
field
INTRODUCTION
3










Data mining (the analysis step of the
"Knowledge Discovery and Data Mining"
process, or KDD)
Extracting hidden information
An interdisciplinary subfield of computer
science
The computational process of discovering
patterns in large data sets
Involving methods at the intersection of
Artificial intelligence, Machine learning,
Statistics, and Database systems.
INTORODUCTION(CONTD..)
4

The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further use.
Aside from the raw analysis step, it involves
•
database and data management aspects
•





•

data pre-processing
model
inference considerations

complexity considerations, post-processing of
discovered structures, visualization, and online
updating.
Why Data Mining?
5



The Explosive Growth of Data: from terabytes to petabytes



Eg: Global backbone telecommunication network carry tens of
petabytes everyday
(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)


Data collection and data availability


Automated data collection tools, database systems, Web,
computerized society



Major sources of abundant data


Business: Web, e-commerce, transactions, stocks, …



Science: Remote sensing, bioinformatics, scientific simulation, …



Society and everyone: news, digital cameras,…
Why Data Mining?
6

“Necessity is the mother of invention” - Data
mining—Automated analysis of massive data
sets
What Motivated Data Mining?
7



We are drowning in data, but starving for
knowledge!
Evolution of Database
Technology

8

Data mining can be viewed as a result of natural evolution
of IT


1960s:




1970s:




Data collection, database creation and network DBMS
Relational data model, relational DBMS implementation

1980s:


RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)



Application-oriented DBMS (spatial, scientific, engineering, etc.)
Evolution of Database Technology
9



1990s:




Data mining, data warehousing, multimedia
databases, and Web databases

2000s


Stream data management and mining



Data mining and its applications



Web technology (XML, data integration) and global
information systems
10
What Is Data Mining?
11



Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data



Alternative names




Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?


Simple search and query processing



(Deductive) expert systems
Data Mining: Confluence of Multiple Disciplines
12

Database
Technology

Machine
Learning
Pattern
Recognition

Statistics

Data Mining

Algorithm

Visualization

Other
Disciplines
Knowledge Discovery (KDD) Process
13



Data mining—core of
knowledge discovery
process

Pattern Evaluation
Data Mining

Task-relevant Data
Data
Warehouse
Data Cleaning
Data Integration
Databases

Selection
Knowledge Process
14

1.
2.
3.
4.

5.

6.

7.

Data cleaning – to remove noise and inconsistent data
Data integration – to combine multiple source
Data selection – to retrieve relevant data for analysis
Data transformation – to transform data into
appropriate form for data mining
Data mining- An essential process where intelligent
methods are applied to extract data patterns
Pattern Evaluation-Identify truly interesting patterns
representing knowledge based on interestingness
measure
Knowledge presentation-visualization and
representation techniques
Example: A Web Mining Framework
15



Web mining usually involves









Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
Data Mining in Business Intelligence
Increasing potential
to support
business decisions

End User

Decision
Making

Business
Analyst

Data Presentation
Visualization Techniques
Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16

DBA
KDD Process: A Typical View from ML and
Statistics

Input Data

Data PreProcessing

Data integration
Normalization
Feature selection
Dimension reduction



Data
Mining

Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………

PostProcessing

Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

This is a view from typical machine learning and statistics communities
17
Data Mining: On What Kinds of Data?
18



Database-oriented data sets and applications




Relational database, data warehouse, transactional database

Advanced data sets and advanced applications


Data streams and sensor data



Time-series data, temporal data, sequence data (incl. bio-sequences)



Structure data, graphs, social networks and multi-linked data



Object-relational databases



Heterogeneous databases and legacy databases



Spatial data



Multimedia database



Text databases



The World-Wide Web
RDBMS
19









A database that has a collection of tables of data items, all of
which is formally described and organized according to the
relational model.
Data in a single table represents a relation.
Each table schema must identify a column or group of
columns, called the p rim a ry ke y , to uniquely identify each row.
A relationship can then be established between each row in
the table and a row in another table by creating a fo re ig n ke y ,
a column or group of columns in one table that points to the
primary key of another table.
RDBMS
20
•

•

•

•

•

Database normalization: The relational model offers various levels
of refinement of table organization and reorganization .
DBMS of a relational database is called an RDBMS, and is the
software of a relational database.
The relational database was first defined in June 1970 by Edgar
Codd, of IBM's San Jose Research Laboratory.
Codd's view of what qualifies as an RDBMS is summarized in
Codd's 12 rules.
A relational database has become the predominant choice in
storing data.
21

Relational database
terminology.

A relation is defined as a set of tuples that have the same
attributes
RDMS(contd..)
22

Example :Allelectronics(Company described by relation
tables:Customer,item,employee and branch)
Relation : customer is a group of entities describing the
customer information(Cust_id,cust_name,
Age,Occupation,annual income, credit information and
category)
Tables: used to represent the relationship between or
among multiple entities
 Database queries(SQL): For data accessing using
relational operations such as join, selection and projection
Mining Relational databases
23








Can go further by searching for trends or data patterns
Examples
Analyze customer data to predict the risk of customers
based on their income ,age
Detect deviations: sales comparison with previous year
RDBMS are one of the most commonly available and
richest information repositories for data mining
What is a Data
Warehouse?

24



Defined in many different ways, but not rigorously.


A decision support database that is maintained separately from
the organization’s operational database



Support information processing by providing a solid platform of
consolidated, historical data for analysis.



“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon



Data warehousing:


The process of constructing and using data warehouses
DATA WAREHOUSES
25

Is a repository of information collected from
multiple sources, stored under a unified
schema.
Constructed via
 Data cleaning
 Data integration
 Data transformation
 Data Loading and periodic data refreshing

26
DATA WAREHOUSES(contd…)
27





Data warehouse is modeled by a multidimensional data
structure
Data cube: precomputation &fast access of
summarized data




Each dimension corresponds to an attribute or a set of attributes
in a schema
Each cell stores the value of some aggregate measure (count,
sum etc)



Example:



In Allelectronics the cube has three dimension :

•

Address(with city values, U S A, Canada, Mexico)

•

Time (with quarter values Q1,Q2,Q3,Q4)

•

Item(with type values )
Multidimensional Data
28

Sales volume as a function of product, month,
and region
Re
g

io
n

Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region

Year

Category Country Quarter
Product



Product

City
Office

Month

Month
Day

Week
A Sample Data Cube
29

Pr

TV
PC
VCR
sum

1Qtr

2Qtr

3Qtr

4Qtr

sum

Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum

Country

od
uc
t

Date
Data mining functionalities
30



Tasks can be classified :




Predictive(makes prediction about values of data using known
results found from different data)
Descriptive( characterize properties of a target data set)
 Explore the properties of the data examined

Data mining functionalities are used to specify the kinds
of patterns






Characterization and Discrimination
The mining of frequent patterns, associations and correlations
Classification and regression
Cluster analysis
Outlier analysis
Characterization and Discrimination
31





Data characterization is a summarization of the general

characteristics or features of a target class of data
Output of characterization can be presented in various forms
 Pie charts
 Bar charts
 Curves

multidimensional data cube
 Multidimensional tables
Descriptions presented in generalized relations- Characteristic
rules
Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f
c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s
lle
this can be view in any dimension, such as on occupation to view
these customers according to their type of employment.
Data Discrimination
32









Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or more
multiple contrasting class
Output representation similar to characterization
description
Discrimination description expressed in the form of
rules –Discrimination rules
Target and contrasting class specified by the user

Example:


Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with
s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e
p e rio d
Mining Frequent Patterns, Associations,
Correlations
33



Frequent pattern
Frequent item sets(Milk, bread)
 Frequent subsequences(Latop ,digital camera
,memory
card)
 Frequent sub structures (graphs ,trees)
Mining frequent patterns leads to the discovery of
interesting associations and correlation within
data.

Association analysis(example)
34

Item frequently purchased together
buys(X, ”computer”) =>buys(X, ”software”)
[support=1%, confidence=50%]
X - a variable representing a customer
A confidence or certainty – 50%(chance)
1%(under analysis)
Association rule- with single-dimension association rules
“computer => software[1%,50%]”.
Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”)
[support=2%, confidence=60%] (Multidimensional association rule)
Classification and Regression for Predictive
Analysis
35






Classification: the process of finding a
model(function)that describes and
distinguishes data classes or concepts
Model derived from analysis of a set of training data
Models are represented as




Classification rules(IF-THEN rules)
Decision trees
Mathematical formulae or Neural networks

 Regression:

Statistical methodology for
numeric prediction
36

Cluster Analysis and Outlier
Analysis


Cluster Analysis:






Determining similarity among data on predefined
attributes
The most similar data are grouped into clusters

Outlier Analysis






Outliers: The dataset contain objects that do not
required for the model of the data
Analysis of outlier data is referred to as Outlier

Analysis or Anomaly mining
Detected using statstical tests
Which Technologies Are Used?
Machine
Learning

Applications

Algorithm

Pattern
Recognition

Statistics

Visualization

Data Mining

Database
Technology

High-Performance
Computing

37
Potential Applications of Data Mining
Where there are data there are
data mining applications
38


Data analysis and decision support ( Business Intelligence)


Market analysis and management




Risk analysis and management





Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications


Text mining (news group, email, documents) and Web mining



Stream data mining



Bioinformatics and bio-data analysis
Major Issues in Data Mining (1)


Mining Methodology



Mining knowledge in multi-dimensional space



Data mining: An interdisciplinary effort



Boosting the power of discovery in a networked environment



Handling noise, uncertainty, and incompleteness of data




Mining various and new kinds of knowledge

Pattern evaluation and pattern- or constraint-guided mining

User Interaction


Interactive mining



Incorporation of background knowledge



Presentation and visualization of data mining results
39
Major Issues in Data Mining (2)


Efficiency and Scalability





Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods

Diversity of data types





Handling complex types of data
Mining dynamic, networked, and global data repositories

Data mining and society


Social impacts of data mining



Privacy-preserving data mining



Invisible data mining
40

More Related Content

What's hot

Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science ProcessVishal Patel
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl conceptsjeshocarme
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Data mining query language
Data mining query languageData mining query language
Data mining query languageGowriLatha1
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 Salah Amean
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data-base-system-and-big-data.pptx
Data-base-system-and-big-data.pptxData-base-system-and-big-data.pptx
Data-base-system-and-big-data.pptxMelchorCleve
 

What's hot (20)

Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
NoSql
NoSqlNoSql
NoSql
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Data cube
Data cubeData cube
Data cube
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
01 intro
01 intro01 intro
01 intro
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data-base-system-and-big-data.pptx
Data-base-system-and-big-data.pptxData-base-system-and-big-data.pptx
Data-base-system-and-big-data.pptx
 
Data mining
Data miningData mining
Data mining
 

Similar to Introduction to DataMining

20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data MiningRanak Ghosh
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 

Similar to Introduction to DataMining (20)

2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Chapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptxChapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptx
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Recently uploaded (20)

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Introduction to DataMining

  • 2. 2 A young, fast growing and promising field
  • 3. INTRODUCTION 3      Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD) Extracting hidden information An interdisciplinary subfield of computer science The computational process of discovering patterns in large data sets Involving methods at the intersection of Artificial intelligence, Machine learning, Statistics, and Database systems.
  • 4. INTORODUCTION(CONTD..) 4 The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves • database and data management aspects •    • data pre-processing model inference considerations complexity considerations, post-processing of discovered structures, visualization, and online updating.
  • 5. Why Data Mining? 5  The Explosive Growth of Data: from terabytes to petabytes  Eg: Global backbone telecommunication network carry tens of petabytes everyday (1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras,…
  • 6. Why Data Mining? 6 “Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets
  • 7. What Motivated Data Mining? 7  We are drowning in data, but starving for knowledge!
  • 8. Evolution of Database Technology 8 Data mining can be viewed as a result of natural evolution of IT  1960s:   1970s:   Data collection, database creation and network DBMS Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)
  • 9. Evolution of Database Technology 9  1990s:   Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems
  • 10. 10
  • 11. What Is Data Mining? 11  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names   Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  • 12. Data Mining: Confluence of Multiple Disciplines 12 Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines
  • 13. Knowledge Discovery (KDD) Process 13  Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
  • 14. Knowledge Process 14 1. 2. 3. 4. 5. 6. 7. Data cleaning – to remove noise and inconsistent data Data integration – to combine multiple source Data selection – to retrieve relevant data for analysis Data transformation – to transform data into appropriate form for data mining Data mining- An essential process where intelligent methods are applied to extract data patterns Pattern Evaluation-Identify truly interesting patterns representing knowledge based on interestingness measure Knowledge presentation-visualization and representation techniques
  • 15. Example: A Web Mining Framework 15  Web mining usually involves         Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
  • 16. Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 16 DBA
  • 17. KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction  Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities 17
  • 18. Data Mining: On What Kinds of Data? 18  Database-oriented data sets and applications   Relational database, data warehouse, transactional database Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data  Multimedia database  Text databases  The World-Wide Web
  • 19. RDBMS 19     A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model. Data in a single table represents a relation. Each table schema must identify a column or group of columns, called the p rim a ry ke y , to uniquely identify each row. A relationship can then be established between each row in the table and a row in another table by creating a fo re ig n ke y , a column or group of columns in one table that points to the primary key of another table.
  • 20. RDBMS 20 • • • • • Database normalization: The relational model offers various levels of refinement of table organization and reorganization . DBMS of a relational database is called an RDBMS, and is the software of a relational database. The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory. Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules. A relational database has become the predominant choice in storing data.
  • 21. 21 Relational database terminology. A relation is defined as a set of tuples that have the same attributes
  • 22. RDMS(contd..) 22 Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch) Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category) Tables: used to represent the relationship between or among multiple entities  Database queries(SQL): For data accessing using relational operations such as join, selection and projection
  • 23. Mining Relational databases 23      Can go further by searching for trends or data patterns Examples Analyze customer data to predict the risk of customers based on their income ,age Detect deviations: sales comparison with previous year RDBMS are one of the most commonly available and richest information repositories for data mining
  • 24. What is a Data Warehouse? 24  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses
  • 25. DATA WAREHOUSES 25 Is a repository of information collected from multiple sources, stored under a unified schema. Constructed via  Data cleaning  Data integration  Data transformation  Data Loading and periodic data refreshing 
  • 26. 26
  • 27. DATA WAREHOUSES(contd…) 27   Data warehouse is modeled by a multidimensional data structure Data cube: precomputation &fast access of summarized data   Each dimension corresponds to an attribute or a set of attributes in a schema Each cell stores the value of some aggregate measure (count, sum etc)  Example:  In Allelectronics the cube has three dimension : • Address(with city values, U S A, Canada, Mexico) • Time (with quarter values Q1,Q2,Q3,Q4) • Item(with type values )
  • 28. Multidimensional Data 28 Sales volume as a function of product, month, and region Re g io n Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product  Product City Office Month Month Day Week
  • 29. A Sample Data Cube 29 Pr TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TVs in U.S.A. U.S.A Canada Mexico sum Country od uc t Date
  • 30. Data mining functionalities 30  Tasks can be classified :   Predictive(makes prediction about values of data using known results found from different data) Descriptive( characterize properties of a target data set)  Explore the properties of the data examined Data mining functionalities are used to specify the kinds of patterns      Characterization and Discrimination The mining of frequent patterns, associations and correlations Classification and regression Cluster analysis Outlier analysis
  • 31. Characterization and Discrimination 31   Data characterization is a summarization of the general characteristics or features of a target class of data Output of characterization can be presented in various forms  Pie charts  Bar charts  Curves  multidimensional data cube  Multidimensional tables Descriptions presented in generalized relations- Characteristic rules Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s lle this can be view in any dimension, such as on occupation to view these customers according to their type of employment.
  • 32. Data Discrimination 32     Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class Output representation similar to characterization description Discrimination description expressed in the form of rules –Discrimination rules Target and contrasting class specified by the user Example:  Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e p e rio d
  • 33. Mining Frequent Patterns, Associations, Correlations 33  Frequent pattern Frequent item sets(Milk, bread)  Frequent subsequences(Latop ,digital camera ,memory card)  Frequent sub structures (graphs ,trees) Mining frequent patterns leads to the discovery of interesting associations and correlation within data. 
  • 34. Association analysis(example) 34 Item frequently purchased together buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%] X - a variable representing a customer A confidence or certainty – 50%(chance) 1%(under analysis) Association rule- with single-dimension association rules “computer => software[1%,50%]”. Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)
  • 35. Classification and Regression for Predictive Analysis 35    Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts Model derived from analysis of a set of training data Models are represented as    Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks  Regression: Statistical methodology for numeric prediction
  • 36. 36 Cluster Analysis and Outlier Analysis  Cluster Analysis:    Determining similarity among data on predefined attributes The most similar data are grouped into clusters Outlier Analysis    Outliers: The dataset contain objects that do not required for the model of the data Analysis of outlier data is referred to as Outlier Analysis or Anomaly mining Detected using statstical tests
  • 37. Which Technologies Are Used? Machine Learning Applications Algorithm Pattern Recognition Statistics Visualization Data Mining Database Technology High-Performance Computing 37
  • 38. Potential Applications of Data Mining Where there are data there are data mining applications 38  Data analysis and decision support ( Business Intelligence)  Market analysis and management   Risk analysis and management    Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis
  • 39. Major Issues in Data Mining (1)  Mining Methodology   Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment  Handling noise, uncertainty, and incompleteness of data   Mining various and new kinds of knowledge Pattern evaluation and pattern- or constraint-guided mining User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results 39
  • 40. Major Issues in Data Mining (2)  Efficiency and Scalability    Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types    Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining 40