2. WHY DATA MINING?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation,
…
Society and everyone: news, digital cameras, YouTube Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets 2
3. EVOLUTION OF SCIENCES
Before 1600, empirical science
1600-1950s, theoretical science
E h di i li h h i l t Th ti l d l ft ti t Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch (e.g.Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,y y, p yp f ,
Comm. ACM, 45(11): 50-54, Nov. 2002 3
4. EVOLUTION OF DATABASE TECHNOLOGY
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s: 1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990 1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000 2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems 4
5. WHAT IS DATA MINING?
Data mining (knowledge discovery from data)
E i f i i i i l i li i i l Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of datauge a ou t o data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.g g, g, g ,
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems 5
7. STEPS FOR KNOWLEDGE DISCOVERY PROCESSSTEPS FOR KNOWLEDGE DISCOVERY PROCESS
Data cleaningg
to remove noise or irrelevant data
Data integration
where multiple data sources are combined where multiple data sources are combined
Data selection
where data relevant to the task analysis is retrieved
Data transformation
where data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operations
Data mining
an essential process where intelligent methods arean essential process where intelligent methods are
applied in order to extract data patterns 7
8. S K D PSTEPS FOR KNOWLEDGE DISCOVERY PROCESS
Pattern evaluation
To identify the truly interesting patterns
representing knowledge based on some
interestingness measureinterestingness measure
Knowledge representation
Where visualization and knowledge representation
techniques are used to present the mined knowledge
to the user
Data mining is the process of discovering interesting knowledge from
large amount of data stored either in the data bases, data
warehouses, or other information repositories
8
, p
9. ARCHITECTURE OF DATA MINING SYSTEM
Based on this view, the architecture of data,
mining system has the following major
components-
D t b d t h th i f ti Database, data warehouse or other information
repository
Database or data warehouse server
Knowledge base
Data mining engine
Pattern evaluation modulePattern evaluation module
Graphical user interface
9
11. DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES
Database
S i iTechnology Statistics
Data Mining
Machine
Learning
Visualization
Learning
PatternPattern
Recognition
Algorithm
Other
Disciplines
11
12. WHY NOT TRADITIONAL DATA ANALYSIS?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytesAlgorithms must be highly scalable to handle such as tera bytes
of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor dataData streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs scientific simulations Software programs, scientific simulations
New and sophisticated applications
12
13. DATA MINING- ON WHAT KIND OF DATA
Different data store on which data mining can beg
done
It is applicable to any kind of information
itrepository
However, the challenges and techniques of
mining may differ for each of the repositorymining may differ for each of the repository
systems
13
14. DATA MINING: ON WHAT KINDS OF DATA?
Database-oriented data sets and applications
Relational database data warehouse transactional database Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal dataSpatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web 14
18. TRANSACTIONAL DATABASE
Consists of files where
each record represent
a transaction
T i ll i l d Typically includes a
transaction identity
number and a list of
items making up the
transaction
18
19. DATA MINING FUNCTIONALITIES
Multidimensional concept description: Characterization and
discriminationdiscrimination
Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Frequent patterns, association, correlation vs. causality
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Cl ifi i d di i Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future predictionp p
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown or missing numerical values 19
20. DATA MINING FUNCTIONALITIES (2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., clusterp , g ,
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysisy
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysiso se o e cep o ? Use a e ec o , a e eve s a a ys s
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e g digital camera large SD memory Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
O h di d i i l l Other pattern-directed or statistical analyses 20