2. Topics to cover..
Introduction
Types of Data
Data Mining Functionalities
Interestingness of Patterns
Classification of Data Mining Systems
Data Mining Task Primitives
Integration of a Data Mining System with a Data
Warehouse
Issues
Data Preprocessing.
7. DATABASE
• Database: Shared collection of logically
related data (and a description of this
data), designed to meet the information
needs of an organization.
• Database management System: A
software system that enables users to
define, create, and maintain the
8. Who and How to do it ?
• Database Management System (DBMS) does this job.
• Using Software tools: Access, FileMaker, Lotus Notes,
Oracle or SQL Server, …….
• It includes tools to add, modify or delete data from the
database, ask questions (or queries) about the data
stored in the database and produce reports summarizing
selected contents.
9. Why do we need a database?
• Keep records of our:
– Clients
– Staff
– Volunteers
• To keep a record of activities.
• Keep sales records
• Develop reports
• Perform Querying
10. Data vs. information
• What is data?
–Data is
unprocessed
information.
• What is information?
–Information is data that
have been organized
and communicated in a
logical and meaningful
manner.
11. Purpose of Database system/
Stages of Database System
– Data is converted into information, and information is converted
into knowledge.
– Knowledge; information evaluated and organized so that it can be
used purposefully.
Data
(Unprocessed
information)
Data
(Unprocessed
information)
Information
(processed Data)
Information
(processed Data)
Knowledge
(Evaluated Information
using measures)
Knowledge
(Evaluated Information
using measures)
Action
(Data Analysis &
Future Prediction)
Action
(Data Analysis &
Future Prediction)
Is to transformIs to transform
12. 12
Data Mining works with
Warehouse Data
• Data Warehousing provides the
Enterprise with a memory.
• Data Mining provides the
Enterprise with intelligence
14. What is data Mining?
• Now a days, huge data sets have become available due
to advances in technology.
• As a result, there is an increasing interest in various
scientific communities to explore the use of emerging
data mining techniques for the analysis of these large
data sets .
• Data mining is the extraction of implicit, previously
unknown and potentially useful
information,patterns,associations from data .
• Data mining is the Exploration & analysis, by
automatic or semi-automatic means, of large
quantities of data in order to discover meaningful
patterns .
15. WHO USES DATAMİNİNG?
•Banking
–future prediction
•Amazon.com (Online Stores)
–recommendation
•Facebook
–prediction how active a user will be after
3 months.
13/03/16 Seval Ünver | CENG 553 15
17. DATAMİNİNG İS NOT…
• Data warehousing
• SQL / Ad Hoc Queries /
Reporting
• Online Analytical
Processing (OLAP)
• Data Visualization
DATAMİNİNG İS …
• Explores Data
• Find Patterns
• Performs Prediction
13/03/16 Seval Ünver | CENG 553 17
18. KDD Process
• Knowledge discovery in databases (KDD) is a
multi step process of finding useful information
and patterns in data
• Data Mining is the use of algorithms to extract
information and patterns derived by the KDD
process.
• Many texts treat KDD and Data Mining as the
same process, but it is also possible to think of
Data Mining as the discovery part of KDD.
20. STEPS OF KDD PROCESS
1. Selection-
Data Extraction -Obtaining Data from heterogeneous data
sources -Databases, Data warehouses, World wide web or
other information repositories.
2. Preprocessing-
Data Cleaning- Incomplete , noisy, inconsistent data to be
cleaned- Missing data may be ignored or predicted,
erroneous data may be deleted or corrected.
3. Transformation-
Data Integration- Combines data from multiple sourcesCombines data from multiple sources
into a coherent store -Data can be encoded in commoninto a coherent store -Data can be encoded in common
formats, normalized, reduced.formats, normalized, reduced.
21. Steps of KDD Process
4. D4. Data mining –
Apply algorithms to transformed data an extract patterns.
5. Pattern Interpretation/evaluation
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out
discovered patterns.
Knowledge presentation- present the mined knowledge-
visualization techniques can be used.
22. Types of Data /
What kind of Data can be mined
• Data mining should be applicable to any kind of information
repository. However, algorithms and approaches may differ
when applied to different types of data.
• Relational Databases
• Data Warehouse
• Transaction Databases
• Advanced DB systems and information repositories
– Spatial databases
– Time-series data
– multimedia databases
– WWW
23. Relational Databases
– A relational database consists
of a set of tables containing
either values of entity
attributes, or values of
attributes from entity
relationships.
– Tables have columns and rows,
where columns represent
attributes and rows represent
tuples.
– A tuple in a relational table
corresponds to either an object
or a relationship between
objects and is identified by a set
of attribute values representing
a unique key.
24. Data Warehouse
• A data warehouse as a
storehouse, is a repository
of data collected from
multiple data sources (often
heterogeneous) and is
intended to be used as a
whole under the same
unified schema. A data
warehouse gives the option
to analyze data from
different sources under the
same roof.
25. Transaction Databases
• A transaction database is a set of
records representing
transactions, each with a time
stamp, an identifier and a set of
items. Associated with the
transaction files could also be
descriptive data for the items.
• Transactions are usually stored
in flat files or stored in two
normalized transaction tables,
one for the transactions and one
for the transaction items.
• Applications: Airline reservation,
Railway reservation, Log records
etc.
26. MULTIMEDIA DATABASE
• Multimedia databases include video,
images, audio, Sound clips, and text data.
They can be stored on extended object-
relational or object-oriented databases, or
simply on a file system.
• Ex: Digital Music Player, Social Media,
Electronic publishing.
27. Spatial Databases
• A spatial database is a
database that is enhanced to
store and access spatial data
that defines a geometric
space.
• These data are often
associated with geographic
locations and features, or
constructed features like
cities. Data on spatial
databases are stored as
coordinates, points, lines,
polygons and topology.
• Ex: store geographical
information like maps, and
global or regional
positioning.
28. Time Series Database
• A Time-Series
Database is a
database that
contains data for each
point in time.
• Examples: Weather
Data, stock market
data , Browser logged
activities, ocean tides.
30. World Wide Web
• The World Wide Web is the most
heterogeneous and dynamic repository
available.
• Data in the World Wide Web is organized
in inter-connected documents. These
documents can be text, audio, video, raw
data, and even applications.
32. Integration of a Data Mining System with a Database/Data
Warehouse System
The list of Integration Schemes is as follows −
• No Coupling − In this scheme, the data mining system does not
utilize any of the database or data warehouse functions. It fetches
the data directly from a particular source and processes that data
using some data mining algorithms. The data mining result is stored
in another file.(Ex :Collect data directly from Transactional database)
• Loose Coupling/Semi−tight Coupling - In this scheme, the data
mining system may use some of the functions of database and data
warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data or
fetch directly from particular sources. (Ex: Taken from transactional
DB+ Database/DWH)
• Tight coupling − In this scheme, the data mining system is smoothly
integrated into the database or data warehouse system. The data
mining subsystem is treated as one functional component of an
information system.
34. Data Mining Task Primitives
• We can specify a data mining task in the form of a data mining
query.
• This query is input to the system.
• A data mining query is defined in terms of data mining task
primitives.
• Note − These primitives allow us to communicate in an interactive
manner with the data mining system. Here is the list of Data Mining
Task Primitives −
1. Kind of knowledge to be mined.
2. Set of task relevant data to be mined.
3. Representation for visualizing the discovered patterns.
4. Background knowledge to be used in discovery process.
5. Interestingness measures and thresholds for pattern evaluation.
35. Data Mining Task Primitives-
Example of Data mining query
• use database AllElectronics_db use state_
location_hierarchy for B.address mine
characteristics as customerPurchasing analyze
count% in relevance to
C.age,I.type,I.place_made from customer C,
item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID =
C.cust_ID and P.method_paid = "AmEx" and
B.address = "Canada" and I.price ≥ 100 with
noise threshold = 5% display as table
36. Data Mining Task Primitives-cont..
1. Kind of knowledge to be mined
– It refers to the kind of functions to be performed.
These functions are −
• Characterization
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
1. Set of task relevant data to be mined
– This is the portion of database in which the user is interested.
This portion includes the following −
• Database Attributes
• Data Warehouse dimensions of interest
37. Data Mining Task Primitives-cont..
3. Representation for visualizing the discovered
patterns
– This refers to the form in which discovered patterns
are to be displayed. These representations may
include the following −
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
38. Data Mining Task Primitives-cont..
4. Background knowledge
– The background knowledge allows data to be mined at multiple
levels of abstraction. For example, the Concept hierarchies are
one of the background knowledge that allows data to be mined at
multiple levels of abstraction.
5.Interestingness measures and thresholds for
pattern evaluation
– This is used to evaluate the patterns that are discovered by the process of
knowledge discovery. There are different interesting measures for
different kind of knowledge.
40. Classification of Data mining
System(Cont..)
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
Hinweis der Redaktion
Data mining has become popular in many applications, especially in business. To name a few examples:CapitalOne bank uses data mining to predict whether a loan applicant will default on the loan, given information about his/her demographics, credit history, type of loan, etc.
Netflix (the largest DVD-by-mail rental company) andAmazon.com use data mining to provide recommendations to their customers (“you might also be interested in ___”).
British law enforcement and intelligence agencies use data mining to look for data patterns that might point to developing crime trends or security threats.
Facebook uses data mining to predict how active a user will be after 3 months.
Children's Hospital in Boston uses data mining to sift through emergency room patient records for detecting domestic abuse Pandora (an Internet music radio offering customized music) chooses the next song to play using data mining algorithms.