This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
What is Datamining? Which algorithms can be used for Datamining?
1. DATAMINING
Seval Ünver
E1900810 | CENG 553
Middle East Technical University
Computer Engineering Department
14.05.2013 CENG 553
In Summary
2. Outline
• Introduction
• Data vs. Information
• Who uses datamining?
• Common uses of datamining
• Datamining is…
• Supervised and Unsupervised Learning
• Predictive Models
• Datamining Process
• Some Popular Datamining Algorithms
• Data Warehouse
• Conceptual Modelling of Data Warehouse
• Example of Star Schema, Snowflake Schema, Fact Constellation
• Evolution of OLTP, OLAP and Data Warehouse
08.10.2013 Seval Ünver | CENG 553 2
3. Introduction
• Nowadays, large data sets have become available
due to advances in technology.
• As a result, there is an increasing interest in
various scientific communities to explore the use
of emerging data mining techniques for the
analysis of these large data sets *.
• Data mining is the semi-automatic discovery of
patterns, associations, changes, anomalies, and
statistically significant structures and events in
data **.
* Grossman et al., 2001
** Shmueli G, 2012
08.10.2013 Seval Ünver | CENG 553 3
4. What is Datamining?
• Process of semi-automatically analyzing large
databases to find patterns that are *:
– valid: hold on new data with some certainty
– novel: non-obvious to the system
– useful: should be possible to act on the item
– understandable: humans should be able to
interpret the pattern
• Also known as Knowledge Discovery in
Databases
08.10.2013 Seval Ünver | CENG 553 4
* Prof. S. Sudarshan CSE Dept, IIT Bombay
5. Big data: Cash Register
• Past: It was a
calculator.
• Now: It saves every
detail of every
action.
– The movements of
each product.
– The movements of
each user.
08.10.2013 Seval Ünver | CENG 553 5
6. Data vs. Information
• Data is useless by itself.
• Data is not just numbers
or letters. It consists of
numbers, letters and
their meaning. The
meaning is called
metadata.
• Information is
interpreted data.
• Converting the data to
information is called data
processing.
08.10.2013 Seval Ünver | CENG 553 6
7. Who uses Datamining?
• CapitalOne Bank
– future prediction
• Netflix (the largest DVD-by-mail rental company)
– Recommendation (you might also be interested in…)
• Amazon.com
– recommendation
• British law enforcement
– crime trends or security threats
• Facebook
– prediction how active a user will be after 3 months.
• Children's Hospital in Boston
– detecting domestic abuse
• Pandora (an Internet music radio)
– chooses the next song to play
08.10.2013 Seval Ünver | CENG 553 7
8. Common uses of Datamining:
• Direct mail marketing
• Web site personalization
• Credit card fraud detection
• Gas & jewelry
• Bioinformatics
• Text analysis
– SAS lie detector
• Market basket analysis
– Beer & baby diapers:
08.10.2013 Seval Ünver | CENG 553 8
9. Application Areas
08.10.2013 Seval Ünver | CENG 553 9
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
11. Datamining is not…
• Data warehousing
• SQL / Ad Hoc Queries / Reporting
• Software Agents
• Online Analytical Processing (OLAP)
• Data Visualization
08.10.2013 Seval Ünver | CENG 553 11
12. Supervised vs. Unsupervised Learning
• Supervised:
– Problem solving
– Driven by a real business problems and historical data
– Quality of results dependent on quality of data
• Unsupervised:
– Exploration (aka clustering)
– Relevance often an issue
• Beer and baby diapers
– Useful when trying to get an initial understanding of the data
– Non-obvious patterns can sometimes pop out of a completed
data analysis project
08.10.2013 Seval Ünver | CENG 553 12
26. 08.10.2013 Seval Ünver | CENG 553 26
· Pros
+ Can learn more complicated
class boundaries
+ Fast application
+ Can handle large number of
features
· Cons
- Slow training time
- Hard to interpret
- Hard to implement: trial
and error for choosing
number of nodes
Pros and Cons of Neural Networks
27. Supervised Algorithm Summary
• Decision Trees
– Understandable
– Relatively fast
– Easy to translate into SQL queries
• kNN
– Quick and easy
– Models tend to be very large
• Neural Networks
– Difficult to interpret
– Can require significant amounts of time to train
08.10.2013 Seval Ünver | CENG 553 27
28. K-Means Clustering
• User starts by specifying the number of clusters (K)
• K datapoints are randomly selected
• Repeat until no change:
– Hyperplanes separating K points are generated
– K Centroids of each cluster are computed
08.10.2013 Seval Ünver | CENG 553 28
29. Data Warehouse
Data warehouse is a database used for
reporting and data analysis.
08.10.2013 Seval Ünver | CENG 553 29
30. Data Mining works with Warehouse Data
08.10.2013 Seval Ünver | CENG 553 30
• Data Mining provides
the Enterprise with
intelligence
• Data Warehousing
provides the Enterprise
with a memory
31. Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
08.10.2013 Seval Ünver | CENG 553 31
32. Example of Star Schema
08.10.2013 32
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Seval Ünver | CENG 553
33. Example of Snowflake Schema
08.10.2013 33
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
Seval Ünver | CENG 553
34. Example of Fact Constellation
08.10.2013 34
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
Seval Ünver | CENG 553
35. Evolution of OLTP, OLAP and Data Warehouse
Time
08.10.2013 Seval Ünver | CENG 553 35
36. Evolutionary Step Business Question Enabling Technology
Data Collection
(1960s)
"What was my total revenue in the last
five years?"
computers, tapes, disks
Data Access
(1980s)
"What were unit sales in New England
last March?"
faster and cheaper
computers with more
storage, relational databases
Data Warehousing
And
Decision Support
"What were unit sales in New England
last March? Drill down to Boston."
faster and cheaper
computers with more
storage, On-line analytical
processing
(OLAP), multidimensional
databases,
data warehouses
Data Mining
"What's likely to happen to Boston
unit sales next month? Why?"
faster and cheaper
computers with more
storage, advanced computer
algorithms
08.10.2013 Seval Ünver | CENG 553 36
37. As a Result
• In order to apply data mining, a large amount of
quality data is required.
• The aim of datamining is acquiring rules and
equations which can be used to predict future.
• To be successful on such a work is dependent on
working with database experts and data mining
specialists. They need to work together.
• Work may take longer, you need time and
patience.
08.10.2013 Seval Ünver | CENG 553 37
38. Thank You
If you have question, you can contact with me
via email: e1900810@ceng.metu.edu.tr
Seval Ünver | METU CENG
08.10.2013 Seval Ünver | CENG 553 38
Hinweis der Redaktion
The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers
Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear boundaries
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Widely used learning methodEasy to interpret: can be re-represented as if-then-else rulesApproximates function by piece wise constant regionsDoes not require any prior knowledge of data distribution, works well on noisy data.Has been applied to: classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment.
Pros Reasonable training time Fast application Easy to interpret Easy to implement Can handle large number of featuresCons Cannot handle complicated relationship between features simple decision boundaries problems with lots of missing data
Pros Fast trainingCons Slow during application. No feature selection. Notion of proximity vague
Set of nodes connected by directed weighted edges.Useful for learning complex data like handwriting, speech and image recognition
ProsCan learn more complicated class boundaries Fast application Can handle large number of featuresConsSlow training time Hard to interpret Hard to implement: trial and error for choosing number of nodes
Data warehouse mining: assimilate data from operational sourcesmine static dataMining log dataContinuous mining: example in process controlStages in mining:data selection pre-processing: cleaning transformation mining result evaluation visualization