SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Prof. Pier Luca Lanzi	

Data Mining
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi	

Why Data Mining?	

“Necessity is the mother of invention”	

	

Explosive Growth of Data	

Pressing need for the automated analysis of massive data	

	

Emerged in the late 1980s	

Major developments in the mid 1990s
Prof. Pier Luca Lanzi	

http://consumer.media.seagate.com/2012/06/the-digital-den/how-much-data-is-generated-in-a-minute/
Prof. Pier Luca Lanzi	

Some Data About Data
(http://wikibon.org/blog/big-data-statistics/)	

•  2.7 Zetabytes of data exist in the digital universe today	

•  Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data	

•  Walmart handles more than 1 million customer transactions every hour,
stored into databases estimated to contain more than 2.5 petabytes of data	

•  More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide	

•  100 terabytes of data uploaded daily to Facebook	

•  According to Twitter’s own research in early 2012, it sees roughly 175 million
tweets every day, and has more than 465 million accounts	

4
Prof. Pier Luca Lanzi	

Evolution of Technology	

•  1960s: data collection, database creation,  network DBMS	

•  1970s: relational data model, relational DBMS implementation	

•  1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.); application-oriented DBMS 
(spatial, scientific, engineering, etc.)	

•  1990s: data mining, data warehousing, multimedia databases, 
and Web databases	

•  2000s: stream data management and mining, web technology (XML,
data integration), global information systems 	

•  2010s: social networks, NoSQL, unstructured data, etc.	

5
Prof. Pier Luca Lanzi	

http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
Prof. Pier Luca Lanzi	

What is the Commercial Viewpoint?	

•  Huge amounts of data is being collected 
and warehoused everyday	

§ Web data, e-commerce	

§ Purchases at department stores	

§ Bank/Credit Card transactions	

•  Computers have become
cheaper and more powerful	

•  Competitive pressure is strong 
to provide better, customized services
(e.g., CRM or Customer Relationship Management)	

•  Poor data across businesses and the government costs 
the U.S. economy $3.1 trillion dollars a year 
(http://wikibon.org/blog/big-data-statistics/)	

7
Prof. Pier Luca Lanzi	

What is the Scientific Viewpoint?	

•  Data collected and stored at 
enormous speeds (GB/hour)	

§ remote sensors on a satellite	

§ telescopes scanning the skies	

§ microarrays generating gene 
expression data	

§ scientific simulations 
generating terabytes of data	

•  Traditional techniques infeasible for raw data	

•  Data mining may help scientists 	

§ in classifying and segmenting data	

§ in Hypothesis Formation	

8
Prof. Pier Luca Lanzi	

Why Mining Large Datasets?	

•  There is often information 
“hidden” in the data that is 
not readily evident	

•  Human analysts may take weeks 
to discover useful information	

•  Much of the data is
never analyzed at all!	

0!
500,000!
1,000,000!
1,500,000!
2,000,000!
2,500,000!
3,000,000!
3,500,000!
4,000,000!
1995 1996 1997 1998 1999
The Data Gap	

Total new disk (TB) since 1995	

Number of
analysts	

9
Prof. Pier Luca Lanzi	

10	

Word cloud from Science	

“Dealing with Data”	

11 February 2011	

volume 331	

issue 6018	

pages 639-806
Prof. Pier Luca Lanzi	

Science 11 February 2011
“Dealing with Data”
11
Prof. Pier Luca Lanzi	

Examples
Prof. Pier Luca Lanzi	

Examples	

•  Customer attrition	

§ Given: customer information for the past months	

§ Problem: predict who is likely to attrite next month, 
or estimate customer value 	

§ Data: historical customer records	

•  Credit assessment	

§ Given: a loan application	

§ Problem: predict whether the bank should 
approve the loan	

§ Data: records from other loans	

13
Prof. Pier Luca Lanzi	

http://wikibon.org/blog/role-of-the-data-scientist/
Prof. Pier Luca Lanzi	

https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html
Prof. Pier Luca Lanzi	

What is Data Mining?
Prof. Pier Luca Lanzi	

What is Data Mining?	

•  The non-trivial process of identifying (1) valid, (2) novel, (3) potentially
useful, and (4) understandable patterns in data.	

•  Alternative names,	

§ Data Fishing, Data Dredging (1960-)	

§ Data Mining (1990-), used by DB and business 	

§ Knowledge Discovery in Databases (1989-), used by AI	

§ Business Intelligence, Information Harvesting, Information Discovery,
Knowledge Extraction, ... 	

§ Currently, Data Mining and Knowledge Discovery 
are used interchangeably	

•  Data Mining is not looking up in the phone directory, it is not querying
a Web search engine for information about “Amazon”	

17
Prof. Pier Luca Lanzi	

An Example Using Contact Lens Data	

 18	

None	

Reduced	

Yes	

Hypermetrope	

Pre-presbyopic 	

None	

Normal	

Yes	

Hypermetrope	

Pre-presbyopic	

None	

Reduced	

No	

Myope	

Presbyopic	

None	

Normal	

No	

Myope	

Presbyopic	

None	

Reduced	

Yes	

Myope	

Presbyopic	

Hard	

Normal	

Yes	

Myope	

Presbyopic	

None	

Reduced	

No	

Hypermetrope	

Presbyopic	

Soft	

Normal	

No	

Hypermetrope	

Presbyopic	

None	

Reduced	

Yes	

Hypermetrope	

Presbyopic	

None	

Normal	

Yes	

Hypermetrope	

Presbyopic	

Soft	

Normal	

No	

Hypermetrope	

Pre-presbyopic	

None	

Reduced	

No	

Hypermetrope	

Pre-presbyopic	

Hard	

Normal	

Yes	

Myope	

Pre-presbyopic	

None	

Reduced	

Yes	

Myope	

Pre-presbyopic	

Soft	

Normal	

No	

Myope	

Pre-presbyopic	

None	

Reduced	

No	

Myope	

Pre-presbyopic	

hard	

Normal	

Yes	

Hypermetrope	

Young	

None	

Reduced	

Yes	

Hypermetrope	

Young	

Soft	

Normal	

No	

Hypermetrope	

Young	

None	

Reduced	

No	

Hypermetrope	

Young	

Hard	

Normal	

Yes	

Myope	

Young	

None	

Reduced	

Yes	

Myope	

Young 	

Soft	

Normal	

No	

Myope	

Young	

None	

Reduced	

No	

Myope	

Young	

Recommended lenses	

Tear production rate	

Astigmatism	

Spectacle prescription	

Age
Prof. Pier Luca Lanzi	

An example of possible pattern	

if astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
Prof. Pier Luca Lanzi	

is it a “good” pattern?
Prof. Pier Luca Lanzi	

Example: Credit Risk	

 21	

salary	

loan	

not repaid	

 repaid
Prof. Pier Luca Lanzi	

Example: Credit Risk	

 22	

salary	

loan	

k	

IF salarykTHEN not repaid
Prof. Pier Luca Lanzi	

How Can We Evaluate a Pattern?	

•  Is it valid?	

§ The pattern has to be valid with respect 
to a certainty level (rule true for the 86%)	

•  Is it novel?	

§ The value k should be previously 
unknown or obvious 	

•  Is it useful? Is it actionable?	

§ The pattern should provide information 
useful to the bank for assessing credit risk	

•  Is it understandable?	

23
Prof. Pier Luca Lanzi	

but there is another important question …	

	

was it “worth” finding it? (I mean $ worth)	

	

how much did the search cost?	

	

how much value did it bring?
Prof. Pier Luca Lanzi	

What is the General Idea?	

•  Build computer programs that navigate through databases
automatically, seeking regularities or patterns	

•  There will be problems	

§ Most patterns are banal and uninteresting	

§ Most patterns are spurious, inexact, or contingent on
accidental coincidences in the particular dataset used 	

§ Real data is imperfect: Some parts will be garbled, 
and some will be missing	

•  Algorithms need to be robust enough to cope with imperfect
data and to extract regularities that are inexact but useful	

25
Prof. Pier Luca Lanzi	

http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
Prof. Pier Luca Lanzi	

http://www.tylervigen.com
Prof. Pier Luca Lanzi	

http://www.tylervigen.com
Prof. Pier Luca Lanzi	

http://wikibon.org/blog/role-of-the-data-scientist/
Prof. Pier Luca Lanzi	

What are the Related Fields?	

Databases	

Statistics	

Machine	

Learning	

Visualization	

	

Knowledge Discovery	

And Data Mining	

	

30
Prof. Pier Luca Lanzi	

Statistics, Machine Learning, 
and Data Mining	

•  Statistics is more theory-based, focuses on testing hypotheses	

•  Machine learning is more based on heuristic, focuses on building
program that learns, more general than Data Mining	

•  Knowledge Discovery	

§ integrates theory and heuristics	

§ focus on the entire process of discovery, including 
data cleaning, learning, integration and visualization	

Distinctions are blurred!	

31
Prof. Pier Luca Lanzi	

Why Is It Different?	

•  Tremendous amount of data	

§ High scalability to handle terabytes of data	

•  High-dimensionality of data 	

§ Micro-array may have tens of thousands of dimensions	

•  High complexity of data	

§ Data streams and sensor data	

§ Time-series data, temporal data, sequence data 	

§ Structure data, graphs, social networks and multi-linked data	

§ Heterogeneous databases and legacy databases	

§ Spatial, spatiotemporal, multimedia, text and Web data	

§ Software programs, scientific simulations	

•  New and sophisticated applications	

32
Prof. Pier Luca Lanzi	

http://wikibon.org/blog/big-data-statistics/
Prof. Pier Luca Lanzi	

Knowledge Discovery Process	

 34	

selection 	

cleaning	

transformation	

mining	

evaluation
Prof. Pier Luca Lanzi	

Knowledge Discovery Process 
What are the main steps?	

•  Learning the application domain to extract 
relevant prior knowledge and goals	

•  Data selection	

•  Data cleaning	

•  Data reduction and transformation	

•  Mining	

§ Select the mining approach: classification, 
regression, association, clustering, etc.	

§ Choosing the mining algorithm(s)	

§ Perform mining: search for patterns of interest	

•  Pattern evaluation and knowledge presentation	

§ visualization, transformation, 
removing redundant patterns, etc.	

•  Use of discovered knowledge	

35
Prof. Pier Luca Lanzi	

What are the typical Data Mining tasks?
Prof. Pier Luca Lanzi	

http://wikibon.org/blog/big-data-statistics/
Prof. Pier Luca Lanzi	

What are the Major Data Mining Tasks?	

•  Classification: predicting an item class	

•  Clustering: finding clusters in data	

•  Associations: frequent occurring events…	

•  Visualization: to facilitate human discovery	

•  Summarization: describing a group	

•  Deviation Detection: finding changes	

•  Estimation: predicting a continuous value	

•  Link Analysis: finding relationship	

•  “Sentiment” Analysis, “Opinion” Mining	

•  But many appears as time goes by, opinion mining, 
sentiment mining	

38
Prof. Pier Luca Lanzi	

Data Mining Tasks: classification	

 39	

salary	

loan	

k	

IF salarykTHEN not repaid	

?	

?
Prof. Pier Luca Lanzi	

Data Mining Tasks: 
Classification  Prediction	

•  Finding models (functions) that describe 
and distinguish classes or concepts 	

•  The goal is to describe the data or 
to make future prediction	

•  E.g., classify countries based on climate, 
or classify cars based on gas mileage	

•  Presentation: decision-tree, classification rule, neural network	

•  Prediction: Predict some unknown numerical values 	

40
Prof. Pier Luca Lanzi	

Data Mining Tasks: Clustering	

•  Given a cloud of data points we’d like to understand their structure	

•  The class label is unknown	

•  Group data to form new 
classes, e.g., cluster houses
to find distribution patterns	

•  Clustering based on the 
principle: maximizing the 
intra-class similarity and 
minimizing the interclass 
similarity	

41
Prof. Pier Luca Lanzi	

Data Mining Tasks: Associations 	

Bread	

Peanuts	

Milk	

Fruit	

Jam	

	

	

Bread	

Jam	

Soda	

Chips	

Milk	

Fruit	

	

	

Steak	

Jam	

Soda	

Chips	

Bread	

	

	

Jam	

Soda	

Chips	

Milk	

Bread	

	

Fruit	

Soda	

Chips	

Milk	

	

	

Jam	

Soda	

Peanuts	

Milk	

Fruit	

	

Fruit	

Soda	

Peanuts	

Milk	

	

Fruit	

Peanuts	

Cheese	

Yogurt	

	

Is there something interesting to be noted?	

42
Prof. Pier Luca Lanzi	

Data Mining Tasks: Associations 	

•  Finds interesting associations and/or correlation relationships
among large set of data items. 	

•  E.g., 98% of people who purchase tires and auto accessories also
get automotive services done	

43
Prof. Pier Luca Lanzi	

Data Mining Tasks: Other Tasks	

•  Outlier analysis	

§ Outlier: a data object that does not comply
with the general behavior of the data	

§ It can be considered as noise or exception 
but is quite useful in fraud detection, rare events analysis	

•  Trend and evolution analysis	

§ Trend and deviation: regression analysis	

§ Sequential pattern mining, periodicity analysis	

§ Similarity-based analysis	

•  Text Mining, Graph Mining, Data Streams	

•  Sentiment Analysis, Opinion Mining, etc.	

•  Other pattern-directed or statistical analyses	

44
Prof. Pier Luca Lanzi	

Relevant issues
Prof. Pier Luca Lanzi	

Are all the “Discovered” Patterns Interesting?	

•  Data Mining may generate thousands of patterns, 
not all of them are interesting.	

•  Interestingness measures: a pattern is interesting if it is easily understood by
humans, valid on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to confirm 	

•  Objective vs. subjective interestingness measures:	

•  Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.	

•  Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc.	

46
Prof. Pier Luca Lanzi	

Can We Find All and Only 
Interesting Patterns?	

•  Completeness	

§ Find all the interesting patterns	

§ Can a data mining system find all 
the interesting patterns?	

§ Association vs. classification vs. clustering	

•  Optimization	

§ Search for only interesting patterns:	

§ Can a data mining system find only 
the interesting patterns?	

§ Two approaches: (1) first general all the patterns and then
filter out the uninteresting ones; (2) generate only the
interesting patterns—mining query optimization	

47
Prof. Pier Luca Lanzi	

What About Background Knowledge?	

•  A typical kind of background knowledge: Concept hierarchies	

•  Schema hierarchy	

§ E.g., Street  City  ProvinceOrState  Country	

•  Set-grouping hierarchy	

§ E.g., {20-39} = young, {40-59} = middle_aged	

•  Operation-derived hierarchy	

§ email address: hagonzal@cs.uiuc.edu	

§ login-name  department  university  country	

•  Rule-based hierarchy	

§ LowProfitMargin (X) = Price(X, P1) and Cost (X, P2) 
and (P1 - P2)  $50	

48
Prof. Pier Luca Lanzi	

http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
Prof. Pier Luca Lanzi	

I noted that you did not mention …
Prof. Pier Luca Lanzi	

http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi	

http://gettingattention.org/2014/01/big-data-nonprofit-communications/
Prof. Pier Luca Lanzi	

http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/	

	

http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0	

	

http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf
Prof. Pier Luca Lanzi	

https://www.youtube.com/watch?v=CO2mGny6fFs

Weitere ähnliche Inhalte

Was ist angesagt?

Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditing
Dorothea Salo
 

Was ist angesagt? (20)

Python in Data Science Work
Python in Data Science WorkPython in Data Science Work
Python in Data Science Work
 
Is Search the Right Way?
Is Search the Right Way?Is Search the Right Way?
Is Search the Right Way?
 
Collecting Twitter Data
Collecting Twitter DataCollecting Twitter Data
Collecting Twitter Data
 
A Peek Under the Hood at FamilySearch - Presentation
A Peek Under the Hood at FamilySearch - PresentationA Peek Under the Hood at FamilySearch - Presentation
A Peek Under the Hood at FamilySearch - Presentation
 
Escaping Datageddon
Escaping DatageddonEscaping Datageddon
Escaping Datageddon
 
Sharing Data on the Web
Sharing Data on the WebSharing Data on the Web
Sharing Data on the Web
 
Iiif to go iiif vatican (7 minutes)
Iiif to go   iiif vatican (7 minutes)Iiif to go   iiif vatican (7 minutes)
Iiif to go iiif vatican (7 minutes)
 
Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditing
 
Publishing and Using Linked Open Data - Day 1
Publishing and Using Linked Open Data - Day 1 Publishing and Using Linked Open Data - Day 1
Publishing and Using Linked Open Data - Day 1
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
The Semantic Web: 2010 Update
The Semantic Web: 2010 Update The Semantic Web: 2010 Update
The Semantic Web: 2010 Update
 
IIIF Pre-conference - Usability testing conducted on the UV and Mirador
IIIF Pre-conference - Usability testing conducted on the UV and MiradorIIIF Pre-conference - Usability testing conducted on the UV and Mirador
IIIF Pre-conference - Usability testing conducted on the UV and Mirador
 
Open-Content Text Corpus for African languages
Open-Content Text Corpus for African languagesOpen-Content Text Corpus for African languages
Open-Content Text Corpus for African languages
 
Data Management 101 (2015)
Data Management 101 (2015)Data Management 101 (2015)
Data Management 101 (2015)
 
Kick-off meeting Linkflows project
Kick-off meeting Linkflows projectKick-off meeting Linkflows project
Kick-off meeting Linkflows project
 
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
 
NCURA Webinar on Open Data
NCURA Webinar on Open DataNCURA Webinar on Open Data
NCURA Webinar on Open Data
 
"Curation-Ready" Workflows for Digitized Photograph Collections: A Temporary ...
"Curation-Ready" Workflows for Digitized Photograph Collections: A Temporary ..."Curation-Ready" Workflows for Digitized Photograph Collections: A Temporary ...
"Curation-Ready" Workflows for Digitized Photograph Collections: A Temporary ...
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Practical Data Management - ACRL DCIG Webinar
Practical Data Management - ACRL DCIG WebinarPractical Data Management - ACRL DCIG Webinar
Practical Data Management - ACRL DCIG Webinar
 

Andere mochten auch

Andere mochten auch (20)

DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data Representation
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data Preparation
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to Classification
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph Mining
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Course Organization
Course OrganizationCourse Organization
Course Organization
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2
 

Ähnlich wie DMTM 2015 - 02 Data Mining

datamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxdatamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptx
HASHEMHASH
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 

Ähnlich wie DMTM 2015 - 02 Data Mining (20)

DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
datamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxdatamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptx
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Informatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeInformatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data Decade
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
 
Data, data, data
Data, data, dataData, data, data
Data, data, data
 
Data Landscapes: The Neuroscience Information Framework
Data Landscapes:  The Neuroscience Information FrameworkData Landscapes:  The Neuroscience Information Framework
Data Landscapes: The Neuroscience Information Framework
 
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 

Mehr von Pier Luca Lanzi

Mehr von Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 

DMTM 2015 - 02 Data Mining

  • 1. Prof. Pier Luca Lanzi Data Mining Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Why Data Mining? “Necessity is the mother of invention” Explosive Growth of Data Pressing need for the automated analysis of massive data Emerged in the late 1980s Major developments in the mid 1990s
  • 3. Prof. Pier Luca Lanzi http://consumer.media.seagate.com/2012/06/the-digital-den/how-much-data-is-generated-in-a-minute/
  • 4. Prof. Pier Luca Lanzi Some Data About Data (http://wikibon.org/blog/big-data-statistics/) •  2.7 Zetabytes of data exist in the digital universe today •  Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data •  Walmart handles more than 1 million customer transactions every hour, stored into databases estimated to contain more than 2.5 petabytes of data •  More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide •  100 terabytes of data uploaded daily to Facebook •  According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day, and has more than 465 million accounts 4
  • 5. Prof. Pier Luca Lanzi Evolution of Technology •  1960s: data collection, database creation, network DBMS •  1970s: relational data model, relational DBMS implementation •  1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.); application-oriented DBMS (spatial, scientific, engineering, etc.) •  1990s: data mining, data warehousing, multimedia databases, and Web databases •  2000s: stream data management and mining, web technology (XML, data integration), global information systems •  2010s: social networks, NoSQL, unstructured data, etc. 5
  • 6. Prof. Pier Luca Lanzi http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
  • 7. Prof. Pier Luca Lanzi What is the Commercial Viewpoint? •  Huge amounts of data is being collected and warehoused everyday § Web data, e-commerce § Purchases at department stores § Bank/Credit Card transactions •  Computers have become cheaper and more powerful •  Competitive pressure is strong to provide better, customized services (e.g., CRM or Customer Relationship Management) •  Poor data across businesses and the government costs the U.S. economy $3.1 trillion dollars a year (http://wikibon.org/blog/big-data-statistics/) 7
  • 8. Prof. Pier Luca Lanzi What is the Scientific Viewpoint? •  Data collected and stored at enormous speeds (GB/hour) § remote sensors on a satellite § telescopes scanning the skies § microarrays generating gene expression data § scientific simulations generating terabytes of data •  Traditional techniques infeasible for raw data •  Data mining may help scientists § in classifying and segmenting data § in Hypothesis Formation 8
  • 9. Prof. Pier Luca Lanzi Why Mining Large Datasets? •  There is often information “hidden” in the data that is not readily evident •  Human analysts may take weeks to discover useful information •  Much of the data is never analyzed at all! 0! 500,000! 1,000,000! 1,500,000! 2,000,000! 2,500,000! 3,000,000! 3,500,000! 4,000,000! 1995 1996 1997 1998 1999 The Data Gap Total new disk (TB) since 1995 Number of analysts 9
  • 10. Prof. Pier Luca Lanzi 10 Word cloud from Science “Dealing with Data” 11 February 2011 volume 331 issue 6018 pages 639-806
  • 11. Prof. Pier Luca Lanzi Science 11 February 2011 “Dealing with Data” 11
  • 12. Prof. Pier Luca Lanzi Examples
  • 13. Prof. Pier Luca Lanzi Examples •  Customer attrition § Given: customer information for the past months § Problem: predict who is likely to attrite next month, or estimate customer value § Data: historical customer records •  Credit assessment § Given: a loan application § Problem: predict whether the bank should approve the loan § Data: records from other loans 13
  • 14. Prof. Pier Luca Lanzi http://wikibon.org/blog/role-of-the-data-scientist/
  • 15. Prof. Pier Luca Lanzi https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html
  • 16. Prof. Pier Luca Lanzi What is Data Mining?
  • 17. Prof. Pier Luca Lanzi What is Data Mining? •  The non-trivial process of identifying (1) valid, (2) novel, (3) potentially useful, and (4) understandable patterns in data. •  Alternative names, § Data Fishing, Data Dredging (1960-) § Data Mining (1990-), used by DB and business § Knowledge Discovery in Databases (1989-), used by AI § Business Intelligence, Information Harvesting, Information Discovery, Knowledge Extraction, ... § Currently, Data Mining and Knowledge Discovery are used interchangeably •  Data Mining is not looking up in the phone directory, it is not querying a Web search engine for information about “Amazon” 17
  • 18. Prof. Pier Luca Lanzi An Example Using Contact Lens Data 18 None Reduced Yes Hypermetrope Pre-presbyopic None Normal Yes Hypermetrope Pre-presbyopic None Reduced No Myope Presbyopic None Normal No Myope Presbyopic None Reduced Yes Myope Presbyopic Hard Normal Yes Myope Presbyopic None Reduced No Hypermetrope Presbyopic Soft Normal No Hypermetrope Presbyopic None Reduced Yes Hypermetrope Presbyopic None Normal Yes Hypermetrope Presbyopic Soft Normal No Hypermetrope Pre-presbyopic None Reduced No Hypermetrope Pre-presbyopic Hard Normal Yes Myope Pre-presbyopic None Reduced Yes Myope Pre-presbyopic Soft Normal No Myope Pre-presbyopic None Reduced No Myope Pre-presbyopic hard Normal Yes Hypermetrope Young None Reduced Yes Hypermetrope Young Soft Normal No Hypermetrope Young None Reduced No Hypermetrope Young Hard Normal Yes Myope Young None Reduced Yes Myope Young Soft Normal No Myope Young None Reduced No Myope Young Recommended lenses Tear production rate Astigmatism Spectacle prescription Age
  • 19. Prof. Pier Luca Lanzi An example of possible pattern if astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard
  • 20. Prof. Pier Luca Lanzi is it a “good” pattern?
  • 21. Prof. Pier Luca Lanzi Example: Credit Risk 21 salary loan not repaid repaid
  • 22. Prof. Pier Luca Lanzi Example: Credit Risk 22 salary loan k IF salarykTHEN not repaid
  • 23. Prof. Pier Luca Lanzi How Can We Evaluate a Pattern? •  Is it valid? § The pattern has to be valid with respect to a certainty level (rule true for the 86%) •  Is it novel? § The value k should be previously unknown or obvious •  Is it useful? Is it actionable? § The pattern should provide information useful to the bank for assessing credit risk •  Is it understandable? 23
  • 24. Prof. Pier Luca Lanzi but there is another important question … was it “worth” finding it? (I mean $ worth) how much did the search cost? how much value did it bring?
  • 25. Prof. Pier Luca Lanzi What is the General Idea? •  Build computer programs that navigate through databases automatically, seeking regularities or patterns •  There will be problems § Most patterns are banal and uninteresting § Most patterns are spurious, inexact, or contingent on accidental coincidences in the particular dataset used § Real data is imperfect: Some parts will be garbled, and some will be missing •  Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful 25
  • 26. Prof. Pier Luca Lanzi http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
  • 27. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 28. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 29. Prof. Pier Luca Lanzi http://wikibon.org/blog/role-of-the-data-scientist/
  • 30. Prof. Pier Luca Lanzi What are the Related Fields? Databases Statistics Machine Learning Visualization Knowledge Discovery And Data Mining 30
  • 31. Prof. Pier Luca Lanzi Statistics, Machine Learning, and Data Mining •  Statistics is more theory-based, focuses on testing hypotheses •  Machine learning is more based on heuristic, focuses on building program that learns, more general than Data Mining •  Knowledge Discovery § integrates theory and heuristics § focus on the entire process of discovery, including data cleaning, learning, integration and visualization Distinctions are blurred! 31
  • 32. Prof. Pier Luca Lanzi Why Is It Different? •  Tremendous amount of data § High scalability to handle terabytes of data •  High-dimensionality of data § Micro-array may have tens of thousands of dimensions •  High complexity of data § Data streams and sensor data § Time-series data, temporal data, sequence data § Structure data, graphs, social networks and multi-linked data § Heterogeneous databases and legacy databases § Spatial, spatiotemporal, multimedia, text and Web data § Software programs, scientific simulations •  New and sophisticated applications 32
  • 33. Prof. Pier Luca Lanzi http://wikibon.org/blog/big-data-statistics/
  • 34. Prof. Pier Luca Lanzi Knowledge Discovery Process 34 selection cleaning transformation mining evaluation
  • 35. Prof. Pier Luca Lanzi Knowledge Discovery Process What are the main steps? •  Learning the application domain to extract relevant prior knowledge and goals •  Data selection •  Data cleaning •  Data reduction and transformation •  Mining § Select the mining approach: classification, regression, association, clustering, etc. § Choosing the mining algorithm(s) § Perform mining: search for patterns of interest •  Pattern evaluation and knowledge presentation § visualization, transformation, removing redundant patterns, etc. •  Use of discovered knowledge 35
  • 36. Prof. Pier Luca Lanzi What are the typical Data Mining tasks?
  • 37. Prof. Pier Luca Lanzi http://wikibon.org/blog/big-data-statistics/
  • 38. Prof. Pier Luca Lanzi What are the Major Data Mining Tasks? •  Classification: predicting an item class •  Clustering: finding clusters in data •  Associations: frequent occurring events… •  Visualization: to facilitate human discovery •  Summarization: describing a group •  Deviation Detection: finding changes •  Estimation: predicting a continuous value •  Link Analysis: finding relationship •  “Sentiment” Analysis, “Opinion” Mining •  But many appears as time goes by, opinion mining, sentiment mining 38
  • 39. Prof. Pier Luca Lanzi Data Mining Tasks: classification 39 salary loan k IF salarykTHEN not repaid ? ?
  • 40. Prof. Pier Luca Lanzi Data Mining Tasks: Classification Prediction •  Finding models (functions) that describe and distinguish classes or concepts •  The goal is to describe the data or to make future prediction •  E.g., classify countries based on climate, or classify cars based on gas mileage •  Presentation: decision-tree, classification rule, neural network •  Prediction: Predict some unknown numerical values 40
  • 41. Prof. Pier Luca Lanzi Data Mining Tasks: Clustering •  Given a cloud of data points we’d like to understand their structure •  The class label is unknown •  Group data to form new classes, e.g., cluster houses to find distribution patterns •  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity 41
  • 42. Prof. Pier Luca Lanzi Data Mining Tasks: Associations Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit Steak Jam Soda Chips Bread Jam Soda Chips Milk Bread Fruit Soda Chips Milk Jam Soda Peanuts Milk Fruit Fruit Soda Peanuts Milk Fruit Peanuts Cheese Yogurt Is there something interesting to be noted? 42
  • 43. Prof. Pier Luca Lanzi Data Mining Tasks: Associations •  Finds interesting associations and/or correlation relationships among large set of data items. •  E.g., 98% of people who purchase tires and auto accessories also get automotive services done 43
  • 44. Prof. Pier Luca Lanzi Data Mining Tasks: Other Tasks •  Outlier analysis § Outlier: a data object that does not comply with the general behavior of the data § It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis •  Trend and evolution analysis § Trend and deviation: regression analysis § Sequential pattern mining, periodicity analysis § Similarity-based analysis •  Text Mining, Graph Mining, Data Streams •  Sentiment Analysis, Opinion Mining, etc. •  Other pattern-directed or statistical analyses 44
  • 45. Prof. Pier Luca Lanzi Relevant issues
  • 46. Prof. Pier Luca Lanzi Are all the “Discovered” Patterns Interesting? •  Data Mining may generate thousands of patterns, not all of them are interesting. •  Interestingness measures: a pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm •  Objective vs. subjective interestingness measures: •  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. •  Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc. 46
  • 47. Prof. Pier Luca Lanzi Can We Find All and Only Interesting Patterns? •  Completeness § Find all the interesting patterns § Can a data mining system find all the interesting patterns? § Association vs. classification vs. clustering •  Optimization § Search for only interesting patterns: § Can a data mining system find only the interesting patterns? § Two approaches: (1) first general all the patterns and then filter out the uninteresting ones; (2) generate only the interesting patterns—mining query optimization 47
  • 48. Prof. Pier Luca Lanzi What About Background Knowledge? •  A typical kind of background knowledge: Concept hierarchies •  Schema hierarchy § E.g., Street City ProvinceOrState Country •  Set-grouping hierarchy § E.g., {20-39} = young, {40-59} = middle_aged •  Operation-derived hierarchy § email address: hagonzal@cs.uiuc.edu § login-name department university country •  Rule-based hierarchy § LowProfitMargin (X) = Price(X, P1) and Cost (X, P2) and (P1 - P2) $50 48
  • 49. Prof. Pier Luca Lanzi http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
  • 50. Prof. Pier Luca Lanzi I noted that you did not mention …
  • 51. Prof. Pier Luca Lanzi http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
  • 53. Prof. Pier Luca Lanzi http://gettingattention.org/2014/01/big-data-nonprofit-communications/
  • 54. Prof. Pier Luca Lanzi http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/ http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0 http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf
  • 55. Prof. Pier Luca Lanzi https://www.youtube.com/watch?v=CO2mGny6fFs