Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
From Data Mining  to Knowledge Discovery:  An Introduction Gregory Piatetsky-Shapiro KDnuggets
Outline <ul><li>Introduction </li></ul><ul><li>Data Mining Tasks </li></ul><ul><li>Application Examples </li></ul>
Trends leading to Data Flood <ul><li>More data is generated: </li></ul><ul><ul><li>Bank, telecom, other business transacti...
Examples <ul><li>Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces  1 Gigabit/se...
Growth Trends <ul><li>Moore’s law </li></ul><ul><ul><li>Computer Speed doubles every 18 months </li></ul></ul><ul><li>Stor...
Knowledge Discovery Definition <ul><li>Knowledge Discovery in Data is the  </li></ul><ul><li>non-trivial   process of iden...
Related Fields Statistics Machine Learning Databases Visualization Data Mining and  Knowledge Discovery
Transformed  Data Target  Data RawData Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleani...
Outline <ul><li>Introduction </li></ul><ul><li>Data Mining Tasks </li></ul><ul><li>Application Examples </li></ul>
Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified)  instance...
Classification: Linear Regression <ul><li>Linear Regression </li></ul><ul><ul><li>w 0   + w 1  x  + w 2  y >= 0 </li></ul>...
Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
Classification: Neural Nets <ul><li>Can select more complex regions </li></ul><ul><li>Can be more accurate </li></ul><ul><...
Data Mining Central Quest Find true patterns  and avoid  overfitting   (false patterns due  to randomness)
Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data
Major Data Mining Tasks <ul><li>Classification:  predicting an item class </li></ul><ul><li>Clustering:  finding clusters ...
www.KDnuggets.com Data Mining Software Guide
Outline <ul><li>Introduction </li></ul><ul><li>Data Mining Tasks </li></ul><ul><li>Application Examples </li></ul>
Major Application Areas for  Data Mining Solutions <ul><li>Advertising </li></ul><ul><li>Bioinformatics </li></ul><ul><li>...
Case Study: Search Engines <ul><li>Early search engines used mainly keywords on a page – were subject to manipulation </li...
Case Study: Direct Marketing and CRM <ul><li>Most major direct marketing companies are using modeling and data mining </li...
Biology: Molecular Diagnostics  <ul><li>Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML) </li></ul><ul><ul><li>7...
AF1q: New Marker for Medulloblastoma?  <ul><li>AF1Q ALL1-fused gene from chromosome 1q </li></ul><ul><li>transmembrane pro...
Case Study: Security and Fraud Detection <ul><li>Credit Card Fraud Detection </li></ul><ul><li>Money laundering  </li></ul...
Data Mining and Terrorism:  Controversy in the News <ul><li>TIA: Terrorism (formerly Total) Information Awareness Program ...
Criticism of analytic approach to Threat Detection: <ul><li>Data Mining will  </li></ul><ul><li>invade privacy </li></ul><...
Can Data Mining and Statistics be Effective for Threat Detection? <ul><li>Criticism: Databases have 5% errors, so analyzin...
Another Approach: Link Analysis Can Find Unusual Patterns in the Network Structure
Analytic technology can be effective <ul><li>Combining multiple models and link analysis can reduce false positives </li><...
Data Mining with Privacy <ul><li>Data Mining looks for patterns, not people! </li></ul><ul><li>Technical solutions can lim...
The Hype Curve for  Data Mining and Knowledge Discovery  Over-inflated  expectations Disappointment Growing acceptance and...
Summary Thank You! www.KDnuggets.com   – the website for  Data Mining  and Knowledge Discovery Contact: Gregory Piatetsky-...
Nächste SlideShare
Wird geladen in …5
×

Data Mining and Knowledge Discovery in Business Databases

1.445 Aufrufe

Veröffentlicht am

  • Als Erste(r) kommentieren

Data Mining and Knowledge Discovery in Business Databases

  1. 1. From Data Mining to Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro KDnuggets
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Data Mining Tasks </li></ul><ul><li>Application Examples </li></ul>
  3. 3. Trends leading to Data Flood <ul><li>More data is generated: </li></ul><ul><ul><li>Bank, telecom, other business transactions ... </li></ul></ul><ul><ul><li>Scientific Data: astronomy, biology, etc </li></ul></ul><ul><ul><li>Web, text, and e-commerce </li></ul></ul><ul><li>More data is captured: </li></ul><ul><ul><li>Storage technology faster and cheaper </li></ul></ul><ul><ul><li>DBMS capable of handling bigger DB </li></ul></ul>
  4. 4. Examples <ul><li>Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session </li></ul><ul><ul><li>storage and analysis a big problem </li></ul></ul><ul><li>Walmart reported to have 24 Tera-byte DB </li></ul><ul><li>AT&T handles billions of calls per day </li></ul><ul><ul><li>data cannot be stored -- analysis is done on the fly </li></ul></ul>
  5. 5. Growth Trends <ul><li>Moore’s law </li></ul><ul><ul><li>Computer Speed doubles every 18 months </li></ul></ul><ul><li>Storage law </li></ul><ul><ul><li>total storage doubles every 9 months </li></ul></ul><ul><li>Consequence </li></ul><ul><ul><li>very little data will ever be looked at by a human </li></ul></ul><ul><li>Knowledge Discovery is NEEDED to make sense and use of data. </li></ul>
  6. 6. Knowledge Discovery Definition <ul><li>Knowledge Discovery in Data is the </li></ul><ul><li>non-trivial process of identifying </li></ul><ul><ul><li>valid </li></ul></ul><ul><ul><li>novel </li></ul></ul><ul><ul><li>potentially useful </li></ul></ul><ul><ul><li>and ultimately understandable patterns in data. </li></ul></ul><ul><li>from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996 </li></ul>
  7. 7. Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery
  8. 8. Transformed Data Target Data RawData Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleaning Integration Understanding Knowledge Discovery Process DATA Ware house Knowledge __ ____ __ ____ __ ____ Patterns and Rules
  9. 9. Outline <ul><li>Introduction </li></ul><ul><li>Data Mining Tasks </li></ul><ul><li>Application Examples </li></ul>
  10. 10. Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  11. 11. Classification: Linear Regression <ul><li>Linear Regression </li></ul><ul><ul><li>w 0 + w 1 x + w 2 y >= 0 </li></ul></ul><ul><li>Regression computes w i from data to minimize squared error to ‘fit’ the data </li></ul><ul><li>Not flexible enough </li></ul>
  12. 12. Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
  13. 13. Classification: Neural Nets <ul><li>Can select more complex regions </li></ul><ul><li>Can be more accurate </li></ul><ul><li>Also can overfit the data – find patterns in random noise </li></ul>
  14. 14. Data Mining Central Quest Find true patterns and avoid overfitting (false patterns due to randomness)
  15. 15. Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data
  16. 16. Major Data Mining Tasks <ul><li>Classification: predicting an item class </li></ul><ul><li>Clustering: finding clusters in data </li></ul><ul><li>Associations: e.g. A & B & C occur frequently </li></ul><ul><li>Visualization: to facilitate human discovery </li></ul><ul><li>Estimation: predicting a continuous value </li></ul><ul><li>Deviation Detection: finding changes </li></ul><ul><li>Link Analysis: finding relationships </li></ul><ul><li>… </li></ul>
  17. 17. www.KDnuggets.com Data Mining Software Guide
  18. 18. Outline <ul><li>Introduction </li></ul><ul><li>Data Mining Tasks </li></ul><ul><li>Application Examples </li></ul>
  19. 19. Major Application Areas for Data Mining Solutions <ul><li>Advertising </li></ul><ul><li>Bioinformatics </li></ul><ul><li>Customer Relationship Management (CRM) </li></ul><ul><li>Database Marketing </li></ul><ul><li>Fraud Detection </li></ul><ul><li>eCommerce </li></ul><ul><li>Health Care </li></ul><ul><li>Investment/Securities </li></ul><ul><li>Manufacturing, Process Control </li></ul><ul><li>Sports and Entertainment </li></ul><ul><li>Telecommunications </li></ul><ul><li>Web </li></ul>
  20. 20. Case Study: Search Engines <ul><li>Early search engines used mainly keywords on a page – were subject to manipulation </li></ul><ul><li>Google success is due to its algorithm which uses mainly links to the page </li></ul><ul><li>Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google </li></ul>
  21. 21. Case Study: Direct Marketing and CRM <ul><li>Most major direct marketing companies are using modeling and data mining </li></ul><ul><li>Most financial companies are using customer modeling </li></ul><ul><li>Modeling is easier than changing customer behaviour </li></ul><ul><li>Some successes </li></ul><ul><ul><li>Verizon Wireless reduced churn rate from 2% to 1.5% </li></ul></ul>
  22. 22. Biology: Molecular Diagnostics <ul><li>Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML) </li></ul><ul><ul><li>72 samples, about 7,000 genes </li></ul></ul>ALL AML <ul><ul><li>Results: 33 correct (97% accuracy), </li></ul></ul><ul><ul><li>1 error (sample suspected mislabelled) </li></ul></ul><ul><ul><li>Outcome predictions? </li></ul></ul>
  23. 23. AF1q: New Marker for Medulloblastoma? <ul><li>AF1Q ALL1-fused gene from chromosome 1q </li></ul><ul><li>transmembrane protein </li></ul><ul><li>Related to leukemia (3 PUBMED entries) but not to Medulloblastoma </li></ul>
  24. 24. Case Study: Security and Fraud Detection <ul><li>Credit Card Fraud Detection </li></ul><ul><li>Money laundering </li></ul><ul><ul><li>FAIS (US Treasury) </li></ul></ul><ul><li>Securities Fraud </li></ul><ul><ul><li>NASDAQ Sonar system </li></ul></ul><ul><li>Phone fraud </li></ul><ul><ul><li>AT&T, Bell Atlantic, British Telecom/MCI </li></ul></ul><ul><li>Bio-terrorism detection at Salt Lake Olympics 2002 </li></ul>
  25. 25. Data Mining and Terrorism: Controversy in the News <ul><li>TIA: Terrorism (formerly Total) Information Awareness Program – </li></ul><ul><ul><li>DARPA program closed by Congress </li></ul></ul><ul><ul><li>some functions transferred to intelligence agencies </li></ul></ul><ul><li>CAPPS II – screen all airline passengers </li></ul><ul><ul><li>controversial </li></ul></ul><ul><li>… </li></ul><ul><li>Invasion of Privacy or Defensive Shield? </li></ul>
  26. 26. Criticism of analytic approach to Threat Detection: <ul><li>Data Mining will </li></ul><ul><li>invade privacy </li></ul><ul><li>generate millions of false positives </li></ul><ul><li>But can it be effective? </li></ul>
  27. 27. Can Data Mining and Statistics be Effective for Threat Detection? <ul><li>Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives </li></ul><ul><li>Reality: Analytical models correlate many items of information to reduce false positives. </li></ul><ul><li>Example: Identify one biased coin from 1,000. </li></ul><ul><ul><li>After one throw of each coin, we cannot </li></ul></ul><ul><ul><li>After 30 throws, one biased coin will stand out with high probability. </li></ul></ul><ul><ul><li>Can identify 19 biased coins out of 100 million with sufficient number of throws </li></ul></ul>
  28. 28. Another Approach: Link Analysis Can Find Unusual Patterns in the Network Structure
  29. 29. Analytic technology can be effective <ul><li>Combining multiple models and link analysis can reduce false positives </li></ul><ul><li>Today there are millions of false positives with manual analysis </li></ul><ul><li>Data Mining is just one additional tool to help analysts </li></ul><ul><li>Analytic Technology has the potential to reduce the current high rate of false positives </li></ul>
  30. 30. Data Mining with Privacy <ul><li>Data Mining looks for patterns, not people! </li></ul><ul><li>Technical solutions can limit privacy invasion </li></ul><ul><ul><li>Replacing sensitive personal data with anon. ID </li></ul></ul><ul><ul><li>Give randomized outputs </li></ul></ul><ul><ul><li>Multi-party computation – distributed data </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003 </li></ul>
  31. 31. The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations
  32. 32. Summary Thank You! www.KDnuggets.com – the website for Data Mining and Knowledge Discovery Contact: Gregory Piatetsky-Shapiro [email_address] That’s all folks!

×