Data Mining Technology Conference Insightful Corporation Jim Walter,  Vice President of Research & Development Brand Niema...
Agenda <ul><li>A Brief History Of Analysis </li></ul><ul><li>Data Mining Content Types </li></ul><ul><ul><li>Numbers </li>...
Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Syn...
Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Syn...
Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Syn...
Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Syn...
Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Syn...
Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Syn...
Trend: Increasing Complexity Time Collection Complexity Aggregation Simple Reporting Flexible Reporting Relationships & In...
Technology Adoption Model Adoption Density Techies Time
Technology Adoption Model Early Adopter Adoption Density Techies Time
Technology Adoption Model Early Adopter Early Majority Adoption Density Techies Time
Technology Adoption Model Early Adopter Early Majority Late Majority Adoption Density Techies Time
Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Techies Time
Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Techies Chasm Time
Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Time DBMS OLAP B...
Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Time DBMS OLAP B...
Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Time DBMS OLAP B...
Trend: From big to HUGE!
Why is Data Mining Needed? <ul><li>Very large data (large N) </li></ul><ul><li>Many dimensions (large P) </li></ul><ul><li...
Content Types <ul><li>Numeric </li></ul><ul><li>Text </li></ul><ul><li>Image & Signal </li></ul><ul><li>Fusion    XML </l...
… SCM Enterprise Data Silos ERP CRM Flat File Extracts Preprocess … SCM ERP CRM Rollup DM DW Mnthly Trx Sum Cust Rec … Bui...
… SCM Enterprise Data Silos ERP CRM Flat File Extracts Preprocess … SCM ERP CRM Rollup DM DW Mnthly Trx Sum Cust Rec … Bui...
Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul>...
Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul>...
Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul>...
Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul>...
Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul>...
Initial Exploration Scatter Plot
Secondary Exploration Color Plot Note “noise” at the boundary
Training
Evaluate the  Training effectiveness
 
 
Data Mining Life Cycle (1) Access  (2) Exploration & Feature Extraction (3) Training  (4) Evaluation  (5) Prediction
Text Mining Applications VISUALIZATION INTERPRETATION DATA EXTRACTION OTHER TEXT MINING QUESTION ANSWERING EXPLORATORY SEA...
KEY DIFFERENTIATOR:  Creating Structure from Unstructured Text DATA EXTRACTION INFORMATION EXTRACTION DATABASE DOCUMENTS L...
Classical Structured Analysis Techniques VISUALIZATION INTERPRETATION LINGUISTIC  PRIMITIVES REGRESSION NEURAL NETS CART… ...
Current State of Art data features structural relationships information <ul><ul><ul><li>morphological normalization </li><...
SEARCH ENGINES STORE WORDS  DEEP EXTRACTION STORES FACTS   Nothing can reconstruct original facts from a count of keywords...
Q&A Example
 
 
EXPLORATORY SEARCH
TEXT MINING BASED ON SHALLOW IE QUANTITIES COMPANY, ORGANIZATION , COUNTRY NAMES PRODUCTS UBL  China  176,000 60 cents  CE...
TEXT MINING BASED ON DEEP IE QUANTITIES COMPANY, ORGANIZATION , COUNTRY NAMES PRODUCTS UBL  China  176,000 60 cents  CEA  ...
Integrating Text Mining & Visualization
Integrating Text Mining & Visualization
Image Mining Common Themes <ul><li>Application Domains: </li></ul><ul><li>Medical Imaging </li></ul><ul><li>CT, MR,US </li...
<ul><li>Problem:  </li></ul><ul><ul><li>Noisy image data is hard to interpret and leads to inconsistent outcomes. </li></u...
Two Observers’ Delineation of the Prostate on Ultrasound Images Manual Delineation: Note the large variation between the o...
BRINGING IMAGES INTO THE REALM OF  STATISTICS <ul><li>Searching for patterns and objects in images </li></ul><ul><li>Analy...
MULTIPLE FEATURES MULTIPLE SCALES AUTOMATED FEATURE EXTRACTION FROM IMAGES 0 1 3 20 210 211 213 212 0  1  2  3 20  21  22 ...
SEARCH FOR AIR STRIPS ANALYZE WITH S-PLUS CLASSIFY TISSUES AND DISEASES COURTESY: U. OF DELAWARE & EAST TENNESSEE STATE UN...
Fusion: Future Model for Analyzing Data Text Mining: information extraction  from unstructured data Text Image Video Prese...
Critical Issues <ul><li>Support all required content </li></ul><ul><ul><li>Numeric, text, image, signal… </li></ul></ul><u...
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline
Scalability: Pipeline Training
Scalability: Pipeline Training Scoring
1-D Visualization <ul><li>Histograms </li></ul><ul><li>Barcharts </li></ul><ul><li>Piecharts </li></ul><ul><li>Density </l...
2-D Visualization <ul><li>Scatterplot </li></ul><ul><li>Line </li></ul><ul><li>Box </li></ul><ul><li>Strip </li></ul><ul><...
3-D Visualization <ul><li>Contour </li></ul><ul><li>Level </li></ul><ul><li>Surface </li></ul><ul><li>Cloud </li></ul><ul>...
Useful Variations <ul><li>Color plot </li></ul><ul><li>Shape plot </li></ul><ul><li>Scatterplot matrix </li></ul><ul><li>O...
Eye Candy <ul><li>Slick-looking </li></ul><ul><li>Unused dimensions </li></ul><ul><li>Hard to interpret </li></ul><ul><li>...
Issues <ul><li>Language based vs Menu-driven </li></ul><ul><li>Platform vs application (build vs. use) </li></ul><ul><li>O...
Architecture & Technologies Application Platform Library Graphical User Interface Algorithm I/O Data Pipeline Interpreter ...
Architecture & Technologies Application Platform Library Graphical User Interface Library API Algorithm I/O Data Platform ...
Architecture & Technologies Application Platform Library Graphical User Interface Library API Algorithm I/O Data Platform ...
Architecture & Technologies Application Platform Library Graphical User Interface Library API Algorithm I/O Data Platform ...
Extensible Architecture Engine Vendor API User Code (Algorithms) User Code (GUI)
What influences Success? <ul><li>Data mining is typically I/O bound </li></ul><ul><ul><li>RAID disk systems </li></ul></ul...
Desktop Performance <ul><li>Components should be validated on data sets as large as 25GB, with as many as 125,000,000 rows...
Conclusions <ul><li>Data is becoming larger, more complex and more diverse in form </li></ul><ul><li>Data mining is needed...
Deciding on a Solution <ul><li>Content breadth of offerings and skills </li></ul><ul><ul><li>Numeric, text, image & signal...
Questions for audience <ul><li>Who knows something about ML? </li></ul><ul><ul><li>Ignorant </li></ul></ul><ul><ul><li>Kno...
Nächste SlideShare
Wird geladen in …5
×

Data Mining Technology

948 Aufrufe

Veröffentlicht am

0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
948
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
3
Aktionen
Geteilt
0
Downloads
8
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie
  • Welcome to today’s presentation on an exciting new data mining workbench from Insightful Corporation. Today we are going to discuss Insightful Miner, our Highly Scalable, Next Generation, predictive modeling solution for both new data miners and skilled analytic professionals. My name is Richard Leavitt and I am the Dir of Prod Marketing for Insightful. Also with me today is Jim Walter our VP Development and we’re looking forward to giving you an overview of this exciting new product line.
  • Data Mining Technology

    1. 1. Data Mining Technology Conference Insightful Corporation Jim Walter, Vice President of Research & Development Brand Niemann , Computer Scientist , US EPA
    2. 2. Agenda <ul><li>A Brief History Of Analysis </li></ul><ul><li>Data Mining Content Types </li></ul><ul><ul><li>Numbers </li></ul></ul><ul><ul><li>Text </li></ul></ul><ul><ul><li>Image & Signal </li></ul></ul><ul><li>Brand Niemann – XML </li></ul><ul><li>Time Permitting </li></ul><ul><li>Technology adoption - what To look for </li></ul><ul><li>Demonstration </li></ul>
    3. 3. Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Synthesis 1900… 1950 1960 1970 1980 1990 2000 Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
    4. 4. Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Synthesis 1900… 1950 1960 1970 1980 1990 2000 Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
    5. 5. Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Synthesis 1900… 1950 1960 1970 1980 1990 2000 Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
    6. 6. Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Synthesis 1900… 1950 1960 1970 1980 1990 2000 Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
    7. 7. Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Synthesis 1900… 1950 1960 1970 1980 1990 2000 Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
    8. 8. Historical Context Simple Summarization Complex Calculation Storage & Retrieval Complex Reporting Data Mining Fusion & Synthesis 1900… 1950 1960 1970 1980 1990 2000 Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
    9. 9. Trend: Increasing Complexity Time Collection Complexity Aggregation Simple Reporting Flexible Reporting Relationships & Interactions Clerk Information Worker
    10. 10. Technology Adoption Model Adoption Density Techies Time
    11. 11. Technology Adoption Model Early Adopter Adoption Density Techies Time
    12. 12. Technology Adoption Model Early Adopter Early Majority Adoption Density Techies Time
    13. 13. Technology Adoption Model Early Adopter Early Majority Late Majority Adoption Density Techies Time
    14. 14. Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Techies Time
    15. 15. Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Techies Chasm Time
    16. 16. Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Time DBMS OLAP BI Numeric Data Mining Text & Image (Unstructured) Numeric (Structured)
    17. 17. Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Time DBMS OLAP BI Keyword Search Numeric Data Mining TextCat Image Mining Text Mining Q&A Text & Image (Unstructured) Numeric (Structured)
    18. 18. Technology Adoption Model Early Adopter Early Majority Late Majority Technology Laggards Adoption Density Time DBMS OLAP BI Keyword Search Numeric Data Mining TextCat Image Mining Text Mining Q&A Fusion Text & Image (Unstructured) Numeric (Structured) Fusion (future)
    19. 19. Trend: From big to HUGE!
    20. 20. Why is Data Mining Needed? <ul><li>Very large data (large N) </li></ul><ul><li>Many dimensions (large P) </li></ul><ul><li>Complicating factors </li></ul><ul><ul><li>Time </li></ul></ul><ul><ul><li>Space </li></ul></ul><ul><ul><li>Seasonality </li></ul></ul><ul><ul><li>Interactions </li></ul></ul><ul><li>Key relationships not yet known </li></ul><ul><ul><li>Frequently changing </li></ul></ul><ul><li>Need to project forward </li></ul><ul><li>Trends </li></ul><ul><li>Increasing complexity </li></ul><ul><li>Increasing diversity </li></ul><ul><li>Increasing scale </li></ul>
    21. 21. Content Types <ul><li>Numeric </li></ul><ul><li>Text </li></ul><ul><li>Image & Signal </li></ul><ul><li>Fusion  XML </li></ul>
    22. 22. … SCM Enterprise Data Silos ERP CRM Flat File Extracts Preprocess … SCM ERP CRM Rollup DM DW Mnthly Trx Sum Cust Rec … Build View <ul><li>De-Normalize </li></ul><ul><li>Cust = Row </li></ul><ul><li>Create feature </li></ul>Build Model Evaluate Model Deploy Access Clean, Interpret & Extract Train Evaluate Predict Clean Merge Transform Aggregate
    23. 23. … SCM Enterprise Data Silos ERP CRM Flat File Extracts Preprocess … SCM ERP CRM Rollup DM DW Mnthly Trx Sum Cust Rec … Build View <ul><li>De-Normalize </li></ul><ul><li>Cust = Row </li></ul><ul><li>Create feature </li></ul>Build Model Evaluate Model Deploy Clean Merge Transform Aggregate
    24. 24. Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul><ul><li>S-PLUS, SAS, SPSS… </li></ul></ul><ul><ul><li>Excel, Lotus, Access… </li></ul></ul><ul><li>DBMS </li></ul><ul><ul><li>ODBC/JDBC </li></ul></ul><ul><ul><li>Native </li></ul></ul><ul><li>Domain specific </li></ul><ul><ul><li>User - created </li></ul></ul>50%
    25. 25. Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul><ul><li>S-PLUS, SAS, SPSS… </li></ul></ul><ul><ul><li>Excel, Lotus, Access… </li></ul></ul><ul><li>DBMS </li></ul><ul><ul><li>ODBC/JDBC </li></ul></ul><ul><ul><li>Native </li></ul></ul><ul><li>Domain specific </li></ul><ul><ul><li>User - created </li></ul></ul>50% Cases Customers Citizens
    26. 26. Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul><ul><li>S-PLUS, SAS, SPSS… </li></ul></ul><ul><ul><li>Excel, Lotus, Access… </li></ul></ul><ul><li>DBMS </li></ul><ul><ul><li>ODBC/JDBC </li></ul></ul><ul><ul><li>Native </li></ul></ul><ul><li>Domain specific </li></ul><ul><ul><li>User - created </li></ul></ul>50% Cases Customers Citizens Features, variables, columns
    27. 27. Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul><ul><li>S-PLUS, SAS, SPSS… </li></ul></ul><ul><ul><li>Excel, Lotus, Access… </li></ul></ul><ul><li>DBMS </li></ul><ul><ul><li>ODBC/JDBC </li></ul></ul><ul><ul><li>Native </li></ul></ul><ul><li>Domain specific </li></ul><ul><ul><li>User - created </li></ul></ul>50% Cases Customers Citizens Features, variables, columns Independent, response
    28. 28. Data Access <ul><li>Delimited text </li></ul><ul><li>Fixed format text </li></ul><ul><li>Common vendor fmts </li></ul><ul><ul><li>S-PLUS, SAS, SPSS… </li></ul></ul><ul><ul><li>Excel, Lotus, Access… </li></ul></ul><ul><li>DBMS </li></ul><ul><ul><li>ODBC/JDBC </li></ul></ul><ul><ul><li>Native </li></ul></ul><ul><li>Domain specific </li></ul><ul><ul><li>User - created </li></ul></ul>50% Cases Customers Citizens Features, variables, columns Independent, response Dependent, predictor
    29. 29. Initial Exploration Scatter Plot
    30. 30. Secondary Exploration Color Plot Note “noise” at the boundary
    31. 31. Training
    32. 32. Evaluate the Training effectiveness
    33. 35. Data Mining Life Cycle (1) Access (2) Exploration & Feature Extraction (3) Training (4) Evaluation (5) Prediction
    34. 36. Text Mining Applications VISUALIZATION INTERPRETATION DATA EXTRACTION OTHER TEXT MINING QUESTION ANSWERING EXPLORATORY SEARCH INFORMATION EXTRACTION GUI DATABASE DOCUMENTS LINGUISTIC PRIMITIVES
    35. 37. KEY DIFFERENTIATOR: Creating Structure from Unstructured Text DATA EXTRACTION INFORMATION EXTRACTION DATABASE DOCUMENTS LINGUISTIC PRIMITIVES
    36. 38. Classical Structured Analysis Techniques VISUALIZATION INTERPRETATION LINGUISTIC PRIMITIVES REGRESSION NEURAL NETS CART… PATTERN MATCHING LINGUISTIC NORMALIZATION GUI DATABASE
    37. 39. Current State of Art data features structural relationships information <ul><ul><ul><li>morphological normalization </li></ul></ul></ul><ul><ul><ul><li>semantic normalization </li></ul></ul></ul><ul><li>syntactic normalization: governing verb of each sentence, subject, object, etc. </li></ul><ul><li>facts databanks </li></ul>“ NEXT GENERATION” SEARCH STOPS HERE!
    38. 40. SEARCH ENGINES STORE WORDS DEEP EXTRACTION STORES FACTS Nothing can reconstruct original facts from a count of keywords. Instead, InFact store facts… &quot;From 1949 to 1960 China was in alliance with the Soviet Union, although this relationship was already under severe strain in the late 1950s. There followed, in 1960-72, a period of isolation, during which China sought to identify itself as a natural leader of the developing world in its resistance to &quot;US imperialism&quot;. From 1972 China found itself in de facto alliance with the US against perceived Soviet expansionism. That epoch came to a definitive end in 1989, when relations with the Soviet Union were normalized and the Beijing massacre introduced new and severe strains into Sino-US relations.&quot;
    39. 41. Q&A Example
    40. 44. EXPLORATORY SEARCH
    41. 45. TEXT MINING BASED ON SHALLOW IE QUANTITIES COMPANY, ORGANIZATION , COUNTRY NAMES PRODUCTS UBL China 176,000 60 cents CEA LVNL JDAM UK Army Helicopter Wedgetail Aircraft Boeing
    42. 46. TEXT MINING BASED ON DEEP IE QUANTITIES COMPANY, ORGANIZATION , COUNTRY NAMES PRODUCTS UBL China 176,000 60 cents CEA LVNL JDAM UK Army Helicopter Wedgetail Aircraft cooperate collaborate roll out buy test drop lay off mothball Boeing
    43. 47. Integrating Text Mining & Visualization
    44. 48. Integrating Text Mining & Visualization
    45. 49. Image Mining Common Themes <ul><li>Application Domains: </li></ul><ul><li>Medical Imaging </li></ul><ul><li>CT, MR,US </li></ul><ul><li>Microscopic Imaging </li></ul><ul><li>Video Processing </li></ul><ul><li>Machine Vision </li></ul><ul><li>Document Imaging </li></ul><ul><li>Remote Sensing </li></ul><ul><li>Tactical Imaging IR, EO </li></ul><ul><li>More …. </li></ul>Insightful Imaging Library Segmentation Enhancement Feature Extraction Clustering Classification Registration
    46. 50. <ul><li>Problem: </li></ul><ul><ul><li>Noisy image data is hard to interpret and leads to inconsistent outcomes. </li></ul></ul><ul><li>Application: </li></ul><ul><ul><li>Prostate outlining in ultrasound images. </li></ul></ul><ul><li>Solution: </li></ul><ul><ul><li>Non-linear model fitting enables more efficient and consistent delineation for improving cancer treatment planning. </li></ul></ul><ul><ul><li>Technology suitable for other image processing applications with high noise such as SONAR images </li></ul></ul>Segmentation of Noisy Images
    47. 51. Two Observers’ Delineation of the Prostate on Ultrasound Images Manual Delineation: Note the large variation between the observers Delineation Using Imaging Library Technology: The inter-observer variation is significantly lower
    48. 52. BRINGING IMAGES INTO THE REALM OF STATISTICS <ul><li>Searching for patterns and objects in images </li></ul><ul><li>Analyzing image properties with statistical tools </li></ul><ul><li>Organizing databases of images (satellite/medical) </li></ul><ul><li>Interpreting and Classifying visual information </li></ul>
    49. 53. MULTIPLE FEATURES MULTIPLE SCALES AUTOMATED FEATURE EXTRACTION FROM IMAGES 0 1 3 20 210 211 213 212 0 1 2 3 20 21 22 23 210 211 212 213 Index 1 Index k Index 1 Index k
    50. 54. SEARCH FOR AIR STRIPS ANALYZE WITH S-PLUS CLASSIFY TISSUES AND DISEASES COURTESY: U. OF DELAWARE & EAST TENNESSEE STATE UNIVERSITY
    51. 55. Fusion: Future Model for Analyzing Data Text Mining: information extraction from unstructured data Text Image Video Presentation and Analysis Data Mining & Prediction Data Warehouse Data Integration
    52. 56. Critical Issues <ul><li>Support all required content </li></ul><ul><ul><li>Numeric, text, image, signal… </li></ul></ul><ul><li>Support full life cycle </li></ul><ul><li>Visualization </li></ul><ul><li>“Platform” – look for language-based tools </li></ul><ul><li>Scalable – look for a pipeline construct </li></ul><ul><li>Extensible – check out the architecture </li></ul>
    53. 57. Scalability: Pipeline
    54. 58. Scalability: Pipeline
    55. 59. Scalability: Pipeline
    56. 60. Scalability: Pipeline
    57. 61. Scalability: Pipeline
    58. 62. Scalability: Pipeline
    59. 63. Scalability: Pipeline
    60. 64. Scalability: Pipeline
    61. 65. Scalability: Pipeline
    62. 66. Scalability: Pipeline Training
    63. 67. Scalability: Pipeline Training Scoring
    64. 68. 1-D Visualization <ul><li>Histograms </li></ul><ul><li>Barcharts </li></ul><ul><li>Piecharts </li></ul><ul><li>Density </li></ul><ul><li>Dot </li></ul><ul><li>… </li></ul>
    65. 69. 2-D Visualization <ul><li>Scatterplot </li></ul><ul><li>Line </li></ul><ul><li>Box </li></ul><ul><li>Strip </li></ul><ul><li>QQ </li></ul><ul><li>… </li></ul>
    66. 70. 3-D Visualization <ul><li>Contour </li></ul><ul><li>Level </li></ul><ul><li>Surface </li></ul><ul><li>Cloud </li></ul><ul><li>… </li></ul>
    67. 71. Useful Variations <ul><li>Color plot </li></ul><ul><li>Shape plot </li></ul><ul><li>Scatterplot matrix </li></ul><ul><li>Overlays </li></ul><ul><li>Trellis (conditioning) </li></ul><ul><li>Should be automated </li></ul><ul><li>Should be extensible </li></ul><ul><li>Rotation </li></ul><ul><li>Overlays </li></ul>
    68. 72. Eye Candy <ul><li>Slick-looking </li></ul><ul><li>Unused dimensions </li></ul><ul><li>Hard to interpret </li></ul><ul><li>3-D Bar </li></ul><ul><li>3-D Pie </li></ul><ul><li>Brush & spin </li></ul><ul><li>Multi-plane plots </li></ul>
    69. 73. Issues <ul><li>Language based vs Menu-driven </li></ul><ul><li>Platform vs application (build vs. use) </li></ul><ul><li>Open ended vs point solution </li></ul><ul><li>Commandline </li></ul><ul><li>Extensibility </li></ul><ul><ul><li>Open methods </li></ul></ul><ul><ul><li>C++/Java extensibility </li></ul></ul><ul><ul><li>XML/Web Services </li></ul></ul><ul><li>Visual programming metaphor </li></ul><ul><li>Pipeline architecture </li></ul><ul><li>PL1 vs. object oriented </li></ul><ul><li>Scalability & Interactivity </li></ul>
    70. 74. Architecture & Technologies Application Platform Library Graphical User Interface Algorithm I/O Data Pipeline Interpreter Viz
    71. 75. Architecture & Technologies Application Platform Library Graphical User Interface Library API Algorithm I/O Data Platform API Pipeline Interpreter Viz
    72. 76. Architecture & Technologies Application Platform Library Graphical User Interface Library API Algorithm I/O Data Platform API Pipeline Interpreter Viz User User
    73. 77. Architecture & Technologies Application Platform Library Graphical User Interface Library API Algorithm I/O Data Platform API Pipeline Interpreter Viz User User COM ASP OLE JSP DDE EJB C++, Java, some Ftn, some 4gl Java & C++ XML
    74. 78. Extensible Architecture Engine Vendor API User Code (Algorithms) User Code (GUI)
    75. 79. What influences Success? <ul><li>Data mining is typically I/O bound </li></ul><ul><ul><li>RAID disk systems </li></ul></ul><ul><ul><li>Locate analysis near data (i.e. not across network) </li></ul></ul><ul><ul><li>Databases & warehouses often too slow – ETL </li></ul></ul><ul><ul><li>Use sampling – especially during exploration </li></ul></ul><ul><li>Data is often very dirty </li></ul><ul><ul><li>Data mining tools typically offer sophisticated methods – use them </li></ul></ul><ul><ul><li>Discarding data can skew results </li></ul></ul>
    76. 80. Desktop Performance <ul><li>Components should be validated on data sets as large as 25GB, with as many as 125,000,000 rows and 5,000 columns. </li></ul><ul><li>Reading and writing files operate at ~5.0 MB/s. E.g., read or write 7,000,000 rows by 30 columns data set (1.2 GB) in less than 3 ½ minutes. </li></ul><ul><li>~ 6 ½ minutes to train an ensemble of trees on a 1,000,000 rows by 30 columns data set (180 MB). </li></ul><ul><li>Scoring components (predictors) should perform at ~500,000 rows per minute. </li></ul>
    77. 81. Conclusions <ul><li>Data is becoming larger, more complex and more diverse in form </li></ul><ul><li>Data mining is needed to extract complex relationships from large, high dimension data </li></ul><ul><li>Data mining is being applied to all content types including numeric, text, image and signal data </li></ul><ul><li>Unstructured data (e.g. text, image) must be structured to analyzed… </li></ul>
    78. 82. Deciding on a Solution <ul><li>Content breadth of offerings and skills </li></ul><ul><ul><li>Numeric, text, image & signal </li></ul></ul><ul><li>Product features </li></ul><ul><ul><li>Support full life cycle </li></ul></ul><ul><ul><li>Language-based </li></ul></ul><ul><ul><li>Scalable </li></ul></ul><ul><ul><li>Visualization </li></ul></ul><ul><ul><li>Extensible </li></ul></ul>
    79. 83. Questions for audience <ul><li>Who knows something about ML? </li></ul><ul><ul><li>Ignorant </li></ul></ul><ul><ul><li>Knowledgeable </li></ul></ul><ul><ul><li>Expert </li></ul></ul><ul><li>Tree example </li></ul><ul><ul><li>Customer data </li></ul></ul><ul><ul><li>Reporting (list all customers, age, income, purchase) </li></ul></ul><ul><ul><li>Sort and report </li></ul></ul><ul><ul><li>Create new variable (age*income) </li></ul></ul><ul><ul><li>Viz (scatter, color plots) </li></ul></ul><ul><ul><li>Tree learning </li></ul></ul>

    ×