SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
1
Data Search

      Searching and Finding information in
    Unstructured and Structured Data Sources


     Erik Fransen                  11.00-12.00 P.M. November, 3
     Senior Business Consultant    IRM UK, DW/BI 2009, London

     Centennium BI expertisehuis
     The Hague, The Netherlands
     e.fransen@centennium.nl
2
Agenda
    • Introduction;
    • Industry models;
    • Combining structured & unstructured data
      – “Pure Portal”
      – “Index it all”
      – “Structure it all”
    • Summary.




3
Profile
    • Erik Fransen
    • Background: Knowledge Engineering,
      Middlesex University;
    • Expertise areas:
      –   Business Intelligence
      –   Knowledge engineering
      –   Knowledge & Content management
      –   Data warehousing
      –   Analytics
    • CBIP.
4
Introduction




5
Combining BI with unstructured data
    •   Integrated access to relevant information (‘provide complete picture’);
    •   Unstructured data like documents provide valuable context to numerical
        data;
         – Customer complaints
         – Competitor’s press releases
         – Marketing documents
         – …
    •   Insurance fraud analysis (i.e. claim statistics and claim forms);
    •   Competitive Intelligence (i.e. market share data and competitor news);
    •   Customer retention (i.e. sales data and customer complaints);

    •   Data Search acts as a bridge between structured and unstructured data.




6
(un)structured data keeps growing….
                                                                                       2009




                                                                                        2005
       Cave paintings,
Bone tools 40,000 BC

            Writing 3500 BC

                                                                                       2001




                                                                                                               >80% Unstructured
                         Paper 105
                                 Printing 1450                                         2000
                                 Electricity, Telephone

                                                                           Oracle-79
                                                  1870                       SQL-70
                                                 Transistor 1947
                                                                          SQL-89




                                                                                                   GIGABYTES
                                                 Computing 1950
                                                                         SQL-92
                                      Internet (DARPA) Late 1960s



                                                                                          SQL-99
                                                                   The Web 1993




                                                                                                   SQL-03
                                                                                   1999                                            Source: Forrester

 7
Industry Model:                        Text   Data

Bill Inmon’s DW 2.0™
    •   Hold data at the lowest detail;
    •   Hold data to infinity;
    •   Have integrity of data and have
        online high-performance
        transaction processing;
    •   Tightly couple metadata to the
        data warehouse environment;
    •    …
    • Link structured data
      and unstructured data;




8
Industry Model:
    Information Access Architecture (Gartner)




9
Industry Model:
     Enterprise Search Platform (Forrester)




10
Data Search Scenarios

     Searching and Finding information in
     Unstructured and Structured Data Sources




11
Global architecture
      Master &
      Meta Data

                                                                                          Reports
                     Structured
                                                                   Data    Data
                                                  DWH                                      OLAP
                                                                   Marts   Marts
        OLTP                                                                              Mining…
                                                                            Cubes


       Financial                                  ODS
         Apps
                                    Middleware




                                                                                                    Portal
      Content
     Man System
                     Unstructured




                                                         Search                        Search
      Fileservers
                                                         Index                      Text Mining
                                                        Database                    Visualisation


        Email




     Intranet/inte
          rnet

12
Three data search scenarios
      Master &
      Meta Data

                     Structure                                                                    Reports
                         Structured
                                                                      Data         Data
                        it all                       DWH
                                                                      Marts        Marts
                                                                                                   OLAP
        OLTP                                                                                      Mining…
                                                                                    Cubes


       Financial                                     ODS
         Apps
                                        Middleware




                                                                              Index                          Pure
                                                                               it all                       Portal




                                                                                                              Portal
      Content
     Man System
                         Unstructured




                                                            Search                             Search
      Fileservers
                                                            Index                           Text Mining
                                                           Database                         Visualisation


        Email




     Intranet/inte
          rnet

13
Scenario 1: Pure Portal

     Many portlets, one user interface;
     Business user may manually combines content
     from several independent sources;
     Risk: too complex for user.


14
1: Pure Portal
      Master &
      Meta Data

                                                                                         Reports
                     Structured
                                                                  Data    Data
                                                 DWH                                      OLAP
                                                                  Marts   Marts
        OLTP                                                                             Mining…
                                                                           Cubes


       Financial                                 ODS
         Apps
                                    Middleware




                                                                                                    Pure
                                                                                                   Portal




                                                                                                     Portal
      Content
     Man System
                     Unstructured




                                                        Search                        Search
      Fileservers
                                                        Index                      Text Mining
                                                       Database                    Visualisation


        Email




     Intranet/inte
          rnet

15
Integrate news with BI information




                                   Source: Aruba


16
Structured BI info…




17
… and Photos, Files and Maps




18
Scenario 2: “index it all”

     Enterprise Search from one user interface;
     Business user knows what to look for and expects
     a “complete picture” as a result;
     Risk: Many irrelevant search results due to the
     nature of document indexing.
19
2: Index it all
      Master &
      Meta Data

                                                                                              Reports
                     Structured
                                                                  Data         Data
                                                 DWH                                           OLAP
                                                                  Marts        Marts
        OLTP                                                                                  Mining…
                                                                                Cubes


       Financial                                 ODS
         Apps
                                    Middleware




                                                                          Index
                                                                           it all




                                                                                                        Portal
      Content
     Man System
                     Unstructured




                                                        Search                             Search
      Fileservers
                                                        Index                           Text Mining
                                                       Database                         Visualisation


        Email




     Intranet/inte
          rnet

20
Scenario 2: “Index it all”


Unstructured                                                 Search
                 Search index
data sources                                               application




                                                                         User interface
                                 BI report is indexed
                                 as if it was a document


                Data warehouse
                                                               BI
 Structured                                                application
                  Architecture        Reports
data sources




21
Example: IBM Cognos 8 Go! Search
                                Integration with enterprise
                                search applications (IBM
                                OmniFind, Google OneBox
                                for Enterprise, Yahoo,
                                Autonomy)

                                Search results return all
                                relevant structured content
                                (reports, analyses, etc.)
                                and unstructured content
                                (Word documents, PDFs,
                                et) within a single interface.




22
Example: IBM OmniFind




23
Example: IBM OmniFind




24
SAP BusinessObject Intelligent Search




25
SAP BusinessObject Intelligent Search




26                     11/9/2
Scenario 3: “Structure it all”

     Generate structure using document warehousing
     and text mining;
     Business user knows exactly what to look for;
     Risk: Limited flexibility for user.


27
3: Structure it all
      Master &
      Meta Data

                     Structure                                                               Reports
                         Structured
                                                                      Data    Data
                        it all                       DWH
                                                                      Marts   Marts
                                                                                              OLAP
        OLTP                                                                                 Mining…
                                                                               Cubes


       Financial                                     ODS
         Apps
                                        Middleware




                                                                                                       Portal
      Content
     Man System
                         Unstructured




                                                            Search                        Search
      Fileservers
                                                            Index                      Text Mining
                                                           Database                    Visualisation


        Email




     Intranet/inte
          rnet

28
Generating structure in document warehouse
                                    Retrieve                Preprocess                                  Compile
        Identify Sources                                                          Text Mining
                                   Documents                Documents                                   Metadata



            Sources are not        Internal sources         Format                Linguistic analysis   Carefully attach
            fixed                  retrieval, file          documents in a        Key features are      metadata to
            Iterative process,     servers,                 consistent matter     extracted             document
            sources lead to        CMS/DMS                  Files must be in      Indexing              Used for
            new sources            External source          suitable form for     documents             querying,
                                   retrieval, using         text analysis         Summarizing           matching,
                                   crawlers, spiders                              documents             navigation
                                   Sources are not                                                      support
                                   fixed                                                                Store in
                                   Iterative process,                                                   document
                                   sources lead to                                                      warehouse
                                   new sources


     Source: Dan Sullivan




                             Data warehouse                                Document warehouse
                               Architecture                                    Architecture
                                                         Combine
                                                        (meta)data
29
Document warehouse

        Contains complete documents or URLs
        Metadata about documents:
         summaries, authors’ names, publication
         dates, titles, sources, keywords, etc.
        Translations of documents
        Thematic clustering of similar                    Document warehouse
         documents                                             Architecture

        Topical or thematic indexes

        Extracted key features (structure)
             Dimensions and Facts, linked to documents,
              summaries etc.
             Combine with the data warehouse



30
BI reporting on dimensional model

                                                  Dim
        Dim                     Dim               Action
       Product                 Customer


                                                           Dim
                     Sales                Call          Competitor
                      Facts               Facts


        Dim
                                Dim                       Dim
     Sales person
                                Time                    Telco Term

              Data warehouse              Document warehouse
31
Generate structure using text mining tools




     Example taken from SPSS PASW Text Analytics, many other tools available:
     IBM, SAS, Oracle, SAP BO, Microsoft etc. etc.
32
Generating structure using UIMA
     • Unstructured Information Management Architecture
     • Originates from IBM, now Apache UIMA




     http://incubator.apache.org/uima/           Source: IBM

     UIMA is supported by all main BI vendors.
33
Example: Generating structure using UIMA


     • Analyzed by a collection of text analytics
     • Detected Semantic Entities and Relations Highlighted
     • Represented in UIMA Common Analysis Structure (CAS)




34
Summary
     • Growing business need for combining BI with
       unstructured data;
     • Data Search bridges the gap between both
       worlds
       – Scenario 1: “Pure Portal”
       – Scenario 2: “Index it all”
       – Scenario 3: “Structure it all”
     • Scenarios can be combined.

                             Questions?
35

Weitere ähnliche Inhalte

Ähnlich wie Data Search Searching And Finding Information In Unstructured And Structured Data Sources

Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012Treparel
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceIBM Danmark
 
Leveraging PowerPivot
Leveraging PowerPivotLeveraging PowerPivot
Leveraging PowerPivotDan English
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
An overview of Microsoft data mining technology
An overview of Microsoft data mining technologyAn overview of Microsoft data mining technology
An overview of Microsoft data mining technologyMark Tabladillo
 
BW Multi-Dimensional Model
BW Multi-Dimensional ModelBW Multi-Dimensional Model
BW Multi-Dimensional Modelyujesh
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012Anand Deshpande
 
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Odinot Stanislas
 
An overview of microsoft data mining technology
An overview of microsoft data mining technologyAn overview of microsoft data mining technology
An overview of microsoft data mining technologyMark Tabladillo
 
MS Sql Server: Datamining Introduction
MS Sql Server: Datamining IntroductionMS Sql Server: Datamining Introduction
MS Sql Server: Datamining Introductionsqlserver content
 
Big Data For Investment Research Management
Big Data For Investment Research ManagementBig Data For Investment Research Management
Big Data For Investment Research ManagementIDT Partners
 
Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2David Linthicum
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10keirdo1
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.
 
Teradata Big Data London Seminar
Teradata Big Data London SeminarTeradata Big Data London Seminar
Teradata Big Data London SeminarHortonworks
 

Ähnlich wie Data Search Searching And Finding Information In Unstructured And Structured Data Sources (20)

Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 
Leveraging PowerPivot
Leveraging PowerPivotLeveraging PowerPivot
Leveraging PowerPivot
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Cs753 2a
Cs753 2aCs753 2a
Cs753 2a
 
An overview of Microsoft data mining technology
An overview of Microsoft data mining technologyAn overview of Microsoft data mining technology
An overview of Microsoft data mining technology
 
Data mining
Data miningData mining
Data mining
 
BW Multi-Dimensional Model
BW Multi-Dimensional ModelBW Multi-Dimensional Model
BW Multi-Dimensional Model
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
 
An overview of microsoft data mining technology
An overview of microsoft data mining technologyAn overview of microsoft data mining technology
An overview of microsoft data mining technology
 
MS Sql Server: Datamining Introduction
MS Sql Server: Datamining IntroductionMS Sql Server: Datamining Introduction
MS Sql Server: Datamining Introduction
 
SQL Server: Data Mining
SQL Server: Data MiningSQL Server: Data Mining
SQL Server: Data Mining
 
Big Data For Investment Research Management
Big Data For Investment Research ManagementBig Data For Investment Research Management
Big Data For Investment Research Management
 
Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
 
eXtremeDB FE
eXtremeDB FEeXtremeDB FE
eXtremeDB FE
 
Teradata Big Data London Seminar
Teradata Big Data London SeminarTeradata Big Data London Seminar
Teradata Big Data London Seminar
 

Data Search Searching And Finding Information In Unstructured And Structured Data Sources

  • 1. 1
  • 2. Data Search Searching and Finding information in Unstructured and Structured Data Sources Erik Fransen 11.00-12.00 P.M. November, 3 Senior Business Consultant IRM UK, DW/BI 2009, London Centennium BI expertisehuis The Hague, The Netherlands e.fransen@centennium.nl 2
  • 3. Agenda • Introduction; • Industry models; • Combining structured & unstructured data – “Pure Portal” – “Index it all” – “Structure it all” • Summary. 3
  • 4. Profile • Erik Fransen • Background: Knowledge Engineering, Middlesex University; • Expertise areas: – Business Intelligence – Knowledge engineering – Knowledge & Content management – Data warehousing – Analytics • CBIP. 4
  • 6. Combining BI with unstructured data • Integrated access to relevant information (‘provide complete picture’); • Unstructured data like documents provide valuable context to numerical data; – Customer complaints – Competitor’s press releases – Marketing documents – … • Insurance fraud analysis (i.e. claim statistics and claim forms); • Competitive Intelligence (i.e. market share data and competitor news); • Customer retention (i.e. sales data and customer complaints); • Data Search acts as a bridge between structured and unstructured data. 6
  • 7. (un)structured data keeps growing…. 2009 2005 Cave paintings, Bone tools 40,000 BC Writing 3500 BC 2001 >80% Unstructured Paper 105 Printing 1450 2000 Electricity, Telephone Oracle-79 1870 SQL-70 Transistor 1947 SQL-89 GIGABYTES Computing 1950 SQL-92 Internet (DARPA) Late 1960s SQL-99 The Web 1993 SQL-03 1999 Source: Forrester 7
  • 8. Industry Model: Text Data Bill Inmon’s DW 2.0™ • Hold data at the lowest detail; • Hold data to infinity; • Have integrity of data and have online high-performance transaction processing; • Tightly couple metadata to the data warehouse environment; • … • Link structured data and unstructured data; 8
  • 9. Industry Model: Information Access Architecture (Gartner) 9
  • 10. Industry Model: Enterprise Search Platform (Forrester) 10
  • 11. Data Search Scenarios Searching and Finding information in Unstructured and Structured Data Sources 11
  • 12. Global architecture Master & Meta Data Reports Structured Data Data DWH OLAP Marts Marts OLTP Mining… Cubes Financial ODS Apps Middleware Portal Content Man System Unstructured Search Search Fileservers Index Text Mining Database Visualisation Email Intranet/inte rnet 12
  • 13. Three data search scenarios Master & Meta Data Structure Reports Structured Data Data it all DWH Marts Marts OLAP OLTP Mining… Cubes Financial ODS Apps Middleware Index Pure it all Portal Portal Content Man System Unstructured Search Search Fileservers Index Text Mining Database Visualisation Email Intranet/inte rnet 13
  • 14. Scenario 1: Pure Portal Many portlets, one user interface; Business user may manually combines content from several independent sources; Risk: too complex for user. 14
  • 15. 1: Pure Portal Master & Meta Data Reports Structured Data Data DWH OLAP Marts Marts OLTP Mining… Cubes Financial ODS Apps Middleware Pure Portal Portal Content Man System Unstructured Search Search Fileservers Index Text Mining Database Visualisation Email Intranet/inte rnet 15
  • 16. Integrate news with BI information Source: Aruba 16
  • 18. … and Photos, Files and Maps 18
  • 19. Scenario 2: “index it all” Enterprise Search from one user interface; Business user knows what to look for and expects a “complete picture” as a result; Risk: Many irrelevant search results due to the nature of document indexing. 19
  • 20. 2: Index it all Master & Meta Data Reports Structured Data Data DWH OLAP Marts Marts OLTP Mining… Cubes Financial ODS Apps Middleware Index it all Portal Content Man System Unstructured Search Search Fileservers Index Text Mining Database Visualisation Email Intranet/inte rnet 20
  • 21. Scenario 2: “Index it all” Unstructured Search Search index data sources application User interface BI report is indexed as if it was a document Data warehouse BI Structured application Architecture Reports data sources 21
  • 22. Example: IBM Cognos 8 Go! Search Integration with enterprise search applications (IBM OmniFind, Google OneBox for Enterprise, Yahoo, Autonomy) Search results return all relevant structured content (reports, analyses, etc.) and unstructured content (Word documents, PDFs, et) within a single interface. 22
  • 26. SAP BusinessObject Intelligent Search 26 11/9/2
  • 27. Scenario 3: “Structure it all” Generate structure using document warehousing and text mining; Business user knows exactly what to look for; Risk: Limited flexibility for user. 27
  • 28. 3: Structure it all Master & Meta Data Structure Reports Structured Data Data it all DWH Marts Marts OLAP OLTP Mining… Cubes Financial ODS Apps Middleware Portal Content Man System Unstructured Search Search Fileservers Index Text Mining Database Visualisation Email Intranet/inte rnet 28
  • 29. Generating structure in document warehouse Retrieve Preprocess Compile Identify Sources Text Mining Documents Documents Metadata Sources are not Internal sources Format Linguistic analysis Carefully attach fixed retrieval, file documents in a Key features are metadata to Iterative process, servers, consistent matter extracted document sources lead to CMS/DMS Files must be in Indexing Used for new sources External source suitable form for documents querying, retrieval, using text analysis Summarizing matching, crawlers, spiders documents navigation Sources are not support fixed Store in Iterative process, document sources lead to warehouse new sources Source: Dan Sullivan Data warehouse Document warehouse Architecture Architecture Combine (meta)data 29
  • 30. Document warehouse  Contains complete documents or URLs  Metadata about documents: summaries, authors’ names, publication dates, titles, sources, keywords, etc.  Translations of documents  Thematic clustering of similar Document warehouse documents Architecture  Topical or thematic indexes  Extracted key features (structure)  Dimensions and Facts, linked to documents, summaries etc.  Combine with the data warehouse 30
  • 31. BI reporting on dimensional model Dim Dim Dim Action Product Customer Dim Sales Call Competitor Facts Facts Dim Dim Dim Sales person Time Telco Term Data warehouse Document warehouse 31
  • 32. Generate structure using text mining tools Example taken from SPSS PASW Text Analytics, many other tools available: IBM, SAS, Oracle, SAP BO, Microsoft etc. etc. 32
  • 33. Generating structure using UIMA • Unstructured Information Management Architecture • Originates from IBM, now Apache UIMA http://incubator.apache.org/uima/ Source: IBM UIMA is supported by all main BI vendors. 33
  • 34. Example: Generating structure using UIMA • Analyzed by a collection of text analytics • Detected Semantic Entities and Relations Highlighted • Represented in UIMA Common Analysis Structure (CAS) 34
  • 35. Summary • Growing business need for combining BI with unstructured data; • Data Search bridges the gap between both worlds – Scenario 1: “Pure Portal” – Scenario 2: “Index it all” – Scenario 3: “Structure it all” • Scenarios can be combined. Questions? 35