SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Getting Started with Unstructured
    Data
    Christine Connors & Kevin Lynch
    TriviumRLG LLC

    November 17, 2011


Thursday, November 17, 2011
Meta

    ✤   Presenter: Christine Connors

         ✤    @cjmconnors

    ✤   Presenter: Kevin Lynch

         ✤    @kevinjohnlynch

    ✤   Principals at www.triviumrlg.com

    ✤   Partnering with Dataversity


Thursday, November 17, 2011
Agenda

    ✤   What is unstructured data?

    ✤   Where do we find it?

    ✤   How important is it?

    ✤   How do we visualize it?

    ✤   Machine processing for actionable data

    ✤   Tools


Thursday, November 17, 2011
What is unstructured data?


    ✤   Data which is

         ✤    Not in a database

         ✤    Does not adhere to a formal data model

    ✤   Content




Thursday, November 17, 2011
Isn’t that a misnomer?

    ✤   Problematic term

    ✤   The presence of object metadata or aesthetic markup does not alone
        give ‘structure’ in this sense of the word

         ✤    Object metadata = machine or applied properties

         ✤    Aesthetic markup = stylesheets; rendering information

    ✤   Semi-structured data is typically treated as unstructured for the
        purposes of machine processing and analysis


Thursday, November 17, 2011
Types of ‘un’structured data



    ✤   Text-based documents

         ✤    Word processing, presentations, email, blogs, wikis, tweets, web
              pages, web components (read/write web)

    ✤   Audio/video files




Thursday, November 17, 2011
Where do we find it?

    ✤   Office productivity suites

    ✤   Content management systems

    ✤   Digital asset management systems

    ✤   Web content management systems

         ✤    Wikis, blogs, comment & discussion threads

    ✤   Social networking tools

         ✤    Twitter, Yammer, instant messengers

Thursday, November 17, 2011
Is it really that important?
                              Structured               Unstructured



                                                 15%




                                           85%




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Progress reports -
        created in a word processor




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Dashboards -
        created in presentation software




Thursday, November 17, 2011
What’s in that 80-85%?



    ✤   Progress reports -
        color coded text in a
        spreadsheet




Thursday, November 17, 2011
What’s in that 80-85%?



    ✤   Brainstorming -
        in messaging systems

    ✤   Decision making - in email




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Business intelligence - on the
        web and more




Thursday, November 17, 2011
How can we make the data more
    actionable?

    ✤   Identify it

    ✤   Convert to a format you can work with

    ✤   Add structure, meaning:

         ✤    information extraction

         ✤    annotation

         ✤    content analytics


Thursday, November 17, 2011
What about enterprise search?


    ✤   First line of defense

    ✤   Points you at the highest relevancy ranked data via pattern matching
        and statistical analysis

    ✤   Does not assist in other visualizations or transformations without
        further machine processing




Thursday, November 17, 2011
Information Extraction


    ✤   Token identification - “tokenization”

    ✤   Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective,
        etc.)

    ✤   Phrase identification - noun phrase

    ✤   Entity extraction - people, places, events, dates, organizations




Thursday, November 17, 2011
Information Extraction

    ✤   Cluster analysis - group related information, where relationship may
        not be known

    ✤   Classification - mapping to specific categories

    ✤   Dependency identification / Rule generation

    ✤   Relationship detection - e.g. “Joe” “is CEO” at “IBM”

    ✤   Summarization - key concepts or key sentences


Thursday, November 17, 2011
Open Tools
   ✤    GATE – General Architecture for
        Text Engineering, from the
        University of Sheffield, with many
        users and excellent documentation.

   ✤    GATE has customizable document
        and corpus processing pipelines.
        GATE is an architecture, a
        framework, and a development
        environment, with a clean separation
        of algorithms, data, and
        visualization.


Thursday, November 17, 2011
Open Tools

   ✤    UIMA – Unstructured Information
        Management Architecture (IBM’s
        Watson uses this), originated at
        IBM, now an Apache project.

   ✤    Component software architecture
        with a document processing
        pipeline similar to GATE. Focus on
        performance and scalability, with
        distributed processing (web
        services).


Thursday, November 17, 2011
UIMA
    UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
      types based on existing ones and update the Common Analysis Structure (CAS) for
                                     upstream processing.
                                                                                                    UIMA CAS
                                                                                               Representation now
                              Common Analysis Structure (CAS)                                        Aligned
                                                                                                with XMI standard
                        Relationship                                   CeoOf


                                                       Arg1:Person                        Arg2:Org
                                                                  Analysis Results
                                                              (i.e., Artifact Metadata)
                       Named Entity           Person                                               Organization


                         Parser                 NP                    VP                          PP


                                       Fred       Center     is       the      CEO        of     Center     Micros

                                                            Artifact (e.g., Document)
                                                                                                                     Chart by
                                                                                                                      IBM
Thursday, November 17, 2011
UIMA




                              Image by
                                IBM
Thursday, November 17, 2011
Commercial Tools

    ✤   Oracle Data Mining (Text Mining)

    ✤   IBM SPSS

    ✤   SAS Text Miner

    ✤   Smartlogic

    ✤   Lots of acquisitions going on in the “big data” space

         ✤    HP acquired Autonomy

         ✤    Oracle acquired Endeca

Thursday, November 17, 2011
A Note on Tools

    ✤    UIMA and GATE – comprehensive suite of capabilities, with learning
         curves.

    ✤    Commercial tools range from unstructured capabilities inside DBMSs
         like Oracle, to Business Objects business intelligence tools (who
         acquired Inxight from Xeroc Parc).

    ✤    Your mileage will vary. The biggest differentiator is your knowledge
         of your data.




Thursday, November 17, 2011
What can unstructured data look
    like post-processing?




Thursday, November 17, 2011
Machine Processing


 Unstructured                  Natural                       Rules-based
                                             Statistical                   Semantic
    Data                      Language                        Classifica-
                                             Analysis                      Analysis
                              Processing                         tion



                                           Machine Processing Platform
                                                            Federated
                                                             Search        A
                                                                           P   Index
                                                                           I

     Visualizations                                        Data Stores
Thursday, November 17, 2011
Questions?




Thursday, November 17, 2011
Thank you
     Christine Connors
     Kevin Lynch
     www.triviumrlg.com




Thursday, November 17, 2011

Weitere ähnliche Inhalte

Mehr von DATAVERSITY

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 

Mehr von DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Kürzlich hochgeladen

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Kürzlich hochgeladen (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 

Getting Started with Unstructured Data

  • 1. Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC November 17, 2011 Thursday, November 17, 2011
  • 2. Meta ✤ Presenter: Christine Connors ✤ @cjmconnors ✤ Presenter: Kevin Lynch ✤ @kevinjohnlynch ✤ Principals at www.triviumrlg.com ✤ Partnering with Dataversity Thursday, November 17, 2011
  • 3. Agenda ✤ What is unstructured data? ✤ Where do we find it? ✤ How important is it? ✤ How do we visualize it? ✤ Machine processing for actionable data ✤ Tools Thursday, November 17, 2011
  • 4. What is unstructured data? ✤ Data which is ✤ Not in a database ✤ Does not adhere to a formal data model ✤ Content Thursday, November 17, 2011
  • 5. Isn’t that a misnomer? ✤ Problematic term ✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word ✤ Object metadata = machine or applied properties ✤ Aesthetic markup = stylesheets; rendering information ✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis Thursday, November 17, 2011
  • 6. Types of ‘un’structured data ✤ Text-based documents ✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web) ✤ Audio/video files Thursday, November 17, 2011
  • 7. Where do we find it? ✤ Office productivity suites ✤ Content management systems ✤ Digital asset management systems ✤ Web content management systems ✤ Wikis, blogs, comment & discussion threads ✤ Social networking tools ✤ Twitter, Yammer, instant messengers Thursday, November 17, 2011
  • 8. Is it really that important? Structured Unstructured 15% 85% Thursday, November 17, 2011
  • 9. What’s in that 80-85%? ✤ Progress reports - created in a word processor Thursday, November 17, 2011
  • 10. What’s in that 80-85%? ✤ Dashboards - created in presentation software Thursday, November 17, 2011
  • 11. What’s in that 80-85%? ✤ Progress reports - color coded text in a spreadsheet Thursday, November 17, 2011
  • 12. What’s in that 80-85%? ✤ Brainstorming - in messaging systems ✤ Decision making - in email Thursday, November 17, 2011
  • 13. What’s in that 80-85%? ✤ Business intelligence - on the web and more Thursday, November 17, 2011
  • 14. How can we make the data more actionable? ✤ Identify it ✤ Convert to a format you can work with ✤ Add structure, meaning: ✤ information extraction ✤ annotation ✤ content analytics Thursday, November 17, 2011
  • 15. What about enterprise search? ✤ First line of defense ✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis ✤ Does not assist in other visualizations or transformations without further machine processing Thursday, November 17, 2011
  • 16. Information Extraction ✤ Token identification - “tokenization” ✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.) ✤ Phrase identification - noun phrase ✤ Entity extraction - people, places, events, dates, organizations Thursday, November 17, 2011
  • 17. Information Extraction ✤ Cluster analysis - group related information, where relationship may not be known ✤ Classification - mapping to specific categories ✤ Dependency identification / Rule generation ✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM” ✤ Summarization - key concepts or key sentences Thursday, November 17, 2011
  • 18. Open Tools ✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation. ✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization. Thursday, November 17, 2011
  • 19. Open Tools ✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project. ✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services). Thursday, November 17, 2011
  • 20. UIMA UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. UIMA CAS Representation now Common Analysis Structure (CAS) Aligned with XMI standard Relationship CeoOf Arg1:Person Arg2:Org Analysis Results (i.e., Artifact Metadata) Named Entity Person Organization Parser NP VP PP Fred Center is the CEO of Center Micros Artifact (e.g., Document) Chart by IBM Thursday, November 17, 2011
  • 21. UIMA Image by IBM Thursday, November 17, 2011
  • 22. Commercial Tools ✤ Oracle Data Mining (Text Mining) ✤ IBM SPSS ✤ SAS Text Miner ✤ Smartlogic ✤ Lots of acquisitions going on in the “big data” space ✤ HP acquired Autonomy ✤ Oracle acquired Endeca Thursday, November 17, 2011
  • 23. A Note on Tools ✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves. ✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc). ✤ Your mileage will vary. The biggest differentiator is your knowledge of your data. Thursday, November 17, 2011
  • 24. What can unstructured data look like post-processing? Thursday, November 17, 2011
  • 25. Machine Processing Unstructured Natural Rules-based Statistical Semantic Data Language Classifica- Analysis Analysis Processing tion Machine Processing Platform Federated Search A P Index I Visualizations Data Stores Thursday, November 17, 2011
  • 27. Thank you Christine Connors Kevin Lynch www.triviumrlg.com Thursday, November 17, 2011