SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Getting Started with Unstructured
    Data
    Christine Connors & Kevin Lynch
    TriviumRLG LLC

    November 17, 2011


Thursday, November 17, 2011
Meta

    ✤   Presenter: Christine Connors

         ✤    @cjmconnors

    ✤   Presenter: Kevin Lynch

         ✤    @kevinjohnlynch

    ✤   Principals at www.triviumrlg.com

    ✤   Partnering with Dataversity


Thursday, November 17, 2011
Agenda

    ✤   What is unstructured data?

    ✤   Where do we find it?

    ✤   How important is it?

    ✤   How do we visualize it?

    ✤   Machine processing for actionable data

    ✤   Tools


Thursday, November 17, 2011
What is unstructured data?


    ✤   Data which is

         ✤    Not in a database

         ✤    Does not adhere to a formal data model

    ✤   Content




Thursday, November 17, 2011
Isn’t that a misnomer?

    ✤   Problematic term

    ✤   The presence of object metadata or aesthetic markup does not alone
        give ‘structure’ in this sense of the word

         ✤    Object metadata = machine or applied properties

         ✤    Aesthetic markup = stylesheets; rendering information

    ✤   Semi-structured data is typically treated as unstructured for the
        purposes of machine processing and analysis


Thursday, November 17, 2011
Types of ‘un’structured data



    ✤   Text-based documents

         ✤    Word processing, presentations, email, blogs, wikis, tweets, web
              pages, web components (read/write web)

    ✤   Audio/video files




Thursday, November 17, 2011
Where do we find it?

    ✤   Office productivity suites

    ✤   Content management systems

    ✤   Digital asset management systems

    ✤   Web content management systems

         ✤    Wikis, blogs, comment & discussion threads

    ✤   Social networking tools

         ✤    Twitter, Yammer, instant messengers

Thursday, November 17, 2011
Is it really that important?
                              Structured               Unstructured



                                                 15%




                                           85%




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Progress reports -
        created in a word processor




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Dashboards -
        created in presentation software




Thursday, November 17, 2011
What’s in that 80-85%?



    ✤   Progress reports -
        color coded text in a
        spreadsheet




Thursday, November 17, 2011
What’s in that 80-85%?



    ✤   Brainstorming -
        in messaging systems

    ✤   Decision making - in email




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Business intelligence - on the
        web and more




Thursday, November 17, 2011
How can we make the data more
    actionable?

    ✤   Identify it

    ✤   Convert to a format you can work with

    ✤   Add structure, meaning:

         ✤    information extraction

         ✤    annotation

         ✤    content analytics


Thursday, November 17, 2011
What about enterprise search?


    ✤   First line of defense

    ✤   Points you at the highest relevancy ranked data via pattern matching
        and statistical analysis

    ✤   Does not assist in other visualizations or transformations without
        further machine processing




Thursday, November 17, 2011
Information Extraction


    ✤   Token identification - “tokenization”

    ✤   Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective,
        etc.)

    ✤   Phrase identification - noun phrase

    ✤   Entity extraction - people, places, events, dates, organizations




Thursday, November 17, 2011
Information Extraction

    ✤   Cluster analysis - group related information, where relationship may
        not be known

    ✤   Classification - mapping to specific categories

    ✤   Dependency identification / Rule generation

    ✤   Relationship detection - e.g. “Joe” “is CEO” at “IBM”

    ✤   Summarization - key concepts or key sentences


Thursday, November 17, 2011
Open Tools
   ✤    GATE – General Architecture for
        Text Engineering, from the
        University of Sheffield, with many
        users and excellent documentation.

   ✤    GATE has customizable document
        and corpus processing pipelines.
        GATE is an architecture, a
        framework, and a development
        environment, with a clean separation
        of algorithms, data, and
        visualization.


Thursday, November 17, 2011
Open Tools

   ✤    UIMA – Unstructured Information
        Management Architecture (IBM’s
        Watson uses this), originated at
        IBM, now an Apache project.

   ✤    Component software architecture
        with a document processing
        pipeline similar to GATE. Focus on
        performance and scalability, with
        distributed processing (web
        services).


Thursday, November 17, 2011
UIMA
    UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
      types based on existing ones and update the Common Analysis Structure (CAS) for
                                     upstream processing.
                                                                                                    UIMA CAS
                                                                                               Representation now
                              Common Analysis Structure (CAS)                                        Aligned
                                                                                                with XMI standard
                        Relationship                                   CeoOf


                                                       Arg1:Person                        Arg2:Org
                                                                  Analysis Results
                                                              (i.e., Artifact Metadata)
                       Named Entity           Person                                               Organization


                         Parser                 NP                    VP                          PP


                                       Fred       Center     is       the      CEO        of     Center     Micros

                                                            Artifact (e.g., Document)
                                                                                                                     Chart by
                                                                                                                      IBM
Thursday, November 17, 2011
UIMA




                              Image by
                                IBM
Thursday, November 17, 2011
Commercial Tools

    ✤   Oracle Data Mining (Text Mining)

    ✤   IBM SPSS

    ✤   SAS Text Miner

    ✤   Smartlogic

    ✤   Lots of acquisitions going on in the “big data” space

         ✤    HP acquired Autonomy

         ✤    Oracle acquired Endeca

Thursday, November 17, 2011
A Note on Tools

    ✤    UIMA and GATE – comprehensive suite of capabilities, with learning
         curves.

    ✤    Commercial tools range from unstructured capabilities inside DBMSs
         like Oracle, to Business Objects business intelligence tools (who
         acquired Inxight from Xeroc Parc).

    ✤    Your mileage will vary. The biggest differentiator is your knowledge
         of your data.




Thursday, November 17, 2011
What can unstructured data look
    like post-processing?




Thursday, November 17, 2011
Machine Processing


 Unstructured                  Natural                       Rules-based
                                             Statistical                   Semantic
    Data                      Language                        Classifica-
                                             Analysis                      Analysis
                              Processing                         tion



                                           Machine Processing Platform
                                                            Federated
                                                             Search        A
                                                                           P   Index
                                                                           I

     Visualizations                                        Data Stores
Thursday, November 17, 2011
Questions?




Thursday, November 17, 2011
Thank you
     Christine Connors
     Kevin Lynch
     www.triviumrlg.com




Thursday, November 17, 2011

Weitere ähnliche Inhalte

Mehr von DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Mehr von DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Kürzlich hochgeladen

ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 

Kürzlich hochgeladen (20)

ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 

Getting Started with Unstructured Data

  • 1. Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC November 17, 2011 Thursday, November 17, 2011
  • 2. Meta ✤ Presenter: Christine Connors ✤ @cjmconnors ✤ Presenter: Kevin Lynch ✤ @kevinjohnlynch ✤ Principals at www.triviumrlg.com ✤ Partnering with Dataversity Thursday, November 17, 2011
  • 3. Agenda ✤ What is unstructured data? ✤ Where do we find it? ✤ How important is it? ✤ How do we visualize it? ✤ Machine processing for actionable data ✤ Tools Thursday, November 17, 2011
  • 4. What is unstructured data? ✤ Data which is ✤ Not in a database ✤ Does not adhere to a formal data model ✤ Content Thursday, November 17, 2011
  • 5. Isn’t that a misnomer? ✤ Problematic term ✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word ✤ Object metadata = machine or applied properties ✤ Aesthetic markup = stylesheets; rendering information ✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis Thursday, November 17, 2011
  • 6. Types of ‘un’structured data ✤ Text-based documents ✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web) ✤ Audio/video files Thursday, November 17, 2011
  • 7. Where do we find it? ✤ Office productivity suites ✤ Content management systems ✤ Digital asset management systems ✤ Web content management systems ✤ Wikis, blogs, comment & discussion threads ✤ Social networking tools ✤ Twitter, Yammer, instant messengers Thursday, November 17, 2011
  • 8. Is it really that important? Structured Unstructured 15% 85% Thursday, November 17, 2011
  • 9. What’s in that 80-85%? ✤ Progress reports - created in a word processor Thursday, November 17, 2011
  • 10. What’s in that 80-85%? ✤ Dashboards - created in presentation software Thursday, November 17, 2011
  • 11. What’s in that 80-85%? ✤ Progress reports - color coded text in a spreadsheet Thursday, November 17, 2011
  • 12. What’s in that 80-85%? ✤ Brainstorming - in messaging systems ✤ Decision making - in email Thursday, November 17, 2011
  • 13. What’s in that 80-85%? ✤ Business intelligence - on the web and more Thursday, November 17, 2011
  • 14. How can we make the data more actionable? ✤ Identify it ✤ Convert to a format you can work with ✤ Add structure, meaning: ✤ information extraction ✤ annotation ✤ content analytics Thursday, November 17, 2011
  • 15. What about enterprise search? ✤ First line of defense ✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis ✤ Does not assist in other visualizations or transformations without further machine processing Thursday, November 17, 2011
  • 16. Information Extraction ✤ Token identification - “tokenization” ✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.) ✤ Phrase identification - noun phrase ✤ Entity extraction - people, places, events, dates, organizations Thursday, November 17, 2011
  • 17. Information Extraction ✤ Cluster analysis - group related information, where relationship may not be known ✤ Classification - mapping to specific categories ✤ Dependency identification / Rule generation ✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM” ✤ Summarization - key concepts or key sentences Thursday, November 17, 2011
  • 18. Open Tools ✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation. ✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization. Thursday, November 17, 2011
  • 19. Open Tools ✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project. ✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services). Thursday, November 17, 2011
  • 20. UIMA UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. UIMA CAS Representation now Common Analysis Structure (CAS) Aligned with XMI standard Relationship CeoOf Arg1:Person Arg2:Org Analysis Results (i.e., Artifact Metadata) Named Entity Person Organization Parser NP VP PP Fred Center is the CEO of Center Micros Artifact (e.g., Document) Chart by IBM Thursday, November 17, 2011
  • 21. UIMA Image by IBM Thursday, November 17, 2011
  • 22. Commercial Tools ✤ Oracle Data Mining (Text Mining) ✤ IBM SPSS ✤ SAS Text Miner ✤ Smartlogic ✤ Lots of acquisitions going on in the “big data” space ✤ HP acquired Autonomy ✤ Oracle acquired Endeca Thursday, November 17, 2011
  • 23. A Note on Tools ✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves. ✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc). ✤ Your mileage will vary. The biggest differentiator is your knowledge of your data. Thursday, November 17, 2011
  • 24. What can unstructured data look like post-processing? Thursday, November 17, 2011
  • 25. Machine Processing Unstructured Natural Rules-based Statistical Semantic Data Language Classifica- Analysis Analysis Processing tion Machine Processing Platform Federated Search A P Index I Visualizations Data Stores Thursday, November 17, 2011
  • 27. Thank you Christine Connors Kevin Lynch www.triviumrlg.com Thursday, November 17, 2011