SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Background
                                                 Methodology
                                                   Evaluation
                                                 Conclusions
                                                  References




                               LODIE
                Linked Open Data Information Extraction

                  Fabio Ciravegna                     Anna Lisa Gentile   Ziqi Zhang

                                 OAK Group, Department of Computer Science
                                       The University of Sheffield, UK


                                    9th October 2012
                          Semantic Web and Information Extraction
                                  SWAIE @ EKAW 2012



Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                         1 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


LODIE: Overview


   Linked Open Data for Web-scale Information Extraction
            Web-scale IE
                     number of documents, domains, facts
                     efficient and effective methods required
            Linked Open Data to seed learning
                 “[. . . ] a recommended best practice for exposing, sharing, and
                 connecting data [. . . ] using URIs and RDF”(linkeddata.org).
                     a large KB of typed instances, relations, annotations (e.g., RDFa)
            Adapting to specific user information needs
                     users define specific IE tasks by specifying the types of instances
                     and relations to be learnt



  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                          2 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Challenges: user Information needs



                  SoA defines a generic IE task - KnowItAll, StatSnowball,
                      PROSPERA, NELL, ExtremeExtraction [3, 4, 5, 11]
                      extracts “people, oragnisation, location" etc and their
                      generic relations



                    RQ how to let users define Web-IE tasks tailored to their
                       own needs - “drugs that treat hayfever"




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                3 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Challenges: training data




                  SoA requires certain amount of training/learning resources to
                      be manually specified



                    RQ how to automatically obtain these (and filter noise) from
                       the LOD




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                  4 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Challenges: learning strategies



                  SoA Typically semi-supervised bootstrapping based learning
                      from unstructured texts, prone to propagation of errors



                    RQ how to combine multi-strategy learning (e.g., from both
                       structured and unstructured contents) to avoid drifting
                       away from the learning task




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                 5 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Challenges: publication of triples




                  SoA No integration with existing KB



                    RQ how to integrate with LOD




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                  6 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


LODIE Architecture Overview




                                         Figure: Architecture diagram.
  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                         7 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


User needs formalisation

   Goal: Support users in formalising their information needs in a
   machine understandable format
            Hypothesis
                     Users define information needs in terms of ontologies
                     Users use different vocabularies in ontology creation

            Methods
                     Baseline: manually identify relevant ontologies on the LOD and
                     define a view on them using tools like neon-toolkit.org
                     Ontology Design Pattern: reuse existing Content ODP building
                     block and apply re-engineering patterns to bridge the
                     “vocabulary gap”



  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                      8 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Learning Seed Identification and Filtering - I
   Goal 1: Automatically identify training data in the forms of triples and
   annotations to seed learning
            Hypothesis:
                     LOD can already contain answers to user needs in the forms of
                     triples and annotations
                     The Web contains additional linguistic realisations of triples

            Method
                     From LOD - SPQRL queries to fetch seed triples (and
                     annotations) matching the user needs
                     From the Web - Search for linguistic realisations of triples
                     (identified above):
                             co-occurrence of related instances in textual contexts e.g.
                             sentences
                             structural elements e.g., tables

  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                           9 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Learning Seed Identification and Filtering - II
   Goal 2: Filter noisy training data and select the most informative
   examples for learning
            Hypothesis:
                     Identified learning seeds can contain noise (causing
                     “drifting-away”)
                     ... and can be redundant (causing unnecessary overheads)
                     good learning examples are consistent w.r.t. the learning task
                     and diverse.
            Method
                     Consistency measure - cluster seed instances of different classes
                     N times (varying parameters), does i always appear in the cluster
                     representing the same class?
                     Variability measure - cluster seed instances of the same class,
                     how many clusters are generated and how dense are they
  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                         10 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Multi-Strategy Learning

   Goal: Learning from different types (e.g., structured, unstructured)
   with different strategies to improve both recall and precision
            Hypothesis:
                     The same pieces of knowledge can be repeated in different forms,
                     e.g., in tables v.s. sentences (re-enforcing evidenece, Precision)
                     Some knowledge may be found only in one form or another
                     (Recall)

            Method: multi-strategy learning
                     Learning from structures such as tables and lists [10, 8]
                     Inducing wrappers for regular pages [7]
                     Lexical-syntactic pattern learning from free texts
                     Combine outputs from different processes


  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                          11 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Integration with the LOD



   Goal: Assign unique identifier to the extracted knowledge
            Hypothesis:
                     Knowledge that already exists in the LOD can be re-extracted and
                     must be integrated

            Method:
                     simple, scalable disambiguation methods, e.g., by feature
                     overlapping [2] and string distance metrics [9]




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                        12 / 19
Background
                                                  Methodology
                                                    Evaluation
                                                  Conclusions
                                                   References


User Feedback



  Goal: Integrate user’s feedback on learning
           Hypothesis:
                    Automatic extraction is imperfect and user’s feedback can help
                    improve learning

           Method:
                    Expose extracted knowledge via an interface and collect user
                    feedback - errors can be used as negative example for re-training




 Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                         13 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Evaluation




            suitability to formalise the user needs



            suitability of the approach to IE




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                  14 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Evaluation: suitability to formalise the user needs


   Task described in natural language –> equivalent IE task
            feasibility and efficiency (reasonable time, limited overhead)
            effectiveness
                     is result representative of the user needs?
                             users judge for description of task
                             users judge for resulting instances
                     is result suitable to seeding IE?
                             usefulness of triples in terms to learning using the proposed quality
                             measures




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                                     15 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Evaluation: suitability of the approach to IE


            definition of new task: population of sections of the schema.org
            ontology
                     precision
                     partial evaluation of recall
                             fraction of the available annotated instances
                             checking recall with respect to already available annotated
                             instances, not provided for training

            comparative large scale IE evaluations
                     TAC Knowledge Base Population [6]
                     TREC Entity Extraction task [1]




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                           16 / 19
Background
                                                   Methodology
                                                     Evaluation
                                                   Conclusions
                                                    References


Impact


   LODIE timeliness
            LOD: first very large-scale information resource available for IE
            covering for a growing number of domains
   LODIE output
            Web-scale IE task corpora, linked resources, etc.
            developed code available as open source under the MIT licence
            all the data generated will be made available using a licence such
            as Open Data Commons (opendatacommons.org)




  Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                 17 / 19
Background
                                                 Methodology
                                                   Evaluation
                                                 Conclusions
                                                  References


         Krisztian Balog and Pavel Serdyukov.
         Overview of the TREC 2010 Entity Track.
         In Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010). NIST, 2011.

         S Banerjee and T Pedersen.
         An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet.
         In CICLing ’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text
         Processing, pages 136–145, London, UK, 2002. Springer-Verlag.

         Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell.
         Toward an Architecture for Never-Ending Language Learning.
         pages 1306–1313.

         Oren Etzioni, Ana-maria Popescu, Daniel S Weld, Doug Downey, and Alexander Yates.
         Web-Scale Information Extraction in KnowItAll ( Preliminary Results ).
         pages 100–110, 2004.

         Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Gary Kratkiewicz, Nicolas Ward, and Ralph
         Weischedel.
         Extreme Extraction – Machine Reading in a Week.
         (1):1437–1446, 2011.

         Heng Ji and Ralph Grishman.
         Knowledge Base Population : Successful Approaches and Challenges.

         N Kushmerick.
         Wrapper Induction for information Extraction.
         In IJCAI97, 1997.

         Girija Limaye.

Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                                                                 18 / 19
Background
                                                 Methodology
                                                   Evaluation
                                                 Conclusions
                                                  References


         Annotating and Searching Web Tables Using Entities , Types and Relationships.

         Vanessa Lopez, Miriam Fernández, Nico Stieler, Enrico Motta, Walton Hall, Milton Keynes Mkaa, and United Kingdom.
         PowerAqua : supporting users in querying and exploring the Semantic Web content.

         David Milne and Ian H Witten.
         Learning to Link with Wikipedia.
         pages 509–518, 2007.

         Ndapandula Nakashole and Martin Theobald.
         Scalable knowledge harvesting with high precision and high recall.
         on Web search and data mining, (1955):227–236, 2011.




Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang                                                                               19 / 19

Weitere ähnliche Inhalte

Andere mochten auch

Exploring the similarity between Social Knowledge Sources and Twitter for Cro...
Exploring the similarity between Social Knowledge Sources and Twitter for Cro...Exploring the similarity between Social Knowledge Sources and Twitter for Cro...
Exploring the similarity between Social Knowledge Sources and Twitter for Cro...Andrea Varga
 
Volatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity StreamsVolatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity StreamsAmparo Elizabeth Cano Basave
 
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Anna Lisa Gentile
 
Methodology and Campaign Design for the Evaluation of Semantic Search Tools
Methodology and Campaign Design for the Evaluation of Semantic Search ToolsMethodology and Campaign Design for the Evaluation of Semantic Search Tools
Methodology and Campaign Design for the Evaluation of Semantic Search ToolsStuart Wrigley
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Stuart Wrigley
 
Asterid: Linked Data Asterisms
Asterid: Linked Data AsterismsAsterid: Linked Data Asterisms
Asterid: Linked Data AsterismsGregoire Burel
 
Smart Cities and E-governance
Smart Cities and E-governanceSmart Cities and E-governance
Smart Cities and E-governancesteveking1225
 
E governance
E governanceE governance
E governanceGoa App
 
Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"KRITYANAND UNESCO CLUB Jamshedpur
 

Andere mochten auch (13)

Exploring the similarity between Social Knowledge Sources and Twitter for Cro...
Exploring the similarity between Social Knowledge Sources and Twitter for Cro...Exploring the similarity between Social Knowledge Sources and Twitter for Cro...
Exploring the similarity between Social Knowledge Sources and Twitter for Cro...
 
Volatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity StreamsVolatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity Streams
 
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
 
SEALS @ WWW2012
SEALS @ WWW2012SEALS @ WWW2012
SEALS @ WWW2012
 
Methodology and Campaign Design for the Evaluation of Semantic Search Tools
Methodology and Campaign Design for the Evaluation of Semantic Search ToolsMethodology and Campaign Design for the Evaluation of Semantic Search Tools
Methodology and Campaign Design for the Evaluation of Semantic Search Tools
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
 
Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
 
Asterid: Linked Data Asterisms
Asterid: Linked Data AsterismsAsterid: Linked Data Asterisms
Asterid: Linked Data Asterisms
 
e-governance in India
e-governance in Indiae-governance in India
e-governance in India
 
Smart Cities and E-governance
Smart Cities and E-governanceSmart Cities and E-governance
Smart Cities and E-governance
 
E governance
E governanceE governance
E governance
 
Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"
 
E governance
E governanceE governance
E governance
 

Ähnlich wie Lodie overview

Programming Exercises Evaluation Systems: An Interoperability Survey
Programming Exercises Evaluation Systems: An Interoperability SurveyProgramming Exercises Evaluation Systems: An Interoperability Survey
Programming Exercises Evaluation Systems: An Interoperability SurveyRicardo Queirós
 
Supporting Emergence: Interaction Design for Visual Analytics Approach to ESDA
Supporting Emergence: Interaction Design for Visual Analytics Approach to ESDASupporting Emergence: Interaction Design for Visual Analytics Approach to ESDA
Supporting Emergence: Interaction Design for Visual Analytics Approach to ESDAJesse Lingeman
 
Dispersion, coordination and performance in GSD: a systematic review
Dispersion, coordination and performance in GSD: a systematic reviewDispersion, coordination and performance in GSD: a systematic review
Dispersion, coordination and performance in GSD: a systematic reviewAnh Nguyen Duc
 
CAA 2012 Closing Address
CAA 2012 Closing Address CAA 2012 Closing Address
CAA 2012 Closing Address jisc-elearning
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsMathieu d'Aquin
 
Internet mediatedresearch edirisingha_draft
Internet mediatedresearch edirisingha_draftInternet mediatedresearch edirisingha_draft
Internet mediatedresearch edirisingha_draftPalitha Edirisingha
 
My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)AI4BD GmbH
 
Making methods with vision APIs, online data & network building (lessons lear...
Making methods with vision APIs, online data & network building (lessons lear...Making methods with vision APIs, online data & network building (lessons lear...
Making methods with vision APIs, online data & network building (lessons lear...Janna Joceli Omena
 
Vii 4 Sh17 Sorathia
Vii 4 Sh17 SorathiaVii 4 Sh17 Sorathia
Vii 4 Sh17 SorathiaIESS
 
At ddatapres121211updated
At ddatapres121211updatedAt ddatapres121211updated
At ddatapres121211updatedrubiosv
 
Enriching SMW based Virtual Research Environments with external data, Jan Nov...
Enriching SMW based Virtual Research Environments with external data, Jan Nov...Enriching SMW based Virtual Research Environments with external data, Jan Nov...
Enriching SMW based Virtual Research Environments with external data, Jan Nov...KDZ - Zentrum für Verwaltungsforschung
 
Soeren okfn greece meetup
Soeren okfn greece meetupSoeren okfn greece meetup
Soeren okfn greece meetupOKFN-GR
 
Design Principles of Advanced Task Elicitation Systems
Design Principles of Advanced Task Elicitation SystemsDesign Principles of Advanced Task Elicitation Systems
Design Principles of Advanced Task Elicitation SystemsProf. Dr. Alexander Maedche
 
Learning Analytics - A New Discipline and Linked Data
Learning Analytics - A New Discipline and Linked DataLearning Analytics - A New Discipline and Linked Data
Learning Analytics - A New Discipline and Linked DataDragan Gasevic
 
Research Associate, LE
Research Associate, LEResearch Associate, LE
Research Associate, LEjanrudin
 

Ähnlich wie Lodie overview (20)

Programming Exercises Evaluation Systems: An Interoperability Survey
Programming Exercises Evaluation Systems: An Interoperability SurveyProgramming Exercises Evaluation Systems: An Interoperability Survey
Programming Exercises Evaluation Systems: An Interoperability Survey
 
Supporting Emergence: Interaction Design for Visual Analytics Approach to ESDA
Supporting Emergence: Interaction Design for Visual Analytics Approach to ESDASupporting Emergence: Interaction Design for Visual Analytics Approach to ESDA
Supporting Emergence: Interaction Design for Visual Analytics Approach to ESDA
 
Dispersion, coordination and performance in GSD: a systematic review
Dispersion, coordination and performance in GSD: a systematic reviewDispersion, coordination and performance in GSD: a systematic review
Dispersion, coordination and performance in GSD: a systematic review
 
CAA 2012 Closing Address
CAA 2012 Closing Address CAA 2012 Closing Address
CAA 2012 Closing Address
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning Analytics
 
Internet mediatedresearch edirisingha_draft
Internet mediatedresearch edirisingha_draftInternet mediatedresearch edirisingha_draft
Internet mediatedresearch edirisingha_draft
 
My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)
 
A2 annotation approach
A2 annotation approachA2 annotation approach
A2 annotation approach
 
Utrecht sb- gráinne conole
Utrecht  sb- gráinne conoleUtrecht  sb- gráinne conole
Utrecht sb- gráinne conole
 
Making methods with vision APIs, online data & network building (lessons lear...
Making methods with vision APIs, online data & network building (lessons lear...Making methods with vision APIs, online data & network building (lessons lear...
Making methods with vision APIs, online data & network building (lessons lear...
 
Vii 4 Sh17 Sorathia
Vii 4 Sh17 SorathiaVii 4 Sh17 Sorathia
Vii 4 Sh17 Sorathia
 
At ddatapres121211updated
At ddatapres121211updatedAt ddatapres121211updated
At ddatapres121211updated
 
Enriching SMW based Virtual Research Environments with external data, Jan Nov...
Enriching SMW based Virtual Research Environments with external data, Jan Nov...Enriching SMW based Virtual Research Environments with external data, Jan Nov...
Enriching SMW based Virtual Research Environments with external data, Jan Nov...
 
STI Summit 2011 - Visual analytics and linked data
STI Summit 2011 - Visual analytics and linked dataSTI Summit 2011 - Visual analytics and linked data
STI Summit 2011 - Visual analytics and linked data
 
Soeren okfn greece meetup
Soeren okfn greece meetupSoeren okfn greece meetup
Soeren okfn greece meetup
 
WRI Update: Governance of Forests Initiative (GFI)
WRI Update: Governance of Forests Initiative (GFI)WRI Update: Governance of Forests Initiative (GFI)
WRI Update: Governance of Forests Initiative (GFI)
 
Design Principles of Advanced Task Elicitation Systems
Design Principles of Advanced Task Elicitation SystemsDesign Principles of Advanced Task Elicitation Systems
Design Principles of Advanced Task Elicitation Systems
 
Learning Analytics - A New Discipline and Linked Data
Learning Analytics - A New Discipline and Linked DataLearning Analytics - A New Discipline and Linked Data
Learning Analytics - A New Discipline and Linked Data
 
Research Associate, LE
Research Associate, LEResearch Associate, LE
Research Associate, LE
 
UX Research
UX ResearchUX Research
UX Research
 

Lodie overview

  • 1. Background Methodology Evaluation Conclusions References LODIE Linked Open Data Information Extraction Fabio Ciravegna Anna Lisa Gentile Ziqi Zhang OAK Group, Department of Computer Science The University of Sheffield, UK 9th October 2012 Semantic Web and Information Extraction SWAIE @ EKAW 2012 Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 1 / 19
  • 2. Background Methodology Evaluation Conclusions References LODIE: Overview Linked Open Data for Web-scale Information Extraction Web-scale IE number of documents, domains, facts efficient and effective methods required Linked Open Data to seed learning “[. . . ] a recommended best practice for exposing, sharing, and connecting data [. . . ] using URIs and RDF”(linkeddata.org). a large KB of typed instances, relations, annotations (e.g., RDFa) Adapting to specific user information needs users define specific IE tasks by specifying the types of instances and relations to be learnt Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 2 / 19
  • 3. Background Methodology Evaluation Conclusions References Challenges: user Information needs SoA defines a generic IE task - KnowItAll, StatSnowball, PROSPERA, NELL, ExtremeExtraction [3, 4, 5, 11] extracts “people, oragnisation, location" etc and their generic relations RQ how to let users define Web-IE tasks tailored to their own needs - “drugs that treat hayfever" Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 3 / 19
  • 4. Background Methodology Evaluation Conclusions References Challenges: training data SoA requires certain amount of training/learning resources to be manually specified RQ how to automatically obtain these (and filter noise) from the LOD Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 4 / 19
  • 5. Background Methodology Evaluation Conclusions References Challenges: learning strategies SoA Typically semi-supervised bootstrapping based learning from unstructured texts, prone to propagation of errors RQ how to combine multi-strategy learning (e.g., from both structured and unstructured contents) to avoid drifting away from the learning task Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 5 / 19
  • 6. Background Methodology Evaluation Conclusions References Challenges: publication of triples SoA No integration with existing KB RQ how to integrate with LOD Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 6 / 19
  • 7. Background Methodology Evaluation Conclusions References LODIE Architecture Overview Figure: Architecture diagram. Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 7 / 19
  • 8. Background Methodology Evaluation Conclusions References User needs formalisation Goal: Support users in formalising their information needs in a machine understandable format Hypothesis Users define information needs in terms of ontologies Users use different vocabularies in ontology creation Methods Baseline: manually identify relevant ontologies on the LOD and define a view on them using tools like neon-toolkit.org Ontology Design Pattern: reuse existing Content ODP building block and apply re-engineering patterns to bridge the “vocabulary gap” Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 8 / 19
  • 9. Background Methodology Evaluation Conclusions References Learning Seed Identification and Filtering - I Goal 1: Automatically identify training data in the forms of triples and annotations to seed learning Hypothesis: LOD can already contain answers to user needs in the forms of triples and annotations The Web contains additional linguistic realisations of triples Method From LOD - SPQRL queries to fetch seed triples (and annotations) matching the user needs From the Web - Search for linguistic realisations of triples (identified above): co-occurrence of related instances in textual contexts e.g. sentences structural elements e.g., tables Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 9 / 19
  • 10. Background Methodology Evaluation Conclusions References Learning Seed Identification and Filtering - II Goal 2: Filter noisy training data and select the most informative examples for learning Hypothesis: Identified learning seeds can contain noise (causing “drifting-away”) ... and can be redundant (causing unnecessary overheads) good learning examples are consistent w.r.t. the learning task and diverse. Method Consistency measure - cluster seed instances of different classes N times (varying parameters), does i always appear in the cluster representing the same class? Variability measure - cluster seed instances of the same class, how many clusters are generated and how dense are they Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 10 / 19
  • 11. Background Methodology Evaluation Conclusions References Multi-Strategy Learning Goal: Learning from different types (e.g., structured, unstructured) with different strategies to improve both recall and precision Hypothesis: The same pieces of knowledge can be repeated in different forms, e.g., in tables v.s. sentences (re-enforcing evidenece, Precision) Some knowledge may be found only in one form or another (Recall) Method: multi-strategy learning Learning from structures such as tables and lists [10, 8] Inducing wrappers for regular pages [7] Lexical-syntactic pattern learning from free texts Combine outputs from different processes Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 11 / 19
  • 12. Background Methodology Evaluation Conclusions References Integration with the LOD Goal: Assign unique identifier to the extracted knowledge Hypothesis: Knowledge that already exists in the LOD can be re-extracted and must be integrated Method: simple, scalable disambiguation methods, e.g., by feature overlapping [2] and string distance metrics [9] Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 12 / 19
  • 13. Background Methodology Evaluation Conclusions References User Feedback Goal: Integrate user’s feedback on learning Hypothesis: Automatic extraction is imperfect and user’s feedback can help improve learning Method: Expose extracted knowledge via an interface and collect user feedback - errors can be used as negative example for re-training Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 13 / 19
  • 14. Background Methodology Evaluation Conclusions References Evaluation suitability to formalise the user needs suitability of the approach to IE Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 14 / 19
  • 15. Background Methodology Evaluation Conclusions References Evaluation: suitability to formalise the user needs Task described in natural language –> equivalent IE task feasibility and efficiency (reasonable time, limited overhead) effectiveness is result representative of the user needs? users judge for description of task users judge for resulting instances is result suitable to seeding IE? usefulness of triples in terms to learning using the proposed quality measures Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 15 / 19
  • 16. Background Methodology Evaluation Conclusions References Evaluation: suitability of the approach to IE definition of new task: population of sections of the schema.org ontology precision partial evaluation of recall fraction of the available annotated instances checking recall with respect to already available annotated instances, not provided for training comparative large scale IE evaluations TAC Knowledge Base Population [6] TREC Entity Extraction task [1] Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 16 / 19
  • 17. Background Methodology Evaluation Conclusions References Impact LODIE timeliness LOD: first very large-scale information resource available for IE covering for a growing number of domains LODIE output Web-scale IE task corpora, linked resources, etc. developed code available as open source under the MIT licence all the data generated will be made available using a licence such as Open Data Commons (opendatacommons.org) Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 17 / 19
  • 18. Background Methodology Evaluation Conclusions References Krisztian Balog and Pavel Serdyukov. Overview of the TREC 2010 Entity Track. In Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010). NIST, 2011. S Banerjee and T Pedersen. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In CICLing ’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, pages 136–145, London, UK, 2002. Springer-Verlag. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. Toward an Architecture for Never-Ending Language Learning. pages 1306–1313. Oren Etzioni, Ana-maria Popescu, Daniel S Weld, Doug Downey, and Alexander Yates. Web-Scale Information Extraction in KnowItAll ( Preliminary Results ). pages 100–110, 2004. Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Gary Kratkiewicz, Nicolas Ward, and Ralph Weischedel. Extreme Extraction – Machine Reading in a Week. (1):1437–1446, 2011. Heng Ji and Ralph Grishman. Knowledge Base Population : Successful Approaches and Challenges. N Kushmerick. Wrapper Induction for information Extraction. In IJCAI97, 1997. Girija Limaye. Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 18 / 19
  • 19. Background Methodology Evaluation Conclusions References Annotating and Searching Web Tables Using Entities , Types and Relationships. Vanessa Lopez, Miriam Fernández, Nico Stieler, Enrico Motta, Walton Hall, Milton Keynes Mkaa, and United Kingdom. PowerAqua : supporting users in querying and exploring the Semantic Web content. David Milne and Ian H Witten. Learning to Link with Wikipedia. pages 509–518, 2007. Ndapandula Nakashole and Martin Theobald. Scalable knowledge harvesting with high precision and high recall. on Web search and data mining, (1955):227–236, 2011. Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 19 / 19