SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Towards Web-Scale  Information Extraction Eugene Agichtein  Mathematics & Computer Science Emory University [email_address] http:// www.mathcs.emory.edu /~eugene/
The Value of Text Data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example: Answering Queries Over Text For years,  Microsoft Corporation   CEO   Bill Gates  was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said  Bill Veghte , a  Microsoft   VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman ,  founder  of the  Free Software Foundation , countered saying… Name  Title  Organization Bill Gates   CEO   Microsoft Bill Veghte   VP   Microsoft Richard Stallman   Founder   Free Soft.. PEOPLE Select  Name From  PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from William Cohen’s IE tutorial, 2003)
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Information Extraction Tasks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Entity Tagging ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example: Extracting Entities from Text ,[object Object],1 4089 Whispering Pines Nobel Drive  San Diego CA 92122 Address Citation Ronald Fagin,  Combining Fuzzy Information from Multiple Systems ,  Proc. of ACM SIGMOD ,  2002 House number Building Road City Zip State Year 2002 S 4 Conference  Proc. of ACM SIGMOD S 3 Title Combining Fuzzy Information from Multiple Systems S 2 Author Ronald Fagin S 1  Label( s i ) Sequence Segment( s i )
Hand-Coded Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[IBM Avatar] ContactPattern     RegularExpression(Email.body,”can be reached at”)
Machine Learning Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],The  human T cell leukemia lymphotropic virus type 1 Tax protein  represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba]
Representation Models [Cohen and McCallum, 2003] Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Sliding Window Classify Pre-segmented Candidates Finite State Machines Context Free Grammars Boundary Models Abraham Lincoln was born in Kentucky. member? Abraham Lincoln was born in Kentucky. Abraham Lincoln  was born in  Kentucky . Classifier which class? … and beyond Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Classifier which class? BEGIN END BEGIN END BEGIN Abraham Lincoln was born in Kentucky. Most likely state sequence? Abraham Lincoln was born in Kentucky. NNP V P NP V NNP NP PP VP VP S Most likely parse?
Popular Machine Learning Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],For details: [Feldman, 2006 and Cohen, 2004]
Some Available Entity Taggers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Alias-I LingPipe  ( http://www.alias-i.com/lingpipe/ ) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Relation Extraction Examples ,[object Object],Disease Outbreaks relation Relation Extraction  „ We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ [From AliBaba] Zaire Ebola May 1995 U.S. Pneumonia Feb. 1995 July 1995 Jan. 1995 Date Location Disease Name U.K. Mad Cow Disease Ethiopia Malaria May 19 1995 , Atlanta -- The Centers for Disease Control  and Prevention, which is in the front line of the world's  response to the deadly  Ebola  epidemic in  Zaire , is finding itself hard pressed to cope with the crisis…  CBF-A  CBF-C CBF-B   CBF-A-CBF-C complex interact complex associates
Relation Extraction Approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Open Information Extraction  [Banko et al., IJCAI 2007] ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Event Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object]
Event Extraction: Integration Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Other Information Extraction Tutorials ,[object Object],[object Object],[object Object]
Summary: Accuracy of Extraction Tasks ,[object Object],[object Object],[object Object],[object Object],[object Object],[Feldman, ICML 2006 tutorial]
Multilingual Information Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Scaling Information Extraction to the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Scaling Up Information Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Efficient Scanning for Information Extraction ,[object Object],[object Object],[object Object],[object Object],Extraction System Text Database ,[object Object],[object Object],[object Object],filtered … Output Tuples Classifier ,[object Object]
Exploiting Keyword and Phrase Indexes  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Simple Strategy: Iterative Set Expansion ,[object Object],Extraction System Text Database ,[object Object],[object Object],[object Object],Time for retrieving a document Time for answering a query Time for processing a document Query  Generation ,[object Object],(e.g., [ Ebola AND Zaire ]) (e.g., <Malaria, Ethiopia>) … Output Tuples
Reachability via Querying t 1   retrieves  document  d 1   that  contains  t 2 t 1 t 2 t 3 t 4 t 5 Upper recall limit : determined by the size  of the biggest connected component Reachability   Graph Tuples Documents t 1 t 2 t 3 t 4 t 5 d 1 d 2 d 3 d 4 d 5 <SARS, China> <Ebola, Zaire> <Malaria, Ethiopia> <Cholera, Sudan> <H5N1, Vietnam> [Agichtein et al. 2003b]
Reachability Graph for DiseaseOutbreaks DiseaseOutbreaks,  New York Times 1995
Getting Around Reachability Limits ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
QXtract  [Agichtein and Gravano 2003] ,[object Object],[object Object],[object Object],[object Object],Query Generation Information Extraction Seed Sampling Classifier Training Queries User-Provided Seed Tuples
Using Generic Indexes: Summary ,[object Object],[object Object],[object Object],[object Object]
Index Structures for Information Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Bindings Engine (BE)  [Cafarella and Etzioni 2005]   neighbor 1 str 1 #docs … pos 0 pos 1 docid #docs-1 pos #docs-1 #posns pos 0 pos 1 … pos #pos-1 docid 0 docid 1 #posns pos 0 neighbor 0 pos 1 neighbor 1 … pos #pos-1 … #neighbors blk_offset 19 12 Result: in document 19: “I love cities such as  Philadelphia .”  neighbor 0 str 0 words such philadelphia nickels mayors give friendly cities billy as NP right Philadelphia 3 <offset> AdjT left such
Related Approach:  [ K. Chakrabarti et al.  2006]   ,[object Object],[object Object]
Workload-Driven Indexing  [ S. Chakrabarti et al.  2006] Indexing Thousands of Entity Types
Workload-Driven Indexing (continued)
Parallelization/Adaptive Processing ,[object Object],[object Object],[object Object],[object Object]
IBM WebFountain ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Gruhl et al. 2004]
UIMA (IBM Research) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Map/Reduce ( [ Dean & Ghemawat, OSDI 2004 ])
Map/Reduce (continued) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
References and Supplemental Materials ,[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...Andre Freitas
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsAndre Freitas
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Lect6-An introduction to ontologies and ontology development
Lect6-An introduction to ontologies and ontology developmentLect6-An introduction to ontologies and ontology development
Lect6-An introduction to ontologies and ontology developmentAntonio Moreno
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine TranslationRIILP
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka
 
Entity Linking in Queries: Tasks and Evaluation
Entity Linking in Queries: Tasks and EvaluationEntity Linking in Queries: Tasks and Evaluation
Entity Linking in Queries: Tasks and EvaluationFaegheh Hasibi
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memoriesRIILP
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsVito Ostuni
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Francisco Manuel Rangel Pardo
 

Was ist angesagt? (20)

SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Lect6-An introduction to ontologies and ontology development
Lect6-An introduction to ontologies and ontology developmentLect6-An introduction to ontologies and ontology development
Lect6-An introduction to ontologies and ontology development
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Unit 2(knowledge)
Unit 2(knowledge)Unit 2(knowledge)
Unit 2(knowledge)
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
 
Entity Linking in Queries: Tasks and Evaluation
Entity Linking in Queries: Tasks and EvaluationEntity Linking in Queries: Tasks and Evaluation
Entity Linking in Queries: Tasks and Evaluation
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 

Ähnlich wie ppt

Machine learning presentation (razi)
Machine learning presentation (razi)Machine learning presentation (razi)
Machine learning presentation (razi)Rizwan Shaukat
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine LearningSri Ambati
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadGeohedrick
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect WorldVital.AI
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEbutest
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud ComputingRahul Pola
 
Cloud computing
Cloud computingCloud computing
Cloud computingBasil John
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 

Ähnlich wie ppt (20)

Machine learning presentation (razi)
Machine learning presentation (razi)Machine learning presentation (razi)
Machine learning presentation (razi)
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AE
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Session1
Session1Session1
Session1
 
Session1
Session1Session1
Session1
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
D017422528
D017422528D017422528
D017422528
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

ppt

  • 1. Towards Web-Scale Information Extraction Eugene Agichtein Mathematics & Computer Science Emory University [email_address] http:// www.mathcs.emory.edu /~eugene/
  • 2.
  • 3. Example: Answering Queries Over Text For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. &quot;We can be open source. We love the concept of shared source,&quot; said Bill Veghte , a Microsoft VP . &quot;That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from William Cohen’s IE tutorial, 2003)
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Representation Models [Cohen and McCallum, 2003] Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Sliding Window Classify Pre-segmented Candidates Finite State Machines Context Free Grammars Boundary Models Abraham Lincoln was born in Kentucky. member? Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky . Classifier which class? … and beyond Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Classifier which class? BEGIN END BEGIN END BEGIN Abraham Lincoln was born in Kentucky. Most likely state sequence? Abraham Lincoln was born in Kentucky. NNP V P NP V NNP NP PP VP VP S Most likely parse?
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. Reachability via Querying t 1 retrieves document d 1 that contains t 2 t 1 t 2 t 3 t 4 t 5 Upper recall limit : determined by the size of the biggest connected component Reachability Graph Tuples Documents t 1 t 2 t 3 t 4 t 5 d 1 d 2 d 3 d 4 d 5 <SARS, China> <Ebola, Zaire> <Malaria, Ethiopia> <Cholera, Sudan> <H5N1, Vietnam> [Agichtein et al. 2003b]
  • 30. Reachability Graph for DiseaseOutbreaks DiseaseOutbreaks, New York Times 1995
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Workload-Driven Indexing [ S. Chakrabarti et al. 2006] Indexing Thousands of Entity Types
  • 39.
  • 40.
  • 41.
  • 42. Map/Reduce ( [ Dean & Ghemawat, OSDI 2004 ])
  • 43.
  • 44.
  • 45.

Hinweis der Redaktion

  1. Check attribution Lexicon: lookup Classify candidates Sliding window – when candidats not known Boundary model – window+classification in one Finite state machine for complete path Grammers
  2. All of these provide a form of API for integration with other code