SlideShare a Scribd company logo
1 of 53
SYMPOSIUM ON BIAS AND DIVERSITY IN IRA TESTBED FOR DIVERSIFICATON IN SEARCH Koblenz, August 31, 2011 Michael Matthews, Barcelona Media/Yahoo! Research 1
OVERVIEW Introduction to LivingKnowledge Testbed – The Diversity Engine Getting started – Our first application! Adding text analysis Adding multimedia analysis Evaluation Indexing and search Developing applications Future work 2
DIVERSITY ENGINE Provide collections, annotation tools and an evaluation framework to allow for collaborative and comparable research Supports indexing and searching on a wide variety of document annotations including entities, bias, trust, polarity, and multimedia features  Support development of bias and diversity aware applications
ARCHITECTURE Document Collections Analysis Pipeline Index/ Search Application Development NYT Yahoo! News ARC Crawls Evaluation Framework
 DESIGN DECISIONS Use Open Source tools when available Programming Language - Java 1.6 Data format – LK XML Analysis tools Operating System – Linux (any software language) Indexing/Search - Solr GUI – JSP, HTML, JavaScript, CSS 5
LK-XML format.
 DOCUMENT COLLECTIONS Supported Formats -ARC (Internet Memory Crawls) ,Text, HTML. Kyoto, BBN, NYT Collections Testing Examples included with Diversity Engine Large ARCs available from Internet Memory Converters provided for other collections (MPQA, BBN, NYT) that have licensing restrictions 7
 ANALYSIS MODULES 8
 INDEXING/SEARCH Solr Enterprise search platform built on top of Lucene Xml input and output allows for easy integration with Diversity Engine Plug-in framework allows customization Built-in facet capabilities support indexing and searching on annotations Integration Converter from LK XML – Solr XML Plug-in for facet ranking and speed improvements 9
 APPLICATION DEVELOPMENT ,[object Object]
Future Predictor
Media Content Analysis
Support development – coding required!
Real World Problems
HTML Extraction
Scaling to Large Collections
Provenance
Some pluggable GUI components
Examples to ease learning curve10
 APPLICATION DEVELOPMENT 11
 APPLICATION DEVELOPMENT 12
EVALUATION FRAMEWORK ,[object Object]
Evaluates any possible annotation pipeline
Measures correctness and quality
Outputs Precision + Recall
Compares annotation output of pipeline with ground truth data13
 OUR FIRST APPLICATION Download Diversity Engine release from SourceForge  tar xzvf [release file] cd testbed ant build apps/testbed conf/testbed/tutorial-application.xml What happened? 197 text files and 127 images files converted from arc format to LK XML and stored in devapps/example/data/lkxml 2 annotators were run over collection OpenNLP for tokenization, sentence splitting, Pos tags SST named entity recognizer Results stored in devapps/example/data/lkxml Files were converted to Solr xml format and indexed using solr Solr XML stored to devapps/example/data/solr HTML Visualization Files stored in devapps/example/data/html ant deploy-testbed Solr running at http://localthost:8983/solr/ Example app running at http://localhost:8983/testbed/ 14
 EXAMPLE SOLR OUTPUT http://localhost:8983/solr/select/?q=putin 15
 EXAMPLE APPLICATION http://localhost:8983/testbed/results.jsp?query=putin 16
 EXAMPLE DOCUMENT 17
 CONFIGURATION FILE <lk-applicationlogDir="log"appDir="devapps/example"> 	<corpusdir="corpora/examples/smallarc"format="arc"/> 	<image-pipeline> 		<annotators> 		</annotators> 	</image-pipeline> 	<pipeline> 		<annotators> 			<annotatorexec="./opennlp"/> 			<annotatorexec="./sst"/> 		</annotators> 	</pipeline> 	<visualize/> 	<indexersolrHomeDir="solr/solr“ 		solrDataDir="solr/solr/data“ 		converter="conf/testbed/tutorial-lk2solr.xml"/> 	<searcherappTitle="LivingKnowledge  - Example Application" appShortTitle="Example Application" appUrl="http://localhost:8983/solr/"> 	<facets> 			<facetfield="per"description="Person"/> 			<facetfield="loc"description="Location"/> 	</facets> 	</searcher> </lk-application> 18
 TEXT ANALYSIS 	<pipeline> 		<annotators> 			<annotatorexec="./opennlp"/> 			<annotatorexec="./sst"/> 		</annotators> 	</pipeline> 	<pipeline> 		<annotators> 			<annotatorexec="./opennlp"/> 			<annotatorexec="./sst"/> 			<annotatorexec="./facts"/> 			<annotatorexec="./unitn_tagger"/> 			<annotatorexec="./unitn_subjexpr"/> 		</annotators> 	</pipeline> apps/testbed –run pipeline conf/testbed/tutorial-application.xml apps/testbed –run visualization conf/testbed/tutorial-application.xml 19
 TEXT ANALYSIS - FACTS devapps/example/data/lkxml/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.facts.xml 20
 TEXT ANALYSIS - FACTS devapps/example/data/html/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.html 21
 IMAGE ANALYSIS 	<image-pipeline> 		<annotators> 			<annotatorexec="./soton_haarfacedetector"/> 		</annotators> 	</pipeline> 	<pipeline> 		<annotators> 			<annotatorexec="./opennlp"/> 			<annotatorexec="./sst"/> 			<annotatorexec="./facts"/> 			<annotatorexec="./unitn_tagger"/> 			<annotatorexec="./unitn_subjexpr"/> 			<annotatorexec="./imageannots"/> 		</annotators> 	</pipeline> apps/testbed –run pipeline,image-pipeline –pipeline imageannotsconf/testbed/tutorial-application.xml ls devapps/example/data/lkxml/img/* 22
 ANALYSIS API Documents in LK XML format  Annotators passed a single document directory –They should add annotations for each document in directory Files will have consistent naming convention LkText file = id + “.lktext.xml” LkMedia = id + “.lkmedia.xml” LkAnnotation = id + “.” + annotatorId + “.xml” Annotators will be processed sequentially in the order listed in the XML file Annotators can be written in any language but must run on Linux – Helper classes will exist for Java, but there is no obligation to use them. Add application calling your new annotator to apps directory Add your application to the configuration file as before 23
 ANALYSIS API – JAVA Extend class org.diversityengine.annotator.AbstractAnnotator Implement Methods getName() getType() - TEXT OR IMAGE For Image Analysis implement LkAnnotation getLkAnnotation(ImageDocument document) For Text Analysis implement LkAnnotation getLkAnnotation(TextDocument document) In main, instantiate and call annotator NewAnnotator annotator = new NewAnnotator() annotator.processDirectory(args[0]); Add application calling your new annotator to apps directory Add your application to the configuration file as before 24
EVALUATION Evaluation works with same configuration file. Simply add evaluation element <lk-applicationlogDir="log"appDir="devapps/evaluation"> 	<corpusdir="corpora/evaluation/sst/text/"format="bbn"/> 	<pipeline> 	<annotators> 		<annotatorexec="./sst"/> 	</annotators> 	</pipeline> 	<evaluationevalDir="evaluation/sst/"> 		<evaluatorprovides="ENTITIES" goldDir="corpora/evaluation/sst/gold/" goldAnnotator="sstgold" annotator="sst" /> 	</evaluation> </lk-application> apps/testbed conf/evaluation/sst.xml 25
EVALUATION RESULTS <evaluationgoldDir="/home/mikemat/code/livingknowledge/WP6/testbed/corpora/evaluation/sst/gold/"lkDir="/home/mikemat/code/livingknowledge/WP6/testbed/devapps/evaluation/data/lkxml"annotation="sst"goldAnnotation="sstgold"provides="ENTITIES"> <docs> <docid="WSJ0375"N="19"tp="18"fp="1"fn="1" /> <docid="WSJ0380"N="19"tp="15"fp="4"fn="1" /> <docid="WSJ0376"N="72"tp="61"fp="11"fn="7" /> <docid="WSJ0377"N="26"tp="17"fp="9"fn="6" /> <docid="WSJ0378"N="10"tp="10"fp="0"fn="0" /> <docid="WSJ0379"N="24"tp="19"fp="5"fn="2" /> </docs> <totalsN="170"tp="140"fp="30"fn="17"p="0.8235294117647058"r="0.89171974522293"f="0.8562691131498471" /> </evaluation> cat evaluation/sst/sst.ENTITIES.xml 26
 INDEXING AND SEARCH Search Engines - Traditional Bag-of-words representation Inverted index (words -> documents) for efficiency 10 docs ranked according tf-idf similarity with query Search Engines – Today Much metadata associated with documents Ranking based on 100s of features (date, location, pagerank, click data, etc, personalization) Richer display Facets for exploratory search Answers when appropriate etc.. Many open source options - Lucene/Solr most widely used 27
 APACHE LUCENE/SOLR Lucene/Solr 28
 FACETED SEARCH Diagram by Yonik Seeley 29
FACETED SEACH ,[object Object]
price ranges for product query
related people or locations for news query
Exploratory Search
Show documents that matching the query term and a selected facet
Make inferences not clear from simple document list
Living Knowledge Analysis is modeled very well by facets
Topics as determined by entity and fact extraction
Location and Time diversity dimensions
Opinions as determined by opinion extraction30
LK XML TO SOLR ,[object Object]

More Related Content

What's hot

Automated Evolution of Feature Logging Statement Levels Using Git Histories a...
Automated Evolution of Feature Logging Statement Levels Using Git Histories a...Automated Evolution of Feature Logging Statement Levels Using Git Histories a...
Automated Evolution of Feature Logging Statement Levels Using Git Histories a...
Raffi Khatchadourian
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
Suite Solutions
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
Suite Solutions
 
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
Suite Solutions
 
WebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis Tool
WebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis ToolWebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis Tool
WebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis Tool
Jeffrey West
 
WebLogic Filtering ClassLoader and ClassLoader Analysis Tool Demo
WebLogic Filtering ClassLoader and ClassLoader Analysis Tool DemoWebLogic Filtering ClassLoader and ClassLoader Analysis Tool Demo
WebLogic Filtering ClassLoader and ClassLoader Analysis Tool Demo
Jeffrey West
 

What's hot (16)

Opal Hermes - towards representative benchmarks
Opal  Hermes - towards representative benchmarksOpal  Hermes - towards representative benchmarks
Opal Hermes - towards representative benchmarks
 
Automated Evolution of Feature Logging Statement Levels Using Git Histories a...
Automated Evolution of Feature Logging Statement Levels Using Git Histories a...Automated Evolution of Feature Logging Statement Levels Using Git Histories a...
Automated Evolution of Feature Logging Statement Levels Using Git Histories a...
 
QTP Automation Testing Tutorial 7
QTP Automation Testing Tutorial 7QTP Automation Testing Tutorial 7
QTP Automation Testing Tutorial 7
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
 
Net framework session03
Net framework session03Net framework session03
Net framework session03
 
Intro To C++ - Class 07 - Headers, Interfaces, & Prototypes
Intro To C++ - Class 07 - Headers, Interfaces, & PrototypesIntro To C++ - Class 07 - Headers, Interfaces, & Prototypes
Intro To C++ - Class 07 - Headers, Interfaces, & Prototypes
 
EA User Group London 2018 - Extending EA with custom scripts to cater for spe...
EA User Group London 2018 - Extending EA with custom scripts to cater for spe...EA User Group London 2018 - Extending EA with custom scripts to cater for spe...
EA User Group London 2018 - Extending EA with custom scripts to cater for spe...
 
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectJpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
 
LOD2: State of Play WP6 - LOD2 Stack Architecture
LOD2: State of Play WP6 - LOD2 Stack ArchitectureLOD2: State of Play WP6 - LOD2 Stack Architecture
LOD2: State of Play WP6 - LOD2 Stack Architecture
 
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
 
The_Little_Jenkinsfile_That_Could
The_Little_Jenkinsfile_That_CouldThe_Little_Jenkinsfile_That_Could
The_Little_Jenkinsfile_That_Could
 
QTP Automation Testing Tutorial 2
QTP Automation Testing Tutorial 2QTP Automation Testing Tutorial 2
QTP Automation Testing Tutorial 2
 
LDAP Injection & Blind LDAP Injection in Web Applications
LDAP Injection & Blind LDAP Injection in Web ApplicationsLDAP Injection & Blind LDAP Injection in Web Applications
LDAP Injection & Blind LDAP Injection in Web Applications
 
WebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis Tool
WebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis ToolWebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis Tool
WebLogic's ClassLoaders, Filtering ClassLoader and ClassLoader Analysis Tool
 
WebLogic Filtering ClassLoader and ClassLoader Analysis Tool Demo
WebLogic Filtering ClassLoader and ClassLoader Analysis Tool DemoWebLogic Filtering ClassLoader and ClassLoader Analysis Tool Demo
WebLogic Filtering ClassLoader and ClassLoader Analysis Tool Demo
 

Viewers also liked

5 schéma de financement des investissements
5  schéma de financement des investissements5  schéma de financement des investissements
5 schéma de financement des investissements
Jean-michel Neugate
 

Viewers also liked (9)

Economics pp
Economics ppEconomics pp
Economics pp
 
Greentree Theme
Greentree ThemeGreentree Theme
Greentree Theme
 
A Linear-Algebraic Technique with an Application in Semantic Image Retrieval
A Linear-Algebraic Technique with an Application in Semantic Image RetrievalA Linear-Algebraic Technique with an Application in Semantic Image Retrieval
A Linear-Algebraic Technique with an Application in Semantic Image Retrieval
 
Poemes
PoemesPoemes
Poemes
 
Magrana pdi
Magrana pdiMagrana pdi
Magrana pdi
 
Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...
 
5 schéma de financement des investissements
5  schéma de financement des investissements5  schéma de financement des investissements
5 schéma de financement des investissements
 
Mining Events from Multimedia Streams (WAIS Research group seminar June 2014)
Mining Events from Multimedia Streams (WAIS Research group seminar June 2014)Mining Events from Multimedia Streams (WAIS Research group seminar June 2014)
Mining Events from Multimedia Streams (WAIS Research group seminar June 2014)
 
University of Southampton StarStream
University of Southampton StarStreamUniversity of Southampton StarStream
University of Southampton StarStream
 

Similar to ESSIR LivingKnowledge DiversityEngine tutorial

torque - Automation Testing Tool for C-C++ on Linux
torque -  Automation Testing Tool for C-C++ on Linuxtorque -  Automation Testing Tool for C-C++ on Linux
torque - Automation Testing Tool for C-C++ on Linux
JITENDRA LENKA
 
Must be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxMust be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docx
herthaweston
 
Android application structure
Android application structureAndroid application structure
Android application structure
Alexey Ustenko
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29
Julie Allinson
 

Similar to ESSIR LivingKnowledge DiversityEngine tutorial (20)

ShwetaKBijay-resume
ShwetaKBijay-resumeShwetaKBijay-resume
ShwetaKBijay-resume
 
torque - Automation Testing Tool for C-C++ on Linux
torque -  Automation Testing Tool for C-C++ on Linuxtorque -  Automation Testing Tool for C-C++ on Linux
torque - Automation Testing Tool for C-C++ on Linux
 
Document Summarizer
Document SummarizerDocument Summarizer
Document Summarizer
 
Cognos Software Development Kit
Cognos Software Development KitCognos Software Development Kit
Cognos Software Development Kit
 
Must be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxMust be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docx
 
Introduction to Roslyn and its use in program development
Introduction to Roslyn and its use in program developmentIntroduction to Roslyn and its use in program development
Introduction to Roslyn and its use in program development
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Understanding and extending p2 for fun and profit
Understanding and extending p2 for fun and profitUnderstanding and extending p2 for fun and profit
Understanding and extending p2 for fun and profit
 
Ensuring Software Quality Through Test Automation- Naperville Software Develo...
Ensuring Software Quality Through Test Automation- Naperville Software Develo...Ensuring Software Quality Through Test Automation- Naperville Software Develo...
Ensuring Software Quality Through Test Automation- Naperville Software Develo...
 
Resume_Shanthi
Resume_ShanthiResume_Shanthi
Resume_Shanthi
 
Generative AI Application Development using LangChain and LangFlow
Generative AI Application Development using LangChain and LangFlowGenerative AI Application Development using LangChain and LangFlow
Generative AI Application Development using LangChain and LangFlow
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Android application structure
Android application structureAndroid application structure
Android application structure
 
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
 
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
 
10071756.ppt
10071756.ppt10071756.ppt
10071756.ppt
 
OWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionOWASP Dependency-Track Introduction
OWASP Dependency-Track Introduction
 
Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29
 

More from Jonathon Hare

OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...
OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...
OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...
Jonathon Hare
 

More from Jonathon Hare (19)

Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
 
Content-based image retrieval using a mobile device as a novel interface
Content-based image retrieval using a mobile device as a novel interfaceContent-based image retrieval using a mobile device as a novel interface
Content-based image retrieval using a mobile device as a novel interface
 
IMAGE DIVERSITY ANALYSIS: CONTEXT, OPINION AND BIAS
IMAGE DIVERSITY ANALYSIS: CONTEXT, OPINION AND BIASIMAGE DIVERSITY ANALYSIS: CONTEXT, OPINION AND BIAS
IMAGE DIVERSITY ANALYSIS: CONTEXT, OPINION AND BIAS
 
Bridging the Semantic Gap in Multimedia Information Retrieval: Top-down and B...
Bridging the Semantic Gap in Multimedia Information Retrieval: Top-down and B...Bridging the Semantic Gap in Multimedia Information Retrieval: Top-down and B...
Bridging the Semantic Gap in Multimedia Information Retrieval: Top-down and B...
 
OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...
OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...
OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia A...
 
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalMind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
 
Saliency-based Models of Image Content and their Application to Auto-Annotati...
Saliency-based Models of Image Content and their Application to Auto-Annotati...Saliency-based Models of Image Content and their Application to Auto-Annotati...
Saliency-based Models of Image Content and their Application to Auto-Annotati...
 
The Art and Science of Image Retrieval
The Art and Science of Image RetrievalThe Art and Science of Image Retrieval
The Art and Science of Image Retrieval
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
BUILDING A SCALABLE MULTIMEDIA WEB OBSERVATORY
BUILDING A SCALABLE MULTIMEDIA WEB OBSERVATORYBUILDING A SCALABLE MULTIMEDIA WEB OBSERVATORY
BUILDING A SCALABLE MULTIMEDIA WEB OBSERVATORY
 
Sharp images and fuzzy concepts: Multimedia retrieval and the semantic gap
Sharp images and fuzzy concepts: Multimedia retrieval and the semantic gapSharp images and fuzzy concepts: Multimedia retrieval and the semantic gap
Sharp images and fuzzy concepts: Multimedia retrieval and the semantic gap
 
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
 
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
SEWM'14 keynote: Mining Events from Multimedia Streams
SEWM'14 keynote: Mining Events from Multimedia StreamsSEWM'14 keynote: Mining Events from Multimedia Streams
SEWM'14 keynote: Mining Events from Multimedia Streams
 
A brief introduction to extracting information from images
A brief introduction to extracting information from imagesA brief introduction to extracting information from images
A brief introduction to extracting information from images
 
WAISFest 2011: Southampton Goggles
WAISFest 2011: Southampton GogglesWAISFest 2011: Southampton Goggles
WAISFest 2011: Southampton Goggles
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

ESSIR LivingKnowledge DiversityEngine tutorial

  • 1. SYMPOSIUM ON BIAS AND DIVERSITY IN IRA TESTBED FOR DIVERSIFICATON IN SEARCH Koblenz, August 31, 2011 Michael Matthews, Barcelona Media/Yahoo! Research 1
  • 2. OVERVIEW Introduction to LivingKnowledge Testbed – The Diversity Engine Getting started – Our first application! Adding text analysis Adding multimedia analysis Evaluation Indexing and search Developing applications Future work 2
  • 3. DIVERSITY ENGINE Provide collections, annotation tools and an evaluation framework to allow for collaborative and comparable research Supports indexing and searching on a wide variety of document annotations including entities, bias, trust, polarity, and multimedia features Support development of bias and diversity aware applications
  • 4. ARCHITECTURE Document Collections Analysis Pipeline Index/ Search Application Development NYT Yahoo! News ARC Crawls Evaluation Framework
  • 5. DESIGN DECISIONS Use Open Source tools when available Programming Language - Java 1.6 Data format – LK XML Analysis tools Operating System – Linux (any software language) Indexing/Search - Solr GUI – JSP, HTML, JavaScript, CSS 5
  • 7. DOCUMENT COLLECTIONS Supported Formats -ARC (Internet Memory Crawls) ,Text, HTML. Kyoto, BBN, NYT Collections Testing Examples included with Diversity Engine Large ARCs available from Internet Memory Converters provided for other collections (MPQA, BBN, NYT) that have licensing restrictions 7
  • 9. INDEXING/SEARCH Solr Enterprise search platform built on top of Lucene Xml input and output allows for easy integration with Diversity Engine Plug-in framework allows customization Built-in facet capabilities support indexing and searching on annotations Integration Converter from LK XML – Solr XML Plug-in for facet ranking and speed improvements 9
  • 10.
  • 13. Support development – coding required!
  • 16. Scaling to Large Collections
  • 18. Some pluggable GUI components
  • 19. Examples to ease learning curve10
  • 22.
  • 23. Evaluates any possible annotation pipeline
  • 26. Compares annotation output of pipeline with ground truth data13
  • 27. OUR FIRST APPLICATION Download Diversity Engine release from SourceForge tar xzvf [release file] cd testbed ant build apps/testbed conf/testbed/tutorial-application.xml What happened? 197 text files and 127 images files converted from arc format to LK XML and stored in devapps/example/data/lkxml 2 annotators were run over collection OpenNLP for tokenization, sentence splitting, Pos tags SST named entity recognizer Results stored in devapps/example/data/lkxml Files were converted to Solr xml format and indexed using solr Solr XML stored to devapps/example/data/solr HTML Visualization Files stored in devapps/example/data/html ant deploy-testbed Solr running at http://localthost:8983/solr/ Example app running at http://localhost:8983/testbed/ 14
  • 28. EXAMPLE SOLR OUTPUT http://localhost:8983/solr/select/?q=putin 15
  • 29. EXAMPLE APPLICATION http://localhost:8983/testbed/results.jsp?query=putin 16
  • 31. CONFIGURATION FILE <lk-applicationlogDir="log"appDir="devapps/example"> <corpusdir="corpora/examples/smallarc"format="arc"/> <image-pipeline> <annotators> </annotators> </image-pipeline> <pipeline> <annotators> <annotatorexec="./opennlp"/> <annotatorexec="./sst"/> </annotators> </pipeline> <visualize/> <indexersolrHomeDir="solr/solr“ solrDataDir="solr/solr/data“ converter="conf/testbed/tutorial-lk2solr.xml"/> <searcherappTitle="LivingKnowledge - Example Application" appShortTitle="Example Application" appUrl="http://localhost:8983/solr/"> <facets> <facetfield="per"description="Person"/> <facetfield="loc"description="Location"/> </facets> </searcher> </lk-application> 18
  • 32. TEXT ANALYSIS <pipeline> <annotators> <annotatorexec="./opennlp"/> <annotatorexec="./sst"/> </annotators> </pipeline> <pipeline> <annotators> <annotatorexec="./opennlp"/> <annotatorexec="./sst"/> <annotatorexec="./facts"/> <annotatorexec="./unitn_tagger"/> <annotatorexec="./unitn_subjexpr"/> </annotators> </pipeline> apps/testbed –run pipeline conf/testbed/tutorial-application.xml apps/testbed –run visualization conf/testbed/tutorial-application.xml 19
  • 33. TEXT ANALYSIS - FACTS devapps/example/data/lkxml/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.facts.xml 20
  • 34. TEXT ANALYSIS - FACTS devapps/example/data/html/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.html 21
  • 35. IMAGE ANALYSIS <image-pipeline> <annotators> <annotatorexec="./soton_haarfacedetector"/> </annotators> </pipeline> <pipeline> <annotators> <annotatorexec="./opennlp"/> <annotatorexec="./sst"/> <annotatorexec="./facts"/> <annotatorexec="./unitn_tagger"/> <annotatorexec="./unitn_subjexpr"/> <annotatorexec="./imageannots"/> </annotators> </pipeline> apps/testbed –run pipeline,image-pipeline –pipeline imageannotsconf/testbed/tutorial-application.xml ls devapps/example/data/lkxml/img/* 22
  • 36. ANALYSIS API Documents in LK XML format Annotators passed a single document directory –They should add annotations for each document in directory Files will have consistent naming convention LkText file = id + “.lktext.xml” LkMedia = id + “.lkmedia.xml” LkAnnotation = id + “.” + annotatorId + “.xml” Annotators will be processed sequentially in the order listed in the XML file Annotators can be written in any language but must run on Linux – Helper classes will exist for Java, but there is no obligation to use them. Add application calling your new annotator to apps directory Add your application to the configuration file as before 23
  • 37. ANALYSIS API – JAVA Extend class org.diversityengine.annotator.AbstractAnnotator Implement Methods getName() getType() - TEXT OR IMAGE For Image Analysis implement LkAnnotation getLkAnnotation(ImageDocument document) For Text Analysis implement LkAnnotation getLkAnnotation(TextDocument document) In main, instantiate and call annotator NewAnnotator annotator = new NewAnnotator() annotator.processDirectory(args[0]); Add application calling your new annotator to apps directory Add your application to the configuration file as before 24
  • 38. EVALUATION Evaluation works with same configuration file. Simply add evaluation element <lk-applicationlogDir="log"appDir="devapps/evaluation"> <corpusdir="corpora/evaluation/sst/text/"format="bbn"/> <pipeline> <annotators> <annotatorexec="./sst"/> </annotators> </pipeline> <evaluationevalDir="evaluation/sst/"> <evaluatorprovides="ENTITIES" goldDir="corpora/evaluation/sst/gold/" goldAnnotator="sstgold" annotator="sst" /> </evaluation> </lk-application> apps/testbed conf/evaluation/sst.xml 25
  • 39. EVALUATION RESULTS <evaluationgoldDir="/home/mikemat/code/livingknowledge/WP6/testbed/corpora/evaluation/sst/gold/"lkDir="/home/mikemat/code/livingknowledge/WP6/testbed/devapps/evaluation/data/lkxml"annotation="sst"goldAnnotation="sstgold"provides="ENTITIES"> <docs> <docid="WSJ0375"N="19"tp="18"fp="1"fn="1" /> <docid="WSJ0380"N="19"tp="15"fp="4"fn="1" /> <docid="WSJ0376"N="72"tp="61"fp="11"fn="7" /> <docid="WSJ0377"N="26"tp="17"fp="9"fn="6" /> <docid="WSJ0378"N="10"tp="10"fp="0"fn="0" /> <docid="WSJ0379"N="24"tp="19"fp="5"fn="2" /> </docs> <totalsN="170"tp="140"fp="30"fn="17"p="0.8235294117647058"r="0.89171974522293"f="0.8562691131498471" /> </evaluation> cat evaluation/sst/sst.ENTITIES.xml 26
  • 40. INDEXING AND SEARCH Search Engines - Traditional Bag-of-words representation Inverted index (words -> documents) for efficiency 10 docs ranked according tf-idf similarity with query Search Engines – Today Much metadata associated with documents Ranking based on 100s of features (date, location, pagerank, click data, etc, personalization) Richer display Facets for exploratory search Answers when appropriate etc.. Many open source options - Lucene/Solr most widely used 27
  • 41. APACHE LUCENE/SOLR Lucene/Solr 28
  • 42. FACETED SEARCH Diagram by Yonik Seeley 29
  • 43.
  • 44. price ranges for product query
  • 45. related people or locations for news query
  • 47. Show documents that matching the query term and a selected facet
  • 48. Make inferences not clear from simple document list
  • 49. Living Knowledge Analysis is modeled very well by facets
  • 50. Topics as determined by entity and fact extraction
  • 51. Location and Time diversity dimensions
  • 52. Opinions as determined by opinion extraction30
  • 53.
  • 54. Diversity Engine provides a simple language to map LX XML to Solr XML31
  • 55. LK2SOLR CONVERSION <indexersolrHomeDir="solr/solr“ solrDataDir="solr/solr/data“ converter="conf/testbed/tutorial-lk2solr.xml"/> <lktosolr> <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" /> <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“ type="date"/> </lktosolr> solr – Name of the field in solr annotation – Name of the LKXML Annotation value – Value of annotation filter – Allows post processing on annotation type – Only Date supported currently 32
  • 56. ADDING FACTS TO INDEX <lktosolr> <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" /> <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“ type="date"/> <fieldsolr="yago"annotation="yago-entities"value="$text" /> <fieldsolr="yago-country"annotation="facts" value="xpath:/entity-information[facts/type/text()= 'wordnet_country_108544813']/id/text()" /> </lktosolr> apps/testbed –run convert-solr conf/testbed/tutorial-application.xml ls devapps/example/data/solr/* apps/testbed –run index conf/testbed/tutorial-application.xml 33
  • 57. FACTS TO SOLR <fieldsolr="yago"annotation="yago-entities"value="$text" /> 34
  • 58. FACTS TO SOLR <fieldsolr="yago-country"annotation="facts" value="xpath:/entity-information[facts/type/text()= 'wordnet_country_108544813']/id/text()" /> 35
  • 59. ADDING IMAGES TO INDEX <lktosolr> <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" /> <fieldsolr="yago"annotation="yago-entities"value="$text" /> <fieldsolr="yago-country"annotation="facts" value="xpath:/entityinformation[facts/type/text() ='wordnet_country_108544813']/id/text()" /> <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“ type="date"/> <fieldsolr="image"annotation="IMAGE_ANNOTS"value="$text" /> <fieldsolr="bestimage"annotation="BEST_IMAGES"value="$text" /> </lktosolr> apps/testbed –run convert-solr conf/testbed/tutorial-application.xml ls devapps/example/data/solr/* apps/testbed –run index conf/testbed/tutorial-application.xml 36
  • 60. APPLICATION DEVELOPMENT Examples HTML Extraction Scaling to Large Collections Provenance Some pluggable GUI components 37
  • 61. FACT/IMAGE APPLICATION <searcherappTitle="LivingKnowledge - Example Application" appShortTitle="Example Application" appUrl="http://localhost:8983/solr/"> <facets> <facetfield=“yago"description=“Yago"/> <facetfield=“yago-country"description=“Country"/> <facetfield="per"description="Person"/> <facetfield="loc"description="Location"/> <facetfield=“image"description=“Images"/> </facets> </searcher> ant deploy-testbed 38
  • 62. FACT/IMAGE APPLICATION http://localhost:8983/testbed/results.jsp?query=putin 39
  • 63. OPINION APPLICATION Opinions are at sentence level, not document level – same analysis, but different indexing cat conf/testbed/tutorial-lk2solr-sentence.xml <lktosolrsolrDoc="SENTENCES"contextSize="1"> <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter“ source="solrdoc" /> <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter“ source="solrdoc" /> <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" /> <fieldsolr="yago"annotation="yago-entities"value="$text“ source="solrdoc" /> <fieldsolr="image"annotation="IMAGE_ANNOTS"value="$text" /> <fieldsolr="bestimage"annotation="BEST_IMAGES"value="$text" /> <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“ type="date"/> <fieldsolr="polarity" annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“ value="xpath:/node()[@pol]/@pol"source="solrdoc“ filter="org.diversityengine.solr.converter.filters.PolarityValueFilter"/> <fieldsolr="pol-int“ annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“ value="xpath:concat(/node()[@pol and @int]/@pol,/node()[@int and @pol]/@int)“ source="solrdoc"/> </lktosolr> apps/testbed –run convert-solr,index conf/testbed/tutorial-application-sentence.xml ls devapps/example/data/solr/* 40
  • 64. SOLR XML – SENTENCE 41
  • 65. OPINION APPLICATION modify webappEB-INFeb.xml <web-appxmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> <description> LivingKnowledge Testbed Example Application </description> <display-name>Testbed Examples</display-name> <context-param> <param-name>applicationDef</param-name> <param-value>conf/testbed/tutorial-application-sentence.xml</param-value> <description>The Living Knowledge application description XML file </description> </context-param> </web-app> ant deploy-testbed 42
  • 66. OPINION APPLICATION http://localhost:8983/testbed/results.jsp?query=putin 43
  • 68. HTML EXTRACTION Boilerplate can lead to false positive results and inaccurate facet aggregation Real example – before extraction developed, most common person for most queries was in a top story title (on all pages) the day of the crawl! Titles, Authors and Dates are important for bias and diversity aware search 45
  • 69. PROVENANCE How an annotation is derived is often as important as the annotation itself Users want to verify results Developers need to validate results Open Provenance provides an open source solution Testbed annotations can be extended with Open Provenance chains 46
  • 71. SCALING TO LARGE COLLECTIONS In the real world, even “small” datasets have million of documents NLP/Image processing is expensive – 1 doc/sec = 11 days for 1 million docs! Hadoop Mapper allows for scaling – scales linearly with number of machines ZipCollection writer allows partitioning data into subsets for processing 48
  • 75. FUTURE WORK More components Maven to manage dependencies Better integration of Timeline and Geo visualization components Integration of ranking algorithms Better Documentation  52
  • 76. Thanks! LivingKnowledge Partners! You for coming!! Questions? 53