SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Argo: a platform for interoperable
and customisable text mining
Sophia Ananiadou
National Centre for Text Mining
School of Computer Science
The University of Manchester
Overview
• Sharing tools, resources and text mining workflows
• Challenges
• Interoperable infrastructure for processing and
annotation
2Open AIRE-COAR ConferenceAnaniadou
NaCTeM
• 1st publicly funded national
text mining centre
• Location: Manchester Institute
of Biotechnology
• Phase I - Biology (2004-2008)
• Phase II - Biology, Medicine,
Social Sciences (2008-2011)
• Phase III – Biology, Medicine,
Humanities, Social Sciences;
Fully sustainable centre (2011-
)
www.nactem.ac.uk
Challenges
Language Technology
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Chinese
Hindu
Urdu
Japanese
Korean….Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
….
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities….
Text Types
Newswire
Scientific Literature
Full papers/abstracts
Twitter
Patents
Clinical records, EMR
Textbooks, monographs
Online forums….
Technology
Sentence Splitter
Paragraph Splitter
NP Chunkers
C-parser
D-parser
Semantic parser
NE recognizers
Relation recognizers
…….
Diversity of Languages
Diversity of Contexts
Diversity of Applications
TM Workflows
TM Modules
Shared!
4Open AIRE-COAR ConferenceAnaniadou
Metadata
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Chinese
Hindu
Urdu
Japanese
Korean…Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
….
Language Technology
Linguistic Resources
Knowledge Resources
Resource-Rich
Big DataBig Text
Cloud Computing Crowd Sourcing
Big Ontology
Text Types
Newswire
Scientific Literature
Full papers/abstracts
Twitter
Patents
Clinical records, EMR
Textbooks, monographs
Online forums….
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities….
5Open AIRE-COAR ConferenceAnaniadou
OPEN SCIENCE
Requirements from TM infrastructure
• Modularity of TM modules
• Interoperability among TM modules and resources
• Generic across different languages, domains, and text
types
– Adaptability
6Open AIRE-COAR ConferenceAnaniadou
Module
Interoperability and Adaptability
ModuleModule
Resources
Dictionaries
Ontologies
Adaptation
Rule Writing
(Annotated)
Text
Interoperability and Adaptability
in Resource-rich TM
INFRASTRUCTURES!
Dependency Parser
English French German JapaneseGreek
POS Tagger
Named Entity Languages
Text Types
Domains
7Open AIRE-COAR ConferenceAnaniadou
Example: extracting proteins, annotations
8
GENIA
PennBioIE
AIMed
GENETAG
Incompatibility
Type definitions
Texts
Problem: Inconsistency
Open AIRE-COAR ConferenceAnaniadou
The problem with incompatibility
• Difficult to evaluate NERs
9
Corpus C Corpus D
NER A
Which NER is
best for my
task?
NER B
A: 93% B: 36%
A is better than B.
A: 63% B: 90%
B is better than A.
Why so different among
different corpora and
NERs ?
Open AIRE-COAR ConferenceAnaniadou
Text mining workflows
• A pipeline that executes particular tools and resources in
order
• Example: semantic search
• Various versions (language- or domain-specific) of basic
components needed for different applications and tasks
• Different workflows can be created, compared and evaluated
by the ability to seamlessly “mix and match” various versions
of components
PoS
Tagger
Dictionary
Lookup
NE
Extraction
Chunking Parsing
Semantic
Query
10Open AIRE-COAR ConferenceAnaniadou
Text mining workflows
Interoperability
Common Data Representation and Types
IBM Journal of Research and
Development (2011)
U-Compare: a modular NLP workflow
construction and evaluation system.
Kano, Y., Miwa, M., Cohen, K. B., Hunter,
L., Ananiadou, S. and Tsujii, J.
11Open AIRE-COAR ConferenceAnaniadou
Common Type System
• A common type system is required for the complete
interoperability
• Solution: Maintain local type systems and bridge them
via a sharable type system
12
A single common type is almost impossible to impose
for all developers.
U-Compare
Sharable Type System
Local Type System A Local Type System B
bridging bridging
12Open AIRE-COAR ConferenceAnaniadou
U-Compare Type System
Syntactic Level
Document Level
Semantic Level
13Open AIRE-COAR ConferenceAnaniadou
POS tagger
B
Sentence
Splitter B
library
POS tagger
A
Sentence
Splitter A
NER
Sentence
Splitter A
Sentence
Splitter A
Sentence
Splitter A
Sentence
Splitter B
Sentence
Splitter B
Sentence
Splitter B
POS tagger
A
POS tagger
A
POS tagger
A
POS tagger
B
POS tagger
B
POS tagger
B
NERNERNER
Workflow A Workflow B Workflow C
 F-Score A F-Score B F-Score C
U-Compare: Evaluate and Compare TM
Worklfows
UIMA SD
OpenNLP SD
GENIA SD
UIMA Tokenizer
OpenNLP Tokenizer
GENIA Tagger as
Tokenizer
GENIA Tagger
Stepp Tagger
OpenNLP
Tagger
ABNER
MedT-NER
GENIA Tagger
as NER
• Web-based application
• Interactive creation of
workflows
• Cloud and high-
performance computing
• Integrated TM/NLP processing system
• GUI for workflow creation
• Library of ready-to-use processing components
• Statistics, visualizations, developer APIs
• Supports UIMA
• http://argo.nactem.ac.uk
15
Database: The Journal of Biological Databases
and Curation (2012)
Argo: an integrative, interactive, text mining-
based workbench supporting curation.
Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
Structured
Data
Remote
Processing
Workflow
Diagramming
Workflow Designer
Manual
Editing
Annotator/Curator
Processing
Components
Developers
UIMA
Compliance
16Ananiadou
Processing Components
• Approaching 100 components (U-Compare)
– Additional 50 will be added soon
• META-NET
• Developed or co-developed by NaCTeM
– Planned: Make the library open to others to contribute
• Generic Listener component
– Developers can plug in their own locally run UIMA
component to a workflow in Argo
17Open AIRE-COAR ConferenceAnaniadou
Remote Processing
• Single machine execution
– In-house high-performance machines
• Distributed processing
– HTCondor
– VMware vCloud (EBI) EUPMC
– Planned: EC2, Azure, …
18Open AIRE-COAR ConferenceAnaniadou
Workflows
• Users create workflows as block diagrams
• Workflows can be shared among users
– Read only
– Planned: Read & write
– Planned: downloadable workflows
• Workflows can be deployed as web services
– Plain text (input only), XMI, RDF, BioC
19Open AIRE-COAR ConferenceAnaniadou
Workflows view
20Open AIRE-COAR ConferenceAnaniadou
Workflow Editor
21Open AIRE-COAR Conference
Sample Use Cases
1 Recognition of chemical entities (chemical NER)
2 Semi-automatic curation of metabolic pathways
3 Evaluation of inter-annotator agreement
4 Information extraction as a Web service
Ananiadou Open AIRE-COAR Conference 22
Use Case 1: Chemical NER
Supplies gold
standard corpus
Removes golden annotations
so that they can be created
automatically
Combinations of syntactic and
semantic components create
annotations
Compares and reports precision, recall
and F1 of the different branches
against the gold standard corpus
Chemical Entity Recogniser
• Chemical model evaluated at BioCreative IV
CHEMDNER challenge
• The challenge
– Data: 10,000 manually annotated PubMed abstracts
– Automatically recognises names of chemical entities in text
24Open AIRE-COAR ConferenceAnaniadou
Chemical Entity Recogniser
• Our solution
– Ranked unique mentions: ranked 1st out of 18 groups
– All mentions: ranked 3rd out of 19 groups
Subtask Precision % Recall % F-score %
Ranked unique mentions 91 85 88
All mentions 93 81 87
25Open AIRE-COAR ConferenceAnaniadou
Use Case 2: Semi-automatic Curation –
Metabolic Pathways
Search for
relevant
documents
Manual correction of
automatic annotations
NER for chemicals,
genes, process
indicators
Linking to
ontologies: CTD,
ChEBI, UniProt
26Open AIRE-COAR ConferenceAnaniadou
Save results in
various formats,
e.g., RDF for
querying and
incorporation into
databases
Manual Annotation Editor
Create new
annotations by
selecting text
Create, modify or
delete annotations
Edit details of
annotations
Open a graphical
interface to link
annotations to
ontologies
27Open AIRE-COAR ConferenceAnaniadou
Filtering and converting
annotations
28Open AIRE-COAR ConferenceAnaniadou
Manual Annotation Editor: linking to
ontologiesAutomatic pre-
selection can be
modified by the user
Details show
ontology entry
webpage
29Open AIRE-COAR ConferenceAnaniadou
Use Case 3: Information extraction
as a Web service
Web service-
enabled
reader
Web service-
enabled
writer
34Open AIRE-COAR ConferenceAnaniadou
Language Universal
• Reusable modules
• Generic TM modules: Competence
• Annotated Text, corpora: Performance
• Standards of Data Representation and Types for
Resources: Competence
• Dictionaries, Thesauri, Ontologies: Performance
36Open AIRE-COAR ConferenceAnaniadou

Weitere ähnliche Inhalte

Andere mochten auch

Transcription and Translation PowerPoint
Transcription and Translation PowerPointTranscription and Translation PowerPoint
Transcription and Translation PowerPoint
BiologyIB
 

Andere mochten auch (15)

TRANSLATION
TRANSLATIONTRANSLATION
TRANSLATION
 
Amino acids
Amino acidsAmino acids
Amino acids
 
AIM GLOBAL SLIDE PRESENTATION
AIM GLOBAL SLIDE PRESENTATIONAIM GLOBAL SLIDE PRESENTATION
AIM GLOBAL SLIDE PRESENTATION
 
Top 100 general knowledge question answers 2
Top 100 general knowledge question answers 2Top 100 general knowledge question answers 2
Top 100 general knowledge question answers 2
 
Chp. 2 simulation examples
Chp. 2 simulation examplesChp. 2 simulation examples
Chp. 2 simulation examples
 
Transcription and Translation PowerPoint
Transcription and Translation PowerPointTranscription and Translation PowerPoint
Transcription and Translation PowerPoint
 
Integers
IntegersIntegers
Integers
 
RNA- Structure, Types and Functions
RNA- Structure, Types and FunctionsRNA- Structure, Types and Functions
RNA- Structure, Types and Functions
 
Relations and Functions (Algebra 2)
Relations and Functions (Algebra 2)Relations and Functions (Algebra 2)
Relations and Functions (Algebra 2)
 
HR / Talent Analytics
HR / Talent AnalyticsHR / Talent Analytics
HR / Talent Analytics
 
14 Principles of HENRI FAYOL project on KFC Class-XII
14 Principles of HENRI FAYOL  project on KFC Class-XII14 Principles of HENRI FAYOL  project on KFC Class-XII
14 Principles of HENRI FAYOL project on KFC Class-XII
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and Validation
 
Slides That Rock
Slides That RockSlides That Rock
Slides That Rock
 
The 10 Best Copywriting Formulas for Social Media Headlines
The 10 Best Copywriting Formulas for Social Media HeadlinesThe 10 Best Copywriting Formulas for Social Media Headlines
The 10 Best Copywriting Formulas for Social Media Headlines
 
What 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureWhat 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From Failure
 

Ähnlich wie OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
Margaret-Anne Storey
 
Annotation seminar
Annotation seminarAnnotation seminar
Annotation seminar
hozifa1010
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Centre of Competence
 
Industry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software EngineeringIndustry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software Engineering
Per Runeson
 
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...
Stéphane Ducasse
 

Ähnlich wie OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester (20)

SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
 
Aegis ETNA NTU
Aegis ETNA NTUAegis ETNA NTU
Aegis ETNA NTU
 
ETNA – European Thematic Network on Assistive Information and Communication T...
ETNA – European Thematic Network on Assistive Information and Communication T...ETNA – European Thematic Network on Assistive Information and Communication T...
ETNA – European Thematic Network on Assistive Information and Communication T...
 
Annotation seminar
Annotation seminarAnnotation seminar
Annotation seminar
 
51 etna
51 etna51 etna
51 etna
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
 
Development of the database, the website and the online transcription platfor...
Development of the database, the website and the online transcription platfor...Development of the database, the website and the online transcription platfor...
Development of the database, the website and the online transcription platfor...
 
Semantic annotation of biomedical data
Semantic annotation of biomedical dataSemantic annotation of biomedical data
Semantic annotation of biomedical data
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
 
Presentation OntoCommons Workshop March 2021
Presentation OntoCommons Workshop March 2021Presentation OntoCommons Workshop March 2021
Presentation OntoCommons Workshop March 2021
 
The Semantic Web: status and prospects
The Semantic Web: status and prospectsThe Semantic Web: status and prospects
The Semantic Web: status and prospects
 
Industry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software EngineeringIndustry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software Engineering
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
FAIR data requires FAIR ontologies, how do we do?
FAIR data requires FAIR ontologies, how do we do?FAIR data requires FAIR ontologies, how do we do?
FAIR data requires FAIR ontologies, how do we do?
 
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...
 

Mehr von OpenAIRE

Mehr von OpenAIRE (20)

10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call
 
9th Content Providers Community Call\
9th Content Providers Community Call\9th Content Providers Community Call\
9th Content Providers Community Call\
 
OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)
 
8th Content Providers Community Call
8th Content Providers Community Call8th Content Providers Community Call
8th Content Providers Community Call
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
OpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managersOpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managers
 
What will it cost to manage and share my data?
What will it cost to manage and share my data?What will it cost to manage and share my data?
What will it cost to manage and share my data?
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
 
6th Content Providers Community Call
6th Content Providers Community Call6th Content Providers Community Call
6th Content Providers Community Call
 
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?
 
20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science
 
20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)
 
20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science
 
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
COVID-19: Activities, tools, best practice and contact points in Greece
 COVID-19: Activities, tools, best practice and contact points in Greece COVID-19: Activities, tools, best practice and contact points in Greece
COVID-19: Activities, tools, best practice and contact points in Greece
 
5th Content Providers Community Call
5th Content Providers Community Call5th Content Providers Community Call
5th Content Providers Community Call
 
4th Content Providers Community Call
4th Content Providers Community Call4th Content Providers Community Call
4th Content Providers Community Call
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

  • 1. Argo: a platform for interoperable and customisable text mining Sophia Ananiadou National Centre for Text Mining School of Computer Science The University of Manchester
  • 2. Overview • Sharing tools, resources and text mining workflows • Challenges • Interoperable infrastructure for processing and annotation 2Open AIRE-COAR ConferenceAnaniadou
  • 3. NaCTeM • 1st publicly funded national text mining centre • Location: Manchester Institute of Biotechnology • Phase I - Biology (2004-2008) • Phase II - Biology, Medicine, Social Sciences (2008-2011) • Phase III – Biology, Medicine, Humanities, Social Sciences; Fully sustainable centre (2011- ) www.nactem.ac.uk
  • 4. Challenges Language Technology Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean….Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Domains Finance/Business Health Biology Social Sciences Humanities…. Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums…. Technology Sentence Splitter Paragraph Splitter NP Chunkers C-parser D-parser Semantic parser NE recognizers Relation recognizers ……. Diversity of Languages Diversity of Contexts Diversity of Applications TM Workflows TM Modules Shared! 4Open AIRE-COAR ConferenceAnaniadou
  • 5. Metadata Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean…Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Language Technology Linguistic Resources Knowledge Resources Resource-Rich Big DataBig Text Cloud Computing Crowd Sourcing Big Ontology Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums…. Domains Finance/Business Health Biology Social Sciences Humanities…. 5Open AIRE-COAR ConferenceAnaniadou OPEN SCIENCE
  • 6. Requirements from TM infrastructure • Modularity of TM modules • Interoperability among TM modules and resources • Generic across different languages, domains, and text types – Adaptability 6Open AIRE-COAR ConferenceAnaniadou
  • 7. Module Interoperability and Adaptability ModuleModule Resources Dictionaries Ontologies Adaptation Rule Writing (Annotated) Text Interoperability and Adaptability in Resource-rich TM INFRASTRUCTURES! Dependency Parser English French German JapaneseGreek POS Tagger Named Entity Languages Text Types Domains 7Open AIRE-COAR ConferenceAnaniadou
  • 8. Example: extracting proteins, annotations 8 GENIA PennBioIE AIMed GENETAG Incompatibility Type definitions Texts Problem: Inconsistency Open AIRE-COAR ConferenceAnaniadou
  • 9. The problem with incompatibility • Difficult to evaluate NERs 9 Corpus C Corpus D NER A Which NER is best for my task? NER B A: 93% B: 36% A is better than B. A: 63% B: 90% B is better than A. Why so different among different corpora and NERs ? Open AIRE-COAR ConferenceAnaniadou
  • 10. Text mining workflows • A pipeline that executes particular tools and resources in order • Example: semantic search • Various versions (language- or domain-specific) of basic components needed for different applications and tasks • Different workflows can be created, compared and evaluated by the ability to seamlessly “mix and match” various versions of components PoS Tagger Dictionary Lookup NE Extraction Chunking Parsing Semantic Query 10Open AIRE-COAR ConferenceAnaniadou
  • 11. Text mining workflows Interoperability Common Data Representation and Types IBM Journal of Research and Development (2011) U-Compare: a modular NLP workflow construction and evaluation system. Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J. 11Open AIRE-COAR ConferenceAnaniadou
  • 12. Common Type System • A common type system is required for the complete interoperability • Solution: Maintain local type systems and bridge them via a sharable type system 12 A single common type is almost impossible to impose for all developers. U-Compare Sharable Type System Local Type System A Local Type System B bridging bridging 12Open AIRE-COAR ConferenceAnaniadou
  • 13. U-Compare Type System Syntactic Level Document Level Semantic Level 13Open AIRE-COAR ConferenceAnaniadou
  • 14. POS tagger B Sentence Splitter B library POS tagger A Sentence Splitter A NER Sentence Splitter A Sentence Splitter A Sentence Splitter A Sentence Splitter B Sentence Splitter B Sentence Splitter B POS tagger A POS tagger A POS tagger A POS tagger B POS tagger B POS tagger B NERNERNER Workflow A Workflow B Workflow C  F-Score A F-Score B F-Score C U-Compare: Evaluate and Compare TM Worklfows UIMA SD OpenNLP SD GENIA SD UIMA Tokenizer OpenNLP Tokenizer GENIA Tagger as Tokenizer GENIA Tagger Stepp Tagger OpenNLP Tagger ABNER MedT-NER GENIA Tagger as NER
  • 15. • Web-based application • Interactive creation of workflows • Cloud and high- performance computing • Integrated TM/NLP processing system • GUI for workflow creation • Library of ready-to-use processing components • Statistics, visualizations, developer APIs • Supports UIMA • http://argo.nactem.ac.uk 15 Database: The Journal of Biological Databases and Curation (2012) Argo: an integrative, interactive, text mining- based workbench supporting curation. Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
  • 17. Processing Components • Approaching 100 components (U-Compare) – Additional 50 will be added soon • META-NET • Developed or co-developed by NaCTeM – Planned: Make the library open to others to contribute • Generic Listener component – Developers can plug in their own locally run UIMA component to a workflow in Argo 17Open AIRE-COAR ConferenceAnaniadou
  • 18. Remote Processing • Single machine execution – In-house high-performance machines • Distributed processing – HTCondor – VMware vCloud (EBI) EUPMC – Planned: EC2, Azure, … 18Open AIRE-COAR ConferenceAnaniadou
  • 19. Workflows • Users create workflows as block diagrams • Workflows can be shared among users – Read only – Planned: Read & write – Planned: downloadable workflows • Workflows can be deployed as web services – Plain text (input only), XMI, RDF, BioC 19Open AIRE-COAR ConferenceAnaniadou
  • 20. Workflows view 20Open AIRE-COAR ConferenceAnaniadou
  • 22. Sample Use Cases 1 Recognition of chemical entities (chemical NER) 2 Semi-automatic curation of metabolic pathways 3 Evaluation of inter-annotator agreement 4 Information extraction as a Web service Ananiadou Open AIRE-COAR Conference 22
  • 23. Use Case 1: Chemical NER Supplies gold standard corpus Removes golden annotations so that they can be created automatically Combinations of syntactic and semantic components create annotations Compares and reports precision, recall and F1 of the different branches against the gold standard corpus
  • 24. Chemical Entity Recogniser • Chemical model evaluated at BioCreative IV CHEMDNER challenge • The challenge – Data: 10,000 manually annotated PubMed abstracts – Automatically recognises names of chemical entities in text 24Open AIRE-COAR ConferenceAnaniadou
  • 25. Chemical Entity Recogniser • Our solution – Ranked unique mentions: ranked 1st out of 18 groups – All mentions: ranked 3rd out of 19 groups Subtask Precision % Recall % F-score % Ranked unique mentions 91 85 88 All mentions 93 81 87 25Open AIRE-COAR ConferenceAnaniadou
  • 26. Use Case 2: Semi-automatic Curation – Metabolic Pathways Search for relevant documents Manual correction of automatic annotations NER for chemicals, genes, process indicators Linking to ontologies: CTD, ChEBI, UniProt 26Open AIRE-COAR ConferenceAnaniadou Save results in various formats, e.g., RDF for querying and incorporation into databases
  • 27. Manual Annotation Editor Create new annotations by selecting text Create, modify or delete annotations Edit details of annotations Open a graphical interface to link annotations to ontologies 27Open AIRE-COAR ConferenceAnaniadou
  • 28. Filtering and converting annotations 28Open AIRE-COAR ConferenceAnaniadou
  • 29. Manual Annotation Editor: linking to ontologiesAutomatic pre- selection can be modified by the user Details show ontology entry webpage 29Open AIRE-COAR ConferenceAnaniadou
  • 30. Use Case 3: Information extraction as a Web service Web service- enabled reader Web service- enabled writer 34Open AIRE-COAR ConferenceAnaniadou
  • 31. Language Universal • Reusable modules • Generic TM modules: Competence • Annotated Text, corpora: Performance • Standards of Data Representation and Types for Resources: Competence • Dictionaries, Thesauri, Ontologies: Performance 36Open AIRE-COAR ConferenceAnaniadou