SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)
Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central 	 Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses
File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup  headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not
Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses  ASCII, UTF-8, MacRoman,  ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others:  dash, en dash, minus  space, em space, non-breaking-space
Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster
Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter,  NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977

Weitere ähnliche Inhalte

Was ist angesagt?

SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
Anton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealAnton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealDefconRussia
 
Brain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceBrain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extractionunyil96
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsJitendra Patil
 

Was ist angesagt? (8)

Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Anton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealAnton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can reveal
 
Open source software for building open access repositories
Open source software for building open access repositoriesOpen source software for building open access repositories
Open source software for building open access repositories
 
Brain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceBrain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible Neuroscince
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical Tools
 

Andere mochten auch (9)

Roeder posterismb2010
Roeder posterismb2010Roeder posterismb2010
Roeder posterismb2010
 
Spring survey
Spring surveySpring survey
Spring survey
 
Uml
UmlUml
Uml
 
Roeder rocky 2011_46
Roeder rocky 2011_46Roeder rocky 2011_46
Roeder rocky 2011_46
 
Spring Framework 101
Spring Framework 101Spring Framework 101
Spring Framework 101
 
Sge
SgeSge
Sge
 
Maven
MavenMaven
Maven
 
Spring Intro
Spring IntroSpring Intro
Spring Intro
 
Spring MVC Basics
Spring MVC BasicsSpring MVC Basics
Spring MVC Basics
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Rocky2010 roeder full_textbiomedicalliteratureprocesing

  • 1. Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)
  • 2. Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses
  • 3. File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not
  • 4. Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses ASCII, UTF-8, MacRoman, ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others: dash, en dash, minus space, em space, non-breaking-space
  • 5. Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster
  • 6. Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter, NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977