Rocky2010 roeder full_textbiomedicalliteratureprocesing

•Als PPTX, PDF herunterladen•

0 gefällt mir•204 views

Chris Roeder

Technologie

Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)

Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses

File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not

Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses ASCII, UTF-8, MacRoman, ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others: dash, en dash, minus space, em space, non-breaking-space

Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster

Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter, NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977

Weitere ähnliche Inhalte

Was ist angesagt?

Ld4 l triannonNaomi Dushay

SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab

Survey On Building A Database Driven Reverse DictionaryEditor IJMTER

Anton Dorfman - Reversing data formats what data can revealDefconRussia

Open source software for building open access repositoriesAIMS (Agricultural Information Management Standards)

Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski

Adaptive information extractionunyil96

Corpus Linguistics :Analytical ToolsJitendra Patil

Was ist angesagt? (8)

Ld4 l triannon

SANAPHOR: Ontology-based Coreference Resolution

Survey On Building A Database Driven Reverse Dictionary

Anton Dorfman - Reversing data formats what data can reveal

Open source software for building open access repositories

Brain Imaging Data Structure and Center for Reproducible Neuroscince

Adaptive information extraction

Corpus Linguistics :Analytical Tools

Andere mochten auch

Roeder posterismb2010Chris Roeder

Spring surveyChris Roeder

UmlChris Roeder

Roeder rocky 2011_46Chris Roeder

Spring Framework 101Matthew McCullough

SgeChris Roeder

MavenChris Roeder

Spring Introvschiavoni

Spring MVC BasicsBozhidar Bozhanov

Andere mochten auch (9)

Roeder posterismb2010

Spring survey

Uml

Roeder rocky 2011_46

Spring Framework 101

Sge

Maven

Spring Intro

Spring MVC Basics

Kürzlich hochgeladen

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

From Family Reminiscence to Scholarly Archive .Alan Dix

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Advanced Computer Architecture – An IntroductionDilum Bandara

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy

Streamlining Python Development: A Guide to a Modern Project Setup

How AI, OpenAI, and ChatGPT impact business and software.

WordPress Websites for Engineers: Elevate Your Brand

Unleash Your Potential - Namagunga Girls Coding Club

Developer Data Modeling Mistakes: From Postgres to NoSQL

Connect Wave/ connectwave Pitch Deck Presentation

TeamStation AI System Report LATAM IT Salaries 2024

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Gen AI in Business - Global Trends Report 2024.pdf

Powerpoint exploring the locations used in television show Time Clash

DevEX - reference for building teams, processes, and platforms

Nell’iperspazio con Rocket: il Framework Web di Rust!

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

From Family Reminiscence to Scholarly Archive .

The Ultimate Guide to Choosing WordPress Pros and Cons

Advanced Computer Architecture – An Introduction

DevoxxFR 2024 Reproducible Builds with Apache Maven

Designing IA for AI - Information Architecture Conference 2024

Rocky2010 roeder full_textbiomedicalliteratureprocesing

1. Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)

2. Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses

3. File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not

4. Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses ASCII, UTF-8, MacRoman, ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others: dash, en dash, minus space, em space, non-breaking-space

5. Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster

6. Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter, NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977

Rocky2010 roeder full_textbiomedicalliteratureprocesing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Andere mochten auch

Andere mochten auch (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Rocky2010 roeder full_textbiomedicalliteratureprocesing