1. Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)
2. Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses
3. File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not
4. Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses ASCII, UTF-8, MacRoman, ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others: dash, en dash, minus space, em space, non-breaking-space
5. Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster
6. Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter, NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977