SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Image Mining from Gel Diagrams in
Biomedical Publications
Tobias Kuhn and Michael Krauthammer
Krauthammer Lab, Department of Pathology
Yale University School of Medicine
5th International Symposium on
Semantic Mining in Biomedicine (SMBM)
3 September 2012
Zurich, Switzerland
Introduction
The inclusion of figure images is a recent trend in the area of
literature mining.
The increasing amount of open access publications makes such
images available for automated analysis.
Image mining techniques can be used for image search interfaces,
for relation mining, and to complement text mining approaches.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
Gel Images
Our approach focuses on gel images:
• They are the result of gel electrophoresis (e.g. Southern,
Western and Northern blotting)
• They are often shown in biomedical publication as evidence for
the discussed findings (e.g. protein-protein interactions and
protein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161.9 14-3-3σ high expression
LOX 14-3-3σ low expression
MDA-MB-231 β-actin high expression
NHEM β-actin high expression
C8161.9 β-actin high expression
LOX β-actin high expression
Condition Measurement Result
IL-1β (–) DEX (–) RU486 (–) p-p38 low expression
IL-1β (+) DEX (–) RU486 (–) p-p38 high expression
IL-1β (–) DEX (+) RU486 (–) p-p38 no expression
IL-1β (+) DEX (+) RU486 (–) p-p38 low expression
IL-1β (–) DEX (–) RU486 (+) p-p38 no expression
IL-1β (+) DEX (–) RU486 (+) p-p38 high expression
IL-1β (–) DEX (+) RU486 (+) p-p38 low expression
IL-1β (+) DEX (+) RU486 (+) p-p38 high expression
... ... ...
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
Image Mining Processes
In principle, image mining involves the same processes as classical
literature mining1 (with some subtle but important differences):
• Document categorization (image categorization has to deal
with the two-dimensional space of pixels, instead of text)
• Named entity tagging (pinpointing the mention of an entity is
more difficult with images; OCR errors have to be considered)
• Fact extraction (analysis of graphical elements instead of
parsing complete sentences)
• Collection-wide analysis
1
Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature.
International Journal of Medical Informatics, 67(1-3):7–18.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text gels gel panels named entities
1 21 3 4 5 6
relations
7
1 Figure Extraction
2 Segmentation
3 Text Recognition
4 Gel Segment Detection
5 Gel Panel Detection
6 Named Entity Recognition
7 Relation Extraction
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
Figure Extraction
A B
X
Y
P
A B
X
Y
P
articles figures
11
We use structured XML files of the open access subset of PubMed
Central.
(Figure extraction from PDF files or even bitmaps of scanned articles
would be more difficult, but definitely feasible.)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
Segmentation and Text Recognition
A B
X
Y
P
A B
X
Y
P
segments text
2 3
For segmentation and text recognition we rely on our previous work.2
This includes:
• Detection of layout elements
• Text region detection
• OCR (using the Microsoft Document Imaging package of MS
Office)
2
Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for
biomedical images. J. of Biomedical Informatics, 43(6):924–931, December.
Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region
detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classifiers (based on 75 random trees) on the following
features of image segments:
• coordinates of the relative position within the image
• relative and absolute width and height
• 16 grayscale histogram features
• color features: red, green and blue
• 13 texture features
• number of recognized characters
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
Gel Segment Detection Results
Manually annotated training and testing sets of 500 random figures
each.
Results for three different thresholds:
Threshold Precision Recall F-score
high recall 0.15 0.439 0.909 0.592
0.30 0.765 0.739 0.752
high precision 0.60 0.926 0.301 0.455
Accuracy (area under ROC curve): 98.0%
Unbalanced set: 3% gel segments vs. 97% non-gel segments
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to the
high-recall classifier, and merge them
• Collect labels in the form of text segments arround the detected
gel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels:
• Tokenization of gel label texts
• Lookup in Entrez Gene database
• Case-sensitive matching
• Exclude tokens:
• Less than 3 characters
• Arabic or Latin numbers
• Common short words (from a list of the 100 most frequent words
in biomedical articles)
• 22 general words frequently used in gel diagrams (e.g. min, hrs,
line, type, protein, DNA)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
Named Entity Recognition Results
Recognized gene/protein tokens in 2000 random figures:
absolute relative
Total 156 100.0%
Incorrect 54 34.6%
– Not mentioned (OCR errors) 28 17.9%
– Not references to genes or proteins 26 16.7%
Correct 102 65.3%
– Partially correct (could be more specific) 14 9.0%
– Fully correct 88 56.4%
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
Relation Extraction
relations
7
Relation extraction is future work and we do not have concrete
results at this point.
It would involve the following steps:
• Gene/protein name disambiguation
• Identify semantic roles (condition, measurement, ...)
• Quantify degree of expression
Combination with classical text mining techniques seems promising.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 410 950
Processed articles 386 428
Total figures from processed articles 1 110 643
Processed figures 884 152
Detected gel panels 85 942
Detected gel panels per figure 0.097
Detected gel labels 309 340
Detected gel labels per panel 3.599
Detected gene tokens 1 854 609
Detected gene tokens in gel labels 75 610
Gene token ratio 0.033
Gene token ratio in gel labels 0.068
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accuracy, but it is clear that this procedure is far from perfect.
Shouldn’t we standardize biomedical diagrams? A Unified
Modeling Language (UML) for biomedicine?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
Conclusions and Future Work
Conclusions:
• Gel segments can be detected with high accuracy
• Detection of gel panels at high precision
• Gene/protein name recognition in gel labels at satisfactory
precision
→ Image mining from gel diagrams is feasible
Future Work:
• Relation extraction
• Combination with classical text mining techniques
• Other named entity types: cell lines, drugs, ...
• Standard for biomedical diagrams?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
Thank you for your Attention!
Questions?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19

Weitere ähnliche Inhalte

Was ist angesagt?

A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...CSCJournals
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...CSCJournals
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug designSurmil Shah
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identificationarx-deidentifier
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolarx-deidentifier
 
IEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarIEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarOgan Gurel MD
 
IRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET Journal
 
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...IJERA Editor
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurIAEME Publication
 
diffraction techniques
 diffraction techniques diffraction techniques
diffraction techniqueskarthi keyan
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...arx-deidentifier
 
Segmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeSegmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeAboul Ella Hassanien
 
Advances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesAdvances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesecij
 

Was ist angesagt? (18)

(2011) Comparison of Face Image Quality Metrics
(2011) Comparison of Face Image Quality Metrics(2011) Comparison of Face Image Quality Metrics
(2011) Comparison of Face Image Quality Metrics
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
 
IEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarIEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU Seminar
 
IRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET- Plant Disease Identification System
IRJET- Plant Disease Identification System
 
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
 
diffraction techniques
 diffraction techniques diffraction techniques
diffraction techniques
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
 
Segmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeSegmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosome
 
CV
CVCV
CV
 
Advances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesAdvances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic images
 

Ähnlich wie Image Mining from Gel Diagrams in Biomedical Publications

Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
Introduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherIntroduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherAnjani Dhrangadhariya
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsS P Sajjan
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentationjohn236zaq
 
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Institute of Information Systems (HES-SO)
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Kevin Mader
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuAlexander Pico
 
Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsunyil96
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsphilmaweb
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisJustin P. Bolinger
 
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...IOSR Journals
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsElena Sügis
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례mothersafe
 

Ähnlich wie Image Mining from Gel Diagrams in Biomedical Publications (20)

Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Introduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherIntroduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and Cypher
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphs
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentation
 
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
Viva201393(1).pptxbaru
Viva201393(1).pptxbaruViva201393(1).pptxbaru
Viva201393(1).pptxbaru
 
Research summary
Research summaryResearch summary
Research summary
 
Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientists
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
 
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
 
A01110107
A01110107A01110107
A01110107
 
Bio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anweshaBio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anwesha
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 

Mehr von Tobias Kuhn

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingTobias Kuhn
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsTobias Kuhn
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishingTobias Kuhn
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataTobias Kuhn
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer Tobias Kuhn
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Tobias Kuhn
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for NanopublicationsTobias Kuhn
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsTobias Kuhn
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data PublishingTobias Kuhn
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...Tobias Kuhn
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Tobias Kuhn
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsTobias Kuhn
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Tobias Kuhn
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksTobias Kuhn
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageTobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Tobias Kuhn
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiTobias Kuhn
 

Mehr von Tobias Kuhn (20)

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with Nanopublications
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublications
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and Nanopublications
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data Publishing
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication Reviews
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
 
Nanopubs
NanopubsNanopubs
Nanopubs
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation Networks
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural Language
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen Wiki
 

Kürzlich hochgeladen

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Kürzlich hochgeladen (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Image Mining from Gel Diagrams in Biomedical Publications

  • 1. Image Mining from Gel Diagrams in Biomedical Publications Tobias Kuhn and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine 5th International Symposium on Semantic Mining in Biomedicine (SMBM) 3 September 2012 Zurich, Switzerland
  • 2. Introduction The inclusion of figure images is a recent trend in the area of literature mining. The increasing amount of open access publications makes such images available for automated analysis. Image mining techniques can be used for image search interfaces, for relation mining, and to complement text mining approaches. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
  • 3. Yale Image Finder http://krauthammerlab.med.yale.edu/imagefinder/ T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
  • 4. Gel Images Our approach focuses on gel images: • They are the result of gel electrophoresis (e.g. Southern, Western and Northern blotting) • They are often shown in biomedical publication as evidence for the discussed findings (e.g. protein-protein interactions and protein expressions under different conditions) • About 15% of all subfigures are gel images • They are structured according to common regular patterns T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
  • 5. Relations from Gel Images Condition Measurement Result MDA-MB-231 14-3-3σ high expression NHEM 14-3-3σ no expression C8161.9 14-3-3σ high expression LOX 14-3-3σ low expression MDA-MB-231 β-actin high expression NHEM β-actin high expression C8161.9 β-actin high expression LOX β-actin high expression Condition Measurement Result IL-1β (–) DEX (–) RU486 (–) p-p38 low expression IL-1β (+) DEX (–) RU486 (–) p-p38 high expression IL-1β (–) DEX (+) RU486 (–) p-p38 no expression IL-1β (+) DEX (+) RU486 (–) p-p38 low expression IL-1β (–) DEX (–) RU486 (+) p-p38 no expression IL-1β (+) DEX (–) RU486 (+) p-p38 high expression IL-1β (–) DEX (+) RU486 (+) p-p38 low expression IL-1β (+) DEX (+) RU486 (+) p-p38 high expression ... ... ... T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
  • 6. Image Mining Processes In principle, image mining involves the same processes as classical literature mining1 (with some subtle but important differences): • Document categorization (image categorization has to deal with the two-dimensional space of pixels, instead of text) • Named entity tagging (pinpointing the mention of an entity is more difficult with images; OCR errors have to be considered) • Fact extraction (analysis of graphical elements instead of parsing complete sentences) • Collection-wide analysis 1 Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature. International Journal of Medical Informatics, 67(1-3):7–18. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
  • 7. Procedure A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P articles figures segments text gels gel panels named entities 1 21 3 4 5 6 relations 7 1 Figure Extraction 2 Segmentation 3 Text Recognition 4 Gel Segment Detection 5 Gel Panel Detection 6 Named Entity Recognition 7 Relation Extraction T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
  • 8. Figure Extraction A B X Y P A B X Y P articles figures 11 We use structured XML files of the open access subset of PubMed Central. (Figure extraction from PDF files or even bitmaps of scanned articles would be more difficult, but definitely feasible.) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
  • 9. Segmentation and Text Recognition A B X Y P A B X Y P segments text 2 3 For segmentation and text recognition we rely on our previous work.2 This includes: • Detection of layout elements • Text region detection • OCR (using the Microsoft Document Imaging package of MS Office) 2 Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for biomedical images. J. of Biomedical Informatics, 43(6):924–931, December. Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
  • 10. Gel Segment Detection A B X Y P gels 4 Random forest classifiers (based on 75 random trees) on the following features of image segments: • coordinates of the relative position within the image • relative and absolute width and height • 16 grayscale histogram features • color features: red, green and blue • 13 texture features • number of recognized characters T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
  • 11. Gel Segment Detection Results Manually annotated training and testing sets of 500 random figures each. Results for three different thresholds: Threshold Precision Recall F-score high recall 0.15 0.439 0.909 0.592 0.30 0.765 0.739 0.752 high precision 0.60 0.926 0.301 0.455 Accuracy (area under ROC curve): 98.0% Unbalanced set: 3% gel segments vs. 97% non-gel segments T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
  • 12. Gel Panel Detection A B X Y P gel panels 5 Algorithm: • Start with a gel segment according to the high-precision classifier • Repeatedly look for adjacent gel segments according to the high-recall classifier, and merge them • Collect labels in the form of text segments arround the detected gel region Results on another set of 500 manually annotated figures: Precision Recall F-score 0.951 0.379 0.542 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
  • 13. Named Entity Recognition named entities 6 Detection of gene and protein names in gel labels: • Tokenization of gel label texts • Lookup in Entrez Gene database • Case-sensitive matching • Exclude tokens: • Less than 3 characters • Arabic or Latin numbers • Common short words (from a list of the 100 most frequent words in biomedical articles) • 22 general words frequently used in gel diagrams (e.g. min, hrs, line, type, protein, DNA) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
  • 14. Named Entity Recognition Results Recognized gene/protein tokens in 2000 random figures: absolute relative Total 156 100.0% Incorrect 54 34.6% – Not mentioned (OCR errors) 28 17.9% – Not references to genes or proteins 26 16.7% Correct 102 65.3% – Partially correct (could be more specific) 14 9.0% – Fully correct 88 56.4% T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
  • 15. Relation Extraction relations 7 Relation extraction is future work and we do not have concrete results at this point. It would involve the following steps: • Gene/protein name disambiguation • Identify semantic roles (condition, measurement, ...) • Quantify degree of expression Combination with classical text mining techniques seems promising. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
  • 16. Overall Results on PubMed Central We ran our pipeline on the whole open access subset of PubMed Central: Total articles 410 950 Processed articles 386 428 Total figures from processed articles 1 110 643 Processed figures 884 152 Detected gel panels 85 942 Detected gel panels per figure 0.097 Detected gel labels 309 340 Detected gel labels per panel 3.599 Detected gene tokens 1 854 609 Detected gene tokens in gel labels 75 610 Gene token ratio 0.033 Gene token ratio in gel labels 0.068 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
  • 17. Discussion: Standardized Biomedical Diagrams? It seems feasible to extract relations from gel images at satisfactory accuracy, but it is clear that this procedure is far from perfect. Shouldn’t we standardize biomedical diagrams? A Unified Modeling Language (UML) for biomedicine? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
  • 18. Conclusions and Future Work Conclusions: • Gel segments can be detected with high accuracy • Detection of gel panels at high precision • Gene/protein name recognition in gel labels at satisfactory precision → Image mining from gel diagrams is feasible Future Work: • Relation extraction • Combination with classical text mining techniques • Other named entity types: cell lines, drugs, ... • Standard for biomedical diagrams? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
  • 19. Thank you for your Attention! Questions? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19