SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh
Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang,
James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan,
Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray
Choudhury
Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences
and Technology
Pennsylvania State University, University Park, PA, USA

Past funding: NSF Cyberinfrastructure Chemistry, Microsoft
Current Support: Dow Chemical

http://chemxseer.ist.psu.edu
Talk Overview
â—Ź Challenges and Motivation.
â—Ź Functionalities
–
–
–
–
–
–
–

Fulltext Search
Author Search
Table Search
Figure Search
Expertise Search
Chemical Name and Formula Tagging
Chemical Name and Formula Search

â—Ź Summary.
Based on cyberinfrastructure
for CiteSeerX
Built on Solr/Lucene,
SeerSuite, other OSS
ChemXSeer RSC
ChemXSeer Fulltext Search
ChemXSeer Author Search
ChemXSeer Table Search
• Tables are widely used to present experimental results or
statistical data in scientific documents.
• Existing search engines treat tabular data as regular text
– Structural information and semantics not preserved.
– We automatically identify tables and extract table metadata in xml.
Table Metadata Representation:
• Environment metadata: (document specifics: type, title,…)
• Frame metadata: (border left, right, top, bottom, …)
• Affiliated metadata: (Caption, footnote, …)
• Layout metadata: (number of rows, columns, headers,…)
• Cell content metadata: (values in cells)
• Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu, et.al, AAAI 2007, JCDL 2007.
Sample Table Metadata Extracted File
Sample Table Metadata Extracted File
•
•
•
•
•
•

•
•
•
•
•
•
•
•
•
•
•

<Table>
<DocumentOrigin>Analyst</DocumentOrigin>
<DocumentName>b006011i.pdf</DocumentName>
<Year>2001</Year>
gas sensors </DocumentTitle>
<DocumentTitle>Detection of chlorinated methanes by tin oxide
Shaw, a Kenneth E. Creasy,* b and
<Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R .
of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University
3060</Author>
<TheNumOfCiters></TheNumOfCiters>
<Citers></Citers>
ge ( D R ) and response timeof tin
<TableCaption>Table 1 Temperature effect o n r esistance chan
oxide thin film with 1 % C Cl 4</TableCaption>
2 ) (%) R esponse time Reproducibiliy
<TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O
</TableColumnHeading>
300 1027 21 < 2 0 s Yes 400 993 31 ~ 1
<TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes
0 s No </TableContent>
>
<TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote
<ColumnNum>5</ColumnNum>
1% CCl4 at different temperatures are
<TableReferenceText>In page 3, line 11, … Film responses to
summarized in Table 1……</TableReferenceText>
<PageNumOfTable>3</PageNumOfTable>
<Snapshot>b006011i/b006011i_t1.jpg</Snapshot>
</Table>
ChemXSeer Table Search
ChemXSeer Figure/Plot Data Extraction
and Search
Numerical data in
scientific publications
are often found in figures.
No search engine allows
searching on figures and their
data in chemical documents.
Tools that automate the data extraction from figures and allow
search on them can provide the following:
•
•
•
•

Increases our understanding of key concepts of papers.
Provides data for automatic comparative analyses.
Enables regeneration of figures in different contexts.
Enables search for documents with figures containing specific
experiment results.
X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
Our Contribution
ChemXSeer Name and Formula
Extraction and Search
• Extraction and search of chemical names and formulae in
scientific documents has been shown to be very useful.
• Extraction and search on chemical names is hard:
– Many chemical molecules are created everyday, any dictionary based
name recognizer will fail eventually.
– Names need to segmented to get semantically meaningful sub-terms
such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.

• Identifying formula is hard:
• “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)
• “… such as hydroxyl radical OH, superoxide O2- …” (formula)

• For searching, formulae cannot be treated as text.
• Domain knowledge (formula identification)
•

Structural knowledge (substructure finding and search)

B. Sun, et.al., WWW 2007, WWW 2008, TOIS
Chemical Entity Extraction and Tagging
â—Ź Name tagging
– Each chemical name can be a phrase
– Example
● "... Determination of lactic acid and ...“
â—Ź "... insecticide promecarb (3-isopropyl-5-methylphenyl
methylcarbamate) acts against ..."

â—Ź Formula tagging
– Each formula is a single term
– Example
â—Ź "... such as hydroxyl radical OH, superoxide ..."

– Non-formula example
● "... YSI 5301, Yellow Springs, OH, USA ... ”

â—Ź Tagging examples
– Name tagging:
"... of <name-type>lactic acid</name-type> and ...“

– Formula tagging:
"...

radical <formula-type>OH</formula-type> , superoxide ..."
Online Chemical Entity Tagger
â—Ź We have an open source chemical name and formula
tagger and a web based interface for evaluation.
â—Ź The interface takes a PDF file as input, returns text of
the PDF with names or formulas tagged.
Online Chemical Entity Tagger: Chemical
Name Tagging Example
â—Ź Results on a sample PDF.
â—Ź Some chemical formula erroneously identified as chemical
name (loss of precision).
â—Ź High recall (most chemical names identified)
Online Chemical Entity Tagger: Chemical
Formula Tagging Example
â—Ź Results on a sample PDF.
â—Ź Some chemical formulas not identified (loss of recall).
â—Ź High precision (words identified as formula are actual formulas)
Chemical Name Indexing and Search
• Index Schemes:
– Which tokens to index?
– Indexing all subsequences generates a large size
index
– “but” in “butane” is morpheme, but not for “nembutal”.

â—Ź Segmentation-based index scheme
– Used for indexing chemical names
– First segment a chemical name hierarchically and then index
substrings at each node if frequent.
– acetaldoxime->aldoxime->oxime.
– Search for oxime returns all, depending on ranking function.
– This can not be done in usual text search.
Example Formula Search

http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Built on top of millions of
papers in CiteSeerX.
A similar system was
developed for Dow
Chemicals.
Can find experts in “polymer
chemistry” or expertise of
“Linus Pauling”
Finds an expert based on
their publications.
Many approaches:
Keyphases
Citations
Download count.
Affiliation
Treeratpituk, Chen, JCDL’13
Future Work
Lots of interesting work to do! Few computer/machine
learning scientists involved.
•
•
•
•
•
•
•
•
•
•

Acquisitions - more documents, data, knowledge
Chemical 3D graph search
Fundamental chemical graph representation analysis
Table data storage and access
Figure search and data extraction and access
New data and feature search
• spectra, experimental methods, instrumentation
New documents: 400K PubMed
Semantic chemical graphs
Expert/collaborator search
Search integration of all features
DEMO

Weitere ähnliche Inhalte

Ă„hnlich wie Chemxseer qr-sagnik

WWW (Glibs workshop)
WWW (Glibs workshop)WWW (Glibs workshop)
WWW (Glibs workshop)Roland Stenutz
 
How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesBruce Slutsky
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)BIOVIA
 
Mukesh Kumar Resume
Mukesh Kumar ResumeMukesh Kumar Resume
Mukesh Kumar Resumemukeshkr1
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBLGeorge Papadatos
 
How To Study Organic Chem
How To Study Organic ChemHow To Study Organic Chem
How To Study Organic Chemshehdilanun
 
Systematic reviews - a "how to" guide
Systematic reviews - a "how to" guideSystematic reviews - a "how to" guide
Systematic reviews - a "how to" guideIsla Kuhn
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...Andrew McEachran
 
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCHCOMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCHLauren Bradshaw
 
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...ChemAxon
 
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSyed Asad Rahman
 
Structural databases
Structural databases Structural databases
Structural databases Priyadharshana
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Peter Kenny
 

Ă„hnlich wie Chemxseer qr-sagnik (20)

A new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChemA new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChem
 
WWW (Glibs workshop)
WWW (Glibs workshop)WWW (Glibs workshop)
WWW (Glibs workshop)
 
How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical Substances
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Mukesh Kumar Resume
Mukesh Kumar ResumeMukesh Kumar Resume
Mukesh Kumar Resume
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
How To Study Organic Chem
How To Study Organic ChemHow To Study Organic Chem
How To Study Organic Chem
 
Systematic reviews - a "how to" guide
Systematic reviews - a "how to" guideSystematic reviews - a "how to" guide
Systematic reviews - a "how to" guide
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCHCOMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
Organic chemist
Organic chemistOrganic chemist
Organic chemist
 
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
 
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
 
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
 
Structural databases
Structural databases Structural databases
Structural databases
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 

KĂĽrzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

KĂĽrzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Chemxseer qr-sagnik

  • 1. Search Engine and Repository for eChemistry C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology Pennsylvania State University, University Park, PA, USA Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical http://chemxseer.ist.psu.edu
  • 2. Talk Overview â—Ź Challenges and Motivation. â—Ź Functionalities – – – – – – – Fulltext Search Author Search Table Search Figure Search Expertise Search Chemical Name and Formula Tagging Chemical Name and Formula Search â—Ź Summary.
  • 3. Based on cyberinfrastructure for CiteSeerX Built on Solr/Lucene, SeerSuite, other OSS
  • 7. ChemXSeer Table Search • Tables are widely used to present experimental results or statistical data in scientific documents. • Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml. Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …) Y. Liu, et.al, AAAI 2007, JCDL 2007.
  • 8. Sample Table Metadata Extracted File
  • 9. Sample Table Metadata Extracted File • • • • • • • • • • • • • • • • • <Table> <DocumentOrigin>Analyst</DocumentOrigin> <DocumentName>b006011i.pdf</DocumentName> <Year>2001</Year> gas sensors </DocumentTitle> <DocumentTitle>Detection of chlorinated methanes by tin oxide Shaw, a Kenneth E. Creasy,* b and <Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R . of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University 3060</Author> <TheNumOfCiters></TheNumOfCiters> <Citers></Citers> ge ( D R ) and response timeof tin <TableCaption>Table 1 Temperature effect o n r esistance chan oxide thin film with 1 % C Cl 4</TableCaption> 2 ) (%) R esponse time Reproducibiliy <TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O </TableColumnHeading> 300 1027 21 < 2 0 s Yes 400 993 31 ~ 1 <TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes 0 s No </TableContent> > <TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote <ColumnNum>5</ColumnNum> 1% CCl4 at different temperatures are <TableReferenceText>In page 3, line 11, … Film responses to summarized in Table 1……</TableReferenceText> <PageNumOfTable>3</PageNumOfTable> <Snapshot>b006011i/b006011i_t1.jpg</Snapshot> </Table>
  • 11. ChemXSeer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow search on them can provide the following: • • • • Increases our understanding of key concepts of papers. Provides data for automatic comparative analyses. Enables regeneration of figures in different contexts. Enables search for documents with figures containing specific experiment results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
  • 13. ChemXSeer Name and Formula Extraction and Search • Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard: – Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually. – Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”. • Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula) • “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, et.al., WWW 2007, WWW 2008, TOIS
  • 14. Chemical Entity Extraction and Tagging â—Ź Name tagging – Each chemical name can be a phrase – Example â—Ź "... Determination of lactic acid and ...“ â—Ź "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..." â—Ź Formula tagging – Each formula is a single term – Example â—Ź "... such as hydroxyl radical OH, superoxide ..." – Non-formula example â—Ź "... YSI 5301, Yellow Springs, OH, USA ... ” â—Ź Tagging examples – Name tagging: "... of <name-type>lactic acid</name-type> and ...“ – Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."
  • 15. Online Chemical Entity Tagger â—Ź We have an open source chemical name and formula tagger and a web based interface for evaluation. â—Ź The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.
  • 16. Online Chemical Entity Tagger: Chemical Name Tagging Example â—Ź Results on a sample PDF. â—Ź Some chemical formula erroneously identified as chemical name (loss of precision). â—Ź High recall (most chemical names identified)
  • 17. Online Chemical Entity Tagger: Chemical Formula Tagging Example â—Ź Results on a sample PDF. â—Ź Some chemical formulas not identified (loss of recall). â—Ź High precision (words identified as formula are actual formulas)
  • 18. Chemical Name Indexing and Search • Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”. â—Ź Segmentation-based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically and then index substrings at each node if frequent. – acetaldoxime->aldoxime->oxime. – Search for oxime returns all, depending on ranking function. – This can not be done in usual text search.
  • 20. Expert Recommendation - CiteSeerX http://seerseer.ist.psu.edu (new version CSSeers) Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Affiliation Treeratpituk, Chen, JCDL’13
  • 21. Future Work Lots of interesting work to do! Few computer/machine learning scientists involved. • • • • • • • • • • Acquisitions - more documents, data, knowledge Chemical 3D graph search Fundamental chemical graph representation analysis Table data storage and access Figure search and data extraction and access New data and feature search • spectra, experimental methods, instrumentation New documents: 400K PubMed Semantic chemical graphs Expert/collaborator search Search integration of all features
  • 22. DEMO

Hinweis der Redaktion

  1. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  2. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  3. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  4. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  5. most of those substrings on the tree are semantically meaningful