Richard Resnick, CEO of II-SDV, discussed the challenges of keyword and biological sequence searching in life sciences. Integrating the two search types can provide more relevant results from broad patent authority coverage. However, keyword searching in life sciences is challenging due to issues like spelling variations and domain-specific terminology. Sequence searching also presents challenges regarding alignment and interpretation of results. Building reports from different search platforms is further challenging due to inconsistencies in output formats and a lack of cross-platform integration. An integrated life science search platform could provide a complete report for analysis by combining text and sequence search results into a single workfile.
2. KEYWORD SEARCHING IN THE LIFE SCIENCES IS CHALLENGING
How do you spell“somatostatin”?
Ala-Gly-Cys-Lys-Asn-Phe-Phe-Trp-Lys-Thr-Phe-Thr-Ser-Cys
somato*
AND (Mus
musculus
OR mouse)
TGAACCTCACAGC
ATGGAGCCCCTCT
CTTTGGCTTCCAC
ACCTAGCTGGAAT
GCCTCAGCTGCT
100%/4.2%/100%
is not aa
3. Relevance of results to life sciences
Completenessofpatentauthoritycoverage
Size of bubble corresponds to
the number of hits returned
GOAL: HIGHLY RELEVANT RESULTS FROM BROAD PATENT AUTHORITY
COVERAGE
4. SEQUENCE SEARCHING PRESENTS CHALLENGES
CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGG
GTTTCTCCACTGATGCTGTTGCTAGGGATCCTTGTCCTGGCTTCAGTTTCTGCAACGCATGCCAAGTCATCACCTTACCAGAAGAAAACA
GAGAACCCCTGCGCCCAGAGGTGCCTCCAGAGTTGTCAACAGGAACCGGATGACTTGAAGCAAAAGGCATGCGAGTCTCGCTGCACCAAG
CTCGAGTATGATCCTCGTTGTGTCTATGATCCTCGAGGACACACTGGCACCACCAACCAACGTTCCCCTCCAGGGGAGCGGACACGTGGC
CGCCAACCCGGAGACTACGATGATGACCGCCGTCAACCCCGAAGAGAGGAAGGAGGCCGATGGGGACCAGCTGGACCGAGGGAGCGTGAA
AGAGAAGAAGACTGGAGACAACCAAGAGAAGATTGGAGGCGACCAAGTCATCAGCAGCCACGGAAAATAAGGCCCGAAGGAAGAGAAGGA
GAACAAGAGTGGGGAACACCAGGTAGCCATGTGAGGGAAGAAACATCTCGGAACAACCCTTTCTACTTCCCGTCAAGGCGGTTTAGCACC
CGCTACGGGAACCAAAACGGTAGGATCCGGGTCCTGCAGAGGTTTGACCAAAGGTCAAGGCAGTTTCAGAATCTCCAGAATCACCGTATT
GTGCAGATCGAGGCCAAACCTAACACTCTTGTTCTTCCCAAGCACGCTGATGCTGATAACATCCTTGTTATCCAGCAAGGTATCAAATCT
AATTCTATTCTAAACTACATATATTTTGTTGCTTGATACATATGATTCATTGGATTGCAGGGCAAGCCACCGTGACCGTAGCAAATGGCA
ATAACAGAAGAGCTTTAATCTTGACGAGGGCCATGCACTCAGAATCCCATCCGTTTCATTTCCTACATCTTGACGACATGACACCAGAAC
TCAGAGTAGCTAAATCTCATGCCGTTAACACACCCGGCCAGTTTGAGGTAGGTACCTCTTTCTTCTCACATATATATTCAATTCTCAATT
ATCATCTTACATGTTGTGGGTGTTGCTTCACAGGATTTCTTCCCGGCGAGCAGCCGAGACCAATCATCCTACTTGCAGGGATTCAGCAGG
AATACTTTGGAGGCCGCCTTCAATGTAAGCAAATGTGTCATAATTATGGAATTAAAAGAACGATCATGTTATAAACTTATAATATATATA
TACATAGGCGGAATTCAATGAGATACGGAGGGTGCTGTTAGAAGAGAATGCAGGAGGTGAGCAAGAGGAGAGAGGGCAGAGGCGATGGAG
TACTCGGAGTAGTGAGAACAATGAAGGAGTGATAGTCGAAGTGTCAAAGGAGCACGTTGAAGAACTTACTAAGCACGCTAAATCCGTCTC
AAAGAAAGGCTCCGAAGAAGAGGGAGATATCACCAACCCAATCAACTTGAGAGAAGGCGAGCCCGATCTTTCTGACAACTTTGGGAGGTT
ATTTGAGGTGAAGCCAGACAAGAAGAACCCCCAGCTTCAGGACCTGGACATGATGCTCACCTGTGTAGAGATCAAAGAAGGAGCTTTGAT
GCTCCCACACTTCAACTCAAAGGCCATGGTCATCGTCGTCATCAACAAAGGAACTGGAAACCTTGAACTCGTAGCTGTAAGAAAAGAGCA
ACAACAGAGGGGACGGCGGGAACAAGAGTGGGAAGAAGAGGAGGAAGATGAAGAAGAGGAGGGAAGTAACAGAGAGGTGCGTAGGTACAC
AGCGAGGTTGAAGGAAGGCGATGTGTTCATCATGCCAGCAGCTCATCCAGTAGCCATCAACGCTTCCTCCGAACTCCATCTGCTTGGCTT
CGGTATCAACGCTGAAAACAACCACAGAATCTTCCTTGCAGGTGATAAGGACAATGTGGTAGACCAGATAGAGAAGCAAGCGAAGGATTT
AGCATTCCCTGGTTCGGGTGAACAAGTTGAGAAGCTCATCAAAAACCAGAGGGAGTCTCACTTTGTGAGTGCTCGTCCTCAATCTCAATC
TCCGTCGTCTCCTGAAAAAGAGGACCAAGAGGAGGAAAACCAGGGAGGGAAGGGTCCACTCCTTTCAATTTTGAAGGCTTTTAACTGAGA
ATGGAGGAAACTTGTTATGTATCCATAATAAGATCACGCTTTTGTAATCTACTATCCAAAAACTTATCAATAAATAAAAACGTTTGTGCG
TTGTTTCTCCAAGAAATACGGGTGGCGCTTATGGTTGTTTATTTATACGAAACTAATTAAATACATCATAACGGCAACGACCTCTTATTT
TGTAATTTTCTT
BLAST?
90% ID?
Do I want total query coverage or
total subject coverage?
Global
alignment?
What word size?
How do my sequence hits relate
to my text search results?
Fragment?
Motif?
5. pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:
[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:
[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND
[mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215]
pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:
[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:
[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND
[mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215]
pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:
[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:
[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND
[mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215]
pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:
[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:
[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND
[mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215]
pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:
[19950101 TO 20140215]
KEYWORD SEARCHING IN THE LIFE SCIENCES PRESENTS CHALLENGES
How do my text search results
relate to my sequence hits?
How do I figure out this
system’s query syntax?
What if a keyword is
misspelled in a patent
claim?
How can I exclude patents
unrelated to my domain easily?
How do I build and
maintain reliable
synonym lists?
Can I be sure that all of
the documents I need
to review exist in the
underlying database?
6. BUILDING A REPORT FROM DIFFERENT PLATFORMS IS CHALLENGING
Lack of life science specificity in search
platforms create multiple false-positive
hits that require additional user review
Varying underlying algorithms can
create an apples-to-oranges comparison
Different output formats make it
difficult to analyze and compare results
Little cross-platform integration
necessitates downloading multiple
files for manual collation
7. Identify prior art surrounding gene modification in peanut for
gene families implicated in food allergies.
“Ara h 1” is a seed storage protein from Arachis hypogaea. It
is known because sensitization to it was found in 95%
of peanut-allergic patients from North America.
We’re seeking prior art that describes vaccines related
to these allergies or sequences that hit to the Ara h 1
gene.
CASE STUDY
8. Run a sequence search against the prior art
for the peanut“ara h 1”gene sequence:
Arachis hypogaea cultivar LUHUA 8 Ara h 1
allergen (ara h 1) gene (cds)
Identify relevant documents related to
peanuts and claiming transgenic
modification of plants that decrease
allergy risks, and limited to the documents
published after January 1st 2010
Text Search Sequence Search
CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATA
TTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGA
GGGAGGGTTTCTCCACTGATGCTGTTGCT…
SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS
Union
Combine into a single, unique workfile
9. A COMPLETE REPORT FOR ANALYSIS
Claims contains
vaccin* in green
Bioinformatics-related
patents in red
Sequence search
results in blue
A single, unified report for analyzing results.
10. STANDARD KEYWORDS AND BOOLEAN SYNTAX AREN’T ENOUGH
Life science applications are more than collections of discrete, specific
keywords.
They include field-specific ontological terms that can have synonyms,
alternate spellings, and varying word order.
Building a single query that addresses all of these issues, plus allows
the flexibility of Boolean, proximity, wildcard, field grouping, range
searches, and term boosting, can be difficult.
11. USE EXISTING ONTOLOGY TERMS OR DEFINE YOUR OWN
As you type, suggested
matching terms appear, based
on the ontologies you choose
Simply typing“transgenic”
with the NCBI ontology list
allows“Transgenic Plants”
as one option
At any time, type in the ?
symbol for a complete list of
field choices
Specify words in claims,
date ranges, and many more
options to further refine
your query
Define your own ontologies
and synonyms that are
relevant for your specific
search area
Includes synonyms and
alternate spellings for the
genus and species of peanut
Hit“Search”or <return> to run the search
13. THE “TEXT SEARCH”WORKFILE
Sort by any
column
Rank for
priority
Color code to
categorize
Quickly assign
colors/ranks
using keyboard
shortcuts
3
(for 3 stars)
O
(for orange)
14. All the results seem relevant, but we want to annotate the documents talking
about vaccines in the claims with a green color.
NAVIGATE A WORKFILE
Easily apply bulk
annotations for future
workfile manipulation Keyboard
shortcuts allow
fast workfile
evaluation
(next record)
(close preview)
(previous record)
15. FILTER A WORKFILE
Type in free text, use
wildcards, or type in“?”to filter
by terms in a specific field
16. FILTER A WORKFILE
Apply the filter to pull out the subset of documents that match your query.
12 documents contain the
word“vaccine”, or related
terms, in the claims.
12
17. Let’s annotate these in green.
MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN
18. MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN
Here is what our subset (vaccine in claims) looks like.
You can reset the filter to see other documents that are in the workfile.
19. Let’s annotate in red the documents that are probably not really relevant.
Notice that“Bio-informatics”is a synonym list and includes multiple spellings.
MAKING BIOINFORMATICS DOCUMENTS RED
40 documents relate to
bioinformatics methods.
21. Now it’s time to complete the analysis with sequence search results.
ara h 1 CDS sequence
GenePast 90%ID over the length of the query or the subject (1000 results)
PREPARE YOUR SEQUENCE SEARCH RESULTS
22. We export these results to a LifeQuest workfile.
Apply a filter to keep the patents where the Patent sequence location of my hits are in the
claims: that leads to 81 results in 25 patents.
FILTER YOUR SEQUENCE SEARCH RESULTS & EXPORT
23. Save it as a new“SEQ search”workfile, and open to analyze.
EXPORT YOUR SEARCH RESULTS TO A WORKFILE
25. Run a sequence search against the prior art
for the peanut“ara h 1”gene sequence:
Arachis hypogaea cultivar LUHUA 8 Ara h 1
allergen (ara h 1) gene (cds)
Identify relevant documents related to
peanuts and claiming transgenic
modification of plants that decrease
allergy risks, and limited to the documents
published after January 1st 2010
Text Search Sequence Search
CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATA
TTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGA
GGGAGGGTTTCTCCACTGATGCTGTTGCT…
SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS
Union
Combine into a single, unique workfile
26. CONSOLIDATE TEXT SEARCH AND SEQUENCE SEARCH RESULTS
Merge the two workfiles together (union) to get
a complete set for final analysis.
27. Sort, filter, analyze, and export!
EVALUATE THE MERGED DATA SETS
vaccin* in claims
bioinformatics
related
sequence hit
in claims
29. GENERATE A COMPLETE REPORT FOR ANALYSIS
Includes results from
both sequence & text
searches
Create color codes for
your specific categories
Merge with other
outputs or export to any
format
Sort or filter by any field
Rank hits (1, 2, 3 stars) to
easily identify priority
Claims contain vaccin*
bioinformatics related
found using the“ara h 1”
DNA sequence
A single, unified report for analyzing results.