Boehringer Ingelheim has been developing dedicated Life Science SEARCHCORPORA for startups, scientific literature and news tracking based on the Web Data Analysis platform Deep SEARCH 9.
Using the Deep SEARCH 9 approach, Boehringer Ingelheim is capable of tapping directly any web resources like online websites data bases, web sites or news feeds.
Use case 1: SEARCHCORPUS® for life science startups:
We find startup information we could not find in public search engines.
Use case 2: Life science news SEARCHCORPUS®:
100s of incoming mails and alerts are processed every day and websites and articles behind the news tags are crawled automatically.
The purpose of these applications is that Scientists can subscribe to the services to have compilations of results of personalized deep searches sent to them automatically or that they can alternatively use faceted search on the life science SEARCHCORPORA interactively.
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
1. Boehringer Ingelheim Pharma GmbH & Co. KG
Research Networking - Aleksandar Kapisoda
Deep Web Search
Deep SEARCH 9 GmbH
Klaus Kater
2. Content
1. Intro
2. Search Approach
• Public Search Approach
• SEARCHCORPUS® Approach
3. Use Cases
• SEARCHCORPUS® for life science startups:
We find startup information we could not find in public search engines.
• Life science news SEARCHCORPUS®:
100s of incoming mails and alerts are processed every day and websites and
articles behind the news tags are crawled automatically.
4. Technical Features
5. Outlook
4. 1. Intro
2015
(Deep Web) Search
We showed that we can crawl and find content that public search engines do not find.
5. 1. Intro
What we did in 2015…
2015 (Deep Web) Search
……2014 …………………….………2015………………….………2016…..
During the year we
established our internal
processes to build targeted
SEARCHCORPORA.
We built solutions and
rolled them out.
And we found more than we
bargained for.
6. 1. Intro
2016
Deep (Web Search)
This year we will talk about a misconception were confronted with
when comparing our SEARCHCORPUS® based search results
with search results from public search engines.
7. 2. The Public Search Approach
Public Search Misconception
Clashing with Incomplete Search Results
8. Let’s make up a „Weißwurst Misconception“…
2. The Public Search Approach
Clashing with Incomplete Search Results
Anybody understands that Weißwurst without Weißwurst mustard is
like Fish‘n‘Chips without Chips.
…to make it easier to understand the “Public Search Misconception” .
9. Web search is like trying to find “Weißwurst”mustard”
in a Convenience Store1)
2. The Public Search Approach
Clashing with Incomplete Search Results
You will find loads of local and
not so local mustards.
But if Weißwurst mustard is
located in the specialities
section, you will only find it by
chance or not at all…
1) Not a Bavarian conveniance store.
10. 1) Not a Bavarian conveniance store.
No Weißwurst
mustard!
Web search is like trying to find Weißwurst mustard
in a Convenience Store1)
2. The Public Search Approach
Clashing with Incomplete Search Results
So you may believe, that the
store does not carry Weißwurst
mustard at all.
11. 2. The Public Search Approach
Clashing with Incomplete Search Results
There are two common misperceptions researchers
using public search are entrapped in:
• If a search has results,
we believe that these results are complete.
• If a search doesn‘t have results,
we believe there is nothing that can be found
Both perceptions are wrong and represent the Public Search misconception :
We believe that there is nothing to be found, even though the information may be
available.
We just don’t know where and need the right tools to find it.
This store
doesn‘t have
Weißwurst
mustard…
12. 2. The Public Search Approach
Why Results Are Missed
An explanation why results are missed
Assume we want to monitor startup activities in the area
of CRISPR being used in the fight against diabetes type 1:
+CRISPR +diabetes type 1
13. 2. The Public Search Approach
Why Results Are Missed
14. 2. The Public Search Approach
Why Results Are Missed
An explanation why results are missed
To avoid getting overloaded with biotechnological research papers,
we try to tell the search engine that we are interested in +startups....
+CRISPR +diabetes type 1
+startup
15. 2. The Public Search Approach
Why Results Are Missed
+CRISPR +diabetes type 1
+startup
Only documents in which all terms
match are returned.These documents
are actually on startups.
But only, if the startups were
mentioned in some press release
or report.
18. 3. Use Cases
SEARCHCORPUS® for Life Science Startups:
Situation:
Researchers manually search for startup activities and companies who are active in
specific areas of interest. Interest changes frequently.
Problem:
Searching for startups by scientific topics generates an enormous amount of noise that
needs to be filtered manually.
Approach:
Implementation of a startup SEARCHCORPUS® spanning global startup companies.
Status:
Existing startup SEARCHCORPUS for targeted Search
25. 3. Use Cases
Life Science News SEARCHCORPUS®
Situation:
Researchers are manually filtering 100reds of websites, emails and news feeds
• News that are not screened immediately are lost
Approach:
A targeted news SEARCHCORPUS® using periodic targeted crawling and extraction of
news from sources used by Boehringer Ingelheim scientists.
1. Tracker is made available to researchers in the corporate Intranet
2. News-Archive with faceted search using ontology based query term expansion
3. Search profile based email alerting, whenever matching news are crawled
Status:
Existing news SEARCHCORPUS for targeted Search
26. 3. Use Cases
Life science news SEARCHCORPUS®
• Viewer is updated by the minute, targets could be crawled as frequently as every 10s.
• Crawling frequence and crawling schedule are defined by target.
28. 4. Technical Features
Software:
Deep SEARCH 9 platform for advanced web analytics:
• Concurrent targeted crawling
• Content extraction
• Document caching
• Content annotation (RDF based and via APIs, e.g. Luxid)
• Scheduler for periodic jobs
• Integration of ds9 search and visualization in BI Intranet through API
• News tracker GUI for real-time news monitoring
• Faceted search GUI with RDF based query term expansion
Hardware:
3 Server cluster running ds9, JDBC database, RDF triple store and Elasticsearch.
Currently 90 TB disk space.
29. 5. Outlook
• SEARCHCORPORA®
• Setup of more comprehensive SEARCHCORPORA® (startup, news)
• Extending targeted SEARCHCORPORA® (Life Science domain)
• More Viewer for Data Visualisation (Results)
• Communication with other third party software via API / webservice
• Integration of Semantic Web Technologies
• Terminology
• RDF import/export