II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

The Road to Federated Text Mining:
Are we there yet?
II-SDV 2014
Guy Singh

Click to edit Master title styleClick to edit Master title style
“Federated search is an information retrieval technology that
allows the simultaneous search of multiple searchable
resources.
2
What is federated search?
A user makes a single query request which is distributed to the
search engines participating in the federation”
- Wikipedia

Click to edit Master title styleClick to edit Master title styleCurrent Situation
• Volume of data ever increasing
• Proprietary content can reside within Enterprise
• No need for everyone to keep standard sources up-to-date
• Data from content providers can reside on their sites
Linguamatics Customer Confidential3
Internal Content External Content
MEDLINE Clinical
Trials
Publisher
Content
FDA Drug
Labels
Patents

Data
Sources
Scientific
Literature
Social
Media
News
Web Pages
Internal
Documents
Patents
RSS
Clinical
Trials
4
Increasing Range of Data Sources

5
Varying in Structure

Click to edit Master title styleClick to edit Master title styleHow does text mining differ from keyword search?
Example: What genes affect breast cancer

• Searching across documents using keywords is relatively
trivial
– Do not need to be aware of where the words occur and in what
context
• Text mining documents with varying structure requires a
more sophisticated approach; Need to:
– Know where words matching entities/concepts occur
– Disambiguate depending on context and location
– Find terms in particular regions/parts of document for targeted
searches
7
Why does document structure matter?

• Integrate the data together into a data warehouse
– Extract, Transform and Load each data source into a new database
– Multiple copies of the data
– Data normalisation can be difficult and challenging
– Time consuming and expensive process
– Most database vendors take this approach
– Allows users to perform a single search across all the content
• Leave the data where it is, federated content
– Data remains in it’s original form and location
– Multiple data types
– Multiple network locations
– Single search across multiple different data sources
8
Approaches to dealing with different data sources

Data
Normalisation
Link the
Content
Servers
Merge
Results
Federated
Text Mining
9
How do we get to Federated Text Mining?

10
Data Normalisation – Virtual Indexes
Pathology
Reports Index
Journal
Abstracts Index
Virtual
Index

11
Data Normalisation – Document Structure
Pathology
Reports
Journal
Abstracts

12
Data Normalisation - Entities
Journal
Abstracts
Pathology
ReportsCombined
(Normalized)

Linking Content Servers

• I2E 4.1 introduced a new feature – Linked Server
• One I2E server can be linked to another I2E server
• Provides access to remote and local indexes and queries
through a single I2E interface (Linked Servers)
– Indexes and queries on remote servers on the network appear the
same as local indexes
Linked Servers
Development Status

Linguamatics – Customer confidential
I2E 4.1 Linked Servers
I2E Enterprise on
Customer network
I2E OnDemand
SaaS
Infrastructure
In-house
Indexes
I2E OnDemand
Standard Indexes
I2E Enterprise Access
Custom Indexes
Access via Linked
Servers
Access via single UI

Merging Results (Part I)
Single Server, Multiple Queries

Click to edit Master title styleClick to edit Master title styleI2E 3.0 (2009) – Merging Results (part I) from one server
Profiling Individuals
• Example from news reports related to pharmaceutical industry
• Pick up properties from one document or many
© Linguamatics 2012 - Customer Confidential

© Linguamatics 2013 - Confidential
I2E 3.0 – Merging Results (part I) from one server
Document
Identifier
Patient
information
Disease history
Patient data
Medications
and dosages
Hit displayed in
context

Merging Results (Part II)
Multiple Servers, Multiple Queries

20
Each Server supplying separate set of results
Content
Server 1
Content
Server 2
Content
Server 3
Content
Server 4
Merge into a single set
of results

The Road to Federated Text Mining

Click to edit Master title styleClick to edit Master title styleI2E 4.0: Multiple Clients, Multiple Results
I2E Server 2
FDA Drug Labels
I2E Server 1
Internal Documents
external networkinternal network

Click to edit Master title styleClick to edit Master title styleI2E 4.1/4.2: Single Client, Multiple Results
I2E Server 2
FDA Drug Labels
I2E Server 1
Internal Documents
Linked
server

Click to edit Master title styleClick to edit Master title styleQ4 2014: Single Client, Single Result, Multiple Servers
I2E Server 2
FDA Drug Labels
I2E Server 1
Internal Documents
Linked
server

Click to edit Master title styleClick to edit Master title styleQ4 2014: Federated Text Mining Example
• Single Query
• Differently structured data sources on different servers
– Journal Articles (PubMed Central) on Enterprise Server
– MEDLINE on I2E OnDemand
• Single set of results

Click to edit Master title styleClick to edit Master title styleThe Road to Federated – Are we there yet?
I2E 4.0
Dec 2012
I2E 4.1
October 2013
Next release: in
Development
Q4 2014
Merging the
Results (part II)
Data
Normalisation
Linking
Content Servers

Demo

30
Demo
Cambridge
VPN
Nice
Linked Server
Journal Abstracts
Pathology Reports

Thank you

II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Ähnlich wie II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK) (20)

Mehr von Dr. Haxel Consult

Mehr von Dr. Haxel Consult (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)