Buds n Tech IT Solutions: Top-Notch Web Services in Noida
II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)
1. The Road to Federated Text Mining:
Are we there yet?
II-SDV 2014
Guy Singh
2. Click to edit Master title styleClick to edit Master title style
“Federated search is an information retrieval technology that
allows the simultaneous search of multiple searchable
resources.
2
What is federated search?
A user makes a single query request which is distributed to the
search engines participating in the federation”
- Wikipedia
3. Click to edit Master title styleClick to edit Master title styleCurrent Situation
• Volume of data ever increasing
• Proprietary content can reside within Enterprise
• No need for everyone to keep standard sources up-to-date
• Data from content providers can reside on their sites
Linguamatics Customer Confidential3
Internal Content External Content
MEDLINE Clinical
Trials
Publisher
Content
FDA Drug
Labels
Patents
4. Click to edit Master title styleClick to edit Master title style
Data
Sources
Scientific
Literature
Social
Media
News
Web Pages
Internal
Documents
Patents
RSS
Clinical
Trials
4
Increasing Range of Data Sources
5. Click to edit Master title styleClick to edit Master title style
5
Varying in Structure
6. Click to edit Master title styleClick to edit Master title styleHow does text mining differ from keyword search?
Example: What genes affect breast cancer
7. Click to edit Master title styleClick to edit Master title style
• Searching across documents using keywords is relatively
trivial
– Do not need to be aware of where the words occur and in what
context
• Text mining documents with varying structure requires a
more sophisticated approach; Need to:
– Know where words matching entities/concepts occur
– Disambiguate depending on context and location
– Find terms in particular regions/parts of document for targeted
searches
7
Why does document structure matter?
8. Click to edit Master title styleClick to edit Master title style
• Integrate the data together into a data warehouse
– Extract, Transform and Load each data source into a new database
– Multiple copies of the data
– Data normalisation can be difficult and challenging
– Time consuming and expensive process
– Most database vendors take this approach
– Allows users to perform a single search across all the content
• Leave the data where it is, federated content
– Data remains in it’s original form and location
– Multiple data types
– Multiple network locations
– Single search across multiple different data sources
8
Approaches to dealing with different data sources
9. Click to edit Master title styleClick to edit Master title style
Data
Normalisation
Link the
Content
Servers
Merge
Results
Federated
Text Mining
9
How do we get to Federated Text Mining?
10. Click to edit Master title styleClick to edit Master title style
10
Data Normalisation – Virtual Indexes
Pathology
Reports Index
Journal
Abstracts Index
Virtual
Index
11. Click to edit Master title styleClick to edit Master title style
11
Data Normalisation – Document Structure
Pathology
Reports
Journal
Abstracts
12. Click to edit Master title styleClick to edit Master title style
12
Data Normalisation - Entities
Journal
Abstracts
Pathology
ReportsCombined
(Normalized)
14. Click to edit Master title styleClick to edit Master title style
• I2E 4.1 introduced a new feature – Linked Server
• One I2E server can be linked to another I2E server
• Provides access to remote and local indexes and queries
through a single I2E interface (Linked Servers)
– Indexes and queries on remote servers on the network appear the
same as local indexes
Linked Servers
Development Status
15. Click to edit Master title styleClick to edit Master title style
Linguamatics – Customer confidential
I2E 4.1 Linked Servers
I2E Enterprise on
Customer network
I2E OnDemand
SaaS
Infrastructure
In-house
Indexes
I2E OnDemand
Standard Indexes
I2E Enterprise Access
Custom Indexes
Access via Linked
Servers
Access via single UI
20. Click to edit Master title styleClick to edit Master title style
20
Each Server supplying separate set of results
Content
Server 1
Content
Server 2
Content
Server 3
Content
Server 4
Merge into a single set
of results
23. Click to edit Master title styleClick to edit Master title styleI2E 4.0: Multiple Clients, Multiple Results
I2E Server 2
FDA Drug Labels
I2E Server 1
Internal Documents
external networkinternal network
Linguamatics Customer Confidential23
24. Click to edit Master title styleClick to edit Master title styleI2E 4.1/4.2: Single Client, Multiple Results
I2E Server 2
FDA Drug Labels
I2E Server 1
Internal Documents
external networkinternal network
Linguamatics Customer Confidential24
Linked
server
26. Click to edit Master title styleClick to edit Master title styleQ4 2014: Single Client, Single Result, Multiple Servers
I2E Server 2
FDA Drug Labels
I2E Server 1
Internal Documents
external networkinternal network
Linguamatics Customer Confidential26
Linked
server
27. Click to edit Master title styleClick to edit Master title styleQ4 2014: Federated Text Mining Example
• Single Query
• Differently structured data sources on different servers
– Journal Articles (PubMed Central) on Enterprise Server
– MEDLINE on I2E OnDemand
• Single set of results
Linguamatics Customer Confidential27
28. Click to edit Master title styleClick to edit Master title styleThe Road to Federated – Are we there yet?
I2E 4.0
Dec 2012
I2E 4.1
October 2013
Next release: in
Development
Q4 2014
Merging the
Results (part II)
Data
Normalisation
Linking
Content Servers