2. Data Search
Searching and Finding information in
Unstructured and Structured Data Sources
Erik Fransen 11.00-12.00 P.M. November, 3
Senior Business Consultant IRM UK, DW/BI 2009, London
Centennium BI expertisehuis
The Hague, The Netherlands
e.fransen@centennium.nl
2
3. Agenda
• Introduction;
• Industry models;
• Combining structured & unstructured data
– “Pure Portal”
– “Index it all”
– “Structure it all”
• Summary.
3
6. Combining BI with unstructured data
• Integrated access to relevant information (‘provide complete picture’);
• Unstructured data like documents provide valuable context to numerical
data;
– Customer complaints
– Competitor’s press releases
– Marketing documents
– …
• Insurance fraud analysis (i.e. claim statistics and claim forms);
• Competitive Intelligence (i.e. market share data and competitor news);
• Customer retention (i.e. sales data and customer complaints);
• Data Search acts as a bridge between structured and unstructured data.
6
7. (un)structured data keeps growing….
2009
2005
Cave paintings,
Bone tools 40,000 BC
Writing 3500 BC
2001
>80% Unstructured
Paper 105
Printing 1450 2000
Electricity, Telephone
Oracle-79
1870 SQL-70
Transistor 1947
SQL-89
GIGABYTES
Computing 1950
SQL-92
Internet (DARPA) Late 1960s
SQL-99
The Web 1993
SQL-03
1999 Source: Forrester
7
8. Industry Model: Text Data
Bill Inmon’s DW 2.0™
• Hold data at the lowest detail;
• Hold data to infinity;
• Have integrity of data and have
online high-performance
transaction processing;
• Tightly couple metadata to the
data warehouse environment;
• …
• Link structured data
and unstructured data;
8
11. Data Search Scenarios
Searching and Finding information in
Unstructured and Structured Data Sources
11
12. Global architecture
Master &
Meta Data
Reports
Structured
Data Data
DWH OLAP
Marts Marts
OLTP Mining…
Cubes
Financial ODS
Apps
Middleware
Portal
Content
Man System
Unstructured
Search Search
Fileservers
Index Text Mining
Database Visualisation
Email
Intranet/inte
rnet
12
13. Three data search scenarios
Master &
Meta Data
Structure Reports
Structured
Data Data
it all DWH
Marts Marts
OLAP
OLTP Mining…
Cubes
Financial ODS
Apps
Middleware
Index Pure
it all Portal
Portal
Content
Man System
Unstructured
Search Search
Fileservers
Index Text Mining
Database Visualisation
Email
Intranet/inte
rnet
13
14. Scenario 1: Pure Portal
Many portlets, one user interface;
Business user may manually combines content
from several independent sources;
Risk: too complex for user.
14
15. 1: Pure Portal
Master &
Meta Data
Reports
Structured
Data Data
DWH OLAP
Marts Marts
OLTP Mining…
Cubes
Financial ODS
Apps
Middleware
Pure
Portal
Portal
Content
Man System
Unstructured
Search Search
Fileservers
Index Text Mining
Database Visualisation
Email
Intranet/inte
rnet
15
19. Scenario 2: “index it all”
Enterprise Search from one user interface;
Business user knows what to look for and expects
a “complete picture” as a result;
Risk: Many irrelevant search results due to the
nature of document indexing.
19
20. 2: Index it all
Master &
Meta Data
Reports
Structured
Data Data
DWH OLAP
Marts Marts
OLTP Mining…
Cubes
Financial ODS
Apps
Middleware
Index
it all
Portal
Content
Man System
Unstructured
Search Search
Fileservers
Index Text Mining
Database Visualisation
Email
Intranet/inte
rnet
20
21. Scenario 2: “Index it all”
Unstructured Search
Search index
data sources application
User interface
BI report is indexed
as if it was a document
Data warehouse
BI
Structured application
Architecture Reports
data sources
21
22. Example: IBM Cognos 8 Go! Search
Integration with enterprise
search applications (IBM
OmniFind, Google OneBox
for Enterprise, Yahoo,
Autonomy)
Search results return all
relevant structured content
(reports, analyses, etc.)
and unstructured content
(Word documents, PDFs,
et) within a single interface.
22
27. Scenario 3: “Structure it all”
Generate structure using document warehousing
and text mining;
Business user knows exactly what to look for;
Risk: Limited flexibility for user.
27
28. 3: Structure it all
Master &
Meta Data
Structure Reports
Structured
Data Data
it all DWH
Marts Marts
OLAP
OLTP Mining…
Cubes
Financial ODS
Apps
Middleware
Portal
Content
Man System
Unstructured
Search Search
Fileservers
Index Text Mining
Database Visualisation
Email
Intranet/inte
rnet
28
29. Generating structure in document warehouse
Retrieve Preprocess Compile
Identify Sources Text Mining
Documents Documents Metadata
Sources are not Internal sources Format Linguistic analysis Carefully attach
fixed retrieval, file documents in a Key features are metadata to
Iterative process, servers, consistent matter extracted document
sources lead to CMS/DMS Files must be in Indexing Used for
new sources External source suitable form for documents querying,
retrieval, using text analysis Summarizing matching,
crawlers, spiders documents navigation
Sources are not support
fixed Store in
Iterative process, document
sources lead to warehouse
new sources
Source: Dan Sullivan
Data warehouse Document warehouse
Architecture Architecture
Combine
(meta)data
29
30. Document warehouse
Contains complete documents or URLs
Metadata about documents:
summaries, authors’ names, publication
dates, titles, sources, keywords, etc.
Translations of documents
Thematic clustering of similar Document warehouse
documents Architecture
Topical or thematic indexes
Extracted key features (structure)
Dimensions and Facts, linked to documents,
summaries etc.
Combine with the data warehouse
30
31. BI reporting on dimensional model
Dim
Dim Dim Action
Product Customer
Dim
Sales Call Competitor
Facts Facts
Dim
Dim Dim
Sales person
Time Telco Term
Data warehouse Document warehouse
31
32. Generate structure using text mining tools
Example taken from SPSS PASW Text Analytics, many other tools available:
IBM, SAS, Oracle, SAP BO, Microsoft etc. etc.
32
33. Generating structure using UIMA
• Unstructured Information Management Architecture
• Originates from IBM, now Apache UIMA
http://incubator.apache.org/uima/ Source: IBM
UIMA is supported by all main BI vendors.
33
34. Example: Generating structure using UIMA
• Analyzed by a collection of text analytics
• Detected Semantic Entities and Relations Highlighted
• Represented in UIMA Common Analysis Structure (CAS)
34
35. Summary
• Growing business need for combining BI with
unstructured data;
• Data Search bridges the gap between both
worlds
– Scenario 1: “Pure Portal”
– Scenario 2: “Index it all”
– Scenario 3: “Structure it all”
• Scenarios can be combined.
Questions?
35