In many real life scenarios, searching for information is not the user's end goal. In this presentation I look into the specific example of corporate strategy and business development in a university setting.
In today's academic institutions, strategic questions are those that relate to dependency on funding instruments, the public private partnerships that exist (and those that should be extended!), and the match between topic areas addressed by the research staff and those claimed important by policy makers. The professional search tasks encountered to answer questions in this domain are usually addressed by business intelligence (BI) tools, and not by search engines. However, professionals are known to be busy people inspired by their own research interests, and not particularly fond of keeping the
customer relationship management (CRM) or knowledge management systems up to date for the organisation's strategic interest. This then results in incomplete and inaccurate data.
Instead of requiring research staff (or their administrative support) to provide this management information, I will illustrate by example how the desired information usually exists already in the documents inherent to the academic work process. Information retrieval could thus play an important role in the computer systems that support the business analytics involved, and could significantly improve the coverage of entities of interest - i.e., to reduce the effort involved in achieving good recall in business analytics. The ranking functionality over the enterprise's (textual) content should however not be an isolated component. Our example setting integrates the information derived from research proposals, research publications and the financial systems, providing an excellent motivation for a more unified approach to structured and unstructured data.
1. May 31st, 2013 First SICSA MMI Information Retrieval Workshop
Looking beyond plain text for
document representation in
the enterprise
Arjen P. de Vries
arjen@acm.org
Centrum Wiskunde & Informatica
Delft University of Technology
Spinque B.V.
4. Strategic and business
development needs
What funding schemes are the primary source
of income?
E.g., can we move to Europe when Dutch funding
dries up?
Who has active relations with partner X?
“Valorisation”; new national funding requirements
What industry sectors do we depend upon?
E.g., how many projects in smart cities? Green
energy? Cloud computing? Etc.
How are strategic decisions implemented?
E.g., has objective “move from Telecom toward ICT”
been achieved, and how does it develop over time?
6. Date: Wed, 15 May 2013 15:14:49 +0200
From: Theme Coordinator “INFORMATION”
To: Group Leaders Information Theme
Subject: List of company relations for internal CWI
distribution
Dear Information Theme Group Leaders,
The theme coordinators have been asked whether they: "een
lijstje kan maken met de bedrijfscontacten en daarbij aan te
geven van welke aard de contacten zijn".
Could you send me the names of Dutch companies you are currently
working with or have worked with in the recent past by the end
of Friday 17th May.
The Theme Coordinator
7. Date: Fri, 24 May 2013 11:33:04 +0200
From: Theme Coordinator Life Sciences
To: Group Leaders Life Sciences Team
Subject: Life Sciences: contacts with NL companies?
Dear all,
The CWI themes are currently collecting all contacts we have
with Dutch industry and companies (but also hospitals and TNO
etc.) in order to get an overview. I am doing this for
the theme "Life Sciences".
Can you please send me a list of your contacts with short
description?
Life Sciences Theme Coordinator
8. From: Project Leader Project X
Date: Sun, 26 May 2013 17:34:15 +0200
To: Project X
Subject: [Project X: 33] @WP-leiders
X-BeenThere: Project X @ Y.org
Beste WP-leiders,
Ik kreeg van Het Programma Management het volgende verzoek:
> Mag ik je vragen me een lijstje te sturen van welk EU
onderzoek en welk internationaal onderzoek er loopt bij de
partners gerelateerd aan Project X (internationale inbedding).
Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen
een lijstje met de volgende punten:
- lijst van lopende EU projecten waarbij mensen uit jouw WP
betrokken zijn; geef aub aan wi de partners zijn,
financieringsbron, of het een STREP (of NoE of ...) is, en of
jouw WP een participant of coordinator levert;
- lijst van aangevraagde EU projecten, met zelfde extra's
- lijst van eventuele andere internationale samenwerkingen die
niet door een formeel project zijn afgedekt
Stuur me de lijstjes aub zsm maar niet later dan dinsdag
18u. Bedankt voor jullie hulp. De Projectleider
10. The High Cost of Not Finding Info
If you employ 1000 knowledge workers:
50% of content unindexed $2.5
million/year
6.25% of effort is spent reproducing
information that already exists
$5 million/year
Knowledge workers spend 15-25% of
their time on non-productive
information-related activities
Feldman and Sherman.
IDC Technical Report #29127, 2003
Butler Group Report: Enterprise Search and Retrieval. Oct-2006
“many organisations are frittering away up to 10% of their staff
costs on wasted effort because employees simply can’t find
the right information to do their jobs.”
11. So… “the real world”
“Real” companies (as opposed to
academic institutions) attempt to address
these information needs a priori, by
setting up a Customer Relationship
Management system (CRM)
Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of
the customer", Communications of the ACM 46(4) (2003): 95-99
12.
13. However…
So-called “Professionals” are well known
to focus on their own expertise
They do not have (or take) the time to
maintain adequate descriptions of their
network, skills, projects etc. – neither for
most other types of “management
overhead”
15. Funding Proposals
Proposals submitted (are supposed to)
pass by the faculty’s (TUD) “contract
managers” or the institute’s (CWI)
“project bureau”
E.g., checks for liability, IPR and valid budget
Proposal and (partial) metadata are added to
a content management system (CMS)
The CMS used at my faculty at TUD is DECOS; a
few other faculties plan to use Microsoft
Sharepoint; CWI deploys BSCW
16.
17. Step 1
Index all the proposals submitted with
your favourite IR system
18. Incompleteness
The DECOS metadata entered is usually
incomplete from the start
For many projects for example, only the coordinator
is entered as partner
Also, a proposal’s metadata does not reflect
subsequent change; e.g., as in PuppyIR:
People hired after funding secured
Partner change when key person moved job
Teams evolved
Priorities shifted
New tasks introduced and tasks (re-)assigned
…
20. Inaccuracy
Key information necessary for strategy &
business development scenarios missing
Adding those is error-prone
Infer domain (big data, green energy, cloud
computing, …) from keywords or content
Extract names automatically
Copy amounts manually; inconsistencies in
tables in proposal text are not uncommon
21. Incomplete & inaccurate Data
Ambiguity
When describing domain, e.g., cloud
computing vs. clouds in environmental models
Names of people and companies involved
Typos & OCR mistakes
Entity resolution
Amounts of funding per partner, own
contribution
Funding request may not equal funding
received
22. The real world to rescue (1)
Not much work gets done without
payments…
23. ERP
All large organisations deploy Enterprise
Resource Planning (ERP) systems
Typical modules include accounting, human
resources, manufacturing, and logistics
ERP integrates the modules, data
storing/retrieving processes, and
management and analysis functionalities
Baan, Oracle, PeopleSoft, SAP, …
24. More complete and more
accurate data from ERP
Financial details of each project as executed
Project leader
People who are reimbursed from the project
Exact duration of project activities
...
25. Step 2
Index all the ERP data with your favourite
IR system
Link the ERP project identifiers to the CMS
proposal identifiers
Surprisingly, an n:m relationship…
DB +
27. Institutional Repository
Publication metadata helps validate
existing (and may even extend) the
management info required:
Authors
Author affiliations
Projects and funding schemes (from
acknowledgements)?
Again incomplete data though…
Especially my faculty notoriously bad at
maintaining their part of the institutional
repository
28. Step 3
Crawl the Institutional Repository using
the Open Archives Initiative (OAI)
harvesting protocol
Index all the publications data with your
favourite DB + IR system
Relate projects to publications by author
name, similar title, etc.
29. Result: Unified Access
Proposals
from an XML dump of the CMS
Actual project administration
from CSVs extracted from ERP
Publications
crawled using OAI, from the IRP
33. How to search that graph???!
Rank (un-/semi-)structured data to deal
with incompleteness & inaccuracies
Structured data representation for
attributes including project revenu,
people’s names, starting dates, etc.
Use cases varying from “expert search” to
“data cleaning” and “visual analytics”
34. Search by Strategy
First, visually construct search strategies
by connecting “building blocks”
35. Search by Strategy
First, visually construct search strategies
by connecting “building blocks”
Next, generate the search engine specified
by that search strategy
36. Strategies: DB+IR query plans
Database
Spinque: RDBMS (MonetDB)
BB1(in1,in2,in3, u1,u2)
in1 in2 in3
out
BB2(in1)
in1
out
• Data flow
Spinque: strategy
• Query: strategy made operational
Spinque: PRA
CREATE VIEW a AS
SELECT ..
CREATE VIEW b AS
SELECT ..
CREATE VIEW c AS
SELECT ..
Strategy
Relational DB
37. Probabilistic Relational Algebra
Strategy
Relational DB
• SQL
explicit probabilities
CREATE VIEW x AS
SELECT a1, a3,
1-prod(1-prob) AS prob
FROM y
GROUP BY a1, a3;
• PRA: probabilistic
relational algebra
(Fuhr and Roelleke,
TOIS 2001)
x = Project DISTINCT
[$1,$3](y);
43. Result List Interactions
Zoom in on item using “+”:
Open item in left pane
Shows results of item as query, using a
result-type specific search strategy
Goal to provide contextually most related nodes
from underlying graph
Marking any item red/yellow/green for
later usage
48. Strategic and business
development needs
What are our industry relations?
Who of these partners collaborate with
more than one group?
What funding schemes support these
collaborations?
51. Multi party relations
Grouping of external relations
Foreign
Univ.
NL Univ.
Funding
agency
Public NL
Public
foreign
Private
sector
Multi party relations
Grouping of external relations
Foreign
Univ.
NL Univ.
Funding
agency
Public NL
Public
foreign
Private
sector
Note: External relations with at least two departments; node size w.r.t. number of relations
52. Initial Findings
The integrated search helps improve
recall, reducing the effort involved and
leading to higher quality analyses
Many things that could be done even
more automatically (albeit not perfectly)
seem less important than expected
We use very simple rules to extract URIs and
companies; no information extraction yet
Information professional will always look into
results in detail
53. Open issues
Integrate visualization
Idea: select result list and facet
Too many facets
Idea: group facets
Result explanations
Idea: describe path through graph
Entity support ++
54. Open issues
What strategy is good? Why?
Idea: test using past usage data
What are the right user roles?
Who should do the searches?
Who should write strategies?
~ who writes the SQL queries in traditional DB?
Human in the loop for retrieval, but not
yet for indexing…