1. HORIZONTAL INTEGRATION OF
BIG INTELLIGENCE DATA
The Role of Ontology in the Era of Big Data
T. Malyuta, Ph. D
New York City College of Technology, NY, NY
B. Smith, Ph. D
University at Buffalo, Buffalo, NY
R. Rudnicki
CUBRC, Buffalo, NY
2. 2
Big Data Problem
• Wikipedia defines Big Data as “…a collection of
data sets so large and complex that it becomes
difficult to process using on-hand database
management tools.”
• Gartner defines Big Data with three „V‟s:
• Volume
• Velocity (of production and analysis)
• Variety
• This means that Big Data are beyond our control
(as opposed to those complex and big systems with
diverse and changing data where the complexity is
known)
3. 3
Big Data Solution – Agility
• Dimensions of agility
• Storage paradigms that accommodate massive volumes of
heterogeneous data
• Data processing paradigms that can deal with the massive volumes
of heterogeneous data coming onstream
• Dynamic data stores that can easily accommodate diverse and a
priori unknown data types and semantics
• Methods and tools that leverage dynamic and diverse content
4. 4
Agile Integration and Interoperability
• Today, the main problem of the Big Data is using it
• Utilization of „Variety‟ – diverse types and semantics –
requires data integration and interoperability
• Traditional integration approaches fail
• Agile integration paradigms are needed
5. 5
The Problem of Horizontal Integration of
Big Intelligence Data
• HI =Def. the ability to exploit multiple data sources as
if they are one
• Recognized issues for HI with existing approaches
• Data silos
• Lexicon/semantics silos
• Requirement for HI of Big Intelligence Data – Agile
Semantic Interoperability
A strategy for HI must be agile in the sense that it can be quickly
extended to new zones of emerging data according to need
Ontology allows an incremental approach – big bang already from
the very first buck (we showed in I2WD)
Ontology can provide the needed agility
6. 6
Agile Semantic Interoperability
• A good solution has to be
• Able to grow incrementally
• Able to be developed in a distributed manner
• Without losing consistency
• Independent of particular implementations, and data producers and
consumers
• Applicable to data in an agile manner
• We call our solution: „semantic enhancement‟ (SE) of data
7. 7
SE
• SE is realized with the help of ontologies that are used
to annotate (tag) data
• Vocabulary of ontologies used for annotations provides agile
horizontal integration
• Ontologies, by virtue of their nature and organization, provide
semantic enhancement of data
Skill Education
Technical
ComputerSkill Education
ProgrammingSkil
l
SQ Jav C+
L a +
PersonID Name Description
111 Java Programming
222 SQL Database
8. 8
The Meaning of „Enhancement‟
• Semantic enhancement/enrichment of data = arm‟s
length approach (no change to data) – through
simple annotation we associate an entire knowledge
system with a database field
• enables analytics to process data, e.g. about computer
skills, “vertically” along the Skill hierarchy, as well as
“horizontally” via relations between Skill and Education.
• and further… while data in the database does not change, its
analysis can be richer and richer as our understanding of the
reality changes
• For this richness to be leveraged by different
communities, persons, and applications it needs to
have the properties mentioned above and be
constructed in accordance with the principles of the
SE
9. 9
SE Principles
⁻ Create a Shared Semantic Resource (SSR) of
ontologies to be used for annotation
⁻ Establish an agile strategy for building ontologies
within this SSR, and apply and extend these
ontologies to annotate new source data as they
come onstream
⁻ Strategy pioneered in biomedical and other scientific fields:
leaves data as they are, and incrementally tags data
sources with terms from a growing, consistent, non-
redundant set of ontologies
⁻ Problem: Given the immense and growing variety of
data sources, the development methodology must
be applied by multiple different groups
10. 10
Achieving the Goal
• Methodology of incremental distributed ontology
development
• A common ontology architecture incorporating a
common, domain-neutral, upper-level ontology
(BFO)
• A shared governance and change management
process
• A simple, repeatable process for ontology
development
• An ontology registry
• A process of intelligence data capture through
„annotation‟ or „tagging‟ of source data artifacts
11. 11
Main Methodological Points
• Ontological realism
• Based on Doctrine;
• Involves SMEs in label selection and definition
• Thoroughly tested*
• Arms-length process, with minimal disturbance to existing
data and data semantics
• Reference ontologies – capture generic content and are
designed for aggressive reuse in multiple different types of
context
• Single reference ontology for each domain of interest
• Application ontologies – are tied to specific local
applications
• An application ontology is created by combining local content with generic
content taken over from relevant reference ontologies
• Are still interoperable as are based on the common set of reference
* Barry Smith and Werner Ceusters, “Ontological Realism as a Methodology for Coordinated Evolution of
Scientific Ontologies”, Applied Ontology, 5 (2010), 139–188.
12. 12
Arms-length Process
• Focusing on the terms (labels, acronyms, codes) used in ***our
source data.
• Where multiple distinct terms {t1, …, tn} are used in separate
data sources with one and the same meaning, they are
associated with a single preferred label drawn from a standard
set of such labels
• All the separate data items associated with the {t1, … tn} thereby
linked together through the corresponding preferred labels.
• Preferred labels form basis the for the ontologies we build
SE ontology labels XYZ
AB Heterogeneous KL
C Contents M
13. 13
Reference and Application Ontologies
Reference Ontology Application Definitions
vehicle =def: an object used for artillery vehicle = def. vehicle designed for
transporting people or goods the transport of one or more artillery
weapons
tractor =def: a vehicle that is used for
towing wheeled tractor = def. a tractor that has a
wheeled platform
crane =def: a vehicle that is used for
lifting and moving heavy objects tracked tractor = def. a tractor that has a
tracked platform
vehicle platform=def: means of providing
mobility to a vehicle artillery tractor = def. an artillery vehicle
that is a tractor
wheeled platform=def: a vehicle
platform that provides mobility through wheeled artillery tractor = def. an artillery
the use of wheels tractor that has a wheeled platform
tracked platform=def: a vehicle
platform that provides mobility through
the use of continuous tracks
14. 14
Illustration of Ontology Types (Toy Example)
Vehicle Black –
reference
ontologies
Artillery Red –
Tractor Vehicle application
ontologies
Wheeled Artillery
Tractor Tractor
Wheeled
Artillery
Tractor
15. 15
Role of Reference Ontologies
• Normalized
• Maintains a set of consistent ontologies
• Eliminates redundancy
• Modular
• A set of plug-and-play ontology modules
• Enables distributed consistent development
• Surveyable
16. 16
SE Architecture
• The Upper Level Ontology (ULO) in the SE hierarchy
must be maximally general (no overlap with domain
ontologies)
• The Mid-Level Ontologies (MLOs) introduce successively
less general and more detailed representations of types
which arise in successively narrower domains until we
reach the Lowest Level Ontologies (LLOs).
• The LLOs are maximally specific representation of the
entities in a particular one-dimensional domain
18. 18
Current State
• Completed
• Data Representation and Integration Framework (DRIF):
architectural solution and implementation to create Dataspace
(cloud of intelligence data)
• Lossless representation of sources with their native semantics
• Semantic Enhancement (SE): suite of prototype ontologies with
coverage allowing annotation of these native semantics
• Index exposing the content of the Dataspace via SE with proven
benefits
• Methodology and architecture for ontology development
• In progress
• Assembling the Shared Semantic Resource (SSR) as a separate
store and enabling its use outside the Dataspace; in discussions
with various agencies
19. 19
The SSR
DoD AirForce Navy NSA
use
Reference Ontologies (Shared Semantic Resource) Application
…
Ontologies:
Geospatial Weapon
Agent-related
Agent Organization Information
Weapon-related
Artifact …
for purposes of
Event Intelligence Video NLP
Reporting Analysis Analysis
20. 20
Challenges to HI
• Too many lexicons
• The scope of the domain: signal, sensor, image, …
intelligence about … the whole world
• Difficult to conduct governance and management of
ontology development to ensure consistent evolution
• Lack of expertise
21. 21
Preventing Failure
• The method we use offers solutions to some of the common
reasons for failure
• Lack of Consensus
• Realism offers an objective standard for settling disputes over
terminology. Ontology development becomes an empirical science
instead of an exercise in the publication of dialects
• Governance helps to resolve conflicts and achieve consensus
• High Maintenance
• Arm‟s length implementation places no additional overhead onto
applications
• Parochialism
• Architecture and methodology prevent development of vocabularies
that apply only to a single perspective
• Poor Quality
• Experience prevents common mistakes in vocabularies that cause
downstream problems with search and analytics
23. 23
Integrated Store of Intelligence Data
• Lossless integration without heavy pre-processing
• Ability to:
• Incorporate multiple integration models / approaches /
points of view of data and data-semantics
• Perform continuous semantic enrichment of the
integrated store
• Scalability
24. 24
Solution Components
• Cloud implementation
• Cloudbase (Accumulo)
• Data Representation and Integration Framework
• Comprehensive unified representation of data, data semantics, and
metadata
• This work was funded by US Army CERDEC Intelligence
and Information Warfare Directorate (I2WD)
25. 25
Dealing with Semantic Heterogeneity
Physical Virtual integration. A
Integration. A projection onto a
separate data store homogeneous data-
homogenizing model exposed to
semantics in a users – is more
particular data- flexible, but may
model – works only have the problem of
for special cases, data availability (e.g.
entails loss and military, intelligence).
distortion of data Also, a particular
and semantics, homogeneous model
creates a new data has limited usage,
silo. does not expose all
content, and does
not support
enrichment
26. 26
Pursuit of the Holy Grail of Intelligence
Data Integration
• In a highly dynamic semantic environment evolving in ad
hoc ways
• how to have it all and have it available immediately and at any time?
• Traditional physical and virtual integration approaches fail to respond to
these requirements
• how to use these data resources efficiently (integrate, query, and
analyze)?
27. 27
Workable Solution
A physical store Light Weight
incorporating Semantic
heterogeneous contents. Enhancement (SE)
Data Representation and supports semantic
Integration Framework (DRIF) – is integration and
based on a decomposed
representation of structured data provides a decent
(RDF-style) and allows collection utilization capability
of data resources without loss and without adding
or distortion and thereby achieve
representational integration storage and
processing weight to
the already storage-
and processing-
heavy Dataspace
28. 28
DRIF Dataspace
• Integration without heavy pre-processing (ad-hoc rapid
integration):
• Of any data artifact regardless of the model (or absence of it)
and modality
• Without loss and or distortion of data and data-semantics
• Continuous evolution and enrichment
• Pay-as-you-go solution
• While data and data-semantics are expected to be enriched
and refined, they can be efficiently utilized immediately after
entering the DataSpace through querying, navigation, and
drilling
29. Organization of the DRIF Dataspace
Registration
Ingestion
Extraction [Transformation] / Enrichment
30. 30
Semantic Enhancement of the Dataspace
• Simple yet efficient harmonization strategy
• Takes place not by changing the data semantics to which it is applied ,
but rather by adding an extra semantic layer to it
• Long-lasting solution that can be applied consistently and in cumulative
fashion to new models entering the Dataspace
• Strategy compliant with and complementing the
DRIF
• Source data models are not changed
• Be used efficiently, and in a unified fashion, in
search, reasoning, and analytics
• Provides views of the Dataspace of different level of detail
• Mapping to a particular Über-model or choosing a
single comprehensive model for harmonization do
not provide the benefits described
31. 31
Illustration
• DRIF Dataspace accommodates lots of data models and
is a microcosm of a collection of systems with diverse and
heterogeneous data
• Incremental annotations of these data models through SE
ontologies
• Preserving the native content of data resources
• Presenting the native content via the SE annotations
• Benefits of the approach
32. 32
Sources
• Source database Db1, with tables Person and
Skill, containing person data and data pertaining to skills of
different kinds, respectively.
PersonID SkillID SkillID Name Description
111 222 222 Java Programming
• Source database Db2, with the table Person, containing data
about IT personnel and their skills:
ID SkillDescr
333 SQL
• Source database Db3, with the table ProgrSkill, containing
data about programmers‟ skills:
EmplID SkillName
444 Java
33. 33
Representation in theSE Label
Label Relation
Dataspace
Representation of
Db1.Name Is-a SE.Skill data-models, SE
Db2.SkillDescr Is-a SE.ComputerSkill and SE annotations
Db3.SkillName Is-a SE.ProgrammingSkill as Concepts and
ConceptAssociation
Db1.PersonID Is-a SE.PersonID s
Db2.ID Is-a SE.PersonID
Db3.EmplID Is-a SE.PersonID Blue – SE
annotations
SE.ComputerSkill Is-a SE.Skill
Red – SE
SE.ProgrammingSkill Is-a SE.ComputerSkill hierarchies
Value and Relation Value and
Associated Label Associated Label Native
111, Db1.PersonID hasSkillID 222, Db1.SkillID representation
of structured
222, Db1.SkillID hasName Java, Db1.Name
data
222, Db1.SkillID hasDescription Programming,
Db1.Description
333, Db2.ID hasSkillDescr SQL, Db2.SkillDescr
444, Db3.EmplID hasSkillName Java, Db3.SkillName
34. 34
Indexed Contents Based on the SE
Index entries based on the SE and native (blue) vocabularies
Index Entry Associated Field-Value
111, Type: Person
PersonID Skill: Java
Db1.Description:Programming
333, Type: Person
PersonID ComputerSkill: SQL
444, Type: Person
PersonID ProgrammingSkill: Java
35. 35
Benefits of DRIF + SE
• Leverages syntactic integration provided by DRIF, semantic
integration provided by the SE vocabulary and annotations of
native sources, and rich semantics provided by ontologies in
general
• Entering Skill = Java (which will be re-written at run time as: Skill =
Java OR ComputerSkill = Java OR ProgrammingSkill = Java OR
NetworkSkill = Java) will return: persons 111 and 444
• Entering ComputerSkill = Java OR ComputerSkill = SQL will
return: persons 333 and 444
• entering ProgrammingSkill = Java will return: person 444
• entering Description = Programming will return: person 111
• Allows to query/search and manipulate native
representations
• Light-weight non-intrusive approach that can be improved
and refined without impacting the Dataspace
36. 36
Index Contents without the SE
Index entries based on native vocabularies
Index Entry Associated Field-Value
111, PersonID Type: Person
Name: Java
Description: Programming
333, ID Type: Person
SkillDescr: SQL
444, EmplID Type: Person
SkillName: Java
37. 37
Problems
• Even for our toy example we can see how much manual
effort the analyst needs to apply in performing search
without SE – and even then the information he will gain will
be meager in comparison with what is made available
through the Index with SE.
• For example, if an analyst is familiar with the labels used in Db1
and is thus in a position to enter Name = Java, his query will still
return only: person 111. Directly salient Db4 information will
thus be missed.
38. 38
Additional Notes on the SE process
• Original data and data-semantics are included in the Dataspace
without loss and or distortion; thus there is no need to cover all
semantics of the Dataspace – what is unlikely to be used in
search or is not important for integration will still be available
when needed
• A complex ontology is not needed – a common and shared
vocabulary is sufficient for virtual semantic integration and
search/analytics
• The approach is very flexible, and investments can be made in
specific areas according to need (pay-as-you-go)
• The approach is tunable – if the chosen annotations of a
particular subset of a source data-model are too general for data
analyses, the respective ontologies can be further developed
and source models re-annotated
39. 39
Benefits of the Approach
• Does not interfere with the source content
• Enhancement enables this content to evolve in a cumulative
fashion as it accommodates new kinds of data
• Does not depend on the data resources and can be developed
independently from them in an incremental and distributed fashion
• Provides a more consistent, homogeneous, and well-articulated
presentation of the content which originates in multiple internally
inconsistent and heterogeneous systems
• Makes management and exploitation of the content more cost-
effective
• The use of the selected ontologies brings integration with other
government initiatives and brings the system closer to the federally
mandated net-centric data strategy
• Creates an integrated content that is effectively searchable and
that provides content to which more powerful analytics can be
applied
40. 40
Towards Globalization and Sharing
• Using the SE approach to
create a Shared Semantic
Resource for the Intelligence
Community to enable
interoperability across
systems
• Applying it directly to or
projecting its contents on a
particular integration
solution
41. 41
References
• Smith B. et al. Horizontal Integration of Warfighter
Intelligence Data: A Shared Semantic Resource for the
Intelligence Community, STIDS Conference, 2012.
•
• Smith B. et al., “Ontology for the Intelligence
Analyst”, Crosstalk: The Journal of Defense Software
Engineering, 2012.
•
• Salmen D. et al. Integration of Intelligence Data through
Semantic Enhancement, STIDS Conference, 2011.
42. Follow Us
Data Tactics Corporation
7901 Jones Branch Dr.
Suite 700
McLean, VA 22102
www.data-tactics-corp.com