1. May 2013 Blue Canopy
Knowledge at the Speed
of Need: HADOOP
Integration and the
Semantic Web
National Cyber Practice
Blue Canopy White Paper
Written by Nick Savage, Robert Bergstrom
2. May 2012 Page 2 Blue Canopy Group, LLC Internal Only
Contents
Executive Summary..........................................................................................................3
Solving the “BIG DATA” Challenge – Volume, Velocity, and Variety .................4
Identifying a Hi Volume Data Challenge .....................................................................4
Identifying a Hi Velocity Data Challenge ....................................................................6
Identifying a Hi Variety Data Challenge ......................................................................7
The Distinct Roles of HADOOP Integration versus the Semantic web ...................8
The Blue Fusion Technical Architecture........................................................................9
Tools of the Trade.............................................................................................................11
Conclusion........................................................................................................................12
3. May 2012 Page 3 2013 Blue Canopy Group, LLC CN Savage
Section 1 fro
Executive Summary
In the world of government and business, there are a variety of industries that require the
capabilities to efficiently find solutions to exchange trillions of dollars, petabytes of data
for billions of consumers and workers in picoseconds to get the job done daily. The data
produced consist of continuous data written to log files for tracking security
requirements, patient or client records required to remain online or archived offline to
ensure standards are met to comply with mandates or laws. For example, in healthcare
patient records must be maintained typically 7 to 10 years, whereas military patient
records must be maintained beyond the life of the soldier and beneficiaries. How do you
look for trends and patterns in so much data where the future value cannot be predicted or
measured? For example, before ingesting the data you need to determine where to store
the data to avoid read versus write issues and query the data effectively.
Another challenge for the technologist is managing stream data generated rapidly by
networks, web searches, sensors, or phone conversations that involve or incur large
volumes of data may be read infrequently with limited computing and storage capabilities
available. How do you mine this data? How do you integrate and correlate this
dynamically changing data with your structured data (i.e., database, marts, or warehouses
without available foreign keys).
Data velocity is another operational challenge faced by institutions where time is critical
searching for one defined state to save lives, avoid catastrophic events, or acknowledge
positive major activities. Certain systems must perform complex event processing
required by extracting information from various data sources quickly reducing the events
to one activity, state or alert; the analyst or expert system must then predict certain events
based on patterns in the data and infer an anomaly or complex event. For example in
real world operations (i.e. Situational Awareness scenarios), a complex event requiring
data and operational involvement from multiple federal agencies (i.e., Intel, CDC,
Healthcare, FBI) and systems must be identified with velocity or expediency prior to
Bioterrorist unleashing a pandemic smallpox threat. Blue Fusion is the Blue Canopy
solution that adapts to new information, new ontologies, and new relationships based on
the domain and presents the results to the user in simple way. Organizations must also
overcome disparate data challenges by managing stovepipe systems in the enterprise in
structured (e.g., database, XML) and unstructured (e.g., email, blobs, PowerPoint slides)
formats to extract the most knowledge based on different data delivery requirements.
How do you maintain performance and system tolerance to integrate and correlate the
multiple data sources?
This paper provides the architectural details for the Blue Canopy technical approach,
Blue Fusion, to design a loosely couple architecture and develop an ontology-based
4. May 2012 Page 4 2013 Blue Canopy Group, LLC CN Savage
application environment that leverages the open source distributed Hadoop framework
and the flexibility of the Semantic web to produce a BIG DATA integration solution that
is fast, dynamic, cost-effective and reliable.
Solving the “Big Data” Challenge - Volume,
Velocity, and Variety
You must first formulatea data processing profile and analyzethe typeof "Big Data" challenge
the user and the organization is facing by establishing whether you areaddressingdata velocity
requirements to address critical conditions,data volume issues (e.g., fusing data infrequently to a
large integrated data store), or a third category where the obstacleis consolidating a large variety
of data sources , types, and delivery systems intoa seamless but loosely- coupled environment.
A well conceived loosely-coupled architecture accommodates combinations ofthe threedata
profiles (i.e., hi-volume, hi-velocity,and hi-variety). In the first case, we will explore a big
data high volume scenario.
Identifying a High Volume Data Scenario
Figure 1 Banking Use Case - Hi Volume Data Scenario
5. May 2012 Page 5 2013 Blue Canopy Group, LLC CN Savage
(Hi-Volume Data) As depicted in the Figure 1 drawing at a high level in this
Banking use case the banking analyst runs analytics on very large amounts of data
from possibly disparate data sources, identifying “gaps” and “overlaps” in composite
view of the subject that is not time sensitive;Forensics can be performed in this case
to search for the pre-defined conditions and anomalies based on the ontology. To
reiterate the objective, the emphasis is on addressing high volume not data velocity,
in which the disparate data must be ingested, collected, normalized, and analyzed.
The Situation
The banking system runs thousands of transactions daily. Somehow an internal
criminal has inserted an algorithm depositing “half pennies” into a “shadow"
account that collects funds from all banking accounts regionally. These miniscule
series of thefts may not be detected by the analyst. The funds add up rapidly to
accruing millions of dollars. Without the aid of a forward thinking expert system,
the analyst may take months to uncover the pattern, if at all.
The Approach
The Blue Fusion approach rolls up all of the banking information into a nightly
backup containing all transactions, and emails, for the day to be analyzed later
based on the ontology definitions and business rules established at the business
ruler loaded into memory, created to look for correlations and patterns related to
fraud generating an alert to the bank analyst. In this case, the system is alerted to
seek out a pattern identifying a large series of very unusually small transactions
that correlate with very large withdrawals or withdrawals committed just below the
$10,000 threshold.
1. The Blue Fusion Rules server determines where the information will be
stored and/or queried once the data has been ingested onto HDFS.
2. Once the data is map reduced, a non-RDF triple is stored in HBASE
cache using the timestamp as the unique identifier; Why? HBASE does
not currently recognize data graphs.
3. After each ingest of new data, all data is pushed to the RDF Database for
storage and possible display. The notification systemlooks for the pre-
defined alerts or conditions defined in the ontologies.
4. The triples are pushed to the RDF Database for immediate storage or
SPARQL queries, and the result set from the query is displayed to the
user's dashboard.
6. May 2012 Page 6 2013 Blue Canopy Group, LLC CN Savage
5. Basedon user’s rights and settings, the analyst may see the user account
number, number of associatedtransactions, and the total amount stolen.
Whereas, the investigator may have access to the culprits name and other
vital detailed information.
Identifying a High Velocity Data Scenario
In the second case, we will explore a high velocity data scenario.
Figure 2 Bioterrorism Hi-Velocity Data Scenario
(Hi-Velocity) In this scenario the analyst must focus on mining near real-time and
streaming data to resolve critical conditions. As depicted in the Figure 2 drawing at a
high level in this Bioterrorism use case the law enforcement analyst runs analytics on
very large amounts of data from possibly disparate data sources, identifying “gaps”
and “overlaps” in composite view of the subject but this time the information returned
is time sensitive; Forensics can be performed in this case to search for the pre-defined
conditions and anomalies based on the ontology. The emphasis is on addressing data
velocity, in which the disparate data must be ingested, collected, map reduced, saved
in triple stores to identify a state based monitoring the network, sensors in the
atmosphere, weather indicators in addition to the structured data contained in the
transactional database, the business intelligence, and the reporting system.
7. May 2012 Page 7 2013 Blue Canopy Group, LLC CN Savage
The Situation
Bioterrorist infect members of an operational cell with smallpox. Their plans are
to fly in planes to football games held in domed facilities where 100,000 sports fans
are in attendance during the late summer while the weather is still very hot.
Authorities will have seven to seventeen days to contain the emergency. Several
civilians become infected prior to the planned attack and visit area hospitals
showing symptoms of high fever and rash. After several days, the hospitals conduct
blood test, the results return positive for smallpox. A powerful sensor that can
detect airborne pathogens generates data indicating higher levels of variola virus
in a subway tunnel area where the terrorist traveled from the airport.
The Operation Approach
The local hospitals continue to treat the patient based on a limited amount of
information. Patients have a high fever which is a common symptom for many
diseases. Meanwhile, the CDC analyst monitors the systems searching for any
critical conditions related to pandemics. As the disease severely progresses the rash
and lesions appear on the patients. Blood tests are conducted on the severely ill
patients. Results return positive for variola indicating smallpox. At this stage,the
race against the clock begins as the medical and criminal organizations are
notified.
The Blue Fusion Approach
1. In the background the Blue Fusion ontology rules created to look for
correlations and patterns is processed by the Complex Event server.
2. The rules server queries the data for reports of large sums of capital
distributed to a suspicious non-profit, criminal activity on 911 calls, web
activity, symptoms recorded for other patients and hospitals;
3. Initially this information was stored in the HBASE cache.
4. Within days, the Blue Fusion approach receives information from the
pathogen bio sensorsystemretained in the triple stores in the RDF
database and simultaneously the CDC receives single blood test returned
positive for a patient;
5. The CDC and Health Analyst alert the authorities. The expert system
correlates the new data and retrieves the information to be processedin
RDF database to conduct additional queries for law enforcement
personnel to track down the operational cell.
6. The community of interest mobilizes to immunize the population and
capture the criminals.
8. May 2012 Page 8 2013 Blue Canopy Group, LLC CN Savage
Identifying a High Variety Data Scenario
In the final case, we will explore the hybrid where the data fusion requirements
require the developer to manage a high variety of data sources, types, and delivery
requirements.
Figure 3 Hi-Variety Data Scenario
(Hi-Variety) In this financial hacker scenario the analyst must focus on defining
relationships and uncovering links between data that identifies, segments, and
stratifies subject populations based on a wide variety of variables, types, and
structures; Semantically integrating data to create a holistic 360-degree view of the
subject across a defined continuum. As depicted in the Figure 3 drawing at a high
level in this Financial use case the financial analyst and law enforcement analyst
work together to runs analytics on very large amounts of data from possibly disparate
data sources, identifying “gaps” and “overlaps” in composite view of the subject
where the mobile applications exist to initiate trades, video must be merged with
transactional data and emails and flat files; Forensics can be performed in this case to
search for the pre-defined conditions and anomalies based on the ontology; However,
the system is in a constant flux building on the original ontologies defined in the
repository to ensure the system is adapting and learning from the new relationships;
the emphasis is on addressing flexibility and complex data integration, in which the
disparate data must be ingested, collected, map reduced, saved in a very high number
of triple stores to identify the relationship at the data graph level.
9. May 2012 Page 9 2013 Blue Canopy Group, LLC CN Savage
The Situation
An international Hackers' organization has been hired to create havoc on the stock
market by contaminating and limiting access to major banking, financial and
corporate sites in addition to many high profile social media sites; the hiring
criminal firm has plans on exploiting the stock market by placing "short sales" on
the stock of the affected firms. As the global market plummets and panic ensues,
the firm makes billions of dollars based on the negative events routing the gains to
private accounts. One hacker with similar features to a high level deceased
executive has assumed his identity with all required ID cards and access codes;
The individual liquidates millions and disappears in the night; All parties are
unaware of the Executive's whereabouts and demise. The body of the missing
Executive is discovered the next day. In the meantime, the hackers have disabled
blogs, the email systems, and penetrated the network accessing personal credit card
and banking information for millions. The identities are sold on the black market.
The Blue Fusion Approach
The Blue Fusion approach gathers BIG DATA from multiple sources. First
ingesting the data exchanged between several large organizations. Based on
established data exchange agreements between the FBI, the financial institutions
affected, ingesting the structured data into HDFS and the HBASE cache for
integration with related emails, security and network audit log file for future data
mining purposes to be processed after highest level processing events have been
completed. The rules server map reduces the higher priority real time data
produced (i.e., network intruder alert detections, phone activity) and stores the
triples in the RDF database based on the ontologies that define the relationships
between stock price changes and formulas that identify anomalies in selling and
buying patterns on the market. The expert system identifies the companies and
individuals that capitalize on the major losses. The location and description of
related videos from the financial office of the missing executive indicates strange
activity in his office after a Blue Fusion alert is generated indicating his logout
time did not coincide with his departure from the office. The link to the video is
accessed on a query through the data graph stored in the RDF database identifying
imposter in the FBI Hacker Database. Initial information stored in the HBASE
cache is mined and moves the relevant triples to the RDF database. The expert
system correlates the new data and retrieves the information to be processed in
RDF database to conduct additional queries for law enforcement personnel. The
communities of interest mobilize to trace the money trail and correlate the Hacker
methods of operation from the Hacker Database and locate their current
whereabouts.
The Distinct Roles of HADOOP Integration versus the Semantic web
10. May 2012 Page 10 2013 Blue Canopy Group, LLC CN Savage
Technically, what role does HADOOP framework and the Semantic web provide?
In this case the HADOOP components fulfill several functions, the primary role is
storage, distributed processing, and fault tolerance to ingest the data and function as a
data cache using HBASE. HBASE stores data in a scalable and large distributed
database that runs on HADOOP. From a technological perspective, HBASE is
designed better for fast reading purposes not writing; Thus, it's better to run queries
on "time sensitive" data in the Resource Description Framework (RDF) Database;
Why? Web semantics graph data of RDF form triple stores (i.e., subject-predicate-
object triplet) requires optimization on the data with indices which don't exist in the
HBASE database structure thus significant coding required in Java and use of JENA
and REST servers to poll for data changes and maintain notifications to the user's
dashboard ; however, when time is not critical, HDFS and HBASE is best used when
it has yet to be determined what data mining must occur based on the data source;
Based on time sensitivity a rules server must determine if the information should:
1. Remain in HDFS and stored in a RDF database for immediate processing
2. Batched at a later timeframe and stored in the HBASE for possible future
mining.
Once the data is stored in the RDF database in the graph data format in triples the analyst
or the expert system can utilize SPARQL to execute queries in the data. It is
recommended to leverage RDF database that have the capacity to store in the range over
a billion to 50 billion triples to ensure you capture the various permutations of
relationships is well defined for the domain while maintaining speed on writing and
querying the database.
What do the Ontology approach and the Semantic Web provide?
First and foremost the ontological approach allows for the designers and developers the
flexibility and adaptability to dynamically define the rules and relationships between the
structured data, semi-structured and unstructured data. BlueFusion uses the XML
structure of incoming data and reference ontology to automatically derive a conceptual
graph representing the semantics of the data. RDF schema contains the metadata (i.e., the
description of the data). For each data source, most importantly a conceptual graph is
created according to the framework that can be saved in RDF database to ensure triples
can be processed and queried using SPARQL (Resource Description Framework Query
Language). This data graph represents semantic integration of the original data sources .
Integration ontologies can be extended to repurpose or share data and to support different
tasks across disparate applications and eventually across the boundaries if it is an
internet-based application.
The automated agent evaluates incoming XML messages and compares information to
the integrated ontology using established reference ontology for lookup purposes. The
11. May 2012 Page 11 2013 Blue Canopy Group, LLC CN Savage
reference ontology is implemented in OWL (Ontology Web Language) and contains a
model of XML document structure, and rules and axioms to interpret a generic XML
document. With regards to integration, each data instance is stored as an instance of a
concept, each of which has relations to other concepts that represent metadata such as
time stamps and source data.
The Blue Fusion Architecture
Figure 4 The BlueFusion Technical Approach
The BlueFusion approach provides the analyst, administrator, developer and most
importantly the end user the flexibility to expand the robustness of the enterprise by
providing the methodology and the tools required to leverage the strengths of HADOOP
and the semantic web.
The Blue Fusion Architecture
12. May 2012 Page 12 2013 Blue Canopy Group, LLC CN Savage
As depicted in the previous drawing, Figure 4, an administrator has the ability to build a
configurable loosely coupled architecture based on the use case and the tools available
tools available that are the "best fit" for the requirements.
Tools of the Trade
The Administrator can perform the following task thru the Bluefusion Admin Tool:
Configure Storage Requirements
1. Determine which data source will exist for the life of the enterprise and
remain as the system of record or purged at certain intervals :
A. Original XML documents created prior to ingestion into HDFS
B. The Cache which contain the tuples in non-RDF Format; this
information cannot utilize SPARQL to conduct queries; Java custom
coding is required.
C. The RDF Database Data Incubator which contains the triplestores
that may contain information for mining purposes where conditions
have not been identified by the analyst or expert system but is ready to
be analyzed quickly or queried; Based on the configuration, the data
incubator may be a replica of the HBASE cache but in RDF format
where data graphs are recognized and SPARQL can be executed.
2. Based on the business rules and ontologies, configure which triples will be
stored in the High Priority Database (Red - Alert Condition Database) for
immediate processing and which triples will be stored in Medium Priority
Database (Yellow - Discovery Condition Database) certain partial criteria
has been identified for a defined event. For example in the case of
bioterrorism early symptoms for some pandemic diseases are the same as
many common diseases; The victim has a rash and a high fever. The
analyst must have a way to keep monitoring that activity until the
condition is correlated or eliminated as opportunity or threat. Once the
condition has been defined and processed through the dashboard the event
will be promoted to the Alert Condition Database to identify it earlier if it
happens as a future event.
Configure Access Control Levels granting Create, Read, Update, and Delete
(CRUD) rights to the :
1. User Dashboard Features to determine what is presented to the end user
based on user level and use case.
13. May 2012 Page 13 2013 Blue Canopy Group, LLC CN Savage
2. Visual & Command line SPARQL applets to ensure the analyst has
access to query and perform analysis on the use case.
3. HADOOP Cache to determine duration for storing the data the ACL to
the cache and related logs.
4. Semantic Materialized Views to ensure appropriate user level
information is reported based on use case.
5. HADOOP Performance tuning features to manage the workload
balance between the various data stores and distribution management.
6. Semantic Situational Awareness Data Profile performance variables or
template based on the use case determine temporal settings, subject
settings, as relates to data velocity,volume and variety. Workload balance
tuning to spawn more expert agents to mine data.
The Bluefusion approach is to establish three separate RDF Database instances
using HBASE as a cache.