Reasoning over big data

Reasoning Over Big Data Stores
Eric Little, PhD
VP Data Science
Polytechnic School of Engineering - NYU
eric.little@osthus.com

Who We Are & What We Do
OSTHUS, Inc. is the U.S.
subsidiary of OSTHUS GmbH
Global presence - offices in
Germany, U.S. & China
Provide advanced solutions,
consulting and technology
services for Pharmaceutical and
Biotech R&D
 Technology provider for the
Allotrope effort, globally aligning
several pharma and biotech
companies

Semantic Technologies – Smart Data Piece
Semantic Technologies
 Provide several important features for emerging new technologies
• Controlled vocabularies
• Taxonomies
• Metadata structures
• Ontology models
• Logical inference
Data today continues to evolve and grow in both size and complexity.
We need hybrid solutions that can provide real insights
 Analytics is growing into a new kind of field – Data Science
 Is data science about interacting with machines or humans?
 Must be able to strike a balance between complexity of the data and
simplicity of the presentation to the user

Metadata, Reference Data & Master Data
• While often lumped together, these are distinct kinds of data
• Semantic Technologies can help with the organization of these
kinds of data – but should not be done in isolation
• Scalability is achieved using complementary approaches
Increasedconceptualcomplexity
IncreasedScalabilityIssues

Graphs are good for information –
not so good for high-bandwidth
applications where speed and
scalability are the primary drivers.
Can require highly specialized
hardware, software techniques or
engineers
Semantics should be confined to
the metadata aspects of the
problem – use other tech for the
rest
Where Semantics Can Fall Short

Big Data is a real challenge –
but starting to become a buzz
word
 Many “Big Data Problems”
can be reduced to smaller
data problems
Applications exist that require
complex inferencing over very
large data sets
 A current client has lab
readings from 40,000+
devices
How to do this effectively?
The Big Data Problem

Why Not Just Build the Data Lake?
Data lakes are fine when you
are gathering and storing the
data
 What happens later on when
a lot of data is in there?
The benefits are that data can
stay in its original form – no
real ETL
But running analytics across
disparate stores is very
challenging
“Without metadata, every
subsequent use of data means
analysts start from scratch.”
(Gartner 2014)

Reasoning Over Big Data Is A Growing Topic
There has been an inordinate amount of time and energy spent on
just queries.
 This is not reasoning though – it is just retrieval
What is Reasoning?
 More than just automated query sets run in sequence or parallel
 Reasoning is about inferring new information that isn’t in the raw data.
 It is a heuristic – where one discovers or learns something new for
themselves
 Deductive, Inductive, Abductive

Logical Reasoning (does
not always assume set
theory)
Mathematical Reasoning
(which is logical
reasoning, but assumes
set theory as the basis)
9
Types of Reasoning One Can Use

Types of Semantic Inference (Forward and
Backward Chaining)
Uses Modus Ponens
Finds a T consequent and
affirms related antecedent
(verifies connection)
Uses Modus Ponens
Finds a T antecedent & affirms a
related consequent (new
knowledge)

Ontology Layering Is Important for Scale
Data Source Models
Multi- & Single-Source Data
Integration Models
Domain Models (Objs, Attributes,
Process & Relations)
System Lvl Models (Rules)
DataTraceability(Provenance)
UserDrivenOntologies
Upper-Lvl Models
Meta-data
Levels
(Human
Concepts)
Data-centric
Levels
(Machine
Language)
Metaphysics – not just data models
Data Sources connected directly to higher classifications
Federation allows for improved scale

Get your semantics experts and your big data scientists on the same
page
 Utilize tables where possible – avoid multi-node graph hops
 Use graphs for metadata – leave instance data in place when possible
 Large graphs should be avoided
 Lots of columns and rows are fine – joins across tables are not
 Break graph information into other formats wherever possible
Pre-compute phases are important
 Pre-compute multi-table joins based on SME input, known semantic
patterns, business rules/logics, etc.
 Use statistical methods to cluster data (e.g., normalcy calcs)
Use the tech that is right for the job
Combining Semantics and NoSQL

One Example of Using RDF in Cloud-scalable
Applications
Example of a current approach being used – there are others
Can scale across multiple cloud nodes (where TS’s have issues)
Triples are indexed items

THANK YOU – QUESTIONS?
Eric Little, PhD
VP Data Science
OSTHUS, Inc.
eric.little@osthus.com
(M) 321-480-4818
www.linkedin.com/pub/eric-little

Reasoning over big data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Reasoning over big data

Ähnlich wie Reasoning over big data (20)

Mehr von OSTHUS

Mehr von OSTHUS (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reasoning over big data