Big Data and the Semantic Web: Challenges and Opportunities
1. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web:
Challenges and Opportunities
Srinath Srinivasa
Open Systems Laboratory
IIIT Bangalore
http://osl.iiitb.ac.in/
sri@iiitb.ac.in
2. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
http://www.bda2013.net/
3. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Topical Anchors: Given
a list of noun phrases,
identify a semantic
topic for these terms.
Powered by Wikipedia
cooccurrence graph
hosted by Agama
Web APIs enable use of
Topical Anchors in
third party applications
4. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Topic Expansion: Given a
term, expands it into
semantically relevant topical
clusters with different
senses.
Uses co-occurrence
datasets from Wikipedia
2006 or 2011.
Web APIs enable use by
third party applications
5. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Agama: A graph database for
storing large undirected graphs
for efficient traversal (not
structurebased retrieval)
Currently Agama powers a co
occurrence graph of all noun
phrases from Wikipedia articles
hosted in OSL, managing 10s of
millions of nodes and 100s of
millions of edges
6. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
More data beats better algorithms..
meets
No data is an island..
7. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Outline
● Big Data Characteristics
● Big Data Analytics
● Patterndriven and Modeldriven Analytics
● Big Data and the Semantic Web
● Semantic Challenges
● The myth of a global ontology
● Convergent and divergent semantics
● Semantic interoperability
● Technology Challenges
● Storage, traversal and retrieval of largescale semantic networks
● Inference on Big Data
● On the road ahead
8. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data
Data that is
● Too large to be processed by conventional
databases and data management techniques
(Volume)
● Too diverse in structure that no single data model
captures all elements of the data (Variety)
● Transient and/or impermanent, especially when
pertaining to dynamic phenomena (Velocity)
9. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data
● Transaction records
● Network streams
● Experimental output
● Social media data
● Demographic records
● Citation data
● Clickstreams
● Log data
● Weather data
● …
10. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Some Big Data Stats
● YouTube users upload 48 hours of video every minute
http://gigaom.com/2011/05/25/youtube48hoursofvideoperminute/
● Facebook data grows by 500TB daily
http://www.slashgear.com/facebookdatagrowsbyover500tbdaily23243691/
● WalMart handles more than 1 million customer
transactions every hour http://www.economist.com/node/15557443
● Akamai analyzes 75 million events per day for
targeted advertising http://wikibon.org/blog/tamingbigdata/
● 90% of data in the world today was created in the last
2 years http://wikibon.org/blog/bigdatainfographics/
11. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data Analytics
Examine Big Data for useful (often actionable)
knowledge
The long spectrum of Big Data Analytics
Pattern identification
Association rule mining
Classification/Clustering
Record Linkage
Security analytics
Complex Event
Processing
Opinion mining
Predictive modeling
Pattern driven
Model driven
12. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
● Discovery and visualization
of recurring patterns in
datasets
● Mostly quantitative
● Paradigms in pattern
discovery:
● Sampling and
aggregation
● Thresholding and
filtering
Image Source: Wikipedia
13. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
Sampling and Aggregation
● Query based pattern aggregation
● Based on an initial idea of what we are looking
for
Hypothesis
Data
Query Patterns Aggregation Presentation
14. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
Tresholding and Filtering
● Based on sifting through the entire dataset (or a
view) to look for “interesting” patterns without
the context of a query
Data
Interestingness
criteria
Patterns Filtering
and
Segregation
Presentation
15. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Model Driven Analytics
Analytics as a modeldiscovery problem
Wedding
Images source: Wikipedia
Observable
Data
Latent
Concept
16. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Model Driven Analytics
● Pattern discovery coupled with semantic
modeling
● Nontrivial qualitative modeling challenges
● Model discovery:
● Descriptive model discovery
Fit a model to explain the observed data
● Predictive model discovery
Discover a model that can predict values of data elements
into the future
17. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Linked Data
Image source: Wikipedia
The Linked Data
Cloud as of
September 2011
18. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Linked Data
● Using Semantic Web technologies to connect data
elements from disparate data sources
● From Web of Documents to Web of Data
● Elements of Linked Data
● URIs
● HTTP
● Resource Description Framework (RDF)
● Serialization formats (RDFa, RDF/XML, N3, Turtle,
and others)
19. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web
Big Data
Semantic Web
Model Discovery
Catalyzation and
Predictive Modeling
20. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data Semantic Web
● One of the main elements of the Linked Data Cloud: DBpedia is
built from a Big Data resource: Wikipedia
● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from
mining PubMed publications
● Enterprise scale Big Data Analytics helping build organizational
models, operational intelligence solutions, etc. Example: Anzo
software suite by Cambridge Semantics (www.cambridgesemantics.com),
Loom data management suite by Revelytix (www.revelytix.com)
21. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Web Big Data
Schema.org
● Collection of schemata on various topics that are recognized by major
search providers and used to semantically interpret web content
SourceMap
● Linked data augmented with web content and crowdsourced data used
to provide details about companies like their carbon footprint, energy
use, water use, etc. www.sourcemap.com
OpenSteetMap
● Linked data augmenting crowdsourced data on www.openstreetmap.org
helped in detailed mapping of disaster scenario during the Jan 2010
Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=bernersleelinkeddata)
22. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web:
Challenges
Semantic challenges
● The myth of a global ontology
● Convergent and divergent semantics
Technology and system challenges
● Characteristics of a semantic graph
● Managing graph structured data
23. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
The Myth of a Global Ontology
Several “core” semantic ontologies exist:
● WordNet
● YAGO
● OpenCyc
● SUMO
However, none of them (even automated ones) can
capture all possible semantic associations and all
possible perspectives on a given topic
24. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
The Myth of a Global Ontology
The open world problem
● We don't know what we don't know..
● Representation bias in big data sources
The neutralbutuseless perspective
● Localized, utilitarian descriptions often more useful than neutral,
global descriptions. Ex: Use of “zones” as a geographical element in
Indian Railways
● Difficult for disparate perspectives to coexist in a single Ontology,
violating design principles like Occam's razor
25. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Convergent and Divergent
Semantics
Wikipedia article on
West Bank
conflict
Palestine POV
Israeli POV
Historians' POV
UN's POV
Encyclopedic Semantics
26. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Convergent and Divergent
Semantics
IPL
event schedule
Traffic planning
Advertisement planning
around IPL
Legal structuring
around IPL
TV programme
scheduling
Security
planning
27. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Interoperability
● Binary predicates like RDF may not capture
complete semantics of the association
But it is too difficult to work with higherorder predicates
● Semantic queries are characterized by contextual
relevance and default assumptions
● Linked Data can be useful primarily within the
context of a model
Modelbuilding from predicates as complex a problem as
identifying predicates from data
28. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Challenges: Summary
● Hard to distinguish data from noise without a model
Especially hard when we are using data to help build a model!
● There may not be a single global model explaining the data
● Model construction as challenging, if not more challenging, as predicate
mining
● No clarity on the underlying processes that aid in knowledge aggregation
Knowledge aggregation happens differently depending on the kind of
knowledge being aggregated (encyclopedic versus operational knowledge)
29. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Tech Challenges
Storing Big Semantic Data
● Semantic data not amenable to physical access coherence to be
efficiently stored in relational tables
● Logical proximity of triples, more important than physical
proximity
● Read/Write storage models change logical proximity
● RDF graphs tend to be extremely dense and/or clustered
● Need efficient methods of graph storage and retrieval
30. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
● Databases optimized to store and retrieve interrelated
sets of triples of the form (subject, predicate, object)
● Query models based on answering graph queries
(usually in SPARQL) rather than SQL queries
● Main design criteria: storage and readahead policies of
triples based on their logical proximity rather than
physical proximity in order to enable Bulk Synchronous
Parallel (BSP) processing
31. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
AllegroGraph (http://www.franz.com/agraph/allegrograph/)
● NoSQL Graph based native storage for RDF triples
● ACID compliant
● Interfaces with Solr for free text indexing
● Triple and text level indexing
● MongoDB integration
● RDFS++ Reasoning with dynamic materialization
● SPARQL queries on named graphs and Prolog based
inferencing engine
32. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Sesame http://www.openrdf.org/
● Open source Java framework for parsing, storing,
querying and inferencing over RDF data
● Collections of RDF triples can be manipulated in memory
using a graph data model
● Compliant with SPARQL 1.1 protocol recommendation
● Provides two levels of APIs: SAIL (Storage and Inference
Layer) for low level RDF processing and Repository layer
for programmatic interfacing with Sesame
33. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Mulgara http://www.mulgara.org/
● Native storage model for RDF
● Supports multiple models (databases) per server
● ACID transactions and concurrency support
● Copyonwrite cache semantics
● Fulltext search and support for data types
● Primarily useful as a repository – no evidence of
support for logical inferences over RDF
34. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Other examples:
● InfiniteGraph from Objectivity http://www.objectivity.com/
● BigData http://www.bigdata.com/bigdata/blog/
– A high scaleout storage and computing engine
● Agama https://github.com/arrac/agama/wiki/Agama
– Storage, search and traversal support (Ruby library) for
very large graphs
● Neo4j http://www.neo4j.org/
– Embedded, diskbased transactional graph database
written in Java
35. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Logical inference over Big Data
● Problem: Find factual answers to specific questions by
reasoning over largescale data.
● Performing extremely largescale deductions over large
semantic datasets in interactive response time
● Need to contend with potentially inconsistent predicates,
incomplete or missing values and default assumptions
● Varieties of inference over datasets
● Deduction
● Induction
● Abduction
● Statistical inference
36. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Logical inference over Big Data
Common approaches for scalable inferencing:
● Horn clause inferencing
● Variants of random walks on knowledge graphs
● Distributed MCMC (Markov Chain Monte Carlo)
methods
37. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Horn Clauses
Horn clauses are predicates of the form:
atomic sentence with no negation and a single consequent
Horn clause knowledge bases can be resolved using “backward
chaining” starting from the consequent and building a tree of
antecedents until they are grounded in facts
Horn clause resolution can be scaled over large datasets by
parallelizing resolutions using MapReduce
p1∧p2∧...∧pn →u
38. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Random Walks on Big Data
Random walks on RDF graphs as a means of:
● Belief materialization
● Soft inference
a c e
d f
b
R R
R
R
Assuming transitivity of R
39. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Random Walks on Big Data
Large scale graph processing solutions for
scaling random walks over Big Data:
● Apache Giraph http://giraph.apache.org/
● Pregel [Malewicz et al., 2010]
● Grappa http://www.cs.washington.edu/node/4217/
40. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
A “generic” problem solving method based on local
sampling, useful for soft inferences on semantic data
Time homogeneous Markov Chain:
41. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
A homogeneous Markov chain can be represented as a set of
“states” and “transition probabilities” across states
Given an initial “prior” probability distribution across states
the “stationary distribution” or “equilibrium condition”
is defined as:
42. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
Markov Chain Monte Carlo
Given a state space S and an “equilibrium” distribution
choose a sample s of the state space S so that a Markov chain
on s results in as the stationary distribution
MCMC for logical inference
For a logical inference problem, the equilibrium condition
would be of the form [0,1]m
defined over a set of m predicates
Example Sampling algorithms for MCMC
Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling
MetropolisHastings algorithm
http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm
43. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Scaling MCMC for Big Data
Distributed MCMC
Several models are explored for distributing MCMC computations
over large datasets making them amenable to diffusing
computations. Some examples include: [Murray 2010; Singh et al
2011]
Distributional models for MCMC beyond the scope of this talk..
44. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
On the road ahead..
Some promising directions for Big Data and
Semantics
● Diffusion models for large scale inference
● Cognitive models for semantics over large scale data
● Modelbased reasoning and reasoning across models
● Soft (probabilistic) inferences, confidence measures,
relevance feedback
● Continuous learning over Big Data
45. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Thank You!
46. Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
References
● Neal Madras. Introduction to Markov Chain Monte Carlo.
http://www.cs.cornell.edu/selman/cs475/lectures/intromcmclukas.pdf
● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz
Czajkowski. 2010. Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135146. DOI=10.1145/1807167.1807184
http://doi.acm.org/10.1145/1807167.1807184
● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for
Computational Linguistics, Stroudsburg, PA, USA, 529539.
● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores,
Clusters and Clouds. http://lccc.eecs.berkeley.edu/
● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics,
Stroudsburg, PA, USA, 7988.
● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning firstorder Horn clauses from web
text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10).
Association for Computational Linguistics, Stroudsburg, PA, USA, 10881098.
● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Largescale crossdocument
coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies Volume 1 (HLT '11), Vol. 1. Association for
Computational Linguistics, Stroudsburg, PA, USA, 793803.