Big Data and the Semantic Web: Challenges and Opportunities

Srinath Srinivasa
Srinath SrinivasaProfessor at IIIT Bangalore um International Institute of Information Technology, Bangalore
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web:
Challenges and Opportunities
Srinath Srinivasa
Open Systems Laboratory
IIIT Bangalore
http://osl.iiitb.ac.in/
sri@iiitb.ac.in
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
http://www.bda2013.net/
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Topical Anchors: Given 
a list of noun phrases, 
identify a semantic 
topic for these terms.
Powered by Wikipedia 
co­occurrence graph 
hosted by Agama
Web APIs enable use of 
Topical Anchors in 
third party applications 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Topic Expansion: Given a
term, expands it into
semantically relevant topical
clusters with different
senses.
Uses co-occurrence
datasets from Wikipedia
2006 or 2011.
Web APIs enable use by
third party applications
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Agama: A graph database for 
storing large undirected graphs 
for efficient traversal (not 
structure­based retrieval)
Currently Agama powers a co­
occurrence graph of all noun­
phrases from Wikipedia articles 
hosted in OSL, managing 10s of 
millions of nodes and 100s of 
millions of edges 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
More data beats better algorithms..
meets
No data is an island..
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Outline
● Big Data Characteristics
● Big Data Analytics
● Pattern­driven and Model­driven Analytics
● Big Data and the Semantic Web
● Semantic Challenges
● The myth of a global ontology
● Convergent and divergent semantics
● Semantic interoperability 
● Technology Challenges
● Storage, traversal and retrieval of large­scale semantic networks
● Inference on Big Data
● On the road ahead
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data
Data that is 
● Too large to be processed by conventional 
databases and data management techniques 
(Volume)
● Too diverse in structure that no single data model 
captures all elements of the data (Variety)
● Transient and/or impermanent, especially when 
pertaining to dynamic phenomena (Velocity)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data
● Transaction records
● Network streams
● Experimental output
● Social media data 
● Demographic records
● Citation data 
● Clickstreams
● Log data
● Weather data 
● …
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Some Big Data Stats
● YouTube users upload 48 hours of video every minute 
http://gigaom.com/2011/05/25/youtube­48­hours­of­video­per­minute/
● Facebook data grows by 500TB daily 
http://www.slashgear.com/facebook­data­grows­by­over­500­tb­daily­23243691/
● WalMart handles more than 1 million customer 
transactions every hour http://www.economist.com/node/15557443
● Akamai analyzes 75 million events per day for 
targeted advertising http://wikibon.org/blog/taming­big­data/
● 90% of data in the world today was created in the last 
2 years http://wikibon.org/blog/big­data­infographics/ 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data Analytics
Examine Big Data for useful (often actionable) 
knowledge
The long spectrum of Big Data Analytics
Pattern identification
Association rule mining
Classification/Clustering
Record Linkage
Security analytics
Complex Event
Processing
Opinion mining
Predictive modeling
Pattern driven
Model driven
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
● Discovery and visualization 
of recurring patterns in 
datasets
● Mostly quantitative
●  Paradigms in pattern 
discovery:
● Sampling and 
aggregation
● Thresholding and 
filtering
Image Source: Wikipedia
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
Sampling and Aggregation
● Query based pattern aggregation
● Based on an initial idea of what we are looking 
for
Hypothesis
Data
Query Patterns Aggregation Presentation
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
Tresholding and Filtering
● Based on sifting through the entire dataset (or a 
view) to look for “interesting” patterns without 
the context of a query
Data
Interestingness
criteria
Patterns Filtering
and
Segregation
Presentation
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Model Driven Analytics
Analytics as a model­discovery problem
Wedding
Images source: Wikipedia
Observable
Data
Latent
Concept
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Model Driven Analytics
● Pattern discovery coupled with semantic 
modeling
● Non­trivial qualitative modeling challenges
● Model discovery:
● Descriptive model discovery
Fit a model to explain the observed data
● Predictive model discovery
Discover a model that can predict values of data elements 
into the future
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Linked Data
Image source: Wikipedia
The Linked Data
Cloud as of
September 2011
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Linked Data
● Using Semantic Web technologies to connect data 
elements from disparate data sources
● From Web of Documents to Web of Data
● Elements of Linked Data
● URIs 
● HTTP
● Resource Description Framework (RDF)
● Serialization formats (RDFa, RDF/XML, N3, Turtle, 
and others)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web
Big Data
Semantic Web
Model Discovery
Catalyzation and
Predictive Modeling
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data        Semantic Web
● One of the main elements of the Linked Data Cloud: DBpedia is 
built from a Big Data resource: Wikipedia
● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from 
mining PubMed publications
● Enterprise scale Big Data Analytics helping build organizational 
models, operational intelligence solutions, etc. Example: Anzo 
software suite by Cambridge Semantics (www.cambridgesemantics.com), 
Loom data management suite by Revelytix (www.revelytix.com)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Web       Big Data
Schema.org
● Collection of schemata on various topics that are recognized by major 
search providers and used to semantically interpret web content
SourceMap
● Linked data augmented with web content and crowdsourced data used 
to provide details about companies like their carbon footprint, energy 
use, water use, etc. www.sourcemap.com 
OpenSteetMap
● Linked data augmenting crowdsourced data on www.openstreetmap.org 
helped in detailed mapping of disaster scenario during the Jan 2010 
Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=berners­lee­linked­data)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web: 
Challenges
Semantic challenges
● The myth of a global ontology
● Convergent and divergent semantics
Technology and system challenges
● Characteristics of a semantic graph
● Managing graph structured data
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
The Myth of a Global Ontology
Several “core” semantic ontologies exist:
● WordNet
● YAGO
● OpenCyc
● SUMO
However, none of them (even automated ones) can 
capture all possible semantic associations and all 
possible perspectives on a given topic
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
The Myth of a Global Ontology
The open world problem
● We don't know what we don't know.. 
● Representation bias in big data sources
The neutral­but­useless perspective
● Localized, utilitarian descriptions often more useful than neutral, 
global descriptions. Ex: Use of “zones” as a geographical element in 
Indian Railways
● Difficult for disparate perspectives to co­exist in a single Ontology, 
violating design principles like Occam's razor
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Convergent and Divergent 
Semantics
Wikipedia article on
West Bank
conflict
Palestine POV
Israeli POV
Historians' POV
UN's POV
Encyclopedic Semantics
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Convergent and Divergent 
Semantics
IPL
event schedule
Traffic planning
Advertisement planning
around IPL
Legal structuring
around IPL
TV programme
scheduling
Security
planning
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Interoperability
● Binary predicates like RDF may not capture 
complete semantics of the association
But it is too difficult to work with higher­order predicates
● Semantic queries are characterized by contextual 
relevance and default assumptions
● Linked Data can be useful primarily within the 
context of a model
Model­building from predicates as complex a problem as 
identifying predicates from data
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Challenges: Summary
● Hard to distinguish data from noise without a model
Especially hard when we are using data to help build a model!
● There may not be a single global model explaining the data
● Model construction as challenging, if not more challenging, as predicate 
mining
● No clarity on the underlying processes that aid in knowledge aggregation
Knowledge aggregation happens differently depending on the kind of 
knowledge being aggregated (encyclopedic versus operational knowledge) 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Tech Challenges
Storing Big Semantic Data
● Semantic data not amenable to physical access coherence to be 
efficiently stored in relational tables
● Logical proximity of triples, more important than physical 
proximity
● Read/Write storage models change logical proximity
● RDF graphs tend to be extremely dense and/or clustered
● Need efficient methods of graph storage and retrieval 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
● Databases optimized to store and retrieve interrelated 
sets of triples of the form (subject, predicate, object) 
● Query models based on answering graph queries 
(usually in SPARQL) rather than SQL queries
●  Main design criteria: storage and read­ahead policies of 
triples based on their logical proximity rather than 
physical proximity in order to enable Bulk Synchronous 
Parallel (BSP) processing
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
AllegroGraph  (http://www.franz.com/agraph/allegrograph/)
● NoSQL Graph based native storage for RDF triples
● ACID compliant
● Interfaces with Solr for free text indexing 
● Triple and text level indexing
● MongoDB integration
● RDFS++ Reasoning with dynamic materialization 
● SPARQL queries on named graphs and Prolog based 
inferencing engine
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Sesame http://www.openrdf.org/
●  Open source Java framework for parsing, storing, 
querying and inferencing over RDF data 
● Collections of RDF triples can be manipulated in memory 
using a graph data model
● Compliant with SPARQL 1.1 protocol recommendation 
● Provides two levels of APIs: SAIL (Storage and Inference 
Layer) for low level RDF processing and Repository layer 
for programmatic interfacing with Sesame
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Mulgara http://www.mulgara.org/ 
● Native storage model for RDF
● Supports multiple models (databases) per server
● ACID transactions and concurrency support 
● Copy­on­write­ cache semantics
● Full­text search and support for data types
● Primarily useful as a repository – no evidence of 
support for logical inferences over RDF 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Other examples:
● InfiniteGraph from Objectivity http://www.objectivity.com/
● Big­Data http://www.bigdata.com/bigdata/blog/ 
– A high scale­out storage and computing engine
● Agama https://github.com/arrac/agama/wiki/Agama 
– Storage, search and traversal support (Ruby library) for 
very large graphs 
● Neo4j http://www.neo4j.org/ 
– Embedded, disk­based transactional graph database 
written in Java 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Logical inference over Big Data
● Problem: Find factual answers to specific questions by 
reasoning over large­scale data.  
● Performing extremely large­scale deductions over large 
semantic datasets in interactive response time 
● Need to contend with potentially inconsistent predicates, 
incomplete or missing values and default assumptions
● Varieties of inference over datasets
● Deduction
● Induction
● Abduction
● Statistical inference
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Logical inference over Big Data
Common approaches for scalable inferencing:
● Horn clause inferencing
● Variants of random walks on knowledge graphs
● Distributed MCMC (Markov Chain Monte Carlo) 
methods
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Horn Clauses
Horn clauses are predicates of the form:
atomic sentence with no negation and a single consequent
Horn clause knowledge bases can be resolved using “backward 
chaining” starting from the consequent and building a tree of 
antecedents until they are grounded in facts
Horn clause resolution can be scaled over large datasets by 
parallelizing resolutions using MapReduce 
 
p1∧p2∧...∧pn →u
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Random Walks on Big Data
Random walks on RDF graphs as a means of:
● Belief materialization
● Soft inference
a c e
d f
b
R R
R
R
Assuming transitivity of R
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Random Walks on Big Data
Large scale graph processing solutions for 
scaling random walks over Big Data: 
● Apache Giraph http://giraph.apache.org/ 
● Pregel [Malewicz et al., 2010]
● Grappa http://www.cs.washington.edu/node/4217/ 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
A “generic” problem solving method based on local 
sampling, useful for soft inferences on semantic data
Time homogeneous Markov Chain:
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
A homogeneous Markov chain can be represented as a set of 
“states” and “transition probabilities” across states
Given an initial “prior” probability distribution across states  
         the “stationary distribution” or “equilibrium condition” 
is defined as: 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
Markov Chain Monte Carlo
Given a state space S and an “equilibrium” distribution       
choose a sample s of the state space S so that a Markov chain 
on s results in      as the stationary distribution
MCMC for logical inference
For a logical inference problem, the equilibrium condition 
would be of the form [0,1]m
 defined over a set of m predicates
Example Sampling algorithms for MCMC
Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling 
Metropolis­Hastings algorithm 
http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Scaling MCMC for Big Data
Distributed MCMC
Several models are explored for distributing MCMC computations 
over large datasets making them amenable to diffusing 
computations. Some examples include: [Murray 2010; Singh et al 
2011]
Distributional models for MCMC beyond the scope of this talk.. 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
On the road ahead..
Some promising directions for Big Data and 
Semantics
● Diffusion models for large scale inference
● Cognitive models for semantics over large scale data
● Model­based reasoning and reasoning across models
● Soft (probabilistic) inferences, confidence measures, 
relevance feedback
● Continuous learning over Big Data 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Thank You!
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
References
● Neal Madras. Introduction to Markov Chain Monte Carlo. 
http://www.cs.cornell.edu/selman/cs475/lectures/intro­mcmc­lukas.pdf 
● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz 
Czajkowski. 2010. Pregel: a system for large­scale graph processing. In Proceedings of the 2010 ACM SIGMOD International 
Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135­146. DOI=10.1145/1807167.1807184 
http://doi.acm.org/10.1145/1807167.1807184
● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In 
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for 
Computational Linguistics, Stroudsburg, PA, USA, 529­539. 
● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores, 
Clusters and Clouds. http://lccc.eecs.berkeley.edu/ 
● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the 
Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, 
Stroudsburg, PA, USA, 79­88.
● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first­order Horn clauses from web 
text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10). 
Association for Computational Linguistics, Stroudsburg, PA, USA, 1088­1098.
● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large­scale cross­document 
coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the 
Association for Computational Linguistics: Human Language Technologies ­ Volume 1 (HLT '11), Vol. 1. Association for 
Computational Linguistics, Stroudsburg, PA, USA, 793­803.   
1 von 46

Recomendados

Using the Semantic Web Stack to Make Big Data Smarter von
Using the Semantic Web Stack to Make  Big Data SmarterUsing the Semantic Web Stack to Make  Big Data Smarter
Using the Semantic Web Stack to Make Big Data SmarterMatheus Mota
2.1K views68 Folien
Linking Open, Big Data Using Semantic Web Technologies - An Introduction von
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
2.1K views62 Folien
How Semantics Solves Big Data Challenges von
How Semantics Solves Big Data ChallengesHow Semantics Solves Big Data Challenges
How Semantics Solves Big Data ChallengesDATAVERSITY
2.5K views28 Folien
Industry Ontologies: Case Studies in Creating and Extending Schema.org von
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org sopekmir
991 views45 Folien
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist von
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistEthics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistStratos Kontopoulos
196 views9 Folien
Enterprise knowledge graphs von
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
3.9K views49 Folien

Más contenido relacionado

Was ist angesagt?

Semantics for Big Data Integration and Analysis von
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
2K views28 Folien
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr... von
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...semanticsconference
643 views22 Folien
Sebastian Hellmann von
Sebastian HellmannSebastian Hellmann
Sebastian HellmannConnected Data World
381 views26 Folien
The Bounties of Semantic Data Integration for the Enterprise von
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise Ontotext
14.8K views14 Folien
The Power of Semantic Technologies to Explore Linked Open Data von
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
1.3K views51 Folien
Building Knowledge Graphs in 10 steps von
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsOntotext
408 views12 Folien

Was ist angesagt?(20)

Semantics for Big Data Integration and Analysis von Craig Knoblock
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
Craig Knoblock2K views
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr... von semanticsconference
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
The Bounties of Semantic Data Integration for the Enterprise von Ontotext
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
Ontotext14.8K views
The Power of Semantic Technologies to Explore Linked Open Data von Ontotext
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext1.3K views
Building Knowledge Graphs in 10 steps von Ontotext
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
Ontotext408 views
A possible future role of schema.org for business reporting von sopekmir
A possible future role of schema.org for business reportingA possible future role of schema.org for business reporting
A possible future role of schema.org for business reporting
sopekmir1.2K views
Property graph vs. RDF Triplestore comparison in 2020 von Ontotext
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
Ontotext16.9K views
Rank | Analyse | Lead | Search von sopekmir
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
sopekmir524 views
Interaction with Linked Data von EUCLID project
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
EUCLID project9.7K views
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren... von semanticsconference
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Linked data for Enterprise Data Integration von Sören Auer
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
Sören Auer3.9K views
Koneksys - Offering Services to Connect Data using the Data Web von Koneksys
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys3.9K views
Supporting GDPR Compliance through effectively governing Data Lineage and Dat... von Connected Data World
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
How to Reveal Hidden Relationships in Data and Risk Analytics von Ontotext
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk Analytics
Ontotext1.4K views

Destacado

Semantic Technologies for Big Data von
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big DataMarin Dimitrov
13.4K views38 Folien
Is data sharing the privilege of a few? Bringing Linked Data to those without... von
Is data sharing the privilege of a few? Bringing Linked Data to those without...Is data sharing the privilege of a few? Bringing Linked Data to those without...
Is data sharing the privilege of a few? Bringing Linked Data to those without...Christophe Guéret
2.4K views22 Folien
Inference using owl 2.0 semantics von
Inference using owl 2.0 semanticsInference using owl 2.0 semantics
Inference using owl 2.0 semanticsCraig Trim
1.3K views22 Folien
From Big Data to Smart Data von
From Big Data to Smart DataFrom Big Data to Smart Data
From Big Data to Smart DataMarin Dimitrov
5.3K views22 Folien
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술 von
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술Haklae Kim
1.2K views23 Folien
Big Data and Semantic Web in Manufacturing von
Big Data and Semantic Web in ManufacturingBig Data and Semantic Web in Manufacturing
Big Data and Semantic Web in ManufacturingNitesh Khilwani
2K views24 Folien

Destacado(15)

Semantic Technologies for Big Data von Marin Dimitrov
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big Data
Marin Dimitrov13.4K views
Is data sharing the privilege of a few? Bringing Linked Data to those without... von Christophe Guéret
Is data sharing the privilege of a few? Bringing Linked Data to those without...Is data sharing the privilege of a few? Bringing Linked Data to those without...
Is data sharing the privilege of a few? Bringing Linked Data to those without...
Christophe Guéret2.4K views
Inference using owl 2.0 semantics von Craig Trim
Inference using owl 2.0 semanticsInference using owl 2.0 semantics
Inference using owl 2.0 semantics
Craig Trim1.3K views
From Big Data to Smart Data von Marin Dimitrov
From Big Data to Smart DataFrom Big Data to Smart Data
From Big Data to Smart Data
Marin Dimitrov5.3K views
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술 von Haklae Kim
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술
Haklae Kim1.2K views
Big Data and Semantic Web in Manufacturing von Nitesh Khilwani
Big Data and Semantic Web in ManufacturingBig Data and Semantic Web in Manufacturing
Big Data and Semantic Web in Manufacturing
Nitesh Khilwani2K views
9 Data Mining Challenges From Data Scientists Like You von Salford Systems
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
Salford Systems24.5K views
What is the role of cloud computing, web 2.0, and web 3.0 semantic technologi... von Mills Davis
What is the role of cloud computing, web 2.0, and web 3.0 semantic technologi...What is the role of cloud computing, web 2.0, and web 3.0 semantic technologi...
What is the role of cloud computing, web 2.0, and web 3.0 semantic technologi...
Mills Davis8.6K views
Big Data: Analisi del Sentiment von Miriade Spa
Big Data: Analisi del SentimentBig Data: Analisi del Sentiment
Big Data: Analisi del Sentiment
Miriade Spa502 views
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo... von Robert Cole
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
Robert Cole3K views
NLTK - Natural Language Processing in Python von shanbady
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
shanbady23.2K views
The World Wide Web Power Point von karamfilova
The World Wide Web Power PointThe World Wide Web Power Point
The World Wide Web Power Point
karamfilova33.7K views
Internet and World Wide Web von Samudin Kassan
Internet and World Wide WebInternet and World Wide Web
Internet and World Wide Web
Samudin Kassan26.2K views
Ppt on internet von Rahul Gandhi
Ppt on internetPpt on internet
Ppt on internet
Rahul Gandhi458.4K views

Similar a Big Data and the Semantic Web: Challenges and Opportunities

Pf3426712675 von
Pf3426712675Pf3426712675
Pf3426712675IJERA Editor
304 views5 Folien
A Generic Model for Student Data Analytic Web Service (SDAWS) von
A Generic Model for Student Data Analytic Web Service (SDAWS)A Generic Model for Student Data Analytic Web Service (SDAWS)
A Generic Model for Student Data Analytic Web Service (SDAWS)Editor IJCATR
227 views3 Folien
9. the semantic grid and autonomic grid von
9. the semantic grid and autonomic grid9. the semantic grid and autonomic grid
9. the semantic grid and autonomic gridDr Sandeep Kumar Poonia
2.1K views50 Folien
Radhakrishnan Moni von
Radhakrishnan MoniRadhakrishnan Moni
Radhakrishnan MoniRadhakrishnan Moni
175 views1 Folie
Linked Data to Improve the OER Experience von
Linked Data to Improve the OER ExperienceLinked Data to Improve the OER Experience
Linked Data to Improve the OER ExperienceThe Open Education Consortium
1.3K views43 Folien
Semantic Web: Technolgies and Applications for Real-World von
Semantic Web: Technolgies and Applications for Real-WorldSemantic Web: Technolgies and Applications for Real-World
Semantic Web: Technolgies and Applications for Real-WorldAmit Sheth
4.7K views131 Folien

Similar a Big Data and the Semantic Web: Challenges and Opportunities(20)

A Generic Model for Student Data Analytic Web Service (SDAWS) von Editor IJCATR
A Generic Model for Student Data Analytic Web Service (SDAWS)A Generic Model for Student Data Analytic Web Service (SDAWS)
A Generic Model for Student Data Analytic Web Service (SDAWS)
Editor IJCATR227 views
Semantic Web: Technolgies and Applications for Real-World von Amit Sheth
Semantic Web: Technolgies and Applications for Real-WorldSemantic Web: Technolgies and Applications for Real-World
Semantic Web: Technolgies and Applications for Real-World
Amit Sheth4.7K views
X api chinese cop monthly meeting feb.2016 von Jessie Chuang
X api chinese cop monthly meeting   feb.2016X api chinese cop monthly meeting   feb.2016
X api chinese cop monthly meeting feb.2016
Jessie Chuang826 views
Bridging the gap between the semantic web and big data: answering SPARQL que... von IJECEIAES
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...
IJECEIAES4 views
Big Data As a service - Sethuonline.com | Sathyabama University Chennai von sethuraman R
Big Data As a service - Sethuonline.com | Sathyabama University ChennaiBig Data As a service - Sethuonline.com | Sathyabama University Chennai
Big Data As a service - Sethuonline.com | Sathyabama University Chennai
sethuraman R251 views
Description Of A Relational Database von Tracy Dolittle
Description Of A Relational DatabaseDescription Of A Relational Database
Description Of A Relational Database
Tracy Dolittle2 views
Resume_latest_22_01 von Raghu Golla
Resume_latest_22_01Resume_latest_22_01
Resume_latest_22_01
Raghu Golla138 views
IRJET- Data Mining - Secure Keyword Manager von IRJET Journal
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
IRJET Journal8 views
Database Integrated Analytics using R InitialExperiences wi von OllieShoresna
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna3 views
Linked Data Generation for the University Data From Legacy Database von dannyijwest
Linked Data Generation for the University Data From Legacy Database  Linked Data Generation for the University Data From Legacy Database
Linked Data Generation for the University Data From Legacy Database
dannyijwest73 views
FIWARE Training: Introduction to Smart Data Models von FIWARE
FIWARE Training: Introduction to Smart Data ModelsFIWARE Training: Introduction to Smart Data Models
FIWARE Training: Introduction to Smart Data Models
FIWARE143 views
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده von Web Standards School
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده

Más de Srinath Srinivasa

AI and the sense of self von
AI and the sense of selfAI and the sense of self
AI and the sense of selfSrinath Srinivasa
207 views18 Folien
Modeling sustainability in social networks von
Modeling sustainability in social networksModeling sustainability in social networks
Modeling sustainability in social networksSrinath Srinivasa
184 views39 Folien
Characterizing online social cognition von
Characterizing online social cognitionCharacterizing online social cognition
Characterizing online social cognitionSrinath Srinivasa
177 views23 Folien
Open ended data von
Open ended dataOpen ended data
Open ended dataSrinath Srinivasa
350 views21 Folien
The Web and the Mind von
The Web and the MindThe Web and the Mind
The Web and the MindSrinath Srinivasa
1K views59 Folien
Big Social Machines: Architecture and Challenges von
Big Social Machines: Architecture and ChallengesBig Social Machines: Architecture and Challenges
Big Social Machines: Architecture and ChallengesSrinath Srinivasa
563 views29 Folien

Más de Srinath Srinivasa(15)

Modeling sustainability in social networks von Srinath Srinivasa
Modeling sustainability in social networksModeling sustainability in social networks
Modeling sustainability in social networks
Srinath Srinivasa184 views
Big Social Machines: Architecture and Challenges von Srinath Srinivasa
Big Social Machines: Architecture and ChallengesBig Social Machines: Architecture and Challenges
Big Social Machines: Architecture and Challenges
Srinath Srinivasa563 views
The Power Law of Social Media: What CIOs Should Know von Srinath Srinivasa
The Power Law of Social Media: What CIOs Should KnowThe Power Law of Social Media: What CIOs Should Know
The Power Law of Social Media: What CIOs Should Know
Srinath Srinivasa1.1K views
Aggregating Operational Knowledge in Community Settings von Srinath Srinivasa
Aggregating Operational Knowledge in Community SettingsAggregating Operational Knowledge in Community Settings
Aggregating Operational Knowledge in Community Settings
Srinath Srinivasa509 views
Semantics hidden within co-occurrence patterns von Srinath Srinivasa
Semantics hidden within co-occurrence patternsSemantics hidden within co-occurrence patterns
Semantics hidden within co-occurrence patterns
Srinath Srinivasa1.4K views
The open problem of open-world computing von Srinath Srinivasa
The open problem of open-world computingThe open problem of open-world computing
The open problem of open-world computing
Srinath Srinivasa1.2K views
Trends In Graph Data Management And Mining von Srinath Srinivasa
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
Srinath Srinivasa2.8K views

Último

The relative risk of cancer from smoking and vaping nicotine von
The relative risk of cancer from smoking and vaping nicotine The relative risk of cancer from smoking and vaping nicotine
The relative risk of cancer from smoking and vaping nicotine yfzsc5g7nm
171 views25 Folien
The AI apocalypse has been canceled von
The AI apocalypse has been canceledThe AI apocalypse has been canceled
The AI apocalypse has been canceledTina Purnat
125 views19 Folien
status epilepticus-management von
status epilepticus-managementstatus epilepticus-management
status epilepticus-managementVamsi Krishna Koneru
8 views91 Folien
sales forecasting (Pharma) von
sales forecasting (Pharma)sales forecasting (Pharma)
sales forecasting (Pharma)sristi51
7 views13 Folien
VarSeq 2.5.0: VSClinical AMP Workflow from the User Perspective von
VarSeq 2.5.0: VSClinical AMP Workflow from the User PerspectiveVarSeq 2.5.0: VSClinical AMP Workflow from the User Perspective
VarSeq 2.5.0: VSClinical AMP Workflow from the User PerspectiveGolden Helix
20 views24 Folien
PATIENTCOUNSELLING in.pptx von
PATIENTCOUNSELLING  in.pptxPATIENTCOUNSELLING  in.pptx
PATIENTCOUNSELLING in.pptxskShashi1
16 views16 Folien

Último(20)

The relative risk of cancer from smoking and vaping nicotine von yfzsc5g7nm
The relative risk of cancer from smoking and vaping nicotine The relative risk of cancer from smoking and vaping nicotine
The relative risk of cancer from smoking and vaping nicotine
yfzsc5g7nm171 views
The AI apocalypse has been canceled von Tina Purnat
The AI apocalypse has been canceledThe AI apocalypse has been canceled
The AI apocalypse has been canceled
Tina Purnat125 views
sales forecasting (Pharma) von sristi51
sales forecasting (Pharma)sales forecasting (Pharma)
sales forecasting (Pharma)
sristi517 views
VarSeq 2.5.0: VSClinical AMP Workflow from the User Perspective von Golden Helix
VarSeq 2.5.0: VSClinical AMP Workflow from the User PerspectiveVarSeq 2.5.0: VSClinical AMP Workflow from the User Perspective
VarSeq 2.5.0: VSClinical AMP Workflow from the User Perspective
Golden Helix20 views
PATIENTCOUNSELLING in.pptx von skShashi1
PATIENTCOUNSELLING  in.pptxPATIENTCOUNSELLING  in.pptx
PATIENTCOUNSELLING in.pptx
skShashi116 views
Referral-system_April-2023.pdf von manali9054
Referral-system_April-2023.pdfReferral-system_April-2023.pdf
Referral-system_April-2023.pdf
manali905437 views
MEDICAL RESEARCH.pptx von rishi2789
MEDICAL RESEARCH.pptxMEDICAL RESEARCH.pptx
MEDICAL RESEARCH.pptx
rishi278947 views
melani glossophobia.pdf von Paygeon
melani glossophobia.pdfmelani glossophobia.pdf
melani glossophobia.pdf
Paygeon9 views
eTEP -RS Dr.TVR.pptx von Varunraju9
eTEP -RS Dr.TVR.pptxeTEP -RS Dr.TVR.pptx
eTEP -RS Dr.TVR.pptx
Varunraju998 views
Taking Action to Improve the Patient Journey With Transthyretin Amyloidosis (... von PeerVoice
Taking Action to Improve the Patient Journey With Transthyretin Amyloidosis (...Taking Action to Improve the Patient Journey With Transthyretin Amyloidosis (...
Taking Action to Improve the Patient Journey With Transthyretin Amyloidosis (...
PeerVoice7 views
Structural Racism and Public Health: How to Talk to Policymakers and Communit... von katiequigley33
Structural Racism and Public Health: How to Talk to Policymakers and Communit...Structural Racism and Public Health: How to Talk to Policymakers and Communit...
Structural Racism and Public Health: How to Talk to Policymakers and Communit...
katiequigley33290 views
Pharma Franchise For Critical Care Medicine | Saphnix Lifesciences von Saphnix Lifesciences
Pharma Franchise For Critical Care Medicine | Saphnix LifesciencesPharma Franchise For Critical Care Medicine | Saphnix Lifesciences
Pharma Franchise For Critical Care Medicine | Saphnix Lifesciences

Big Data and the Semantic Web: Challenges and Opportunities

  • 1. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web: Challenges and Opportunities Srinath Srinivasa Open Systems Laboratory IIIT Bangalore http://osl.iiitb.ac.in/ sri@iiitb.ac.in
  • 2. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India http://www.bda2013.net/
  • 3. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India OSL Releases Topical Anchors: Given  a list of noun phrases,  identify a semantic  topic for these terms. Powered by Wikipedia  co­occurrence graph  hosted by Agama Web APIs enable use of  Topical Anchors in  third party applications 
  • 4. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India OSL Releases Topic Expansion: Given a term, expands it into semantically relevant topical clusters with different senses. Uses co-occurrence datasets from Wikipedia 2006 or 2011. Web APIs enable use by third party applications
  • 5. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India OSL Releases Agama: A graph database for  storing large undirected graphs  for efficient traversal (not  structure­based retrieval) Currently Agama powers a co­ occurrence graph of all noun­ phrases from Wikipedia articles  hosted in OSL, managing 10s of  millions of nodes and 100s of  millions of edges 
  • 6. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India More data beats better algorithms.. meets No data is an island..
  • 7. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Outline ● Big Data Characteristics ● Big Data Analytics ● Pattern­driven and Model­driven Analytics ● Big Data and the Semantic Web ● Semantic Challenges ● The myth of a global ontology ● Convergent and divergent semantics ● Semantic interoperability  ● Technology Challenges ● Storage, traversal and retrieval of large­scale semantic networks ● Inference on Big Data ● On the road ahead
  • 8. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data Data that is  ● Too large to be processed by conventional  databases and data management techniques  (Volume) ● Too diverse in structure that no single data model  captures all elements of the data (Variety) ● Transient and/or impermanent, especially when  pertaining to dynamic phenomena (Velocity)
  • 9. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data ● Transaction records ● Network streams ● Experimental output ● Social media data  ● Demographic records ● Citation data  ● Clickstreams ● Log data ● Weather data  ● …
  • 10. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Some Big Data Stats ● YouTube users upload 48 hours of video every minute  http://gigaom.com/2011/05/25/youtube­48­hours­of­video­per­minute/ ● Facebook data grows by 500TB daily  http://www.slashgear.com/facebook­data­grows­by­over­500­tb­daily­23243691/ ● WalMart handles more than 1 million customer  transactions every hour http://www.economist.com/node/15557443 ● Akamai analyzes 75 million events per day for  targeted advertising http://wikibon.org/blog/taming­big­data/ ● 90% of data in the world today was created in the last  2 years http://wikibon.org/blog/big­data­infographics/ 
  • 11. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data Analytics Examine Big Data for useful (often actionable)  knowledge The long spectrum of Big Data Analytics Pattern identification Association rule mining Classification/Clustering Record Linkage Security analytics Complex Event Processing Opinion mining Predictive modeling Pattern driven Model driven
  • 12. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Pattern Driven Analytics ● Discovery and visualization  of recurring patterns in  datasets ● Mostly quantitative ●  Paradigms in pattern  discovery: ● Sampling and  aggregation ● Thresholding and  filtering Image Source: Wikipedia
  • 13. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Pattern Driven Analytics Sampling and Aggregation ● Query based pattern aggregation ● Based on an initial idea of what we are looking  for Hypothesis Data Query Patterns Aggregation Presentation
  • 14. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Pattern Driven Analytics Tresholding and Filtering ● Based on sifting through the entire dataset (or a  view) to look for “interesting” patterns without  the context of a query Data Interestingness criteria Patterns Filtering and Segregation Presentation
  • 15. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Model Driven Analytics Analytics as a model­discovery problem Wedding Images source: Wikipedia Observable Data Latent Concept
  • 16. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Model Driven Analytics ● Pattern discovery coupled with semantic  modeling ● Non­trivial qualitative modeling challenges ● Model discovery: ● Descriptive model discovery Fit a model to explain the observed data ● Predictive model discovery Discover a model that can predict values of data elements  into the future
  • 17. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Linked Data Image source: Wikipedia The Linked Data Cloud as of September 2011
  • 18. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Linked Data ● Using Semantic Web technologies to connect data  elements from disparate data sources ● From Web of Documents to Web of Data ● Elements of Linked Data ● URIs  ● HTTP ● Resource Description Framework (RDF) ● Serialization formats (RDFa, RDF/XML, N3, Turtle,  and others)
  • 19. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web Big Data Semantic Web Model Discovery Catalyzation and Predictive Modeling
  • 20. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data        Semantic Web ● One of the main elements of the Linked Data Cloud: DBpedia is  built from a Big Data resource: Wikipedia ● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from  mining PubMed publications ● Enterprise scale Big Data Analytics helping build organizational  models, operational intelligence solutions, etc. Example: Anzo  software suite by Cambridge Semantics (www.cambridgesemantics.com),  Loom data management suite by Revelytix (www.revelytix.com)
  • 21. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic Web       Big Data Schema.org ● Collection of schemata on various topics that are recognized by major  search providers and used to semantically interpret web content SourceMap ● Linked data augmented with web content and crowdsourced data used  to provide details about companies like their carbon footprint, energy  use, water use, etc. www.sourcemap.com  OpenSteetMap ● Linked data augmenting crowdsourced data on www.openstreetmap.org  helped in detailed mapping of disaster scenario during the Jan 2010  Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=berners­lee­linked­data)
  • 22. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web:  Challenges Semantic challenges ● The myth of a global ontology ● Convergent and divergent semantics Technology and system challenges ● Characteristics of a semantic graph ● Managing graph structured data
  • 23. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India The Myth of a Global Ontology Several “core” semantic ontologies exist: ● WordNet ● YAGO ● OpenCyc ● SUMO However, none of them (even automated ones) can  capture all possible semantic associations and all  possible perspectives on a given topic
  • 24. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India The Myth of a Global Ontology The open world problem ● We don't know what we don't know..  ● Representation bias in big data sources The neutral­but­useless perspective ● Localized, utilitarian descriptions often more useful than neutral,  global descriptions. Ex: Use of “zones” as a geographical element in  Indian Railways ● Difficult for disparate perspectives to co­exist in a single Ontology,  violating design principles like Occam's razor
  • 25. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Convergent and Divergent  Semantics Wikipedia article on West Bank conflict Palestine POV Israeli POV Historians' POV UN's POV Encyclopedic Semantics
  • 26. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Convergent and Divergent  Semantics IPL event schedule Traffic planning Advertisement planning around IPL Legal structuring around IPL TV programme scheduling Security planning
  • 27. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic Interoperability ● Binary predicates like RDF may not capture  complete semantics of the association But it is too difficult to work with higher­order predicates ● Semantic queries are characterized by contextual  relevance and default assumptions ● Linked Data can be useful primarily within the  context of a model Model­building from predicates as complex a problem as  identifying predicates from data
  • 28. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic Challenges: Summary ● Hard to distinguish data from noise without a model Especially hard when we are using data to help build a model! ● There may not be a single global model explaining the data ● Model construction as challenging, if not more challenging, as predicate  mining ● No clarity on the underlying processes that aid in knowledge aggregation Knowledge aggregation happens differently depending on the kind of  knowledge being aggregated (encyclopedic versus operational knowledge) 
  • 29. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Tech Challenges Storing Big Semantic Data ● Semantic data not amenable to physical access coherence to be  efficiently stored in relational tables ● Logical proximity of triples, more important than physical  proximity ● Read/Write storage models change logical proximity ● RDF graphs tend to be extremely dense and/or clustered ● Need efficient methods of graph storage and retrieval 
  • 30. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data ● Databases optimized to store and retrieve interrelated  sets of triples of the form (subject, predicate, object)  ● Query models based on answering graph queries  (usually in SPARQL) rather than SQL queries ●  Main design criteria: storage and read­ahead policies of  triples based on their logical proximity rather than  physical proximity in order to enable Bulk Synchronous  Parallel (BSP) processing
  • 31. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data AllegroGraph  (http://www.franz.com/agraph/allegrograph/) ● NoSQL Graph based native storage for RDF triples ● ACID compliant ● Interfaces with Solr for free text indexing  ● Triple and text level indexing ● MongoDB integration ● RDFS++ Reasoning with dynamic materialization  ● SPARQL queries on named graphs and Prolog based  inferencing engine
  • 32. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data Sesame http://www.openrdf.org/ ●  Open source Java framework for parsing, storing,  querying and inferencing over RDF data  ● Collections of RDF triples can be manipulated in memory  using a graph data model ● Compliant with SPARQL 1.1 protocol recommendation  ● Provides two levels of APIs: SAIL (Storage and Inference  Layer) for low level RDF processing and Repository layer  for programmatic interfacing with Sesame
  • 33. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data Mulgara http://www.mulgara.org/  ● Native storage model for RDF ● Supports multiple models (databases) per server ● ACID transactions and concurrency support  ● Copy­on­write­ cache semantics ● Full­text search and support for data types ● Primarily useful as a repository – no evidence of  support for logical inferences over RDF 
  • 34. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data Other examples: ● InfiniteGraph from Objectivity http://www.objectivity.com/ ● Big­Data http://www.bigdata.com/bigdata/blog/  – A high scale­out storage and computing engine ● Agama https://github.com/arrac/agama/wiki/Agama  – Storage, search and traversal support (Ruby library) for  very large graphs  ● Neo4j http://www.neo4j.org/  – Embedded, disk­based transactional graph database  written in Java 
  • 35. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Logical inference over Big Data ● Problem: Find factual answers to specific questions by  reasoning over large­scale data.   ● Performing extremely large­scale deductions over large  semantic datasets in interactive response time  ● Need to contend with potentially inconsistent predicates,  incomplete or missing values and default assumptions ● Varieties of inference over datasets ● Deduction ● Induction ● Abduction ● Statistical inference
  • 36. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Logical inference over Big Data Common approaches for scalable inferencing: ● Horn clause inferencing ● Variants of random walks on knowledge graphs ● Distributed MCMC (Markov Chain Monte Carlo)  methods
  • 37. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Horn Clauses Horn clauses are predicates of the form: atomic sentence with no negation and a single consequent Horn clause knowledge bases can be resolved using “backward  chaining” starting from the consequent and building a tree of  antecedents until they are grounded in facts Horn clause resolution can be scaled over large datasets by  parallelizing resolutions using MapReduce    p1∧p2∧...∧pn →u
  • 38. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Random Walks on Big Data Random walks on RDF graphs as a means of: ● Belief materialization ● Soft inference a c e d f b R R R R Assuming transitivity of R
  • 39. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Random Walks on Big Data Large scale graph processing solutions for  scaling random walks over Big Data:  ● Apache Giraph http://giraph.apache.org/  ● Pregel [Malewicz et al., 2010] ● Grappa http://www.cs.washington.edu/node/4217/ 
  • 40. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India MCMC A “generic” problem solving method based on local  sampling, useful for soft inferences on semantic data Time homogeneous Markov Chain:
  • 41. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India MCMC A homogeneous Markov chain can be represented as a set of  “states” and “transition probabilities” across states Given an initial “prior” probability distribution across states            the “stationary distribution” or “equilibrium condition”  is defined as: 
  • 42. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India MCMC Markov Chain Monte Carlo Given a state space S and an “equilibrium” distribution        choose a sample s of the state space S so that a Markov chain  on s results in      as the stationary distribution MCMC for logical inference For a logical inference problem, the equilibrium condition  would be of the form [0,1]m  defined over a set of m predicates Example Sampling algorithms for MCMC Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling  Metropolis­Hastings algorithm  http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm 
  • 43. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Scaling MCMC for Big Data Distributed MCMC Several models are explored for distributing MCMC computations  over large datasets making them amenable to diffusing  computations. Some examples include: [Murray 2010; Singh et al  2011] Distributional models for MCMC beyond the scope of this talk.. 
  • 44. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India On the road ahead.. Some promising directions for Big Data and  Semantics ● Diffusion models for large scale inference ● Cognitive models for semantics over large scale data ● Model­based reasoning and reasoning across models ● Soft (probabilistic) inferences, confidence measures,  relevance feedback ● Continuous learning over Big Data 
  • 45. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Thank You!
  • 46. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India References ● Neal Madras. Introduction to Markov Chain Monte Carlo.  http://www.cs.cornell.edu/selman/cs475/lectures/intro­mcmc­lukas.pdf  ● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz  Czajkowski. 2010. Pregel: a system for large­scale graph processing. In Proceedings of the 2010 ACM SIGMOD International  Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135­146. DOI=10.1145/1807167.1807184  http://doi.acm.org/10.1145/1807167.1807184 ● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In  Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for  Computational Linguistics, Stroudsburg, PA, USA, 529­539.  ● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores,  Clusters and Clouds. http://lccc.eecs.berkeley.edu/  ● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the  Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics,  Stroudsburg, PA, USA, 79­88. ● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first­order Horn clauses from web  text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10).  Association for Computational Linguistics, Stroudsburg, PA, USA, 1088­1098. ● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large­scale cross­document  coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the  Association for Computational Linguistics: Human Language Technologies ­ Volume 1 (HLT '11), Vol. 1. Association for  Computational Linguistics, Stroudsburg, PA, USA, 793­803.