SlideShare a Scribd company logo
1 of 48
SEMANTIC SEARCH
OVER BIG LINKED
DATA
Dr. Thanh Tran
…AND THERE WAS LINKED DATA!
(Source: http://linkeddata.org/)
RDF
A W3C Web standard for data representation and exchange
Allows different kinds of data to be captured as graphs
Graphs contain resource descriptions
Each is a set of triples
• Attribute values
• Relations to other resources
Freddie
Mercury
Brian
May
Queen
Liar 1971
source: http://linkeddata.org/
LINKED DATA CLOUD
(Source: http://linkeddata.org/)
OPPORTUNITIES (1)
Data.gov: effective dissemination and consumption of public sector data
(Source: http://www.data.gov)
The Freddie
Mercury-written
lead single "Seven
Seas of Rhye"
reached number ten
in the UK, giving the
band their first
hit.[14] The album is
the first real…
“written by freddie queen single”
WKP:
Page
OPPORTUNITIES (2)
Linked Data Cloud: effective dissemination and consumption of data across
datasets, across domains
Freddie
Mercury
Brian
May
Queen
Liar 1971
MusicBrainz;
Artist
MusicBRainz:
Band
MusciBrainz:
Single
“written by freddie queen single”
OPPORTUNITIES
Linked Data Cloud: effective dissemination and consumption of data across
datasets, across domains
The Freddie
Mercury-written
lead single "Seven
Seas of Rhye"
reached number ten
in the UK, giving the
band their first
hit.[14] The album is
the first real…
WKP:
Page
Freddie
Mercury
Brian
May
Queen
Liar 1971
MusicBrainz;
Artist
MusicBRainz:
Band
MusciBrainz:
Single
“written by freddie queen single”
OPPORTUNITIES
Linked Data Cloud: effective dissemination and consumption of data across
datasets, across domains
The Freddie
Mercury-written
lead single "Seven
Seas of Rhye"
reached number ten
in the UK, giving the
band their first
hit.[14] The album is
the first real…
WKP:
Page
Freddie
Mercury
Brian
May
Queen
Queen
Elizabeth 1
Liar 1971 single
Freebase:
Person
MusicBrainz;
Artist
MusicBRainz:
Band
MusciBrainz:
Single
“written by freddie queen single”
OPPORTUNITIES
Linked Data Cloud: effective dissemination and consumption of data across
datasets, across domains
The Freddie
Mercury-written
lead single "Seven
Seas of Rhye"
reached number ten
in the UK, giving the
band their first
hit.[14] The album is
the first real…
WKP:
Page
COGNITIVE CHALLENGES
Structured data / database solution requires needs to be given as
structured queries
Writing structured queries requires knowledge about
• Query language syntax and semantics
• Datasets and their schemas
• Links between datasets
<x, type, Single>
<Freddie Mercury, writer, x>
<Freddie Mercury, member, Queen>
“written by freddie queen single”
SEMANTIC SEARCH OVER BIG LINKED DATA!
VISION
Enabling end users to retrieve and explore relevant knowledge
from Big Linked Data via intuitive interfaces!
THE INFORMATION WORKBENCH DEMO
Facets
Syntactic
Completions
Keywords
Semantic
Completions
(Source: http://www.fluidops.com/information-workbench/)
FOLLOWING AGENDA
Technical Challenges
Big Picture of Previous & Current Work
Contributions & Innovations
Keyword Search over Big Linked Data
Where are we now?
What is to be done?
TECHNICAL CHALLENGES
Linked Data is Big Data
Volume: numerous large datasets
• Processing all datasets possible/ needed?
Velocity: streams from sensors, live feeds etc.
• How to provide fresh, timely results?
• Preprocessing possible?
Variety: different data formats + schemas are
unknown, heterogeneous and rapidly changing
• Making sense of the data?
• Integrate and combine knowledge from different datasets?
BIG PICTURE
Previous & Current Work
Acquire
• Source
selection
[ISWC10, T
KDE12b]
Organize
• Indexes for
quick
lookup of
entities,
relations
and paths
[JWS09,
CIKM11a]
Analyze
• Descriptive
resource
summary
[ISWC11]
• Structural
summary of
datasets
[TKDE12a]
Search
• Entity & relational
search and ranking
[SIGIR11,CIKM11b]
• Keyword query
processing
[ICDE09,
SIGMOD09]
Volume
Fast access?
All data/datasets?
BIG PICTURE
Previous & Current Work
Acquire
• Source
selection
[ISWC10, T
KDE12b]
• Stream-
based
processing
of external
sources
[ISWC10b]
• Combining
local &
external
sources
[ESWC12]
Organize
• Indexes for
quick
lookup of
entities,
relations
and paths
[JWS09,
CIKM11a]
• On-demand
search-
driven data
integration
[WebSci12]
Analyze
• Descriptive
resource
summary
[ISWC11]
• Structural
summary of
datasets
[TKDE12a]
Search
• Entity & relational
search and ranking
[SIGIR11,CIKM11b]
• Keyword query
processing
[ICDE09,
SIGMOD09]
• Explorative Linked
Data query
processing
[ESWC11]
• Multi-datasets
search [WWW12]
Volume
Fast access?
All data/datasets?
Velocity
Fresh results?
Preprocessing?
Heterogeneous
Datasets/Schemas
Structured +
Unstructured
Variety
KEYWORD SEARCH OVER BIG LINKED DATA
BIG PICTURE
Previous & Current Work
Acquire
• Source
selection
[ISWC10, T
KDE12b]
• Stream-
based
processing
of external
sources
[ISWC10b]
• Combining
local &
external
sources
[ESWC12]
Organize
• Indexes for
quick
lookup of
entities,
relations
and paths
[JWS09,
CIKM11a]
• On-demand
search-
driven data
integration
[WebSci12]
Analyze
• Descriptive
resource
summary
[ISWC11]
• Structural
summary of
datasets
[TKDE12a]
Search
• Entity & relational
search and ranking
[SIGIR11,CIKM11b]
• Keyword query
processing
[ICDE09, SIGMOD
09]
• Explorative Linked
Data query
processing
[ESWC11]
• Multi-datasets
search [WWW12]
Volume
Fast access?
All data/datasets?
Velocity
Fresh results?
Preprocessing?
Heterogeneous
Datasets/Schemas
Structured +
Unstructured
Variety
KEYWORD SEARCH PROBLEM (1)
Freddie
Mercury
Brian
May
Queen
Queen
Elizabeth 1
Liar 1971 single
PersonArtist Band Single
writer
1) Query 1 1) Result 1
2) Query 2) Result 2
… …
Set of QueriesSelection Set of Results
“written by freddie queen single”
KEYWORD SEARCH PROBLEM (2)
Goal
• Finding “substructures”, e.g. Steiner Graph
• Connecting keyword matching elements
• AND-Semantics: contain one keyword matching element
for every query keyword
Problem
• Keywords produce large number of matching elements
• Large number of connecting graphs
• Search complexity increases exponentially with the size
of the data graphs & query keywords
• Data graphs large in size
INDEX-BASED TOP-K KEYWORD QUERY
PROCESSING [CIKM11B]
Cast problem as the one of index-based join processing
• Index-based data access (retrieval)
• Join (combine)
D-LENGTH 2-HOP COVER GRAPH INDEX (1)
Use d-length 2-hop cover for graph indexing, i.e. a set of
neighbourhood labels NBn for every node n
• If there is a path of length 2d or less between u and v then
• All paths of length 2d or less between u and v are:
• u and v are called center nodes and w is the hop node
emptyNBNB vu
vu NBNBwvwu ,,...,,...,
D-LENGTH 2-HOP COVER GRAPH INDEX (2)
A set of d-length neighborhoods is a d-length 2-hop cover
During construction, pruning paths reduces that size!
Freddie
Mercury
Liar
writer
Freddie
Mercury
Brian
May
Queen
Liar 1971
Band
Liar
Single
Freddie
Mercury
Artist
Freddie
Mercury
Queen
member
Freddie
Mercury
Queen
member
Brian
May
Queen
member
Queen Liar
producer
Queen Band
Queen 1971
formed in
Freddie
Mercury Liar
writer
LiarSingle
1-length 2-hop cover
path index
center/hop
nodes
hop
nodes
Freddie
Mercury
Queen
Artist
Liar
writer
Freddie
Mercury Liar
writer
TOP-K JOIN: NEIGHBORHOOD JOIN
Freddie
Mercury
Artist
Freddie
Mercury
Queen
member
Band
Freddie
Mercury
Queen
member Brian
May
member
Freddie
Mercury
Queen
member Brian
May
member
Freddie
Mercury
Queen
member
Liar
producer
Freddie
Mercury
Queen
member
1971
formed in
Freddie
Mercury
Liar
writer
Single
formed in
Freddie
Mercury
Queen
member
Freddie
Mercury
Liar
writer
2-length 2-hop cover
Freddie
Mercury
Queen
member
Brian
May
Queen
member
QueenLiar
producer
QueenBand
Queen1971
formed in
Freddie
Mercury
Queen
member
Liar
writer
Freddie
Mercury
Queen
member
Artist
QueenLiar
producer
Single
 Retrieve neighborhoods NBu and NBv for u and v
 Join path entries in Nbu and NBv on hop nodes (rank join on sorted
inputs)
TOP-K JOIN: GRAPH JOIN
Freddie
Mercury
Artist
Freddie
Mercury
Queenmember
Artist
Freddie
Mercury
Artist
Freddie
Mercury
Queen
member
Keyword Graphs
Comprise all paths of max length 2d
between Freddie Mercury and Queen
Freddie
Mercury
Artist
Freddie
Mercury
Queenmember
LiarSingle
hop
node 1
…
hop
node 1
…
Expand to obtain Keyword Graph
Neighborhoods containing free hop nodes
KEYWORD QUERY PROCESSING / PLANNING
Process
• Index access to retrieve keyword
neighborhoods
• Rank (neighborhoods/graph)
join to connect keyword elements
Planning: which join order? Freddie
Mercury
writerQueen Single
KEYWORD QUERY PROCESSING / PLANNING
Join order also determines results
• No single join order delivers all results
(some might even be empty)
• We do not know in advance which orders
deliver which results
Consider all possible join orders
Freddie
Mercury
Queen
Liar
Single
writer
Freddie
Mercury
writerQueen Single
Produce results for d = 1!
Produce no results for d = 1!
“written by freddie queen single 1971”
1971
1971
Freddie
Mercury
writer QueenSingle1971
INTEGRATED QUERY PLAN
Terminate early after computing top-k instead of all results
• Use rank join operators
• Introduce top-k union operator
Freddie
Mercury
Queen Single
writer
TOP-K PLANS
Integrated Query Plan is composition of sub-plans
• Some might produce no results
• Some sub-plans produce results earlier than others
Rank not only results, but also rank operators (hence plans)
• Global score of rank join operator, based on current results and
upper bounds for subsequent join operations
• Only the operator with the highest global score can push results to
subsequent operators
• Otherwise, activate lower level data access operators
INDEX-BASED TOP-K KEYWORD QUERY
PROCESSING [CIKM11B]
Benefits
• One-order of magnitude faster performance than online
graph exploration
• Compared with graph indexing approaches, our solution
reduces storage requirement up to 86%, improves
performance by more than 50% on average
SEARCH TECHNOLOGY INNOVATIONS
Integrated
Zero Upfront Effort / On-Demand
• Does not require preprocessing, upfront integration (Watson)
Fresh Results / Timely Response
Relational
• Entities (Yahoo!, Google, Facebook Graph Search)
• Plus relations, paths, graphs…
Zero Manual Effort
• Does not require expert to specify search forms (E-commerce
search), structure templates, translation rules and domain
adaptation (Wolfram Alpha, Watson)
• Interpretation of keywords and structural context, i.e. relevant
relations between entities through online graph exploration
WHAT HAVE WE ACHIEVED?
Volume: fast access? all data/datasets?
• Quick IR-style keyword-based lookup
• Reduce search space / result candidates
• Handle hundred of datasets with response time within
few seconds (with local sources)
• Ranking performance consistently superior than state-of-
the-art (20% improvements in terms of F-measure)
according to keyword search benchmark 2012
• Structured, semi-structured  unstructured?
hybrid data management?
WHAT HAVE WE ACHIEVED?
Velocity: fresh results? preprocessing?
• On-demand stream-based processing, i.e. exploration of
sources, data integration and result combination at
querying time
• No need to process / store all data
• Fresh results from external sources can be guaranteed
WHAT HAVE WE ACHIEVED?
Variety: different datasets, schemas and formats
• Interpretation of data semantics and matching across
datasets performed at querying time
• No assumptions of schema, i.e. can handle
unknown, possibly semi-structured data
• Works well when data sources are homogenous, i.e.
large overlaps / matching signals are numerous and
specific  heterogeneous data from different domains
with small overlaps / no specific matching signals?
BIG PICTURE
Previous & Current & Future Work
Acquire
• Source
selection
[ISWC10,
TKDE12b]
• Stream-
based
processing
of external
sources
[ISWC10b]
• Combining
local &
external
sources
[ESWC12]
Organize
• Indexes for
quick
lookup of
entities, rela
tions and
paths
[JWS09, CI
KM11a]
• On-demand
search-
driven data
integration
[WebSci12]
• Heterogene
ous data
integration
[ICDE13, W
SDM13]
• Integration
of hybrid
big data
Analyze
• Descriptive
entity
summary
[ISWC11]
• Structural
summary of
datasets
[TKDE12a]
• Probabilistic
models of
text and
structure
[ICML13,
SIGMOD13]
• Hybrid big
data
management
Search
• Entity & relational
search and ranking
[SIGIR11,CIKM11b]
• Keyword query
processing
[ICDE09, SIGMOD
09]
• Explorative Linked
Data query
processing
[ESWC11]
• Multi-datasets
search [WWW12]
Volume
Fast access?
All data/datasets?
Velocity
Fresh results?
Preprocessing?
Heterogeneous
Datasets/Schemas
Structured +
Unstructured
Variety
CONCLUSIONS
Vision
• Enabling end users to retrieve and explore relevant knowledge from
Big Linked Data via intuitive interfaces!
Status quo
• End users can retrieve complex knowledge (complex graphs) from
hundreds of Linked Data sources
1-3 years from now
• Improve “integrated view” coverage from 30% to 80%
• Coverage of structured and unstructured result (from sensors,
social networks etc.)
3-5 years from now
• Robust probabilistic models of hybrid Big Linked Data
• For search, ranking, as well as analytics and prediction?
THANKS!
Tran Duc Thanh
ducthanh.tran@kit.edu
http://sites.google.com/site/kimducthanh/
REFERENCES (1)
• [ICML13] Veli Bicer, Thanh Tran
Topical Relational Model
Submitted to International Conference on Machine Learning (ICML’13).
• [SIGMOD13]
TopGuess: Query Selectivity Estimation over Text-rich Data Graphs
Submitted to SIGMOD13.
• [ICDE13] Yongtao Ma, Thanh Tran
TYPifier: Inferring the Type Semantics of Structured Data
In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013
• [WSDM13] Yongtao Ma, Thanh Tran
TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous
Web Data Integration
In International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013
• [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph
Managing Structured and Semi-structured RDF Data Using Structure Indexes
In Transactions on Knowledge and Data Engineering journal.
• [TKDE12b] Thanh Tran, Lei Zhang
Keyword Query Routing
In Transactions on Knowledge and Data Engineering journal.
• [WWW12] Daniel Herzig, Thanh Tran
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration
In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012
• [WebSci12] Thanh Tran, Yongtao Ma, and Gong Cheng
Pay-less Entity Consolidation – Exploiting Entity Search User Feedbacks for Pay-as-you-go
Entity Data Integration
In Proceedings of Web Science Conference 2012 (WebSci'12). Evanston, USA, June, 2012
• [CIKM11a] Günter Ladwig, Thanh Tran
Index Structures and Top-k Join Algorithms for Native Keyword Search Databases
In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11).
Glasgow, UK, October, 2011
• [CIKM11b] Veli Bicer, Thanh Tran
Ranking Support for Keyword Search on Structured Data using Relevance Models
In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11).
Glasgow, UK, October, 2011
REFERENCES (2)
• [ISWC11] Gong Cheng, Thanh Tran and Yuzhong Qu
RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization
In Proceedings of 10th International Semantic Web Conference (ISWC'11).
Koblenz, Germany, October, 2011
• [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S.
Thompson, Thanh Tran Duc
Repeatable and Reliable Search System Evaluation using Crowdsourcing
In Proceedings of 34th Annual International ACM SIGIR
Conference (SIGIR'11), Beijing, China, July, 2011
• [DEXA11] Andreas Wagner, Günter Ladwig, Thanh Tran
Browsing-oriented Semantic Faceted Search
In Proceedings of 22nd International Conference on Database and Expert Systems Applications
(DEXA'11). Toulouse, France, August, 2011
• [ISWC10a] Thanh Tran, Lei Zhang, Rudi Studer
Summary Models for Routing Keywords to Linked Data Sources
In Proceedings of 9th International Semantic Web Conference (ISWC'10).
Shanghai, China, November, 2010
• [ISWC10b] Günter Ladwig, Thanh Tran
Linked Data Query Processing Strategies
In Proceedings of 9th International Semantic Web Conference (ISWC'10).
Shanghai, China, November, 2010
• [JWS09] Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue
Pan
Semplore: A Scalable IR Approach to Search the Web of Data
In Journal of Web Semantics 7 (3),September, 2009
• [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
In Proceedings of the 25th International Conference on Data Engineering (ICDE'09).
Shanghai, China, March 2009
• [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong
Yu, Thanh Tran, Peter Haase, Rudi Studer
Hermes: A Travel through Semantics in the Data Web
In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009
BACKUP
QUERY INTERPRETATION [ICDE09, SIGMOD09]
Focus on query interpretations instead of final answers
Leverage the power of underlying DB query engine for processing
interpretations
Reduction of search space
• Query interpretation on structure summary generated from data
• Exploration on reduced search space!
Focus on top-k results
• Top-k procedure for exploring and finding the k best results
Freddie
Mercury
Queen Queen
Elizabeth 1
single
PersonArtist Band Single Literal
member producer writer marital status
<x, type, Single>
<Queen, producer, x>
<Freddie Mercury, writer, x>
<Queen, type, Band>
<Freddy Mercury, type, Artist>
“written by freddie queen single”
QUERY INTERPRETATION
Benefits
• Outperforms online bidirectional search by at least one order of
magnitude
• Performance comparable with index-based approaches, but
requires less space
Drawbacks
• “Meaningful” interpretations may generate empty results
• Relies on DB query engine, native tailored optimization not possible
BIG PICTURE
Previous & Current Work
Acquire
• Source
selection
[ISWC10,
TKDE12b]
• Stream-
based
processing
of external
sources
[ISWC10b]
Organize
• Indexes for
quick
lookup of
entities, rela
tions and
paths
[JWS09, CI
KM11a]
• On-demand
search-
driven data
integration
[WebSci12]
Analyze
• Descriptive
resource
summary
[ISWC11]
• Structural
summary of
datasets
[TKDE12a]
Search
• Entity & relational
search and ranking
[SIGIR11,CIKM11b]
• Keyword query
processing
[ICDE09, SIGMOD
09]
• Explorative Linked
Data query
processing
[ESWC11]
Volume
Fast access?
All data/datasets?
Velocity
Fresh results?
Preprocessing?
BIG PICTURE
Previous & Current Work
Acquire
• Source
selection
[ISWC10,
TKDE12b]
• Stream-
based
processing
of external
sources
[ISWC10b]
• Combining
local &
external
sources
[ESWC12]
Organize
• Indexes for
quick
lookup of
entities,
relations
and paths
[JWS09,
CIKM11a]
• On-demand
search-
driven data
integration
[WebSci12]
Analyze
• Descriptive
entity
summary
[ISWC11]
• Structural
summary of
datasets
[TKDE12a]
Search
• Entity & relational
search and ranking
[SIGIR11,CIKM11b]
• Keyword query
processing
[ICDE09, SIGMOD
09]
• Explorative Linked
Data query
processing
[ESWC11]
• Multi-datasets
search [WWW12]
Volume
Fast access?
All data/datasets?
Velocity
Fresh results?
Preprocessing?
Heterogeneous
Datasets/Schemas
Structured +
Unstructured
Variety
SEMANTIC SEARCH TECHNIQUES FOR
LINKING
Linking homogenous data
• Given structured entity description, find
matching entities described using
same/similar schema
Linking heterogeneous data
• Given structured entity, find matching
entities described using different
schemas
Linking hybrid data
• Given text mentions, find matching
entities (no schema)
Keyword search
• Given keywords, find matching entities
(no schema)
name age
Tran Thanh 31
name age
Tran Thanh 31
id description
p1
Tran Duc Thanh,
age 31, works
at..
label age
Tran Duc Thanh 31
name age
Tran Thanh 31
…
content
Tran Duc Thanh,
a researcher at
KIT…
name age
Tran Thanh 31
query
Tran Duc Thanh
Search-based Linking
• Adopt methods for semantic matching and ranking for schema-
agnostic linking in hybrid & heterogenous data scenarios
• Embed linking into the search-process to leverage user
feedbacks

More Related Content

Similar to Big data search

(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGGRatko Mutavdzic
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training IntroductionMax De Marzi
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphNeo4j
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentPeter Haase
 
Selecting the right database type for your knowledge management needs.
Selecting the right database type for your knowledge management needs.Selecting the right database type for your knowledge management needs.
Selecting the right database type for your knowledge management needs.Synaptica, LLC
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinAnja Jentzsch
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklNeo4j
 

Similar to Big data search (20)

(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training Introduction
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Linked data 20171106
Linked data 20171106Linked data 20171106
Linked data 20171106
 
Selecting the right database type for your knowledge management needs.
Selecting the right database type for your knowledge management needs.Selecting the right database type for your knowledge management needs.
Selecting the right database type for your knowledge management needs.
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Big data search

  • 1. SEMANTIC SEARCH OVER BIG LINKED DATA Dr. Thanh Tran
  • 2. …AND THERE WAS LINKED DATA!
  • 4. RDF A W3C Web standard for data representation and exchange Allows different kinds of data to be captured as graphs Graphs contain resource descriptions Each is a set of triples • Attribute values • Relations to other resources Freddie Mercury Brian May Queen Liar 1971
  • 5. source: http://linkeddata.org/ LINKED DATA CLOUD (Source: http://linkeddata.org/)
  • 6. OPPORTUNITIES (1) Data.gov: effective dissemination and consumption of public sector data (Source: http://www.data.gov)
  • 7. The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real… “written by freddie queen single” WKP: Page OPPORTUNITIES (2) Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains
  • 8. Freddie Mercury Brian May Queen Liar 1971 MusicBrainz; Artist MusicBRainz: Band MusciBrainz: Single “written by freddie queen single” OPPORTUNITIES Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real… WKP: Page
  • 9. Freddie Mercury Brian May Queen Liar 1971 MusicBrainz; Artist MusicBRainz: Band MusciBrainz: Single “written by freddie queen single” OPPORTUNITIES Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real… WKP: Page
  • 10. Freddie Mercury Brian May Queen Queen Elizabeth 1 Liar 1971 single Freebase: Person MusicBrainz; Artist MusicBRainz: Band MusciBrainz: Single “written by freddie queen single” OPPORTUNITIES Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real… WKP: Page
  • 11. COGNITIVE CHALLENGES Structured data / database solution requires needs to be given as structured queries Writing structured queries requires knowledge about • Query language syntax and semantics • Datasets and their schemas • Links between datasets <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, member, Queen> “written by freddie queen single”
  • 12. SEMANTIC SEARCH OVER BIG LINKED DATA!
  • 13. VISION Enabling end users to retrieve and explore relevant knowledge from Big Linked Data via intuitive interfaces!
  • 14. THE INFORMATION WORKBENCH DEMO Facets Syntactic Completions Keywords Semantic Completions
  • 16. FOLLOWING AGENDA Technical Challenges Big Picture of Previous & Current Work Contributions & Innovations Keyword Search over Big Linked Data Where are we now? What is to be done?
  • 17. TECHNICAL CHALLENGES Linked Data is Big Data Volume: numerous large datasets • Processing all datasets possible/ needed? Velocity: streams from sensors, live feeds etc. • How to provide fresh, timely results? • Preprocessing possible? Variety: different data formats + schemas are unknown, heterogeneous and rapidly changing • Making sense of the data? • Integrate and combine knowledge from different datasets?
  • 18. BIG PICTURE Previous & Current Work Acquire • Source selection [ISWC10, T KDE12b] Organize • Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a] Analyze • Descriptive resource summary [ISWC11] • Structural summary of datasets [TKDE12a] Search • Entity & relational search and ranking [SIGIR11,CIKM11b] • Keyword query processing [ICDE09, SIGMOD09] Volume Fast access? All data/datasets?
  • 19. BIG PICTURE Previous & Current Work Acquire • Source selection [ISWC10, T KDE12b] • Stream- based processing of external sources [ISWC10b] • Combining local & external sources [ESWC12] Organize • Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a] • On-demand search- driven data integration [WebSci12] Analyze • Descriptive resource summary [ISWC11] • Structural summary of datasets [TKDE12a] Search • Entity & relational search and ranking [SIGIR11,CIKM11b] • Keyword query processing [ICDE09, SIGMOD09] • Explorative Linked Data query processing [ESWC11] • Multi-datasets search [WWW12] Volume Fast access? All data/datasets? Velocity Fresh results? Preprocessing? Heterogeneous Datasets/Schemas Structured + Unstructured Variety
  • 20. KEYWORD SEARCH OVER BIG LINKED DATA
  • 21. BIG PICTURE Previous & Current Work Acquire • Source selection [ISWC10, T KDE12b] • Stream- based processing of external sources [ISWC10b] • Combining local & external sources [ESWC12] Organize • Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a] • On-demand search- driven data integration [WebSci12] Analyze • Descriptive resource summary [ISWC11] • Structural summary of datasets [TKDE12a] Search • Entity & relational search and ranking [SIGIR11,CIKM11b] • Keyword query processing [ICDE09, SIGMOD 09] • Explorative Linked Data query processing [ESWC11] • Multi-datasets search [WWW12] Volume Fast access? All data/datasets? Velocity Fresh results? Preprocessing? Heterogeneous Datasets/Schemas Structured + Unstructured Variety
  • 22. KEYWORD SEARCH PROBLEM (1) Freddie Mercury Brian May Queen Queen Elizabeth 1 Liar 1971 single PersonArtist Band Single writer 1) Query 1 1) Result 1 2) Query 2) Result 2 … … Set of QueriesSelection Set of Results “written by freddie queen single”
  • 23. KEYWORD SEARCH PROBLEM (2) Goal • Finding “substructures”, e.g. Steiner Graph • Connecting keyword matching elements • AND-Semantics: contain one keyword matching element for every query keyword Problem • Keywords produce large number of matching elements • Large number of connecting graphs • Search complexity increases exponentially with the size of the data graphs & query keywords • Data graphs large in size
  • 24. INDEX-BASED TOP-K KEYWORD QUERY PROCESSING [CIKM11B] Cast problem as the one of index-based join processing • Index-based data access (retrieval) • Join (combine)
  • 25. D-LENGTH 2-HOP COVER GRAPH INDEX (1) Use d-length 2-hop cover for graph indexing, i.e. a set of neighbourhood labels NBn for every node n • If there is a path of length 2d or less between u and v then • All paths of length 2d or less between u and v are: • u and v are called center nodes and w is the hop node emptyNBNB vu vu NBNBwvwu ,,...,,...,
  • 26. D-LENGTH 2-HOP COVER GRAPH INDEX (2) A set of d-length neighborhoods is a d-length 2-hop cover During construction, pruning paths reduces that size! Freddie Mercury Liar writer Freddie Mercury Brian May Queen Liar 1971 Band Liar Single Freddie Mercury Artist Freddie Mercury Queen member Freddie Mercury Queen member Brian May Queen member Queen Liar producer Queen Band Queen 1971 formed in Freddie Mercury Liar writer LiarSingle 1-length 2-hop cover path index center/hop nodes hop nodes Freddie Mercury Queen Artist Liar writer Freddie Mercury Liar writer
  • 27. TOP-K JOIN: NEIGHBORHOOD JOIN Freddie Mercury Artist Freddie Mercury Queen member Band Freddie Mercury Queen member Brian May member Freddie Mercury Queen member Brian May member Freddie Mercury Queen member Liar producer Freddie Mercury Queen member 1971 formed in Freddie Mercury Liar writer Single formed in Freddie Mercury Queen member Freddie Mercury Liar writer 2-length 2-hop cover Freddie Mercury Queen member Brian May Queen member QueenLiar producer QueenBand Queen1971 formed in Freddie Mercury Queen member Liar writer Freddie Mercury Queen member Artist QueenLiar producer Single  Retrieve neighborhoods NBu and NBv for u and v  Join path entries in Nbu and NBv on hop nodes (rank join on sorted inputs)
  • 28. TOP-K JOIN: GRAPH JOIN Freddie Mercury Artist Freddie Mercury Queenmember Artist Freddie Mercury Artist Freddie Mercury Queen member Keyword Graphs Comprise all paths of max length 2d between Freddie Mercury and Queen Freddie Mercury Artist Freddie Mercury Queenmember LiarSingle hop node 1 … hop node 1 … Expand to obtain Keyword Graph Neighborhoods containing free hop nodes
  • 29. KEYWORD QUERY PROCESSING / PLANNING Process • Index access to retrieve keyword neighborhoods • Rank (neighborhoods/graph) join to connect keyword elements Planning: which join order? Freddie Mercury writerQueen Single
  • 30. KEYWORD QUERY PROCESSING / PLANNING Join order also determines results • No single join order delivers all results (some might even be empty) • We do not know in advance which orders deliver which results Consider all possible join orders Freddie Mercury Queen Liar Single writer Freddie Mercury writerQueen Single Produce results for d = 1! Produce no results for d = 1! “written by freddie queen single 1971” 1971 1971 Freddie Mercury writer QueenSingle1971
  • 31. INTEGRATED QUERY PLAN Terminate early after computing top-k instead of all results • Use rank join operators • Introduce top-k union operator Freddie Mercury Queen Single writer
  • 32. TOP-K PLANS Integrated Query Plan is composition of sub-plans • Some might produce no results • Some sub-plans produce results earlier than others Rank not only results, but also rank operators (hence plans) • Global score of rank join operator, based on current results and upper bounds for subsequent join operations • Only the operator with the highest global score can push results to subsequent operators • Otherwise, activate lower level data access operators
  • 33. INDEX-BASED TOP-K KEYWORD QUERY PROCESSING [CIKM11B] Benefits • One-order of magnitude faster performance than online graph exploration • Compared with graph indexing approaches, our solution reduces storage requirement up to 86%, improves performance by more than 50% on average
  • 34. SEARCH TECHNOLOGY INNOVATIONS Integrated Zero Upfront Effort / On-Demand • Does not require preprocessing, upfront integration (Watson) Fresh Results / Timely Response Relational • Entities (Yahoo!, Google, Facebook Graph Search) • Plus relations, paths, graphs… Zero Manual Effort • Does not require expert to specify search forms (E-commerce search), structure templates, translation rules and domain adaptation (Wolfram Alpha, Watson) • Interpretation of keywords and structural context, i.e. relevant relations between entities through online graph exploration
  • 35. WHAT HAVE WE ACHIEVED? Volume: fast access? all data/datasets? • Quick IR-style keyword-based lookup • Reduce search space / result candidates • Handle hundred of datasets with response time within few seconds (with local sources) • Ranking performance consistently superior than state-of- the-art (20% improvements in terms of F-measure) according to keyword search benchmark 2012 • Structured, semi-structured  unstructured? hybrid data management?
  • 36. WHAT HAVE WE ACHIEVED? Velocity: fresh results? preprocessing? • On-demand stream-based processing, i.e. exploration of sources, data integration and result combination at querying time • No need to process / store all data • Fresh results from external sources can be guaranteed
  • 37. WHAT HAVE WE ACHIEVED? Variety: different datasets, schemas and formats • Interpretation of data semantics and matching across datasets performed at querying time • No assumptions of schema, i.e. can handle unknown, possibly semi-structured data • Works well when data sources are homogenous, i.e. large overlaps / matching signals are numerous and specific  heterogeneous data from different domains with small overlaps / no specific matching signals?
  • 38. BIG PICTURE Previous & Current & Future Work Acquire • Source selection [ISWC10, TKDE12b] • Stream- based processing of external sources [ISWC10b] • Combining local & external sources [ESWC12] Organize • Indexes for quick lookup of entities, rela tions and paths [JWS09, CI KM11a] • On-demand search- driven data integration [WebSci12] • Heterogene ous data integration [ICDE13, W SDM13] • Integration of hybrid big data Analyze • Descriptive entity summary [ISWC11] • Structural summary of datasets [TKDE12a] • Probabilistic models of text and structure [ICML13, SIGMOD13] • Hybrid big data management Search • Entity & relational search and ranking [SIGIR11,CIKM11b] • Keyword query processing [ICDE09, SIGMOD 09] • Explorative Linked Data query processing [ESWC11] • Multi-datasets search [WWW12] Volume Fast access? All data/datasets? Velocity Fresh results? Preprocessing? Heterogeneous Datasets/Schemas Structured + Unstructured Variety
  • 39. CONCLUSIONS Vision • Enabling end users to retrieve and explore relevant knowledge from Big Linked Data via intuitive interfaces! Status quo • End users can retrieve complex knowledge (complex graphs) from hundreds of Linked Data sources 1-3 years from now • Improve “integrated view” coverage from 30% to 80% • Coverage of structured and unstructured result (from sensors, social networks etc.) 3-5 years from now • Robust probabilistic models of hybrid Big Linked Data • For search, ranking, as well as analytics and prediction?
  • 41. REFERENCES (1) • [ICML13] Veli Bicer, Thanh Tran Topical Relational Model Submitted to International Conference on Machine Learning (ICML’13). • [SIGMOD13] TopGuess: Query Selectivity Estimation over Text-rich Data Graphs Submitted to SIGMOD13. • [ICDE13] Yongtao Ma, Thanh Tran TYPifier: Inferring the Type Semantics of Structured Data In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013 • [WSDM13] Yongtao Ma, Thanh Tran TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data Integration In International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013 • [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph Managing Structured and Semi-structured RDF Data Using Structure Indexes In Transactions on Knowledge and Data Engineering journal. • [TKDE12b] Thanh Tran, Lei Zhang Keyword Query Routing In Transactions on Knowledge and Data Engineering journal. • [WWW12] Daniel Herzig, Thanh Tran Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012 • [WebSci12] Thanh Tran, Yongtao Ma, and Gong Cheng Pay-less Entity Consolidation – Exploiting Entity Search User Feedbacks for Pay-as-you-go Entity Data Integration In Proceedings of Web Science Conference 2012 (WebSci'12). Evanston, USA, June, 2012 • [CIKM11a] Günter Ladwig, Thanh Tran Index Structures and Top-k Join Algorithms for Native Keyword Search Databases In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 • [CIKM11b] Veli Bicer, Thanh Tran Ranking Support for Keyword Search on Structured Data using Relevance Models In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011
  • 42. REFERENCES (2) • [ISWC11] Gong Cheng, Thanh Tran and Yuzhong Qu RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization In Proceedings of 10th International Semantic Web Conference (ISWC'11). Koblenz, Germany, October, 2011 • [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran Duc Repeatable and Reliable Search System Evaluation using Crowdsourcing In Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011 • [DEXA11] Andreas Wagner, Günter Ladwig, Thanh Tran Browsing-oriented Semantic Faceted Search In Proceedings of 22nd International Conference on Database and Expert Systems Applications (DEXA'11). Toulouse, France, August, 2011 • [ISWC10a] Thanh Tran, Lei Zhang, Rudi Studer Summary Models for Routing Keywords to Linked Data Sources In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 • [ISWC10b] Günter Ladwig, Thanh Tran Linked Data Query Processing Strategies In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 • [JWS09] Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue Pan Semplore: A Scalable IR Approach to Search the Web of Data In Journal of Web Semantics 7 (3),September, 2009 • [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 • [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer Hermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009
  • 44. QUERY INTERPRETATION [ICDE09, SIGMOD09] Focus on query interpretations instead of final answers Leverage the power of underlying DB query engine for processing interpretations Reduction of search space • Query interpretation on structure summary generated from data • Exploration on reduced search space! Focus on top-k results • Top-k procedure for exploring and finding the k best results Freddie Mercury Queen Queen Elizabeth 1 single PersonArtist Band Single Literal member producer writer marital status <x, type, Single> <Queen, producer, x> <Freddie Mercury, writer, x> <Queen, type, Band> <Freddy Mercury, type, Artist> “written by freddie queen single”
  • 45. QUERY INTERPRETATION Benefits • Outperforms online bidirectional search by at least one order of magnitude • Performance comparable with index-based approaches, but requires less space Drawbacks • “Meaningful” interpretations may generate empty results • Relies on DB query engine, native tailored optimization not possible
  • 46. BIG PICTURE Previous & Current Work Acquire • Source selection [ISWC10, TKDE12b] • Stream- based processing of external sources [ISWC10b] Organize • Indexes for quick lookup of entities, rela tions and paths [JWS09, CI KM11a] • On-demand search- driven data integration [WebSci12] Analyze • Descriptive resource summary [ISWC11] • Structural summary of datasets [TKDE12a] Search • Entity & relational search and ranking [SIGIR11,CIKM11b] • Keyword query processing [ICDE09, SIGMOD 09] • Explorative Linked Data query processing [ESWC11] Volume Fast access? All data/datasets? Velocity Fresh results? Preprocessing?
  • 47. BIG PICTURE Previous & Current Work Acquire • Source selection [ISWC10, TKDE12b] • Stream- based processing of external sources [ISWC10b] • Combining local & external sources [ESWC12] Organize • Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a] • On-demand search- driven data integration [WebSci12] Analyze • Descriptive entity summary [ISWC11] • Structural summary of datasets [TKDE12a] Search • Entity & relational search and ranking [SIGIR11,CIKM11b] • Keyword query processing [ICDE09, SIGMOD 09] • Explorative Linked Data query processing [ESWC11] • Multi-datasets search [WWW12] Volume Fast access? All data/datasets? Velocity Fresh results? Preprocessing? Heterogeneous Datasets/Schemas Structured + Unstructured Variety
  • 48. SEMANTIC SEARCH TECHNIQUES FOR LINKING Linking homogenous data • Given structured entity description, find matching entities described using same/similar schema Linking heterogeneous data • Given structured entity, find matching entities described using different schemas Linking hybrid data • Given text mentions, find matching entities (no schema) Keyword search • Given keywords, find matching entities (no schema) name age Tran Thanh 31 name age Tran Thanh 31 id description p1 Tran Duc Thanh, age 31, works at.. label age Tran Duc Thanh 31 name age Tran Thanh 31 … content Tran Duc Thanh, a researcher at KIT… name age Tran Thanh 31 query Tran Duc Thanh Search-based Linking • Adopt methods for semantic matching and ranking for schema- agnostic linking in hybrid & heterogenous data scenarios • Embed linking into the search-process to leverage user feedbacks

Editor's Notes

  1. Not only single datasets but the entire LDCOpportunity: Combine information from different sources and domains to address complex information needsConsider scenario: For our scenario:- Information about Freddie and the singles written by him from Wikipedia, or more precisely, WKP, a dataset we obtain by combining Dbpedia with the texts in Wikipedia
  2. Information about single, queen and freddie from musicbrainzsasmas links can be use to combine information
  3. Some more informaton about queen, however it is not the same as Queen in musicbrainzWhen combining information from different datasets, we need to know which resources refer to the same real-world objectTo combine and integrate information only from the same resources
  4. Some more informaton about queen, however it is not the same as Queen in musicbrainzWhen combining information from different datasets, we need to know which resources refer to the same real-world objectTo combine and integrate information only from the same resources
  5. Togive an ideaofthisvision, I wouldliketoshow a screenshopof a technologydemosntratorcalled IWB Support theprocessofBig Linked Data Semantic Search:startswithkeywordsearch: intepretingthequeryintentandthenbrowsing / exploration/refinementofresultsset via facetedsearch
  6. Comparison with bidirectional search [V. Kacholia et al.] and search based on graph indexing [H. He et al.]Time for query computation + time for processing queriesOutperforms bidirectional search by at least one order of magnitudePerformance comparable with indexing based approaches, but requires less spaceNo schema / summary neededSupport different types of data e.g. RDF graphs, document graphs, hybrid data graphsNo non-empty results Native tailored optimization