Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

Large Knowledge Collider (LarKC) :
A Platform for Web Scale Reasoning

Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2

Maebashi Institute of Technology, Japan
Vrije University Amsterdam, the Netherlands
International WIC Institute, Beijing University of Technology, China

http://www.larkc.eu

1

The World is Creating
the Linked Data Every Day!

Late br
e
Google aking news:
Video
now al
with R annota so
DF-a ( ted
f ro m Y using v
ahoo a ocabul
nd Fac aries
e bo o k )

2

ay
da
y
rd
er
pe
p
tts
s
en
e n
um
cum
oc
do
d
n
iio n
llll o
ii
m
rr m
ffou
ou
3

http://www.zemanta.com/

5

toxic releases consumer expenditure
recent earthquakes consumer price index
crime statistics tornado reports
assaults on police trade statistics
social benefits river elevations 6

unemployment rates energy consumption

Things to do with data.gov

7

<rdf:RDF>
<rdf:Description rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.rdf
<rdfs:label>Description of the artist Yeah Yeah Yeahs</rdfs:label>
<foaf:primaryTopic rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4b
</rdf:Description>
<mo:MusicArtist rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4#a
<rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup"/>
<foaf:name>Yeah Yeah Yeahs</foaf:name>
<ov:sortLabel>Yeah Yeah Yeahs</ov:sortLabel>
<bio:event>
<bio:Birth><bio:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime
</bio:event>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/Yeah_Yeah_Yeahs"/>
<mo:image rdf:resource="/music/images/artists/7col_in/584c04d2-4acc-491b-8a0a-e63
<foaf:page rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.html"/
<mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/584c04d2-4acc-491b-8a0a
<foaf:homepage rdf:resource="http://www.yeahyeahyeahs.com/"/>
<mo:wikipedia rdf:resource="http://en.wikipedia.org/wiki/Yeah_Yeah_Yeahs"/>
<mo:myspace rdf:resource="http://www.myspace.com/yeahyeahyeahs"/>
<mo:member rdf:resource="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#art
<mo:member rdf:resource="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#ar
<mo:member rdf:resource="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#a
... 10

<foaf:made>
<mo:Record>
<dc:title>It's Blitz!</dc:title>
<mo:musicbrainz rdf:resource="http://musicbrainz.org/release/9c4177fe-bdce-4f9d-ab
<rev:hasReview rdf:resource="/music/reviews/hnp2#review"/>
</mo:Record>
</foaf:made>
.....
<mo:MusicArtist rdf:about="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#arti
<foaf:name>Brian Chase</foaf:name>
</mo:MusicArtist>

<mo:MusicArtist rdf:about="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#art
<foaf:name>Karen O</foaf:name>
</mo:MusicArtist>

<mo:MusicArtist rdf:about="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#art
<foaf:name>Nick Zinner</foaf:name>
</mo:MusicArtist>
</rdf:RDF>

11

AND much more…

12

What to do for the success of Web-scale
Semantic Data Processing?

Refining Search by Reasoning Refining Reasoning by Search
[Berners-Lee 1999] [Fensel & Frank 2007]

Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]

13

The LarKC Consortium
13 partner institutions (from 11 countries, 2 from Asia)

14
14

The Large Knowledge Collider

a platform for infinitely scalable reasoning
on the data-web

“a configurable platform for
infinitely scalable semantic web reasoning”
“pipeline” suggests
linear structure:

but in LarKC also:

16

What to about
the problem of success:

parallelization

17

Supermarket!

Takes seconds

18

Supermarket!

Takes a couple of minutes

19

Supermarket!

Get a better register

20

Massive Data
(even Web Scale
Data!)

Ooops!

21

From Linked Data Website
More than 7x108 triples

22

Parallelization

I am with Web-scale
data : 7x10^8 triples

Cashier1: 53
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Cashier3: 13
Cashier4: 32
-------------------- 23
Total : 340

Data
two for the dependencies
price of one?
2nd for half
price?

Cashier1: 53
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Cashier3: 13
Cashier4: 32
-------------------- 24
Total : 340

Fruit Split
two for the Responsibility
price of one?
2nd for half Vegetables

price?

Household

Cashier1: 53 Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Rest
Cashier3: 13
Cashier4: 32
-------------------- 25
Total : 340

Fruit Load
two for the Balancing
price of one?
2nd for half Vegetables

price?

Household

Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Rest
Cashier3: 13
Cashier4: 32
-------------------- 26
Total : 340

Fruit Data
With a box of dependencies
detergent
and a box of Vegetables For RDF data, any triple can
refer to any URI.
cereal get a
free pen!

Household

Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Rest
Cashier3: 13
Cashier4: 32
-------------------- 27
Total : 340

Towards Parallelization and Distribution

Different parallel computing models:
− Peer-to-peer (MaRVIN)
− Map-Reduce (Reasoning-Hadoop)

28

The
MaRVIN
Way!

compute

compute compute Eyal Oren
input output
data data
compute compute

Spyros Kotoulas
compute

29
Divide-Conquer-Swap

MARVIN
(Massive RDF Versatile Inference Network)
… is:
− a distributed technique for computing RDFS/OWL closure

… scales by:
− distributing computation over many nodes
− approximate (sound but incomplete) reasoning
− anytime convergence (more complete over time)

… runs on:
− in principle: any grid, using Ibis middleware
− the DAS-3 distributed supercomputer (300 nodes)
30

Divide-Conquer-Swap

SPLIT

Repeat
COMP
UTE

JOIN
31

Current performance

200 Million triples in 7.2 minutes on 64 nodes.

32

Reasoning-Hadoop!

RDFS/OWL reasoning with the MapReduce framework.

33

The MapReduce
Distributed Programming Model
Initially designed and developed by Google in 2004 for large data
processing [Jeffrey & Sanjay 2004].
The computation is expressed with two functions: map and reduce.
Map-Reduce on 64 machines:

Peak inference rates at 8M triples/sec
Sustained inference rates at 4M triples/sec

C2
ApC Map <C,_,_> Reduce p1
AqB <A, r3
DrD
ErD
.
.
_,_
> .
. q1
. _,_> . D3
FrC
<C, F1
Map <F,_,_> Reduce
Map-Reduce Jacopo Urbani
34

What to about

cognitive heuristics

35

Stopping Rules
On very large datasets,
incompleteness is the rule
Must stop before we are finished
When to stop?
Stopping rules are important
− determine length of computation
(don’t stop too late)
− quality of result
(don’t stop too early)

Take inspiration from
economics, biology, psychology

Lael Schooler

Humans have good heuristics for when to stop
problem solving:
Time between
solutions

“Name capital cities in Europe”:
London, Paris, Berlin, Rome, Amsterdam, …
Milan, Madrid, …., ….., Paris, ….,
Wrong
answers Repetitions

When to switch between tasks?

Lael Schooler
hard task & easy task
hard task & easy task combined
combined
task
task

Humans (& animals) are very
Humans (& animals) are very
good finding this optimum
good finding this optimum

What to about

data selection

39

Take data-selection seriously

Where do the axioms come from?
• Which subset to use?
• Relevance measures Zhisheng Huang

• Example: syntactic relevance:
• δ(α,β)=1 if α,β share a concept symbol
• δ(α,β)=k if δ(α,γ)=k-1 and
β,γ share a concept symbol
• very simple measure,
very syntactically unstable, but:

Gives a high quality sound approximation
Gives a high quality sound approximation
(> 90% recall, 100% precision for small k)
(> 90% recall, 100% precision for small k)

Take identifiers seriously
exploit the grounding of logical symbols
in natural language
• Google distance as relevance measure
Zhisheng Huang
max{log f ( x ), log f ( y )} − log f ( x , y )
NGD ( x , y ) =
log M − min{log f ( x ), log f ( y )}
= symmetric conditional probability
of co-occurrence
= estimate of semantic distance

Gives almost perfect “forgetting function”
Gives almost perfect “forgetting function”
for matching class definitions in 2 vocabularies
for matching class definitions in 2 vocabularies

Unifying Search and Reasoning from the
Viewpoint of Granularity
Barriers for Web-scale Problem Solving

(1) most relevant data vs search results space [Berners-Lee 1999].
(2) Traditional reasoning systems vs Web-scale data vs rational time [Fensel 2007].

Refining Search by Reasoning Refining Reasoning by Search
[Berners-Lee 1999] [Fensel & Frank 2007]

Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]

Granularity

Human Problem Solving Web Problem Solving
Inspire!

Basic level advantage, Cognitive Memory Retention
Multi-level, multi-perspective, Variable Precision
42

Concrete Strategies

• The Starting Point.
• Multi-level Completeness.
• Multi-level Specificity.
• Multi-perspective.

43

43

The Starting Point Strategy

[Collins 1969] Collins, A.M. and Quillian, M.R. Retrieval time from
semantic memory. Journal of Verbal Learning and Verbal
Behaviour, 8, 240-247.

44

(I) The Starting Point Strategy
The “ Basic level advantage ” [Rogers2007].
Concepts in a basic level -- > more frequently than other terms
[Wisniewski1989].

TI (i ) = ∑ j =1 m(i, j )
n
• (Frequency) Total Interest :

As a step forward “familiar term” in basic level, “interests retention” focuses on
frequency and recency at the same time.

Interest retention models < -- > Cognitive memory retention models
[Anderson, Schooler 1991].
• (Frequency and Recency) Exponential Model for Interest Retention :
EIR(i ) = ∑ j =1 m(i, j ) × Ae − bTi
n

• (Frequency and Recency) Power Model for Interest Retention :

PIR(i ) = ∑ j =1 m(i, j ) × ATi
n −b
45

Interest Retention and Interest Prediction

A comparative study of TI during Difference on the
1990-2008 and IR in 2009 contribution values from
papers published in
different years

A comparative study on the
A comparative study on the prediction and real
prediction and real publication numbers by the
publication numbers by the exponential law model
power law model

46

Evaluations and the Released Dataset

• interest retentions vs future interests.
publication >= 100
top 9 interests
2000 to 2007
1226 persons
49.54% predict 3 out of 9 interests.

• 615,124 computer scientists in the SwetoDBLP dataset.
• http://wiki.larkc.eu/csri-rdf

47

DBLP-SSE : DBLP Search Support Engine

Recent interests are extracted using the power law interest retention model.
Terms with high frequency do not necessarily have high interest retention. (e.g.
“Knowledge”)

48

DBLP-SSE : DBLP Search Support Engine
Log in Dieter Fensel

Top 9 Web, Service, Semantic, Architecture, Model, Ontology,
interests Knowledge, Computing, Language
Query : Artificial Intelligence

List 1 : without current interests constraints (Top 5 results)

* PROLOG Programming for Artificial Intelligence, Second Edition.
* Artificial Intelligence Architectures for Composition and Performance
Environment.
* Artificial Intelligence in Music Education: A Critical Review.
* Music, Intelligence and Artificiality. Artificial Intelligence and Music
Education.
* Musical Knowledge: What can Artificial Intelligence Bring to the
Musician?
* ...

List 2 : with current interests constraints (Top 5 results)

* Web Intelligence and Artificial Intelligence in Education.
* Artificial Intelligence Exchange and Service Tie to All Test
Environments (AI-ESTATE)-A New Standard for System Diagnostics.
* Semantic Model for Artificial Intelligence Based on Molecular
Computing.
* Open Information Systems Semantics for Distributed Artificial
Intelligence.
* Artificial Intelligence and Financial Services.
*…
49

Multi-level Completeness Strategy

Low completeness Limited Time

High completeness More time Available

One practical question :

How to choose the nodes to be reasoned over?

50

Choosing the pivotal nodes
in the network first !

51
Another one: If I stop in here, what
is the completeness like now!

Multi-level Completeness Strategy

Nodes are grouped together by Node degrees under a perspective.

Completeness Prediction Function :

| Nrel (i ) | ×(| Nsub(i ) | − | Nsub(i ' ) |)
PC (i ) =
| Nrel (i ) | ×(| N | − | Nsub(i ' ) |)+ | Nrel (i ' ) | ×(| Nsub(i ' ) | − | N |)

degree(n, Pcn) to stop Satisﬁed authors AI authors
“Who are
70 2885 151
authors in
30 17121 579
Artificial
11 78868 1142 Intelligence?”
4 277417 1704
1 575447 2225
0 615124 2355

Unifying search and reasoning with multilevel Comparison of predicted and actual
completeness and anytime behavior. completeness value.
52

Multi-level Specificity Strategy

general Limited Time

Specific More time Available

53

A Case Study on Multi-level Specificity Strategy
Specificity Relevant Keywords Number of Authors
Level 1 Artificial Intelligence 2355
Answers to “Who are the authors in Artificial Level 2 Agents 9157
Intelligence?” in multiple levels of specificity
according to the hierarchical ontology of Automated Reasoning 222
Artificial Intelligence. Cognition 19775
Constriants 8744
Games 3817
Specificity Number of authors Completeness
Knowledge 1537
Level 1 2355 0.85% Representation
2939
Level 1,2 207468 75.11% Natural Language
Level 1,2,3 276205 100% 16425
Robot
…
…
A comparative study on the answers in
Level 3 Case-Based Reasoning 1133
different levels of specificity.
Cognitive Modeling 76
Decision Trees 1112
Search 32079
Translation 4414
Web Intelligence 122
… … 54

The Multi-perspective Strategy

Multiple representation of Knowledge [Minsky2006]
User needs may differ from each other
< -- > expect answers from different perspectives.

Normalized Degree Distribution of predicates in SwetoDBLP dataset
55

The Multi-perspective Strategy
Under different perspectives, the distribution characteristics are different!

Fig. 2. Coauthor number distribution Fig. 3. log-log diagram of Figure 2. Fig. 4. A zoomed in version
in the SwetoDBLP dataset. of Figure 2.

Fig. 5. A zoomed in version of coauthor Fig. 6. Publication number distribution Fig. 7. log-log diagram
distribution for Artificial Intelligence". in the SwetoDBLP dataset. of Figure 6.

56

Comparison of Results
from Different Perspectives

A partial result of the multilevel specificity reasoning task The list of authors
in Artificial Intelligence" in level 1 from two perspectives.
Publication number perspective Coauthor number perspective
Thomas S. Huang (387) Carl Kesselman (312)
John Mylopoulos (261) Thomas S. Huang (271)
Hsinchun Chen (260) Edward A. Fox (269)
Henri Prade (252) Lei Wang (250)
Didier Dubois (241) John Mylopoulos (245)
Thomas Eiter (219) Ewa Deelman (237)
... ...

57

Summarizing

The Semantic Web is rapidly becoming real
Scale is becoming a real problem
Different ways of scaling up:
− parallelization
− exploiting cognitive heuristics
Stopping rules, cognitive memory retention, etc.
− data-selection for incomplete reasoning.
− New Forms of Reasoning.

LarKC Chinese Forum

59

Acknowledgement
Slides for this talk is mainly from 3 previous talks :

Frank van Harmelen. Large Scale Reasoning on the Semantic Web or:
When success is becoming a problem. Invited talk at the 2009 International
Joint Conferences on Active Media Technology and Brain Informatics.
Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of
Granularity. the 2009 International Joint Conferences on Active Media
Technology and Brain Informatics.
Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar,
University of Amsterdam, 2008.

60

Contact Info
Want to play with LarKC?
Want to play with LarKC?
Want to contribute plugins?
Want to contribute plugins?
Want to deploy LarKC?
Want to deploy LarKC?

Frank.van.Harmelen@cs.vu.nl
http://www.larkc.eu
Asia:
Asia:
Ning Zhong: zhong@maebashi-it.ac.jp
Ning Zhong: zhong@maebashi-it.ac.jp
Yi Zeng ::yzeng@emails.bjut.edu.cn
Yi Zeng yzeng@emails.bjut.edu.cn
@ WIC
@ WIC
61

References
[Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original
Design and Ultimate Destiny of the World Wide Web by Its Inventor.
HarperSanFrancisco (1999)
[Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web
scale. IEEE Internet Computing 11(2) (2007) 94-96
[Michalski1986] Michalski, R.S. and Winston, P.H. Variable precision logic. Artificial
Intelligence, 29(2), 121–146, 1986.
[Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial
intelligence, and the future of the human mind. Simon & Schuster, 2006.
[Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and
explanations of the basic-level advantage. Journal of Experimental Psychology:
General 136(3) (2007) 451-469
[Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of
learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates
(1976) 321-361
[Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.:
Swetodblp ontology of computer science publications. Web Semantics: Science,
Services and Agents on the World Wide Web 5(3) (2007) 151-155
[Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental
Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913)
62

Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

Ähnlich wie Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning