TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning
1. Large Knowledge Collider (LarKC) :
A Platform for Web Scale Reasoning
Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2
Maebashi Institute of Technology, Japan
Vrije University Amsterdam, the Netherlands
International WIC Institute, Beijing University of Technology, China
http://www.larkc.eu
1
2. The World is Creating
the Linked Data Every Day!
Late br
e
Google aking news:
Video
now al
with R annota so
DF-a ( ted
f ro m Y using v
ahoo a ocabul
nd Fac aries
e bo o k )
2
3. ay
da
y
rd
er
pe
p
tts
s
en
e n
um
cum
oc
do
d
n
iio n
llll o
ii
m
rr m
ffou
ou
3
13. What to do for the success of Web-scale
Semantic Data Processing?
Refining Search by Reasoning Refining Reasoning by Search
[Berners-Lee 1999] [Fensel & Frank 2007]
Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]
13
14. The LarKC Consortium
13 partner institutions (from 11 countries, 2 from Asia)
14
14
15. The Large Knowledge Collider
a platform for infinitely scalable reasoning
on the data-web
16. “a configurable platform for
infinitely scalable semantic web reasoning”
“pipeline” suggests
linear structure:
but in LarKC also:
16
23. Parallelization
I am with Web-scale
data : 7x10^8 triples
Cashier1: 53
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Cashier3: 13
Cashier4: 32
-------------------- 23
Total : 340
24. Data
two for the dependencies
price of one?
2nd for half
price?
Cashier1: 53
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Cashier3: 13
Cashier4: 32
-------------------- 24
Total : 340
25. Fruit Split
two for the Responsibility
price of one?
2nd for half Vegetables
price?
Household
Cashier1: 53 Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Rest
Cashier3: 13
Cashier4: 32
-------------------- 25
Total : 340
26. Fruit Load
two for the Balancing
price of one?
2nd for half Vegetables
price?
Household
Cashier1: 53 Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Rest
Cashier3: 13
Cashier4: 32
-------------------- 26
Total : 340
27. Fruit Data
With a box of dependencies
detergent
and a box of Vegetables For RDF data, any triple can
refer to any URI.
cereal get a
free pen!
Household
Cashier1: 53 Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Rest
Cashier3: 13
Cashier4: 32
-------------------- 27
Total : 340
28. Towards Parallelization and Distribution
Different parallel computing models:
− Peer-to-peer (MaRVIN)
− Map-Reduce (Reasoning-Hadoop)
28
29. The
MaRVIN
Way!
compute
compute compute Eyal Oren
input output
data data
compute compute
Spyros Kotoulas
compute
29
Divide-Conquer-Swap
30. MARVIN
(Massive RDF Versatile Inference Network)
… is:
− a distributed technique for computing RDFS/OWL closure
… scales by:
− distributing computation over many nodes
− approximate (sound but incomplete) reasoning
− anytime convergence (more complete over time)
… runs on:
− in principle: any grid, using Ibis middleware
− the DAS-3 distributed supercomputer (300 nodes)
30
34. The MapReduce
Distributed Programming Model
Initially designed and developed by Google in 2004 for large data
processing [Jeffrey & Sanjay 2004].
The computation is expressed with two functions: map and reduce.
Map-Reduce on 64 machines:
Peak inference rates at 8M triples/sec
Sustained inference rates at 4M triples/sec
C2
ApC Map <C,_,_> Reduce p1
AqB <A, r3
DrD
ErD
.
.
_,_
> .
. q1
. _,_> . D3
FrC
<C, F1
Map <F,_,_> Reduce
Map-Reduce Jacopo Urbani
34
36. Stopping Rules
On very large datasets,
incompleteness is the rule
Must stop before we are finished
When to stop?
Stopping rules are important
− determine length of computation
(don’t stop too late)
− quality of result
(don’t stop too early)
37. Take inspiration from
economics, biology, psychology
Lael Schooler
Humans have good heuristics for when to stop
problem solving:
Time between
solutions
“Name capital cities in Europe”:
London, Paris, Berlin, Rome, Amsterdam, …
Milan, Madrid, …., ….., Paris, ….,
Wrong
answers Repetitions
38. When to switch between tasks?
Lael Schooler
hard task & easy task
hard task & easy task combined
combined
task
task
Humans (& animals) are very
Humans (& animals) are very
good finding this optimum
good finding this optimum
40. Take data-selection seriously
Where do the axioms come from?
• Which subset to use?
• Relevance measures Zhisheng Huang
• Example: syntactic relevance:
• δ(α,β)=1 if α,β share a concept symbol
• δ(α,β)=k if δ(α,γ)=k-1 and
β,γ share a concept symbol
• very simple measure,
very syntactically unstable, but:
Gives a high quality sound approximation
Gives a high quality sound approximation
(> 90% recall, 100% precision for small k)
(> 90% recall, 100% precision for small k)
41. Take identifiers seriously
exploit the grounding of logical symbols
in natural language
• Google distance as relevance measure
Zhisheng Huang
max{log f ( x ), log f ( y )} − log f ( x , y )
NGD ( x , y ) =
log M − min{log f ( x ), log f ( y )}
= symmetric conditional probability
of co-occurrence
= estimate of semantic distance
Gives almost perfect “forgetting function”
Gives almost perfect “forgetting function”
for matching class definitions in 2 vocabularies
for matching class definitions in 2 vocabularies
42. Unifying Search and Reasoning from the
Viewpoint of Granularity
Barriers for Web-scale Problem Solving
(1) most relevant data vs search results space [Berners-Lee 1999].
(2) Traditional reasoning systems vs Web-scale data vs rational time [Fensel 2007].
Refining Search by Reasoning Refining Reasoning by Search
[Berners-Lee 1999] [Fensel & Frank 2007]
Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]
Granularity
Human Problem Solving Web Problem Solving
Inspire!
Basic level advantage, Cognitive Memory Retention
Multi-level, multi-perspective, Variable Precision
42
44. The Starting Point Strategy
[Collins 1969] Collins, A.M. and Quillian, M.R. Retrieval time from
semantic memory. Journal of Verbal Learning and Verbal
Behaviour, 8, 240-247.
44
45. (I) The Starting Point Strategy
The “ Basic level advantage ” [Rogers2007].
Concepts in a basic level -- > more frequently than other terms
[Wisniewski1989].
TI (i ) = ∑ j =1 m(i, j )
n
• (Frequency) Total Interest :
As a step forward “familiar term” in basic level, “interests retention” focuses on
frequency and recency at the same time.
Interest retention models < -- > Cognitive memory retention models
[Anderson, Schooler 1991].
• (Frequency and Recency) Exponential Model for Interest Retention :
EIR(i ) = ∑ j =1 m(i, j ) × Ae − bTi
n
• (Frequency and Recency) Power Model for Interest Retention :
PIR(i ) = ∑ j =1 m(i, j ) × ATi
n −b
45
46. Interest Retention and Interest Prediction
A comparative study of TI during Difference on the
1990-2008 and IR in 2009 contribution values from
papers published in
different years
A comparative study on the
A comparative study on the prediction and real
prediction and real publication numbers by the
publication numbers by the exponential law model
power law model
46
47. Evaluations and the Released Dataset
• interest retentions vs future interests.
publication >= 100
top 9 interests
2000 to 2007
1226 persons
49.54% predict 3 out of 9 interests.
• 615,124 computer scientists in the SwetoDBLP dataset.
• http://wiki.larkc.eu/csri-rdf
47
48. DBLP-SSE : DBLP Search Support Engine
Recent interests are extracted using the power law interest retention model.
Terms with high frequency do not necessarily have high interest retention. (e.g.
“Knowledge”)
48
49. DBLP-SSE : DBLP Search Support Engine
Log in Dieter Fensel
Top 9 Web, Service, Semantic, Architecture, Model, Ontology,
interests Knowledge, Computing, Language
Query : Artificial Intelligence
List 1 : without current interests constraints (Top 5 results)
* PROLOG Programming for Artificial Intelligence, Second Edition.
* Artificial Intelligence Architectures for Composition and Performance
Environment.
* Artificial Intelligence in Music Education: A Critical Review.
* Music, Intelligence and Artificiality. Artificial Intelligence and Music
Education.
* Musical Knowledge: What can Artificial Intelligence Bring to the
Musician?
* ...
List 2 : with current interests constraints (Top 5 results)
* Web Intelligence and Artificial Intelligence in Education.
* Artificial Intelligence Exchange and Service Tie to All Test
Environments (AI-ESTATE)-A New Standard for System Diagnostics.
* Semantic Model for Artificial Intelligence Based on Molecular
Computing.
* Open Information Systems Semantics for Distributed Artificial
Intelligence.
* Artificial Intelligence and Financial Services.
*…
49
50. Multi-level Completeness Strategy
Low completeness Limited Time
High completeness More time Available
One practical question :
How to choose the nodes to be reasoned over?
50
51. Choosing the pivotal nodes
in the network first !
51
Another one: If I stop in here, what
is the completeness like now!
52. Multi-level Completeness Strategy
Nodes are grouped together by Node degrees under a perspective.
Completeness Prediction Function :
| Nrel (i ) | ×(| Nsub(i ) | − | Nsub(i ' ) |)
PC (i ) =
| Nrel (i ) | ×(| N | − | Nsub(i ' ) |)+ | Nrel (i ' ) | ×(| Nsub(i ' ) | − | N |)
degree(n, Pcn) to stop Satisfied authors AI authors
“Who are
70 2885 151
authors in
30 17121 579
Artificial
11 78868 1142 Intelligence?”
4 277417 1704
1 575447 2225
0 615124 2355
Unifying search and reasoning with multilevel Comparison of predicted and actual
completeness and anytime behavior. completeness value.
52
54. A Case Study on Multi-level Specificity Strategy
Specificity Relevant Keywords Number of Authors
Level 1 Artificial Intelligence 2355
Answers to “Who are the authors in Artificial Level 2 Agents 9157
Intelligence?” in multiple levels of specificity
according to the hierarchical ontology of Automated Reasoning 222
Artificial Intelligence. Cognition 19775
Constriants 8744
Games 3817
Specificity Number of authors Completeness
Knowledge 1537
Level 1 2355 0.85% Representation
2939
Level 1,2 207468 75.11% Natural Language
Level 1,2,3 276205 100% 16425
Robot
…
…
A comparative study on the answers in
Level 3 Case-Based Reasoning 1133
different levels of specificity.
Cognitive Modeling 76
Decision Trees 1112
Search 32079
Translation 4414
Web Intelligence 122
… … 54
55. The Multi-perspective Strategy
Multiple representation of Knowledge [Minsky2006]
User needs may differ from each other
< -- > expect answers from different perspectives.
Normalized Degree Distribution of predicates in SwetoDBLP dataset
55
56. The Multi-perspective Strategy
Under different perspectives, the distribution characteristics are different!
Fig. 2. Coauthor number distribution Fig. 3. log-log diagram of Figure 2. Fig. 4. A zoomed in version
in the SwetoDBLP dataset. of Figure 2.
Fig. 5. A zoomed in version of coauthor Fig. 6. Publication number distribution Fig. 7. log-log diagram
distribution for Artificial Intelligence". in the SwetoDBLP dataset. of Figure 6.
56
57. Comparison of Results
from Different Perspectives
A partial result of the multilevel specificity reasoning task The list of authors
in Artificial Intelligence" in level 1 from two perspectives.
Publication number perspective Coauthor number perspective
Thomas S. Huang (387) Carl Kesselman (312)
John Mylopoulos (261) Thomas S. Huang (271)
Hsinchun Chen (260) Edward A. Fox (269)
Henri Prade (252) Lei Wang (250)
Didier Dubois (241) John Mylopoulos (245)
Thomas Eiter (219) Ewa Deelman (237)
... ...
57
58. Summarizing
The Semantic Web is rapidly becoming real
Scale is becoming a real problem
Different ways of scaling up:
− parallelization
− exploiting cognitive heuristics
Stopping rules, cognitive memory retention, etc.
− data-selection for incomplete reasoning.
− New Forms of Reasoning.
60. Acknowledgement
Slides for this talk is mainly from 3 previous talks :
Frank van Harmelen. Large Scale Reasoning on the Semantic Web or:
When success is becoming a problem. Invited talk at the 2009 International
Joint Conferences on Active Media Technology and Brain Informatics.
Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of
Granularity. the 2009 International Joint Conferences on Active Media
Technology and Brain Informatics.
Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar,
University of Amsterdam, 2008.
60
61. Contact Info
Want to play with LarKC?
Want to play with LarKC?
Want to contribute plugins?
Want to contribute plugins?
Want to deploy LarKC?
Want to deploy LarKC?
Frank.van.Harmelen@cs.vu.nl
http://www.larkc.eu
Asia:
Asia:
Ning Zhong: zhong@maebashi-it.ac.jp
Ning Zhong: zhong@maebashi-it.ac.jp
Yi Zeng ::yzeng@emails.bjut.edu.cn
Yi Zeng yzeng@emails.bjut.edu.cn
@ WIC
@ WIC
61
62. References
[Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original
Design and Ultimate Destiny of the World Wide Web by Its Inventor.
HarperSanFrancisco (1999)
[Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web
scale. IEEE Internet Computing 11(2) (2007) 94-96
[Michalski1986] Michalski, R.S. and Winston, P.H. Variable precision logic. Artificial
Intelligence, 29(2), 121–146, 1986.
[Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial
intelligence, and the future of the human mind. Simon & Schuster, 2006.
[Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and
explanations of the basic-level advantage. Journal of Experimental Psychology:
General 136(3) (2007) 451-469
[Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of
learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates
(1976) 321-361
[Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.:
Swetodblp ontology of computer science publications. Web Semantics: Science,
Services and Agents on the World Wide Web 5(3) (2007) 151-155
[Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental
Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913)
62