SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Large Knowledge Collider (LarKC) :
      A Platform for Web Scale Reasoning

 Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2

              Maebashi Institute of Technology, Japan
           Vrije University Amsterdam, the Netherlands
International WIC Institute, Beijing University of Technology, China

                        http://www.larkc.eu




                                                                       1
The World is Creating
                                the Linked Data Every Day!




        Late br
                e
      Google aking news:
              Video
                     now al
with R      annota           so
       DF-a (      ted
 f ro m Y     using v
          ahoo a      ocabul
                 nd Fac       aries
                        e bo o k )




                                                         2
ay
                                     da
                                        y
                                   rd
                                  er
                                 pe
                                 p
                           tts
                             s
                       en
                        e n
                     um
                    cum
                   oc
                  do
                  d
               n
            iio n
         llll o
        ii
       m
    rr m
ffou
  ou
                                            3
4
http://www.zemanta.com/




                          5
toxic releases       consumer expenditure
recent earthquakes   consumer price index
crime statistics     tornado reports
assaults on police   trade statistics
social benefits      river elevations       6

unemployment rates   energy consumption
Things to do with data.gov




                             7
8
9
<rdf:RDF>
 <rdf:Description rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.rdf
  <rdfs:label>Description of the artist Yeah Yeah Yeahs</rdfs:label>
  <foaf:primaryTopic rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4b
 </rdf:Description>
  <mo:MusicArtist rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4#a
  <rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup"/>
  <foaf:name>Yeah Yeah Yeahs</foaf:name>
  <ov:sortLabel>Yeah Yeah Yeahs</ov:sortLabel>
  <bio:event>
    <bio:Birth><bio:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime
  </bio:event>
  <owl:sameAs rdf:resource="http://dbpedia.org/resource/Yeah_Yeah_Yeahs"/>
<mo:image rdf:resource="/music/images/artists/7col_in/584c04d2-4acc-491b-8a0a-e63
<foaf:page rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.html"/
<mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/584c04d2-4acc-491b-8a0a
<foaf:homepage rdf:resource="http://www.yeahyeahyeahs.com/"/>
<mo:wikipedia rdf:resource="http://en.wikipedia.org/wiki/Yeah_Yeah_Yeahs"/>
<mo:myspace rdf:resource="http://www.myspace.com/yeahyeahyeahs"/>
<mo:member rdf:resource="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#art
<mo:member rdf:resource="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#ar
<mo:member rdf:resource="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#a
...                                                                          10
<foaf:made>
  <mo:Record>
   <dc:title>It's Blitz!</dc:title>
   <mo:musicbrainz rdf:resource="http://musicbrainz.org/release/9c4177fe-bdce-4f9d-ab
   <rev:hasReview rdf:resource="/music/reviews/hnp2#review"/>
  </mo:Record>
</foaf:made>
.....
<mo:MusicArtist rdf:about="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#arti
  <foaf:name>Brian Chase</foaf:name>
</mo:MusicArtist>

<mo:MusicArtist rdf:about="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#art
 <foaf:name>Karen O</foaf:name>
</mo:MusicArtist>

<mo:MusicArtist rdf:about="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#art
  <foaf:name>Nick Zinner</foaf:name>
</mo:MusicArtist>
</rdf:RDF>


                                                                              11
AND much more…




                 12
What to do for the success of Web-scale
        Semantic Data Processing?

Refining Search by Reasoning              Refining Reasoning by Search
[Berners-Lee 1999]                               [Fensel & Frank 2007]

     Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]




                                                                         13
The LarKC Consortium
 13 partner institutions (from 11 countries, 2 from Asia)




                                                            14
                                                             14
The Large Knowledge Collider

          a platform for infinitely scalable reasoning
             on the data-web
“a configurable platform for
infinitely scalable semantic web reasoning”
                            “pipeline” suggests
                              linear structure:




                              but in LarKC also:




                                                   16
What to about
the problem of success:

         parallelization




                      17
Supermarket!




Takes seconds

                18
Supermarket!




Takes a couple of minutes

                            19
Supermarket!




Get a better register



                        20
Massive Data
(even Web Scale
     Data!)




       Ooops!




                  21
From Linked Data Website
More than 7x108 triples




                     22
Parallelization

                         I am with Web-scale
                         data : 7x10^8 triples




Cashier1: 53
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Cashier3: 13
Cashier4: 32
--------------------                             23
Total : 340
Data
   two for the         dependencies
  price of one?
   2nd for half
      price?




Cashier1: 53
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
Cashier3: 13
Cashier4: 32
--------------------             24
Total : 340
Fruit        Split
   two for the                      Responsibility
  price of one?
   2nd for half        Vegetables

      price?

                       Household




Cashier1: 53           Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
                         Rest
Cashier3: 13
Cashier4: 32
--------------------                             25
Total : 340
Fruit      Load
   two for the                      Balancing
  price of one?
   2nd for half        Vegetables

      price?

                       Household




Cashier1: 53           Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
                         Rest
Cashier3: 13
Cashier4: 32
--------------------                            26
Total : 340
Fruit         Data
  With a box of                      dependencies
   detergent
  and a box of         Vegetables   For RDF data, any triple can
                                    refer to any URI.
  cereal get a
   free pen!

                       Household




Cashier1: 53           Packaged
Cashier2: 14
Cashier3: 33
Cashier4: 72
Cashier2: 34
                         Rest
Cashier3: 13
Cashier4: 32
--------------------                                           27
Total : 340
Towards Parallelization and Distribution



   Different parallel computing models:
   −   Peer-to-peer (MaRVIN)
   −   Map-Reduce (Reasoning-Hadoop)




                                           28
The
           MaRVIN
            Way!




                    compute


        compute               compute            Eyal Oren
input                                   output
 data                                    data
        compute               compute

                                          Spyros Kotoulas
                    compute

                                                    29
                  Divide-Conquer-Swap
MARVIN
        (Massive RDF Versatile Inference Network)
… is:
 −   a distributed technique for computing RDFS/OWL closure


… scales by:
 −   distributing computation over many nodes
 −   approximate (sound but incomplete) reasoning
 −   anytime convergence (more complete over time)


… runs on:
 −   in principle: any grid, using Ibis middleware
 −   the DAS-3 distributed supercomputer (300 nodes)
                                                              30
Divide-Conquer-Swap




                      SPLIT




             Repeat
                      COMP
                       UTE


                      JOIN
                        31
Current performance

200 Million triples in 7.2 minutes on 64 nodes.




                                                  32
Reasoning-Hadoop!

RDFS/OWL reasoning with the MapReduce framework.




                                                   33
The MapReduce
             Distributed Programming Model
  Initially designed and developed by Google in 2004 for large data
  processing [Jeffrey & Sanjay 2004].
  The computation is expressed with two functions: map and reduce.
Map-Reduce on 64 machines:

 Peak inference rates at 8M triples/sec
 Sustained inference rates at 4M triples/sec

                                                                 C2
 ApC             Map          <C,_,_>          Reduce            p1
 AqB                          <A,                                r3
 DrD
 ErD
                    .
                    .
                                  _,_
                                      >           .
                                                  .              q1
                    .             _,_>            .              D3
 FrC
                              <C,                                F1
                  Map         <F,_,_>          Reduce
                              Map-Reduce                              Jacopo Urbani
                                                                             34
What to about
the problem of success:

    cognitive heuristics




                       35
Stopping Rules
On very large datasets,
incompleteness is the rule
Must stop before we are finished
When to stop?
Stopping rules are important
−   determine length of computation
    (don’t stop too late)
−   quality of result
    (don’t stop too early)
Take inspiration from
     economics, biology, psychology

                                                  Lael Schooler

Humans have good heuristics for when to stop
 problem solving:
                                         Time between
                                           solutions

“Name capital cities in Europe”:
 London, Paris, Berlin, Rome, Amsterdam, …
 Milan, Madrid, …., ….., Paris, ….,
   Wrong
  answers                          Repetitions
When to switch between tasks?


                                  Lael Schooler
  hard task & easy task
   hard task & easy task           combined
                                    combined
                                      task
                                       task




Humans (& animals) are very
Humans (& animals) are very
 good finding this optimum
 good finding this optimum
What to about
the problem of success:

         data selection




                      39
Take data-selection seriously

Where do the axioms come from?
• Which subset to use?
• Relevance measures                    Zhisheng Huang

  • Example: syntactic relevance:
   • δ(α,β)=1 if α,β share a concept symbol
   • δ(α,β)=k if δ(α,γ)=k-1 and
                   β,γ share a concept symbol
  • very simple measure,
     very syntactically unstable, but:

    Gives a high quality sound approximation
     Gives a high quality sound approximation
    (> 90% recall, 100% precision for small k)
     (> 90% recall, 100% precision for small k)
Take identifiers seriously
exploit the grounding of logical symbols
  in natural language
• Google distance as relevance measure
                                                                      Zhisheng Huang
                    max{log f ( x ), log f ( y )} − log f ( x , y )
  NGD ( x , y ) =
                      log M − min{log f ( x ), log f ( y )}
 = symmetric conditional probability
     of co-occurrence
 = estimate of semantic distance


Gives almost perfect “forgetting function”
 Gives almost perfect “forgetting function”
for matching class definitions in 2 vocabularies
 for matching class definitions in 2 vocabularies
Unifying Search and Reasoning from the
                    Viewpoint of Granularity
                       Barriers for Web-scale Problem Solving

(1) most relevant data vs search results space [Berners-Lee 1999].
(2) Traditional reasoning systems vs Web-scale data vs rational time [Fensel 2007].


    Refining Search by Reasoning                  Refining Reasoning by Search
    [Berners-Lee 1999]                                   [Fensel & Frank 2007]

         Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]


                                  Granularity


         Human Problem Solving                  Web Problem Solving
                                    Inspire!

               Basic level advantage, Cognitive Memory Retention
               Multi-level, multi-perspective, Variable Precision
                                                                                      42
Concrete Strategies



•    The Starting Point.
•    Multi-level Completeness.
•    Multi-level Specificity.
•    Multi-perspective.




                                  43

                                 43
The Starting Point Strategy




[Collins 1969] Collins, A.M. and Quillian, M.R. Retrieval time from
semantic memory. Journal of Verbal Learning and Verbal
Behaviour, 8, 240-247.

                                                                      44
(I) The Starting Point Strategy
The “ Basic level advantage ” [Rogers2007].
Concepts in a basic level -- > more frequently than other terms
[Wisniewski1989].

                                          TI (i ) = ∑ j =1 m(i, j )
                                                       n
• (Frequency) Total Interest :

As a step forward “familiar term” in basic level, “interests retention” focuses on
frequency and recency at the same time.

Interest retention models < -- > Cognitive memory retention models
[Anderson, Schooler 1991].
• (Frequency and Recency) Exponential Model for Interest Retention :
   EIR(i ) = ∑ j =1 m(i, j ) × Ae − bTi
                 n



• (Frequency and Recency) Power Model for Interest Retention :

   PIR(i ) = ∑ j =1 m(i, j ) × ATi
                 n                   −b
                                                                                45
Interest Retention and Interest Prediction


           A comparative study of TI during               Difference on the
           1990-2008 and IR in 2009                       contribution values from
                                                          papers published in
                                                          different years




                                              A comparative study on the
A comparative study on the                    prediction and real
prediction and real                           publication numbers by the
publication numbers by the                    exponential law model
power law model




                                                                                     46
Evaluations and the Released Dataset

•   interest retentions vs future interests.
    publication >= 100
    top 9 interests
    2000 to 2007
    1226 persons
    49.54% predict 3 out of 9 interests.

•   615,124 computer scientists in the SwetoDBLP dataset.
•   http://wiki.larkc.eu/csri-rdf




                                                            47
DBLP-SSE : DBLP Search Support Engine

Recent interests are extracted using the power law interest retention model.
Terms with high frequency do not necessarily have high interest retention. (e.g.
“Knowledge”)




                                                                                   48
DBLP-SSE : DBLP Search Support Engine
          Log in      Dieter Fensel

          Top 9       Web, Service, Semantic, Architecture, Model, Ontology,
          interests   Knowledge, Computing, Language
          Query :     Artificial Intelligence

          List 1 :    without current interests constraints (Top 5 results)

                      * PROLOG Programming for Artificial Intelligence, Second Edition.
                      * Artificial Intelligence Architectures for Composition and Performance
                      Environment.
                      * Artificial Intelligence in Music Education: A Critical Review.
                      * Music, Intelligence and Artificiality. Artificial Intelligence and Music
                      Education.
                      * Musical Knowledge: What can Artificial Intelligence Bring to the
                      Musician?
                      * ...

          List 2 :    with current interests constraints (Top 5 results)


                      * Web Intelligence and Artificial Intelligence in Education.
                      * Artificial Intelligence Exchange and Service Tie to All Test
                      Environments (AI-ESTATE)-A New Standard for System Diagnostics.
                      * Semantic Model for Artificial Intelligence Based on Molecular
                      Computing.
                      * Open Information Systems Semantics for Distributed Artificial
                      Intelligence.
                      * Artificial Intelligence and Financial Services.
                      *…
                                                                                         49
Multi-level Completeness Strategy



Low completeness                 Limited Time


High completeness                More time Available




One practical question :

How to choose the nodes to be reasoned over?




                                                       50
Choosing the pivotal nodes
   in the network first !




                                                          51
                             Another one: If I stop in here, what
                             is the completeness like now!
Multi-level Completeness Strategy

     Nodes are grouped together by Node degrees under a perspective.

Completeness Prediction Function :

                                      | Nrel (i ) | ×(| Nsub(i ) | − | Nsub(i ' ) |)
        PC (i ) =
                    | Nrel (i ) | ×(| N | − | Nsub(i ' ) |)+ | Nrel (i ' ) | ×(| Nsub(i ' ) | − | N |)


degree(n, Pcn) to stop   Satisfied authors    AI authors
                                                                                                         “Who are
         70              2885                   151
                                                                                                         authors in
         30              17121                  579
                                                                                                         Artificial
         11              78868                  1142                                                     Intelligence?”
          4              277417                 1704
          1              575447                 2225
          0              615124                 2355

  Unifying search and reasoning with multilevel                 Comparison of predicted and actual
  completeness and anytime behavior.                            completeness value.
                                                                                                                  52
Multi-level Specificity Strategy



  general          Limited Time




  Specific         More time Available




                                         53
A Case Study on Multi-level Specificity Strategy
                                                   Specificity    Relevant Keywords          Number of Authors
                                                    Level 1       Artificial Intelligence   2355
Answers to “Who are the authors in Artificial       Level 2               Agents            9157
Intelligence?” in multiple levels of specificity
according to the hierarchical ontology of                        Automated Reasoning        222
Artificial Intelligence.                                                Cognition           19775
                                                                       Constriants          8744
                                                                         Games              3817
Specificity   Number of authors    Completeness
                                                                      Knowledge             1537
Level 1              2355              0.85%                        Representation
                                                                                            2939
Level 1,2           207468            75.11%                       Natural Language
Level 1,2,3         276205             100%                                                 16425
                                                                          Robot
                                                                                            …
                                                                          …
A comparative study on the answers in
                                                    Level 3      Case-Based Reasoning       1133
different levels of specificity.
                                                                  Cognitive Modeling        76
                                                                     Decision Trees         1112
                                                                         Search             32079
                                                                       Translation          4414
                                                                   Web Intelligence         122
                                                                            …               …             54
The Multi-perspective Strategy

       Multiple representation of Knowledge [Minsky2006]
       User needs may differ from each other
       < -- > expect answers from different perspectives.




Normalized Degree Distribution of predicates in SwetoDBLP dataset
                                                                    55
The Multi-perspective Strategy
           Under different perspectives, the distribution characteristics are different!




Fig. 2. Coauthor number distribution          Fig. 3. log-log diagram of Figure 2.     Fig. 4. A zoomed in version
in the SwetoDBLP dataset.                                                              of Figure 2.




Fig. 5. A zoomed in version of coauthor      Fig. 6. Publication number distribution      Fig. 7. log-log diagram
distribution for Artificial Intelligence".   in the SwetoDBLP dataset.                    of Figure 6.

                                                                                                                56
Comparison of Results
                from Different Perspectives


  A partial result of the multilevel specificity reasoning task The list of authors
  in Artificial Intelligence" in level 1 from two perspectives.
      Publication number perspective              Coauthor number perspective
Thomas S. Huang (387)                      Carl Kesselman (312)
John Mylopoulos (261)                      Thomas S. Huang (271)
Hsinchun Chen (260)                        Edward A. Fox (269)
Henri Prade (252)                          Lei Wang (250)
Didier Dubois (241)                        John Mylopoulos (245)
Thomas Eiter (219)                         Ewa Deelman (237)
...                                        ...




                                                                                      57
Summarizing

The Semantic Web is rapidly becoming real
Scale is becoming a real problem
Different ways of scaling up:
 −   parallelization
 −   exploiting cognitive heuristics
         Stopping rules, cognitive memory retention, etc.
 −   data-selection for incomplete reasoning.
 −   New Forms of Reasoning.
LarKC Chinese Forum




                      59
Acknowledgement
Slides for this talk is mainly from 3 previous talks :


   Frank van Harmelen. Large Scale Reasoning on the Semantic Web or:
   When success is becoming a problem. Invited talk at the 2009 International
   Joint Conferences on Active Media Technology and Brain Informatics.
   Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of
   Granularity. the 2009 International Joint Conferences on Active Media
   Technology and Brain Informatics.
   Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar,
   University of Amsterdam, 2008.




                                                                                60
Contact Info
  Want to play with LarKC?
  Want to play with LarKC?
  Want to contribute plugins?
  Want to contribute plugins?
  Want to deploy LarKC?
  Want to deploy LarKC?

Frank.van.Harmelen@cs.vu.nl
    http://www.larkc.eu
                          Asia:
                           Asia:
           Ning Zhong: zhong@maebashi-it.ac.jp
           Ning Zhong: zhong@maebashi-it.ac.jp
            Yi Zeng ::yzeng@emails.bjut.edu.cn
            Yi Zeng yzeng@emails.bjut.edu.cn
                          @ WIC
                          @ WIC
                                             61
References
[Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original
Design and Ultimate Destiny of the World Wide Web by Its Inventor.
HarperSanFrancisco (1999)
[Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web
scale. IEEE Internet Computing 11(2) (2007) 94-96
[Michalski1986] Michalski, R.S. and Winston, P.H. Variable precision logic. Artificial
Intelligence, 29(2), 121–146, 1986.
[Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial
intelligence, and the future of the human mind. Simon & Schuster, 2006.
[Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and
explanations of the basic-level advantage. Journal of Experimental Psychology:
General 136(3) (2007) 451-469
[Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of
learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates
(1976) 321-361
[Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.:
Swetodblp ontology of computer science publications. Web Semantics: Science,
Services and Agents on the World Wide Web 5(3) (2007) 151-155
[Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental
Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913)
                                                                                  62
Thank you!




             63

Weitere ähnliche Inhalte

Ähnlich wie Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

Ähnlich wie Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning (20)

Interpreting the data parallel analysis with sawzall
Interpreting the data  parallel analysis with sawzallInterpreting the data  parallel analysis with sawzall
Interpreting the data parallel analysis with sawzall
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Dbrec - Music recommendations using DBpedia
Dbrec - Music recommendations using DBpediaDbrec - Music recommendations using DBpedia
Dbrec - Music recommendations using DBpedia
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Some news about the SW
Some news about the SWSome news about the SW
Some news about the SW
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
Creativity in Digital Scholarship
Creativity in Digital ScholarshipCreativity in Digital Scholarship
Creativity in Digital Scholarship
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
ASE2010
ASE2010ASE2010
ASE2010
 
Publishing Linked Open Data in 15 minutes
Publishing Linked Open Data in 15 minutesPublishing Linked Open Data in 15 minutes
Publishing Linked Open Data in 15 minutes
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

  • 1. Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2 Maebashi Institute of Technology, Japan Vrije University Amsterdam, the Netherlands International WIC Institute, Beijing University of Technology, China http://www.larkc.eu 1
  • 2. The World is Creating the Linked Data Every Day! Late br e Google aking news: Video now al with R annota so DF-a ( ted f ro m Y using v ahoo a ocabul nd Fac aries e bo o k ) 2
  • 3. ay da y rd er pe p tts s en e n um cum oc do d n iio n llll o ii m rr m ffou ou 3
  • 4. 4
  • 6. toxic releases consumer expenditure recent earthquakes consumer price index crime statistics tornado reports assaults on police trade statistics social benefits river elevations 6 unemployment rates energy consumption
  • 7. Things to do with data.gov 7
  • 8. 8
  • 9. 9
  • 10. <rdf:RDF> <rdf:Description rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.rdf <rdfs:label>Description of the artist Yeah Yeah Yeahs</rdfs:label> <foaf:primaryTopic rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4b </rdf:Description> <mo:MusicArtist rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4#a <rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup"/> <foaf:name>Yeah Yeah Yeahs</foaf:name> <ov:sortLabel>Yeah Yeah Yeahs</ov:sortLabel> <bio:event> <bio:Birth><bio:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime </bio:event> <owl:sameAs rdf:resource="http://dbpedia.org/resource/Yeah_Yeah_Yeahs"/> <mo:image rdf:resource="/music/images/artists/7col_in/584c04d2-4acc-491b-8a0a-e63 <foaf:page rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.html"/ <mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/584c04d2-4acc-491b-8a0a <foaf:homepage rdf:resource="http://www.yeahyeahyeahs.com/"/> <mo:wikipedia rdf:resource="http://en.wikipedia.org/wiki/Yeah_Yeah_Yeahs"/> <mo:myspace rdf:resource="http://www.myspace.com/yeahyeahyeahs"/> <mo:member rdf:resource="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#art <mo:member rdf:resource="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#ar <mo:member rdf:resource="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#a ... 10
  • 11. <foaf:made> <mo:Record> <dc:title>It's Blitz!</dc:title> <mo:musicbrainz rdf:resource="http://musicbrainz.org/release/9c4177fe-bdce-4f9d-ab <rev:hasReview rdf:resource="/music/reviews/hnp2#review"/> </mo:Record> </foaf:made> ..... <mo:MusicArtist rdf:about="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#arti <foaf:name>Brian Chase</foaf:name> </mo:MusicArtist> <mo:MusicArtist rdf:about="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#art <foaf:name>Karen O</foaf:name> </mo:MusicArtist> <mo:MusicArtist rdf:about="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#art <foaf:name>Nick Zinner</foaf:name> </mo:MusicArtist> </rdf:RDF> 11
  • 13. What to do for the success of Web-scale Semantic Data Processing? Refining Search by Reasoning Refining Reasoning by Search [Berners-Lee 1999] [Fensel & Frank 2007] Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007] 13
  • 14. The LarKC Consortium 13 partner institutions (from 11 countries, 2 from Asia) 14 14
  • 15. The Large Knowledge Collider a platform for infinitely scalable reasoning on the data-web
  • 16. “a configurable platform for infinitely scalable semantic web reasoning” “pipeline” suggests linear structure: but in LarKC also: 16
  • 17. What to about the problem of success: parallelization 17
  • 21. Massive Data (even Web Scale Data!) Ooops! 21
  • 22. From Linked Data Website More than 7x108 triples 22
  • 23. Parallelization I am with Web-scale data : 7x10^8 triples Cashier1: 53 Cashier2: 14 Cashier3: 33 Cashier4: 72 Cashier2: 34 Cashier3: 13 Cashier4: 32 -------------------- 23 Total : 340
  • 24. Data two for the dependencies price of one? 2nd for half price? Cashier1: 53 Cashier2: 14 Cashier3: 33 Cashier4: 72 Cashier2: 34 Cashier3: 13 Cashier4: 32 -------------------- 24 Total : 340
  • 25. Fruit Split two for the Responsibility price of one? 2nd for half Vegetables price? Household Cashier1: 53 Packaged Cashier2: 14 Cashier3: 33 Cashier4: 72 Cashier2: 34 Rest Cashier3: 13 Cashier4: 32 -------------------- 25 Total : 340
  • 26. Fruit Load two for the Balancing price of one? 2nd for half Vegetables price? Household Cashier1: 53 Packaged Cashier2: 14 Cashier3: 33 Cashier4: 72 Cashier2: 34 Rest Cashier3: 13 Cashier4: 32 -------------------- 26 Total : 340
  • 27. Fruit Data With a box of dependencies detergent and a box of Vegetables For RDF data, any triple can refer to any URI. cereal get a free pen! Household Cashier1: 53 Packaged Cashier2: 14 Cashier3: 33 Cashier4: 72 Cashier2: 34 Rest Cashier3: 13 Cashier4: 32 -------------------- 27 Total : 340
  • 28. Towards Parallelization and Distribution Different parallel computing models: − Peer-to-peer (MaRVIN) − Map-Reduce (Reasoning-Hadoop) 28
  • 29. The MaRVIN Way! compute compute compute Eyal Oren input output data data compute compute Spyros Kotoulas compute 29 Divide-Conquer-Swap
  • 30. MARVIN (Massive RDF Versatile Inference Network) … is: − a distributed technique for computing RDFS/OWL closure … scales by: − distributing computation over many nodes − approximate (sound but incomplete) reasoning − anytime convergence (more complete over time) … runs on: − in principle: any grid, using Ibis middleware − the DAS-3 distributed supercomputer (300 nodes) 30
  • 31. Divide-Conquer-Swap SPLIT Repeat COMP UTE JOIN 31
  • 32. Current performance 200 Million triples in 7.2 minutes on 64 nodes. 32
  • 33. Reasoning-Hadoop! RDFS/OWL reasoning with the MapReduce framework. 33
  • 34. The MapReduce Distributed Programming Model Initially designed and developed by Google in 2004 for large data processing [Jeffrey & Sanjay 2004]. The computation is expressed with two functions: map and reduce. Map-Reduce on 64 machines: Peak inference rates at 8M triples/sec Sustained inference rates at 4M triples/sec C2 ApC Map <C,_,_> Reduce p1 AqB <A, r3 DrD ErD . . _,_ > . . q1 . _,_> . D3 FrC <C, F1 Map <F,_,_> Reduce Map-Reduce Jacopo Urbani 34
  • 35. What to about the problem of success: cognitive heuristics 35
  • 36. Stopping Rules On very large datasets, incompleteness is the rule Must stop before we are finished When to stop? Stopping rules are important − determine length of computation (don’t stop too late) − quality of result (don’t stop too early)
  • 37. Take inspiration from economics, biology, psychology Lael Schooler Humans have good heuristics for when to stop problem solving: Time between solutions “Name capital cities in Europe”: London, Paris, Berlin, Rome, Amsterdam, … Milan, Madrid, …., ….., Paris, …., Wrong answers Repetitions
  • 38. When to switch between tasks? Lael Schooler hard task & easy task hard task & easy task combined combined task task Humans (& animals) are very Humans (& animals) are very good finding this optimum good finding this optimum
  • 39. What to about the problem of success: data selection 39
  • 40. Take data-selection seriously Where do the axioms come from? • Which subset to use? • Relevance measures Zhisheng Huang • Example: syntactic relevance: • δ(α,β)=1 if α,β share a concept symbol • δ(α,β)=k if δ(α,γ)=k-1 and β,γ share a concept symbol • very simple measure, very syntactically unstable, but: Gives a high quality sound approximation Gives a high quality sound approximation (> 90% recall, 100% precision for small k) (> 90% recall, 100% precision for small k)
  • 41. Take identifiers seriously exploit the grounding of logical symbols in natural language • Google distance as relevance measure Zhisheng Huang max{log f ( x ), log f ( y )} − log f ( x , y ) NGD ( x , y ) = log M − min{log f ( x ), log f ( y )} = symmetric conditional probability of co-occurrence = estimate of semantic distance Gives almost perfect “forgetting function” Gives almost perfect “forgetting function” for matching class definitions in 2 vocabularies for matching class definitions in 2 vocabularies
  • 42. Unifying Search and Reasoning from the Viewpoint of Granularity Barriers for Web-scale Problem Solving (1) most relevant data vs search results space [Berners-Lee 1999]. (2) Traditional reasoning systems vs Web-scale data vs rational time [Fensel 2007]. Refining Search by Reasoning Refining Reasoning by Search [Berners-Lee 1999] [Fensel & Frank 2007] Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007] Granularity Human Problem Solving Web Problem Solving Inspire! Basic level advantage, Cognitive Memory Retention Multi-level, multi-perspective, Variable Precision 42
  • 43. Concrete Strategies • The Starting Point. • Multi-level Completeness. • Multi-level Specificity. • Multi-perspective. 43 43
  • 44. The Starting Point Strategy [Collins 1969] Collins, A.M. and Quillian, M.R. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behaviour, 8, 240-247. 44
  • 45. (I) The Starting Point Strategy The “ Basic level advantage ” [Rogers2007]. Concepts in a basic level -- > more frequently than other terms [Wisniewski1989]. TI (i ) = ∑ j =1 m(i, j ) n • (Frequency) Total Interest : As a step forward “familiar term” in basic level, “interests retention” focuses on frequency and recency at the same time. Interest retention models < -- > Cognitive memory retention models [Anderson, Schooler 1991]. • (Frequency and Recency) Exponential Model for Interest Retention : EIR(i ) = ∑ j =1 m(i, j ) × Ae − bTi n • (Frequency and Recency) Power Model for Interest Retention : PIR(i ) = ∑ j =1 m(i, j ) × ATi n −b 45
  • 46. Interest Retention and Interest Prediction A comparative study of TI during Difference on the 1990-2008 and IR in 2009 contribution values from papers published in different years A comparative study on the A comparative study on the prediction and real prediction and real publication numbers by the publication numbers by the exponential law model power law model 46
  • 47. Evaluations and the Released Dataset • interest retentions vs future interests. publication >= 100 top 9 interests 2000 to 2007 1226 persons 49.54% predict 3 out of 9 interests. • 615,124 computer scientists in the SwetoDBLP dataset. • http://wiki.larkc.eu/csri-rdf 47
  • 48. DBLP-SSE : DBLP Search Support Engine Recent interests are extracted using the power law interest retention model. Terms with high frequency do not necessarily have high interest retention. (e.g. “Knowledge”) 48
  • 49. DBLP-SSE : DBLP Search Support Engine Log in Dieter Fensel Top 9 Web, Service, Semantic, Architecture, Model, Ontology, interests Knowledge, Computing, Language Query : Artificial Intelligence List 1 : without current interests constraints (Top 5 results) * PROLOG Programming for Artificial Intelligence, Second Edition. * Artificial Intelligence Architectures for Composition and Performance Environment. * Artificial Intelligence in Music Education: A Critical Review. * Music, Intelligence and Artificiality. Artificial Intelligence and Music Education. * Musical Knowledge: What can Artificial Intelligence Bring to the Musician? * ... List 2 : with current interests constraints (Top 5 results) * Web Intelligence and Artificial Intelligence in Education. * Artificial Intelligence Exchange and Service Tie to All Test Environments (AI-ESTATE)-A New Standard for System Diagnostics. * Semantic Model for Artificial Intelligence Based on Molecular Computing. * Open Information Systems Semantics for Distributed Artificial Intelligence. * Artificial Intelligence and Financial Services. *… 49
  • 50. Multi-level Completeness Strategy Low completeness Limited Time High completeness More time Available One practical question : How to choose the nodes to be reasoned over? 50
  • 51. Choosing the pivotal nodes in the network first ! 51 Another one: If I stop in here, what is the completeness like now!
  • 52. Multi-level Completeness Strategy Nodes are grouped together by Node degrees under a perspective. Completeness Prediction Function : | Nrel (i ) | ×(| Nsub(i ) | − | Nsub(i ' ) |) PC (i ) = | Nrel (i ) | ×(| N | − | Nsub(i ' ) |)+ | Nrel (i ' ) | ×(| Nsub(i ' ) | − | N |) degree(n, Pcn) to stop Satisfied authors AI authors “Who are 70 2885 151 authors in 30 17121 579 Artificial 11 78868 1142 Intelligence?” 4 277417 1704 1 575447 2225 0 615124 2355 Unifying search and reasoning with multilevel Comparison of predicted and actual completeness and anytime behavior. completeness value. 52
  • 53. Multi-level Specificity Strategy general Limited Time Specific More time Available 53
  • 54. A Case Study on Multi-level Specificity Strategy Specificity Relevant Keywords Number of Authors Level 1 Artificial Intelligence 2355 Answers to “Who are the authors in Artificial Level 2 Agents 9157 Intelligence?” in multiple levels of specificity according to the hierarchical ontology of Automated Reasoning 222 Artificial Intelligence. Cognition 19775 Constriants 8744 Games 3817 Specificity Number of authors Completeness Knowledge 1537 Level 1 2355 0.85% Representation 2939 Level 1,2 207468 75.11% Natural Language Level 1,2,3 276205 100% 16425 Robot … … A comparative study on the answers in Level 3 Case-Based Reasoning 1133 different levels of specificity. Cognitive Modeling 76 Decision Trees 1112 Search 32079 Translation 4414 Web Intelligence 122 … … 54
  • 55. The Multi-perspective Strategy Multiple representation of Knowledge [Minsky2006] User needs may differ from each other < -- > expect answers from different perspectives. Normalized Degree Distribution of predicates in SwetoDBLP dataset 55
  • 56. The Multi-perspective Strategy Under different perspectives, the distribution characteristics are different! Fig. 2. Coauthor number distribution Fig. 3. log-log diagram of Figure 2. Fig. 4. A zoomed in version in the SwetoDBLP dataset. of Figure 2. Fig. 5. A zoomed in version of coauthor Fig. 6. Publication number distribution Fig. 7. log-log diagram distribution for Artificial Intelligence". in the SwetoDBLP dataset. of Figure 6. 56
  • 57. Comparison of Results from Different Perspectives A partial result of the multilevel specificity reasoning task The list of authors in Artificial Intelligence" in level 1 from two perspectives. Publication number perspective Coauthor number perspective Thomas S. Huang (387) Carl Kesselman (312) John Mylopoulos (261) Thomas S. Huang (271) Hsinchun Chen (260) Edward A. Fox (269) Henri Prade (252) Lei Wang (250) Didier Dubois (241) John Mylopoulos (245) Thomas Eiter (219) Ewa Deelman (237) ... ... 57
  • 58. Summarizing The Semantic Web is rapidly becoming real Scale is becoming a real problem Different ways of scaling up: − parallelization − exploiting cognitive heuristics Stopping rules, cognitive memory retention, etc. − data-selection for incomplete reasoning. − New Forms of Reasoning.
  • 60. Acknowledgement Slides for this talk is mainly from 3 previous talks : Frank van Harmelen. Large Scale Reasoning on the Semantic Web or: When success is becoming a problem. Invited talk at the 2009 International Joint Conferences on Active Media Technology and Brain Informatics. Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of Granularity. the 2009 International Joint Conferences on Active Media Technology and Brain Informatics. Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar, University of Amsterdam, 2008. 60
  • 61. Contact Info Want to play with LarKC? Want to play with LarKC? Want to contribute plugins? Want to contribute plugins? Want to deploy LarKC? Want to deploy LarKC? Frank.van.Harmelen@cs.vu.nl http://www.larkc.eu Asia: Asia: Ning Zhong: zhong@maebashi-it.ac.jp Ning Zhong: zhong@maebashi-it.ac.jp Yi Zeng ::yzeng@emails.bjut.edu.cn Yi Zeng yzeng@emails.bjut.edu.cn @ WIC @ WIC 61
  • 62. References [Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. HarperSanFrancisco (1999) [Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web scale. IEEE Internet Computing 11(2) (2007) 94-96 [Michalski1986] Michalski, R.S. and Winston, P.H. Variable precision logic. Artificial Intelligence, 29(2), 121–146, 1986. [Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial intelligence, and the future of the human mind. Simon & Schuster, 2006. [Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and explanations of the basic-level advantage. Journal of Experimental Psychology: General 136(3) (2007) 451-469 [Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates (1976) 321-361 [Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.: Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web 5(3) (2007) 151-155 [Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913) 62