I gave this talk on Oct 2 at the Semantic Technology and Business conference. In this talk I discuss how I process Freebase data with the open source Infovore framework, which processes Freebase and other RDF data quickly by using Hadoop, Map/Reduce, and Amazon Web Services
39. 0 10 20 30 40 50 60
Freebase
DBpedia
any relational database
machine learning
Jena
Amazon Web Services
PHP
map/reduce frameworks (ex. Hadoop)
MongoDB
Sesame
Virtuoso OpenLink
other NoSQL database
Solid State Drives (SSD)
other cloud computing service
Neo4J
Ruby
Drupal
alternative JVM languages (ex. Scala or Clojure)
other triple store
any key/value store (ex. JDBM or Berkeley DB)
OWLIM
Allegrograph
4store
Factual
dotNetRDF
Stardog
Kasabi/Talis Platform
Oracle Spatial RDF
Tools Popular With :BaseKB Users
43. Jena Framework
SDB
Relational db-based
Triple store
TDB
Native disk-based
triple store
Model
In-memory triple store
“We use Jena Models like PHP programmers use hashtables”
-- Kendall Clark, Clark and Parsia
60. freebaseRDFPrefilter removes…
Wasteful Facts
• 120M+ copies of the “a” predicate
• 60M+ access control predicates
Violent and Dangerous facts
ns:common.topic ns:type.type.instance ?o .
Is repeated 30M times, and if you group on ?s and keep
them in memory…
61. … uneven bin distribution …
331
332330
333
334 335
… …
75. Descriptions
ns:m.010bfy ns:common.topic.description
"Riverside u00E9 uma cidade localizada no estado norte-americano
de Texas, no Condado de Walker."@pt .
ns:m.010bs8 ns:common.topic.description
"El Campo is a city in Wharton County, Texas, United States. The
population was 10,945 at the 2000 census, making it the largest city in
Wharton County."@en .
76. Descriptions
ns:m.010bfy ns:common.topic.description
"Riverside u00E9 uma cidade localizada no estado norte-americano
de Texas, no Condado de Walker."@pt .
ns:m.010bs8 ns:common.topic.description
"El Campo is a city in Wharton County, Texas, United States. The
population was 10,945 at the 2000 census, making it the largest city in
Wharton County."@en .
This does not compute!
77. Descriptions
ns:m.010bfy ns:common.topic.description
"Riverside u00E9 uma cidade localizada no estado norte-americano
de Texas, no Condado de Walker."@pt .
ns:m.010bs8 ns:common.topic.description
"El Campo is a city in Wharton County, Texas, United States. The
population was 10,945 at the 2000 census, making it the largest city in
Wharton County."@en .
78. Labels and Names
ns:american_football.football_division rdfs:label
"American football division"@en .
ns:american_football.football_conference rdfs:label
"Grupper inom amerikansk fotboll"@sv .
ns:american_football.football_player ns:type.object.name
"Football-Spieler"@de .
ns:american_football.football_team ns:type.object.name
"American football-team"@nl .
85. Examples…
ns:m.010bs8 ns:common.topic.description
"El Campo is a city in Wharton County, Texas, United States. The
population was 10,945 at the 2000 census, making it the largest city in
Wharton County."@en .
ns:american_football.football_division rdfs:label
"American football division"@en .
Freebase always uses the same key in the ?s, ?p, and ?o fields, but...
86. It wasn’t always this way
… the old quad dump used mids in the subject field, but others in the destination field …
112. Pig Script – count common types
$ pig
grunt> run chopper/src/main/pig/lib/chopper.pig
grunt> a = LOAD '/freebase/20130915/a/' USING
com.ontology2.chopper.io.PrimitiveTripleInput();
grunt> oNodes = FOREACH a GENERATE o;
grunt> groupNodes = GROUP oNodes BY o;
grunt> countedNodes = FOREACH groupNodes GENERATE
group AS uri:chararray,COUNT(oNodes) AS cnt:long;
grunt> sortedNodes = ORDER countedNodes BY cnt DESC;
grunt> top100= DUMP sortedNodes;
124. :BaseKB Now
• Created Weekly by automated process
• Delivered to AMZN S3
• Accepted facts are 100% Valid RDF
• Rejected facts collected for inspection
• “Violent” predicates removed to fight skew
• Horizontally divided for fast processing
http://basekb.com/