Presentation on Apache Stanbol (incubating) and related projects given by Olivier Grisel durin ApacheCon 2011.
More information:
- http://incubator.apache.org/stanbol/
- http://www.iks-project.eu
2. My Background 11/7/11 Olivier Grisel - R&D Engineer nuxeo Open Source ECM European project: IKS Stuff I do: Machine Learning Natural Language Processing All things data
3. Agenda 11/7/11 The Web of Data: what, why, how? CMS integration demo Semantic Components in Stanbol Building models for Stanbol
6. 11/7/11 “ To a computer, then, the web is a flat , boring world devoid of meaning ” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
7. 11/7/11 “ This is a pity, as in fact documents on the web describe real objects and imaginary concepts , and give particular relationships between them” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
8. 11/7/11 “ The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning , better enabling computers and people to work in cooperation.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
9. 11/7/11 “ Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values .” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
13. Decoding User Intents 11/7/11 Next Generation User Interfaces Siri - conversational interface IBM DeepQA: Watson for Heath Care Tell Google about your stuff Publish structured prediction of your products "3 bedrooms flat near Montmartre" Useful for non-public data as well Intranet query: "ApacheCon slides" Intranet query: "Xerox invoices" Intranet query: "Xerox salesperson email"
14. The Web of Data - How? 11/7/11 RDF / TripeStores / Sparql Graph stores with dynamic schemas Strong interoperability JSON-LD Upgrade your JSON with scoped vocabularies Web / Mobile / JS developer friendly RDFa + schema.org & rNews Publish annotation in structured markup Vocabulary understood by Search Engines
15. HTML example 11/7/11 <p> My name is Manu Sporny and you can give me a ring via 1-800-555-0155. <img src="http://manu.sporny.org/images/manu.png" /> I have a <a href="http://manu.sporny.org/">blog</a>. </p>
16. RDFa example 11/7/11 <p vocab="http://schema.org/" prefix="foaf: http://xmlns.com/foaf/0.1/" about="#manu" typeof="Person" > My name is <span property="name" >Manu Sporny</span> and you can give me a ring via <span property="telephone" >1-800-555-0155</span>. <img rel="image" src="http://manu.sporny.org/images/manu.png" /> I have a <a rel="foaf:weblog" href="http://manu.sporny.org/">blog</a>. </p>
22. Apache Stanbol 11/7/11 Enhancer Text analysis with Apache OpenNLP / Tika EntityHub / ContentHub Linked Data Indexing with Apache Solr Graph Storage with Apache Clerezza / Jena Reasoner / Rules Inference with Apache Jena & OWLApi Components / HTTP Services OSGi with Apache Felix / JAX-RS with Jersey
28. Minimalist HTTP Client 11/7/11 curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data "John Smith was born in London." http://stanbol.demo.nuxeo.com/engines
32. Stanbol Enhancer 11/7/11 Chain of Enhancement Engines Language Detection (Tika) Named Entity Detection (OpenNLP) Linked Data dereferencing (Solr) Refactoring / Translation (Jena)
33. Stanbol EntityHub 11/7/11 Referenced Sites DBpedia Geonames (NY Times, MusicBrainz, ProductDB, UnitProt...) Fast local offline indices (Solr) Batch indexing utilities for RDF dumps Multilingual fulltext search in labels & descriptions Vocabulary mapping / merging
34. Stanbol Reasoner 11/7/11 RDFS / OWL-lite / OWL2 Consistency checks Cardinality checks: each person has 1 birth date Range constraints: birth dates are valid dates Materializing types / properties Types from subclass: Musician > Artist > Person Symmetric property: A worked with B Transitive property: A is a located in B Query-time expansion / inference?
38. Universal Topic Classification 11/7/11 Use Apache Lucene / Solr MoreLikeThis to perform a truncated nearest neighbors query in the TF-IDF vector space of Wikipedia
39. Universal Topic Classification 11/7/11 Index text of all articles grouped by topic Solr MoreLikeThis query on new document DBpedia dumps provide: Text summaries for each article “ subject” relationships between articles and topics “ broader” / “narrower” SKOS hieararchy between topics
40. About the Data 11/7/11 500k purely technical categories “ People_with_missing_birth_place”, “Rivers_in_Romania” 70k “semantically grounded” categories Paths to roots require both “ technical” and “grounded” categories Scale: 1.2M topic / topic links 30M topic / article links
41. Some results (Wikinews) 11/7/11 US children who celebrate Independence Day more likely to become Republicans, says Harvard study Fireworks Voting theory Republican Party (United States) Statistics Electoral systems
42. Some results (Wikinews) 11/7/11 U.S. space agency NASA sues ex-astronaut American astronauts Aviation halls of fame Edwards Air Force Base Apollo program Exploration of the Moon
43. Some results (Wikinews) 11/7/11 Hundreds of thousands of British public sector workers strike over planned pension changes Retirement in the United Kingdom United Kingdom pensions and benefits Pensions in the United Kingdom Labor disputes by country Labor disputes
44. Some results (PLoS One) 11/7/11 Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways Renal physiology Kidney Nephrology Hypertension Membrane biology
45. Wrap Up 11/7/11 Web of Data brings Sructured Context Frame to decode User Intention NLP + Entities & Topics indices to automate Content Enrichment to provide Disambiguationn
46. Resources 11/7/11 Documentation, svn, mailing list: http://incubator.apache.org/stanbol IKS project blog: http://blog.iks-project.eu Blog posts about Semantic ECM: http://blogs.nuxeo.com/dev/semantic/
47. Thank you for your attention! 11/7/11 Olivier Grisel [email_address] https://twitter.com/ogrisel
48. Training models for NER from Wikipedia Extract sentences with link positions in Wikipedia articles DBPedia to the find type of the target entity (Person, Location, Organization) Apache Pig scripts to compute the join + format the result as training files for OpenNLP Apache OpenNLP to build and evaluate the models Apache Hadoop / Apache Whirr for distributed processing