Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

6.940 Aufrufe

Veröffentlicht am

Flink Forward 2015

Veröffentlicht in: Technologie
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/36cXjBY ♥♥♥
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating direct: ❤❤❤ http://bit.ly/36cXjBY ❤❤❤
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

  1. 1. A Semantic Big Data Companion Stefano Bortoli @stefanobortoli bortoli@okkam.it Flavio Pompermaier @fpompermaier pompermaier@okkam.it
  2. 2. The company (briefly) • Okkam is – a SME based in Trento, Italy. – Started as spin-off of the University of Trento and FBK (2010) • Okkam core business is – large-scale data integration using semantic technologies and an Entity Name System • Okkam operative sectors – Services for public administration – Services for restaurants (and more) – Research projects • FP7, H2020, and Local agencies
  3. 3. Who we are • Stefano Bortoli, PhD – works as technical director and researcher at Okkam S.R.L. (Trento, Italy). His research and development interests are in the area of Information Integration, with special focus in entity- centric applications exploiting semantic technologies. • Flavio Pompermaier, MSc. – works as senior software engineer at Okkam S.R.L. (Trento, Italy). Flavio is a passionate developer working with state of the art technologies, combining semantic with big data technologies.
  4. 4. Our contributions • Early Flink adopters and promoters (since Stratosphere) • Example on how to use Flink with MongoDB – https://github.com/okkam-it/flink-mongodb-test • 2 pull requests – [FLINK-1928] [hbase] Added HBase write example – [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats • 8 JIRA tickets – OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy – RESOLVED FLINK-1978 POJO serialization NPE – OPEN FLINK-1834 Is mapred.output.dir conf parameter really required? – CLOSED FLINK-1828 Impossible to output data to an HBase table – OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies – OPEN FLINK-2800 kryo serialization problem – RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter – OPEN FLINK-1241 Record processing counter for Dashboard • Hundreads of email threads and discussions 
  5. 5. What we do
  6. 6. Our toolbox
  7. 7. Our Flink Use Cases • Our objective is to build and manage (very) large entity-centric knowledge bases to serve different purposes and domains • So far, we used Apache Flink for: – Domain reasoning (Flink + Parquet + Thrift) – RDF data lifecycle (Flink + Parquet + Jena/Sesame ) – RDF data intelligence (Flink + ELKiBi) – Duplicate record detection (Flink + HBase + Solr) – Entiton Record linkage (Flink + MongoDB + Kryo) – Telemetry analysis (Flink + MongoDB + Weka)
  8. 8. Why we need Flink Entiton data model Database record RDF statement Triplestore NOSQL + Indexes + Quad provenance IRI predicate object object Type Subject local IRI Subject Global IRI RDF Type Expensive datawearhouse
  9. 9. Entiton using Parquet+Thrift namespace java it.okkam.flink.entitons.serialization.thrift struct EntitonQuad { 1: required string p; //pred 2: required string o; //obj 3: optional string ot; //obj-type 4: required string g; //sourceIRI } struct EntitonAtom { 1: required string s; //local-IRI 2: optional string oid; // ens-IRI 3: required list<string> types; //rdf-types 4: required list<EntitonQuad> quads; // quads } struct EntitonMolecule { 1: required EntitonAtom r; //root atom 2: optional list<EntitonAtom> atoms; //other atoms } Quad Subject local IRI Subject ENS IRI RDF Type
  10. 10. Hardware-wise • We compete with expensive data wearhouse solution – e.g. Oracle Exadata Database Machines, IBM Netezza, etc. • Test on small machines fosters optimization – If you don’t want to wait, make your code faster! • Our code is ready to scale, without big investments • Fancy stuff can be done without millions of euros in HW 8 x Gigabyte Brix 16GB RAM 256GB SSD 1T HDD Intel I7 4770 3,2Ghz + 1 Gbit Switch
  11. 11. ENS Maintenance • Entity Name System (ENS) (FP7 ’08-’10) – A Web-scale support for minting and reusing persistent entity identifiers for SW or LOD
  12. 12. ENS Maintenance • Duplicate detection of 9.2M entities in 6h45 (using Flink incubator 0.6) – Query Apache Solr global index to perform flexible blocking given a subset of attributes of each entity (names) – Distinct/Join pairs of candidate duplicate – Rich Filter implementing Match function – Consolidate sets of candidate duplicates grouping • Tricks: – If HBase does not distribute uniformly regions do rebalance() – Compress byte[] with LZ4 in custom HBase input format to reduce network traffic – Reverse keys to speed up join execution (up to 10%, in some cases)
  13. 13. Tax Reasoner • Pilot project for ACI and Val d’Aosta Objectives: to produce analytics and investigate: 1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)? 2. Who did not pay Vehicle Insurance? 3. Who skipped Vehicle Inspection? 4. Who did not pay Vehicle Sales Taxes? 5. Who violated exceptions to the above? Dataset: 15 data sources for 5 year with 12M records about 950k vehicles and 500k subjects for a total of 90M facts Challenge: consider events (time) and infer implicit information. Apache Flink jobs: 1. From RDF to Entiton 2. Domain Specific Temporal Inference (Tax Reasoner) 3. Build ElasticSearch Indexes
  14. 14. Tax Reasoner
  15. 15. Tax Reasoner Temporal Inference Execution Plan 1h ETA with SSD (2h30 on HDD) on developer machine 11M new facts inferred It took 1 DAY to perform the select query for one of the sources!!
  16. 16. RDF Data Intelligence SIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBi Geospatial indicators
  17. 17. Using Entiton with MongoDB • Inspired by the work of Gregg Kellog (2012) – http://www.slideshare.net/gkellogg1/jsonld-and-mongodb • We updated JAOB (Java Architecture for OWL Binding) – Serialize RDF into POJO and viceversa – Provides also a Maven plugin to compile OWL ontology into POJOs – https://github.com/okkam-it/jaob • Database access layer data (using SpringData): – POJO  RDF  JSON-LD + KRYO  MongoDB – MongoDB  KRYO  POJO • Bottom line: – we use (framed) JSON-LD to allow (complex) tree queries on an entiton database modeled according to a domain ontology – We exploit Kryo deserialization for fast reading – We enjoy SpringData abstraction to implement Data access APIs
  18. 18. Using Entiton with MongoDB @Document(collection = Entiton.TABLE) @CompoundIndexes({ @CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true), …. }) public class Entiton<T extends Thing> implements Serializable { @Id private String id; @Version private String version; private DBObject jsonld; private Binary kryo; @GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE) private Point point; public String javaClass; … }
  19. 19. Entiton (Mongo JSON-LD)
  20. 20. Contextual personalization in
  21. 21. Recommendation Architecture
  22. 22. Telemetry data collection
  23. 23. Flink batch processes
  24. 24. Contextual Personalization on demand
  25. 25. Lesson learned • Reversing String Tuples ids leads to performance improvements of joins • When you make joins, ensure distinct dataset keys • Reuse objects to reduce the impact of garbage collection • When writing Flink jobs, start with small and debuggable unit tests first, then run it on the cluster on the entire dataset (waiting for big data debugging of Dr. Leich@TUB) • Serialization matters: less memory required, less gc, faster data loading  faster execution • HD speed matters when RAM is not enough, SSD rulez • Parquet rulez: self-describing data, push-down filters • Use Gelly consciously, sometimes joins are good enough • If your code crashes, it is usually your fault: from 0.9 Flink is quite stable for batch jobs execution (at least)
  26. 26. Suggested improvements • Running jobs on a cluster is always tricky because of bad maven dependencies management in flink – Define a POM for the build settings, a parent, and a bill-of- materials BOM (releasetrain) and common resources folder – Jar validation on client submit wrt deployed Flink dist bundle giving warning on possibly conflicting classes • Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827 • Better monitoring, we’re eager to use the new web client! • Complete Hadoop compatibility – counters and custom grouping/sorting functions • Start thinking about education of professionals – e.g. courses and certification
  27. 27. Future works • Benchmark Entiton serialization models on Parquet (Avro vs Thrift vs Protobuf) • Manage declarative data fusion policies – a-la LDIF: http://ldif.wbsg.de/ • Define a formal entiton operations algebra (e.g. merge, project, select, filter) • Try out Cloudera Kudu – novel Hadoop storage engine addressing both bulk loading stability, scan performance and random access – https://github.com/cloudera/kudu
  28. 28. Conclusions • We think we are walking along the “last mile” towards real world enterprise Semantic Applications • Combining big data and semantics allows us to be flexible, expressive and, thanks to Flink, very scalable with very competitive costs • Apache Flink gives us the leverage to shuffle data around without much headache • We proved cool stuff can be done in a simple and efficient way, with the right tools and mindset • Hopefully, Flink will help us reducing tax evasion in Italy, not bad for a German Squirrel, ah?
  29. 29. Thanks for your attention Any Questions (before beer)?