SlideShare ist ein Scribd-Unternehmen logo
1 von 29
A Semantic Big Data
Companion
Stefano Bortoli @stefanobortoli
bortoli@okkam.it
Flavio Pompermaier @fpompermaier
pompermaier@okkam.it
The company (briefly)
• Okkam is
– a SME based in Trento, Italy.
– Started as spin-off of the
University of Trento and FBK (2010)
• Okkam core business is
– large-scale data integration using
semantic technologies and
an Entity Name System
• Okkam operative sectors
– Services for public administration
– Services for restaurants (and more)
– Research projects
• FP7, H2020, and Local agencies
Who we are
• Stefano Bortoli, PhD
– works as technical director and researcher at Okkam S.R.L.
(Trento, Italy). His research and development interests are in the
area of Information Integration, with special focus in entity-
centric applications exploiting semantic technologies.
• Flavio Pompermaier, MSc.
– works as senior software engineer at Okkam S.R.L. (Trento, Italy).
Flavio is a passionate developer working with state of the art
technologies, combining semantic with big data technologies.
Our contributions
• Early Flink adopters and promoters (since Stratosphere)
• Example on how to use Flink with MongoDB
– https://github.com/okkam-it/flink-mongodb-test
• 2 pull requests
– [FLINK-1928] [hbase] Added HBase write example
– [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats
• 8 JIRA tickets
– OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy
– RESOLVED FLINK-1978 POJO serialization NPE
– OPEN FLINK-1834 Is mapred.output.dir conf parameter really required?
– CLOSED FLINK-1828 Impossible to output data to an HBase table
– OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies
– OPEN FLINK-2800 kryo serialization problem
– RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter
– OPEN FLINK-1241 Record processing counter for Dashboard
• Hundreads of email threads and discussions 
What we do
Our toolbox
Our Flink Use Cases
• Our objective is to build and manage (very)
large entity-centric knowledge bases to serve
different purposes and domains
• So far, we used Apache Flink for:
– Domain reasoning (Flink + Parquet + Thrift)
– RDF data lifecycle (Flink + Parquet + Jena/Sesame )
– RDF data intelligence (Flink + ELKiBi)
– Duplicate record detection (Flink + HBase + Solr)
– Entiton Record linkage (Flink + MongoDB + Kryo)
– Telemetry analysis (Flink + MongoDB + Weka)
Why we need Flink
Entiton data model
Database record
RDF statement
Triplestore
NOSQL
+ Indexes
+
Quad
provenance IRI
predicate
object
object Type
Subject
local IRI
Subject
Global IRI
RDF Type
Expensive
datawearhouse
Entiton using Parquet+Thrift
namespace java
it.okkam.flink.entitons.serialization.thrift
struct EntitonQuad {
1: required string p; //pred
2: required string o; //obj
3: optional string ot; //obj-type
4: required string g; //sourceIRI
}
struct EntitonAtom {
1: required string s; //local-IRI
2: optional string oid; // ens-IRI
3: required list<string> types; //rdf-types
4: required list<EntitonQuad> quads; // quads
}
struct EntitonMolecule {
1: required EntitonAtom r; //root atom
2: optional list<EntitonAtom> atoms; //other atoms
}
Quad
Subject
local IRI
Subject
ENS IRI
RDF
Type
Hardware-wise
• We compete with expensive data wearhouse solution
– e.g. Oracle Exadata Database Machines, IBM Netezza, etc.
• Test on small machines fosters optimization
– If you don’t want to wait, make your code faster!
• Our code is ready to scale, without big investments
• Fancy stuff can be done without millions of euros in HW
8 x Gigabyte Brix
16GB RAM
256GB SSD
1T HDD
Intel I7 4770 3,2Ghz
+
1 Gbit Switch
ENS Maintenance
• Entity Name System (ENS) (FP7 ’08-’10)
– A Web-scale support for minting and reusing
persistent entity identifiers for SW or LOD
ENS Maintenance
• Duplicate detection of 9.2M entities in 6h45
(using Flink incubator 0.6)
– Query Apache Solr global index to perform flexible blocking given
a subset of attributes of each entity (names)
– Distinct/Join pairs of candidate duplicate
– Rich Filter implementing Match function
– Consolidate sets of candidate duplicates grouping
• Tricks:
– If HBase does not distribute uniformly regions do rebalance()
– Compress byte[] with LZ4 in custom HBase input format to reduce
network traffic
– Reverse keys to speed up join execution (up to 10%, in some cases)
Tax Reasoner
• Pilot project for ACI and Val d’Aosta
Objectives: to produce analytics and investigate:
1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)?
2. Who did not pay Vehicle Insurance?
3. Who skipped Vehicle Inspection?
4. Who did not pay Vehicle Sales Taxes?
5. Who violated exceptions to the above?
Dataset: 15 data sources for 5 year with 12M records about 950k
vehicles and 500k subjects for a total of 90M facts
Challenge: consider events (time) and infer implicit information.
Apache Flink jobs:
1. From RDF to Entiton
2. Domain Specific Temporal Inference (Tax Reasoner)
3. Build ElasticSearch Indexes
Tax Reasoner
Tax Reasoner
Temporal Inference Execution Plan
1h ETA with SSD (2h30 on HDD)
on developer machine
11M new facts inferred
It took 1 DAY to
perform the
select query for
one of the
sources!!
RDF Data Intelligence
SIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBi
Geospatial indicators
Using Entiton with MongoDB
• Inspired by the work of Gregg Kellog (2012)
– http://www.slideshare.net/gkellogg1/jsonld-and-mongodb
• We updated JAOB (Java Architecture for OWL Binding)
– Serialize RDF into POJO and viceversa
– Provides also a Maven plugin to compile OWL ontology into POJOs
– https://github.com/okkam-it/jaob
• Database access layer data (using SpringData):
– POJO  RDF  JSON-LD + KRYO  MongoDB
– MongoDB  KRYO  POJO
• Bottom line:
– we use (framed) JSON-LD to allow (complex) tree queries on an entiton
database modeled according to a domain ontology
– We exploit Kryo deserialization for fast reading
– We enjoy SpringData abstraction to implement Data access APIs
Using Entiton with MongoDB
@Document(collection = Entiton.TABLE)
@CompoundIndexes({
@CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true),
….
})
public class Entiton<T extends Thing> implements Serializable {
@Id private String id;
@Version private String version;
private DBObject jsonld;
private Binary kryo;
@GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE)
private Point point;
public String javaClass;
…
}
Entiton (Mongo JSON-LD)
Contextual personalization in
Recommendation Architecture
Telemetry data collection
Flink batch processes
Contextual Personalization on demand
Lesson learned
• Reversing String Tuples ids leads to performance
improvements of joins
• When you make joins, ensure distinct dataset keys
• Reuse objects to reduce the impact of garbage collection
• When writing Flink jobs, start with small and debuggable
unit tests first, then run it on the cluster on the entire
dataset (waiting for big data debugging of Dr. Leich@TUB)
• Serialization matters: less memory required, less gc, faster
data loading  faster execution
• HD speed matters when RAM is not enough, SSD rulez
• Parquet rulez: self-describing data, push-down filters
• Use Gelly consciously, sometimes joins are good enough
• If your code crashes, it is usually your fault: from 0.9 Flink
is quite stable for batch jobs execution (at least)
Suggested improvements
• Running jobs on a cluster is always tricky because of bad
maven dependencies management in flink
– Define a POM for the build settings, a parent, and a bill-of-
materials BOM (releasetrain) and common resources folder
– Jar validation on client submit wrt deployed Flink dist bundle
giving warning on possibly conflicting classes
• Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827
• Better monitoring, we’re eager to use the new web client!
• Complete Hadoop compatibility
– counters and custom grouping/sorting functions
• Start thinking about education of professionals
– e.g. courses and certification
Future works
• Benchmark Entiton serialization models on
Parquet (Avro vs Thrift vs Protobuf)
• Manage declarative data fusion policies
– a-la LDIF: http://ldif.wbsg.de/
• Define a formal entiton operations algebra (e.g.
merge, project, select, filter)
• Try out Cloudera Kudu
– novel Hadoop storage engine addressing both bulk
loading stability, scan performance and random access
– https://github.com/cloudera/kudu
Conclusions
• We think we are walking along the “last mile”
towards real world enterprise Semantic Applications
• Combining big data and semantics allows us to be
flexible, expressive and, thanks to Flink, very scalable
with very competitive costs
• Apache Flink gives us the leverage to shuffle data
around without much headache
• We proved cool stuff can be done in a simple and
efficient way, with the right tools and mindset
• Hopefully, Flink will help us reducing tax evasion in
Italy, not bad for a German Squirrel, ah?
Thanks for your attention
Any Questions
(before beer)?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Stephan Ewen
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingFabian Hueske
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Martin Junghanns
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsFlink Forward
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink StreamingTuri, Inc.
 

Was ist angesagt? (20)

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
 

Andere mochten auch

Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingFlink Forward
 
Aljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeAljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeFlink Forward
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
Flink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinFlink Forward
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkFlink Forward
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkFlink Forward
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward
 

Andere mochten auch (20)

Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Aljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeAljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of Time
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Flink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Case Study: Bouygues Telecom
Flink Case Study: Bouygues Telecom
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 

Ähnlich wie S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensStoitsis Giannis
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restore
Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & RestoreLadies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restore
Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restoregemziebeth
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)Cisco DevNet
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
20100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v120100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v1Gilles Guirand
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Mark Wilkinson
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With SparkShivaji Dutta
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015Pavel Bucek
 
Building bridges - Plone Conference 2015 Bucharest
Building bridges   - Plone Conference 2015 BucharestBuilding bridges   - Plone Conference 2015 Bucharest
Building bridges - Plone Conference 2015 BucharestAndreas Jung
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overvieweposthumus
 
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20Phil Wilkins
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
PowerPoint
PowerPointPowerPoint
PowerPointVideoguy
 

Ähnlich wie S. Bartoli & F. Pompermaier – A Semantic Big Data Companion (20)

Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-Athens
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restore
Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & RestoreLadies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restore
Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restore
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
 
70487.pdf
70487.pdf70487.pdf
70487.pdf
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
20100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v120100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v1
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Prashant_Agrawal_CV
Prashant_Agrawal_CVPrashant_Agrawal_CV
Prashant_Agrawal_CV
 
WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015
 
Building bridges - Plone Conference 2015 Bucharest
Building bridges   - Plone Conference 2015 BucharestBuilding bridges   - Plone Conference 2015 Bucharest
Building bridges - Plone Conference 2015 Bucharest
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
 
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 

Mehr von Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

Mehr von Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

  • 1. A Semantic Big Data Companion Stefano Bortoli @stefanobortoli bortoli@okkam.it Flavio Pompermaier @fpompermaier pompermaier@okkam.it
  • 2. The company (briefly) • Okkam is – a SME based in Trento, Italy. – Started as spin-off of the University of Trento and FBK (2010) • Okkam core business is – large-scale data integration using semantic technologies and an Entity Name System • Okkam operative sectors – Services for public administration – Services for restaurants (and more) – Research projects • FP7, H2020, and Local agencies
  • 3. Who we are • Stefano Bortoli, PhD – works as technical director and researcher at Okkam S.R.L. (Trento, Italy). His research and development interests are in the area of Information Integration, with special focus in entity- centric applications exploiting semantic technologies. • Flavio Pompermaier, MSc. – works as senior software engineer at Okkam S.R.L. (Trento, Italy). Flavio is a passionate developer working with state of the art technologies, combining semantic with big data technologies.
  • 4. Our contributions • Early Flink adopters and promoters (since Stratosphere) • Example on how to use Flink with MongoDB – https://github.com/okkam-it/flink-mongodb-test • 2 pull requests – [FLINK-1928] [hbase] Added HBase write example – [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats • 8 JIRA tickets – OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy – RESOLVED FLINK-1978 POJO serialization NPE – OPEN FLINK-1834 Is mapred.output.dir conf parameter really required? – CLOSED FLINK-1828 Impossible to output data to an HBase table – OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies – OPEN FLINK-2800 kryo serialization problem – RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter – OPEN FLINK-1241 Record processing counter for Dashboard • Hundreads of email threads and discussions 
  • 7. Our Flink Use Cases • Our objective is to build and manage (very) large entity-centric knowledge bases to serve different purposes and domains • So far, we used Apache Flink for: – Domain reasoning (Flink + Parquet + Thrift) – RDF data lifecycle (Flink + Parquet + Jena/Sesame ) – RDF data intelligence (Flink + ELKiBi) – Duplicate record detection (Flink + HBase + Solr) – Entiton Record linkage (Flink + MongoDB + Kryo) – Telemetry analysis (Flink + MongoDB + Weka)
  • 8. Why we need Flink Entiton data model Database record RDF statement Triplestore NOSQL + Indexes + Quad provenance IRI predicate object object Type Subject local IRI Subject Global IRI RDF Type Expensive datawearhouse
  • 9. Entiton using Parquet+Thrift namespace java it.okkam.flink.entitons.serialization.thrift struct EntitonQuad { 1: required string p; //pred 2: required string o; //obj 3: optional string ot; //obj-type 4: required string g; //sourceIRI } struct EntitonAtom { 1: required string s; //local-IRI 2: optional string oid; // ens-IRI 3: required list<string> types; //rdf-types 4: required list<EntitonQuad> quads; // quads } struct EntitonMolecule { 1: required EntitonAtom r; //root atom 2: optional list<EntitonAtom> atoms; //other atoms } Quad Subject local IRI Subject ENS IRI RDF Type
  • 10. Hardware-wise • We compete with expensive data wearhouse solution – e.g. Oracle Exadata Database Machines, IBM Netezza, etc. • Test on small machines fosters optimization – If you don’t want to wait, make your code faster! • Our code is ready to scale, without big investments • Fancy stuff can be done without millions of euros in HW 8 x Gigabyte Brix 16GB RAM 256GB SSD 1T HDD Intel I7 4770 3,2Ghz + 1 Gbit Switch
  • 11. ENS Maintenance • Entity Name System (ENS) (FP7 ’08-’10) – A Web-scale support for minting and reusing persistent entity identifiers for SW or LOD
  • 12. ENS Maintenance • Duplicate detection of 9.2M entities in 6h45 (using Flink incubator 0.6) – Query Apache Solr global index to perform flexible blocking given a subset of attributes of each entity (names) – Distinct/Join pairs of candidate duplicate – Rich Filter implementing Match function – Consolidate sets of candidate duplicates grouping • Tricks: – If HBase does not distribute uniformly regions do rebalance() – Compress byte[] with LZ4 in custom HBase input format to reduce network traffic – Reverse keys to speed up join execution (up to 10%, in some cases)
  • 13. Tax Reasoner • Pilot project for ACI and Val d’Aosta Objectives: to produce analytics and investigate: 1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)? 2. Who did not pay Vehicle Insurance? 3. Who skipped Vehicle Inspection? 4. Who did not pay Vehicle Sales Taxes? 5. Who violated exceptions to the above? Dataset: 15 data sources for 5 year with 12M records about 950k vehicles and 500k subjects for a total of 90M facts Challenge: consider events (time) and infer implicit information. Apache Flink jobs: 1. From RDF to Entiton 2. Domain Specific Temporal Inference (Tax Reasoner) 3. Build ElasticSearch Indexes
  • 15. Tax Reasoner Temporal Inference Execution Plan 1h ETA with SSD (2h30 on HDD) on developer machine 11M new facts inferred It took 1 DAY to perform the select query for one of the sources!!
  • 16. RDF Data Intelligence SIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBi Geospatial indicators
  • 17. Using Entiton with MongoDB • Inspired by the work of Gregg Kellog (2012) – http://www.slideshare.net/gkellogg1/jsonld-and-mongodb • We updated JAOB (Java Architecture for OWL Binding) – Serialize RDF into POJO and viceversa – Provides also a Maven plugin to compile OWL ontology into POJOs – https://github.com/okkam-it/jaob • Database access layer data (using SpringData): – POJO  RDF  JSON-LD + KRYO  MongoDB – MongoDB  KRYO  POJO • Bottom line: – we use (framed) JSON-LD to allow (complex) tree queries on an entiton database modeled according to a domain ontology – We exploit Kryo deserialization for fast reading – We enjoy SpringData abstraction to implement Data access APIs
  • 18. Using Entiton with MongoDB @Document(collection = Entiton.TABLE) @CompoundIndexes({ @CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true), …. }) public class Entiton<T extends Thing> implements Serializable { @Id private String id; @Version private String version; private DBObject jsonld; private Binary kryo; @GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE) private Point point; public String javaClass; … }
  • 25. Lesson learned • Reversing String Tuples ids leads to performance improvements of joins • When you make joins, ensure distinct dataset keys • Reuse objects to reduce the impact of garbage collection • When writing Flink jobs, start with small and debuggable unit tests first, then run it on the cluster on the entire dataset (waiting for big data debugging of Dr. Leich@TUB) • Serialization matters: less memory required, less gc, faster data loading  faster execution • HD speed matters when RAM is not enough, SSD rulez • Parquet rulez: self-describing data, push-down filters • Use Gelly consciously, sometimes joins are good enough • If your code crashes, it is usually your fault: from 0.9 Flink is quite stable for batch jobs execution (at least)
  • 26. Suggested improvements • Running jobs on a cluster is always tricky because of bad maven dependencies management in flink – Define a POM for the build settings, a parent, and a bill-of- materials BOM (releasetrain) and common resources folder – Jar validation on client submit wrt deployed Flink dist bundle giving warning on possibly conflicting classes • Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827 • Better monitoring, we’re eager to use the new web client! • Complete Hadoop compatibility – counters and custom grouping/sorting functions • Start thinking about education of professionals – e.g. courses and certification
  • 27. Future works • Benchmark Entiton serialization models on Parquet (Avro vs Thrift vs Protobuf) • Manage declarative data fusion policies – a-la LDIF: http://ldif.wbsg.de/ • Define a formal entiton operations algebra (e.g. merge, project, select, filter) • Try out Cloudera Kudu – novel Hadoop storage engine addressing both bulk loading stability, scan performance and random access – https://github.com/cloudera/kudu
  • 28. Conclusions • We think we are walking along the “last mile” towards real world enterprise Semantic Applications • Combining big data and semantics allows us to be flexible, expressive and, thanks to Flink, very scalable with very competitive costs • Apache Flink gives us the leverage to shuffle data around without much headache • We proved cool stuff can be done in a simple and efficient way, with the right tools and mindset • Hopefully, Flink will help us reducing tax evasion in Italy, not bad for a German Squirrel, ah?
  • 29. Thanks for your attention Any Questions (before beer)?