Anzeige
Anzeige

Más contenido relacionado

Presentaciones para ti(20)

Anzeige
Anzeige

Scaling up Linked Data

  1. Scaling up Linked Data Presented by: Marin Dimitrov (Ontotext)
  2. EUCLID Objective 2 Visualization Module Metadata Streaming providers Physical Wrapper Downloads Dataacquisition R2R Transf.LD Wrapper Musical Content Application Analysis & Mining Module LDDatasetAccess LD Wrapper RDF/ XML Integrated Dataset Interlinking Cleansing Vocabulary Mapping SPARQL Endpoint Publishing RDFa Other content EUCLID – Scaling up Linked Data
  3. • Our aim: build a music-based portal using Linked Data technologies • So far, we have studied different mechanisms for: • Linked Data management via SPARQL queries • Reasoning over Linked Data • Linked Data access (RDF dumps, endpoints, RDFa) • Linked Data storage in repositories • In this chapter, we will study current research and technologies to scale up to very large volumes of Linked Data Motivation: Music! EUCLID – Scaling up Linked Data 3 CH 2 CH 3 CH 1 CH 5
  4. Agenda 1. Introduction to Big (Linked) Data 2. NoSQL databases for Linked Data 3. Hadoop for Linked Data 4. Stream processing for Linked Data 5. … and more 4EUCLID – Scaling up Linked Data
  5. INTRODUCTIONTO BIG (LINKED) DATA 5EUCLID – Scaling up Linked Data
  6. Introduction to Big Data 6 Big Data Management of data which is “too complex” for being processed with traditional solutions • Big does not stand primarily for size, but as an analogy for “overwhelming” • Big can mean “high variety”, “high volume” or “high velocity” EUCLID – Scaling up Linked Data
  7. The 3Vs of Big Data 7 Big Data Variety Velocity Volume Different forms of data Petabytes of data Real-time data streams Big Data EUCLID – Scaling up Linked Data
  8. Variety Volume Velocity Data characteristic Structured, semi- structured and unstructured Large volumes of data Streams, sensors, near real-time data, IoT Challenge Data integration Reasoning and querying Reasoning & querying Solution Semantic technologies are a good fit Distributed storage & processing, parallel processing Stream reasoning & querying The 3Vs of Big Data 8 time EUCLID – Scaling up Linked Data
  9. The ExtendedVs of Big Data 9 • Veracity: Uncertainty of the data • Variability: Variation in meaning in different contexts • Value: turning data into information into insight • Not easy measure • Depend on context and intended use • Linked Data & Semantic Technologies can help Variety VelocityVolume EUCLID – Scaling up Linked Data
  10. Beyond Big Data 10EUCLID – Scaling up Linked Data
  11. 11 Source: Gartner Inc. “Gartner Identifies Top Technology Trends Impacting Information Infrastructure in 2013” EUCLID – Scaling up Linked Data Semantic Technologies Semantic technologies extract meaning from data, ranging from quantitative data and text, to video, voice and images. Many of these techniques have existed for years and are based on advanced statistics, data mining, machine learning and knowledge management. One reason they are garnering more interest is the renewed business requirement for monetizing information as a strategic asset. Even more pressing is the technical need. Increasing volumes, variety and velocity — big data — in IM and business operations, requires semantic technology that makes sense out of data for humans, or automates decisions Beyond Big Data (2)
  12. Towards Big Linked Data 12 • This characteristic is the most inherent to Linked Data • Agile data model • Different vocabularies Variety Velocity Volume 2007 2008 2009 2010 2011 • RDF Streams • Semantic Sensors EUCLID – Scaling up Linked Data
  13. Towards Big Linked Data (2) 13EUCLID – Scaling up Linked Data
  14. Big Linked Data & Linked Big Data 14 • Exponential growth of Linked Data in the last five years • Big Data approach adopted by the Linked Data community, especially to handle Source: M. Dimitrov. “Semantic Technologies for Big Data” VelocityVolume Big Linked Data Linked Big Data • Linked Data approach adopted by the Big Data community • RDF data model for • Enrich Big Data with metadata and semantics • Interlink Big Data sets & reduce duplication • Simplify data access, discovery & integration Variety EUCLID – Scaling up Linked Data
  15. NOSQL DATABASES FOR LINKED DATA 15EUCLID – Scaling up Linked Data
  16. RDF Databases 16 • Native or RDBMS based RDF databases – OWLIM (http://www.ontotext.com/owlim) – Virtuoso Universal Server (http://virtuoso.openlinksw.com/ ) – Stardog (http://stardog.com) – AllegroGraph (http://www.franz.com/agraph/allegrograph/ ) – Systap Bigdata (http://www.systap.com/) – Jena TDB (http://jena.apache.org/documentation/tdb/) – Oracle, DB2 EUCLID – Scaling up Linked Data
  17. RDF Database Advantages 17 • RDF (graph) based data model – Global identifies of resources/entities – Agile schema • Inference of implicit facts – Forward, backward, hybrid reasoning strategy • Expressive query language (SPARQL) • Compliance to standards EUCLID – Scaling up Linked Data
  18. NoSQL Databases 18 • “Not Only SQL” • a group of databases technologies which don’t follow the relational data model • Typical requirements – Distributed – High availability – Handle big data & query volumes (scalability) – Hierarchical or graph data structures – Flexible schema EUCLID – Scaling up Linked Data
  19. NoSQLTaxonomy 19 • Key/value stores – Each key associated with a value (DHT) • Wide-column stores – Each key is associated with many attributes, columns are stored together • Document databases – Each key associated with a complex data structure • Graph databases – Data is represented as nodes and edges EUCLID – Scaling up Linked Data ValueKey Data Data Relationship Structured- document Key Structured- document Key Conceptual structures Artist Album Song The Beatles Let it be Get back Queen Jazz Fun it
  20. Key/Value Stores 20 • Efficient key/value lookups • Schema-less • Simpler read/write operations – Low latency & high throughput • Examples – DynamoDB, Azure Table Storage, Riak, Redis, MemcacheDB, Voldemort EUCLID – Scaling up Linked Data ValueKey
  21. Wide-Column Stores 21 • A key is associated with several attributes • Data in the same column is stored together • Efficient for complex aggregations over data • Schema-less / dynamic schema • Easy to add new columns • Columns can be grouped together (column family) • Examples: – HBase (http://hbase.apache.org) – Cassandra (http://cassandra.apache.org) Artist Album Song The Beatles Let it be Get back Queen Jazz Fun it EUCLID – Scaling up Linked Data
  22. HBase 22 • Open source column-oriented store • Based on Google’s BigTable • Built on top of HDFS and Hadoop • Horizontally scalable, automatic sharding • high availability / automatic failover • Strongly consistent reads/writes • Java/REST API EUCLID – Scaling up Linked Data
  23. Document Databases 23 • Each key associated with a complex data structure (document) • Documents can contain key/value pairs, key/array pairs, or even nested structures • Schema-less / dynamic schema – New fields can be easily added to the document structure • Typical document formats – JSON, XML • Examples: – Couchbase (http://www.couchbase.com) – MongoDB (http://www.mongodb.org) Structured- document Key Structured- document Key EUCLID – Scaling up Linked Data
  24. Document Databases (2) 24 Example: { Homepage: "thebeatles.com", Origin: "Liverpool", Albums: [ {Title: "Let it be", Year: "1970", Duration: "35:16"}, {Title: "Help!", Year: "1965"}, {Title: "Revolver", Year: "1966", Duration: "35:01"} ] } The Beatles { FullName: "Elvis Aaron Presley", Homepage: "elvis.com", Origin: "Memphis" Albums: [ {Title: "Blue Hawaii", Year: "1961", Duration: "32:02"} ] } Elvis Presley EUCLID – Scaling up Linked Data
  25. Couchbase 25 • Document-oriented database – Documents are stored as JSON • Flexible schema – Document structure easy to change • Optimised to run in-memory and on several nodes – Ejection and eventual persistence • Incremental views & indexes • Scalability, rebalancing, replication, failover • RESTful API EUCLID – Scaling up Linked Data
  26. Network of Friends in a High School 26 Graph Databases Motivation Relationship among artists in Last.fm http://sixdegrees.hu/last.fm/ A Fragment of Facebook Relationships between Tweets Graphs: Representation of highly connected data EUCLID – Scaling up Linked Data
  27. Graph Databases 27 • Based on the property graph model • Support for query languages and core graph-based tasks – reachability, traversal, adjacency and pattern matching • Examples – Neo4j (http://neo4j.org) – Dex (http://sparsity-technologies.com/dex.php) – HyperGraphDB (http://www.hypergraphdb.org) Data Data Relationship EUCLID – Scaling up Linked Data
  28. Graph Databases 28 Example: Property Graph Model • Nodes and edges may have properties • Properties: Key-value pairs The Beatles Let it be Revolver Help! created Year: 1970 Duration: 35:16 Year: 1965 Year: 1966 Duration: 35:01 Homepage: thebeatles.com Origin: Liverpool Elvis Presley Revolver created Year: 1961 Duration: 32:02 Fullname: Elvis Aaron Presley Homepage: elvis.com Origin: Memphis EUCLID – Scaling up Linked Data
  29. Neo4j 29 • Graph database – Nodes, Relationships, Properties, Paths – Indexes over properties • Flexible schema • Cypher graph query language • ACID transactions • High availability, distributed clusters • RESTful and Java APIs EUCLID – Scaling up Linked Data
  30. Rya 30 • RDF store based on Accumulo – Column-store, HDFS – Sesame query parser, SAIL implementation • 3 table index – SPO, POS, OSP – Sufficient for all triple patterns – All triple parts (S, P, O) encoded in the RowID – Clustered index EUCLID – Scaling up Linked Data Source: R. Punnoose, A. Crainiceanu, D. Rapp “Rya: A Scalable RDF Triple Store for the Clouds”
  31. Rya (2) 31 • Query processing – Sesame (SPARQL) query plan translated to Accumulo range scans & lookups – Parallel scans for joins (x10-20 speedup) – Batch scans (Accumulo) to reduce number of range scans – Statistics for triple patterns selectivity, query re-ordering • Performance evaluation (LUBM) – No significant degradation when data grows with 2-3 orders of magnitude EUCLID – Scaling up Linked Data Source: R. Punnoose, A. Crainiceanu, D. Rapp “Rya: A Scalable RDF Triple Store for the Clouds”
  32. “NoSQL Databases f0r RDF: An Empirical Evaluation” 32 • Goal – Store RDF data in HBase, Couchbase, Hive & Cassandra – Benchmark query performance against a native distributed RDF database (4store) • HBase prototype – Jena for SPARQL queries – 3 index tables (SPO, POS, OSP) – Row key encodes S+P+O, cells are empty – Jena query plan translated to HBase filters & lookups EUCLID – Scaling up Linked Data Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
  33. “NoSQL Databases f0r RDF: An Empirical Evaluation” (2) 33 • Hive+HBase prototype – SPARQL to HiveQL translation – Property table • Row key is S • a column for each P • cell value stores O • Multi-valued attributes have different timestamps EUCLID – Scaling up Linked Data Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
  34. “NoSQL Databases f0r RDF: An Empirical Evaluation” (3) 34 • CumulusRDF prototype – Sesame for SPARQL queries, Cassandra for data management – 3 index tables (SPO, POS, OSP) – Sesame query plan translated to Cassandra index lookups • Couchbase prototype – Map RDF into JSON documents • all triples with the same S stored in the same document (molecule) • 2 JSON arrays for Ps and Os – Jena as a SPARQL query engine – 3 indexes (Couchbase views): SPO, POS, OSP EUCLID – Scaling up Linked Data Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
  35. “NoSQL Databases f0r RDF: An Empirical Evaluation” (4) 35 • Benchmarks – BSBM 10M, 100M and 1B triples – 1, 2, 4, 8, 16 node cluster – AWS cost & query execution time EUCLID – Scaling up Linked Data Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
  36. “NoSQL Databases f0r RDF: An Empirical Evaluation” (5) 36 • Results – Simple SPARQL queries can be executed more efficiently on a NoSQL datastore – Data loading time for some NoSQL datastores comparable or better than the native RDF store – Complex SPARQL queries perform significantly slower on NoSQL systems • Query optimisations are required – MapReduce operations (Hive & Couchbase) introduce high latency for view maintenance / query execution EUCLID – Scaling up Linked Data Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
  37. HADOOP FOR LINKED DATA 37EUCLID – Scaling up Linked Data
  38. • Apache Hadoop (http://hadoop.apache.org) is an open source implementation of MapReduce • MapReduce – Distributed batch processing – Map phase partitions the input set (K/V pairs), Reduce phase performs aggregated processing over the partitions in parallel – Shuffle intermediate results (from Map nodes to Reduce nodes) • Allows for the processing of distributed large data sets across clusters of computers – On a distributed file system (HDFS) – Scales up to thousands of nodes, each offering local processing power and storage 38 Working with Distributed Data EUCLID – Scaling up Linked Data
  39. “Scalable Distributed Reasoning with MapReduce” 39 • Goal – Utilise Hadoop for large scale reasoning • Approach – Implement each RDFS rule (join) via a Map & Reduce function – Map outputs original triple as value, and the join term as key – Reducer receives all needed triples to perform the join EUCLID – Scaling up Linked Data Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
  40. “Scalable Distributed Reasoning with MapReduce” (2) 40EUCLID – Scaling up Linked Data Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
  41. “Scalable Distributed Reasoning with MapReduce” (3) 41 • Challenge – Too many duplicates (unique to derived triple ratio of 1:50) • Optimisations – Replicate schema triples on each mode (in memory) • Needed for each join; usually a small set – Rule re-ordering • Which rule may be triggered by another rule? • Reduce the number of required iterations EUCLID – Scaling up Linked Data Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
  42. “Scalable Distributed Reasoning with MapReduce” (4) 42 • Results – Throughput of 4.5M triples / sec on a 16-node cluster – 16+ nodes do not improve the performance significantly EUCLID – Scaling up Linked Data Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
  43. Lessons Learned from Large- scale Reasoning (J. Urbani) 43 • 1st Law: Treat schema triples differently – Replicate on all nodes to minimise subsequent data transfer • 2nd Law: Data skew dominates data distribution – No universal partitioning scheme for input data – Computation tasks moved to the nodes storing the data (data locality) • 3rd Law: Certain problems only appear at a very large scale – Proof-of-concept prototypes are often not representative EUCLID – Scaling up Linked Data Source: Jacopo Urbani “Three Laws Learned from Web-scale Reasoning”
  44. STREAM PROCESSING FOR LINKED DATA 45EUCLID – Scaling up Linked Data
  45. Streaming Data • A large amount of new data is constantly being created or data is being updated at a rapid rate – Traffic data, sensor networks, social networks, financial markets • Many data sources create a constant “stream of information” – Not always practical to store all data and then query it – Continuous queries over transient data • More recent data is more important – Describes the current state of a dynamic system 46 time EUCLID – Scaling up Linked Data
  46. Stream Processing • Streams are observed through windows • Continuous queries can be registered over the stream • Continuous queries are iteratively evaluated over the data in the current window – Can leverage static background knowledge (e.g., schema information) • Generates a stream of answers 47 Window Stream of answers Background Knowledge time Continuous Query EUCLID – Scaling up Linked Data
  47. Linked Stream Data 48 • A representation of sensor/stream data following the Linked Data principles – Sensor data can be enriched with semantics – Facilitates data discovery and integration of heterogeneous data sources • Challenges – RDF Triples must be annotated with timestamps – Extensions to the SPARQL language – windows, continuous queries, streaming operators – Continuous semantics – Scalability (Volume) – High throughput and low latency (Velocity) – Approximate reasoning EUCLID – Scaling up Linked Data
  48. Querying Streams with SPARQL Extensions 49 • The mechanism to evaluate queries over streaming data is the specification of continuous queries • The corresponding results to the continuous query are updated while new data arrives • Several SPARQL extensions with streaming operators based on CQL (Continuous Query Language) – C-SPARQL – SPARQLStream – EP-SPARQL, CQELS, Instants EUCLID – Scaling up Linked Data
  49. C-SPARQL (1) 50 C-SPARQL is an extension of SPARQL 1.1 FromStrClause  'FROM' ['NAMED'] 'STREAM' StreamIRI ' [ RANGE' Window ']' Window  LogicalWindow | PhysicalWindow LogicalWindow  Number TimeUnit WindowOverlap TimeUnit  'MSEC' | 'SEC' | 'MIN' | 'HOUR' | 'DAY' WindowOverlap  'STEP' Number TimeUnit | 'TUMBLING' PhysicalWindow  'TRIPLES' Number 1. RDF Streams: Sequence of RDF triples annotated with timestamps: <(s,p,o), timestamp> 2. FROM STREAM extension for stream sources and windows EUCLID – Scaling up Linked Data
  50. C-SPARQL (2) 51 3. Registration • Creates a continuous query over the data source • The query output is variable bindings, RDF graph, or a new stream Registration  'REGISTER' ('QUERY'|'STREAM') QName 'AS' Query EUCLID – Scaling up Linked Data
  51. C-SPARQL (3) 52 Example REGISTER QUERY CarsEnteringInDistricts AS SELECT DISTINCT ?district ?car FROM STREAM <www.uc.eu/tollgates.trdf> [RANGE 40 SEC STEP 10 SEC] WHERE { ?toll t:registers ?car . ?toll c:placedIn ?street . ?district c:contains ?street . } Query: Retrieve the cars and districts, where the car was registered in a toll. Source: Barbieri, Davide Francesco, et al. "Querying rdf streams with c-sparql." ACM SIGMOD Record 39.1 (2010): 20-26. EUCLID – Scaling up Linked Data
  52. C-SPARQL (4) 53EUCLID – Scaling up Linked Data Source: M. Balduini et al. “Tutorial on Stream Reasoning for Linked Data (ISWC’2013)”
  53. SPARQLStream(1) 54 • Utilizes the same definition of RDF streams as in C-SPARQL: • The language is defined as follows: <(s,p,o), timestamp> NamedStream  'FROM' ['NAMED'] 'STREAM' StreamIRI ' [' Window ']' Window  'NOW-' Integer TimeUnit [UpperBound] [Slide] UpperBound  'TO NOW-' Integer TimeUnit Slide  'SLIDE' Integer TimeUnit TimeUnit  'MS' | 'S' | 'MINUTES' | 'HOURS' | 'DAY' Select  'SELECT' [XStream] [DISTINCT | REDUCED] … Xstream  'ISTREAM' | 'DSTREAM' | 'RSTREAM' Source: Jean-Paul Calbimonte and Oscar Corcho. ”SPARQLStream: Ontology-based access to data streams." Tutorial at ISWC 2013 EUCLID – Scaling up Linked Data
  54. SPARQLStream(2) 55 Example Query: Retrieve a rstream with the observations captured by all sensors in the last 10 minutes. PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns/#> SELECT RSTREAM ?sensor ?observation FROM STREAM <www.semsorgrid4env.eu/SensorReadings.srdf> [FROM NOW – 10 MINUTES TO NOW STEP 1 MINUTE] WHERE { ?observation a ssn:Observation; ssn:observedBy ?sensor . } EUCLID – Scaling up Linked Data
  55. Classification of Existing Systems 56EUCLID – Scaling up Linked Data Source: M. Balduini et al. “Tutorial on Stream Reasoning for Linked Data (ISWC’2013)”
  56. W3C Semantic Sensor Networks 57 • SSN Ontology – http://www.w3.org/2005/Incubator/ssn/ssnx/ssn – OWL DL ontology – used to semantically describe sensors and sensor networks & data – Recommendations for applying the ontology for Linked Sensor Data EUCLID – Scaling up Linked Data
  57. W3C Semantic Sensor Networks (2) 58 • Different perspectives – Sensor, data/observation, system EUCLID – Scaling up Linked Data
  58. … AND MORE 59EUCLID – Scaling up Linked Data
  59. ATrillion RDFTriples 60 • Use case – Use RDF and Linked Data for the customer management database of a big telecom – Franz Inc / AllegroGraph EUCLID – Scaling up Linked Data
  60. uRiKA Appliance 61 • YarcData • Big Data appliance for graph analytics – 8K processors, 1TB RAM – In-memory RDF database – SPARQL 1.1 support EUCLID – Scaling up Linked Data
  61. RDFS Reasoning on GPUs 62 • Similar approach to Urbani et al. for large scale reasoning with Hadoop – Handle rules with 2 antecedents – Rule reordering – Dictionary encoding • Shared-memory architecture – Efficient GPU algorithm implementation is challenging EUCLID – Scaling up Linked Data Source: Norman Heino & Jeff Z. Pan ”RDFS Reasoning on Massively Parallel Hardware" ISWC 2012
  62. RDFS Reasoning on GPUs (2) 63 • Data parallelism – Apply one rule (thread) on one instance triple, join to a schema triple if possible – Hundreds / thousands of threads working on parallel • Challenge – Duplicate removal • Benchmark – x5 speedup of computation – But… memory transfer overhead is significant EUCLID – Scaling up Linked Data Source: Norman Heino & Jeff Z. Pan ”RDFS Reasoning on Massively Parallel Hardware" ISWC 2012
  63. Benchmarks 64 • BSBM v3.1 (April 2013) – http://wifo5-03.informatik.uni- mannheim.de/bizer/berlinsparqlbenchmark/results/V7/ – Includes benchmarks with up to 150 billion triples – x750 scale increase since the last BSBM result (200M triples) • LDBC – Industry neutral, non-profit organisation – Benchmarks for RDF and graph databases, similar to TPC – Big data volume, complex queries EUCLID – Scaling up Linked Data
  64. SUMMARY 65EUCLID – Scaling up Linked Data
  65. Summary 66 • Linked Data is a good fit for the Variety challenge of Big Data • Linked Data can simplify data discovery, data access, data integration challenges for Big Data • Exponential growth of Linked Data • Linked Data benchmarks target bigger workloads EUCLID – Scaling up Linked Data
  66. Summary (2) 67 • Ongoing R&D towards scaling up Linked Data for high data Volume and Velocity – NoSQL datastores for RDF data management – Hadoop for scalable RDF reasoning – GPUs for scalable RDF reasoning • Adapting Linked Data & SPARQL for streaming data scenarios EUCLID – Scaling up Linked Data
  67. For exercises, quiz and further material visit our website: 68 @euclid_project euclidproject euclidproject http://www.euclid-project.eu Other channels: eBook Course EUCLID – Scaling up Linked Data
Anzeige