m2r2: A Framework for Results Materialization and Reuse

•

1 gefällt mir•568 views

This document presents m2r2, a framework for materializing and reusing results in high-level dataflow systems for big data. The framework operates at the logical plan level to be language-independent. It includes components for matching plans, rewriting queries to reuse past results, optimizing plans, caching results, and garbage collection. An evaluation using the TPC-H benchmark on Pig Latin showed the framework reduced query execution time by 65% on average by reusing past query results. Future work includes integrating it with more systems and minimizing materialization costs.

Technologie

m2r2: A Framework for Results
Materialization and
Reuse in High-Level Dataflow Systems
for Big Data
2nd International Conference
on Big Data Science and Engineering (BDSE 2013)
Vasiliki Kalavri, Hui Shang, Vladimir Vlassov
{kalavri, hshang, vladv}@kth.se
4 December 2013, Sydney, Australia

Outline
➔ Motivation
➔ Materialized Views in Relational DBMSs
➔ High-Level Dataflow Systems for Big Data
◆ similarities in design and implementation
➔ m2r2 design
◆ design goals and system components
➔ Prototype Implementation Details
➔ Evaluation Results
➔ Conclusions and Future Work
2

Motivation
➔ Avoid computational redundancies
◆ filter out bad records, spam e-mail
◆ data representation transformations
➔ Microsoft has found a 30%-60% similarity
in queries submitted for execution
➔ A Berkeley MapReduce workload
characterization study shows a big need
for caching job results
3

Materialized Views in RDBMSs
➔ A derived relation, stored in the database
◆ Queries are computed using the views instead of
the base relations
➔ Challenges
◆ View Design: What to materialize?
◆ View Maintenance: How to update the views?
◆ View Exploitation: How to use the views for query
optimization?
● view matching and query rewriting
4

High-Level Dataflow Systems (1)
High-Level Dataflow Systems for Big Data
(Pig, Hive, Jaql, DryadLINQ, etc.) exhibit
wide similarities on multiple design levels:
➔ Language Layer
◆ Declarative, SQL-like language
◆ Statements define transformations on collections of datasets
➔ Data Operators
◆ Encapsulate the logic of the transformations to be performed
◆ Relational, Expressions, Control-flow
5

High-Level Dataflow Systems (2)
Pig Latin
HiveQL
Jaql
6

High-Level Dataflow Systems (3)
● The Logical Plan
○ Parser → AST → DAG of operators
● Compilation to an Execution Plan
7

m2r2: materialize - match - rewrite - reuse
➔ A language-independent, extensible
framework for
◆ storing
◆ managing and
◆ using
previous job and sub-job results
➔ Operates on the logical plan level, in
order to support different languages and
backend execution engines
8

m2r2 Components
➔ Plan Matcher and Rewriter
◆ How to be independent of the high-level
language and execution engine?
◆ Shark: Hive on Spark, PonIC: Pig on
Stratosphere, etc.? → Match at the Logical Plan
level!
➔ Plan Optimizer
➔ Results Cache
➔ Plan Repository
➔ Garbage Collector
9

m2r2 Implementation
➔ Built on top of
Pig/Hadoop
➔ HDFS as the Results Cache
➔ MySQL Cluster as the
Repository
◆ in-memory, highly-available
and fault-tolerant
➔ Garbage Collection as a
separate module
◆ policy on reuse frequency and
last access time
11

Evaluation Setup
12
➔ Cluster Setup
◆ Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12
deployed on top of OpenStack
◆ 20 Ubuntu 11.10 VMs
➔ Data and Queries
◆ TPC-H Benchmark for Pig
◆ 20 queries, out of which 6 with reuse
opportunity
◆ 107 GB of data using DBGEN tools of TPC-H

Conclusions
15
➔ The logical plan is the proper layer to
build a language-independent reuse
framework
➔ When there exists reuse opportunity,
query execution time can be immensely
reduced
◆ 65% on average in our experiments
➔ The materialization overhead is quite
small and I/O dominant

Future Work
➔ Integrate with other high-level systems
➔ Explore the possibility of sharing results
among different frameworks
➔ Obtain execution traces and perform a
more realistic evaluation
➔ Minimize costs by overlapping
materialization with regular query
execution
16

Empfohlen

Big data processing systems researchVasia Kalavri

Block Sampling: Efficient Accurate Online Aggregation in MapReduceVasia Kalavri

MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri

Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri

Gelly in Apache Flink Bay Area MeetupVasia Kalavri

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Apache Flink & Graph ProcessingVasia Kalavri

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward

Empfohlen

Big data processing systems researchVasia Kalavri

Block Sampling: Efficient Accurate Online Aggregation in MapReduceVasia Kalavri

MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri

Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri

Gelly in Apache Flink Bay Area MeetupVasia Kalavri

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Apache Flink & Graph ProcessingVasia Kalavri

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward

Iceberg: a fast table format for S3DataWorks Summit

Batch and Stream Graph Processing with Apache FlinkVasia Kalavri

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

A primer on building real time data-driven productsLars Albertsson

Predictive Datacenter Analytics with StrymonVasia Kalavri

Data pipelines from zero to solidLars Albertsson

Apache flinkpranay kumar

A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx

The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Case study- Real-time OLAP Cubes Ziemowit Jankowski

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...BJ Jang

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

Introduction to Real-time data processingYogi Devendra Vyavahare

Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri

The shortest path is not always a straight lineVasia Kalavri

Weitere ähnliche Inhalte

Was ist angesagt?

Iceberg: a fast table format for S3DataWorks Summit

Batch and Stream Graph Processing with Apache FlinkVasia Kalavri

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

A primer on building real time data-driven productsLars Albertsson

Predictive Datacenter Analytics with StrymonVasia Kalavri

Data pipelines from zero to solidLars Albertsson

Apache flinkpranay kumar

A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx

The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Case study- Real-time OLAP Cubes Ziemowit Jankowski

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...BJ Jang

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

Introduction to Real-time data processingYogi Devendra Vyavahare

Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

Was ist angesagt? (20)

Iceberg: a fast table format for S3

Batch and Stream Graph Processing with Apache Flink

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

A primer on building real time data-driven products

Predictive Datacenter Analytics with Strymon

Data pipelines from zero to solid

Apache flink

A time energy performance analysis of map reduce on heterogeneous systems wit...

The evolution of Netflix's S3 data warehouse (Strata NY 2018)

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...

Presto Summit 2018 - 09 - Netflix Iceberg

Case study- Real-time OLAP Cubes

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Introduction to Real-time data processing

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Andere mochten auch

Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri

The shortest path is not always a straight lineVasia Kalavri

Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri

Apache Flink Deep DiveVasia Kalavri

A Skype case study (2011)Vasia Kalavri

Demystifying Distributed Graph ProcessingVasia Kalavri

Flink vs. SparkSlim Baltagi

Andere mochten auch (8)

Like a Pack of Wolves: Community Structure of Web Trackers

The shortest path is not always a straight line

Graphs as Streams: Rethinking Graph Processing in the Streaming Era

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

Apache Flink Deep Dive

A Skype case study (2011)

Demystifying Distributed Graph Processing

Flink vs. Spark

Ähnlich wie m2r2: A Framework for Results Materialization and Reuse

Nicholas：hdfs what is new in hadoop 2hdhappy001

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh

11. From Hadoop to Spark 1:2Fabio Fumarola

Spark to DocumentDB connectorDenny Lee

SQL Engines for Hadoop - The case for Impalamarkgrover

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Home For Gypsies – Storage for NoSQL DatabasesAtish Kathpal

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Govt.Engineering college, Idukki

Introduction to Apache HadoopChristopher Pezza

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)Cédrick Lunven

Polyglot Persistence - Two Great Tastes That Taste Great TogetherJohn Wood

Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan

Big data and non relational databaseManageEngine, Zoho Corporation

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh

Ähnlich wie m2r2: A Framework for Results Materialization and Reuse (20)

Nicholas：hdfs what is new in hadoop 2

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...

11. From Hadoop to Spark 1:2

Spark to DocumentDB connector

SQL Engines for Hadoop - The case for Impala

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Home For Gypsies – Storage for NoSQL Databases

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...

Introduction to Apache Hadoop

Introduction to Hadoop Administration

Apache Hadoop YARN - The Future of Data Processing with Hadoop

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)

Polyglot Persistence - Two Great Tastes That Taste Great Together

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Big data and non relational database

Introduction to Hadoop Administration

Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...

Kürzlich hochgeladen

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Why Teams call analytics are critical to your entire businesspanagenda

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Exploring Multimodal Embeddings with MilvusZilliz

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Platformless Horizons for Digital AdaptabilityWSO2

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Kürzlich hochgeladen (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Why Teams call analytics are critical to your entire business

Six Myths about Ontologies: The Basics of Formal Ontology

presentation ICT roal in 21st century education

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Exploring Multimodal Embeddings with Milvus

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Understanding the FAA Part 107 License ..

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Strategies for Landing an Oracle DBA Job as a Fresher

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Platformless Horizons for Digital Adaptability

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Boost Fertility New Invention Ups Success Rates.pdf

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

m2r2: A Framework for Results Materialization and Reuse

1. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia

2. Outline ➔ Motivation ➔ Materialized Views in Relational DBMSs ➔ High-Level Dataflow Systems for Big Data ◆ similarities in design and implementation ➔ m2r2 design ◆ design goals and system components ➔ Prototype Implementation Details ➔ Evaluation Results ➔ Conclusions and Future Work 2

3. Motivation ➔ Avoid computational redundancies ◆ filter out bad records, spam e-mail ◆ data representation transformations ➔ Microsoft has found a 30%-60% similarity in queries submitted for execution ➔ A Berkeley MapReduce workload characterization study shows a big need for caching job results 3

4. Materialized Views in RDBMSs ➔ A derived relation, stored in the database ◆ Queries are computed using the views instead of the base relations ➔ Challenges ◆ View Design: What to materialize? ◆ View Maintenance: How to update the views? ◆ View Exploitation: How to use the views for query optimization? ● view matching and query rewriting 4

5. High-Level Dataflow Systems (1) High-Level Dataflow Systems for Big Data (Pig, Hive, Jaql, DryadLINQ, etc.) exhibit wide similarities on multiple design levels: ➔ Language Layer ◆ Declarative, SQL-like language ◆ Statements define transformations on collections of datasets ➔ Data Operators ◆ Encapsulate the logic of the transformations to be performed ◆ Relational, Expressions, Control-flow 5

6. High-Level Dataflow Systems (2) Pig Latin HiveQL Jaql 6

7. High-Level Dataflow Systems (3) ● The Logical Plan ○ Parser → AST → DAG of operators ● Compilation to an Execution Plan 7

8. m2r2: materialize - match - rewrite - reuse ➔ A language-independent, extensible framework for ◆ storing ◆ managing and ◆ using previous job and sub-job results ➔ Operates on the logical plan level, in order to support different languages and backend execution engines 8

9. m2r2 Components ➔ Plan Matcher and Rewriter ◆ How to be independent of the high-level language and execution engine? ◆ Shark: Hive on Spark, PonIC: Pig on Stratosphere, etc.? → Match at the Logical Plan level! ➔ Plan Optimizer ➔ Results Cache ➔ Plan Repository ➔ Garbage Collector 9

10. Match and Rewrite Algorithm 10

11. m2r2 Implementation ➔ Built on top of Pig/Hadoop ➔ HDFS as the Results Cache ➔ MySQL Cluster as the Repository ◆ in-memory, highly-available and fault-tolerant ➔ Garbage Collection as a separate module ◆ policy on reuse frequency and last access time 11

12. Evaluation Setup 12 ➔ Cluster Setup ◆ Pig 0.11, Hadoop 1.0.4 and MySQL Cluster 7.2.12 deployed on top of OpenStack ◆ 20 Ubuntu 11.10 VMs ➔ Data and Queries ◆ TPC-H Benchmark for Pig ◆ 20 queries, out of which 6 with reuse opportunity ◆ 107 GB of data using DBGEN tools of TPC-H

13. Speedup using Sub-Jobs 13

14. Speedup using Whole Jobs 14

15. Conclusions 15 ➔ The logical plan is the proper layer to build a language-independent reuse framework ➔ When there exists reuse opportunity, query execution time can be immensely reduced ◆ 65% on average in our experiments ➔ The materialization overhead is quite small and I/O dominant

16. Future Work ➔ Integrate with other high-level systems ➔ Explore the possibility of sharing results among different frameworks ➔ Obtain execution traces and perform a more realistic evaluation ➔ Minimize costs by overlapping materialization with regular query execution 16

17. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data 2nd International Conference on Big Data Science and Engineering (BDSE 2013) Vasiliki Kalavri, Hui Shang, Vladimir Vlassov {kalavri, hshang, vladv}@kth.se 4 December 2013, Sydney, Australia