high_level_parallel_processing_model

•

0 gefällt mir•237 views

This document summarizes and compares three high-level parallel processing models: Pig Latin, SCOPE, and Hive. It discusses how each aims to address the limitations of traditional approaches to large-scale data analysis by providing a high-level scripting language that is compiled into optimized parallel tasks. While the ideas are similar, there are differences in programming style, extensibility, data models, and optimization strategies. Overall, the models evaluate tradeoffs between flexibility, performance, and usability for large-scale data analysis.

Technologie

High Level Parallel Processing Models for
Data Analysis
Mingliang Sun

Motivation

● Ever-increasing amount of data

● High cost of traditional approaches

● Limitation of the bare MapReduce
approach

Example
A. Pavlo et al, “A Comparison of Approaches to Large-scale
Data Analysis,” Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY, USA 2009

● Pros of Parallel DW:
○ superior runtime performance
● Cons of Parallel DW:
○ time consuming up-front set-up
○ sophisticated configuration and tuning

New Model – Pig Latin
● Comes from Yahoo
● Pig Latin, a high-level data analysis scripting
language
● Features of Pig, and motivation for them
● Language features, data model, and motivation for
● Implementation of Pig
● A novel debugging approach brought by the system
● A few real usage scenarios

New Model - SCOPE
● Developed by Microsoft
● SCOPE, a declarative and extensible scripting
language
● Underlying parallel data processing and storage
system
● Language features and data model
● System design and architecture
● TPC-H benchmark

New Model - Hive
● Comes from Facebook
● HiveQL, a high-level data analysis scripting language
● Language features, data model, and type system
● Data storage in HDFS (Hadoop File System)
● System architecture and components
● Usage statistics at Facebook

Comparison
RDB/DW Pig Latin SCOPE Hive

Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL
Style single block of steps where each data processing comprises of a
declarative step specifies only commands" subset of SQL
constraints that a single, high- * "Has a strong and some
collectively define level relational- resemblance to extensions"
the result algebra style data SQL -- an * "Working
transformation" intentional design towards making
choice" HiveQL subsume
SQL syntax"

Extensibility Vendor / product * Currently Support C# * Support UDF of
specific UDF support JAVA arbitrary
(User Defined UDF programming
Function) * With future languages
support of * Data types can
arbitrary also be
languages customized

Comparison (Cont')
RDB/DW Pig Latin SCOPE Hive

Nested Data No, unless one is Yes,supports (Not directly Yes, supports
Model willing to violate complex data mentioned or complex data
1NF types (set, map, demonstrated in (map, list, and
and tuple) paper) struct)

Data Ownership Yes No No Yes or No

Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files
structure File System) files

Comparison (Cont')
RDB/DW Pig Latin SCOPE Hive

Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly
stored in system and/or stored in
system
(Metadata)

Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on
operate on external data) external data) both internal and
system-owned, external data)
internal data)

Optimization SQL execution * basic * Complie-time: * "Currently has a
plan optimization better execution naive rule-based
* Not directly plan optimizer with a
discussed in the * Run-time: small number of
paper reduced traffic / simple rules"
workload (Rack- * Plan to build a
awareness, partial cost-based
aggregation, optimizer and
grouping adaptive
heuristics) optimization"

Conclusions
● The ideas behind these 3 papers are very
similar
○ Addressing the same problem: limitation of the bare
MapReduce model
○ Similar approach: high-level data processing scripts
compiled into optimized, low-level parallel processing tasks
supported by the underlying parallel processing system
● Yet there are interesting differences
○ data schema, data ownership, and extensibility
○ Underlying system

Weitere ähnliche Inhalte

Was ist angesagt?

AnjuAnju Shekhawat

Large scale computing with mapreducehansen3032

Hadoop architecture-tutorialvinayiqbusiness

Hadoop ppt2Ankit Gupta

4. hbase overviewAnuja Gunale

Small Overview of Skype Database Toolselliando dias

Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee

An Introduction to HadoopDerrekYoungDotCom

PostgreSQL - Object Relational DatabaseMubashar Iqbal

Hadoop TechnologiesKannappan Sirchabesan

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D

Spark corePrashant Gupta

1. Apache HIVEAnuja Gunale

Bigtable: A Distributed Storage System for Structured Dataelliando dias

The Evolution of the Hadoop EcosystemCloudera, Inc.

Gfs vs hdfsYuval Carmel

Hadoop Shamama Kamal

Google BigTableNew York City College of Technology Computer Systems Technology Colloquium

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1

Was ist angesagt? (20)

Anju

Large scale computing with mapreduce

Hadoop architecture-tutorial

Hadoop ppt2

4. hbase overview

Small Overview of Skype Database Tools

Parallel Data Processing with MapReduce: A Survey

An Introduction to Hadoop

PostgreSQL - Object Relational Database

Hadoop Technologies

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Spark core

1. Apache HIVE

Bigtable: A Distributed Storage System for Structured Data

The Evolution of the Hadoop Ecosystem

Gfs vs hdfs

Hadoop

Google BigTable

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

Andere mochten auch

Cs782 presentation group7Mingliang Sun

Class 9: Consistent HashingDavid Evans

Overview of Zookeeper, Helix and Kafka (Oakjug)Chris Richardson

Consistent hashingJooho Lee

Distributed Hash Tableravindra.devagiri

Design principles of scalable, distributed systemsTinniam V Ganesh (TV)

Distributed Hash Table and Consistent HashingCloudFundoo

How to Become a Thought Leader in Your NicheLeslie Samuel

Andere mochten auch (8)

Cs782 presentation group7

Class 9: Consistent Hashing

Overview of Zookeeper, Helix and Kafka (Oakjug)

Consistent hashing

Distributed Hash Table

Design principles of scalable, distributed systems

Distributed Hash Table and Consistent Hashing

How to Become a Thought Leader in Your Niche

Ähnlich wie high_level_parallel_processing_model

Microsoft's Hadoop StoryMichael Rys

Big Data: An OverviewC. Scyphers

NosqlMuluken Sholaye Tesfaye

Drill njhug -19 feb2013MapR Technologies

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

Big data & hadoopAbhi Goyan

hadoopDeep Mehta

Hadoop programmingMuthusamy Manigandan

Deploying Grid Services Using HadoopGeorge Ang

Big data Analytics HadoopMishika Bharadwaj

Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts

getFamiliarWithHadoopAmirReza Mohammadi

High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino

Apache SparkSugumarSarDurai

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

Big data pptShweta Sahu

Hadoop seminarKrishnenduKrishh

4. hadoop גיא לבנברגTaldor Group

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir

Apache Hadoop 1.1Sperasoft

Ähnlich wie high_level_parallel_processing_model (20)

Microsoft's Hadoop Story

Big Data: An Overview

Nosql

Drill njhug -19 feb2013

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Big data & hadoop

hadoop

Hadoop programming

Deploying Grid Services Using Hadoop

Big data Analytics Hadoop

Large-Scale Data Storage and Processing for Scientists with Hadoop

getFamiliarWithHadoop

High level languages for Big Data Analytics (Report)

Apache Spark

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

Big data ppt

Hadoop seminar

4. hadoop גיא לבנברג

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015

Apache Hadoop 1.1

Kürzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

A Year of the Servo Reboot: Where Are We Now?Igalia

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Artificial Intelligence: Facts and MythsJoaquim Jorge

How to convert PDF to text with Nanonetsnaman860154

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

Data Cloud, More than a CDP by Matt Robison

Boost PC performance: How more available memory can improve productivity

08448380779 Call Girls In Friends Colony Women Seeking Men

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

A Year of the Servo Reboot: Where Are We Now?

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Automating Google Workspace (GWS) & more with Apps Script

Presentation on how to chat with PDF using ChatGPT code interpreter

08448380779 Call Girls In Civil Lines Women Seeking Men

Finology Group – Insurtech Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

GenCyber Cyber Security Day Presentation

Artificial Intelligence: Facts and Myths

How to convert PDF to text with Nanonets

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Powerful Google developer tools for immediate impact! (2023-24 C)

high_level_parallel_processing_model

1. High Level Parallel Processing Models for Data Analysis Mingliang Sun

2. Motivation ● Ever-increasing amount of data ● High cost of traditional approaches ● Limitation of the bare MapReduce approach

3. Example A. Pavlo et al, “A Comparison of Approaches to Large-scale Data Analysis,” Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA 2009 ● Pros of Parallel DW: ○ superior runtime performance ● Cons of Parallel DW: ○ time consuming up-front set-up ○ sophisticated configuration and tuning

4. New Model – Pig Latin ● Comes from Yahoo ● Pig Latin, a high-level data analysis scripting language ● Features of Pig, and motivation for them ● Language features, data model, and motivation for ● Implementation of Pig ● A novel debugging approach brought by the system ● A few real usage scenarios

5. New Model - SCOPE ● Developed by Microsoft ● SCOPE, a declarative and extensible scripting language ● Underlying parallel data processing and storage system ● Language features and data model ● System design and architecture ● TPC-H benchmark

6. New Model - Hive ● Comes from Facebook ● HiveQL, a high-level data analysis scripting language ● Language features, data model, and type system ● Data storage in HDFS (Hadoop File System) ● System architecture and components ● Usage statistics at Facebook

7. Comparison RDB/DW Pig Latin SCOPE Hive Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL Style single block of steps where each data processing comprises of a declarative step specifies only commands" subset of SQL constraints that a single, high- * "Has a strong and some collectively define level relational- resemblance to extensions" the result algebra style data SQL -- an * "Working transformation" intentional design towards making choice" HiveQL subsume SQL syntax" Extensibility Vendor / product * Currently Support C# * Support UDF of specific UDF support JAVA arbitrary (User Defined UDF programming Function) * With future languages support of * Data types can arbitrary also be languages customized

8. Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Nested Data No, unless one is Yes,supports (Not directly Yes, supports Model willing to violate complex data mentioned or complex data 1NF types (set, map, demonstrated in (map, list, and and tuple) paper) struct) Data Ownership Yes No No Yes or No Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files structure File System) files

9. Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly stored in system and/or stored in system (Metadata) Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on operate on external data) external data) both internal and system-owned, external data) internal data) Optimization SQL execution * basic * Complie-time: * "Currently has a plan optimization better execution naive rule-based * Not directly plan optimizer with a discussed in the * Run-time: small number of paper reduced traffic / simple rules" workload (Rack- * Plan to build a awareness, partial cost-based aggregation, optimizer and grouping adaptive heuristics) optimization"

10. Conclusions ● The ideas behind these 3 papers are very similar ○ Addressing the same problem: limitation of the bare MapReduce model ○ Similar approach: high-level data processing scripts compiled into optimized, low-level parallel processing tasks supported by the underlying parallel processing system ● Yet there are interesting differences ○ data schema, data ownership, and extensibility ○ Underlying system

high_level_parallel_processing_model

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie high_level_parallel_processing_model

Ähnlich wie high_level_parallel_processing_model (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

high_level_parallel_processing_model