This document summarizes and compares three high-level parallel processing models: Pig Latin, SCOPE, and Hive. It discusses how each aims to address the limitations of traditional approaches to large-scale data analysis by providing a high-level scripting language that is compiled into optimized parallel tasks. While the ideas are similar, there are differences in programming style, extensibility, data models, and optimization strategies. Overall, the models evaluate tradeoffs between flexibility, performance, and usability for large-scale data analysis.
2. Motivation
● Ever-increasing amount of data
● High cost of traditional approaches
● Limitation of the bare MapReduce
approach
3. Example
A. Pavlo et al, “A Comparison of Approaches to Large-scale
Data Analysis,” Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY, USA 2009
● Pros of Parallel DW:
○ superior runtime performance
● Cons of Parallel DW:
○ time consuming up-front set-up
○ sophisticated configuration and tuning
4. New Model – Pig Latin
● Comes from Yahoo
● Pig Latin, a high-level data analysis scripting
language
● Features of Pig, and motivation for them
● Language features, data model, and motivation for
● Implementation of Pig
● A novel debugging approach brought by the system
● A few real usage scenarios
5. New Model - SCOPE
● Developed by Microsoft
● SCOPE, a declarative and extensible scripting
language
● Underlying parallel data processing and storage
system
● Language features and data model
● System design and architecture
● TPC-H benchmark
6. New Model - Hive
● Comes from Facebook
● HiveQL, a high-level data analysis scripting language
● Language features, data model, and type system
● Data storage in HDFS (Hadoop File System)
● System architecture and components
● Usage statistics at Facebook
7. Comparison
RDB/DW Pig Latin SCOPE Hive
Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL
Style single block of steps where each data processing comprises of a
declarative step specifies only commands" subset of SQL
constraints that a single, high- * "Has a strong and some
collectively define level relational- resemblance to extensions"
the result algebra style data SQL -- an * "Working
transformation" intentional design towards making
choice" HiveQL subsume
SQL syntax"
Extensibility Vendor / product * Currently Support C# * Support UDF of
specific UDF support JAVA arbitrary
(User Defined UDF programming
Function) * With future languages
support of * Data types can
arbitrary also be
languages customized
8. Comparison (Cont')
RDB/DW Pig Latin SCOPE Hive
Nested Data No, unless one is Yes,supports (Not directly Yes, supports
Model willing to violate complex data mentioned or complex data
1NF types (set, map, demonstrated in (map, list, and
and tuple) paper) struct)
Data Ownership Yes No No Yes or No
Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files
structure File System) files
9. Comparison (Cont')
RDB/DW Pig Latin SCOPE Hive
Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly
stored in system and/or stored in
system
(Metadata)
Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on
operate on external data) external data) both internal and
system-owned, external data)
internal data)
Optimization SQL execution * basic * Complie-time: * "Currently has a
plan optimization better execution naive rule-based
* Not directly plan optimizer with a
discussed in the * Run-time: small number of
paper reduced traffic / simple rules"
workload (Rack- * Plan to build a
awareness, partial cost-based
aggregation, optimizer and
grouping adaptive
heuristics) optimization"
10. Conclusions
● The ideas behind these 3 papers are very
similar
○ Addressing the same problem: limitation of the bare
MapReduce model
○ Similar approach: high-level data processing scripts
compiled into optimized, low-level parallel processing tasks
supported by the underlying parallel processing system
● Yet there are interesting differences
○ data schema, data ownership, and extensibility
○ Underlying system