SlideShare ist ein Scribd-Unternehmen logo
1 von 124
Downloaden Sie, um offline zu lesen
12: MapReduce and DBMS Hybrids
Zubair Nabi
zubair.nabi@itu.edu.pk
May 26, 2013
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used for
simple summarization, business intelligence, machine learning, among
many other applications1
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used for
simple summarization, business intelligence, machine learning, among
many other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Data Model
Tables:
Similar to RDBMS tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Serialization can be both system provided or user defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Serialization can be both system provided or user defined
Serialization information of each table is also stored in the
Hive-Metastore for query optimization
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Serialization can be both system provided or user defined
Serialization information of each table is also stored in the
Hive-Metastore for query optimization
Tables can also be defined for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Buckets are calculated based on the hash of a column within the
partition
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Buckets are calculated based on the hash of a column within the
partition
Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Custom types: user-defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input data
using a single HiveQL statement
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input data
using a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input data
using a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’
2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’
2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Detailed profile information, such as gender and academic institution is
present in the table: profiles(userid int,school
string,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender and
academic institution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender and
academic institution
1 FROM (SELECT a.status, b.school, b.gender
2 FROM status_updates a JOIN profiles b
3 ON (a.userid = b.userid and
4 a.ds=’2013-05-26’)
5 ) subq1
6 INSERT OVERWRITE TABLE gender_summary
7 PARTITION(ds=’2013-05-26’)
8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender
9 INSERT OVERWRITE TABLE school_summary
10 PARTITION(ds=’2013-05-26’)
11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Table: metadata for a table including columns and their types, owner,
storage, and serialization information
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Table: metadata for a table including columns and their types, owner,
storage, and serialization information
Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communication
layer, and Hive as the translation layer
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communication
layer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation between
SQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB Architecture
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
The connector connects to the DB, executes the SQL query, and
returns results in the form of key/value pairs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
The connector connects to the DB, executes the SQL query, and
returns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Stored as an XML file on the HDFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw data
files from the HDFS into n parts, where n is the number of nodes in the
cluster
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw data
files from the HDFS into n parts, where n is the number of nodes in the
cluster
2 Local Hasher: Copies a partition from the HDFS to the node-local DB
of each node and further partitions it into smaller size chunks
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with references
to HadoopDB tables, table schemas, formats, and serialization
information
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with references
to HadoopDB tables, table schemas, formats, and serialization
information
2 All operators with partitioning keys similar to the node-local database
are converted into SQL queries and pushed to the database layer
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Originally designed by Aster Data Systems and later acquired by
Teradata
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Originally designed by Aster Data Systems and later acquired by
Teradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Can be implemented in a number of languages including Java, C#,
C++, Python, etc. and can thus make use of third-party libraries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Can be implemented in a number of languages including Java, C#,
C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resource
allocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
PARTITION BY partitions the input to the function on one or more
attributes from the schema
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Any number of custom clauses can also be defined whose names and
arguments are passed as a key/value map to the function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Any number of custom clauses can also be defined whose names and
arguments are passed as a key/value map to the function
Implemented as relations so easily nestable
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Contracts identical to MapReduce functions
Only one row function operates over a row from the input table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Contracts identical to MapReduce functions
Only one row function operates over a row from the input table
Only one partition function operates over a group of rows defined by the
PARTITION BY clause, in the order specified by the ORDER BY
clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
These methods are passed an iterator over their input rows and an
emitter object for returning output rows to the database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
These methods are passed an iterator over their input rows and an
emitter object for returning output rows to the database
operateOnPartition can also optionally implement the combiner
interface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Installation
Functions need to be installed first before they can be used
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,
such as configuration files, binaries, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,
such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbage
collected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
The worker database implements a “bridge” which manages its
communication with the SQL/MR function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
The worker database implements a “bridge” which manages its
communication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages its
communication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture (2)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
Example: Wordcount
1 SELECT token, COUNT(*)
2 FROM tokenizer(
3 ON input-table
4 DELIMITER(’ ’)
5 )
6 GROUP BY token;
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Timestamp User ID
10:00:00 238909
00:58:24 7656
10:00:24 238909
02:30:33 7656
10:01:23 238909
10:02:40 238909
Timestamp User ID Session ID
10:00:00 238909 0
10:00:24 238909 0
10:01:23 238909 0
10:02:40 238909 1
00:58:24 7656 0
02:30:33 7656 1
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization (2)
1 SELECT ts, userid, session
2 FROM sessionize (
3 ON clicks
4 PARTITION BY userid
5 ORDER BY ts
6 TIMECOLUMN (’ts’)
7 TIMEOUT (60)
8 );
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
Example: Clickstream Sessionization (3)
1 public class Sessionize implements PartitionFunction {
2
3 private int timeColumnIndex;
4 private int timeout;
5
6 public Sessionize(RuntimeContract contract) {
7 // Get time column and timeout from contract
8 // Define output schema
9 contract.complete();
10 }
11
12 public void operationOnPartition(
13 PartitionDefinition partition,
14 RowIterator inputIterator,
15 RowEmitter outputEmitter) {
16 // Implement the partition function logic
17 // Emit output rows
18 }
19
20 }
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
References
1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham
Murthy. 2009. Hive: a warehousing solution over a map-reduce
framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629.
2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi
Silberschatz, and Alexander Rasin. 2009. HadoopDB: an architectural
hybrid of MapReduce and DBMS technologies for analytical workloads.
Proc. VLDB Endow. 2, 1 (August 2009), 922-933.
3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009.
SQL/MapReduce: a practical approach to self-describing, polymorphic,
and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2
(August 2009), 1402-1413.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37

Weitere ähnliche Inhalte

Was ist angesagt?

Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??Abdul Aslam
 
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 Download-manuals-gis-how toworkwithmaplayersandnetworklayers Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Download-manuals-gis-how toworkwithmaplayersandnetworklayershydrologywebsite1
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)ruchabhandiwad
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Data analytics online training
Data analytics online trainingData analytics online training
Data analytics online trainingankitha reddy
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsJan van Ansem
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...infinityend3
 

Was ist angesagt? (18)

Hadoop 2.0 and yarn
Hadoop 2.0 and yarnHadoop 2.0 and yarn
Hadoop 2.0 and yarn
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
 
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 Download-manuals-gis-how toworkwithmaplayersandnetworklayers Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Data analytics online training
Data analytics online trainingData analytics online training
Data analytics online training
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Database
DatabaseDatabase
Database
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small Pockets
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
 

Andere mochten auch

AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application ScriptingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
 

Andere mochten auch (19)

AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 

Ähnlich wie MapReduce and DBMS Hybrids

Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDave Cross
 
Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Fizaril Amzari Omar
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clustersresponseteam
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
 
SIG-04-Databases.pptx
SIG-04-Databases.pptxSIG-04-Databases.pptx
SIG-04-Databases.pptxHugoDeConello
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in ActionZubair Nabi
 

Ähnlich wie MapReduce and DBMS Hybrids (20)

Database Part 2
Database Part 2Database Part 2
Database Part 2
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::Class
 
No SQL introduction
No SQL introductionNo SQL introduction
No SQL introduction
 
DBMS Basics
DBMS BasicsDBMS Basics
DBMS Basics
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clusters
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Hadoop
HadoopHadoop
Hadoop
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
SIG-04-Databases.pptx
SIG-04-Databases.pptxSIG-04-Databases.pptx
SIG-04-Databases.pptx
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
 
Nosql
NosqlNosql
Nosql
 

Mehr von Zubair Nabi

Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraZubair Nabi
 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageZubair Nabi
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google FilesystemZubair Nabi
 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationZubair Nabi
 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesZubair Nabi
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmZubair Nabi
 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPIZubair Nabi
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 

Mehr von Zubair Nabi (10)

Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with Cassandra
 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and Storage
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google Filesystem
 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad Application
 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative Architectures
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce Paradigm
 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPI
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

MapReduce and DBMS Hybrids

  • 1. 12: MapReduce and DBMS Hybrids Zubair Nabi zubair.nabi@itu.edu.pk May 26, 2013 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
  • 2. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
  • 3. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
  • 4. Introduction Data warehousing solution built atop Hadoop by Facebook 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 5. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 6. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 7. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 8. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 9. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization Stores 2PB of uncompressed data at Facebook and is heavily used for simple summarization, business intelligence, machine learning, among many other applications1 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 10. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization Stores 2PB of uncompressed data at Facebook and is heavily used for simple summarization, business intelligence, machine learning, among many other applications1 Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc. 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 11. Data Model Tables: Similar to RDBMS tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 12. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 13. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 14. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Serialization can be both system provided or user defined Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 15. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Serialization can be both system provided or user defined Serialization information of each table is also stored in the Hive-Metastore for query optimization Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 16. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Serialization can be both system provided or user defined Serialization information of each table is also stored in the Hive-Metastore for query optimization Tables can also be defined for data stored in external sources such as HDFS, NFS, and local FS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 17. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 18. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 19. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 20. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 21. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 22. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Buckets are calculated based on the hash of a column within the partition Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 23. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Buckets are calculated based on the hash of a column within the partition Each bucket is stored within a file in the partition directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 24. Column Data Types Primitive types: integers, floats, strings, dates, and booleans Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 25. Column Data Types Primitive types: integers, floats, strings, dates, and booleans Nestable collection types: arrays and maps Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 26. Column Data Types Primitive types: integers, floats, strings, dates, and booleans Nestable collection types: arrays and maps Custom types: user-defined Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 27. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 28. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 29. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 30. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert – multiple queries on the same input data using a single HiveQL statement Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 31. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert – multiple queries on the same input data using a single HiveQL statement User-defined column transformation and aggregation functions in Java Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 32. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert – multiple queries on the same input data using a single HiveQL statement User-defined column transformation and aggregation functions in Java Custom map-reduce scripts written in any language can be embedded Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 33. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 34. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 35. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Using: 1 LOAD DATA LOCAL INPATH ’/logs/status_updates’ 2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 36. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Using: 1 LOAD DATA LOCAL INPATH ’/logs/status_updates’ 2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’) Detailed profile information, such as gender and academic institution is present in the table: profiles(userid int,school string,gender int) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 37. Example: Facebook Status (2) Query to workout the frequency of status updates based on gender and academic institution Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
  • 38. Example: Facebook Status (2) Query to workout the frequency of status updates based on gender and academic institution 1 FROM (SELECT a.status, b.school, b.gender 2 FROM status_updates a JOIN profiles b 3 ON (a.userid = b.userid and 4 a.ds=’2013-05-26’) 5 ) subq1 6 INSERT OVERWRITE TABLE gender_summary 7 PARTITION(ds=’2013-05-26’) 8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender 9 INSERT OVERWRITE TABLE school_summary 10 PARTITION(ds=’2013-05-26’) 11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
  • 39. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 40. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 41. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 42. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Table: metadata for a table including columns and their types, owner, storage, and serialization information Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 43. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Table: metadata for a table including columns and their types, owner, storage, and serialization information Partition: metadata for a partition; similar to the information for a table Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 44. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
  • 45. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 46. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 47. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 48. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 49. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 50. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, All shortcomings pointed by DeWitt and Stonebraker, as discussed before Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 51. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, All shortcomings pointed by DeWitt and Stonebraker, as discussed before At times an order of magnitude slower than parallel DBs Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 52. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 53. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 54. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 55. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 56. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances PostgreSQL as the database layer, Hadoop as the communication layer, and Hive as the translation layer 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 57. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances PostgreSQL as the database layer, Hadoop as the communication layer, and Hive as the translation layer Commercialized through the start up, Hadapt2 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 58. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 59. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 60. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases 3 Data Loader: Data partitioning across single-node databases Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 61. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases 3 Data Loader: Data partitioning across single-node databases 4 SQL to MapReduce to SQL (SMS) Planner: Translation between SQL and MapReduce Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 62. HadoopDB Architecture Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
  • 63. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 64. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 65. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job The connector connects to the DB, executes the SQL query, and returns results in the form of key/value pairs Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 66. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job The connector connects to the DB, executes the SQL query, and returns results in the form of key/value pairs Hadoop in essence sees the DB as just another data source Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 67. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 68. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials 2 Metadata about the datasets, replica locations, and partitioning scheme Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 69. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials 2 Metadata about the datasets, replica locations, and partitioning scheme Stored as an XML file on the HDFS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 70. Data Loader Consists of two key components: 1 Global Hasher: Executes a custom Hadoop job to repartition raw data files from the HDFS into n parts, where n is the number of nodes in the cluster Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
  • 71. Data Loader Consists of two key components: 1 Global Hasher: Executes a custom Hadoop job to repartition raw data files from the HDFS into n parts, where n is the number of nodes in the cluster 2 Local Hasher: Copies a partition from the HDFS to the node-local DB of each node and further partitions it into smaller size chunks Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
  • 72. SQL to MapReduce to SQL (SMS) Planner Extends HiveQL in two key ways: 1 Before query execution, the Hive Metastore is updated with references to HadoopDB tables, table schemas, formats, and serialization information Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
  • 73. SQL to MapReduce to SQL (SMS) Planner Extends HiveQL in two key ways: 1 Before query execution, the Hive Metastore is updated with references to HadoopDB tables, table schemas, formats, and serialization information 2 All operators with partitioning keys similar to the node-local database are converted into SQL queries and pushed to the database layer Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
  • 74. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
  • 75. Introduction The declarative nature of SQL is too limiting for describing most big data computation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 76. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 77. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 78. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 79. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Originally designed by Aster Data Systems and later acquired by Teradata Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 80. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Originally designed by Aster Data Systems and later acquired by Teradata Used by Barnes and Noble, LinkedIn, SAS, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 81. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 82. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 83. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 84. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 85. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Can be implemented in a number of languages including Java, C#, C++, Python, etc. and can thus make use of third-party libraries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 86. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Can be implemented in a number of languages including Java, C#, C++, Python, etc. and can thus make use of third-party libraries Executed within processes to provide sandboxing and resource allocation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 87. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 88. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause ON is the only required clause which specifies the input to the function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 89. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause ON is the only required clause which specifies the input to the function PARTITION BY partitions the input to the function on one or more attributes from the schema Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 90. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 91. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Any number of custom clauses can also be defined whose names and arguments are passed as a key/value map to the function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 92. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Any number of custom clauses can also be defined whose names and arguments are passed as a key/value map to the function Implemented as relations so easily nestable Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 93. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 94. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 95. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Contracts identical to MapReduce functions Only one row function operates over a row from the input table Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 96. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Contracts identical to MapReduce functions Only one row function operates over a row from the input table Only one partition function operates over a group of rows defined by the PARTITION BY clause, in the order specified by the ORDER BY clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 97. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 98. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 99. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 100. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively These methods are passed an iterator over their input rows and an emitter object for returning output rows to the database Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 101. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively These methods are passed an iterator over their input rows and an emitter object for returning output rows to the database operateOnPartition can also optionally implement the combiner interface Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 102. Installation Functions need to be installed first before they can be used Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 103. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 104. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 105. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Any arbitrary file can be installed which is replicated to all workers, such as configuration files, binaries, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 106. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Any arbitrary file can be installed which is replicated to all workers, such as configuration files, binaries, etc. Each function is provided with a temporary directory which is garbage collected after execution Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 107. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 108. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 109. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 110. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. The worker database implements a “bridge” which manages its communication with the SQL/MR function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 111. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. The worker database implements a “bridge” which manages its communication with the SQL/MR function The SQL/MR function process contains a “runner” which manages its communication with the worker database Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 112. Architecture (2) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
  • 113. Example: Wordcount 1 SELECT token, COUNT(*) 2 FROM tokenizer( 3 ON input-table 4 DELIMITER(’ ’) 5 ) 6 GROUP BY token; Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
  • 114. Example: Clickstream Sessionization Divide a user’s clicks on a website into sessions Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 115. Example: Clickstream Sessionization Divide a user’s clicks on a website into sessions A session includes the user’s clicks within a specified time period Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 116. Example: Clickstream Sessionization Divide a user’s clicks on a website into sessions A session includes the user’s clicks within a specified time period Timestamp User ID 10:00:00 238909 00:58:24 7656 10:00:24 238909 02:30:33 7656 10:01:23 238909 10:02:40 238909 Timestamp User ID Session ID 10:00:00 238909 0 10:00:24 238909 0 10:01:23 238909 0 10:02:40 238909 1 00:58:24 7656 0 02:30:33 7656 1 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 117. Example: Clickstream Sessionization (2) 1 SELECT ts, userid, session 2 FROM sessionize ( 3 ON clicks 4 PARTITION BY userid 5 ORDER BY ts 6 TIMECOLUMN (’ts’) 7 TIMEOUT (60) 8 ); Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
  • 118. Example: Clickstream Sessionization (3) 1 public class Sessionize implements PartitionFunction { 2 3 private int timeColumnIndex; 4 private int timeout; 5 6 public Sessionize(RuntimeContract contract) { 7 // Get time column and timeout from contract 8 // Define output schema 9 contract.complete(); 10 } 11 12 public void operationOnPartition( 13 PartitionDefinition partition, 14 RowIterator inputIterator, 15 RowEmitter outputEmitter) { 16 // Implement the partition function logic 17 // Emit output rows 18 } 19 20 } Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
  • 119. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
  • 120. Summary Hive, HadoopDB, and nCluster explore three different points in the design space Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 121. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 122. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality 2 HadoopDB uses MapReduce and DBMS side-by-side Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 123. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality 2 HadoopDB uses MapReduce and DBMS side-by-side 3 nCluster implements MapReduce within a DBMS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 124. References 1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629. 2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2, 1 (August 2009), 922-933. 3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2 (August 2009), 1402-1413. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37