SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Pig on Tez
PRESENTED BY

Cheolsoo Park, Netflix
R o h i n i P a l a n i s w a m y , Ya h o o !

The Apache Software Foundation
Apache Pig on Tez Team
Name

Role

Company

Apache Pig Contributor

Linkedin

Cheolsoo Park

VP. Apache Pig

Netflix

Daniel Dai

Apache Pig PMC

Hortonworks

Mark Wagner

Apache Pig Committer

Linkedin

Olga Natkovich

Apache Pig PMC, Pig on Tez Project
Manager

Yahoo!

Rohini Palaniswamy

Apache Pig PMC

Yahoo!

Alex Bain

The Apache Software Foundation

2
Agenda
 Overview
 Pig and Hive
 Pig on Tez


Why Tez?



Benefits of Tez



Design



Operator DAGs



Performance



Known Issues



Where are we?



What next?

The Apache Software Foundation

3
Pig Overview
 Apache top-level project for ETL on hadoop.
 PIG Latin - Procedural scripting language that translates sequence of data processing
steps into MapReduce jobs.
 Easy to write, read and reuse and very extensible.
 Feature parity with SQL (FILTER BY, CROSS, JOIN (OUTER, INNER), ORDER BY, LIMIT, RANK, ROLLUP,
CUBE), Custom Loader and Storer, User defined functions (java and non-java), Nested
ForEach, Streaming, macros and much more
PAGEVIEWS = LOAD ‘/data/pageviews’ as (user, url);
GRP = GROUP PAGEVIEWS BY user;
CNT = FOREACH GRP GENERATE group, COUNT(url) as numvisits;
STORE CNT into ‘/data/visited’ using PigStorage(‘,’);

The Apache Software Foundation

4
Pig and Hive
Pig
Language

Hive

PIG Latin - Procedural

SQL - Declarative

Features

Feature rich. Can easily add new
operators and constructs. For eg:
Nested Foreach, Switch case,
Macros, Scalars.

Limited to SQL operators

Developer code

Load/StoreFunc, Algebraic and
Accumulator java UDFs, non-java
UDFs (jython, python, javascript,
groovy, jruby), Custom Partitioners.

StorageHandler, java UDFs

Complex Processing

Well suited. Multi-query works well
with 1000s of lines of pig script.

Not a good fit

Server

Only client. Can work with Hive
Metastore using
HCatLoader/Storer.

Requires Metastore server
and data has to be registered
in it. HiverServer2 for jdbc

The Apache Software Foundation

5
Pig and Hive - Continued
Pig
Tez as execution engine

Hive

Planned for 0.14

Planned for Hive 0.13

ORCFile Support

Patch available. Currently through
HCatLoader

From Hive 0.12 onwards.
Huge performance gains

Vectorization

No. May be in future.

Yes. Huge performance gains

Transactions

No

Yes. In works

Cost-based optimizer

No

Yes. In works

JDBC support, Integration with BI
tools

No

Yes. HiveServer2 with
Microstrategy/Tableau

Area of application

Pipeline processing language
standard

Interactive Analytics
/Reporting Platform

The Apache Software Foundation

6
Why Tez?
 Built on top of YARN



Multi-tenancy (queues, capacity management)
Resource allocation

 DAG execution framework



Natural fit for Pig and Hive than MR as their execution plans are DAGs.
Better than running a DAG of MR jobs passing data in between jobs using HDFS as intermediate store.

 Different types of edges


ONE_ONE, BROADCAST, SHUFFLE

 Flexible Input-Processor-Output runtime model









Custom Vertex Processors. For eg: Map Processor, Reduce Processor, Pig Processor
Custom Inputs. For eg: MRFileInput (input to map), ShuffledMergedInput (input to reduce)
Custom Outputs. For eg: OnFileSortedOutput (output of map), MRFileOutput (output of reduce)

Multiple inputs and outputs
Highly extensible
Security
Support from Tez Community and Hive Community
The Apache Software Foundation

7
Why Tez? – As a end user





Better Performance
Reduced Resource Usage (Containers/Memory/CPU)
Reduced Network I/O
Reduced Namenode and Datanode load

The Apache Software Foundation

8
Benefits of Tez
Features

Benefits
•

No intermediate data storage

•
•

•
Single AM for whole DAG

The Apache Software Foundation

•

Less pressure on Namenode
- Lesser calls for listing and getting block locations
- Smaller namespace usage
- Cuts down on GC
Less pressure on Datanode
- Cuts down on IO in network for both writing and reading.
- Saves space as there are no 3 replicas
Eliminates extra step of map reads from HDFS in every
intermediate job in DAG
- Saves on capacity by eliminating the need for map task
containers
Saves on capacity. For a 5 stage MR job, there would be 5 AM
containers launched.
Eliminates issue of queue and resource contention faced in MR
for jobs started after previous job in DAG completes.
9
Benefits of Tez - Continued
Features

Benefits
•

Container reuse
•

Reduced launch overhead
- Container request and release overhead
- Resource localization overhead
- JVM launch time overhead
Reduced network IO
- Reduce tasks can be launched on same node as Map
- 1-1 edge tasks can be launched on same node

•

Memory structures like small tables used for join can be cached
in jvm and reused for next task on container reuse. Provides
significant performance speedup.

•

Using unsorted input and output where possible saves a lot of
CPU usage and increases performance

•

Saves on capacity. Can have reducers based on data size
instead of having fixed number of reducers.

Vertex caching
Custom inputs and outputs
Dynamic reducer estimation
The Apache Software Foundation

10
Pig on Tez - Design
Logical Plan
LogToPhyTranslationVisitor
Physical Plan
TezCompiler

MRCompiler

Tez Plan

MR Plan

Tez Execution Engine

MR Execution Engine

The Apache Software Foundation

11
Pig on Tez – Join
Left
split

Right
split

Left
split

Load L and R

Right
split

l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x;
Configuration
per input

Configuration
per job

Join

The Apache Software Foundation

Left
split

Left
split

Right
split

Load L

Load R

Join

12

Right
split
Pig on Tez – Split + Group-by
Load foo
Split multiplex

de-multiplex

Group by y Group by z
HDFS

f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;

Load foo
Multiple outputs

Group by y

HDFS

Group by z

Load g1 and Load g2

Reduce follows
reduce

Join

The Apache Software Foundation

Join

13
Pig on Tez – Order-by

Aggregate
HDFS

Load &
Sample

f = LOAD ‘foo’ AS (x, y);
o = ORDER f BY x;

Sample

Aggregate

Stage sample map
on distributed cache

Pass through input
via 1-1 edge

Broadcast sample map

Partition

Partition

Sort

Sort

The Apache Software Foundation

14
Pig on Tez – Skewed join
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x
USING ‘skewed’;

Sample L

Load &
Sample

Aggregate
HDFS

Aggregate
Pass through input
via 1-1 edge

Stage sample map
on distributed cache

Broadcast
sample map

Partition L

Partition R

Partition L and Partition R
Join
Join

The Apache Software Foundation

15
Time in secs

Performance numbers
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0

MR
Tez
Replicated
Join (2.8x)

The Apache Software Foundation

Join +
Groupby
(1.5x)

Join +
3 way Split +
Groupby +
Join +
Orderby
Groupby +
(1.5x)
Orderby
(2.6x)

16
Factors affecting performance
 Number of stages in the DAG


Higher the number of stages in the DAG, performance of Tez over MR will be better.

 Cluster/queue capacity


More congested a queue is, the performance of Tez over MR will be better due to container reuse.

 Size of intermediate output


More the size of intermediate output, the performance of Tez over MR will be better due to reduced
HDFS usage.

 Size of data in the job


For smaller data and more stages, the performance of Tez over MR will be better as percentage of
launch overhead in the total time is high for smaller jobs.

 Vertex caching

The Apache Software Foundation

17
Container usage
MR

Tez

Savings

Tez with
container reuse

7563

7562

1

180

Join + Groupby +
Orderby

7655

7603

52

180

Join + Groupby +
Orderby

7663

7609

54

180

3 way Split + Join +
Groupby + Order by

622

563

59

180

Query
Replicated Join

Note. The cluster size was 25 nodes with 180 containers (1.5G each) and Tez reused
them again and again for tasks.
The Apache Software Foundation

18
Known issues
 Container reuse will have issues when there are


Static variables in LoadFunc, StoreFunc, UDFs



Memory leaks in LoadFunc, StoreFunc, UDFs

 With single DAG execution of whole script, AM retries can be very costly until
Tez supports checkpointing and resuming.

The Apache Software Foundation

19
Where are we?
 Major operators



Split, Union



Group-by, Distinct, Limit



Order-by










Load, Store, Filter-by, Foreach

Hash join, Replicated join, Skewed join, Merge join

UDFs (Java and non-Java)
Streaming
Multi-query on and off
Macros
Scalars
95% of e2e tests pass for finished features.
The Apache Software Foundation

20
What next?
 Feature Parity with MR




Local mode
Port all unit and e2e tests
Support for remaining Operators




CROSS, RANK, CUBE, ROLLUP

Support for Native Mapreduce (Low priority)

 Merge tez branch with trunk
 Stability


Handling failures
 Testing and tuning for large data and DAGs with > 10 stages

 Usability


Counters
 Progress Information
 Log information and debuggability
The Apache Software Foundation

21
What next? – Performance Improvements
›

Dynamic Reducer Estimation

›

Better memory management

›

Calculate input splits in AM and let Tez do combining of input splits for
pig.maxCombinedSplitSize

›

Vertex Grouping to write data directly into one output directory from multiple vertices in
case of union

›

Using unsorted shuffle in Union, Orderby, Skewed Join, etc to improve performance.

›

Shared Edges for multiple outputs if same data has to go to multiple downstream
vertices. For eg: Multi-query off, skewed join sample aggregation output.

›

HDFS Caching

The Apache Software Foundation

22
C ontri butors Wel come

The Apache Software Foundation
Pi g User Group Meetup at Li nkedIn
14 th March 2014

The Apache Software Foundation
Questi ons ???

The Apache Software Foundation

Weitere ähnliche Inhalte

Was ist angesagt?

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 

Was ist angesagt? (20)

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Andere mochten auch

Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudAmazon Web Services
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewAmazon Web Services
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 

Andere mochten auch (11)

Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 

Ähnlich wie February 2014 HUG : Pig On Tez

Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Big Data Joe™ Rossi
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 

Ähnlich wie February 2014 HUG : Pig On Tez (20)

Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 

Mehr von Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mehr von Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Kürzlich hochgeladen

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 

Kürzlich hochgeladen (20)

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 

February 2014 HUG : Pig On Tez

  • 1. Pig on Tez PRESENTED BY Cheolsoo Park, Netflix R o h i n i P a l a n i s w a m y , Ya h o o ! The Apache Software Foundation
  • 2. Apache Pig on Tez Team Name Role Company Apache Pig Contributor Linkedin Cheolsoo Park VP. Apache Pig Netflix Daniel Dai Apache Pig PMC Hortonworks Mark Wagner Apache Pig Committer Linkedin Olga Natkovich Apache Pig PMC, Pig on Tez Project Manager Yahoo! Rohini Palaniswamy Apache Pig PMC Yahoo! Alex Bain The Apache Software Foundation 2
  • 3. Agenda  Overview  Pig and Hive  Pig on Tez  Why Tez?  Benefits of Tez  Design  Operator DAGs  Performance  Known Issues  Where are we?  What next? The Apache Software Foundation 3
  • 4. Pig Overview  Apache top-level project for ETL on hadoop.  PIG Latin - Procedural scripting language that translates sequence of data processing steps into MapReduce jobs.  Easy to write, read and reuse and very extensible.  Feature parity with SQL (FILTER BY, CROSS, JOIN (OUTER, INNER), ORDER BY, LIMIT, RANK, ROLLUP, CUBE), Custom Loader and Storer, User defined functions (java and non-java), Nested ForEach, Streaming, macros and much more PAGEVIEWS = LOAD ‘/data/pageviews’ as (user, url); GRP = GROUP PAGEVIEWS BY user; CNT = FOREACH GRP GENERATE group, COUNT(url) as numvisits; STORE CNT into ‘/data/visited’ using PigStorage(‘,’); The Apache Software Foundation 4
  • 5. Pig and Hive Pig Language Hive PIG Latin - Procedural SQL - Declarative Features Feature rich. Can easily add new operators and constructs. For eg: Nested Foreach, Switch case, Macros, Scalars. Limited to SQL operators Developer code Load/StoreFunc, Algebraic and Accumulator java UDFs, non-java UDFs (jython, python, javascript, groovy, jruby), Custom Partitioners. StorageHandler, java UDFs Complex Processing Well suited. Multi-query works well with 1000s of lines of pig script. Not a good fit Server Only client. Can work with Hive Metastore using HCatLoader/Storer. Requires Metastore server and data has to be registered in it. HiverServer2 for jdbc The Apache Software Foundation 5
  • 6. Pig and Hive - Continued Pig Tez as execution engine Hive Planned for 0.14 Planned for Hive 0.13 ORCFile Support Patch available. Currently through HCatLoader From Hive 0.12 onwards. Huge performance gains Vectorization No. May be in future. Yes. Huge performance gains Transactions No Yes. In works Cost-based optimizer No Yes. In works JDBC support, Integration with BI tools No Yes. HiveServer2 with Microstrategy/Tableau Area of application Pipeline processing language standard Interactive Analytics /Reporting Platform The Apache Software Foundation 6
  • 7. Why Tez?  Built on top of YARN   Multi-tenancy (queues, capacity management) Resource allocation  DAG execution framework   Natural fit for Pig and Hive than MR as their execution plans are DAGs. Better than running a DAG of MR jobs passing data in between jobs using HDFS as intermediate store.  Different types of edges  ONE_ONE, BROADCAST, SHUFFLE  Flexible Input-Processor-Output runtime model        Custom Vertex Processors. For eg: Map Processor, Reduce Processor, Pig Processor Custom Inputs. For eg: MRFileInput (input to map), ShuffledMergedInput (input to reduce) Custom Outputs. For eg: OnFileSortedOutput (output of map), MRFileOutput (output of reduce) Multiple inputs and outputs Highly extensible Security Support from Tez Community and Hive Community The Apache Software Foundation 7
  • 8. Why Tez? – As a end user     Better Performance Reduced Resource Usage (Containers/Memory/CPU) Reduced Network I/O Reduced Namenode and Datanode load The Apache Software Foundation 8
  • 9. Benefits of Tez Features Benefits • No intermediate data storage • • • Single AM for whole DAG The Apache Software Foundation • Less pressure on Namenode - Lesser calls for listing and getting block locations - Smaller namespace usage - Cuts down on GC Less pressure on Datanode - Cuts down on IO in network for both writing and reading. - Saves space as there are no 3 replicas Eliminates extra step of map reads from HDFS in every intermediate job in DAG - Saves on capacity by eliminating the need for map task containers Saves on capacity. For a 5 stage MR job, there would be 5 AM containers launched. Eliminates issue of queue and resource contention faced in MR for jobs started after previous job in DAG completes. 9
  • 10. Benefits of Tez - Continued Features Benefits • Container reuse • Reduced launch overhead - Container request and release overhead - Resource localization overhead - JVM launch time overhead Reduced network IO - Reduce tasks can be launched on same node as Map - 1-1 edge tasks can be launched on same node • Memory structures like small tables used for join can be cached in jvm and reused for next task on container reuse. Provides significant performance speedup. • Using unsorted input and output where possible saves a lot of CPU usage and increases performance • Saves on capacity. Can have reducers based on data size instead of having fixed number of reducers. Vertex caching Custom inputs and outputs Dynamic reducer estimation The Apache Software Foundation 10
  • 11. Pig on Tez - Design Logical Plan LogToPhyTranslationVisitor Physical Plan TezCompiler MRCompiler Tez Plan MR Plan Tez Execution Engine MR Execution Engine The Apache Software Foundation 11
  • 12. Pig on Tez – Join Left split Right split Left split Load L and R Right split l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x; Configuration per input Configuration per job Join The Apache Software Foundation Left split Left split Right split Load L Load R Join 12 Right split
  • 13. Pig on Tez – Split + Group-by Load foo Split multiplex de-multiplex Group by y Group by z HDFS f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Load foo Multiple outputs Group by y HDFS Group by z Load g1 and Load g2 Reduce follows reduce Join The Apache Software Foundation Join 13
  • 14. Pig on Tez – Order-by Aggregate HDFS Load & Sample f = LOAD ‘foo’ AS (x, y); o = ORDER f BY x; Sample Aggregate Stage sample map on distributed cache Pass through input via 1-1 edge Broadcast sample map Partition Partition Sort Sort The Apache Software Foundation 14
  • 15. Pig on Tez – Skewed join l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x USING ‘skewed’; Sample L Load & Sample Aggregate HDFS Aggregate Pass through input via 1-1 edge Stage sample map on distributed cache Broadcast sample map Partition L Partition R Partition L and Partition R Join Join The Apache Software Foundation 15
  • 16. Time in secs Performance numbers 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 MR Tez Replicated Join (2.8x) The Apache Software Foundation Join + Groupby (1.5x) Join + 3 way Split + Groupby + Join + Orderby Groupby + (1.5x) Orderby (2.6x) 16
  • 17. Factors affecting performance  Number of stages in the DAG  Higher the number of stages in the DAG, performance of Tez over MR will be better.  Cluster/queue capacity  More congested a queue is, the performance of Tez over MR will be better due to container reuse.  Size of intermediate output  More the size of intermediate output, the performance of Tez over MR will be better due to reduced HDFS usage.  Size of data in the job  For smaller data and more stages, the performance of Tez over MR will be better as percentage of launch overhead in the total time is high for smaller jobs.  Vertex caching The Apache Software Foundation 17
  • 18. Container usage MR Tez Savings Tez with container reuse 7563 7562 1 180 Join + Groupby + Orderby 7655 7603 52 180 Join + Groupby + Orderby 7663 7609 54 180 3 way Split + Join + Groupby + Order by 622 563 59 180 Query Replicated Join Note. The cluster size was 25 nodes with 180 containers (1.5G each) and Tez reused them again and again for tasks. The Apache Software Foundation 18
  • 19. Known issues  Container reuse will have issues when there are  Static variables in LoadFunc, StoreFunc, UDFs  Memory leaks in LoadFunc, StoreFunc, UDFs  With single DAG execution of whole script, AM retries can be very costly until Tez supports checkpointing and resuming. The Apache Software Foundation 19
  • 20. Where are we?  Major operators   Split, Union  Group-by, Distinct, Limit  Order-by        Load, Store, Filter-by, Foreach Hash join, Replicated join, Skewed join, Merge join UDFs (Java and non-Java) Streaming Multi-query on and off Macros Scalars 95% of e2e tests pass for finished features. The Apache Software Foundation 20
  • 21. What next?  Feature Parity with MR    Local mode Port all unit and e2e tests Support for remaining Operators   CROSS, RANK, CUBE, ROLLUP Support for Native Mapreduce (Low priority)  Merge tez branch with trunk  Stability  Handling failures  Testing and tuning for large data and DAGs with > 10 stages  Usability  Counters  Progress Information  Log information and debuggability The Apache Software Foundation 21
  • 22. What next? – Performance Improvements › Dynamic Reducer Estimation › Better memory management › Calculate input splits in AM and let Tez do combining of input splits for pig.maxCombinedSplitSize › Vertex Grouping to write data directly into one output directory from multiple vertices in case of union › Using unsorted shuffle in Union, Orderby, Skewed Join, etc to improve performance. › Shared Edges for multiple outputs if same data has to go to multiple downstream vertices. For eg: Multi-query off, skewed join sample aggregation output. › HDFS Caching The Apache Software Foundation 22
  • 23. C ontri butors Wel come The Apache Software Foundation
  • 24. Pi g User Group Meetup at Li nkedIn 14 th March 2014 The Apache Software Foundation
  • 25. Questi ons ??? The Apache Software Foundation

Hinweis der Redaktion

  1. Pigmix queries
  2. - Either turn off container reuse or fix code
  3. - Both Algebraic and Accumulator UDFs
  4. - Both Algebraic and Accumulator UDFs