SlideShare a Scribd company logo
1 of 22
Pig : Data Analysis Tool in the Cloud  Jeff Zhang zjffdu@gmail.com Committer  of Pig in ASF
Agenda Background What is Pig Brief introduction of Pig internals Demo Q/A
Data Explosion Web 2.0 ,[object Object],[object Object]
Then, Pig’s Coming
What is Pig  Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.  Ease of programming Optimization opportunities Extensibility Built upon Hadoop
A simple example of Pig-Latin  1291950309812, http://snda.com/page_1  1291950309822, http://snda.com/page_2     1291950309832, http://snda.com/page_3 …. ,[object Object],raw_data = load   '/java_one/pv'   UsingPigStorage(‘,')                     as   (time_stamp : long,   url :  chararray);pages = foreachraw_datagenerateurl;pages = grouppagesbyurl;pages = foreachpagesgenerategroupasurl,  COUNT(pages.url)  aspv; ,[object Object],pages = orderpages  bypvdesc;top10 = limitpages  10;dumptop10;
Operators in Pig-Latin Load   - a = load ‘data’ usingPigStorage(‘’)  as (f1:int ,f2:double,f3:chararray) Store  - store a into ‘/test/output’ usingPigStorage(‘,’)  Dump - dump a Filter  - b = filter a by f1 > 0 and f2 == ‘java_one’ Foreach - b = foreach a generate  f1, f3 Group  - b= group a by f3; Join	- b = Join a by f1, b by f1; Describe	- describe b; ….
Data Structure in Pig Cell   field in database -  Primitive types: int, long, float, double, bytearray, chararrar,nul -  Complex types:  map, tuple, databag Tuple row (1,  1.2,  “java”) DataBag table or view  { (1, 1.2, “java”),  (2,2.3, “c++”) ,  (3,4.5,”c”) }
How to use Pig Grunt (Interactive Shell) Java API Other languages (in future)
Architecture of Pig Grunt (Interactive shell) PigServer  (Java API)   Parser   (PigLatinLogicalPlan) PigContext Optimizer   (LogicalPlan LogicalPlan) Compiler  (LogicalPlan PhysiclaPlan  MapReducePlan) ExecutionEngine Hadoop
Three basic operations of Pig Group by Join Order
How Pig do Group by Data Source           Split               Mapper         Partition          Reducer (A,1) (B,2) (C,3) (A,1) (B,2) (C,3) (B,4) (B,5) (C,6) (A,7) (E,8) (D,9) (A,{(A,1),(A,7)}) (C,{(C,3),(C,6)}) (E,{(E,8)}) (B,4) (B,5) (C,6) (B,{(B,2),(B,4),(B,5)}) (D,{(D,9)}) (A,7) (E,8) (D,9)
How Pig do Join Data Source           Split              Mapper         Partition          Reducer (1,A1) (4,A4) (3,A3) (5,A5) (2,A2) (1,A1) (4,A4) (5,B5) (1,B1) ((1,A1),(1,B1)) ((3,A3),(3,B3)) ((5,A5),(5,B5)) (3,A3) (5,A5) (3,B3) (2,B2) (5,B5) (1,B1) (3,B3) (2,B2) (4,B4) ((2,A2)(2,B2)) ((4,B4),(4,B4)) (2,A2) (4,B4)
How Pig do Sort Data Source          Split       Mapper         Range Partition        Reducer (100) (200) (900) (50) (100) (200) (300) (400) (100) (200) (900) (50) (600) (800) (300) (400) (50) (600) (800) (600) (800) (300) (400)
UDF (User-Defined-Function) register myudf.jar; raw_data=  load   ‘/java_one/udf’   as  (name:chararray); firstnames  =  foreachraw_datageneratemyudf.FirstName (name);  storefirstnamesinto   ‘/java_one/udf_output’; public class  FirstNameextendsEvalFunc<String>{     @Override     public String exec(Tuple input) throwsIOException {         String name=input.get(0).toString(); …. returnfirstname; } }
What Storage Pig Supports HDFS Plain Text Binary format Customized format (XML, JSON, Protobuf,  Thrift…) RDBMS(DBStorage) Cassandra (CassandraStorage) HBase(HBaseStorage)
What fields can Pig be applied  Data Analysis Text Processing ETL Machine Learning
Who’s using Pig More:	 http://wiki.apache.org/pig/PoweredBy
References http://pig.apache.org  (Pig official site) http://hadoop.apache.org  (Hadoop official site) https://github.com/zjffdu/RAF-PIG (Rich API for Pig)
Demo
Thank you ! 				Q&A
Analyze Large Data Sets with Apache Pig: A Platform for Data Analysis in the Cloud

More Related Content

What's hot

Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Simon Elliston Ball
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorchHyungjoo Cho
 
Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12Yuta Kashino
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text filesEdwin de Jonge
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Marcel Caraciolo
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編Ryota Kamoshida
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisDataWorks Summit
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
深層学習とベイズ統計
深層学習とベイズ統計深層学習とベイズ統計
深層学習とベイズ統計Yuta Kashino
 
TF.data & Eager Execution
TF.data & Eager ExecutionTF.data & Eager Execution
TF.data & Eager ExecutionModulabs
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonRoss McDonald
 
tf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input Pipelinetf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input PipelineAlluxio, Inc.
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Phillip Trelford
 
【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark
【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark
【NN輪読会】YouTube-8M: A Large-Scale Video Classification BenchmarkTomomi Moriyama
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationGlobalLogic Ukraine
 

What's hot (20)

Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorch
 
Inside database
Inside databaseInside database
Inside database
 
Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
Math synonyms
Math synonymsMath synonyms
Math synonyms
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data Analysis
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
深層学習とベイズ統計
深層学習とベイズ統計深層学習とベイズ統計
深層学習とベイズ統計
 
TF.data & Eager Execution
TF.data & Eager ExecutionTF.data & Eager Execution
TF.data & Eager Execution
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
tf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input Pipelinetf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input Pipeline
 
Ml15m2018 10-27
Ml15m2018 10-27Ml15m2018 10-27
Ml15m2018 10-27
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015
 
【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark
【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark
【NN輪読会】YouTube-8M: A Large-Scale Video Classification Benchmark
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 

Similar to Analyze Large Data Sets with Apache Pig: A Platform for Data Analysis in the Cloud

Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsHyunjung Park
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slidejonycse
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Hands on data science with r.pptx
Hands  on data science with r.pptxHands  on data science with r.pptx
Hands on data science with r.pptxNimrita Koul
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on HadoopChung-Tsai Su
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Robert Stern
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 

Similar to Analyze Large Data Sets with Apache Pig: A Platform for Data Analysis in the Cloud (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Apache pig
Apache pigApache pig
Apache pig
 
Pig latin
Pig latinPig latin
Pig latin
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Hadoop
HadoopHadoop
Hadoop
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slide
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hands on data science with r.pptx
Hands  on data science with r.pptxHands  on data science with r.pptx
Hands on data science with r.pptx
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Analyze Large Data Sets with Apache Pig: A Platform for Data Analysis in the Cloud

  • 1. Pig : Data Analysis Tool in the Cloud Jeff Zhang zjffdu@gmail.com Committer of Pig in ASF
  • 2. Agenda Background What is Pig Brief introduction of Pig internals Demo Q/A
  • 3.
  • 5. What is Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Ease of programming Optimization opportunities Extensibility Built upon Hadoop
  • 6.
  • 7. Operators in Pig-Latin Load - a = load ‘data’ usingPigStorage(‘’) as (f1:int ,f2:double,f3:chararray) Store - store a into ‘/test/output’ usingPigStorage(‘,’) Dump - dump a Filter - b = filter a by f1 > 0 and f2 == ‘java_one’ Foreach - b = foreach a generate f1, f3 Group - b= group a by f3; Join - b = Join a by f1, b by f1; Describe - describe b; ….
  • 8. Data Structure in Pig Cell  field in database - Primitive types: int, long, float, double, bytearray, chararrar,nul - Complex types: map, tuple, databag Tuple row (1, 1.2, “java”) DataBag table or view { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }
  • 9. How to use Pig Grunt (Interactive Shell) Java API Other languages (in future)
  • 10. Architecture of Pig Grunt (Interactive shell) PigServer (Java API) Parser (PigLatinLogicalPlan) PigContext Optimizer (LogicalPlan LogicalPlan) Compiler (LogicalPlan PhysiclaPlan  MapReducePlan) ExecutionEngine Hadoop
  • 11. Three basic operations of Pig Group by Join Order
  • 12. How Pig do Group by Data Source  Split  Mapper  Partition  Reducer (A,1) (B,2) (C,3) (A,1) (B,2) (C,3) (B,4) (B,5) (C,6) (A,7) (E,8) (D,9) (A,{(A,1),(A,7)}) (C,{(C,3),(C,6)}) (E,{(E,8)}) (B,4) (B,5) (C,6) (B,{(B,2),(B,4),(B,5)}) (D,{(D,9)}) (A,7) (E,8) (D,9)
  • 13. How Pig do Join Data Source  Split  Mapper  Partition  Reducer (1,A1) (4,A4) (3,A3) (5,A5) (2,A2) (1,A1) (4,A4) (5,B5) (1,B1) ((1,A1),(1,B1)) ((3,A3),(3,B3)) ((5,A5),(5,B5)) (3,A3) (5,A5) (3,B3) (2,B2) (5,B5) (1,B1) (3,B3) (2,B2) (4,B4) ((2,A2)(2,B2)) ((4,B4),(4,B4)) (2,A2) (4,B4)
  • 14. How Pig do Sort Data Source  Split  Mapper  Range Partition  Reducer (100) (200) (900) (50) (100) (200) (300) (400) (100) (200) (900) (50) (600) (800) (300) (400) (50) (600) (800) (600) (800) (300) (400)
  • 15. UDF (User-Defined-Function) register myudf.jar; raw_data= load ‘/java_one/udf’ as (name:chararray); firstnames = foreachraw_datageneratemyudf.FirstName (name); storefirstnamesinto ‘/java_one/udf_output’; public class FirstNameextendsEvalFunc<String>{ @Override public String exec(Tuple input) throwsIOException { String name=input.get(0).toString(); …. returnfirstname; } }
  • 16. What Storage Pig Supports HDFS Plain Text Binary format Customized format (XML, JSON, Protobuf, Thrift…) RDBMS(DBStorage) Cassandra (CassandraStorage) HBase(HBaseStorage)
  • 17. What fields can Pig be applied Data Analysis Text Processing ETL Machine Learning
  • 18. Who’s using Pig More: http://wiki.apache.org/pig/PoweredBy
  • 19. References http://pig.apache.org (Pig official site) http://hadoop.apache.org (Hadoop official site) https://github.com/zjffdu/RAF-PIG (Rich API for Pig)
  • 20. Demo
  • 21. Thank you ! Q&A