The document discusses the data analysis tool Apache Pig. Pig is a platform for analyzing large datasets that uses a high-level language called Pig Latin for expressing data analysis programs. Pig is built on Hadoop and allows for easy programming while also providing optimization opportunities and extensibility. The document provides an overview of Pig internals including its data structures, basic operations like group, join and order, and architecture. Examples of Pig Latin code and use cases for Pig are also presented.
5. What is Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Ease of programming Optimization opportunities Extensibility Built upon Hadoop
6.
7. Operators in Pig-Latin Load - a = load ‘data’ usingPigStorage(‘’) as (f1:int ,f2:double,f3:chararray) Store - store a into ‘/test/output’ usingPigStorage(‘,’) Dump - dump a Filter - b = filter a by f1 > 0 and f2 == ‘java_one’ Foreach - b = foreach a generate f1, f3 Group - b= group a by f3; Join - b = Join a by f1, b by f1; Describe - describe b; ….
8. Data Structure in Pig Cell field in database - Primitive types: int, long, float, double, bytearray, chararrar,nul - Complex types: map, tuple, databag Tuple row (1, 1.2, “java”) DataBag table or view { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }
9. How to use Pig Grunt (Interactive Shell) Java API Other languages (in future)
14. How Pig do Sort Data Source Split Mapper Range Partition Reducer (100) (200) (900) (50) (100) (200) (300) (400) (100) (200) (900) (50) (600) (800) (300) (400) (50) (600) (800) (600) (800) (300) (400)
15. UDF (User-Defined-Function) register myudf.jar; raw_data= load ‘/java_one/udf’ as (name:chararray); firstnames = foreachraw_datageneratemyudf.FirstName (name); storefirstnamesinto ‘/java_one/udf_output’; public class FirstNameextendsEvalFunc<String>{ @Override public String exec(Tuple input) throwsIOException { String name=input.get(0).toString(); …. returnfirstname; } }
16. What Storage Pig Supports HDFS Plain Text Binary format Customized format (XML, JSON, Protobuf, Thrift…) RDBMS(DBStorage) Cassandra (CassandraStorage) HBase(HBaseStorage)
17. What fields can Pig be applied Data Analysis Text Processing ETL Machine Learning
19. References http://pig.apache.org (Pig official site) http://hadoop.apache.org (Hadoop official site) https://github.com/zjffdu/RAF-PIG (Rich API for Pig)