SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech
Data Analysis Timeline
ISAM files
COBOL/JCL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map
Reduce/Hive
Data Analysis Timeline
HDFS files
Map
Reduce/Hive
SQL
Simplified View of
Co-existence
HDFS
Files
Map Reduce , Hive,
HBase
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
The
Great
Divide
PXF addresses the
divide.
Pivotal eXtension Framework (PXF)
• History
o Based on external table functionality of RDBMS
o Built at Pivotal by a small team in Israel
• Goals
o Single hop
o No Materialization of data
o Fully parallel for high throughput
o Extensible
Motivation for building PXF
• Use SQL engine’s statistical/analytic functions (e.g.
Madlib) on third party data stores e.g.
o HBase data
o Hive data
o Native Data on HDFS in a variety of formats
• Join in-database dimensions with other fact tables
• Fast ingest of data into SQL native format (insert into …
select * from …)
Motivation for building PXF
• Enterprises love the cheap storage offered by HDFS,
and want to store data over there
• M/R is very limiting
• Integrating with Third Party systems e.g. Accumulo etc.
• Existing techniques involved copying data to HDFS,
which is very brittle and in-efficient
High Level Flow
SQL
Data
Node5
Data
Node1
Data
Node2
Data
Node3
Data
Node4
Where is
the data for
table foo?
On
DataNodes
1,3 and 5
- Protocol is http
- End points are running on all data nodes
Name
Node
Major components
• Fragmenter
o Get the locations of fragments for a table
• Accessor
o Understand and read the fragment, return records
• Resolver
o Convert the records into a SQL engine format
• Analyzer
o Provide source stats to the Query optimizer
PXF Architecture
HAWQ
Master
M/R,
Pig,
Hive
Data Node
Container with End-Points
PXF Fragmenter
Local HDFS
Hadoop
Pivotal Green
Zookeeper
3
1
6
PSQL
select * from external table foo
location=”pxf://namenode:50070/financedata”
0
splits[..]
HAWQ
Segment
getSplit(0)
PXFWritable
A
B
0 6To
A BTo
MetaData
Data
Native
PHD
5
4
PXF Accessor/Resolver
Local HDFS
2
Classes
• The four major components are defined as interfaces and
base classes that can be extended. e.g. Fragmenter
/*
* Class holding information about fragments (FragmentInfo)
*/
public class FragmentsOutput {
public FragmentsOutput();
public void addFragment(String sourceName, String[] replicas, byte[] metadata );
public void addFragment(String sourceName, String[] replicas, byte[] metadata,
String userData);
public List<FragmentInfo> getFragments();
}
/* Internal interface that defines the access to data on the source
* data store (e.g, a file on HDFS, a region of an HBase table, etc).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IReadAccessor {
public boolean openForRead() throws Exception;
public OneRow readNextObject() throws Exception;
public void closeForRead() throws Exception;
}
/*
* An interface for writing data into a data store
* (e.g, a sequence file on HDFS).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IWriteAccessor {
public boolean openForWrite() throws Exception;
public OneRow writeNextObject(OneRow onerow) throws Exception;
public void closeForWrite() throws Exception;
}
Accessor Interface
/*
* Interface that defines the deserialization of one record brought from
* the data Accessor. Every implementation of a deserialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IReadResolver
{
public List<OneField> getFields(OneRow row) throws Exception;
}
/*
* Interface that defines the serialization of data read from the DB
* into a OneRow object.
* Every implementation of a serialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IWriteResolver
{
public OneRow setFields(DataInputStream inputStream) throws Exception;
}
Resolver Interface
/*Abstract class that defines getting statistics for ANALYZE.
* GetEstimatedStats returns statistics for a given path
* (block size, number of blocks, number of tuples (rows)).
* Used when calling ANALYZE on a PXF external table, to get
* table's statistics that are used by the optimizer to plan queries.
*/
public abstract class Analyzer extends Plugin {
public Analyzer(InputData metaData){
super(metaData);
}
/** path is a data source name (e.g, file, dir, wildcard, table name)
* returns the data statistics in json format
*
* NOTE: It is highly recommended to implement an extremely fast logic
* that returns *estimated* statistics. Scanning all the data for exact
* statistics is considered bad practice.
*/
public String GetEstimatedStats(String data) throws Exception {
/* Return default values */
return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo());
}
}
Analyzer Interface
Syntax - Long Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?
FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter&
ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor&
RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver&
ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer')
format 'TEXT' (delimiter = ',');
Say WHAT???
Syntax - Short Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple')
format 'TEXT' (delimiter = ',');
Whew!!
Built in Profiles
• # of profiles are built in and more are being contributed
o HBase, Hive, HDFS Text, Avro, SequenceFiles,
GemFireXD, Accumulo, Cassandra, JSON
o PXF will be open-sourced completely, for using with your
favorite SQL engine.
o But you can write your own connectors right now, and
use it with HAWQ.
Predicate Pushdown
• SQL engines may push down parts of the “WHERE” clause
down to PXF.
• e.g. “where id > 500 and id < 1000”
• PXF provides a FilterBuilder class
• Filters can be combined together
• Simple expression “constant <OP> column”
• Complex expression “object(s) <OP> object(s)”
Demo
• Create a text file on HDFS
• Create a table using a SQL engine (HAWQ) on HDFS
• Create an external table using PXF
• Select from both tables separately
• Finally run a join across both tables
More info online...
• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html
• http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
Questions?
Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
Long Dao
 

Was ist angesagt? (20)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
HDFS
HDFSHDFS
HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Pptx present
Pptx presentPptx present
Pptx present
 

Ähnlich wie Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Marco Gralike
 
Working with the IFS on System i
Working with the IFS on System iWorking with the IFS on System i
Working with the IFS on System i
Chuck Walker
 

Ähnlich wie Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation (20)

Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
Apache Kite
Apache KiteApache Kite
Apache Kite
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQL
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
SQL/MED and PostgreSQL
SQL/MED and PostgreSQLSQL/MED and PostgreSQL
SQL/MED and PostgreSQL
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
Android Data Storagefinal
Android Data StoragefinalAndroid Data Storagefinal
Android Data Storagefinal
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
 
WhatsNewNIO2.pdf
WhatsNewNIO2.pdfWhatsNewNIO2.pdf
WhatsNewNIO2.pdf
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
Working with the IFS on System i
Working with the IFS on System iWorking with the IFS on System i
Working with the IFS on System i
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Hadoop
HadoopHadoop
Hadoop
 
Power tools in Java
Power tools in JavaPower tools in Java
Power tools in Java
 

Kürzlich hochgeladen

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Kürzlich hochgeladen (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

  • 1. Pivotal eXtension Framework Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech
  • 2. Data Analysis Timeline ISAM files COBOL/JCL
  • 3. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL
  • 4. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 5. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 6. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 7. Data Analysis Timeline HDFS files Map Reduce/Hive SQL
  • 9. Simplified View of Co-existence SQL HDFS Files Map Reduce , Hive, HBase RDBMS Files HDFS
  • 10. Simplified View of Co-existence SQL HDFS Files Map Reduce , Hive, HBase RDBMS Files HDFS The Great Divide
  • 12. Pivotal eXtension Framework (PXF) • History o Based on external table functionality of RDBMS o Built at Pivotal by a small team in Israel • Goals o Single hop o No Materialization of data o Fully parallel for high throughput o Extensible
  • 13. Motivation for building PXF • Use SQL engine’s statistical/analytic functions (e.g. Madlib) on third party data stores e.g. o HBase data o Hive data o Native Data on HDFS in a variety of formats • Join in-database dimensions with other fact tables • Fast ingest of data into SQL native format (insert into … select * from …)
  • 14. Motivation for building PXF • Enterprises love the cheap storage offered by HDFS, and want to store data over there • M/R is very limiting • Integrating with Third Party systems e.g. Accumulo etc. • Existing techniques involved copying data to HDFS, which is very brittle and in-efficient
  • 15. High Level Flow SQL Data Node5 Data Node1 Data Node2 Data Node3 Data Node4 Where is the data for table foo? On DataNodes 1,3 and 5 - Protocol is http - End points are running on all data nodes Name Node
  • 16. Major components • Fragmenter o Get the locations of fragments for a table • Accessor o Understand and read the fragment, return records • Resolver o Convert the records into a SQL engine format • Analyzer o Provide source stats to the Query optimizer
  • 17. PXF Architecture HAWQ Master M/R, Pig, Hive Data Node Container with End-Points PXF Fragmenter Local HDFS Hadoop Pivotal Green Zookeeper 3 1 6 PSQL select * from external table foo location=”pxf://namenode:50070/financedata” 0 splits[..] HAWQ Segment getSplit(0) PXFWritable A B 0 6To A BTo MetaData Data Native PHD 5 4 PXF Accessor/Resolver Local HDFS 2
  • 18. Classes • The four major components are defined as interfaces and base classes that can be extended. e.g. Fragmenter /* * Class holding information about fragments (FragmentInfo) */ public class FragmentsOutput { public FragmentsOutput(); public void addFragment(String sourceName, String[] replicas, byte[] metadata ); public void addFragment(String sourceName, String[] replicas, byte[] metadata, String userData); public List<FragmentInfo> getFragments(); }
  • 19. /* Internal interface that defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */ public interface IReadAccessor { public boolean openForRead() throws Exception; public OneRow readNextObject() throws Exception; public void closeForRead() throws Exception; } /* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */ public interface IWriteAccessor { public boolean openForWrite() throws Exception; public OneRow writeNextObject(OneRow onerow) throws Exception; public void closeForWrite() throws Exception; } Accessor Interface
  • 20. /* * Interface that defines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IReadResolver { public List<OneField> getFields(OneRow row) throws Exception; } /* * Interface that defines the serialization of data read from the DB * into a OneRow object. * Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IWriteResolver { public OneRow setFields(DataInputStream inputStream) throws Exception; } Resolver Interface
  • 21. /*Abstract class that defines getting statistics for ANALYZE. * GetEstimatedStats returns statistics for a given path * (block size, number of blocks, number of tuples (rows)). * Used when calling ANALYZE on a PXF external table, to get * table's statistics that are used by the optimizer to plan queries. */ public abstract class Analyzer extends Plugin { public Analyzer(InputData metaData){ super(metaData); } /** path is a data source name (e.g, file, dir, wildcard, table name) * returns the data statistics in json format * * NOTE: It is highly recommended to implement an extremely fast logic * that returns *estimated* statistics. Scanning all the data for exact * statistics is considered bad practice. */ public String GetEstimatedStats(String data) throws Exception { /* Return default values */ return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo()); } } Analyzer Interface
  • 22. Syntax - Long Form CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data? FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter& ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor& RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver& ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer') format 'TEXT' (delimiter = ','); Say WHAT???
  • 23. Syntax - Short Form CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple') format 'TEXT' (delimiter = ','); Whew!!
  • 24. Built in Profiles • # of profiles are built in and more are being contributed o HBase, Hive, HDFS Text, Avro, SequenceFiles, GemFireXD, Accumulo, Cassandra, JSON o PXF will be open-sourced completely, for using with your favorite SQL engine. o But you can write your own connectors right now, and use it with HAWQ.
  • 25. Predicate Pushdown • SQL engines may push down parts of the “WHERE” clause down to PXF. • e.g. “where id > 500 and id < 1000” • PXF provides a FilterBuilder class • Filters can be combined together • Simple expression “constant <OP> column” • Complex expression “object(s) <OP> object(s)”
  • 26. Demo • Create a text file on HDFS • Create a table using a SQL engine (HAWQ) on HDFS • Create an external table using PXF • Select from both tables separately • Finally run a join across both tables
  • 27. More info online... • http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html • http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
  • 29. Pivotal eXtension Framework Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech