SlideShare a Scribd company logo
1 of 19
Ming Yuan / Alyssa Romeo
Capital One
May 24th, 2016
Simplifying Apache Cascading
2
Apache Cascading
• Open source framework implementing the “chain of responsibility” design pattern
• Abstraction over MapReduce, Tez, or Flink processing engine when transforming big
data sets on Hadoop
• APIs for constructing and executing data-processing flows
3
PDS Framework on Cascading
A light-weight layer on top of Apache Cascading to
– Manage metadata for inputs and outputs in properties files
– Define data processing rules in properties files
– Support development in a parallel manner
– Make testing easier and more flexible PDS Framework
4
Case Studies
Source code Directly use Cascading After rewritten on the framework
TranOptimizerTrxnDtl.java 473 134
TrxnDtlTransformation.java 278 81
PlanTypeCdeCalculation.java 152 144
MyMain.java 12
Total 903 371
Source code Directly use Cascading After rewritten on the framework
PmsmJoin.java 210 87
JoinFunc.java 257 38
MyMain.java 12
Total 467 137
Cascading application 1 – 60% code reduction
Cascading application 2 – 70% code reduction
5
Root configuration
Data Processing Step
Sources SinkData-Processing Rules
Schema file Schema fileProcessing rules
6
Managing Multiple Steps on the Framework
1
2 3
Processing rules
4
5
Root configuration Schema files
6
Processing rules
Transformation step
Transformation step
Application ControllerApplication Initiator
5
7
Root Configuration
Root file entries configure application level components, including
– Hadoop configurations
– Global configuration entries for the application
– Definitions for File Taps (location and schema)
– Definitions for Hive Taps
ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt
ATPT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATPT_retain_schema.txt
ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us
ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt
ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt
ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us
HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties
FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties
SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties
Root configuration
8
ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt
ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us
ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt
ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt
ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us
HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties
FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties
SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties
Schema Configuration – FileTap
atgt_org|decimal|FALSE|1|NA
atgt_acct|string|FALSE|1|NA
atgt_rec_nbr|decimal|FALSE|1|NA
atgt_logo|decimal|FALSE|1|NA
atgt_type|string|FALSE|1|NA
atgt_mt_eff_date|decimal|FALSE|1|NA
atgt_org|
atgt_acct|
atgt_rec_nbr|
atgt_logo|
atgt_type|
atgt_mt_eff_date|
Schema file
Tap pmsmTap = new Hfs(
getTextDelimitedFromConfig("ATPT_SCHEME_PATH", null, false, “ ”),
getFromConfigure("ATPT_DATA_PATH")
);
9
ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt
ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us
ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt
ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt
ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us
HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties
FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties
SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties
Schema Configuration – HiveTap
Schema file
DATA_BASE=dhdp_coaf
APP_COLUMN_NAMES=app_id, created_dt,…
APP_COLUMN_TYPES=Bigint, String, …
TABLE=MyTable
PARTITION_KEYS=odate
SER_LIB=org.apache.hadoop… (optional, by default it is ParquetHiveSerDe)
APP_PATH=hdfs://….
HiveTap hiveTap = getHiveTapFromConfig(“SECOND_HIVE_TAP”, sinkMode, booleanValue);
10
Data Processing Rules
• Processing rules are documented as properties
• Out-of-box macros define the transformation logic
• Framework translates the processing rules to Cascading API calls on the fly
ARRMT_ID_CHAIN obj(atpt_chain)
TRXN_SEQ_NUM atpt_mt_hi_tran_trk_id
POST_DT str(atpt_mt_posting_date)
TRXN_CD int(atpt_mt_txn_code)
AGT_ID substr(atpt_mt_hi_rep_id, 2, 4)
result.set(outputFields.getPos("ARRMT_ID_CHAIN"), argument.getObject(new Fields ("atpt_chain")));
result.set(outputFields.getPos("TRXN_SEQ_NUM"), argument.getObject(new Fields ("atpt_mt_hi_tran_trk_id")));
result.set(outputFields.getPos("POST_DT"), argument.getString(new Fields ("atpt_mt_posting_date")));
result.set(outputFields.getPos("TRXN_CD"), argument.getInteger(new Fields ("atpt_mt_txn_code")));
result.set(outputFields.getPos("AGT_ID"), argument.getString(new Fields ("atpt_mt_hi_rep_id")).substring(2,4));
Processing rules
11
Data Processing Rules -- Macros
Macro Names Syntax Functionality
obj TARGET obj(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getObject(new Fields("SOURCE")));
default TARGET SOURCE result.set(outputFields.getPos("TARGET"), argument.getObject(new Fields ("SOURCE")));
as-is TARGET asis(default) result.set(outputFields.getPos("TARGET"), default));
string TARGET str(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getString(new Fields ("SOURCE")));
int TARGET int(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getInteger(new Fields("SOURCE")));
sub-string TARGET substr(SOURCE, 2, 4)
result.set(outputFields.getPos("TARGET"), argument.getString(new Fields
("SOURCE")).subString(2,4);
replace
TARGET replace(SOURCE, A,
B, C, D, default)
String rawValue = argument.getString(new Fields ("SOURCE"));
if (A equalto rawValue) then
result.set(outputFields.getPos("TARGET"), B);
if (C equalto rawValue) then
result.set(outputFields.getPos("TARGET"), D);
result.set(outputFields.getPos("TARGET"), “default”);
replace null
TARGET repnull(SOURCE,
default)
String rawValue = argument.getString(new Fields ("SOURCE"));
if (null equalto rawValue) then
result.set(outputFields.getPos("TARGET"), “default”);
else
result.set(outputFields.getPos("TARGET"), rawValue);
replace null
with
whitespace
TARGET repnullws(SOURCE)
String rawValue = argument.getString(new Fields ("SOURCE"));
if (null equalto rawValue) then
result.set(outputFields.getPos("TARGET"), " ");
else
result.set(outputFields.getPos("TARGET"), rawValue);
not null TARGET notnull(SOURCE)
String rawValue = argument.getString(new Fields ("SOURCE"));
if (null equalto rawValue) then
throw RuntimeException;
else result.set(outputFields.getPos("TARGET"), rawValue);
convert date
TARGET dateconv(SOURCE,
yyyymmdd, dd-mm-yyyy)
String rawValue = argument.getString(new Fields ("SOURCE"));
targetValue = rawValue from yyyymmdd to dd-mm-yyyy;
result.set(outputFields.getPos("TARGET"), targetValue);
move decimal TARGET movedeci(SOURCE,-2)
String rawValue = argument.getDouble(new Fields ("SOURCE"));
result.set(outputFields.getPos("TARGET"), rawValue / (10 ^ -2));
12
Exception Handling
“Whenever an operation fails and throws an exception, if there is an
associated trap, the offending Tuple is saved to the resource specified by the
trap Tap.”
-- Cascading documentation
FlowDef flowDef = FlowDef.flowDef().addSource(ipAmcpPipe, ipAmcpInTap)
.addSource(ipAtptPipe, ipAtptInTap)
.addTailSink(transformPipe, outTap)
.addTrap(ipAtptPipe, badRecordsTap);
}
13
How to Adopt the Framework
• Create a root configuration file
• Create a schema file for each input and output (or reuse DQ schema files)
• Define processing rules
• Add all of the files to HDFS
• Subclass the PDSBaseFuntion per processing step
@Override
protected void operate(FlowProcess flowProcess, FunctionCall<Tuple> functionCall) {
this.populateTupleSet(functionCall);
TupleEntry argument = functionCall.getArguments();
Tuple result = functionCall.getContext();
Fields outputFields = functionCall.getDeclaredFields();
result.set(outputFields.getPos("CHK_NUM"), check_number_calculation(argument));
functionCall.getOutputCollector().add(result);
}
@Override
protected String getConfigPath() {
return “/path/to/rulesfile”;
}
14
How to Adopt the Framework
• Subclass the PDSBaseDriver class and implement the “transform” method
• Create a “main” class
• Run tests
public class TestHarness {
public static void main(String[] args) {
new MyDriverImp().process("/path/to/rootconfig");
}
}
@Override
protected FlowDef transform() {
Fields pmamfields = getFieldsFromConfigEntry("PMAM_SCHEME_PATH");
String apparrFilePath = this.getFromConfigure("OUTPUT_DATA_PATH");
Tap pmsmTap = new Hfs( this.getTextDelimitedFromConfig("PMSM_SCHEME_PATH", null, false,
fieldDelimiter), apparrFilePath);
FlowDef flowDef = FlowDef.flowDef()
.addSource(ipAmcpPipe, ipAmcpInTap)
.addTailSink(transformPipe, outTap)
.addTrap(ipAtptPipe, badRecordsTap);
return flowDef;
}
Key Words in the
Root config file
15
Conclusion
• Benefits
– Reduce the total effort of developing and testing Cascading applications
• Provide a re-usable layer to reduce the amount of “plumbing” code
• Make Cascading modules configurable
– Improve the code quality
• Modularize Cascading applications and support best practices in Java coding
• Support additional features (such as logging and exception handling)
– Build an open architecture for future extension and integration
• Technical specification
– Compatible with JDK 1.5 and above; Jar file was compiled with JDK 1.7
– Tested with Cascading 2.5
16
For questions, please reach out to Ming.Yuan@capitalone.com
17
Appendix: PDSBaseDriver Class
Method Functionality Override
process(String path)
This method takes the path to the root configuration file, initializes all required
configurations, invokes "transform()" in its subclass, and executes Cascading
flows.
No
init(String path)
This method takes the path to the root configuration file, parses the file, and
stores configuration entries accordingly.
No
getFromConfig(String key)
This method takes a String-typed key, and returns a string-typed value is the key
has been used in the root configuration file. It returns null, otherwise.
No
getFieldsFromConfigEntry(
String key)
This method takes a String-typed key. In the root configuration file, if the key has
been assigned to a path to a schema file, the method returns a Fields object
based on all column names in the schema file. This Fields will be automatically
cached.
No
getFieldsFromConfigEntry(
String key, String[]
appendences)
This method takes a String-typed key in the root configuration file. If the key has
been assigned to a path to a schema file, it returns a Fields object with all
column names in the schema file and all names in the input string array. This
Fields will NOT be cached.
No
getTextDelimitedFromConfig(
String key,
String[] appendences,
boolean hasHeader,
String delimiter)
This methods creates and returns a TextDelimited object from a configuration
key in the root configuration files. You can use the second parameter to append
any column names programmatically. The third and forth parameters are for
input/output files.
No
transform()
Subclass should build a Flowdef object with application processing flow in this
method.
Yes
18
Appendix: PDSBaseFunction Class
Method Functionality Override
prepare(FlowProcess f,
OperationCall<Tuple> call)
This method overrides the same function in Cascading BaseOperation class. No
cleanup(FlowProcess f,
OperationCall<Tuple> call)
This method overrides the same function in Cascading BaseOperation class. No
init(String key, String
filePath)
This method parses a mapping rules file, and initializes the PDSBaseFunction
object.
No
populateTupleSet(
FunctionCall<Tuple> call)
This method populates values in its output tuple based on input values and pre-
defined processing rules.
No
getConfigPath() This method returns the path pointing to processing rules file in HDFS. Yes
Operate(
FlowProcess flowProcess,
FunctionCall<Tuple>
functionCall)
This methods should invoke the "populateTupleSet()" method in order to
execute pre-defined transformation rules, and it should invoke any additional
custom transformation methods for complex logic.
Yes
19
Appendix: Class Diagram
* Yellow color indicates components from Cascading package

More Related Content

What's hot

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Jinho Kim
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Cassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra InternalsCassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra InternalsDataStax
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 

What's hot (20)

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Cassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra InternalsCassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra Internals
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 

Similar to Simplifying Apache Cascading

From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream ProcessingEventador
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Parallel programming patterns (UA)
Parallel programming patterns (UA)Parallel programming patterns (UA)
Parallel programming patterns (UA)Oleksandr Pavlyshak
 
Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакIgor Bronovskyy
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Databricks
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application frameworkModern Data Stack France
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisationgrooverdan
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsRafael Ferreira da Silva
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 

Similar to Simplifying Apache Cascading (20)

From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream Processing
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Parallel programming patterns (UA)
Parallel programming patterns (UA)Parallel programming patterns (UA)
Parallel programming patterns (UA)
 
Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр Павлишак
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computations
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 

More from Ming Yuan

Cloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummitCloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummitMing Yuan
 
Forrester2019
Forrester2019Forrester2019
Forrester2019Ming Yuan
 
R & Python on Hadoop
R & Python on HadoopR & Python on Hadoop
R & Python on HadoopMing Yuan
 
SSO with sfdc
SSO with sfdcSSO with sfdc
SSO with sfdcMing Yuan
 
Rest and beyond
Rest and beyondRest and beyond
Rest and beyondMing Yuan
 
Building calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apexBuilding calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apexMing Yuan
 

More from Ming Yuan (7)

Cloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummitCloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummit
 
Forrester2019
Forrester2019Forrester2019
Forrester2019
 
R & Python on Hadoop
R & Python on HadoopR & Python on Hadoop
R & Python on Hadoop
 
SSO with sfdc
SSO with sfdcSSO with sfdc
SSO with sfdc
 
Singleton
SingletonSingleton
Singleton
 
Rest and beyond
Rest and beyondRest and beyond
Rest and beyond
 
Building calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apexBuilding calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apex
 

Recently uploaded

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Recently uploaded (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Simplifying Apache Cascading

  • 1. Ming Yuan / Alyssa Romeo Capital One May 24th, 2016 Simplifying Apache Cascading
  • 2. 2 Apache Cascading • Open source framework implementing the “chain of responsibility” design pattern • Abstraction over MapReduce, Tez, or Flink processing engine when transforming big data sets on Hadoop • APIs for constructing and executing data-processing flows
  • 3. 3 PDS Framework on Cascading A light-weight layer on top of Apache Cascading to – Manage metadata for inputs and outputs in properties files – Define data processing rules in properties files – Support development in a parallel manner – Make testing easier and more flexible PDS Framework
  • 4. 4 Case Studies Source code Directly use Cascading After rewritten on the framework TranOptimizerTrxnDtl.java 473 134 TrxnDtlTransformation.java 278 81 PlanTypeCdeCalculation.java 152 144 MyMain.java 12 Total 903 371 Source code Directly use Cascading After rewritten on the framework PmsmJoin.java 210 87 JoinFunc.java 257 38 MyMain.java 12 Total 467 137 Cascading application 1 – 60% code reduction Cascading application 2 – 70% code reduction
  • 5. 5 Root configuration Data Processing Step Sources SinkData-Processing Rules Schema file Schema fileProcessing rules
  • 6. 6 Managing Multiple Steps on the Framework 1 2 3 Processing rules 4 5 Root configuration Schema files 6 Processing rules Transformation step Transformation step Application ControllerApplication Initiator 5
  • 7. 7 Root Configuration Root file entries configure application level components, including – Hadoop configurations – Global configuration entries for the application – Definitions for File Taps (location and schema) – Definitions for Hive Taps ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt ATPT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATPT_retain_schema.txt ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties Root configuration
  • 8. 8 ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties Schema Configuration – FileTap atgt_org|decimal|FALSE|1|NA atgt_acct|string|FALSE|1|NA atgt_rec_nbr|decimal|FALSE|1|NA atgt_logo|decimal|FALSE|1|NA atgt_type|string|FALSE|1|NA atgt_mt_eff_date|decimal|FALSE|1|NA atgt_org| atgt_acct| atgt_rec_nbr| atgt_logo| atgt_type| atgt_mt_eff_date| Schema file Tap pmsmTap = new Hfs( getTextDelimitedFromConfig("ATPT_SCHEME_PATH", null, false, “ ”), getFromConfigure("ATPT_DATA_PATH") );
  • 9. 9 ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties Schema Configuration – HiveTap Schema file DATA_BASE=dhdp_coaf APP_COLUMN_NAMES=app_id, created_dt,… APP_COLUMN_TYPES=Bigint, String, … TABLE=MyTable PARTITION_KEYS=odate SER_LIB=org.apache.hadoop… (optional, by default it is ParquetHiveSerDe) APP_PATH=hdfs://…. HiveTap hiveTap = getHiveTapFromConfig(“SECOND_HIVE_TAP”, sinkMode, booleanValue);
  • 10. 10 Data Processing Rules • Processing rules are documented as properties • Out-of-box macros define the transformation logic • Framework translates the processing rules to Cascading API calls on the fly ARRMT_ID_CHAIN obj(atpt_chain) TRXN_SEQ_NUM atpt_mt_hi_tran_trk_id POST_DT str(atpt_mt_posting_date) TRXN_CD int(atpt_mt_txn_code) AGT_ID substr(atpt_mt_hi_rep_id, 2, 4) result.set(outputFields.getPos("ARRMT_ID_CHAIN"), argument.getObject(new Fields ("atpt_chain"))); result.set(outputFields.getPos("TRXN_SEQ_NUM"), argument.getObject(new Fields ("atpt_mt_hi_tran_trk_id"))); result.set(outputFields.getPos("POST_DT"), argument.getString(new Fields ("atpt_mt_posting_date"))); result.set(outputFields.getPos("TRXN_CD"), argument.getInteger(new Fields ("atpt_mt_txn_code"))); result.set(outputFields.getPos("AGT_ID"), argument.getString(new Fields ("atpt_mt_hi_rep_id")).substring(2,4)); Processing rules
  • 11. 11 Data Processing Rules -- Macros Macro Names Syntax Functionality obj TARGET obj(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getObject(new Fields("SOURCE"))); default TARGET SOURCE result.set(outputFields.getPos("TARGET"), argument.getObject(new Fields ("SOURCE"))); as-is TARGET asis(default) result.set(outputFields.getPos("TARGET"), default)); string TARGET str(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getString(new Fields ("SOURCE"))); int TARGET int(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getInteger(new Fields("SOURCE"))); sub-string TARGET substr(SOURCE, 2, 4) result.set(outputFields.getPos("TARGET"), argument.getString(new Fields ("SOURCE")).subString(2,4); replace TARGET replace(SOURCE, A, B, C, D, default) String rawValue = argument.getString(new Fields ("SOURCE")); if (A equalto rawValue) then result.set(outputFields.getPos("TARGET"), B); if (C equalto rawValue) then result.set(outputFields.getPos("TARGET"), D); result.set(outputFields.getPos("TARGET"), “default”); replace null TARGET repnull(SOURCE, default) String rawValue = argument.getString(new Fields ("SOURCE")); if (null equalto rawValue) then result.set(outputFields.getPos("TARGET"), “default”); else result.set(outputFields.getPos("TARGET"), rawValue); replace null with whitespace TARGET repnullws(SOURCE) String rawValue = argument.getString(new Fields ("SOURCE")); if (null equalto rawValue) then result.set(outputFields.getPos("TARGET"), " "); else result.set(outputFields.getPos("TARGET"), rawValue); not null TARGET notnull(SOURCE) String rawValue = argument.getString(new Fields ("SOURCE")); if (null equalto rawValue) then throw RuntimeException; else result.set(outputFields.getPos("TARGET"), rawValue); convert date TARGET dateconv(SOURCE, yyyymmdd, dd-mm-yyyy) String rawValue = argument.getString(new Fields ("SOURCE")); targetValue = rawValue from yyyymmdd to dd-mm-yyyy; result.set(outputFields.getPos("TARGET"), targetValue); move decimal TARGET movedeci(SOURCE,-2) String rawValue = argument.getDouble(new Fields ("SOURCE")); result.set(outputFields.getPos("TARGET"), rawValue / (10 ^ -2));
  • 12. 12 Exception Handling “Whenever an operation fails and throws an exception, if there is an associated trap, the offending Tuple is saved to the resource specified by the trap Tap.” -- Cascading documentation FlowDef flowDef = FlowDef.flowDef().addSource(ipAmcpPipe, ipAmcpInTap) .addSource(ipAtptPipe, ipAtptInTap) .addTailSink(transformPipe, outTap) .addTrap(ipAtptPipe, badRecordsTap); }
  • 13. 13 How to Adopt the Framework • Create a root configuration file • Create a schema file for each input and output (or reuse DQ schema files) • Define processing rules • Add all of the files to HDFS • Subclass the PDSBaseFuntion per processing step @Override protected void operate(FlowProcess flowProcess, FunctionCall<Tuple> functionCall) { this.populateTupleSet(functionCall); TupleEntry argument = functionCall.getArguments(); Tuple result = functionCall.getContext(); Fields outputFields = functionCall.getDeclaredFields(); result.set(outputFields.getPos("CHK_NUM"), check_number_calculation(argument)); functionCall.getOutputCollector().add(result); } @Override protected String getConfigPath() { return “/path/to/rulesfile”; }
  • 14. 14 How to Adopt the Framework • Subclass the PDSBaseDriver class and implement the “transform” method • Create a “main” class • Run tests public class TestHarness { public static void main(String[] args) { new MyDriverImp().process("/path/to/rootconfig"); } } @Override protected FlowDef transform() { Fields pmamfields = getFieldsFromConfigEntry("PMAM_SCHEME_PATH"); String apparrFilePath = this.getFromConfigure("OUTPUT_DATA_PATH"); Tap pmsmTap = new Hfs( this.getTextDelimitedFromConfig("PMSM_SCHEME_PATH", null, false, fieldDelimiter), apparrFilePath); FlowDef flowDef = FlowDef.flowDef() .addSource(ipAmcpPipe, ipAmcpInTap) .addTailSink(transformPipe, outTap) .addTrap(ipAtptPipe, badRecordsTap); return flowDef; } Key Words in the Root config file
  • 15. 15 Conclusion • Benefits – Reduce the total effort of developing and testing Cascading applications • Provide a re-usable layer to reduce the amount of “plumbing” code • Make Cascading modules configurable – Improve the code quality • Modularize Cascading applications and support best practices in Java coding • Support additional features (such as logging and exception handling) – Build an open architecture for future extension and integration • Technical specification – Compatible with JDK 1.5 and above; Jar file was compiled with JDK 1.7 – Tested with Cascading 2.5
  • 16. 16 For questions, please reach out to Ming.Yuan@capitalone.com
  • 17. 17 Appendix: PDSBaseDriver Class Method Functionality Override process(String path) This method takes the path to the root configuration file, initializes all required configurations, invokes "transform()" in its subclass, and executes Cascading flows. No init(String path) This method takes the path to the root configuration file, parses the file, and stores configuration entries accordingly. No getFromConfig(String key) This method takes a String-typed key, and returns a string-typed value is the key has been used in the root configuration file. It returns null, otherwise. No getFieldsFromConfigEntry( String key) This method takes a String-typed key. In the root configuration file, if the key has been assigned to a path to a schema file, the method returns a Fields object based on all column names in the schema file. This Fields will be automatically cached. No getFieldsFromConfigEntry( String key, String[] appendences) This method takes a String-typed key in the root configuration file. If the key has been assigned to a path to a schema file, it returns a Fields object with all column names in the schema file and all names in the input string array. This Fields will NOT be cached. No getTextDelimitedFromConfig( String key, String[] appendences, boolean hasHeader, String delimiter) This methods creates and returns a TextDelimited object from a configuration key in the root configuration files. You can use the second parameter to append any column names programmatically. The third and forth parameters are for input/output files. No transform() Subclass should build a Flowdef object with application processing flow in this method. Yes
  • 18. 18 Appendix: PDSBaseFunction Class Method Functionality Override prepare(FlowProcess f, OperationCall<Tuple> call) This method overrides the same function in Cascading BaseOperation class. No cleanup(FlowProcess f, OperationCall<Tuple> call) This method overrides the same function in Cascading BaseOperation class. No init(String key, String filePath) This method parses a mapping rules file, and initializes the PDSBaseFunction object. No populateTupleSet( FunctionCall<Tuple> call) This method populates values in its output tuple based on input values and pre- defined processing rules. No getConfigPath() This method returns the path pointing to processing rules file in HDFS. Yes Operate( FlowProcess flowProcess, FunctionCall<Tuple> functionCall) This methods should invoke the "populateTupleSet()" method in order to execute pre-defined transformation rules, and it should invoke any additional custom transformation methods for complex logic. Yes
  • 19. 19 Appendix: Class Diagram * Yellow color indicates components from Cascading package