SlideShare a Scribd company logo
1 of 35
Beyond unit tests:
Testing for Spark/Hadoop workflows
Anant Nag
Senior Software Engineer
LinkedIn
Shankar Manian
Staff Software Engineer
LinkedIn
A day in the life of data engineer
● Produce D.E.D Report by 5 AM.
● At 1 AM, an alert goes off saying the pipeline has
failed
● Dev wakes up, curses his bad luck and starts
backfill job at high priority
● Finds cluster busy, starts killing jobs to make way
for D.E.D job
● Debugs failures, finds today is daylight savings day
and data partitions has gone haywire.
● Most days it works on retry
● Some days, we are not so lucky
NO D.E.D => We are D.E.A.D
Scale @ LinkedIn
10s of clusters
1000s of machines
1000s of users
100s of 1000s of Azkaban workflows running per month
Powers key business impacting features
People you may know
Who viewed my profile
Nightmares at Data street
● Cluster gets upgraded
● Data partition changes
● Code needs to be rewritten in a
new technology
● Different version of a dependent
jar is available
● ...
Do you know your dependencies?
● Direct dependencies
● Indirect dependencies
● Hidden dependencies
● Semantic dependencies
“Hey, I am changing column X in data P to format B. Do you foresee any issues?”
Paranoia is justified
No Confidence to make changes
Lack of Agility
Loss of Innovation
Introducing Marvin
Architecture
● Workflow definition
● Test definitions
● Test execution environment:
○ Local
○ Production
● Test data
Workflow definition
hadoop {
workflow('workflow1') {
sparkJob('job1') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
jars ‘jar1.jar,jar2.jar’
executorMemory ‘2G’
numExecutors 400
}
pigJob('job2') {
uses 'src/main/pig/pigScript.pig'
depends 'job1'
}
targets 'job2'
}
}
Test definition
hadoop {
workflow('countByCountryFlow') {
pigJob('countByCountry') {
uses 'src/main/pig/count_by_country.pig'
reads files: [
'input_data': "/data/input"
]
writes files: [
'output_path': "/jobs/output"
]
}
targets 'countByCountry'
}
}
hadoop {
workflowTestSuite("test1") {
addWorkflow('countByCountryFlow') {
}
}
workflowTestSuite("test2") {
...
}
}
Overriding parameters
hadoop {
workflow('countByCountryFlow') {
pigJob('countByCountry') {
uses 'src/main/pig/count_by_country.pig'
reads files: [
'input_data': "/data/input"
]
writes files: [
'output_path': "/jobs/output"
]
}
targets 'countByCountry'
}
}
hadoop {
workflowTestSuite("test1") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
reads files: [
'input_data': '/path/to/test/data'
]
writes files: [
'output_path': '/path/to/test/output'
]
}
}
}
}
● 10s of clusters @LINKEDIN
● Multiple versions of Pig - 0.11, 0.15
● Some clusters update now, some later
● Code should run on all the versions
Configuration override
Configuration override
● Write multiple tests
● One test for each version of pig
● Override pig version in the tests
workflowTestSuite("testWithPig11") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘pig.home’: ‘/path/to/pig/11’
]
}
}
}
workflowTestSuite("testWithPig12") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘pig.home’: ‘/path/to/pig/12’
]
}
}
}
workflowTestSuite("testWithPig15") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘pig.home’: ‘/path/to/pig/15’
]
}
}
}
True story..
● Complex pipeline with 10s of Jobs
● Most of them are Spark Jobs
● DataFrames API
● Rewrite all spark jobs to use DataFrames
● Is my new code ready for production??
● Write tests
○ Assertions on output
● All tests succeed after changes (^_^)
Data Validation and Assertion
● Types of checks
○ Record level checks
○ Aggregate level
■ Data transformation
■ Data aggregation
■ Data distribution
● Assert against Expectation
Record level Validation
us 148083
in 46074
cn 34332
br 30836
gb 24387
fr 14983
...
hadoop {
workflowTestSuite(‘test1’) {
addWorkflow(‘countFlow’,’testCountFlow’){}
assertionWorkflow('assertNonNegativeCount') {
sparkJob("assertNonNegativeCount") {
}
targets 'assertNonNegativeCount'
}
}
val count = spark.read.avro(input)
require(count.
map(r =>
r.getAs[Long](“count”)).
where(_ < 0 )).count() == 0)
Aggregated validation
● Binary spam classifier C1
● C1 classifies 30% of test input as spam
● Wrote a new classifier C2
● C2 can deviate at most 10%
● Write aggregated validations
Execution
Test execution
$> gradle azkabanTest -Ptestname=test1
● Test results on the terminal
● Reports for the passed and failed tests
Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED
Summary of the completed tests
Job Statistics for individual test
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2
Test automation
● Auto deployment process for Azkaban artifacts
● Tests run as part of the deployment
● Tests fail => Artifact deployment fails
● No un-tested code can go to production
Test Data
Test data
● Real data
○ Very large data
○ Tests run very slow
● Randomly generated data
○ Not equivalent to real data
○ Real issues can never be catched
● Manually created data
○ Covers all the cases
○ Too much effort
● Anything else????
Requirements of test data
● Representative of real data
● Smaller in size
● Automated generation
● Sharable
● Discoverable
Data sampling
● Sampling definition:
○ Sampling logic + parameters
○ Sampling logic example:
SELECT * FROM ${table} TABLESAMPLE(${sample_percent} PERCENT) s;
○ Parameters: table, sample_percent
● Joinable samples
○ Join tables first and then sample
○ Hash bucket on the column to join and pick same buckets
● Metadata
○ Expiry date
○ Refresh rate
○ Permissions
Sampling pipeline
● Separate repository for sampling definition + metadata
● Sampling processor generates Azkaban workflows
● Each definition corresponds to single workflow
● Auto deployment to Azkaban
● Workflows scheduled based on metadata
○ Refresh rate
○ Expiry date
● Samples produced stored in HDFS
● Permissions set using metadata
Sampling discovery
● Publish metadata for Sampled data to data discovery service
● Linkedin wherehows
● Publish metadata
○ Original data for a sample
○ All samples of a data
○ Default sample of a data
○ Lifecycle details
Architecture
The road ahead...
Flexible execution environment
● Sandboxed
● Replicate production settings
● Save and Restore
● Ability to run in a single box
Data validation framework
● Validation logic in schema
● Automated validation on read and write
● Serves as contract between producers and consumers
Sampling pipeline
● Hash Bucketing on data ingestion
● Automated discovery of samples through DALI API
● Sampling for model training and testing
● Obfuscation of sensitive data
Azkaban DSL: https://github.com/linkedin/linkedin-gradle-plugin-for-apache-hadoop
Azkaban: https://github.com/azkaban/azkaban

More Related Content

What's hot

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesDatabricks
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_finalRamya Sunil
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)Alexander Kukushkin
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...InfluxData
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesTanel Poder
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergendistributed matters
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 

What's hot (20)

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage Devices
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 

Similar to Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
Testing Django APIs
Testing Django APIsTesting Django APIs
Testing Django APIstyomo4ka
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with GatlingPetr Vlček
 
Creating a mature puppet system
Creating a mature puppet systemCreating a mature puppet system
Creating a mature puppet systemrkhatibi
 
Creating a Mature Puppet System
Creating a Mature Puppet SystemCreating a Mature Puppet System
Creating a Mature Puppet SystemPuppet
 
ContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureYury Tsarev
 
MySQL Utilities -- PyTexas 2015
MySQL Utilities -- PyTexas 2015MySQL Utilities -- PyTexas 2015
MySQL Utilities -- PyTexas 2015Dave Stokes
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional ProgrammerDave Cross
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?Dmitri Shiryaev
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
All of the thing about Postman
All of the thing about PostmanAll of the thing about Postman
All of the thing about PostmanAlihossein shahabi
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 

Similar to Beyond unit tests: Deployment and testing for Hadoop/Spark workflows (20)

Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Testing Django APIs
Testing Django APIsTesting Django APIs
Testing Django APIs
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 
Creating a mature puppet system
Creating a mature puppet systemCreating a mature puppet system
Creating a mature puppet system
 
Creating a Mature Puppet System
Creating a Mature Puppet SystemCreating a Mature Puppet System
Creating a Mature Puppet System
 
ContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven Infrastructure
 
MySQL Utilities -- PyTexas 2015
MySQL Utilities -- PyTexas 2015MySQL Utilities -- PyTexas 2015
MySQL Utilities -- PyTexas 2015
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
All of the thing about Postman
All of the thing about PostmanAll of the thing about Postman
All of the thing about Postman
 
Netty training
Netty trainingNetty training
Netty training
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Netty training
Netty trainingNetty training
Netty training
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

  • 1.
  • 2. Beyond unit tests: Testing for Spark/Hadoop workflows Anant Nag Senior Software Engineer LinkedIn Shankar Manian Staff Software Engineer LinkedIn
  • 3. A day in the life of data engineer ● Produce D.E.D Report by 5 AM. ● At 1 AM, an alert goes off saying the pipeline has failed ● Dev wakes up, curses his bad luck and starts backfill job at high priority ● Finds cluster busy, starts killing jobs to make way for D.E.D job ● Debugs failures, finds today is daylight savings day and data partitions has gone haywire. ● Most days it works on retry ● Some days, we are not so lucky NO D.E.D => We are D.E.A.D
  • 4. Scale @ LinkedIn 10s of clusters 1000s of machines 1000s of users 100s of 1000s of Azkaban workflows running per month Powers key business impacting features People you may know Who viewed my profile
  • 5. Nightmares at Data street ● Cluster gets upgraded ● Data partition changes ● Code needs to be rewritten in a new technology ● Different version of a dependent jar is available ● ...
  • 6. Do you know your dependencies? ● Direct dependencies ● Indirect dependencies ● Hidden dependencies ● Semantic dependencies “Hey, I am changing column X in data P to format B. Do you foresee any issues?”
  • 7. Paranoia is justified No Confidence to make changes Lack of Agility Loss of Innovation
  • 9. Architecture ● Workflow definition ● Test definitions ● Test execution environment: ○ Local ○ Production ● Test data
  • 10. Workflow definition hadoop { workflow('workflow1') { sparkJob('job1') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ jars ‘jar1.jar,jar2.jar’ executorMemory ‘2G’ numExecutors 400 } pigJob('job2') { uses 'src/main/pig/pigScript.pig' depends 'job1' } targets 'job2' } }
  • 11. Test definition hadoop { workflow('countByCountryFlow') { pigJob('countByCountry') { uses 'src/main/pig/count_by_country.pig' reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { } } workflowTestSuite("test2") { ... } }
  • 12. Overriding parameters hadoop { workflow('countByCountryFlow') { pigJob('countByCountry') { uses 'src/main/pig/count_by_country.pig' reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { reads files: [ 'input_data': '/path/to/test/data' ] writes files: [ 'output_path': '/path/to/test/output' ] } } } }
  • 13. ● 10s of clusters @LINKEDIN ● Multiple versions of Pig - 0.11, 0.15 ● Some clusters update now, some later ● Code should run on all the versions Configuration override
  • 14. Configuration override ● Write multiple tests ● One test for each version of pig ● Override pig version in the tests workflowTestSuite("testWithPig11") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘pig.home’: ‘/path/to/pig/11’ ] } } } workflowTestSuite("testWithPig12") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘pig.home’: ‘/path/to/pig/12’ ] } } } workflowTestSuite("testWithPig15") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘pig.home’: ‘/path/to/pig/15’ ] } } }
  • 15.
  • 16. True story.. ● Complex pipeline with 10s of Jobs ● Most of them are Spark Jobs
  • 17. ● DataFrames API ● Rewrite all spark jobs to use DataFrames ● Is my new code ready for production?? ● Write tests ○ Assertions on output ● All tests succeed after changes (^_^)
  • 18. Data Validation and Assertion ● Types of checks ○ Record level checks ○ Aggregate level ■ Data transformation ■ Data aggregation ■ Data distribution ● Assert against Expectation
  • 19. Record level Validation us 148083 in 46074 cn 34332 br 30836 gb 24387 fr 14983 ... hadoop { workflowTestSuite(‘test1’) { addWorkflow(‘countFlow’,’testCountFlow’){} assertionWorkflow('assertNonNegativeCount') { sparkJob("assertNonNegativeCount") { } targets 'assertNonNegativeCount' } } val count = spark.read.avro(input) require(count. map(r => r.getAs[Long](“count”)). where(_ < 0 )).count() == 0)
  • 20. Aggregated validation ● Binary spam classifier C1 ● C1 classifies 30% of test input as spam ● Wrote a new classifier C2 ● C2 can deviate at most 10% ● Write aggregated validations
  • 22. Test execution $> gradle azkabanTest -Ptestname=test1 ● Test results on the terminal ● Reports for the passed and failed tests Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED Summary of the completed tests Job Statistics for individual test ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total ------------------------------------------------------------------------------------------------------------------------------------------------------------------- TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2
  • 23. Test automation ● Auto deployment process for Azkaban artifacts ● Tests run as part of the deployment ● Tests fail => Artifact deployment fails ● No un-tested code can go to production
  • 25. Test data ● Real data ○ Very large data ○ Tests run very slow ● Randomly generated data ○ Not equivalent to real data ○ Real issues can never be catched ● Manually created data ○ Covers all the cases ○ Too much effort ● Anything else????
  • 26. Requirements of test data ● Representative of real data ● Smaller in size ● Automated generation ● Sharable ● Discoverable
  • 27. Data sampling ● Sampling definition: ○ Sampling logic + parameters ○ Sampling logic example: SELECT * FROM ${table} TABLESAMPLE(${sample_percent} PERCENT) s; ○ Parameters: table, sample_percent ● Joinable samples ○ Join tables first and then sample ○ Hash bucket on the column to join and pick same buckets ● Metadata ○ Expiry date ○ Refresh rate ○ Permissions
  • 28. Sampling pipeline ● Separate repository for sampling definition + metadata ● Sampling processor generates Azkaban workflows ● Each definition corresponds to single workflow ● Auto deployment to Azkaban ● Workflows scheduled based on metadata ○ Refresh rate ○ Expiry date ● Samples produced stored in HDFS ● Permissions set using metadata
  • 29. Sampling discovery ● Publish metadata for Sampled data to data discovery service ● Linkedin wherehows ● Publish metadata ○ Original data for a sample ○ All samples of a data ○ Default sample of a data ○ Lifecycle details
  • 32. Flexible execution environment ● Sandboxed ● Replicate production settings ● Save and Restore ● Ability to run in a single box
  • 33. Data validation framework ● Validation logic in schema ● Automated validation on read and write ● Serves as contract between producers and consumers
  • 34. Sampling pipeline ● Hash Bucketing on data ingestion ● Automated discovery of samples through DALI API ● Sampling for model training and testing ● Obfuscation of sensitive data

Editor's Notes

  1. Hello Everyone, Thank you for coming. I am Shankar and this is Anant and we are going to talk about testing big data workflows.
  2. It has been challenging to do that. As you might have experienced, there are external factors that are hard to control influence the outcome. Quite often, we have to upgrade our clusters to the latest version of hadoop. As stable as hadoop has been, upgrades almost always cause some issue or other to some of the existing pipelines. Sometimes, its the way the data is partitioned or ingested that causes our joins to fail. The technology space in big data being as vibrant as it is, we often find new and better ways of doing things. May be we want to rewrite the job in Spark for better performance. Or Tez. Or something else that might come up in the future. Or its sometimes as simple as a newer version of a jar your code depends on. These are just some of the examples. The list is endless
  3. To be able to just be aware of all the dependencies that can break your pipeline is super hard. Too soon your dependencies are so complicated and interwoven no one knows why a pipeline is failing. Direct dependencies are the easiest one. These are the data or the flows that produce the data that your flow is dependent on. But then those flows in turn could be dependent on some other flow and so on forming a large list of indirect dependencies. Sometimes the dependencies are hidden. One of the flows could be dependent on an intermediate data that some other flow is producing and did not clean up and no one knew about the dependency. One fine day the other flow owner decided to clean up and voila your flow is failing. Dependencies on datasets and their schemas are much easier compared to tracking semantic dependencies. A time field could be changing from PST to UTC. Does it affect you ? Will your flow run with that change ? it's hard to tell.
  4. There is an easy solution to all of this. "Don't change anything". That might sound paranoid but in light of all of this, it might be justified. There is one serious issue with that solution. If you are not feeling confident to constantly make changes, you will be making less and less of them and losing out the innovative edge your company has in the marketplace. Clearly, no company would want that. Is there a better alternative ? Is there a way to be confident whatever changes, our flow will continue to work. Is there a way to find out a potential failure before it happens in production?
  5. These are the questions that motivated us to work on a workflow testing framework we are calling Marvin. To talk more about Marvin and how it helps address these scenarios, let me call Anant.
  6. I’m going to talk about the End to end testing framework I’ll talk about the different components involved, how those components are implemented and how they interact with each other What are the requirements for a workflow testing framework, what do we need? We definitely need a way to programatically write our Azkaban workflows We can do that using the Azkaban DSL, I’ll talk more about how Azkaban DSL looks like in the future slides Once we have written our workflows, We should be able to write tests for it We can again use the Azkaban DSL to write tests. Once we have written the tests, where do we run them??? Can we run them locally? Well we can but it’s very difficult to replicate the production environment on the local machine Can we run them on the production environment, the tests might be slow, there can be latency issues but our tests will be more robust and our code will be immune to environment or configuration changes. What is next?? We have the test definitions, we know where to run them and how to run them but we don’t have the test data Again we can have it locally but since the tests are being run on Azkaban, we should have them in HDFS This is how the complete architecture looks like, We have the test definitions, test execution environment and the test data Now we can go see how a workflow definition looks like
  7. The workflow definition is written in a language called the Hadoop DSL. Using this language, one can define multiple workflows, Add jobs to the workflow, add properties to the jobs and many more. In this example, we have declared a workflow called workflow1, This workflow contains a two jobs, Job1 which is a spark job Job2 which is a pig job There is a linear dependency between these jobs and job2 depends on Job1. Each job also has some properties which are required for the job to run. For e.g. to run a spark job you need to have the execution jar, and the class which should be run or you might want to set the eecutorMemory or the number of executors When this DSL is compiled, it produces Azkaban workflows. Visually, the DAG will look something like this Now we know how to define a workflow programatically, we are ready to write tests for our workflows.
  8. What are the requirements for a good testing framework, what should we be able to do??? We should be able to create a named test Can do that using the workflowTestSuite construct We should be able to Add the workflow that we want to test. Can be done using the addWorkflow construct. Provide the name of the workflow that you want to add as parameter of addWorkflow construct We should also be able to add multiple tests So we can multiple workflowTestSuite blocks with different names
  9. Adding a workflow to test is not enough, we may want to override certain parameters of the workflow Workflows themselves doesn’t have any parameter, who has parameters? Jobs!! We should be able to find a job and change parameters Parameters such as input_data Point it to test data Point the output to test output Don’t change anything else Apart from the parameters, there are other things that we can change, we can change environment variables, we can change configurations
  10. Let me give you a real world example of why configuration override or environment override might be necessary Linkedin has 10s of clusters. Some clusters are used as ETL clusters, development clusters, production clusters. Multiple clusters might not have the same set of softwares or the same environment. Since there are multiple versions of pig floating around, some clusters will have pig 11, some will have pig 15 and so on. We update our clusters with latest versions but some might be updated now, some later so there is always a mismatch of installed software installed. I cannot always go to all the clusters and run my code in each of the clusters
  11. So what can I do there to make sure that my code runs on all the version pig and on all the clusters. I can write multiple tests for my code Each of my test will test my code against a particular version of pig If all of my tests succeed, then I know that my pig code works for all the pig versions and it will run on other clusters as well
  12. Any test is incomplete without assertions Assertions let you add a validation on the output of your function They help you to compare the output of your function to a expected value and see if the output meets expectation They also help you to categorize your failures, if a particular assertion fails, you know where to look. So why do we need assertions for our workflow tests? To answer this question, let me tell you a true story….
  13. I had a complex pipeline with 10s of jobs in the workflow. IT used to collect metrics from hadoop and ingest them to HDFS from where the data was consumed by other pipelines. Most of my jobs were spark jobs and I wrote them using spark 1.4
  14. So Spark came up with a new DataFrames API I was very excited about the new API and wanted to rewrite all of my jobs to use dataframes. Before I start to rewrite my jobs, How can I be sure that the new code is ready for production?? What did i do I wrote tests and added assertions on the output of the test Now even if I change my code and rewrite it, the output data should be the same and all my assertions should pass And when I ran my code, all of my tests succeeded and I could move my code to production.
  15. The moral of the story is that we need Data validation and assertions, specially for workflows Now what kind checks or validations can we have on our output data We can have record level checks. We can have some validation rules on each record and validate those rules record by record Or we can have aggregated level checks. Which means that we’ll have validation rules on a number of records at once. We might not be able to write straight forward aggregated rules so we may have to transform the data, aggregate the data or distribute the data before performing any checks We can also have a direct comparison against the expected output. For e.g. you can just compare the schema of the output to the expected schema. Or you can have a hardcoded expected data which should be equal to the output data.
  16. An example of record level checks could be that you have a table with two columsn The first column indicates the number id of the country. The second column is the number of member from the country. In the data we can see the countries us, india, china and their corresponding number of memebers Now one kind of check that we can perform here is that we can check if the number of membes of each country is not negative. To do this kind of assertion in our test definition, we can add an assertion workflow which is just another workflow but it will be run after the workflow that we are testing. In this assertionWorkflow, we can have a job which validates our output and checks if each count of the country is not negative. This can be easily written using spark where we use the require method to find out all the records with negative count values and assertthat the number of such records is zero
  17. We can also have aggregated level validation. To give you an example, Consider your team have developed a binary spam classifier called C1 You expect it to perform well and it classifier about 30% of your test input as spam Now your team has started workfing on a new classifier called C2 which is very similar to classifier C1 You don’t expect the classifier C2 to deviate a lot from C1 It can deviate at most 10% from C1 What can I do to make sure that classifier C2 works as expected??? Well, I can again write tests and hae aggregated validations on the test output Even with the new classifier, my validations should passs.
  18. Now we know how to define the workflows, how to define the tests and add assertions on top of the tests. We also have the test execution environment in place. But how do we run our tests, how do we deploy and execute our tests Azkaban DSL comes as part of a Gradle plugin, which has certain tasks Tasks to deploy an artifact to Azkaban, run the executions on Azkaban and fetch the result We have written a new task to the gradle plugin which can deploy our tests separately, run them and get the result of the tests on the terminal It also creates a nice report for the passed and the failed tests along with information about specific jobs On your screen, you can see such an example. We created a test called Test-countByCountry and ran it in Azkaban, The test suceeded and it had two succeeded jobs So the azkabanTest task gives us the ability to run ad-hoc tests. Now ad-hoc tests are fine when you are developing. But testing is incomplete unless you have an automated way of running these tests. You want to run your tests whenever there is a new change in the code. And these tests should be run before your code is even deployed to production.
  19. So for this we have a test automation system in process. LInkedin has an auto deployment system for deploying artifacts to Azkaban This sytem just takes the azkaban artifacts from artifactory and uploads them to Azkaban. Now the tests can run as part of the auto deployment process. Our deployment service, can run all the tests before deploying the artifacts and If any test fails, the deployment to the production can also fail. This way we make sure that any buggy code or non tested code doesn’t go to the production.
  20. Till now we have talked about writing the tests, writing the assertions, running the tests on Azkaban, but we are still missing a very important component of our testing framework. Test data, we still don’t know where to find our test data, how to generate the test data, what are the requirements of test data.
  21. Different ways in which we can get the test data We can use the real data but the real data is very large, our tests will run very slow We can use randomly generated data but it’s not equivalent to the real data. We can run our tests very fast but the real issues can never be catched Or we can look at the original data and manually create our test data. Now manually created data is good, it will cover all the cases but it’s too much of an efforct to create the test data. You’ll have to look at the original data, observe it and then manually create the data. Is there anything else?? Before trying to find an alternative, let us take the learnings from this slide and come up with requirements of ideal test data
  22. So we want our test data to be - Representative of the real data, to catch most of the real world issues It shoudl be smaller in size so that we can run all our tests relatively fast We should not put in a lot of efforct to generate the test data therefore we should have the ability to generate the test daTa automatically. The test data should be sharable. It means that once someone has created a test data for his tests, the test data should be available for other teams to use as well. It should be discoverable, it means that other teams should be able to discover your data before trying to generate their own data Test framework allows you to override inputs/outputs to any data of your choice. People can create their own samples and use it. Sampling pipeline’s goal is to enable discoverablility and sharing of existing samples and standardization and debuggability of sampling definition. In addition, it can improve performance by materializing at ingestion stage.
  23. So at LinkedIn, we realized that sampled data can be a good candidate for the test data. So we came up with concept of Sampling definition. Now a sampling definition is nothing but just the sampling logic and some parameters. A sampling logic could be as simple as “shown here”. The sample_percent is provided as a parameter. Now we need this kind of sampling definition because we want multiple developers to share their logic for sampling and then others can just plugin in their parameters and generate samples of their own.
  24. We now know the requirements for the test data and how sample data meets the requirements. But, there are still some parts missing We still need to automate the generation, we still need to make it discoverable and sharable We’ve created a sampling pipeline which takes care of all these things