SlideShare a Scribd company logo
1 of 54
Download to read offline
Hadoop:
Big Data Stacks validation w/ iTest
How to tame the elephant?

 Konstantin Boudnik, Ph.D., Cloudera Inc.
 cos@apache.org
 Hadoop Committer, Pig Contributor
 Roman Shaposhnik, Cloudera Inc.
 rvs@cloudera.com
 Hadoop Contributor, Oozie Committer
 Andre Arcilla
 belowsov@gmail.com
 Hadoop Integration Architect
This content is licensed under
Creative Commons BY-NC-SA 3.0
Agenda
● Problem we are facing
   ○ Big Data Stacks
   ○ Why validation
● What is "Success" and the effort to achieve it
● Solutions
   ○ Ops testing
   ○ Platform certification
   ○ Application testing
● Stack on stack
   ○ Test artifacts are First Class Citizen
   ○ Assembling validation stack (vstack)
   ○ Tailoring vstack for target clusters
   ○ D3: Deployment/Dependencies/Determinism
Not on Agenda
● Development cycles
   ○ Features, fixes, commits, patch testing
● Release engineering
   ○ Release process
   ○ Package preparations
   ○ Release notes
   ○ RE deployments
   ○ Artifact management
   ○ Branching strategies
● Application deployment
   ○ Cluster update strategies (rolling vs. offline)
   ○ Cluster upgrades
   ○ Monitoring
   ○ Statistics collection
● We aren't done yet, but...
What's a Big Data Stack anyway?

        Just a base layer!
What is Big Data Stack?

         Guess again...
What is Big Data Stack?


          A bit more...
What is Big Data Stack?


          A bit or two more...
What is Big Data Stack?


           A bit or two more + a bit = HIVE
What is Big Data Stack?
What is Big Data Stack?
            And a Sqoop of flavor...
What is Big Data Stack?
             A Babylon tower?
What is the glue that holds
        the bricks?

● Packaging
   ○ RPM, DEB, YINST, EINST?
   ○ Not the best fit for Java
● Maven
   ○ Part of the Java ecosystem
   ○ Not the best tool for non-Java
     artifacts
● Common APIs we will assume
   ○ Versioned artifacts
   ○ Dependencies
   ○ BOMs
How can we possibly guarantee that everything is
                    right?
Development & Deployment
       Discipline !
Is it enough really?
Of course not...
Components:
 ● I want Pig 0.7
 ● You need Pig 0.8
Components:
 ● I want Pig 0.7
 ● You need Pig 0.8
Configurations:
 ● Have I put in 5th data volume to DN's conf of my
   6TB nodes?
 ● Have I _not_ copied it accidently to my 4TB
   nodes?
Components:
 ● I want Pig 0.7
 ● You need Pig 0.8
Configurations:
 ● Have I put in 5th data volume to DN's conf of my
   6TB nodes?
 ● Have I _not_ copied it accidently to my 4TB
   nodes?
Auxiliary services:
 ● Does my Sqoop has Oracle connector?
Honestly:
 ● Can anyone remember all these details?
What if you've missed some?
How would you like your 10th re-spin of a release?
Make it bloody this time, please...
Redeployments...
Angry customers...
LOST REVENUE ;(
And don't you have anything better to do with that
                 life of yours?
Who Needs A Stack Anyway?

● Developers
   ○ reference platform which is complete and works
   ○ less handholding for downstream groups
● QE
   ○ definitive testable artifact, something to explore and
     certify
   ○ free QE to do QE
● Operations
   ○ deploy a "Hadoop", not a "Hadoop erector set"
   ○ confidence in dev/QE effort

● Eliminate cross-team effort "bleed"
● Align teams on a common "axis"
Successful Stack Deployment System

               "Customers Are Happy!"

● Deploy stack configuration X to a cluster Y, for such X & Y:
   ○ satisfy devs, QE and Ops
   ○ variable enough to absorb all the crazy use cases
Who Needs Stack Testing Anyway?
● All stack customers
   ○ Yes QE for sure (integration, system, CI levels...), but
      also
   ○ Devs: did my latest patch break security auth or slowed
      HDFS to a crawl?
   ○ Ops: you ask for "striping". Did YOU test striping? Can
      WE test it as well?
   ○ Customers: did you test MY stack
Successful Integration Testing System

                "Customers Are Happy!"

 1. QE: easy to deploy, configure, run, add tests
 2. Stakeholders: provides relevant info about stack quality
     1. requires plenty of relevant tests and datapoints
         1. see 1)
Effort Required To Deploy/Test Stacks
The Parable of Garage
● Lots of stuff coexisting together
    ○ fancy, polished machines (mountain bike)
    ○ utility tools (hedge trimmer, shop vacuum)
    ○ ugly misfits (old paint pan, smelly dry mushrooms)
● Cannot be "simplified" due to complexity and evolving
  nature
● Need to manage complexity: use, change, check
● Keep objects inviolate: cannot expect a bucket to mount
  directly on a wall
● Reliance on external services: studs, drywall, garage door
The Complexity Management Solution

● Provide the Framework and the means for objects to plug
  into this framework
● Framework
    ○ holds everything together, enables management
    ○ sophisticated design, well engineered, sturdy
● "Glue"/"Bubblegum logic"
    ○ binds componets to the framework via minimally-
      invasive API
    ○ easy to design, engineering complexity is lower
    ○ follows well-documented process
● Components
    ○ actual participants of the garage
    ○ require no modifications
Engineering Effort Matrix

            Development    Level of       External                     Develop in
                                                        Investment
               Effort     Complexity    Dependencies                    Parallel




Framework     Medium         High           Many         One time         No




                                         Few or none
Component                 Low-Medium                    Incremental,
            Low-Medium                   provided by                      Yes
Glue                       by example                  per component
                                          framework




     ● Framework - high complexity one time effort
     ● Components - per-component incremental effort,
       average complexity, code-by-example
Validation Stack for Big Data
              A Babylon tower vs Tower of Hanoi
Validation Stack (tailored)
              Or something like this...
Use accepted platform, tools, practices
● JVM is simply The Best
    ○ Disclaimer: not to start religious war
● Yet, Java isn't dynamic enough (as in JDK6)
    ○ But we don't care what's your implementation language
    ○ Groovy, Scala, Clojure, JPython (?)
● Everyone knows JUnit/TestNG
    ○ alas not everyone can use it effectively
● Dependency tracking and packaging
    ○ Maven
● Information radiators facilitate data comprehension and
  sharing
    ○ TeamCity
    ○ Jenkins
Few more pieces
● Tests/workloads have to be artifact'ed
   ○ It's not good to go fishing for test classes
● Artifacts have to be self-contained
   ○ Reading 20 pages to find a URL to copy a file from?
      "Forget about it" (C)
● Standard execution interface
   ○ JUnit's Runner is as good as any custom one
● A recognizable reporting format
   ○ XML sucks, but at least it has a structure
A test artifact (PigSmoke 0.9-SNAPSHOT)
<project>
 <groupId>org.apache.pig</groupId>
 <artifactId>pigsmoke</artifactId>
 <packaging>jar</packaging>
 <version>0.9.0-SNAPSHOT</version>
 <dependencies>
  <dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pigunit</artifactId>
    <version>0.9.0-SNAPSHOT</version>
  </dependency>
  <dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pig</artifactId>
    <version>0.9.0-SNAPSHOT</version>
  </dependency>
 </dependencies>
</project>
How do we write iTest artifacts
$ cat TestHadoopTinySmoke.groovy
     class TestHadoopTinySmoke {
         ....
           @BeforeClass
            static void setUp() throws IOException {
                   String pattern = null; //Let's unpack everything
                   JarContent.unpackJarContainer(TestHadoopSmoke.class, '.' , pattern);
                    .....
             }

          @Test
          void testCacheArchive() {
              def conf = (new Configuration()).get("fs.default.name");
              ....
              sh.exec("hadoop fs -rmr ${testDir}/cachefile/out",
                        "hadoop ....
Add suitable dependencies (if desired)
<project>
 ...
   <dependency>
     <groupId>org.apache.pig</groupId>
     <artifactId>pigsmoke</artifactId>
     <version>0.9-SNAPSHOT</version>
     <scope>test</scope>
   </dependency>
   <!-- OMG: Hadoop dependency _WAS_ missed -->
   <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-core</artifactId>
     <version>0.20.2-CDH3B4-SNAPSHOT</version>
   </dependency>
 ...
Unpack data (if needed)
...
      <execution>
         <id>unpack-testartifact-jar</id>
       <phase>generate-test-resources</phase>
       <goals>
        <goal>unpack</goal>
       </goals>
       <configuration>
        <artifactItems>
         <artifactItem>
           <groupId>org.apache.pig</groupId>
           <artifactId>pigsmoke</artifactId>
           <version>0.9-SNAPSHOT</version>
           <type>jar</type>
           <outputDirectory>${project.build.directory}</outputDirectory>
           <includes>test/data/**/*</includes>
         </artifactItem>
        </artifactItems>
       </configuration>
      </execution>
...
Find runtime libraries (if required)
 ...
       <execution>
         <id>find-lzo-jar</id>
         <phase>pre-integration-test</phase>
         <goals> <goal>execute</goal> </goals>
         <configuration>
          <source>
            try {
              project.properties['lzo.jar'] = new File("${HADOOP_HOME}/lib").list(
                [accept:{d, f-> f ==~ /hadoop.*lzo.*.jar/ }] as FilenameFilter
              ).toList().get(0);
            } catch (java.lang.IndexOutOfBoundsException ioob) {
              log.error "No lzo.jar has been found under ${HADOOP_HOME}/lib. Check your
installation.";
              throw ioob;
            }
          </source>
         </configuration>
       </execution>
  </project>
Take it easy: iTest will do the rest
...
      <execution>
       <id>check-testslist</id>
       <phase>pre-integration-test</phase>
       <goals> <goal>execute</goal> </goals>
       <configuration>
        <source><![CDATA[
          import org.apache.itest.*

         if (project.properties['org.codehaus.groovy.maven.destination'] &&
             project.properties['org.codehaus.groovy.maven.jar']) {
              def prefix = project.properties['org.codehaus.groovy.maven.destination'];
             JarContent.listContent(project.properties['org.codehaus.groovy.maven.jar']).
               each {
                 TestListUtils.touchTestFiles(prefix, it);
               };
         }]]>
        </source>
       </configuration>
      </execution>
...
Tailored validation stack
<project>
 <groupId>com.cloudera.itest</groupId>
 <artifactId>smoke-tests</artifactId>
 <packaging>pom</packaging>
 <version>1.0-SNAPSHOT</version>
 <name>hadoop-stack-validation</name>

 ...

  <!-- List of modules which should be executed as a part of stack testing run -->
  <modules>
   <module>pig</module>
   <module>hive</module>
   <module>hadoop</module>
   <module>hbase</module>
   <module>sqoop</module>
  </modules>

 ...
</project>
Just let Jenkins do its job (results)
Just let Jenkins do its job (trending)
What else needs to be taken care of?
  ● Packaged deployment
     ○ packaged artifact verification
     ○ stack validation

Little Puppet on top of JVM:

static PackageManager pm;

@BeforeClass
 static void setUp() {
 ....
 pm = PackageManager.getPackageManager()
 pm.addBinRepo("default", "http://archive.canonical.com/", key)
 pm.refresh()
 pm.install("hadoop-0.20")
The coolest thing about single platform:
void commonPackageTest(String[] gpkgs, Closure smoke, ...) {
   pkgs.each { pm.install(it) }
   pkgs.each { assertTrue("package ${it.name} is
notinstalled",
                           pm.isInstalled(it)) }
   pkgs.each { pm.svc_do(it, "start")}
   smoke.call(args)
}
@Test
void testHadoop() {
  commonPackageTest(["hadoop-0.20", ...],
                        this.&commonSmokeTestsuiteRun,
                        TestHadoopSmoke.class)
  commonPackageTest(["hadoop-0.20", ...],
                        { sh.exec("hadoop fs -ls /") })
}
Fully automated unified reporting
iTest: current status

 ● Version 0.1 available at http://github.com/cloudera/iTest
 ● Apache2.0 licensed
 ● Contributors are welcome (free cookies to first 25)
Putting all technologies together:
 ● Puppet, iTest, Whirr:
    1. Change hits a SCM repo
    2. Hudson build produces Maven + Packaged artifacts
    3. Automatic deployment of modified stacks
    4. Automatic validation using corresponding stack of
       integration tests
    5. Rinse and repeat

 ● Challenges:
    ○ Maven versions vs. packaged versions vs. source
    ○ Strict, draconian discipline in test creation
    ○ Battling combinatoric explosion of stacks
    ○ Size of the cluster (pseudo-distributed <-> 500 nodes)
    ○ Self contained dependencies (JDK to the rescue!)
    ○ Sure, but does it brew espresso?
Definition of Success
● Build powerful platform that allows to
  sustain a high rate of product innovation
  without accumulating the technical debt
● Tighten organizational feedback loop to
  accelerate product evolution
● Improve culture and processes to achieve
  Agility with Stability as the organization
  builds new technologies

More Related Content

What's hot

Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAsMaxym Kharchenko
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 

What's hot (20)

Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Viewers also liked

2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopMark Johnson
 
Big data testing (1)
Big data testing (1)Big data testing (1)
Big data testing (1)vodqancr
 
Big Data, Big Trouble: Getting into the Flow of Hadoop Testing
Big Data, Big Trouble: Getting into the Flow of Hadoop TestingBig Data, Big Trouble: Getting into the Flow of Hadoop Testing
Big Data, Big Trouble: Getting into the Flow of Hadoop TestingTechWell
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (7)

2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Big data testing (1)
Big data testing (1)Big data testing (1)
Big data testing (1)
 
Big Data, Big Trouble: Getting into the Flow of Hadoop Testing
Big Data, Big Trouble: Getting into the Flow of Hadoop TestingBig Data, Big Trouble: Getting into the Flow of Hadoop Testing
Big Data, Big Trouble: Getting into the Flow of Hadoop Testing
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
How to be an awesome test automation professional
How to be an awesome test automation professionalHow to be an awesome test automation professional
How to be an awesome test automation professional
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?

Throwing complexity over the wall: Rapid development for enterprise Java (Jav...
Throwing complexity over the wall: Rapid development for enterprise Java (Jav...Throwing complexity over the wall: Rapid development for enterprise Java (Jav...
Throwing complexity over the wall: Rapid development for enterprise Java (Jav...Dan Allen
 
My "Perfect" Toolchain Setup for Grails Projects
My "Perfect" Toolchain Setup for Grails ProjectsMy "Perfect" Toolchain Setup for Grails Projects
My "Perfect" Toolchain Setup for Grails ProjectsGR8Conf
 
Golang @ Tokopedia
Golang @ TokopediaGolang @ Tokopedia
Golang @ TokopediaQasim Zaidi
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with DjangoRoger Barnes
 
Deploying software at Scale
Deploying software at ScaleDeploying software at Scale
Deploying software at ScaleKris Buytaert
 
Creating a reasonable project boilerplate
Creating a reasonable project boilerplateCreating a reasonable project boilerplate
Creating a reasonable project boilerplateStanislav Petrov
 
Django deployment with PaaS
Django deployment with PaaSDjango deployment with PaaS
Django deployment with PaaSAppsembler
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
 
Simplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - Sydney
Simplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - SydneySimplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - Sydney
Simplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - SydneyAbhinav Gupta
 
Does Django scale? An introduction to Scalability
Does Django scale? An introduction to ScalabilityDoes Django scale? An introduction to Scalability
Does Django scale? An introduction to ScalabilityDavid Arcos
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersJavan Rasokat
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbtFabio Fumarola
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Developers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonDevelopers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonIneke Scheffers
 
ContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureYury Tsarev
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpNathan Handler
 
2014 11 20 Drupal 7 -> 8 test migratie
2014 11 20 Drupal 7 -> 8 test migratie2014 11 20 Drupal 7 -> 8 test migratie
2014 11 20 Drupal 7 -> 8 test migratiehcderaad
 
Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...
Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...
Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...mfrancis
 

Similar to Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant? (20)

Throwing complexity over the wall: Rapid development for enterprise Java (Jav...
Throwing complexity over the wall: Rapid development for enterprise Java (Jav...Throwing complexity over the wall: Rapid development for enterprise Java (Jav...
Throwing complexity over the wall: Rapid development for enterprise Java (Jav...
 
My "Perfect" Toolchain Setup for Grails Projects
My "Perfect" Toolchain Setup for Grails ProjectsMy "Perfect" Toolchain Setup for Grails Projects
My "Perfect" Toolchain Setup for Grails Projects
 
Golang @ Tokopedia
Golang @ TokopediaGolang @ Tokopedia
Golang @ Tokopedia
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with Django
 
Integration testing - A&BP CC
Integration testing - A&BP CCIntegration testing - A&BP CC
Integration testing - A&BP CC
 
Deploying software at Scale
Deploying software at ScaleDeploying software at Scale
Deploying software at Scale
 
Creating a reasonable project boilerplate
Creating a reasonable project boilerplateCreating a reasonable project boilerplate
Creating a reasonable project boilerplate
 
Django deployment with PaaS
Django deployment with PaaSDjango deployment with PaaS
Django deployment with PaaS
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
Simplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - Sydney
Simplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - SydneySimplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - Sydney
Simplified CI/CD Flows for Salesforce via SFDX - Downunder Dreamin - Sydney
 
Does Django scale? An introduction to Scalability
Does Django scale? An introduction to ScalabilityDoes Django scale? An introduction to Scalability
Does Django scale? An introduction to Scalability
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Developers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonDevelopers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomon
 
ContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven Infrastructure
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at Yelp
 
2014 11 20 Drupal 7 -> 8 test migratie
2014 11 20 Drupal 7 -> 8 test migratie2014 11 20 Drupal 7 -> 8 test migratie
2014 11 20 Drupal 7 -> 8 test migratie
 
Gallio Crafting A Toolchain
Gallio Crafting A ToolchainGallio Crafting A Toolchain
Gallio Crafting A Toolchain
 
Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...
Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...
Improved developer productivity thanks to Maven and OSGi - Lukasz Dywicki (Co...
 

More from Dmitri Shiryaev

Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdfEnterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdfDmitri Shiryaev
 
Uniting Data JavaOne2013
Uniting Data JavaOne2013Uniting Data JavaOne2013
Uniting Data JavaOne2013Dmitri Shiryaev
 
RFID Technology and Internet of Things
RFID Technology and Internet of ThingsRFID Technology and Internet of Things
RFID Technology and Internet of ThingsDmitri Shiryaev
 
Composite Applications with SOA, BPEL and Java EE
Composite  Applications with SOA, BPEL and Java EEComposite  Applications with SOA, BPEL and Java EE
Composite Applications with SOA, BPEL and Java EEDmitri Shiryaev
 
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best PracticesA Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best PracticesDmitri Shiryaev
 
SOA Strategy and Architecture
SOA Strategy and ArchitectureSOA Strategy and Architecture
SOA Strategy and ArchitectureDmitri Shiryaev
 

More from Dmitri Shiryaev (6)

Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdfEnterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
 
Uniting Data JavaOne2013
Uniting Data JavaOne2013Uniting Data JavaOne2013
Uniting Data JavaOne2013
 
RFID Technology and Internet of Things
RFID Technology and Internet of ThingsRFID Technology and Internet of Things
RFID Technology and Internet of Things
 
Composite Applications with SOA, BPEL and Java EE
Composite  Applications with SOA, BPEL and Java EEComposite  Applications with SOA, BPEL and Java EE
Composite Applications with SOA, BPEL and Java EE
 
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best PracticesA Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
 
SOA Strategy and Architecture
SOA Strategy and ArchitectureSOA Strategy and Architecture
SOA Strategy and Architecture
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?

  • 1. Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant? Konstantin Boudnik, Ph.D., Cloudera Inc. cos@apache.org Hadoop Committer, Pig Contributor Roman Shaposhnik, Cloudera Inc. rvs@cloudera.com Hadoop Contributor, Oozie Committer Andre Arcilla belowsov@gmail.com Hadoop Integration Architect
  • 2. This content is licensed under Creative Commons BY-NC-SA 3.0
  • 3. Agenda ● Problem we are facing ○ Big Data Stacks ○ Why validation ● What is "Success" and the effort to achieve it ● Solutions ○ Ops testing ○ Platform certification ○ Application testing ● Stack on stack ○ Test artifacts are First Class Citizen ○ Assembling validation stack (vstack) ○ Tailoring vstack for target clusters ○ D3: Deployment/Dependencies/Determinism
  • 4. Not on Agenda ● Development cycles ○ Features, fixes, commits, patch testing ● Release engineering ○ Release process ○ Package preparations ○ Release notes ○ RE deployments ○ Artifact management ○ Branching strategies ● Application deployment ○ Cluster update strategies (rolling vs. offline) ○ Cluster upgrades ○ Monitoring ○ Statistics collection ● We aren't done yet, but...
  • 5. What's a Big Data Stack anyway? Just a base layer!
  • 6. What is Big Data Stack? Guess again...
  • 7. What is Big Data Stack? A bit more...
  • 8. What is Big Data Stack? A bit or two more...
  • 9. What is Big Data Stack? A bit or two more + a bit = HIVE
  • 10. What is Big Data Stack?
  • 11. What is Big Data Stack? And a Sqoop of flavor...
  • 12. What is Big Data Stack? A Babylon tower?
  • 13. What is the glue that holds the bricks? ● Packaging ○ RPM, DEB, YINST, EINST? ○ Not the best fit for Java ● Maven ○ Part of the Java ecosystem ○ Not the best tool for non-Java artifacts ● Common APIs we will assume ○ Versioned artifacts ○ Dependencies ○ BOMs
  • 14. How can we possibly guarantee that everything is right?
  • 15. Development & Deployment Discipline !
  • 16. Is it enough really?
  • 18. Components: ● I want Pig 0.7 ● You need Pig 0.8
  • 19. Components: ● I want Pig 0.7 ● You need Pig 0.8 Configurations: ● Have I put in 5th data volume to DN's conf of my 6TB nodes? ● Have I _not_ copied it accidently to my 4TB nodes?
  • 20. Components: ● I want Pig 0.7 ● You need Pig 0.8 Configurations: ● Have I put in 5th data volume to DN's conf of my 6TB nodes? ● Have I _not_ copied it accidently to my 4TB nodes? Auxiliary services: ● Does my Sqoop has Oracle connector?
  • 21. Honestly: ● Can anyone remember all these details?
  • 22. What if you've missed some?
  • 23. How would you like your 10th re-spin of a release? Make it bloody this time, please...
  • 27. And don't you have anything better to do with that life of yours?
  • 28. Who Needs A Stack Anyway? ● Developers ○ reference platform which is complete and works ○ less handholding for downstream groups ● QE ○ definitive testable artifact, something to explore and certify ○ free QE to do QE ● Operations ○ deploy a "Hadoop", not a "Hadoop erector set" ○ confidence in dev/QE effort ● Eliminate cross-team effort "bleed" ● Align teams on a common "axis"
  • 29. Successful Stack Deployment System "Customers Are Happy!" ● Deploy stack configuration X to a cluster Y, for such X & Y: ○ satisfy devs, QE and Ops ○ variable enough to absorb all the crazy use cases
  • 30. Who Needs Stack Testing Anyway? ● All stack customers ○ Yes QE for sure (integration, system, CI levels...), but also ○ Devs: did my latest patch break security auth or slowed HDFS to a crawl? ○ Ops: you ask for "striping". Did YOU test striping? Can WE test it as well? ○ Customers: did you test MY stack
  • 31. Successful Integration Testing System "Customers Are Happy!" 1. QE: easy to deploy, configure, run, add tests 2. Stakeholders: provides relevant info about stack quality 1. requires plenty of relevant tests and datapoints 1. see 1)
  • 32. Effort Required To Deploy/Test Stacks
  • 33. The Parable of Garage ● Lots of stuff coexisting together ○ fancy, polished machines (mountain bike) ○ utility tools (hedge trimmer, shop vacuum) ○ ugly misfits (old paint pan, smelly dry mushrooms) ● Cannot be "simplified" due to complexity and evolving nature ● Need to manage complexity: use, change, check ● Keep objects inviolate: cannot expect a bucket to mount directly on a wall ● Reliance on external services: studs, drywall, garage door
  • 34. The Complexity Management Solution ● Provide the Framework and the means for objects to plug into this framework ● Framework ○ holds everything together, enables management ○ sophisticated design, well engineered, sturdy ● "Glue"/"Bubblegum logic" ○ binds componets to the framework via minimally- invasive API ○ easy to design, engineering complexity is lower ○ follows well-documented process ● Components ○ actual participants of the garage ○ require no modifications
  • 35. Engineering Effort Matrix Development Level of External Develop in Investment Effort Complexity Dependencies Parallel Framework Medium High Many One time No Few or none Component Low-Medium Incremental, Low-Medium provided by Yes Glue by example per component framework ● Framework - high complexity one time effort ● Components - per-component incremental effort, average complexity, code-by-example
  • 36. Validation Stack for Big Data A Babylon tower vs Tower of Hanoi
  • 37. Validation Stack (tailored) Or something like this...
  • 38. Use accepted platform, tools, practices ● JVM is simply The Best ○ Disclaimer: not to start religious war ● Yet, Java isn't dynamic enough (as in JDK6) ○ But we don't care what's your implementation language ○ Groovy, Scala, Clojure, JPython (?) ● Everyone knows JUnit/TestNG ○ alas not everyone can use it effectively ● Dependency tracking and packaging ○ Maven ● Information radiators facilitate data comprehension and sharing ○ TeamCity ○ Jenkins
  • 39. Few more pieces ● Tests/workloads have to be artifact'ed ○ It's not good to go fishing for test classes ● Artifacts have to be self-contained ○ Reading 20 pages to find a URL to copy a file from? "Forget about it" (C) ● Standard execution interface ○ JUnit's Runner is as good as any custom one ● A recognizable reporting format ○ XML sucks, but at least it has a structure
  • 40. A test artifact (PigSmoke 0.9-SNAPSHOT) <project> <groupId>org.apache.pig</groupId> <artifactId>pigsmoke</artifactId> <packaging>jar</packaging> <version>0.9.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.9.0-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.9.0-SNAPSHOT</version> </dependency> </dependencies> </project>
  • 41. How do we write iTest artifacts $ cat TestHadoopTinySmoke.groovy class TestHadoopTinySmoke { .... @BeforeClass static void setUp() throws IOException { String pattern = null; //Let's unpack everything JarContent.unpackJarContainer(TestHadoopSmoke.class, '.' , pattern); ..... } @Test void testCacheArchive() { def conf = (new Configuration()).get("fs.default.name"); .... sh.exec("hadoop fs -rmr ${testDir}/cachefile/out", "hadoop ....
  • 42. Add suitable dependencies (if desired) <project> ... <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigsmoke</artifactId> <version>0.9-SNAPSHOT</version> <scope>test</scope> </dependency> <!-- OMG: Hadoop dependency _WAS_ missed --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2-CDH3B4-SNAPSHOT</version> </dependency> ...
  • 43. Unpack data (if needed) ... <execution> <id>unpack-testartifact-jar</id> <phase>generate-test-resources</phase> <goals> <goal>unpack</goal> </goals> <configuration> <artifactItems> <artifactItem> <groupId>org.apache.pig</groupId> <artifactId>pigsmoke</artifactId> <version>0.9-SNAPSHOT</version> <type>jar</type> <outputDirectory>${project.build.directory}</outputDirectory> <includes>test/data/**/*</includes> </artifactItem> </artifactItems> </configuration> </execution> ...
  • 44. Find runtime libraries (if required) ... <execution> <id>find-lzo-jar</id> <phase>pre-integration-test</phase> <goals> <goal>execute</goal> </goals> <configuration> <source> try { project.properties['lzo.jar'] = new File("${HADOOP_HOME}/lib").list( [accept:{d, f-> f ==~ /hadoop.*lzo.*.jar/ }] as FilenameFilter ).toList().get(0); } catch (java.lang.IndexOutOfBoundsException ioob) { log.error "No lzo.jar has been found under ${HADOOP_HOME}/lib. Check your installation."; throw ioob; } </source> </configuration> </execution> </project>
  • 45. Take it easy: iTest will do the rest ... <execution> <id>check-testslist</id> <phase>pre-integration-test</phase> <goals> <goal>execute</goal> </goals> <configuration> <source><![CDATA[ import org.apache.itest.* if (project.properties['org.codehaus.groovy.maven.destination'] && project.properties['org.codehaus.groovy.maven.jar']) { def prefix = project.properties['org.codehaus.groovy.maven.destination']; JarContent.listContent(project.properties['org.codehaus.groovy.maven.jar']). each { TestListUtils.touchTestFiles(prefix, it); }; }]]> </source> </configuration> </execution> ...
  • 46. Tailored validation stack <project> <groupId>com.cloudera.itest</groupId> <artifactId>smoke-tests</artifactId> <packaging>pom</packaging> <version>1.0-SNAPSHOT</version> <name>hadoop-stack-validation</name> ... <!-- List of modules which should be executed as a part of stack testing run --> <modules> <module>pig</module> <module>hive</module> <module>hadoop</module> <module>hbase</module> <module>sqoop</module> </modules> ... </project>
  • 47. Just let Jenkins do its job (results)
  • 48. Just let Jenkins do its job (trending)
  • 49. What else needs to be taken care of? ● Packaged deployment ○ packaged artifact verification ○ stack validation Little Puppet on top of JVM: static PackageManager pm; @BeforeClass static void setUp() { .... pm = PackageManager.getPackageManager() pm.addBinRepo("default", "http://archive.canonical.com/", key) pm.refresh() pm.install("hadoop-0.20")
  • 50. The coolest thing about single platform: void commonPackageTest(String[] gpkgs, Closure smoke, ...) { pkgs.each { pm.install(it) } pkgs.each { assertTrue("package ${it.name} is notinstalled", pm.isInstalled(it)) } pkgs.each { pm.svc_do(it, "start")} smoke.call(args) } @Test void testHadoop() { commonPackageTest(["hadoop-0.20", ...], this.&commonSmokeTestsuiteRun, TestHadoopSmoke.class) commonPackageTest(["hadoop-0.20", ...], { sh.exec("hadoop fs -ls /") }) }
  • 52. iTest: current status ● Version 0.1 available at http://github.com/cloudera/iTest ● Apache2.0 licensed ● Contributors are welcome (free cookies to first 25)
  • 53. Putting all technologies together: ● Puppet, iTest, Whirr: 1. Change hits a SCM repo 2. Hudson build produces Maven + Packaged artifacts 3. Automatic deployment of modified stacks 4. Automatic validation using corresponding stack of integration tests 5. Rinse and repeat ● Challenges: ○ Maven versions vs. packaged versions vs. source ○ Strict, draconian discipline in test creation ○ Battling combinatoric explosion of stacks ○ Size of the cluster (pseudo-distributed <-> 500 nodes) ○ Self contained dependencies (JDK to the rescue!) ○ Sure, but does it brew espresso?
  • 54. Definition of Success ● Build powerful platform that allows to sustain a high rate of product innovation without accumulating the technical debt ● Tighten organizational feedback loop to accelerate product evolution ● Improve culture and processes to achieve Agility with Stability as the organization builds new technologies