1. DevOps for Big Data
Enabling Continuous Delivery for
data analytics applications based on
Hadoop, Vertica, and Tableau
1
Max Martynov, VP of Technology
Grid Dynamics
2. Introductions
• Grid Dynamics
─ Solutions company, specializing in eCommerce
─ Experts in mission-critical applications (IMDGs, Big Data)
─ Implementing Continuous Integration and Continuous Delivery for 5+ years
• Qubell
─ Enterprise DevOps platform
─ Focused on self-service environments, service orchestration, and continuous
upgrades
─ Targets web-scale and big data applications
2
3. State of DevOps and Continuous Delivery
Continuous Delivery Value
• Agility
• Transparency
• Efficiency
• Consistency
• Quality
• Control
Findings from The 2014 State of
DevOps Report
• Strong IT performance is a
competitive advantage
• DevOps practices improve IT
performance
• Organizational culture matters
• Job satisfaction is the No. 1
predictor of organizational
performance
3
4. Continuous Delivery Infrastructure
• Environments
─ Reliable and repeatable deployment automation
─ Database schema management
─ Data management
─ Application properties management
─ Dynamic environments
• Quality
─ Test automation
─ Test data management (again)
─ Code analysis and review
• Process
─ Source code management, branching strategy
─ Agile requirements and project management
─ CICD pipeline
* Big Data applications bring
additional challenges in these
areas due to big amounts of data,
complexity of business logic and
large scale environments.
4
5. Implementing Continuous Delivery for Big Data:
Initial State of the Project
• Medium size distributed development team
• Diverse technology stack – Hadoop + Vertica + Tableau
• Only one environment existed and it was production
• Delivery pipeline:
• Procurement of hardware for a new environment was taking months
5
Development
Team
Production
7. Hadoop Analytical Application
7
Master
Database
Slaves 1 - N
Manager
10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware servers
How to quickly reproduce this environment for dev-test purposes?
8. 1. Stop Gap Measure
• Same hardware, different logical “zones” implemented on the file system
• Automated build and deployment
• Delivery pipeline:
8
Development
Team
Production
cluster
/test1-N
/stage
/prod
Zones
9. 1. Stop Gap Measure: Pros and Cons
Pros
• Better than before: code can be
tested before it goes to production
• All logical environments has
access to the same production
data
• Zero additional environment costs
Cons
• Stability, security and compliance
issues: dev, test and prod
environments share same
hardware
• Performance issues: tests affect
production performance
• Impossible to run “destructive”
tests that affect shared production
data
• Impossible to test upgrades of
middleware (new versions of H*
components)
9
10. 2. Hadoop Dynamic Environments
10
Data
Components
Custom
Application
Services Environment
Policies
Dev
QA
Stage Prod
Dev/QA/Ops
Request
Environment
Orchestrate environment
provisioning and application
deployment
Environment
11. 2. Hadoop Dynamic Environments (continued)
• Dev/QA/Ops teams got a self-service portal to
─ provision environments
─ deploy applications
• A new environment can be created from scratch in 2-3 hours
─ singe-node dev sandbox
─ multi-node QA
─ big clusters for scalability and performance
• An application can be deployed to an environment within 10 minutes
11
12. 3. Vertica and Tableau Dynamic Environments
12
Components
Data UDF
Dev
Services
Environment
Policies
QA
Stage Prod
Dev/QA/Ops
Request
Environment
Orchestrate environment
provisioning and application
deployment
Environment
VSQL Config
Shared
service
13. 4. Tests & Test Data
• Dev and QA teams implemented automated tests
• Two options to handle data on dev-test environments:
1. Tests generate data for themselves
2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)
Integration Tests
(integration with data)
Component Tests
Unit Tests
Manual tests;
snapshot of production data
Auto tests on “API” level, validating job output;
snapshot of production data
Auto tests on “API” level, testing job output;
test-generated data
13
Exploratory
Tests
Java code, auto-generated data;
build-time validation
14. 5. CICD pipeline
With all components ready, implementing CICD pipeline is easy:
14
2. Commit Github Flow
Development
Team
1. Develop &
Experiment
3. Build &
unit test
4. Deploy 5. Test
6. Release
Dev Sandbox QA Environment
17. Results
• Reduced risk and higher quality
─ No more development in production
─ Developers have sandboxes, tests are run on separate environments
─ Feature are deployed to production only after validation
• Increased efficiency
─ A new environment can be provisioned within 2 hours
─ Developers can freely experiment with new changes
─ No resource contention
• Reduced costs
─ No need to procure in-house hardware and manage in-house datacenter
─ Dynamic environments save money by using them on only when they are needed
17
19. OCTOBER 14
Thank You
19
Max Martynov, VP of Technology, Grid Dynamics
mmartynov@griddynamics.com
Victoria Livschitz, CEO and Founder, Qubell
vlivschitz@qubell.com