DevOps for Big Data - Data 360 2014 Conference

DevOps for Big Data
Enabling Continuous Delivery for
data analytics applications based on
Hadoop, Vertica, and Tableau
1
Max Martynov, VP of Technology
Grid Dynamics

Introductions
• Grid Dynamics
─ Solutions company, specializing in eCommerce
─ Experts in mission-critical applications (IMDGs, Big Data)
─ Implementing Continuous Integration and Continuous Delivery for 5+ years
• Qubell
─ Enterprise DevOps platform
─ Focused on self-service environments, service orchestration, and continuous
upgrades
─ Targets web-scale and big data applications
2

State of DevOps and Continuous Delivery
Continuous Delivery Value
• Agility
• Transparency
• Efficiency
• Consistency
• Quality
• Control
Findings from The 2014 State of
DevOps Report
• Strong IT performance is a
competitive advantage
• DevOps practices improve IT
performance
• Organizational culture matters
• Job satisfaction is the No. 1
predictor of organizational
performance
3

Continuous Delivery Infrastructure
• Environments
─ Reliable and repeatable deployment automation
─ Database schema management
─ Data management
─ Application properties management
─ Dynamic environments
• Quality
─ Test automation
─ Test data management (again)
─ Code analysis and review
• Process
─ Source code management, branching strategy
─ Agile requirements and project management
─ CICD pipeline
* Big Data applications bring
additional challenges in these
areas due to big amounts of data,
complexity of business logic and
large scale environments.
4

Implementing Continuous Delivery for Big Data:
Initial State of the Project
• Medium size distributed development team
• Diverse technology stack – Hadoop + Vertica + Tableau
• Only one environment existed and it was production
• Delivery pipeline:
• Procurement of hardware for a new environment was taking months
5
Development
Team
Production

Development in Production
6
It is fun until somebody
misses the nail

Hadoop Analytical Application
7
Master
Database
Slaves 1 - N
Manager
10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware servers
How to quickly reproduce this environment for dev-test purposes?

1. Stop Gap Measure
• Same hardware, different logical “zones” implemented on the file system
• Automated build and deployment
• Delivery pipeline:
8
Development
Team
Production
cluster
/test1-N
/stage
/prod
Zones

1. Stop Gap Measure: Pros and Cons
Pros
• Better than before: code can be
tested before it goes to production
• All logical environments has
access to the same production
data
• Zero additional environment costs
Cons
• Stability, security and compliance
issues: dev, test and prod
environments share same
hardware
• Performance issues: tests affect
production performance
• Impossible to run “destructive”
tests that affect shared production
data
• Impossible to test upgrades of
middleware (new versions of H*
components)
9

2. Hadoop Dynamic Environments
10
Data
Components
Custom
Application
Services Environment
Policies
Dev
QA
Stage Prod
Dev/QA/Ops
Request
Environment
Orchestrate environment
provisioning and application
deployment
Environment

2. Hadoop Dynamic Environments (continued)
• Dev/QA/Ops teams got a self-service portal to
─ provision environments
─ deploy applications
• A new environment can be created from scratch in 2-3 hours
─ singe-node dev sandbox
─ multi-node QA
─ big clusters for scalability and performance
• An application can be deployed to an environment within 10 minutes
11

3. Vertica and Tableau Dynamic Environments
12
Components
Data UDF
Dev
Services
Environment
Policies
QA
Stage Prod
Dev/QA/Ops
Request
Environment
Orchestrate environment
provisioning and application
deployment
Environment
VSQL Config
Shared
service

4. Tests & Test Data
• Dev and QA teams implemented automated tests
• Two options to handle data on dev-test environments:
1. Tests generate data for themselves
2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)
Integration Tests
(integration with data)
Component Tests
Unit Tests
Manual tests;
snapshot of production data
Auto tests on “API” level, validating job output;
snapshot of production data
Auto tests on “API” level, testing job output;
test-generated data
13
Exploratory
Tests
Java code, auto-generated data;
build-time validation

5. CICD pipeline
With all components ready, implementing CICD pipeline is easy:
14
2. Commit Github Flow
Development
Team
1. Develop &
Experiment
3. Build &
unit test
4. Deploy 5. Test
6. Release
Dev Sandbox QA Environment

6. Release Button
15
Release
Candidate
Release
Ops/RE Production

Results
• Reduced risk and higher quality
─ No more development in production
─ Developers have sandboxes, tests are run on separate environments
─ Feature are deployed to production only after validation
• Increased efficiency
─ A new environment can be provisioned within 2 hours
─ Developers can freely experiment with new changes
─ No resource contention
• Reduced costs
─ No need to procure in-house hardware and manage in-house datacenter
─ Dynamic environments save money by using them on only when they are needed
17

Enabling Technologies
Agile Software Factory
Software Engineering Assembly Line
griddynamics.com
Qubell
Enterprise DevOps Platform
qubell.com
18

OCTOBER 14
Thank You
19
Max Martynov, VP of Technology, Grid Dynamics
mmartynov@griddynamics.com
Victoria Livschitz, CEO and Founder, Qubell
vlivschitz@qubell.com

DevOps for Big Data - Data 360 2014 Conference

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie DevOps for Big Data - Data 360 2014 Conference

Ähnlich wie DevOps for Big Data - Data 360 2014 Conference (20)

Mehr von Grid Dynamics

Mehr von Grid Dynamics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

DevOps for Big Data - Data 360 2014 Conference