2. What Data Pipelines are Made Off
⢠Big Data applications:
⢠Ingestion
⢠Storage
⢠Processing
⢠Serving
⢠WorkďŹows
⢠Machine learning
⢠Data Sources and Destinations
⢠Tests?
⢠Schemas??
3. Archetypes of Data Pipelines Builders
⢠Exploratory workloads
⢠Data centric
⢠Simple Deploymentâ¨
â¨
Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
⢠Code centric
⢠Heavy on methodologies
⢠Heavy tooling
⢠Very complex deployment
4. Making Big Data Teams Scale
⢠Scaling teams is hard
⢠Scaling Big Data teams is harder
⢠Different mentality between data professionals/
engineers
⢠Mixture of technologies
⢠Data as integration point
⢠Often schema-less
⢠Lack of tools
5. Continuous Delivery
⢠Keep software in a production
ready state
⢠Test all the changes: unit,
integration
⢠Exercise deployments
⢠Faster feedback cycle
7. The case for CI/CD/DevOps in Big Data Projects
⢠Coordination: data engineers, analysts, business, ops
⢠Integrate and test critical jobs
⢠Complex infrastructure: multiple distributed systems
⢠Need to decouple cluster operation via APIs/DSLs
⢠DevOps team to manage cluster operations: scaling, monitoring,
deployment.
⢠Include CI/CD practices are part of the delivery process.
8.
9. How are these techniques
applicable to
Big Data applications?
10. What Do We Need for Deploying our apps?
⢠Source control system: Git, Hg, etc
⢠CI process to run tests and package app
⢠A repository to store packaged app
⢠A repository to store conďŹguration
⢠An API/DSL to deploy to the cluster
⢠Mechanism to monitor the behaviour and performance of the app
11. Who are we?
Software developers withâ¨
years of Big Data experience
What do we want?
Simple and robust way toâ¨
deploy Big Data applications
How will we get it?
Write thousands of linesâ¨
of code on top of Mesos
12. Amaterasu - Simple Continually Deployed Data
Apps
⢠Amaterasu is the Shinto goddess of sun
⢠In the Japanese manga series Naruto
Amaterasu is a super-natural power in the
shape of a black ďŹame that can only be
taken out by its Sender
⢠Started as a framework to reliably execute
Spark driver programs
13. Amaterasu - Simple Continually Deployed Data
Apps
⢠Big Data apps in Multiple Frameworks
(Currently Only Spark is Supported)
⢠Multiple Languages (soon)
⢠WorkďŹow as YAML
⢠Simple to Write, easy to deploy
⢠Reliable execution (via Mesos)
⢠Multiple Environments
14. Big Data Pipeline Ops Requirements
⢠Support managing multiple distributed
technologies: Apache Spark, HDFS, Kafka,
Cassandra, etc.
⢠Treat data center as the OS while providing
resource isolation, scalability and fault tolerance.
⢠Ability to run multiple tasks per machine to
maximize utilization
15. Why Mesos?
⢠General purpose, battle tested cluster resource scheduler.
⢠Can run major modern Big Data systems: Hadoop, Spark,
Kafka, Cassandra
⢠Can deploys spark as part of the execution
⢠Supports scheduled and long running apps.
⢠Improves resource management and efďŹciency
⢠Great APIs
⢠DC/OS provides an even reacher environment
16. Amaterasu Repositories
⢠Jobs are deďŹned in repositories
⢠Current implementation - git repositories
⢠Local directories support is planned for future release
⢠Repos structure
⢠maki.yml - The workďŹow deďŹnition
⢠src - a folder containing the actions (spark scripts, etc.) to be executed
⢠env - a folder containing conďŹguration per environment
⢠BeneďŹts of using git:
⢠Branching
⢠Tooling
18. Amaterasu is not a workďŹow engine, â¨
itâs a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications
19. Actions DSL
⢠Your Scala/Future languages Spark code
⢠Few changes:
⢠Donât create a new sc/sqlContext, use the one
in scope or access via AmaContext.sc and
AmaContext.sqlContext
⢠AmaContext.getDataFrame and
AmaContext.getRDD are used to access data
from previously executed actions
21. Environments
⢠ConďŹguration is stored per environment
⢠Stored as JSON
⢠Contains:
⢠Spark master URI
⢠Input/output path
⢠Work dir
⢠User deďŹned key-values
25. Future Development
⢠Continuous integration and test automation
⢠R, shell and Python support (R is already in progress)
⢠Extend environments to support:
⢠Full spark conďŹguration (spark-defaults.conf, etc.)
⢠Extendable conďŹguration model
⢠Better tooling
⢠DC/OS universe package
⢠Other frameworks: Flink, vowpal wabbit
⢠YARN?