15. Apache Mesos
“Apache Mesos abstracts CPU, memory, storage,
and other compute resources away from machines
(physical or virtual), enabling fault-tolerant and
elastic distributed systems to easily be built and run
effectively.”
16.
17. Mesos Features
Scalability: scale up to 10,000s of nodes
Fault-tolerant: replicated master and slaves using ZooKeeper
Docker support: Support for Docker containers
Native Container: Linux Native isolation between tasks with Linux Containers
Scheduling: Multi-resource scheduling (memory, CPU, disk, and ports)
API supports: Java, Python and C++ APIs for developing new parallel applications
Monitoring: Web UI for viewing cluster state
22. Docker Containerizer
Mesos adds the support for launching tasks that contains Docker
images
Users can either launch a Docker image as a Task, or as an
Executor.
To run the mesos-agent to enable the Docker Containerizer,
“docker” must be set as one of the containerizers option
mesos-agent --containerizers=docker,mesos
23.
24. Mesos Frameworks
Aurora: Aurora was developed at Twitter and the migrated to Apache Project later.
Aurora is a framework that keeps service running across a shared pool of
machines, and responsible for keeping them running forever.
Marathon: It is a framework for container orchestration for Mesos. Marathon helps
to run other framework on Mesos. Marathon also runs other application container
such as Jetty, JBoss Server, Play Server.
Chronos: Fault tolerance job scheduler for Mesos, It was developed at Airbnb as
replacement of cron.
25. Resilient Distributed Datasets
(RDDs)
- Big collection of data
which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
Spark Stack
26. Many big-data applications need to process large data streams in near-real time
Monitoring Systems
Alert Systems
Computing Systems
Why Spark Streaming?
28. Framework for large scale stream processing
➔ Created at UC Berkeley
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
What is Spark Streaming?
29. Run a streaming computation as a series of very small, deterministic batch jobs
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
Spark Streaming
32. ● To use Mesos from Spark, you need a Spark binary package available in a
place accessible (http/s3/hdfs) by Mesos, and a Spark driver program
configured to connect to Mesos.
● Configuring the driver program to connect to Mesos:
val sconf = new SparkConf()
.setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos")
.setAppName("MyStreamingApp")
.set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz")
.set("spark.mesos.coarse", "true")
.set("spark.cores.max", "30")
.set("spark.executor.memory", "10g")
val sc = new SparkContext(sconf)
val ssc = new StreamingContext(sc, Seconds(1))
...
Spark Streaming over a HA Mesos Cluster
33. Real-time stream processing systems must be operational 24/7, which
requires them to recover from all kinds of failures in the system.
● Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in
the cluster.
● In Streaming, driver failure can be recovered with checkpointing application state.
● Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.
Spark Streaming Fault-tolerance
36. ● Figure out the bottleneck : CPU, Memory, IO, Network
● If parsing is involved, use the one which gives
high performance.
● Proper Data modeling
● Compression, Serialization
Creating a scalable pipeline