In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

In-Memory Computing, Storage & Analysis
Apache Apex + Apache Geode
Sandeep Deshmukh Ashish Tadose

Project Status
Mentor List
Ted Dunning: Apache Member, MapR
Alan Gates: Apache Member, Hortonworks
Taylor Goetz: Apache Member, Hortonworks
Justin Mclean: Apache Member, Class Software
Chris Nauroth: Apache Member, Hortonworks
Hitesh Shah: Apache Member, Hortonworks
Apex In Apache Incubation Stage

Apache Apex (Incubating) Committer List
Open-sourced in July 2015
Over 50 committers already…
And growing….

Apex Platform Overview Enterprise
Edition

Directed Acyclic Graph (DAG)
Application Programming Model
• A Stream is a sequence of data tuples
• An Operator takes one or more input streams,performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance in single-threaded
• DirectedAcyclic Graph (DAG) is made up of operators and streams
Output StreamTuple Tuple
er
Operator
er
Operator
er
Operator
er
Operator
Application Programming Model

Hadoop Edge Node
DT RTS
Management
Server
Hadoop Node
YARN Container
Apex App Master
Hadoop Node
YARN Container
YARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming
Container
Hadoop Node
YARN Container
YARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming
Container
CLI
REST
API
DT RTS
Management
Server
REST
API
Part of Community Edition
Apex Component Overview

• Native Hadoop Integration
• Partitioning and Scaling out
• Advanced Windowing Support
• Stateful Fault-tolerance
• Processing Semantics
• Compute Locality
• Dynamic updates
Apex Features …

• Processing data in-motion
• Preventing data-loss – buffer server
• In memory data stores for querying data
IMC Components in Apex

Typical
latencies
Why In-Memory Computing?

Why In-Memory Computing?
In-memory computing will have long term, disruptive impact by
radically changing users expectations, application design principles,
product's architectures and vendor's strategies RAM is the new disk,
disk the new tape
RAM is the new disk,
disk the new tape
In-memory computing is the future of computing.. it offers massive
not only in TCO reduction but across all four value dimensions:
performance, process, process innovation, simplification and
flexibility.

What are IMDG?
• IMDGs host data in memory and distribute it across a cluster of commodity servers
• The main access pattern is key/value access, MapReduce, various forms of HPC-like processing,
and a limited distributed querying and indexing capabilities.
Why they are important?
• Performance – using RAM is faster than using disk.
• Extremely High availability of data - by keeping it in memory and in highly distributed cluster.
• Data Structure – using a key/value store allows greater flexibility for the application developer.
object store similar in interface to a typical concurrent hash map.
• Scalable Data Partitioning
• Transactional ACID support
In Memory Data Grid - IMDG

High Level Architecture - Geode

Geode Features
Core Features
• Linear scalability & latency miniming data distribution
• Performance optimized persistence - High availability & durability
• Configurable consistency - region types { partitioned, replicated & local }
• Distributed transactions
• Cluster resilience & failover
Advanced Features
• Server Function Execution - Send computation to data
• Asynchronous Events - Deliver events to a receiver without impacting the
write path
• Continues Queries & Client subscriptions - Useful for refreshing client
cache

Ÿ Caching for speed and scale
– Read-through, Write-through, Write-behind
Ÿ Geode as the OLTP system of record
– Data in-memory for low latency, on disk for durability
Ÿ Parallel compute engine
Ÿ Real-time analytics
Application Patterns

Geode reads
With Consistent
Latency and CPU
• Scaled from 256 clients and 2 servers to 1280 clients and 10 servers
• Partitioned region with redundancy and 1K data size
0
2
4
6
8
10
12
14
16
18
0
1
2
3
4
5
6
2 4 6 8 10
Speedup
Server Hosts
speedup
latency (ms)
CPU %
Geode Features

Geode 3.5-4.5X Faster Than Cassandra for YCSB

Roadmap
Ÿ HDFS persistence
Ÿ Off-heap storage
Ÿ Lucene indexes
Ÿ Spark integration
Ÿ Cloud Foundry service
…and other ideas from the Geode community!
Roadmap

Streaming meets In Memory Data Grid

Apex + Geode
Apex Operator check-pointing in Geode store
• Better latency for checkpoint operations than HDFS check-pointing
• Makes Apex DAG a complete in-memory pipeline
• https://issues.apache.org/jira/browse/APEXCORE-283
Write Apex data streams to Geode store
• Apex output operator implementation which writes data to Geode region
• Use cases
• Ingest streaming data in Geode for further processing
• Store Data processed by Apex pipeline in Geode store to serve user queries
• https://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Ähnlich wie In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode (20)

Mehr von imcpune

Mehr von imcpune (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode