"Hadoop Ecosystem and Low Latency Streaming Architecture" was presented by Vijay Mandava and Lan Jiang to Detroit Java User Group on 3/23/2015. It covers the basic introduction of Hadoop Ecosystem and then focus on the low latency streaming architecture, including frameworks such as Flume, Kafka and Storm.
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
Hadoop Ecosystem and Low Latency Streaming Architecture
1. Hadoop Ecosystem and Low
Latency Streaming Architecture
InSemble Inc.
http://www.insemble.com
2. Agenda
What is Big Data and why it is relevant ?1
Flume, Kafka and Storm4
Reference Architecture for Low Latency Streaming3
Hadoop Ecosystem2
Demo5
3. Big Data Definitions
• Wikipedia defines it as “Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process
data within a tolerable elapsed time”
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “Big Data is a large volume, unstructured data
which cannot be handled by traditional database management systems
”
4. Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by
person analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
5. Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
7. Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
8. Big Data Journey
➢ Real time Insight from all channels
➢ IT is key differentiator for your business
➢ Perfect alignment of Business and IT
➢ Ad Hoc Data Exploration
➢ Batch, Interactive, Real time use cases
➢ Predictive Analytics, Machine Learning
➢ Consolidated Analytics
➢ ETL
➢ Time Constraints
➢ Security standards defined
➢ Governance Standards Defined
➢ Integrated with the Enterprise
➢ Evaluate Business Benefits
➢ Understand Ecosystem
➢ Identify Platform
Aware of Benefits
Execute
Expand
Managed
Optimized
- Scout for Opportunities
- Pilot project
- Multiple Use cases
- Governance Model
- Core competency
Journey Over Time
BusinessValue
Effects
GREAT
GOOD
10. Flume Architecture
• Distributed system for
collecting and aggregating
from multiple data stores to
a centralized data store
• Agent is a JVM that hosts
the Flume components
• Channel will store
message until picked by a
sink
• Different types of Flume
sources
• Source and Sink are
decoupled
13. Kafka Introduction
• Messaging System which is distributed, partitioned and replicated
• Kafka brokers run as a cluster
• Producers and Consumers can be written in any language
14. Topic
• Ordered, immutable sequence numbers
• Retains messages until a period of time
• “Offset” of where they are is controlled by the consumer
• Each partition is replicated and has “leader” and 0 or more “follower”.
R/W only done on leader
15. Producers and Consumers
• Producer controls which partition messages goes to
• Supports both Queuing and Pub/Sub
– Abstraction called Consumer group
• Ordering within Partition
– Ordering for subscriber has to be done with only one subscriber to that
partition
16. Storm Introduction
• Distributed real time computational system
–Process unbounded streams of data
–Can use multiple programming languages
–Scalable, fault-tolerant and guarantees that data will be processed
• Use Cases
–Real time analytics, online machine learning
–Continuous Computation
–Distributed RPC
–ETL
• Concepts
–Topology
–Spouts
–Bolts
17. Concepts
• Storm Cluster
– Master node(Nimbus)
• Distributing code
• Assigns tasks to machines
• Monitors for failures
– Worker nodes(Supervisor)
• Starts/stops worker processes
• Each worker process executes subset of a topology
– Zookeeper
• Coordinates between Nimbus and Supervisors
• Nimbus and Supervisors completely stateless
• State maintained by Zookeeper or local disks
18. Details
• Stream
– Unbounded sequence of tuples
• Spout(write logic)
– Source of stream. Emits tuples
• Bolt(write logic)
– Processes streams and emits tuples
• Topology
– DAG of spouts and bolts
– Submit a topology to a Storm cluster
– Each node runs in parallel and parallelism is controlled
19. Stream groupings
• Tells a topology how to send tuples between two components
• Since tasks are executed in parallel, how do we control which tasks the
tuples are being sent to
22. Demo - Twitter TopN Trending Topic
• Use Flume Twitter Source to ingest data and
publish event to Kafka topic
• Use Kafka as messaging backbone
• Use Storm as an Real-Time event processing
system to calculate TopN trending topic
• Use Redis to store the TopN Result
• Use Node.js/JQuery for visualization