7. These Three Trends
• A shift to scalable, elastic computing
infrastructure.
• An explosion in the complexity and variety of
data available.
• The power and value that come from
combining disparate data for comprehensive
analysis.
8. What is Hadoop?
• A file store, HDFS (Hadoop Distributed File
System)
• A distributed processing system:
– 1.0: MapReduce
– 2.0: Yarn (a distributed operating system)
• Process comes to data
10. HDFS
• Designed to distributing store very large data
sets reliably, and to stream those data sets at
high bandwidth to distributing computation
• HDFS Comics
11.
12. YARN
• A cluster management technology
• YARN combines a central resource manager
that reconciles the way applications use
Hadoop system resources with node
manager agents that monitor the processing
operations of individual cluster nodes
15. Spark
• Spark offers an integrated framework for
advanced analytics, including a machine
learning library (MLLib), a graph engine
(GraphX), a streaming analytics engine (Spark
Streaming) and a fast interactive query tool
(Shark)
17. Flume
• A distributed, reliable, and available service for
efficiently collecting, aggregating, and moving
large amounts of streaming data into the Hadoop
Distributed File System (HDFS)
• It has a simple and flexible architecture based on
streaming data flows; and is robust and fault
tolerant with tunable reliability mechanisms for
failover and recovery
18. Sqoop
• A tool designed for efficiently transferring bulk
data between Hadoop and structured
datastores such as relational databases