Hadoop

Hadoop Ecosystem
• HDFS: Hadoop Distributed File System is a distributed file system which can be installed on commodity servers. HDFS offers a way
to store large data files on numerous machines and is designed to be fault tolerant due its data replication feature.
• YARN: Yet Another Resource Negotiator aka MapReduce V2. Its a framework for job scheduling and managing resources on
cluster.
• Flume: Its a tool or service that is used for aggregating, collecting and moving large amount of log data in and out of Hadoop.
• Zookeeper: Its a framework that enables highly reliable distributed coordination of nodes in the cluster.
• Sqoop: “SQL-to-Hadoop” or Sqoop is a tool for efficient transfer between Hadoop and structured data sources i.e Relational
Database or other Hadoop data stores, e.g. Hive or HBase.
• Oozie: Workflow scheduler system to manage Hadoop jobs. The jobs may include non MapReduce jobs.
• Pig: Initially developed at Yahoo!, Pig is a framework consisting of high level scripting language i.e Pig Latin along with a run time
environment to which allows user to run MapReduce on Hadoop cluster.
• Mahout: Mahout is a scalable machine learning and data mining library.
• R Connectors: R Connectors are used for generating statistics of the nodes in a cluster.
• Hive: Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files.
• HBase: Its a column-oriented non-rational database management system that runs on top of HDFS.
• Amabri: This component of Hadoop ecosystem is used for provisioning, managing and monitoring the Hadoop cluster.

Hadoop HDFS
• Hadoop cluster for storage. Data replicated
over multiple machines.
• Master/Slave Architecture ( 1 master n slaves
)=> nameNode/DataNode.
• Designed for largerfiles ( 64Mb), but handles
for streaming of volume of data.
• A file to be written in split into blocks. Each
block by default is replicated 3 times by
default, distributed across cluster ( node,
rack, datacenter) hence maintaining network
topology for productive search.
• Nodes are self replicated to maintain
failover.

Projects
Data Format
Avro Parquet
Data Ingestion
Flume Sqoop
Data Processing
Pig Hive Crunch Spark
Data Storage
HBase
Coordination
Zookeeper

Data Formats
• Data extraction from Hadoop nodes by
Hadoop client.
• MapReduce needs
• Splittable
• Dynamic schema searchable
• Binary format able
• Compress able
• Encoding and decoding
• Serializable and de-serializable
• RPC
• Language independent ( Neutral )
THRIFT
Google
Protocol Buffers

Avro
• It’s a JSON based encoded using binary format.
• Data serialization framework supporting compressibility and splittable for MapReduce.
• RPC communication between Hadoop nodes and client program to Hadoop services.
• Language Neutral data serialization system. – write and read in language as c, c#, c++,
java, JavaScript, PHP, python, ruby.
• Language independent schema, hence code generation is optional, hence compact
encoding. It has rich schema resolution capability.
• Data types -- primitive types (null, boolean, int, long, float, double, bytes, and string) and
complex types (record, enum, array, map, union, and fixed).
• Java support specific mapping, generic mapping and reflect mapping ( slow ) of types to
the avro data types hence solving compilation problems.
• Avro is in memory serialization and deserialization.
• Data file has header containing metadata – avro schema and sync marker ( series of
blocks containing the serialized avro objects ).
• Ref: Avro Documentation

Parquet
• Columnar storage to store nested data => helps file size and query performance.
• Good for the nested data storage.
• ORCFile ( Optimized Record columunar file ) is a type used in Hive project.
• In memory data models can be used to read and write the parquet files.
• Data types -- primitive types (boolean, int32, int64, int96, float, double, binary,
fixed_len_byte_array) and logical types (UTF8, ENUM, DECIMAL, DATE, LIST, MAP).
• Parquet doesn’t need sync marker as in avro as block boundaries are stored in the footer
metadata.
• The structure is encoded by 2 integers : definition level and repetition level.
• The file has header and blocks ( these have row groups of column blocks ( these has
many pages ))
• Compression supported are snappy, gzip, LZO.
• Default block size same as hdfs block size of 128 MB. Default page size is 1Mb and it’s the
smallest unit to store.
• Parquet files are processed using HIVE, IMPALA, PIG.
• Ref: Parquet Documentation

Flume
• High volume ingestion into Hadoop. ( event based data as in web app server log
files, JMS messages. ). Property are set for the configuration of Flume agents.
• Flume Agent run source and sinks connected via channels. Sink are HDFS system.
Hbase, Solr. Type for each member can be logger, directory, file or directory.
• Fan Out support the source delivering to multiple channel ( eg one being file and
another memory ) in same agent.
• Load balancing is catered by having one agent sink sending to two subsequent
flume agents for processing.
• Source catergory are ( avro, Exec, Http, jms, netstat, sequence generator, spooling
directory, syslog, thrift, twitter). Sink category are ( avro, elasticsarch, File Roll,
hbase, hdfs, irc, logger, morphline(solr), null, thrift ). Channel are ( file, jdbc,
memory). Interceptor are ( host, morphline, regex filtering, static, timestamp,
UUID ).
• Ref: Flume Documentation

Sqoop
• Extracts data from a structured data
source as RDBMS. Mapreduce and Hive is
used.
• Read it from book as its move CLI based
commands.
• Ref:Sqoop Documentation

HBase
• Its non relational distributed column oriented db.
• Integrated with MapReduce, REST API, Java API for client, bulk
imports, block cache and bloom filters for real time queries,
replication across cluster / backup options,
• A single table is partitioned into regions. Regions are assigned to
region servers across clusters.
• RDBMS – Entity(table), attributes(column), relation ( FK),
*to*(junction table), natural keys ( artificial ids )
• Splits rows into regions, region is hosted on one server, writes
are in-memory using flush ( drawback ), read merge rows in-
memory using flush ( local files ), read write consistent to row.
• Table – design space. Row – atomic Key/value container, Column
– key in the K/V container inside a row. Value. TimeStamp.
Column family – divide column file in physical files.
• HBase good for large datasets, sparse datasets, loosely coupled (
denorm ) records, concurrent clients.
• NOSQL API: get, put, append, increment, scan,delete,
checkandPut, checkandmutate, checkanddelete, batch.
• Use cases: monitoring devices logs.

Crunch
• High level API for writing and testing complex MapReduce pipelines. It uses
multiple serializable type data model. Its good for non tuple data types as images,
audio, seismic data.
• It composes of processing the pipelines. i.e. DAG.
• Pipelines are MapReduce, memory, spark.
• Input source / output targets: AVRO, Parquet, Sequence files, HBASE, Hfiles, CSV,
JDBC, text.
• 3 interfaces : Pcollection<t>, Ptable, PGroupedTable used for distributed datasets.
Works with spark and Hadoop.
• It uses arbitrary objects. Its for complex data types. It support in-memory
execution engine. Data flow pattern to use crunch.
• Ref: Crunch Documentation

Spark
• In-memory data processing. Its alternative to
MapReduce for certain applications.
• 10x(on disk) – 100x(in-memory) faster for the
algorithms to access data.
• Iterative machine learning algos and iterative data
mining. API for java, scala, python.
• SparkSQL ( unstructured data processing ). Mlib (
machine learning algo. ) GraphX( graph processing
). Spark streaming( live data streams )
• Custer Manager is used ( Yarn, Mesos )
• Stage – Each job divided into smaller set of tasks.
Spark Context – connection to spark cluster to
create RDD, accumulators and broadcast variables
on that cluster.
• RDD( Resilient Distributed Dataset ) is abstraction
in Spark. Its fault tolerant, immutable, partitioned
collection of elements operated parallel.
Operations as map, filter, persist. Types of files
supported ( text, sequence, Hadoop Input format)

ZooKeeper
• Distributed computing issues – network reliability, latency, bandwidth,
secure network, topology changes, many admins, heterogeneous network,
transportation cost.
• It allows distributed process to coordinate using hierarchical name space of
data registers. It stores the name in file system model.
• Services catered are naming, config, lock & sync, group services.
• Leader is elected on the service startup.
• Configuration stores – data location, transaction log, cluster members info,
myid file.
• API’s of CLI – create, delete, exists,set data, get data, get children, sync.
• Features – atomicity, notification, ordering, version write, sequential node,
HA, ephemeral nodes ( lifecycle dependent, no children ).
• Ref: Apache Zookeeper

Storm
• Real time streaming key for Lamada architecture.[ batch and real time data quering]
• Its 1M+message per sec per node. Fast, fault tolerant, scalable, parallelism in streaming.
• Tuples [ key value, immutable], streams. Sprouts [ same as flume source, emit tuples], Bolts [ computation on tuples].
• Topology – DAG of sprouts and bolts, streaming computation.
• Topology Groups – shuffle, localorshuffle, field grouping.
• Architecture – uses nimbus to generate/ control task, zookeeper, supervisor, worker.
• Trident – on top of storm built for merge and join, aggregate, grouping, function, filter. Stateful for incremental processing. Stream
oriented API. MICROBATCHED ORIENTED.
For data migration [ fundamentals ]
• Kappa Architecture – simplification of Lamada without batch processing.[ picture on left ]
• Lamada Architecture -- Immutable sequence via streaming and batch processing. [ picture on right ]
• CAP Theorem - states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. Hence
NOSQL.

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop

Similar to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop

Editor's Notes