4. Hadoop Ecosystem
• HDFS: Hadoop Distributed File System is a distributed file system which can be installed on commodity servers. HDFS offers a way
to store large data files on numerous machines and is designed to be fault tolerant due its data replication feature.
• YARN: Yet Another Resource Negotiator aka MapReduce V2. Its a framework for job scheduling and managing resources on
cluster.
• Flume: Its a tool or service that is used for aggregating, collecting and moving large amount of log data in and out of Hadoop.
• Zookeeper: Its a framework that enables highly reliable distributed coordination of nodes in the cluster.
• Sqoop: “SQL-to-Hadoop” or Sqoop is a tool for efficient transfer between Hadoop and structured data sources i.e Relational
Database or other Hadoop data stores, e.g. Hive or HBase.
• Oozie: Workflow scheduler system to manage Hadoop jobs. The jobs may include non MapReduce jobs.
• Pig: Initially developed at Yahoo!, Pig is a framework consisting of high level scripting language i.e Pig Latin along with a run time
environment to which allows user to run MapReduce on Hadoop cluster.
• Mahout: Mahout is a scalable machine learning and data mining library.
• R Connectors: R Connectors are used for generating statistics of the nodes in a cluster.
• Hive: Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files.
• HBase: Its a column-oriented non-rational database management system that runs on top of HDFS.
• Amabri: This component of Hadoop ecosystem is used for provisioning, managing and monitoring the Hadoop cluster.
5. Hadoop HDFS
• Hadoop cluster for storage. Data replicated
over multiple machines.
• Master/Slave Architecture ( 1 master n slaves
)=> nameNode/DataNode.
• Designed for largerfiles ( 64Mb), but handles
for streaming of volume of data.
• A file to be written in split into blocks. Each
block by default is replicated 3 times by
default, distributed across cluster ( node,
rack, datacenter) hence maintaining network
topology for productive search.
• Nodes are self replicated to maintain
failover.
7. Data Formats
• Data extraction from Hadoop nodes by
Hadoop client.
• MapReduce needs
• Splittable
• Dynamic schema searchable
• Binary format able
• Compress able
• Encoding and decoding
• Serializable and de-serializable
• RPC
• Language independent ( Neutral )
THRIFT
Google
Protocol Buffers
8. Avro
• It’s a JSON based encoded using binary format.
• Data serialization framework supporting compressibility and splittable for MapReduce.
• RPC communication between Hadoop nodes and client program to Hadoop services.
• Language Neutral data serialization system. – write and read in language as c, c#, c++,
java, JavaScript, PHP, python, ruby.
• Language independent schema, hence code generation is optional, hence compact
encoding. It has rich schema resolution capability.
• Data types -- primitive types (null, boolean, int, long, float, double, bytes, and string) and
complex types (record, enum, array, map, union, and fixed).
• Java support specific mapping, generic mapping and reflect mapping ( slow ) of types to
the avro data types hence solving compilation problems.
• Avro is in memory serialization and deserialization.
• Data file has header containing metadata – avro schema and sync marker ( series of
blocks containing the serialized avro objects ).
• Ref: Avro Documentation
9. Parquet
• Columnar storage to store nested data => helps file size and query performance.
• Good for the nested data storage.
• ORCFile ( Optimized Record columunar file ) is a type used in Hive project.
• In memory data models can be used to read and write the parquet files.
• Data types -- primitive types (boolean, int32, int64, int96, float, double, binary,
fixed_len_byte_array) and logical types (UTF8, ENUM, DECIMAL, DATE, LIST, MAP).
• Parquet doesn’t need sync marker as in avro as block boundaries are stored in the footer
metadata.
• The structure is encoded by 2 integers : definition level and repetition level.
• The file has header and blocks ( these have row groups of column blocks ( these has
many pages ))
• Compression supported are snappy, gzip, LZO.
• Default block size same as hdfs block size of 128 MB. Default page size is 1Mb and it’s the
smallest unit to store.
• Parquet files are processed using HIVE, IMPALA, PIG.
• Ref: Parquet Documentation
10. Flume
• High volume ingestion into Hadoop. ( event based data as in web app server log
files, JMS messages. ). Property are set for the configuration of Flume agents.
• Flume Agent run source and sinks connected via channels. Sink are HDFS system.
Hbase, Solr. Type for each member can be logger, directory, file or directory.
• Fan Out support the source delivering to multiple channel ( eg one being file and
another memory ) in same agent.
• Load balancing is catered by having one agent sink sending to two subsequent
flume agents for processing.
• Source catergory are ( avro, Exec, Http, jms, netstat, sequence generator, spooling
directory, syslog, thrift, twitter). Sink category are ( avro, elasticsarch, File Roll,
hbase, hdfs, irc, logger, morphline(solr), null, thrift ). Channel are ( file, jdbc,
memory). Interceptor are ( host, morphline, regex filtering, static, timestamp,
UUID ).
• Ref: Flume Documentation
11. Sqoop
• Extracts data from a structured data
source as RDBMS. Mapreduce and Hive is
used.
• Read it from book as its move CLI based
commands.
• Ref:Sqoop Documentation
12. HBase
• Its non relational distributed column oriented db.
• Integrated with MapReduce, REST API, Java API for client, bulk
imports, block cache and bloom filters for real time queries,
replication across cluster / backup options,
• A single table is partitioned into regions. Regions are assigned to
region servers across clusters.
• RDBMS – Entity(table), attributes(column), relation ( FK),
*to*(junction table), natural keys ( artificial ids )
• Splits rows into regions, region is hosted on one server, writes
are in-memory using flush ( drawback ), read merge rows in-
memory using flush ( local files ), read write consistent to row.
• Table – design space. Row – atomic Key/value container, Column
– key in the K/V container inside a row. Value. TimeStamp.
Column family – divide column file in physical files.
• HBase good for large datasets, sparse datasets, loosely coupled (
denorm ) records, concurrent clients.
• NOSQL API: get, put, append, increment, scan,delete,
checkandPut, checkandmutate, checkanddelete, batch.
• Use cases: monitoring devices logs.
13. Crunch
• High level API for writing and testing complex MapReduce pipelines. It uses
multiple serializable type data model. Its good for non tuple data types as images,
audio, seismic data.
• It composes of processing the pipelines. i.e. DAG.
• Pipelines are MapReduce, memory, spark.
• Input source / output targets: AVRO, Parquet, Sequence files, HBASE, Hfiles, CSV,
JDBC, text.
• 3 interfaces : Pcollection<t>, Ptable, PGroupedTable used for distributed datasets.
Works with spark and Hadoop.
• It uses arbitrary objects. Its for complex data types. It support in-memory
execution engine. Data flow pattern to use crunch.
• Ref: Crunch Documentation
14. Spark
• In-memory data processing. Its alternative to
MapReduce for certain applications.
• 10x(on disk) – 100x(in-memory) faster for the
algorithms to access data.
• Iterative machine learning algos and iterative data
mining. API for java, scala, python.
• SparkSQL ( unstructured data processing ). Mlib (
machine learning algo. ) GraphX( graph processing
). Spark streaming( live data streams )
• Custer Manager is used ( Yarn, Mesos )
• Stage – Each job divided into smaller set of tasks.
Spark Context – connection to spark cluster to
create RDD, accumulators and broadcast variables
on that cluster.
• RDD( Resilient Distributed Dataset ) is abstraction
in Spark. Its fault tolerant, immutable, partitioned
collection of elements operated parallel.
Operations as map, filter, persist. Types of files
supported ( text, sequence, Hadoop Input format)
16. ZooKeeper
• Distributed computing issues – network reliability, latency, bandwidth,
secure network, topology changes, many admins, heterogeneous network,
transportation cost.
• It allows distributed process to coordinate using hierarchical name space of
data registers. It stores the name in file system model.
• Services catered are naming, config, lock & sync, group services.
• Leader is elected on the service startup.
• Configuration stores – data location, transaction log, cluster members info,
myid file.
• API’s of CLI – create, delete, exists,set data, get data, get children, sync.
• Features – atomicity, notification, ordering, version write, sequential node,
HA, ephemeral nodes ( lifecycle dependent, no children ).
• Ref: Apache Zookeeper
17. Storm
• Real time streaming key for Lamada architecture.[ batch and real time data quering]
• Its 1M+message per sec per node. Fast, fault tolerant, scalable, parallelism in streaming.
• Tuples [ key value, immutable], streams. Sprouts [ same as flume source, emit tuples], Bolts [ computation on tuples].
• Topology – DAG of sprouts and bolts, streaming computation.
• Topology Groups – shuffle, localorshuffle, field grouping.
• Architecture – uses nimbus to generate/ control task, zookeeper, supervisor, worker.
• Trident – on top of storm built for merge and join, aggregate, grouping, function, filter. Stateful for incremental processing. Stream
oriented API. MICROBATCHED ORIENTED.
For data migration [ fundamentals ]
• Kappa Architecture – simplification of Lamada without batch processing.[ picture on left ]
• Lamada Architecture -- Immutable sequence via streaming and batch processing. [ picture on right ]
• CAP Theorem - states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. Hence
NOSQL.
http://www.slideshare.net/PhilippeJulio/hadoop-architecture/43-HIGH_AVAILABILITY_SOLUTIONS_NameNode_JobTracker -- reference ( big data analytics with Hadoop , Philippe Julio)
HDFS: Hadoop Distributed File System as name suggest is a distributed file system which can be installed on commodity servers. HDFS offers a way to store large data files on numerous machines and is designed to be fault tolerant due its data replication feature. Learn more on HDFS at Apache
YARN: Yet Another Resource Negotiator aka MapReduce V2. Its a framework for job scheduling and managing resources on cluster. Learn more on HDFS at Apache
Flume: Its a tool or service that is used for aggregating, collecting and moving large amount of log data in and out of Hadoop. More information can be found here.
Zookeeper: Its a framework that enables highly reliable distributed coordination of nodes in the cluster. Check this interesting video on Zookeeper.
Sqoop: “SQL-to-Hadoop” or Sqoop is a tool for efficient transfer between Hadoop and structured data sources i.e Relational Database or other Hadoop data stores, e.g. Hive or HBase. Explore more on sqoop here.
Oozie: Workflow scheduler system to manage Hadoop jobs. The jobs may include non MapReduce jobs. Check out more.
Pig: Initially developed at Yahoo!, Pig is a framework consisting of high level scripting language i.e Pig Latin along with a run time environment to which allows user to run MapReduce on hadoop cluster. Refer link for more on Pig.
Mahout: Mahout is a scalable machine learning and data mining library. Check this video on Mahout and Machine Learning.
R Connectors: R Connectors are used for generating statistics of the nodes in a cluster. More on Oracle R Connectors.
Hive: Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. More on Hive at Apache.
HBase: Its a column-oriented non-rational database management system that runs on top of HDFS. A 3 min video on Hbase.
Amabri: This component of Hadoop ecosystem is used for provisioning, managing and monitoring the hadoop cluster. Check this link for more information.
http://www.sagarjain.com/the-hadoop-ecosystem-in-a-nutshell/