Hadoop - Architectural road map for Hadoop Ecosystem

Architectural Road Map for
Hadoop Ecosystem
Implementation
Sudhir Nallagangu

Content Index
• Industry Deﬁnitions, History and Capabilities
• Hadoop EcoSystem Architecture/Components
• Road Map for Implementation
• Architecture Decision points
• Use cases and Data Sciences
• Questions?

Industry Definitions, History and Capabilities
Definition - BigData is a broad generic term for large data sets that traditional data
processing(RDBMS) techniques and infrastructure are not adequate. Two core challenges
faced:
• Storage - To store PetaBytes of structured, semistructured and unstructured data that
grows by TeraBytes daily.
• Processing - To process, analyze, visualize in reasonable amount of time.
History - Google faced with task of saving, indexing and searching billions of web pages
turned to distributed storage and processing innovations. They published their techniques
as white papers - Google File System(GFS 2003) for storage challenge and
MapReduce(2004) for processing challenge.
Inspired by above, a team of open source development team created Hadoop core (HDFS
and MapReduce) in 2006. This provided techniques to manage BigData by using
commodity hardware (no supercomputers) in a regular enterprise datacenter setting.
• HDFS - A distributed file storage system to solve ever growing storage needs by
horizontal scalable infrastructure.
• MapReduce - A distributed “divide and conquer” processing technique that leverages
“data locality”. Data locality execution means execution/processing at the place where
data resides(efficient over transferring data over network).
• The core system and services designed to operate , survive and recover in hardware
and software failure scenarios as regular occurrence.

Industry Definitions, History and Capabilities
Capabilities
Use Cases
Data Ware House modernization - Identified as the early adoption use case. Integrate big data and data
warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of
analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining
what data should be moved to the data warehouse.
Big Data Exploration - Addresses the challenge that every large organization faces: information is stored in many
different systems and silos. Enables you to explore and mine big data to find, visualize, and understand all your
data to improve decision making by creating a unified view of information across all data sources you gain
enhanced value and new insights. Make important decisions such as pricing. Helps arrive at Master Data
Management.
Customer Behavior analysis, Retention, Segmentation - With access to consumer behavior data, companies can
learn what prompts user to stick around , learn about customer characteristics to improve marketing efforts. Data
Science led algorithms help building recommendation systems.
Cyber Security Intelligence - Data Science led Intrusion, fraud, anomaly detection systems - Lower risk, detect
fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis
platforms with big data technologies to process and analyze new types.
Industry specific - Every industry have invested in BigData to help specific challenges in those industries.
Healthcare for instances uses patient data to improve patient outcomes. agriculture to boost crop yields. One
interesting case involves UN using cell phone data to track malaria spread in African subcontinent.
Sources of Data
• Traditional data sources - RDBMS and transactional data, feed from current Data ware houses.
• User and machine-generated content - eMail, help desk calls, social media, web and software logs.
• Internet of Things - cameras, information-sensing mobile devices, aerial sensory technologies, and
genomics.

Hadoop EcoSystem Components
Early Hadoop supported core services of Storage (HDFS) and Processing (MapReduce).
• All Interfaces to HDFS & MR is through low level Java API. There were no higher level abstractions.
• There was no robust Security model built around.
• Early adopters were heavy tech corporations like Yahoo, Facebook, linkedIn …
Hadoop EcoSystem evolved to meet traditional enterprise needs (still evolving). They include but not limited to:
• Higher level abstractions (Pig, Hive, Cascade)
• Data Ingestion tools (Distcp, Sqoop, Flume)
• Random real time read/write access for large datasets(HBase)
• Distribution coordination services(Zookeeper)
• workﬂow Engines (Oozie, Falcon)
• Security (Apache Ranger, Knox)
• Improved Resource management (YARN. This is now part of Hadoop core).
• In-Memory processing (Spark)
• Machine learning (Mahout)
• Provisioning and Monitoring services
Architecture and Design Goal
• As Hadoop ecosystem matured rather rapidly, enterprises have the task of effectively integrating these
several tools into complete solutions.
• A rich ecosystem of tools, APIs, and development options provide choice and ﬂexibility, but can make it
challenging to determine the best choices to implement a solution.
• Enterprises need to understand the role of each tool. sub-component, architects need to ask right
questions, pick right tools, make right decisions for the implementation

Hadoop EcoSystem Components..
(Picture Courtesy HortonWorks HDP)

Hadoop EcoSystem Core
HDFS • Scalable, fault-tolerant, distributed storage system across many servers.HDFS cluster is
comprised of a NameNode(Master) that manages the cluster metadata and
DataNodes(Worker) that store the actual data.
• Key features
• Provides all standard file system commands available on POSIX systems
• Rack Awareness - allows consideration of a node’s physical location, when allocating
storage and scheduling tasks
• Replication - provides default replication of 3 in case of a particular datanode failure. It
detects errors and automatically creates a separate copy if required
• Minimal data motion. MapReduce moves compute processes to the data on HDFS and
not the other way around. Processing tasks can occur on the physical node where the
data resides. This significantly reduces the network traffic and so there improving
overall latency/performance.
• Utilities diagnose the health of the files system and can rebalance the data on different
nodes
Map
Reduce
• The core of Divide-and-conquer distributed processing Engine. Earlier versions carried
resource management but now those are moved to YARN.
• Includes JobTracker(Master) and TaskTracker(worker) components to run batch
version of jobs.
YARN • Foundation of new generation of Hadoop core operating system. Provides resource
management and a central operating platform. Addresses limitations of MapReduce 1.0
Jobtracker service in areas of scalability and support for non MapReduce programs
• Key features
• Multi-tenancy - Allows multiple access engines (either open-source or proprietary) to
use Hadoop as the common standard for batch, interactive and real-time engines that
can simultaneously access the same data set.
• Cluster utilization - YARN’s dynamic allocation of cluster resources improves utilization
over more static MapReduce rules used in early versions of Hadoop
• YARN’s original purpose was to split up major responsibilities of the JobTracker/
TaskTracker into separate entities: a global ResourceManager, a per-application
ApplicationMaster, a per-node slave NodeManager, a per-application Container
running on a NodeManager.

Hadoop EcoSystem Abstraction services
•One criticism of MapReduce is that the default development cycle in Java is very long. Writing the mappers and
reducers, compiling and packaging the code, submitting the job(s), and retrieving the results is time consuming. Also
those who are not Java programmers cannot really make good use of Hadoop/HDFS.
•PIG and Hive comes to rescue for Analysts and DBA’s. CASCADING for java programmers. Under the covers all the
three apply abstractions into a series of MapReduce programs.
PIG • Originated at Yahoo. Pig scriptable language “Latin” can process petabytes of data with just few
statements. Recommended use case(s) are ETL data pipelines, research on raw data and
iterative processing.
• Key features
• Available functions LOAD, FILTER, GROUP BY, FOREACH, MAX , ORDER, LIMIT, UNION ,
CROSS , SPLIT, CLUBE, ROLLUP.
• Ability to create User Defined functions in Java and other languages that can be called with
in Pig scripts.
Hive/
HQL
• Originated at Facebook, Hive/HQL provides SQL like abstraction called HQL for data analysts
with SQL background.
• Key features:
• Hive under the covers provides a SQL/schema like structure to HDFS data by using a
Metastore. Metastore which visualizes “hdfs data” as sql tables is supported by MySQL.
MySQL does not store data but just the structure to data stored in HDFS.
• Provides majority of ANSI SQL alike statements for processing including Explain, Analyze
statements.
• Provides Partitioning, Indexing, Bucketing features to manage performance.
Cascade •PIG is great for early adopter use cases, ad hoc queries, and less complex applications.
Cascading is great for Enterprise data workflows and is designed for “scale, complexity and
stability” over PIG.
• With Cascading, you can package your entire MapReduce application, including its orchestration
and testing, within a single JAR file
• Processing Model is based on “data pipes”, filters/operations, data taps(sources and sinks).

Hadoop EcoSystem Data Ingestion tools
Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache
Hadoop and structured data stores such as relational databases.
• Mostly Batch driven, simple use case will be an organization that runs a nightly sqoop import to
load the day's data from a production DB into a HDFS for Hive/HSQL analysis.
• Provides Import, Export commands and runs the process in parallel with data coming in and out of
HDFS
• Parallelizes data transfer for fast performance and optimal system utilization
• Copies data quickly from external systems to Hadoop
• Makes data analysis more efficient
• Mitigates excessive loads to external systems.
Flume • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data into HDFS (not out of HDFS)
• Mostly Real time, even driven and a common use case is collecting log data from one system- a
bank of web servers log files (aggregating it in HDFS for later analysis).
• Stream data from multiple sources into Hadoop for analysis
• Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at
which data can be written to the destination
• Guarantee data delivery
• Scale horizontally to handle additional data volume
Distcp DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to
effect its distribution, error handling and recovery, and reporting
Hadoop EcoSystem Real time Read/Write Storage
HBase HDFS is a distributed file system that is well suited for the storage of large files. HDFS is a general
purpose file system and does not provide fast individual record lookups in files. HBase, on the other
hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This
can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.
Key features:
Modelled after Google BigTable, cells are versioned, column oriented (vs RDBMS row orientation)
and rows are sorted(for lookups), map oriented and distributed across servers.
Depends on “ZooKeeper” for distribution services of a cluster. There is no one single master server
which provides that additional flexibility.

Hadoop EcoSystem Workflow tools
Oozie • Ozzie is a workflow scheduling engine specialized in running multi-stage jobs in Hadoop eco-
system. It has ability to monitor, track, recover from errors, maintain dependency of jobs.
• Workflows are expressed as XML and no development language are required.
• 3 Types of jobs:
• Oozie workflow jobs - Jobs running on demand. They are Directed Acyclical Graphs (DAGs),
Oozie Coordinator jobs - Workflow Jobs running periodically on a regular basis.Triggered by
time and data availability.
• Oozie Bundle provides a way to package multiple coordinator and workflow jobs.
• Deployment Model: An Oozie application comprises one file defining the logic of the application
plus other files such as scripts, configuration, and JAR files. A Workflow application consists of a
workflow.xml file and may have configuration files, Pig scripts, Hive scripts, JAR files, etc.
Falcon • Simplifies “data management(data jobs)” for Hadoop and used mainly as part of “Data Ingestion”.
Falcon is a feed processing and feed management system aimed at making it easier for end
consumers to onboard their data onto Hadoop clusters.
• Provides the key services data processing applications need so Sophisticated DLM can
easily be added to Hadoop applications.
• Services include Data processing (using PIG scripts for example), retry logic , late arrival
data handling, Replication, Retention, Records Audit, Governance, metrics.
• Provides integration with metastore/catalog(HCatalog).
• Faster development and higher quality for ETL, reporting and other data processing apps on
Hadoop.
Hadoop EcoSystem Real time Read/Write Storage..
Accumulo Originally developed by NSA(National Security Agency), this provides “cell level security access” to
a BigTable modeled database.Due to its origins in the intelligence community, Accumulo provides
extremely fast access to data in massive tables, while also controlling access to its billions of rows
and millions of columns down to the individual cell.
Key features:
• Group columns within a single file
• Automatic tablet splitting and rebalancing
• Merge tablets and clone tables
• Uses Zookeeper just like HBase

Hadoop EcoSystem Workﬂow tools….

Hadoop EcoSystem Analytic processing tools
Storm • Storm provides ability to process of large amounts of “real time” data(vs batch version of Hadoop).
• Typical use case would to be say process twitter feeds, online web log errors that needs
immediate actions.
• Features:
• Input stream of data is managed by “spout” (like a water spout) which passes data to “bolt”
which transforms (processes) data or pass it another bolt. Ability to scale up by increasing
number of “bolts” and “spouts”
• Storm have 2 kinds of nodes - master (which runs daemon Nimbus) and worker nodes(that runs
supervisor) . All clusters are managed by ZooKeeper and they can fail/restart. Also fast
computation/data transfers are done by “ZeroMQ” which is faster than TCP/IP.
• Provides Java, Scala and Python language interfaces
Spark • An In-memory compute for Machine learning and Data science projects. Typical MapReduce
processing involves data transferring between “hard disks” but Spark does all data processing/
saving in memory(to great extent) thus providing 10 fold performance improvement over Storm.
• Use cases - speeding up batch analysis jobs, iterative machine learning jobs, If you are
interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great
option (although memory requirements must be considered)
• Spark provides RDD (resilient Distributed Data) mechanism for in-Memory data storage
and processing primitives which get applied on whole data
• Spark powers a stack of high-level tools including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly
in the same application.
• Spark can run on Hadoop 2's YARN cluster manager, and can read existing Hadoop
data.SparkSQL can be built and configured to read and write data stored in Hive
• Provides Java, Scala and Python language interfaces
Mahout Suite of machine learning libraries designed to be scalable and robust. Provides Java Interface.
Supports four main Data Science Use cases:
• Collaborative filtering – mines user behavior and makes product recommendations (e.g.
Amazon recommendations)
• Clustering – takes items in a particular class (such as web pages or newspaper articles) and
organizes them into naturally occurring groups
• Classification – learns from existing categorizations and then assigns unclassified items to the
best category
• Frequent item set mining – analyzes items in a group (e.g. items in a shopping cart or terms in
a query session)

Hadoop EcoSystem Security
Architects and designers of early Hadoop were initially focused on HDFS and MapReduce at a massive scale
with no “security model” . Reason was most of the workflow(data ingestion, storage , processing, analytics)
was internal to a enterprise. In recent times as Hadoop and alike got mainstream with interfaces extending
outside the enterprise, Robust Security models are now added to Hadoop architecture
HDP
Security/
Ranger
• Hadoop Inbuilt Advanced Security and Authorization
• Centralized Security Administration - HDP provides a console for managing security policies
access controls from one place.
• Authentication - Kerberos-Based authentication which can generate delegation tickets for time
bound authentication (like disney tickets which are valid for certain days) as Kerberos servers will
not be able to scale to tens of thousands of jobs/data requests. Kerberos have the additional
benefit of never sending password over wire. It can integrate to corporate LDAP if required.
• Authorization and Audit - Prevents rogue jobs being executed by insiders. Fine-grained
authorization via file permissions in HDFS, resource-level access control for YARN and MapReduce
and coarser-grained access control at a service level.
Knox • Apache Knox Gateway (“Knox”) provides perimeter security for Hadoop ecoSystem.
• Provide security to all of Hadoop’s REST & HTTP services
• Support for REST APIs for Apache Ambari, Apache Falcon and Apache Ranger.
Hadoop EcoSystem Operational Tools
Ambari Cluster Administration, Provisioning and Monitoring tool
Zookeeper Distribution Services tool used by HBase, Accumulo

Road Map for Implementation
Stage 1: Development Tools
• Most of Hadoop system sub-components can be independently installed and configured. But it is highly
recommended to go with Sandbox versions such as HortonWorks or Cloudier or any vendor.
• Linux alike machines for developers as Hadoop EcoSystem is designed around Linux and usually the only
supported ones in production.
• Sandbox development environments are also provided as Virtual Machine packages that can go on windows
machines for development. Port forwarding and sharing between Host and Guest development machines can meet
most of development needs.
• Programming IDE’s such as for Java (Eclipse and IntelliJ) can continue with current setup.
• If it is a large team use Vagrant and develop a VM with all software components required.
Stage 3: Pre-Prod components setup and testing
• Execute a POC demo of all components. This need not be a use case but more a technology demonstration.
• Complete one life cycle of development, deployment and testing of POC application with all components.
• Install IDS/IPS products in line of “Data Ingestion” path for data coming from external sources(Twitter feed etc) so
fictitious data cannot be injected into processing.
Stage 2: Cluster Setup - Hardware and Network
•Choose optimal hardware as per vendor recommendations for critical Namenode and ResourceManager/Jobtracker
as well as rest of worker nodes. Install and Configure a multi-node cluster nodes (24 or 32) as a proof of concept
•Isolate Hadoop ecosystem in separate subnets away from web and other traditional infrastructure. Separate “Master
nodes, Worker nodes, client nodes and management nodes” where Master nodes and Worker nodes takes input only
from management nodes.
•Keep Client nodes on edge nodes(akin to DMZ) with end user access limited to only those. Users should never get
access to any Hadoop nodes except the edge/gateway nodes.
•Work with Network architect to finalize on topology design out of - Traditional two Switch rooted Architecture (using
expensive Chassis switches) OR BigData preferred Spine and leaf model (using inexpensive scale out Fixed
switches).
•Define Rack-awareness (Hadoop config file topology.script) for all servers so Namenode can make right decisions.

Road Map for Implementation
Stage 4: Early adoption use case
• Identify Enterprise DataWarehouse or BI team long runnings jobs and queries.
• Formulate a use case or copy an existing use case to mimic above functionality.
• Design and code Sqoop and PIG scripts to load data to HDFS
• Design HDFS and Hive Schema before data load
• Replicate existing functionality using Hadoop ecosystem components.
• Adopt architectural standards for data format, compression (defined in next section)
Stage 5: Mature use case like MDM
• Provide data exploration and analysis to identify, locate and load internal and external final party data files into
landing zone
• Clean the data as per business requirements
• Prepare - Design and Develop Big Data structure and tables to load data files
• Analyze - provide productivity tools to data access and analysis.
• Implement Enterprise data transformation tooks like Cascade if required
Stage 6: Data Science driven Analytic use case.
• Work with business to identify Data science driven use case
• Setup BigData ecosystem and environment
• Engineer with Data Science team to design prediction model.

Architectural and Design Decisions
Data Format, Compression and Storage system
FileFormat
• SequenceFiles - Recommended for smaller files (<64MB default HDFS block size)
• Avro - Recommended for use cases requiring ever evolving schema changes since they have self describing format
• RC and ORCFile formats - Columnar formats and efficient for most of Analytic cases where only few columns of data is
read.
Compression - Very critical the approach used supports splittable format required for MapReduce
• Snappy - Efficient compression but does not support splits
• LZO - Slow performance but supports splits
• Bzip2 - Split support but less efficient. Use only if storage is a serious concern
Schema Design - HDFS or HBase with letter providing random read/write support
Standard directory structure and standard organization of data
Data Movement
• Network - Ensure appropriate bandwidth requirements are met
• Sqoop for data ingestion
• Use single hop from source to HDFS where possible
• Use Database Specific Connectors Whenever Available
• Goldilocks Method of Sqoop Performance Tuning
• Loading Many Tables in Parallel with Fair Scheduler Throttling
• Ask following questions and design to minimize network traffic
• Timeliness of data ingestion
• Incremental updates
• Data transformation - Does data need to be transformed during flight
• Source system structure, layout and proximity
Data Processing
• MapReduce low level Java API - Use only if high flexibility and control of MapReduce is required
• Spark, Storm are clear choices for most enterprise level of processing
• Use Abstractions like Pig, HQL for early adoptions or for data analysts usage. Leverage modern performance
improvements like “Tez” which performs better than MapReduce for Pig and HQL.

Architectural and Design Decisions
Patterns
Use Patterns for common occurring problems
• Removing duplicate keys
• Windowing analysis - helps in arriving peak and lows of a equity
• Updating time series data -One common way of storing this information is as versions in HBase. HBase has a way to store
every change you make to a record as versions. Versions are deﬁned at a column level and ordered by modiﬁcation
timestamp so you can go to any point in time to see what the record looked like.
Graph Processing
Giraph - Recommended for mature cases of Graph processing but only for expert graph programmers
Spark GraphX - Recommended as it is available under one stack of Spark which is set to replace MapReduce.
WorkFlows
• While most Data pipeline and processing can be done manually with scripts, for an enterprise leverage Oozie and Falcon.
They provide robust mechanisms for processing.
Programming languages
• Hadoop is developed in Java and is the core language choice for all interfaces like Spark, Storm, MapReduce
(Abstractions like Pig, HSQL are excluded).
• However, as Hadoop is getting matured, Python is becoming language of choice. If development teams are already
familiar with Java then continue with same.
Unit Testing
• All language and abstractions to Hadoop offer Unit testing (including Pig) and should be used just like any enterprise
application.

Hadoop - Architectural road map for Hadoop Ecosystem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hadoop - Architectural road map for Hadoop Ecosystem

Ähnlich wie Hadoop - Architectural road map for Hadoop Ecosystem (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop - Architectural road map for Hadoop Ecosystem