SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Architectural Road Map for
Hadoop Ecosystem
Implementation
Sudhir Nallagangu
Content Index
• Industry Definitions, History and Capabilities
• Hadoop EcoSystem Architecture/Components
• Road Map for Implementation
• Architecture Decision points
• Use cases and Data Sciences
• Questions?
Industry Definitions, History and Capabilities
Definition - BigData is a broad generic term for large data sets that traditional data
processing(RDBMS) techniques and infrastructure are not adequate. Two core challenges
faced:
• Storage - To store PetaBytes of structured, semistructured and unstructured data that
grows by TeraBytes daily.
• Processing - To process, analyze, visualize in reasonable amount of time.
History - Google faced with task of saving, indexing and searching billions of web pages
turned to distributed storage and processing innovations. They published their techniques
as white papers - Google File System(GFS 2003) for storage challenge and
MapReduce(2004) for processing challenge.
Inspired by above, a team of open source development team created Hadoop core (HDFS
and MapReduce) in 2006. This provided techniques to manage BigData by using
commodity hardware (no supercomputers) in a regular enterprise datacenter setting.
• HDFS - A distributed file storage system to solve ever growing storage needs by
horizontal scalable infrastructure.
• MapReduce - A distributed “divide and conquer” processing technique that leverages
“data locality”. Data locality execution means execution/processing at the place where
data resides(efficient over transferring data over network).
• The core system and services designed to operate , survive and recover in hardware
and software failure scenarios as regular occurrence.
Industry Definitions, History and Capabilities
Capabilities
Use Cases
Data Ware House modernization - Identified as the early adoption use case. Integrate big data and data
warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of
analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining
what data should be moved to the data warehouse.
Big Data Exploration - Addresses the challenge that every large organization faces: information is stored in many
different systems and silos. Enables you to explore and mine big data to find, visualize, and understand all your
data to improve decision making by creating a unified view of information across all data sources you gain
enhanced value and new insights. Make important decisions such as pricing. Helps arrive at Master Data
Management.
Customer Behavior analysis, Retention, Segmentation - With access to consumer behavior data, companies can
learn what prompts user to stick around , learn about customer characteristics to improve marketing efforts. Data
Science led algorithms help building recommendation systems.
Cyber Security Intelligence - Data Science led Intrusion, fraud, anomaly detection systems - Lower risk, detect
fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis
platforms with big data technologies to process and analyze new types.
Industry specific - Every industry have invested in BigData to help specific challenges in those industries.
Healthcare for instances uses patient data to improve patient outcomes. agriculture to boost crop yields. One
interesting case involves UN using cell phone data to track malaria spread in African subcontinent.
Sources of Data
• Traditional data sources - RDBMS and transactional data, feed from current Data ware houses.
• User and machine-generated content - eMail, help desk calls, social media, web and software logs.
• Internet of Things - cameras, information-sensing mobile devices, aerial sensory technologies, and
genomics.
Hadoop EcoSystem Components
Early Hadoop supported core services of Storage (HDFS) and Processing (MapReduce).
• All Interfaces to HDFS & MR is through low level Java API. There were no higher level abstractions.
• There was no robust Security model built around.
• Early adopters were heavy tech corporations like Yahoo, Facebook, linkedIn …
Hadoop EcoSystem evolved to meet traditional enterprise needs (still evolving). They include but not limited to:
• Higher level abstractions (Pig, Hive, Cascade)
• Data Ingestion tools (Distcp, Sqoop, Flume)
• Random real time read/write access for large datasets(HBase)
• Distribution coordination services(Zookeeper)
• workflow Engines (Oozie, Falcon)
• Security (Apache Ranger, Knox)
• Improved Resource management (YARN. This is now part of Hadoop core).
• In-Memory processing (Spark)
• Machine learning (Mahout)
• Provisioning and Monitoring services
Architecture and Design Goal
• As Hadoop ecosystem matured rather rapidly, enterprises have the task of effectively integrating these
several tools into complete solutions.
• A rich ecosystem of tools, APIs, and development options provide choice and flexibility, but can make it
challenging to determine the best choices to implement a solution.
• Enterprises need to understand the role of each tool. sub-component, architects need to ask right
questions, pick right tools, make right decisions for the implementation
Hadoop EcoSystem Components..
(Picture Courtesy HortonWorks HDP)
Hadoop EcoSystem Core
HDFS • Scalable, fault-tolerant, distributed storage system across many servers.HDFS cluster is
comprised of a NameNode(Master) that manages the cluster metadata and
DataNodes(Worker) that store the actual data.
• Key features
• Provides all standard file system commands available on POSIX systems
• Rack Awareness - allows consideration of a node’s physical location, when allocating
storage and scheduling tasks
• Replication - provides default replication of 3 in case of a particular datanode failure. It
detects errors and automatically creates a separate copy if required
• Minimal data motion. MapReduce moves compute processes to the data on HDFS and
not the other way around. Processing tasks can occur on the physical node where the
data resides. This significantly reduces the network traffic and so there improving
overall latency/performance.
• Utilities diagnose the health of the files system and can rebalance the data on different
nodes
Map
Reduce
• The core of Divide-and-conquer distributed processing Engine. Earlier versions carried
resource management but now those are moved to YARN.
• Includes JobTracker(Master) and TaskTracker(worker) components to run batch
version of jobs.
YARN • Foundation of new generation of Hadoop core operating system. Provides resource
management and a central operating platform. Addresses limitations of MapReduce 1.0
Jobtracker service in areas of scalability and support for non MapReduce programs
• Key features
• Multi-tenancy - Allows multiple access engines (either open-source or proprietary) to
use Hadoop as the common standard for batch, interactive and real-time engines that
can simultaneously access the same data set.
• Cluster utilization - YARN’s dynamic allocation of cluster resources improves utilization
over more static MapReduce rules used in early versions of Hadoop
• YARN’s original purpose was to split up major responsibilities of the JobTracker/
TaskTracker into separate entities: a global ResourceManager, a per-application
ApplicationMaster, a per-node slave NodeManager, a per-application Container
running on a NodeManager.
HDFS
MapReduce
Cluster layout
Hadoop EcoSystem Abstraction services
•One criticism of MapReduce is that the default development cycle in Java is very long. Writing the mappers and
reducers, compiling and packaging the code, submitting the job(s), and retrieving the results is time consuming. Also
those who are not Java programmers cannot really make good use of Hadoop/HDFS.
•PIG and Hive comes to rescue for Analysts and DBA’s. CASCADING for java programmers. Under the covers all the
three apply abstractions into a series of MapReduce programs.
PIG • Originated at Yahoo. Pig scriptable language “Latin” can process petabytes of data with just few
statements. Recommended use case(s) are ETL data pipelines, research on raw data and
iterative processing.
• Key features
• Available functions LOAD, FILTER, GROUP BY, FOREACH, MAX , ORDER, LIMIT, UNION ,
CROSS , SPLIT, CLUBE, ROLLUP.
• Ability to create User Defined functions in Java and other languages that can be called with
in Pig scripts.
Hive/
HQL
• Originated at Facebook, Hive/HQL provides SQL like abstraction called HQL for data analysts
with SQL background.
• Key features:
• Hive under the covers provides a SQL/schema like structure to HDFS data by using a
Metastore. Metastore which visualizes “hdfs data” as sql tables is supported by MySQL.
MySQL does not store data but just the structure to data stored in HDFS.
• Provides majority of ANSI SQL alike statements for processing including Explain, Analyze
statements.
• Provides Partitioning, Indexing, Bucketing features to manage performance.
Cascade •PIG is great for early adopter use cases, ad hoc queries, and less complex applications.
Cascading is great for Enterprise data workflows and is designed for “scale, complexity and
stability” over PIG.
• With Cascading, you can package your entire MapReduce application, including its orchestration
and testing, within a single JAR file
• Processing Model is based on “data pipes”, filters/operations, data taps(sources and sinks).
Hadoop EcoSystem Data Ingestion tools
Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache
Hadoop and structured data stores such as relational databases.
• Mostly Batch driven, simple use case will be an organization that runs a nightly sqoop import to
load the day's data from a production DB into a HDFS for Hive/HSQL analysis.
• Provides Import, Export commands and runs the process in parallel with data coming in and out of
HDFS
• Parallelizes data transfer for fast performance and optimal system utilization
• Copies data quickly from external systems to Hadoop
• Makes data analysis more efficient
• Mitigates excessive loads to external systems.
Flume • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data into HDFS (not out of HDFS)
• Mostly Real time, even driven and a common use case is collecting log data from one system- a
bank of web servers log files (aggregating it in HDFS for later analysis).
• Stream data from multiple sources into Hadoop for analysis
• Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at
which data can be written to the destination
• Guarantee data delivery
• Scale horizontally to handle additional data volume
Distcp DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to
effect its distribution, error handling and recovery, and reporting
Hadoop EcoSystem Real time Read/Write Storage
HBase HDFS is a distributed file system that is well suited for the storage of large files. HDFS is a general
purpose file system and does not provide fast individual record lookups in files. HBase, on the other
hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This
can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.
Key features:
Modelled after Google BigTable, cells are versioned, column oriented (vs RDBMS row orientation)
and rows are sorted(for lookups), map oriented and distributed across servers.
Depends on “ZooKeeper” for distribution services of a cluster. There is no one single master server
which provides that additional flexibility.
Hadoop EcoSystem Workflow tools
Oozie • Ozzie is a workflow scheduling engine specialized in running multi-stage jobs in Hadoop eco-
system. It has ability to monitor, track, recover from errors, maintain dependency of jobs.
• Workflows are expressed as XML and no development language are required.
• 3 Types of jobs:
• Oozie workflow jobs - Jobs running on demand. They are Directed Acyclical Graphs (DAGs),
Oozie Coordinator jobs - Workflow Jobs running periodically on a regular basis.Triggered by
time and data availability.
• Oozie Bundle provides a way to package multiple coordinator and workflow jobs.
• Deployment Model: An Oozie application comprises one file defining the logic of the application
plus other files such as scripts, configuration, and JAR files. A Workflow application consists of a
workflow.xml file and may have configuration files, Pig scripts, Hive scripts, JAR files, etc.
Falcon • Simplifies “data management(data jobs)” for Hadoop and used mainly as part of “Data Ingestion”.
Falcon is a feed processing and feed management system aimed at making it easier for end
consumers to onboard their data onto Hadoop clusters.
• Provides the key services data processing applications need so Sophisticated DLM can
easily be added to Hadoop applications.
• Services include Data processing (using PIG scripts for example), retry logic , late arrival
data handling, Replication, Retention, Records Audit, Governance, metrics.
• Provides integration with metastore/catalog(HCatalog).
• Faster development and higher quality for ETL, reporting and other data processing apps on
Hadoop.
Hadoop EcoSystem Real time Read/Write Storage..
Accumulo Originally developed by NSA(National Security Agency), this provides “cell level security access” to
a BigTable modeled database.Due to its origins in the intelligence community, Accumulo provides
extremely fast access to data in massive tables, while also controlling access to its billions of rows
and millions of columns down to the individual cell.
Key features:
• Group columns within a single file
• Automatic tablet splitting and rebalancing
• Merge tablets and clone tables
• Uses Zookeeper just like HBase
Hadoop EcoSystem Workflow tools….
Hadoop EcoSystem Analytic processing tools
Storm • Storm provides ability to process of large amounts of “real time” data(vs batch version of Hadoop).
• Typical use case would to be say process twitter feeds, online web log errors that needs
immediate actions.
• Features:
• Input stream of data is managed by “spout” (like a water spout) which passes data to “bolt”
which transforms (processes) data or pass it another bolt. Ability to scale up by increasing
number of “bolts” and “spouts”
• Storm have 2 kinds of nodes - master (which runs daemon Nimbus) and worker nodes(that runs
supervisor) . All clusters are managed by ZooKeeper and they can fail/restart. Also fast
computation/data transfers are done by “ZeroMQ” which is faster than TCP/IP.
• Provides Java, Scala and Python language interfaces
Spark • An In-memory compute for Machine learning and Data science projects. Typical MapReduce
processing involves data transferring between “hard disks” but Spark does all data processing/
saving in memory(to great extent) thus providing 10 fold performance improvement over Storm.
• Use cases - speeding up batch analysis jobs, iterative machine learning jobs, If you are
interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great
option (although memory requirements must be considered)
• Spark provides RDD (resilient Distributed Data) mechanism for in-Memory data storage
and processing primitives which get applied on whole data
• Spark powers a stack of high-level tools including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly
in the same application.
• Spark can run on Hadoop 2's YARN cluster manager, and can read existing Hadoop
data.SparkSQL can be built and configured to read and write data stored in Hive
• Provides Java, Scala and Python language interfaces
Mahout Suite of machine learning libraries designed to be scalable and robust. Provides Java Interface.
Supports four main Data Science Use cases:
• Collaborative filtering – mines user behavior and makes product recommendations (e.g.
Amazon recommendations)
• Clustering – takes items in a particular class (such as web pages or newspaper articles) and
organizes them into naturally occurring groups
• Classification – learns from existing categorizations and then assigns unclassified items to the
best category
• Frequent item set mining – analyzes items in a group (e.g. items in a shopping cart or terms in
a query session)
Hadoop EcoSystem Security
Architects and designers of early Hadoop were initially focused on HDFS and MapReduce at a massive scale
with no “security model” . Reason was most of the workflow(data ingestion, storage , processing, analytics)
was internal to a enterprise. In recent times as Hadoop and alike got mainstream with interfaces extending
outside the enterprise, Robust Security models are now added to Hadoop architecture
HDP
Security/
Ranger
• Hadoop Inbuilt Advanced Security and Authorization
• Centralized Security Administration - HDP provides a console for managing security policies
access controls from one place.
• Authentication - Kerberos-Based authentication which can generate delegation tickets for time
bound authentication (like disney tickets which are valid for certain days) as Kerberos servers will
not be able to scale to tens of thousands of jobs/data requests. Kerberos have the additional
benefit of never sending password over wire. It can integrate to corporate LDAP if required.
• Authorization and Audit - Prevents rogue jobs being executed by insiders. Fine-grained
authorization via file permissions in HDFS, resource-level access control for YARN and MapReduce
and coarser-grained access control at a service level.
Knox • Apache Knox Gateway (“Knox”) provides perimeter security for Hadoop ecoSystem.
• Provide security to all of Hadoop’s REST & HTTP services
• Support for REST APIs for Apache Ambari, Apache Falcon and Apache Ranger.
Hadoop EcoSystem Operational Tools
Ambari Cluster Administration, Provisioning and Monitoring tool
Zookeeper Distribution Services tool used by HBase, Accumulo
Road Map for Implementation
Stage 1: Development Tools
• Most of Hadoop system sub-components can be independently installed and configured. But it is highly
recommended to go with Sandbox versions such as HortonWorks or Cloudier or any vendor.
• Linux alike machines for developers as Hadoop EcoSystem is designed around Linux and usually the only
supported ones in production.
• Sandbox development environments are also provided as Virtual Machine packages that can go on windows
machines for development. Port forwarding and sharing between Host and Guest development machines can meet
most of development needs.
• Programming IDE’s such as for Java (Eclipse and IntelliJ) can continue with current setup.
• If it is a large team use Vagrant and develop a VM with all software components required.
Stage 3: Pre-Prod components setup and testing
• Execute a POC demo of all components. This need not be a use case but more a technology demonstration.
• Complete one life cycle of development, deployment and testing of POC application with all components.
• Install IDS/IPS products in line of “Data Ingestion” path for data coming from external sources(Twitter feed etc) so
fictitious data cannot be injected into processing.
Stage 2: Cluster Setup - Hardware and Network
•Choose optimal hardware as per vendor recommendations for critical Namenode and ResourceManager/Jobtracker
as well as rest of worker nodes. Install and Configure a multi-node cluster nodes (24 or 32) as a proof of concept
•Isolate Hadoop ecosystem in separate subnets away from web and other traditional infrastructure. Separate “Master
nodes, Worker nodes, client nodes and management nodes” where Master nodes and Worker nodes takes input only
from management nodes.
•Keep Client nodes on edge nodes(akin to DMZ) with end user access limited to only those. Users should never get
access to any Hadoop nodes except the edge/gateway nodes.
•Work with Network architect to finalize on topology design out of - Traditional two Switch rooted Architecture (using
expensive Chassis switches) OR BigData preferred Spine and leaf model (using inexpensive scale out Fixed
switches).
•Define Rack-awareness (Hadoop config file topology.script) for all servers so Namenode can make right decisions.
Road Map for Implementation
Stage 4: Early adoption use case
• Identify Enterprise DataWarehouse or BI team long runnings jobs and queries.
• Formulate a use case or copy an existing use case to mimic above functionality.
• Design and code Sqoop and PIG scripts to load data to HDFS
• Design HDFS and Hive Schema before data load
• Replicate existing functionality using Hadoop ecosystem components.
• Adopt architectural standards for data format, compression (defined in next section)
Stage 5: Mature use case like MDM
• Provide data exploration and analysis to identify, locate and load internal and external final party data files into
landing zone
• Clean the data as per business requirements
• Prepare - Design and Develop Big Data structure and tables to load data files
• Analyze - provide productivity tools to data access and analysis.
• Implement Enterprise data transformation tooks like Cascade if required
Stage 6: Data Science driven Analytic use case.
• Work with business to identify Data science driven use case
• Setup BigData ecosystem and environment
• Engineer with Data Science team to design prediction model.
Architectural and Design Decisions
Data Format, Compression and Storage system
FileFormat
• SequenceFiles - Recommended for smaller files (<64MB default HDFS block size)
• Avro - Recommended for use cases requiring ever evolving schema changes since they have self describing format
• RC and ORCFile formats - Columnar formats and efficient for most of Analytic cases where only few columns of data is
read.
Compression - Very critical the approach used supports splittable format required for MapReduce
• Snappy - Efficient compression but does not support splits
• LZO - Slow performance but supports splits
• Bzip2 - Split support but less efficient. Use only if storage is a serious concern
Schema Design - HDFS or HBase with letter providing random read/write support
Standard directory structure and standard organization of data
Data Movement
• Network - Ensure appropriate bandwidth requirements are met
• Sqoop for data ingestion
• Use single hop from source to HDFS where possible
• Use Database Specific Connectors Whenever Available
• Goldilocks Method of Sqoop Performance Tuning
• Loading Many Tables in Parallel with Fair Scheduler Throttling
• Ask following questions and design to minimize network traffic
• Timeliness of data ingestion
• Incremental updates
• Data transformation - Does data need to be transformed during flight
• Source system structure, layout and proximity
Data Processing
• MapReduce low level Java API - Use only if high flexibility and control of MapReduce is required
• Spark, Storm are clear choices for most enterprise level of processing
• Use Abstractions like Pig, HQL for early adoptions or for data analysts usage. Leverage modern performance
improvements like “Tez” which performs better than MapReduce for Pig and HQL.
Architectural and Design Decisions
Patterns
Use Patterns for common occurring problems
• Removing duplicate keys
• Windowing analysis - helps in arriving peak and lows of a equity
• Updating time series data -One common way of storing this information is as versions in HBase. HBase has a way to store
every change you make to a record as versions. Versions are defined at a column level and ordered by modification
timestamp so you can go to any point in time to see what the record looked like.
Graph Processing
Giraph - Recommended for mature cases of Graph processing but only for expert graph programmers
Spark GraphX - Recommended as it is available under one stack of Spark which is set to replace MapReduce.
WorkFlows
• While most Data pipeline and processing can be done manually with scripts, for an enterprise leverage Oozie and Falcon.
They provide robust mechanisms for processing.
Programming languages
• Hadoop is developed in Java and is the core language choice for all interfaces like Spark, Storm, MapReduce
(Abstractions like Pig, HSQL are excluded).
• However, as Hadoop is getting matured, Python is becoming language of choice. If development teams are already
familiar with Java then continue with same.
Unit Testing
• All language and abstractions to Hadoop offer Unit testing (including Pig) and should be used just like any enterprise
application.
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
Cloudera, Inc.
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
MapR Technologies
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 

Was ist angesagt? (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Hadoop
HadoopHadoop
Hadoop
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 

Ähnlich wie Hadoop - Architectural road map for Hadoop Ecosystem

Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 

Ähnlich wie Hadoop - Architectural road map for Hadoop Ecosystem (20)

Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
paper
paperpaper
paper
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Big Data
Big DataBig Data
Big Data
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Anju
AnjuAnju
Anju
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data
Big DataBig Data
Big Data
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 

Kürzlich hochgeladen

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Hadoop - Architectural road map for Hadoop Ecosystem

  • 1. Architectural Road Map for Hadoop Ecosystem Implementation Sudhir Nallagangu
  • 2. Content Index • Industry Definitions, History and Capabilities • Hadoop EcoSystem Architecture/Components • Road Map for Implementation • Architecture Decision points • Use cases and Data Sciences • Questions?
  • 3. Industry Definitions, History and Capabilities Definition - BigData is a broad generic term for large data sets that traditional data processing(RDBMS) techniques and infrastructure are not adequate. Two core challenges faced: • Storage - To store PetaBytes of structured, semistructured and unstructured data that grows by TeraBytes daily. • Processing - To process, analyze, visualize in reasonable amount of time. History - Google faced with task of saving, indexing and searching billions of web pages turned to distributed storage and processing innovations. They published their techniques as white papers - Google File System(GFS 2003) for storage challenge and MapReduce(2004) for processing challenge. Inspired by above, a team of open source development team created Hadoop core (HDFS and MapReduce) in 2006. This provided techniques to manage BigData by using commodity hardware (no supercomputers) in a regular enterprise datacenter setting. • HDFS - A distributed file storage system to solve ever growing storage needs by horizontal scalable infrastructure. • MapReduce - A distributed “divide and conquer” processing technique that leverages “data locality”. Data locality execution means execution/processing at the place where data resides(efficient over transferring data over network). • The core system and services designed to operate , survive and recover in hardware and software failure scenarios as regular occurrence.
  • 4. Industry Definitions, History and Capabilities Capabilities Use Cases Data Ware House modernization - Identified as the early adoption use case. Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse. Big Data Exploration - Addresses the challenge that every large organization faces: information is stored in many different systems and silos. Enables you to explore and mine big data to find, visualize, and understand all your data to improve decision making by creating a unified view of information across all data sources you gain enhanced value and new insights. Make important decisions such as pricing. Helps arrive at Master Data Management. Customer Behavior analysis, Retention, Segmentation - With access to consumer behavior data, companies can learn what prompts user to stick around , learn about customer characteristics to improve marketing efforts. Data Science led algorithms help building recommendation systems. Cyber Security Intelligence - Data Science led Intrusion, fraud, anomaly detection systems - Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types. Industry specific - Every industry have invested in BigData to help specific challenges in those industries. Healthcare for instances uses patient data to improve patient outcomes. agriculture to boost crop yields. One interesting case involves UN using cell phone data to track malaria spread in African subcontinent. Sources of Data • Traditional data sources - RDBMS and transactional data, feed from current Data ware houses. • User and machine-generated content - eMail, help desk calls, social media, web and software logs. • Internet of Things - cameras, information-sensing mobile devices, aerial sensory technologies, and genomics.
  • 5. Hadoop EcoSystem Components Early Hadoop supported core services of Storage (HDFS) and Processing (MapReduce). • All Interfaces to HDFS & MR is through low level Java API. There were no higher level abstractions. • There was no robust Security model built around. • Early adopters were heavy tech corporations like Yahoo, Facebook, linkedIn … Hadoop EcoSystem evolved to meet traditional enterprise needs (still evolving). They include but not limited to: • Higher level abstractions (Pig, Hive, Cascade) • Data Ingestion tools (Distcp, Sqoop, Flume) • Random real time read/write access for large datasets(HBase) • Distribution coordination services(Zookeeper) • workflow Engines (Oozie, Falcon) • Security (Apache Ranger, Knox) • Improved Resource management (YARN. This is now part of Hadoop core). • In-Memory processing (Spark) • Machine learning (Mahout) • Provisioning and Monitoring services Architecture and Design Goal • As Hadoop ecosystem matured rather rapidly, enterprises have the task of effectively integrating these several tools into complete solutions. • A rich ecosystem of tools, APIs, and development options provide choice and flexibility, but can make it challenging to determine the best choices to implement a solution. • Enterprises need to understand the role of each tool. sub-component, architects need to ask right questions, pick right tools, make right decisions for the implementation
  • 6. Hadoop EcoSystem Components.. (Picture Courtesy HortonWorks HDP)
  • 7. Hadoop EcoSystem Core HDFS • Scalable, fault-tolerant, distributed storage system across many servers.HDFS cluster is comprised of a NameNode(Master) that manages the cluster metadata and DataNodes(Worker) that store the actual data. • Key features • Provides all standard file system commands available on POSIX systems • Rack Awareness - allows consideration of a node’s physical location, when allocating storage and scheduling tasks • Replication - provides default replication of 3 in case of a particular datanode failure. It detects errors and automatically creates a separate copy if required • Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network traffic and so there improving overall latency/performance. • Utilities diagnose the health of the files system and can rebalance the data on different nodes Map Reduce • The core of Divide-and-conquer distributed processing Engine. Earlier versions carried resource management but now those are moved to YARN. • Includes JobTracker(Master) and TaskTracker(worker) components to run batch version of jobs. YARN • Foundation of new generation of Hadoop core operating system. Provides resource management and a central operating platform. Addresses limitations of MapReduce 1.0 Jobtracker service in areas of scalability and support for non MapReduce programs • Key features • Multi-tenancy - Allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. • Cluster utilization - YARN’s dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in early versions of Hadoop • YARN’s original purpose was to split up major responsibilities of the JobTracker/ TaskTracker into separate entities: a global ResourceManager, a per-application ApplicationMaster, a per-node slave NodeManager, a per-application Container running on a NodeManager.
  • 10. Hadoop EcoSystem Abstraction services •One criticism of MapReduce is that the default development cycle in Java is very long. Writing the mappers and reducers, compiling and packaging the code, submitting the job(s), and retrieving the results is time consuming. Also those who are not Java programmers cannot really make good use of Hadoop/HDFS. •PIG and Hive comes to rescue for Analysts and DBA’s. CASCADING for java programmers. Under the covers all the three apply abstractions into a series of MapReduce programs. PIG • Originated at Yahoo. Pig scriptable language “Latin” can process petabytes of data with just few statements. Recommended use case(s) are ETL data pipelines, research on raw data and iterative processing. • Key features • Available functions LOAD, FILTER, GROUP BY, FOREACH, MAX , ORDER, LIMIT, UNION , CROSS , SPLIT, CLUBE, ROLLUP. • Ability to create User Defined functions in Java and other languages that can be called with in Pig scripts. Hive/ HQL • Originated at Facebook, Hive/HQL provides SQL like abstraction called HQL for data analysts with SQL background. • Key features: • Hive under the covers provides a SQL/schema like structure to HDFS data by using a Metastore. Metastore which visualizes “hdfs data” as sql tables is supported by MySQL. MySQL does not store data but just the structure to data stored in HDFS. • Provides majority of ANSI SQL alike statements for processing including Explain, Analyze statements. • Provides Partitioning, Indexing, Bucketing features to manage performance. Cascade •PIG is great for early adopter use cases, ad hoc queries, and less complex applications. Cascading is great for Enterprise data workflows and is designed for “scale, complexity and stability” over PIG. • With Cascading, you can package your entire MapReduce application, including its orchestration and testing, within a single JAR file • Processing Model is based on “data pipes”, filters/operations, data taps(sources and sinks).
  • 11. Hadoop EcoSystem Data Ingestion tools Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. • Mostly Batch driven, simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a HDFS for Hive/HSQL analysis. • Provides Import, Export commands and runs the process in parallel with data coming in and out of HDFS • Parallelizes data transfer for fast performance and optimal system utilization • Copies data quickly from external systems to Hadoop • Makes data analysis more efficient • Mitigates excessive loads to external systems. Flume • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS (not out of HDFS) • Mostly Real time, even driven and a common use case is collecting log data from one system- a bank of web servers log files (aggregating it in HDFS for later analysis). • Stream data from multiple sources into Hadoop for analysis • Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination • Guarantee data delivery • Scale horizontally to handle additional data volume Distcp DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting Hadoop EcoSystem Real time Read/Write Storage HBase HDFS is a distributed file system that is well suited for the storage of large files. HDFS is a general purpose file system and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. Key features: Modelled after Google BigTable, cells are versioned, column oriented (vs RDBMS row orientation) and rows are sorted(for lookups), map oriented and distributed across servers. Depends on “ZooKeeper” for distribution services of a cluster. There is no one single master server which provides that additional flexibility.
  • 12. Hadoop EcoSystem Workflow tools Oozie • Ozzie is a workflow scheduling engine specialized in running multi-stage jobs in Hadoop eco- system. It has ability to monitor, track, recover from errors, maintain dependency of jobs. • Workflows are expressed as XML and no development language are required. • 3 Types of jobs: • Oozie workflow jobs - Jobs running on demand. They are Directed Acyclical Graphs (DAGs), Oozie Coordinator jobs - Workflow Jobs running periodically on a regular basis.Triggered by time and data availability. • Oozie Bundle provides a way to package multiple coordinator and workflow jobs. • Deployment Model: An Oozie application comprises one file defining the logic of the application plus other files such as scripts, configuration, and JAR files. A Workflow application consists of a workflow.xml file and may have configuration files, Pig scripts, Hive scripts, JAR files, etc. Falcon • Simplifies “data management(data jobs)” for Hadoop and used mainly as part of “Data Ingestion”. Falcon is a feed processing and feed management system aimed at making it easier for end consumers to onboard their data onto Hadoop clusters. • Provides the key services data processing applications need so Sophisticated DLM can easily be added to Hadoop applications. • Services include Data processing (using PIG scripts for example), retry logic , late arrival data handling, Replication, Retention, Records Audit, Governance, metrics. • Provides integration with metastore/catalog(HCatalog). • Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop. Hadoop EcoSystem Real time Read/Write Storage.. Accumulo Originally developed by NSA(National Security Agency), this provides “cell level security access” to a BigTable modeled database.Due to its origins in the intelligence community, Accumulo provides extremely fast access to data in massive tables, while also controlling access to its billions of rows and millions of columns down to the individual cell. Key features: • Group columns within a single file • Automatic tablet splitting and rebalancing • Merge tablets and clone tables • Uses Zookeeper just like HBase
  • 14. Hadoop EcoSystem Analytic processing tools Storm • Storm provides ability to process of large amounts of “real time” data(vs batch version of Hadoop). • Typical use case would to be say process twitter feeds, online web log errors that needs immediate actions. • Features: • Input stream of data is managed by “spout” (like a water spout) which passes data to “bolt” which transforms (processes) data or pass it another bolt. Ability to scale up by increasing number of “bolts” and “spouts” • Storm have 2 kinds of nodes - master (which runs daemon Nimbus) and worker nodes(that runs supervisor) . All clusters are managed by ZooKeeper and they can fail/restart. Also fast computation/data transfers are done by “ZeroMQ” which is faster than TCP/IP. • Provides Java, Scala and Python language interfaces Spark • An In-memory compute for Machine learning and Data science projects. Typical MapReduce processing involves data transferring between “hard disks” but Spark does all data processing/ saving in memory(to great extent) thus providing 10 fold performance improvement over Storm. • Use cases - speeding up batch analysis jobs, iterative machine learning jobs, If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered) • Spark provides RDD (resilient Distributed Data) mechanism for in-Memory data storage and processing primitives which get applied on whole data • Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application. • Spark can run on Hadoop 2's YARN cluster manager, and can read existing Hadoop data.SparkSQL can be built and configured to read and write data stored in Hive • Provides Java, Scala and Python language interfaces Mahout Suite of machine learning libraries designed to be scalable and robust. Provides Java Interface. Supports four main Data Science Use cases: • Collaborative filtering – mines user behavior and makes product recommendations (e.g. Amazon recommendations) • Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups • Classification – learns from existing categorizations and then assigns unclassified items to the best category • Frequent item set mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session)
  • 15. Hadoop EcoSystem Security Architects and designers of early Hadoop were initially focused on HDFS and MapReduce at a massive scale with no “security model” . Reason was most of the workflow(data ingestion, storage , processing, analytics) was internal to a enterprise. In recent times as Hadoop and alike got mainstream with interfaces extending outside the enterprise, Robust Security models are now added to Hadoop architecture HDP Security/ Ranger • Hadoop Inbuilt Advanced Security and Authorization • Centralized Security Administration - HDP provides a console for managing security policies access controls from one place. • Authentication - Kerberos-Based authentication which can generate delegation tickets for time bound authentication (like disney tickets which are valid for certain days) as Kerberos servers will not be able to scale to tens of thousands of jobs/data requests. Kerberos have the additional benefit of never sending password over wire. It can integrate to corporate LDAP if required. • Authorization and Audit - Prevents rogue jobs being executed by insiders. Fine-grained authorization via file permissions in HDFS, resource-level access control for YARN and MapReduce and coarser-grained access control at a service level. Knox • Apache Knox Gateway (“Knox”) provides perimeter security for Hadoop ecoSystem. • Provide security to all of Hadoop’s REST & HTTP services • Support for REST APIs for Apache Ambari, Apache Falcon and Apache Ranger. Hadoop EcoSystem Operational Tools Ambari Cluster Administration, Provisioning and Monitoring tool Zookeeper Distribution Services tool used by HBase, Accumulo
  • 16. Road Map for Implementation Stage 1: Development Tools • Most of Hadoop system sub-components can be independently installed and configured. But it is highly recommended to go with Sandbox versions such as HortonWorks or Cloudier or any vendor. • Linux alike machines for developers as Hadoop EcoSystem is designed around Linux and usually the only supported ones in production. • Sandbox development environments are also provided as Virtual Machine packages that can go on windows machines for development. Port forwarding and sharing between Host and Guest development machines can meet most of development needs. • Programming IDE’s such as for Java (Eclipse and IntelliJ) can continue with current setup. • If it is a large team use Vagrant and develop a VM with all software components required. Stage 3: Pre-Prod components setup and testing • Execute a POC demo of all components. This need not be a use case but more a technology demonstration. • Complete one life cycle of development, deployment and testing of POC application with all components. • Install IDS/IPS products in line of “Data Ingestion” path for data coming from external sources(Twitter feed etc) so fictitious data cannot be injected into processing. Stage 2: Cluster Setup - Hardware and Network •Choose optimal hardware as per vendor recommendations for critical Namenode and ResourceManager/Jobtracker as well as rest of worker nodes. Install and Configure a multi-node cluster nodes (24 or 32) as a proof of concept •Isolate Hadoop ecosystem in separate subnets away from web and other traditional infrastructure. Separate “Master nodes, Worker nodes, client nodes and management nodes” where Master nodes and Worker nodes takes input only from management nodes. •Keep Client nodes on edge nodes(akin to DMZ) with end user access limited to only those. Users should never get access to any Hadoop nodes except the edge/gateway nodes. •Work with Network architect to finalize on topology design out of - Traditional two Switch rooted Architecture (using expensive Chassis switches) OR BigData preferred Spine and leaf model (using inexpensive scale out Fixed switches). •Define Rack-awareness (Hadoop config file topology.script) for all servers so Namenode can make right decisions.
  • 17. Road Map for Implementation Stage 4: Early adoption use case • Identify Enterprise DataWarehouse or BI team long runnings jobs and queries. • Formulate a use case or copy an existing use case to mimic above functionality. • Design and code Sqoop and PIG scripts to load data to HDFS • Design HDFS and Hive Schema before data load • Replicate existing functionality using Hadoop ecosystem components. • Adopt architectural standards for data format, compression (defined in next section) Stage 5: Mature use case like MDM • Provide data exploration and analysis to identify, locate and load internal and external final party data files into landing zone • Clean the data as per business requirements • Prepare - Design and Develop Big Data structure and tables to load data files • Analyze - provide productivity tools to data access and analysis. • Implement Enterprise data transformation tooks like Cascade if required Stage 6: Data Science driven Analytic use case. • Work with business to identify Data science driven use case • Setup BigData ecosystem and environment • Engineer with Data Science team to design prediction model.
  • 18. Architectural and Design Decisions Data Format, Compression and Storage system FileFormat • SequenceFiles - Recommended for smaller files (<64MB default HDFS block size) • Avro - Recommended for use cases requiring ever evolving schema changes since they have self describing format • RC and ORCFile formats - Columnar formats and efficient for most of Analytic cases where only few columns of data is read. Compression - Very critical the approach used supports splittable format required for MapReduce • Snappy - Efficient compression but does not support splits • LZO - Slow performance but supports splits • Bzip2 - Split support but less efficient. Use only if storage is a serious concern Schema Design - HDFS or HBase with letter providing random read/write support Standard directory structure and standard organization of data Data Movement • Network - Ensure appropriate bandwidth requirements are met • Sqoop for data ingestion • Use single hop from source to HDFS where possible • Use Database Specific Connectors Whenever Available • Goldilocks Method of Sqoop Performance Tuning • Loading Many Tables in Parallel with Fair Scheduler Throttling • Ask following questions and design to minimize network traffic • Timeliness of data ingestion • Incremental updates • Data transformation - Does data need to be transformed during flight • Source system structure, layout and proximity Data Processing • MapReduce low level Java API - Use only if high flexibility and control of MapReduce is required • Spark, Storm are clear choices for most enterprise level of processing • Use Abstractions like Pig, HQL for early adoptions or for data analysts usage. Leverage modern performance improvements like “Tez” which performs better than MapReduce for Pig and HQL.
  • 19. Architectural and Design Decisions Patterns Use Patterns for common occurring problems • Removing duplicate keys • Windowing analysis - helps in arriving peak and lows of a equity • Updating time series data -One common way of storing this information is as versions in HBase. HBase has a way to store every change you make to a record as versions. Versions are defined at a column level and ordered by modification timestamp so you can go to any point in time to see what the record looked like. Graph Processing Giraph - Recommended for mature cases of Graph processing but only for expert graph programmers Spark GraphX - Recommended as it is available under one stack of Spark which is set to replace MapReduce. WorkFlows • While most Data pipeline and processing can be done manually with scripts, for an enterprise leverage Oozie and Falcon. They provide robust mechanisms for processing. Programming languages • Hadoop is developed in Java and is the core language choice for all interfaces like Spark, Storm, MapReduce (Abstractions like Pig, HSQL are excluded). • However, as Hadoop is getting matured, Python is becoming language of choice. If development teams are already familiar with Java then continue with same. Unit Testing • All language and abstractions to Hadoop offer Unit testing (including Pig) and should be used just like any enterprise application.