SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Training (Day –1) 
Introduction
Big-data 
Four parameters: 
–Velocity: Streaming data and large volume data movement. 
–Volume: Scale from terabytes to zettabytes. 
–Variety: Manage the complexity of multiple relational and non-relational data types and schemas. 
–Voracity: Produced data has to be consumed fast before it becomes meaningless.
Not just internet companies 
Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
Data >> Information >> Business Value 
Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. 
Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. 
Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. 
Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
Single-core, single processor 
Single-core, multi-processor 
Single- core 
Multi-core, single processor 
Multi-core, multi-processor 
Multi-core 
Cluster of processors (single or multi-core) with shared memory 
Cluster of processors with distributed memory 
Cluster 
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. 
Grid of clusters 
Embarrassingly parallel processing 
MapReduce, distributed file system 
Cloud computing 
Pipelined Instruction level 
Concurrent Thread level 
Service Object level 
Indexed File level 
Mega Block level 
Virtual System Level 
Data size: small 
Data size: large 
Reference: Bina Ramamurthy 2011 
Processing Granularity
How to Process BigData? 
Need to process large datasets (>100TB) 
–Just reading 100TB of data can be overwhelming 
–Takes ~11 days to read on a standard computer 
–Takes a day across a 10Gbit link (very high end storage solution) 
–On a single node (@50MB/s) –23days 
–On a 1000 node cluster –33min
Examples 
•Web logs; 
•RFID; 
•sensor networks; 
•social networks; 
•social data (due to thesocial data revolution), 
•Internet text and documents; 
•Internet search indexing; 
•call detail records; 
•astronomy, 
•atmospheric science, 
•genomics, 
•biogeochemical, 
•biological, and 
•other complex and/or interdisciplinary scientific research; 
•military surveillance; 
•medical records; 
•photography archives; 
•video archives; and 
•large-scale e-commerce.
Not so easy… 
Moving data from storage cluster to computation cluster is not feasible 
In large clusters 
–Failure is expected, rather than exceptional. 
–In large clusters, computers fail every day 
–Data is corrupted or lost 
–Computations are disrupted 
–The number of nodes in a cluster may not be constant. 
–Nodes can be heterogeneous. 
Very expensive to build reliability into each application 
–A programmer worries about errors, data motion, communication… 
–Traditional debugging and performance tools don’t apply 
Need a common infrastructure and standard set of tools to handle this complexity 
–Efficient, scalable, fault-tolerant and easy to use
Why is Hadoop and MapReduceneeded? 
The answer to this questions comes from another trend in disk drives: 
–seek time is improving more slowly than transfer rate. 
Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. 
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. 
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
Why is Hadoop and MapReduceneeded? 
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. 
For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. 
MapReducecan be seen as a complement to an RDBMS. 
MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
Why is Hadoop and MapReduceneeded?
Hadoop distributions 
Apache™ Hadoop™ 
Apache Hadoop-based Services for Windows Azure 
Cloudera’sDistribution Including Apache Hadoop (CDH) 
HortonworksData Platform 
IBM InfoSphereBigInsights 
Platform Symphony MapReduce 
MapRHadoop Distribution 
EMC GreenplumMR (using MapR’sM5 Distribution) 
ZettasetData Platform 
SGI Hadoop Clusters (uses Clouderadistribution) 
Grand Logic JobServer 
OceanSyncHadoop Management Software 
Oracle Big Data Appliance (uses Clouderadistribution)
What’s up with the names? 
When naming software projects, Doug Cutting seems to have been inspired by his family. 
Luceneis his wife’s middle name, and her maternal grandmother’s first name. 
His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. 
Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
Hadoop features 
Distributed Framework for processing and storing data generally on commodity hardware. 
Completely Open Source. 
Written in Java 
–Runs on Linux, Mac OS/X, Windows, and Solaris. 
–Client apps can be written in various languages. 
•Scalable: store and process petabytes, scale by adding Hardware 
•Economical: 1000’s of commodity machines 
•Efficient: run tasks where data is located 
•Reliable: data is replicated, failed tasks are rerun 
•Primarily used for batch data processing, not real-time / user facing applications
Components of Hadoop 
•HDFS(Hadoop Distributed File System) 
–ModeledonGFS 
–Reliable,HighBandwidthfilesystemthatcan 
store TB' and PB's data. 
•Map-Reduce 
–UsingMap/ReducemetaphorfromLisplanguage 
–Adistributedprocessingframeworkparadigmthat 
process the data stored onto HDFS in key-value . 
DFS 
Processing Framework 
Client 1 
Client 2 
Input 
data 
Output 
data 
Map 
Map 
Map 
Reduce 
Reduce 
Input 
Map 
Shuffle & Sort 
Reduce 
Output
•Very Large Distributed File System 
–10K nodes, 100 million files, 10 PB 
–Linearly scalable 
–Supports Large files (in GBs or TBs) 
•Economical 
–Uses Commodity Hardware 
–Nodes fail every day. Failure is expected, rather than exceptional. 
–The number of nodes in a cluster is not constant. 
•Optimized for Batch Processing 
HDFS
HDFS Goals 
•Highly fault-tolerant 
–runs on commodity HW, which can fail frequently 
•High throughput of data access 
–Streaming access to data 
•Large files 
–Typical file is gigabytes to terabytes in size 
–Support for tens of millions of files 
•Simple coherency 
–Write-once-read-many access model
HDFS: Files and Blocks 
•Data Organization 
–Data is organized into files and directories 
–Files are divided into uniform sized large blocks 
–Typically 128MB 
–Blocks are distributed across cluster nodes 
•Fault Tolerance 
–Blocks are replicated (default 3) to handle hardware failure 
–Replication based on Rack-Awareness for performance and fault tolerance 
–Keeps checksums of data for corruption detection and recovery 
–Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
HDFS: Files and Blocks 
•High Throughput: 
–Client talks to both NameNodeand DataNodes 
–Data is not sent through the NameNode. 
–Throughput of file system scales nearly linearly with the number of nodes. 
•HDFS exposes block placement so that computation can be migrated to data
HDFS Components 
•NameNode 
–Manages the file namespace operation like opening, creating, renaming etc. 
–File name to list blocks + location mapping 
–File metadata 
–Authorization and authentication 
–Collect block reports from DataNodeson block locations 
–Replicate missing blocks 
–Keeps ALL namespace in memory plus checkpoints & journal 
•DataNode 
–Handles block storage on multiple volumes and data integrity. 
–Clients access the blocks directly from data nodes for read and write 
–Data nodes periodically send block reports to NameNode 
–Block creation, deletion and replication upon instruction from the NameNode.
name:/users/joeYahoo/myFile -blocks:{1,3} 
name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} 
Datanodes (the slaves) 
Namenode (the master) 
1 
1 
2 
2 
2 
4 
5 
3 
3 
4 
4 
5 
5 
Client 
Metadata 
I/O 
1 
3 
HDFS Architecture
Simple commands 
hdfsdfs-ls, -du, -rm, -rmr 
Uploading files 
hdfsdfs–copyFromLocalfoo mydata/foo 
Downloading files 
hdfsdfs-moveToLocalmydata/foo foo 
hdfsdfs-cat mydata/foo 
Admin 
hdfsdfsadmin–report 
Hadoop DFS Interface
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Plugin based framework for extensibility
Map-Reduce 
•MapReduceprograms are executed in two main phases, called 
–mapping and 
–reducing. 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. 
•The mapper is meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce 
Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapper and optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Hadoop and its elements 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Input 
files 
Splits 
Mapper 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
pairs 
Record Reader 
combiner 
. 
. 
. 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
Partitionar 
Reducer 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Input 
Output 
Machine -x
Hadoop Eco-system 
•Hadoop Common: The common utilities that support the other Hadoop subprojects. 
•Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. 
•Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. 
•Other Hadoop-related projects at Apache include: 
–Avro™: A data serialization system. 
–Cassandra™: A scalable multi-master database with no single points of failure. 
–Chukwa™: A data collection system for managing large distributed systems. 
–HBase™: A scalable, distributed database that supports structured data storage for large tables. 
–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. 
–Mahout™: A Scalable machine learning and data mining library. 
–Pig™: A high-level data-flow language and execution framework for parallel computation. 
–ZooKeeper™: A high-performance coordination service for distributed applications.
Exercise –task 
You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. 
•Task: 
–Provide an architecture of such system to meet following goals 
–Fast 
–Available 
–Fair 
–Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. 
•Group / individual presentation
End of session 
Day –1: Introduction

Weitere ähnliche Inhalte

Was ist angesagt?

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfsPradeep Kumbhar
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architectureAisha Siddiqa
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsDataWorks Summit
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introductiontutorialvillage
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDataWorks Summit
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFSSatyaHadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop ArchitectureDelhi/NCR HUG
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 

Was ist angesagt? (20)

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 

Andere mochten auch

Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Takrim Ul Islam Laskar
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Andere mochten auch (11)

Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Ähnlich wie Hadoop introduction

Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop TechnologyAtul Kushwaha
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoopahmed alshikh
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data
Big DataBig Data
Big DataNeha Mehta
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by SunnyDignitasDigital1
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMaulikLakhani
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
hadoop
hadoophadoop
hadoopswatic018
 

Ähnlich wie Hadoop introduction (20)

Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
hadoop
hadoophadoop
hadoop
 

Mehr von Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 

Mehr von Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 

KĂźrzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

KĂźrzlich hochgeladen (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Hadoop introduction

  • 1. Training (Day –1) Introduction
  • 2. Big-data Four parameters: –Velocity: Streaming data and large volume data movement. –Volume: Scale from terabytes to zettabytes. –Variety: Manage the complexity of multiple relational and non-relational data types and schemas. –Voracity: Produced data has to be consumed fast before it becomes meaningless.
  • 3. Not just internet companies Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
  • 4. Data >> Information >> Business Value Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
  • 5. Single-core, single processor Single-core, multi-processor Single- core Multi-core, single processor Multi-core, multi-processor Multi-core Cluster of processors (single or multi-core) with shared memory Cluster of processors with distributed memory Cluster Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large Reference: Bina Ramamurthy 2011 Processing Granularity
  • 6. How to Process BigData? Need to process large datasets (>100TB) –Just reading 100TB of data can be overwhelming –Takes ~11 days to read on a standard computer –Takes a day across a 10Gbit link (very high end storage solution) –On a single node (@50MB/s) –23days –On a 1000 node cluster –33min
  • 7. Examples •Web logs; •RFID; •sensor networks; •social networks; •social data (due to thesocial data revolution), •Internet text and documents; •Internet search indexing; •call detail records; •astronomy, •atmospheric science, •genomics, •biogeochemical, •biological, and •other complex and/or interdisciplinary scientific research; •military surveillance; •medical records; •photography archives; •video archives; and •large-scale e-commerce.
  • 8. Not so easy… Moving data from storage cluster to computation cluster is not feasible In large clusters –Failure is expected, rather than exceptional. –In large clusters, computers fail every day –Data is corrupted or lost –Computations are disrupted –The number of nodes in a cluster may not be constant. –Nodes can be heterogeneous. Very expensive to build reliability into each application –A programmer worries about errors, data motion, communication… –Traditional debugging and performance tools don’t apply Need a common infrastructure and standard set of tools to handle this complexity –Efficient, scalable, fault-tolerant and easy to use
  • 9. Why is Hadoop and MapReduceneeded? The answer to this questions comes from another trend in disk drives: –seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
  • 10. Why is Hadoop and MapReduceneeded? On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. MapReducecan be seen as a complement to an RDBMS. MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
  • 11. Why is Hadoop and MapReduceneeded?
  • 12. Hadoop distributions Apache™ Hadoop™ Apache Hadoop-based Services for Windows Azure Cloudera’sDistribution Including Apache Hadoop (CDH) HortonworksData Platform IBM InfoSphereBigInsights Platform Symphony MapReduce MapRHadoop Distribution EMC GreenplumMR (using MapR’sM5 Distribution) ZettasetData Platform SGI Hadoop Clusters (uses Clouderadistribution) Grand Logic JobServer OceanSyncHadoop Management Software Oracle Big Data Appliance (uses Clouderadistribution)
  • 13. What’s up with the names? When naming software projects, Doug Cutting seems to have been inspired by his family. Luceneis his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
  • 14. Hadoop features Distributed Framework for processing and storing data generally on commodity hardware. Completely Open Source. Written in Java –Runs on Linux, Mac OS/X, Windows, and Solaris. –Client apps can be written in various languages. •Scalable: store and process petabytes, scale by adding Hardware •Economical: 1000’s of commodity machines •Efficient: run tasks where data is located •Reliable: data is replicated, failed tasks are rerun •Primarily used for batch data processing, not real-time / user facing applications
  • 15. Components of Hadoop •HDFS(Hadoop Distributed File System) –ModeledonGFS –Reliable,HighBandwidthfilesystemthatcan store TB' and PB's data. •Map-Reduce –UsingMap/ReducemetaphorfromLisplanguage –Adistributedprocessingframeworkparadigmthat process the data stored onto HDFS in key-value . DFS Processing Framework Client 1 Client 2 Input data Output data Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output
  • 16. •Very Large Distributed File System –10K nodes, 100 million files, 10 PB –Linearly scalable –Supports Large files (in GBs or TBs) •Economical –Uses Commodity Hardware –Nodes fail every day. Failure is expected, rather than exceptional. –The number of nodes in a cluster is not constant. •Optimized for Batch Processing HDFS
  • 17. HDFS Goals •Highly fault-tolerant –runs on commodity HW, which can fail frequently •High throughput of data access –Streaming access to data •Large files –Typical file is gigabytes to terabytes in size –Support for tens of millions of files •Simple coherency –Write-once-read-many access model
  • 18. HDFS: Files and Blocks •Data Organization –Data is organized into files and directories –Files are divided into uniform sized large blocks –Typically 128MB –Blocks are distributed across cluster nodes •Fault Tolerance –Blocks are replicated (default 3) to handle hardware failure –Replication based on Rack-Awareness for performance and fault tolerance –Keeps checksums of data for corruption detection and recovery –Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
  • 19. HDFS: Files and Blocks •High Throughput: –Client talks to both NameNodeand DataNodes –Data is not sent through the NameNode. –Throughput of file system scales nearly linearly with the number of nodes. •HDFS exposes block placement so that computation can be migrated to data
  • 20. HDFS Components •NameNode –Manages the file namespace operation like opening, creating, renaming etc. –File name to list blocks + location mapping –File metadata –Authorization and authentication –Collect block reports from DataNodeson block locations –Replicate missing blocks –Keeps ALL namespace in memory plus checkpoints & journal •DataNode –Handles block storage on multiple volumes and data integrity. –Clients access the blocks directly from data nodes for read and write –Data nodes periodically send block reports to NameNode –Block creation, deletion and replication upon instruction from the NameNode.
  • 21. name:/users/joeYahoo/myFile -blocks:{1,3} name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} Datanodes (the slaves) Namenode (the master) 1 1 2 2 2 4 5 3 3 4 4 5 5 Client Metadata I/O 1 3 HDFS Architecture
  • 22. Simple commands hdfsdfs-ls, -du, -rm, -rmr Uploading files hdfsdfs–copyFromLocalfoo mydata/foo Downloading files hdfsdfs-moveToLocalmydata/foo foo hdfsdfs-cat mydata/foo Admin hdfsdfsadmin–report Hadoop DFS Interface
  • 23. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Plugin based framework for extensibility
  • 24. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. •The mapper is meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 25. Map-Reduce Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapper and optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 27. Hadoop and its elements HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 28. Hadoop Eco-system •Hadoop Common: The common utilities that support the other Hadoop subprojects. •Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. •Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. •Other Hadoop-related projects at Apache include: –Avro™: A data serialization system. –Cassandra™: A scalable multi-master database with no single points of failure. –Chukwa™: A data collection system for managing large distributed systems. –HBase™: A scalable, distributed database that supports structured data storage for large tables. –Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. –Mahout™: A Scalable machine learning and data mining library. –Pig™: A high-level data-flow language and execution framework for parallel computation. –ZooKeeper™: A high-performance coordination service for distributed applications.
  • 29. Exercise –task You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. •Task: –Provide an architecture of such system to meet following goals –Fast –Available –Fair –Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. •Group / individual presentation
  • 30. End of session Day –1: Introduction