SlideShare ist ein Scribd-Unternehmen logo
1 von 27
HADOOP
- Nagarjuna K
-   nagarjunak@outlook.com
Why and What Hadoop ?

A tool to process big data
What is BIG Data ?
Facebook, Google+ etc.,
   Whatever we do getting stored in form of data or inform of logs


Machines too generate lots of data
  Cameras, Mobiles, softwares like STAAD Pro, automated machines in
   industries etc.,


We are having a online discussion now , certainly
 your reading of this presentation is recorded in
 data.
What is BIG Data ?                         ..continued

 Exponential growth of data  challenges to Google, Yahoo,
  Microsoft, Amazon

 Need to go through TBs and PBs of data ?

    Which websites and books were popular ?
    What kind of Ads appeal to them ?


 Existing tools became inadequate to process such large
  data sets.
Why is the data so BIG                    ?
 Till Couple of decade back  Floppy disks

 From then on  CD/DVD Drives

 Half a decade back  Hard drives (500 GB)

 Now  Hard Drives(I TB) are available in abundance
Why is the data so BIG               ?

So WHAT ?



Even the technology to read has taken a leap.
Why is the data so BIG                     ?
                                   Data
                                            Time to
Year     Device        Volume    Transfer
                                            process
                                  speed

       Optical Drive
1990                   1370 MB   4.4 MB/s   5 minutes

        1 TB SATA
2012                    1 TB     100 MB/s    2.5 Hrs
          Drives
How to handle such BIG ?

 BIG elephant
 Numerous small chicken ?
How to handle such BIG ?
Concept of Torrents
  Reduce time to read by reading it from multiple sources
   simultaneously.

  Imagine if we had 100 drives, each holding one hundredth of
   the data. Working in parallel, we could read the data in less
   than two minutes.
How to handle such BIG ? -- Issues

  How to handle a system up and downs ?

  How to combine the data from all the systems ?
Problem1 : System’s Ups and Downs
 Commodity hard ware for data storage and analysis

 Chances of failure are very high

 So, have a redundant copy of the same data across some machines

 In case of eventuality of one machine, you have the other

 Google came up with a file system  GFS (Google File System) which
  implemented all these details.
GFS
 Divides data into chunks and stores in the file System

 Can store data in ranges of PBs also
Problem 2 : How to combine the data ?
 Analyze data across different machines , But how do we merge them to
  get a meaningful outcome ?

 Yes, all (some) of the data has to travel across network. Then only
  merging of the data can occur.

 Doing this is notoriously challenging

 Again Google  Map—Reduce
Map Reduce
 Provides a programming model  abstracts the problem
  of disk reads and writes transforming in to a computation
  of keys and values.

 Two phases

   Map

   Reduce
So what is Hadoop ?
An operating system ?

Provides

 1. A reliable shared storage system

 1. Analysis system
History of Hadoop
 Google was the first to launch GFS and MapReduce

 They published a paper in 2004 announcing the world
  a brand new technology

 This technology was well proven in Google by 2004
  itself
             MapReduce paper by Google
History of Hadoop
 Doug Cutting saw an opportunity and led the charge
  to develop an open source version of this
  MapReduce system called Hadoop .

 Soon after, Yahoo and others rallied around to
 support this effort.

 Now Hadoop is core part in :
   Facebook, Yahoo, LinkedIn, Twitter …
History of Hadoop

GFS  HDFS

MapReduce  MapReduce
HDFS                               -- A Brief
Design  Streaming very large files on commodity cluster.

1. Very Large Files
  MBs to PBs
2. Streaming
  Write once read many approach
  After huge data being placed  We tend to use the data not modify it
  Time to read the whole data is more important
3. Commodity Cluster
  No High end Servers
  Yes, high chance of failure (But HDFS is tolerant enoguh)
  Replication is done
MapReduce                        -- A Brief
Large scale data processing in parallel.

MapReduce provides:
  Automatic parallelization and distribution
  Fault-tolerance
  I/O scheduling
  Status and monitoring
Two phases in MapReduce
  Map
  Reduce
MapReduce                                     -- A Brief

 Map phase
  map (in_key, in_value) -> list(out_key, intermediate_value)
  Processes input key/value pair
  Produces set of intermediate pairs


 Reduce Phase
  reduce (out_key, list(intermediate_value)) -> list(out_value)
  Combines all intermediate values for a particular key
  Produces a set of merged output values (usually just one)
MapReduce   -- A Brief
Hadoop Cluster
Hadoop Ecosystems
Version of Hadoop
We will deal with either of

  Apache hadoop-0.20
  Cloudera hadoop - cdh3
Pre-Requisites
 Core-Java

 Acquaintance with LINUX will help.

 Linux installation on your machines.
Thank you 
 Please email your suggestions to   nagarjunak@outlook.com

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 

Was ist angesagt? (19)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Unit 1
Unit 1Unit 1
Unit 1
 
Anju
AnjuAnju
Anju
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 

Andere mochten auch

Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopGetInData
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 

Andere mochten auch (20)

Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 

Ähnlich wie Big Data and Hadoop - An Introduction

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 

Ähnlich wie Big Data and Hadoop - An Introduction (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
final report
final reportfinal report
final report
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Hadoop
HadoopHadoop
Hadoop
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Big data(hadoop)
Big data(hadoop)Big data(hadoop)
Big data(hadoop)
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Big Data and Hadoop - An Introduction

  • 1. HADOOP - Nagarjuna K - nagarjunak@outlook.com
  • 2. Why and What Hadoop ? A tool to process big data
  • 3. What is BIG Data ? Facebook, Google+ etc.,  Whatever we do getting stored in form of data or inform of logs Machines too generate lots of data  Cameras, Mobiles, softwares like STAAD Pro, automated machines in industries etc., We are having a online discussion now , certainly your reading of this presentation is recorded in data.
  • 4. What is BIG Data ? ..continued  Exponential growth of data  challenges to Google, Yahoo, Microsoft, Amazon  Need to go through TBs and PBs of data ?  Which websites and books were popular ?  What kind of Ads appeal to them ?  Existing tools became inadequate to process such large data sets.
  • 5. Why is the data so BIG ?  Till Couple of decade back  Floppy disks  From then on  CD/DVD Drives  Half a decade back  Hard drives (500 GB)  Now  Hard Drives(I TB) are available in abundance
  • 6. Why is the data so BIG ? So WHAT ? Even the technology to read has taken a leap.
  • 7. Why is the data so BIG ? Data Time to Year Device Volume Transfer process speed Optical Drive 1990 1370 MB 4.4 MB/s 5 minutes 1 TB SATA 2012 1 TB 100 MB/s 2.5 Hrs Drives
  • 8. How to handle such BIG ?  BIG elephant  Numerous small chicken ?
  • 9. How to handle such BIG ? Concept of Torrents  Reduce time to read by reading it from multiple sources simultaneously.  Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes.
  • 10. How to handle such BIG ? -- Issues How to handle a system up and downs ? How to combine the data from all the systems ?
  • 11. Problem1 : System’s Ups and Downs  Commodity hard ware for data storage and analysis  Chances of failure are very high  So, have a redundant copy of the same data across some machines  In case of eventuality of one machine, you have the other  Google came up with a file system  GFS (Google File System) which implemented all these details.
  • 12. GFS  Divides data into chunks and stores in the file System  Can store data in ranges of PBs also
  • 13. Problem 2 : How to combine the data ?  Analyze data across different machines , But how do we merge them to get a meaningful outcome ?  Yes, all (some) of the data has to travel across network. Then only merging of the data can occur.  Doing this is notoriously challenging  Again Google  Map—Reduce
  • 14. Map Reduce  Provides a programming model  abstracts the problem of disk reads and writes transforming in to a computation of keys and values.  Two phases  Map  Reduce
  • 15. So what is Hadoop ? An operating system ? Provides 1. A reliable shared storage system 1. Analysis system
  • 16. History of Hadoop  Google was the first to launch GFS and MapReduce  They published a paper in 2004 announcing the world a brand new technology  This technology was well proven in Google by 2004 itself MapReduce paper by Google
  • 17. History of Hadoop  Doug Cutting saw an opportunity and led the charge to develop an open source version of this MapReduce system called Hadoop .  Soon after, Yahoo and others rallied around to support this effort.  Now Hadoop is core part in :  Facebook, Yahoo, LinkedIn, Twitter …
  • 18. History of Hadoop GFS  HDFS MapReduce  MapReduce
  • 19. HDFS -- A Brief Design  Streaming very large files on commodity cluster. 1. Very Large Files MBs to PBs 2. Streaming Write once read many approach After huge data being placed  We tend to use the data not modify it Time to read the whole data is more important 3. Commodity Cluster No High end Servers Yes, high chance of failure (But HDFS is tolerant enoguh) Replication is done
  • 20. MapReduce -- A Brief Large scale data processing in parallel. MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring Two phases in MapReduce  Map  Reduce
  • 21. MapReduce -- A Brief  Map phase  map (in_key, in_value) -> list(out_key, intermediate_value)  Processes input key/value pair  Produces set of intermediate pairs  Reduce Phase  reduce (out_key, list(intermediate_value)) -> list(out_value)  Combines all intermediate values for a particular key  Produces a set of merged output values (usually just one)
  • 22. MapReduce -- A Brief
  • 25. Version of Hadoop We will deal with either of  Apache hadoop-0.20  Cloudera hadoop - cdh3
  • 26. Pre-Requisites  Core-Java  Acquaintance with LINUX will help.  Linux installation on your machines.
  • 27. Thank you   Please email your suggestions to nagarjunak@outlook.com