SlideShare a Scribd company logo
1 of 50
Overview, HDFS and
Map/Reduce
Hadoop Workshop 1
- SW Engineer who has worked as designer / developer on
NOSQL (Mongo, Cassandra, Hadoop)
- Consultant – HBO, ACS, ACXIOM (GM), AT&T
- Specialize in SW Development, architecture and training
- Currently working with Cassandra, Storm, Kafka, Node.js and
MongoDB
- brian.enochson@gmail.com
Hadoop Workshop 2
• Intros, Agenda and software
• Hadoop Overview, Ecosystem and Use Cases
• HDFS – hadoop distributed file system
• Map/Reduce – how to create and use
• Hadoop Streaming and Pig
• A little bit about the supporting cast, the Hadoop Eco-
System
Hadoop Workshop 3
What is you experience and
interests with Big Data,
NoSql, Hadoop and Software
Development?
Hadoop Workshop 4
• Cloudera Quickstart VM
http://www.cloudera.com/content/cloudera-
content/cloudera-docs/DemoVMs/Cloudera-QuickStart-
VM/cloudera_quickstart_vm.html
We are using Cloudera, there is also HortonWorks, MapR,
Apache Hadoop and multitide of other players.
Hadoop Workshop 5
• The reason for products like Hadoop is the emergence of Big
Data.
• Big Data is not new, but is more common now, we are more
capable of collecting.
• Big Data is not just about size. It is also frequency of delivery
and the unstructured nature.
• RDMS was about structure and definition. It fails in handling
data requirements in many cases.
Hadoop Workshop 6
Hadoop Workshop 7
Hadoop Workshop 8
• Not Only SQL. Doesn‟t mean “No” SQL
• Category of products developed to handle big data
demands.
• Built to handle distributed architecture
• Take advantage of large clusters of commodity hardware
• There are trade-offs, live with CAP Theroem
Hadoop Workshop 9
• Hadoop itself is not a NoSQL product
• Hbase, which is part of Hadoop eco-system is a NoSQL
product.
• Hadoop is a computing framework (a lot more to come)
and typically runs on a cluster of machines
• NoSQL products example Cassandra, MongoDB, Riak,
CouchDB and Hbase.
Hadoop Workshop 10
Where does the name come from?
The name my kid gave a stuffed yellow elephant.
Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those
are my naming criteria. Kids are good at generating such.
Googol is a kid’s term.
• This is from Doug Cutting, inventor of Hadoop (also of
Lucene the search engine and Nutch (along with Mike
Cafarella) which a web crawler and various other items).
Hadoop was created to process data created by Nutch.
Hadoop Workshop 11
• Hadoop is an open source framework for processing
large amounts of data in batch.
• Designed from the ground up to scale out as you add
hardware.
• Hadoop Core has three main components:
• Commons
• HDFS
• MapReduce
Hadoop Workshop 12
• The first is called Commons and contains (as the name
implied) common functionality.
• In addition Hadoop has its own file system called HDFS
that is made to be fault tolerant and supplies the
cornerstone to let it run on commodity hardware.
• The final component making up the Hadoop system is
called MapReduce and it implements the model that
allows processing the data in a parallel manner.
Hadoop Workshop 13
• Hadoop Distributed File System (aka HDFS, or Hadoop
DFS)
• Runs on top of regular OS file system, typically Linux
ext3
• Fixed-size blocks (64MB by default) that are replicated
• Write once, read many; optimized for streaming in and
out
Hadoop Workshop 14
• Responsible for running a job in parallel on many servers
• Handles re-trying a task that fails
• Validating complete results
• Jobs consist of special “map” and “reduce” operations
Hadoop Workshop 15
• Has one “master” server - high quality, beefy box.
• NameNode process - manages file system
• JobTracker process - manages tasks
• Has multiple “slave” servers - commodity hardware.
• DataNode process - manages file system blocks on local drives
• TaskTracker process - runs tasks on server
• Uses high speed network between all servers
Hadoop Workshop 16
Hadoop Workshop 17
• Hadoop is a lot more than HDFS and MapReduce.
• There is an large (and ever growing) Eco-System built up
around Hadoop.
• These are for making MapReduce less complex,
Montoring and Job Control, Adding Persistence systems
and adding Real-Time capabilities.
Hadoop Workshop 18
• Avro
A serialization system for efficient, cross-language
RPC and persistent data storage.
• Mahout
Machine Learning System
• Pig
High Level query langiage (PigLatin) that creates map
reduce jobs.
• Hive
A distributed data warehouse. Hive manages data
stored in HDFS and provides a query language based on
SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
Hadoop Workshop 19
• HBase
A distributed, column-oriented database. HBase uses
HDFS for its underlying storage, and supports both batch-style
computations using MapReduce and point queries (random
reads).
• ZooKeeper
A distributed, highly available coordination service.
ZooKeeper provides primitives such as distributed locks that can
be used for building distributed applications.
• Sqoop
A tool for efficient bulk transfer of data between
structured data stores (such as relational databases) and HDFS.
• Oozie
A service for running and scheduling workflows of
Hadoop jobs (including Map- Reduce, Pig, Hive, and Sqoop
jobs).
Hadoop Workshop 20
• A new MapReduce runtime, called MapReduce 2,
implemented on a new system called YARN (Yet Another
Resource Negotiator), which is a general resource
management system for running distributed applications.
MR2 replaces the “classic” runtime in previous releases.
• HDFS federation, which partitions the HDFS namespace
across multiple namenodes to support clusters with very
large numbers of files.
• HDFS high-availability, which removes the namenode
as a single point of failure by supporting standby
namenodes for failover.
Hadoop Workshop 21
• YARN was added in Hadoop 2
• It is a resource-management platform responsible for
managing compute resources in clusters and using them for
scheduling of users' applications
• YARN can run applications that do not follow the MapReduce
model, unlike the original MapReduce model (MR1)
• YARN provides the daemons and APIs necessary to develop
generic distributed applications of any kind, handles and
schedules resource requests (such as memory and CPU) from
such applications, and supervises their execution
Hadoop Workshop 22
Any questions on the Hadoop Overview?
Hadoop Workshop 23
• Hadoop Distributed File System
• Single Name Node and a Cluster of Datanodes
• Stores large files (+ GB) across nodes
• Reliability is through replication, default value is 3.
Hadoop Workshop 24
• Master and in Hadoop 1 is a SPOF.
• In Hadoop 2 added failover to prevent the SPOF aspect
• Contains
• Directory Tree of all the files
• Keeps track of where all the data resides across the cluster
• For really large clusters, tracking these files becomes an issue,
Hadoop 2 added NameNode Federation to fix that.
Hadoop Workshop 25
• http://localhost:50070/
• Can view information on namenode, blocks and view
logs.
Hadoop Workshop 26
• Files stored on HDFS are chunked and stored as blocks
on DataNode
• Manages storage attached to the nodes that they run on.
• Data never flows through NameNode, only DataNodes
Hadoop Workshop 27
• Typically file system blocks are a few kilobytes.
• HDFS uses block size typically 64MB or 128MB.
• The large size minimizes the cost of seeking data.
• Blocks do not need to be on the same cluster. So large
files in HDFS often span multiple clusters
• >hadoop fsck / -files -blocks
Hadoop Workshop 28
Hadoop Workshop 29
Hadoop Workshop 30
• You can interact with HDFS from command line by use of
“hadoop fs”
• >hadoop fs –help
• >hadoop fs -ls /
https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html
Hadoop Workshop 31
• Get some data
• http://www.gutenberg.org/ebooks/20417
• http://www.gutenberg.org/ebooks/5000
• http://www.gutenberg.org/ebooks/4300
• Save to /tmp/gutenberg local
Hadoop Workshop 32
• hadoop fs -mkdir /tmp
• hadoop fs -mkdir /tmp/gutenberg
• hadoop fs -copyFromLocal /tmp/gutenberg/*
/tmp/gutenberg
Check it:
hadoop fs -ls /tmp/gutenberg/
Or
http://localhost:50075/
Hadoop Workshop 33
• Let‟s rename them
hadoop fs -mv /tmp/gutenberg/pg20417.txt
/tmp/gutenberg/pg20417.txt.bak
hadoop fs -mv /tmp/gutenberg/pg4300.txt
/tmp/gutenberg/pg4300.txt.bak
hadoop fs -mv /tmp/gutenberg/pg5000.txt
/tmp/gutenberg/pg5000.txt.bak
• Then
hadoop fs -copyToLocal /tmp/gutenberg/*.bak /tmp/gutenberg
• And again
ls /tmp/gutenberg
Hadoop Workshop 34
• Quite simple, like normal Java File I/O
• General write to FileSystem abstract class, can use
DistributedFileSystem
• Simple way to read and write to file system is using
java.net.URL. For example:
InputStream in = new
URL("hdfs://host/path").openStream();
Hadoop Workshop 35
• Listing Files
• Getting Status
• Copying to HDFS
• Copying from HDFS
• Viewing
Hadoop Workshop 36
Any questions on HDFS?
Hadoop Workshop 37
• Splits Processing Into Steps
• Mapper
• Reducer
• These are by nature easily parallized and can take
advantage of available hardware.
Hadoop Workshop 38
• Key Value Pair: two pieces of data, exchanged between
the Map and Reduce phases. Also sometimes called a
TUPLE
• Map: The „map‟ function in the MapReduce algorithm
user defined converts each input Key Value Pair to 0...n
output Key Value Pairs
• Reduce: The „reduce‟ function in the MapReduce
algorithm user defined converts each input Key + all
Values to 0...n output Key Value Pairs
• Group: A built-in operation that happens between Map
and Reduce ensures each Key passed to Reduce
includes all Values
Hadoop Workshop 39
Hadoop Workshop 40
• Map translates input to keys and values to new keys and
values
• System Groups each unique key with all its values
• Reduce translates the values of each unique key to new
keys and values
Hadoop Workshop 41
• Let‟s look at the combining english to foreign language
translations
Hadoop Workshop 42
Hadoop Workshop 43
• Is a single point of failure
• Determines # Mapper Tasks from file splits via
InputFormat
• Uses predefined value for # Reducer Tasks
• Client applications use JobClient to submit jobs and
query status
• Command line use hadoop job <commands>
• Web status console use http://jobtracker-server:50030/
Hadoop Workshop 44
• Spawns each Task as a new child JVM
• Max # mapper and reducer tasks set independently
• Can pass child JVM opts via mapred.child.java.opts
• Can re-use JVM to avoid overhead of task initialization
Hadoop Workshop 45
Hadoop Workshop 46
• MRUnit‟s MapDriver or ReduceDriver are the key class
• Configure which mapper under test
• The input value,
• The expected output key
• The expected output value
Hadoop Workshop 47
• Testing A Mapper
• Testing A Reducer
Hadoop Workshop 48
• Not for quick jobs
• Scripts / R programs are often better
• Doesn‟t work well with a lot of small files in HDFS
• Don‟t have that real – time aspect (see Apache Storm)
• Don‟t let people tell you Hadoop is the answer for
everything
• Hive * Pig are good alternative (espcially for SQL
speakers) Hadoop Workshop 49
• Questions?
• Contact Me
brian.enochson@gmail.com
Hadoop Workshop 50

More Related Content

What's hot

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 

What's hot (20)

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
HDFS
HDFSHDFS
HDFS
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop
HadoopHadoop
Hadoop
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Viewers also liked

Introduction to PHP (Casino Affiliate Convention 2008)
Introduction to PHP (Casino Affiliate Convention 2008)Introduction to PHP (Casino Affiliate Convention 2008)
Introduction to PHP (Casino Affiliate Convention 2008)Ivo Jansch
 
Publizitate Eraginkortasunaren Baliospena 5
Publizitate Eraginkortasunaren Baliospena 5Publizitate Eraginkortasunaren Baliospena 5
Publizitate Eraginkortasunaren Baliospena 5katixa
 
Avatare - mediale Artikulationen, interaktive Aktanten, hybride Akteure
Avatare - mediale Artikulationen, interaktive Aktanten, hybride AkteureAvatare - mediale Artikulationen, interaktive Aktanten, hybride Akteure
Avatare - mediale Artikulationen, interaktive Aktanten, hybride AkteureBenjamin Jörissen
 
SAFARI 2012 [FInal presentation slides Stadsdeel Oost]
SAFARI 2012 [FInal presentation slides Stadsdeel Oost]SAFARI 2012 [FInal presentation slides Stadsdeel Oost]
SAFARI 2012 [FInal presentation slides Stadsdeel Oost]Kennisland
 
互联网商品设计
互联网商品设计互联网商品设计
互联网商品设计easychen
 
Lost In The Kingdom Of Vorg
Lost In The Kingdom Of VorgLost In The Kingdom Of Vorg
Lost In The Kingdom Of VorgKwan Tuck Soon
 
Quazy quilters
Quazy quiltersQuazy quilters
Quazy quiltersburnsc62
 
Wicked notes #3
Wicked notes #3Wicked notes #3
Wicked notes #3Kennisland
 
Jörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governance
Jörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governanceJörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governance
Jörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governanceBenjamin Jörissen
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStoreJukka Zitting
 
Presentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur DataPresentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur DataKennisland
 

Viewers also liked (18)

Introduction to PHP (Casino Affiliate Convention 2008)
Introduction to PHP (Casino Affiliate Convention 2008)Introduction to PHP (Casino Affiliate Convention 2008)
Introduction to PHP (Casino Affiliate Convention 2008)
 
Publizitate Eraginkortasunaren Baliospena 5
Publizitate Eraginkortasunaren Baliospena 5Publizitate Eraginkortasunaren Baliospena 5
Publizitate Eraginkortasunaren Baliospena 5
 
Avatare - mediale Artikulationen, interaktive Aktanten, hybride Akteure
Avatare - mediale Artikulationen, interaktive Aktanten, hybride AkteureAvatare - mediale Artikulationen, interaktive Aktanten, hybride Akteure
Avatare - mediale Artikulationen, interaktive Aktanten, hybride Akteure
 
SAFARI 2012 [FInal presentation slides Stadsdeel Oost]
SAFARI 2012 [FInal presentation slides Stadsdeel Oost]SAFARI 2012 [FInal presentation slides Stadsdeel Oost]
SAFARI 2012 [FInal presentation slides Stadsdeel Oost]
 
UX Must Die
UX Must DieUX Must Die
UX Must Die
 
互联网商品设计
互联网商品设计互联网商品设计
互联网商品设计
 
Lost In The Kingdom Of Vorg
Lost In The Kingdom Of VorgLost In The Kingdom Of Vorg
Lost In The Kingdom Of Vorg
 
Quazy quilters
Quazy quiltersQuazy quilters
Quazy quilters
 
Fruits
FruitsFruits
Fruits
 
Wicked notes #3
Wicked notes #3Wicked notes #3
Wicked notes #3
 
Mongo db
Mongo dbMongo db
Mongo db
 
Jörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governance
Jörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governanceJörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governance
Jörissen, Benjamin (2009, Manuskript). Games, reflexivity, and governance
 
华文书展
华文书展华文书展
华文书展
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Presentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur DataPresentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur Data
 
Gno
GnoGno
Gno
 
Kommons
KommonsKommons
Kommons
 
Voordekunst
VoordekunstVoordekunst
Voordekunst
 

Similar to Asbury Hadoop Overview

Similar to Asbury Hadoop Overview (20)

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Anju
AnjuAnju
Anju
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Presentation
PresentationPresentation
Presentation
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 

More from Brian Enochson

Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBrian Enochson
 
NoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionNoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionBrian Enochson
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandraBrian Enochson
 
Cassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data ModelingCassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data ModelingBrian Enochson
 

More from Brian Enochson (7)

Hadoop20141125
Hadoop20141125Hadoop20141125
Hadoop20141125
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
 
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and Cassasdra
 
NoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionNoSQL and MongoDB Introdction
NoSQL and MongoDB Introdction
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Cassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data ModelingCassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data Modeling
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 

Asbury Hadoop Overview

  • 2. - SW Engineer who has worked as designer / developer on NOSQL (Mongo, Cassandra, Hadoop) - Consultant – HBO, ACS, ACXIOM (GM), AT&T - Specialize in SW Development, architecture and training - Currently working with Cassandra, Storm, Kafka, Node.js and MongoDB - brian.enochson@gmail.com Hadoop Workshop 2
  • 3. • Intros, Agenda and software • Hadoop Overview, Ecosystem and Use Cases • HDFS – hadoop distributed file system • Map/Reduce – how to create and use • Hadoop Streaming and Pig • A little bit about the supporting cast, the Hadoop Eco- System Hadoop Workshop 3
  • 4. What is you experience and interests with Big Data, NoSql, Hadoop and Software Development? Hadoop Workshop 4
  • 5. • Cloudera Quickstart VM http://www.cloudera.com/content/cloudera- content/cloudera-docs/DemoVMs/Cloudera-QuickStart- VM/cloudera_quickstart_vm.html We are using Cloudera, there is also HortonWorks, MapR, Apache Hadoop and multitide of other players. Hadoop Workshop 5
  • 6. • The reason for products like Hadoop is the emergence of Big Data. • Big Data is not new, but is more common now, we are more capable of collecting. • Big Data is not just about size. It is also frequency of delivery and the unstructured nature. • RDMS was about structure and definition. It fails in handling data requirements in many cases. Hadoop Workshop 6
  • 9. • Not Only SQL. Doesn‟t mean “No” SQL • Category of products developed to handle big data demands. • Built to handle distributed architecture • Take advantage of large clusters of commodity hardware • There are trade-offs, live with CAP Theroem Hadoop Workshop 9
  • 10. • Hadoop itself is not a NoSQL product • Hbase, which is part of Hadoop eco-system is a NoSQL product. • Hadoop is a computing framework (a lot more to come) and typically runs on a cluster of machines • NoSQL products example Cassandra, MongoDB, Riak, CouchDB and Hbase. Hadoop Workshop 10
  • 11. Where does the name come from? The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term. • This is from Doug Cutting, inventor of Hadoop (also of Lucene the search engine and Nutch (along with Mike Cafarella) which a web crawler and various other items). Hadoop was created to process data created by Nutch. Hadoop Workshop 11
  • 12. • Hadoop is an open source framework for processing large amounts of data in batch. • Designed from the ground up to scale out as you add hardware. • Hadoop Core has three main components: • Commons • HDFS • MapReduce Hadoop Workshop 12
  • 13. • The first is called Commons and contains (as the name implied) common functionality. • In addition Hadoop has its own file system called HDFS that is made to be fault tolerant and supplies the cornerstone to let it run on commodity hardware. • The final component making up the Hadoop system is called MapReduce and it implements the model that allows processing the data in a parallel manner. Hadoop Workshop 13
  • 14. • Hadoop Distributed File System (aka HDFS, or Hadoop DFS) • Runs on top of regular OS file system, typically Linux ext3 • Fixed-size blocks (64MB by default) that are replicated • Write once, read many; optimized for streaming in and out Hadoop Workshop 14
  • 15. • Responsible for running a job in parallel on many servers • Handles re-trying a task that fails • Validating complete results • Jobs consist of special “map” and “reduce” operations Hadoop Workshop 15
  • 16. • Has one “master” server - high quality, beefy box. • NameNode process - manages file system • JobTracker process - manages tasks • Has multiple “slave” servers - commodity hardware. • DataNode process - manages file system blocks on local drives • TaskTracker process - runs tasks on server • Uses high speed network between all servers Hadoop Workshop 16
  • 18. • Hadoop is a lot more than HDFS and MapReduce. • There is an large (and ever growing) Eco-System built up around Hadoop. • These are for making MapReduce less complex, Montoring and Job Control, Adding Persistence systems and adding Real-Time capabilities. Hadoop Workshop 18
  • 19. • Avro A serialization system for efficient, cross-language RPC and persistent data storage. • Mahout Machine Learning System • Pig High Level query langiage (PigLatin) that creates map reduce jobs. • Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Hadoop Workshop 19
  • 20. • HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). • ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. • Sqoop A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. • Oozie A service for running and scheduling workflows of Hadoop jobs (including Map- Reduce, Pig, Hive, and Sqoop jobs). Hadoop Workshop 20
  • 21. • A new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications. MR2 replaces the “classic” runtime in previous releases. • HDFS federation, which partitions the HDFS namespace across multiple namenodes to support clusters with very large numbers of files. • HDFS high-availability, which removes the namenode as a single point of failure by supporting standby namenodes for failover. Hadoop Workshop 21
  • 22. • YARN was added in Hadoop 2 • It is a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications • YARN can run applications that do not follow the MapReduce model, unlike the original MapReduce model (MR1) • YARN provides the daemons and APIs necessary to develop generic distributed applications of any kind, handles and schedules resource requests (such as memory and CPU) from such applications, and supervises their execution Hadoop Workshop 22
  • 23. Any questions on the Hadoop Overview? Hadoop Workshop 23
  • 24. • Hadoop Distributed File System • Single Name Node and a Cluster of Datanodes • Stores large files (+ GB) across nodes • Reliability is through replication, default value is 3. Hadoop Workshop 24
  • 25. • Master and in Hadoop 1 is a SPOF. • In Hadoop 2 added failover to prevent the SPOF aspect • Contains • Directory Tree of all the files • Keeps track of where all the data resides across the cluster • For really large clusters, tracking these files becomes an issue, Hadoop 2 added NameNode Federation to fix that. Hadoop Workshop 25
  • 26. • http://localhost:50070/ • Can view information on namenode, blocks and view logs. Hadoop Workshop 26
  • 27. • Files stored on HDFS are chunked and stored as blocks on DataNode • Manages storage attached to the nodes that they run on. • Data never flows through NameNode, only DataNodes Hadoop Workshop 27
  • 28. • Typically file system blocks are a few kilobytes. • HDFS uses block size typically 64MB or 128MB. • The large size minimizes the cost of seeking data. • Blocks do not need to be on the same cluster. So large files in HDFS often span multiple clusters • >hadoop fsck / -files -blocks Hadoop Workshop 28
  • 31. • You can interact with HDFS from command line by use of “hadoop fs” • >hadoop fs –help • >hadoop fs -ls / https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html Hadoop Workshop 31
  • 32. • Get some data • http://www.gutenberg.org/ebooks/20417 • http://www.gutenberg.org/ebooks/5000 • http://www.gutenberg.org/ebooks/4300 • Save to /tmp/gutenberg local Hadoop Workshop 32
  • 33. • hadoop fs -mkdir /tmp • hadoop fs -mkdir /tmp/gutenberg • hadoop fs -copyFromLocal /tmp/gutenberg/* /tmp/gutenberg Check it: hadoop fs -ls /tmp/gutenberg/ Or http://localhost:50075/ Hadoop Workshop 33
  • 34. • Let‟s rename them hadoop fs -mv /tmp/gutenberg/pg20417.txt /tmp/gutenberg/pg20417.txt.bak hadoop fs -mv /tmp/gutenberg/pg4300.txt /tmp/gutenberg/pg4300.txt.bak hadoop fs -mv /tmp/gutenberg/pg5000.txt /tmp/gutenberg/pg5000.txt.bak • Then hadoop fs -copyToLocal /tmp/gutenberg/*.bak /tmp/gutenberg • And again ls /tmp/gutenberg Hadoop Workshop 34
  • 35. • Quite simple, like normal Java File I/O • General write to FileSystem abstract class, can use DistributedFileSystem • Simple way to read and write to file system is using java.net.URL. For example: InputStream in = new URL("hdfs://host/path").openStream(); Hadoop Workshop 35
  • 36. • Listing Files • Getting Status • Copying to HDFS • Copying from HDFS • Viewing Hadoop Workshop 36
  • 37. Any questions on HDFS? Hadoop Workshop 37
  • 38. • Splits Processing Into Steps • Mapper • Reducer • These are by nature easily parallized and can take advantage of available hardware. Hadoop Workshop 38
  • 39. • Key Value Pair: two pieces of data, exchanged between the Map and Reduce phases. Also sometimes called a TUPLE • Map: The „map‟ function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs • Reduce: The „reduce‟ function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs • Group: A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values Hadoop Workshop 39
  • 41. • Map translates input to keys and values to new keys and values • System Groups each unique key with all its values • Reduce translates the values of each unique key to new keys and values Hadoop Workshop 41
  • 42. • Let‟s look at the combining english to foreign language translations Hadoop Workshop 42
  • 44. • Is a single point of failure • Determines # Mapper Tasks from file splits via InputFormat • Uses predefined value for # Reducer Tasks • Client applications use JobClient to submit jobs and query status • Command line use hadoop job <commands> • Web status console use http://jobtracker-server:50030/ Hadoop Workshop 44
  • 45. • Spawns each Task as a new child JVM • Max # mapper and reducer tasks set independently • Can pass child JVM opts via mapred.child.java.opts • Can re-use JVM to avoid overhead of task initialization Hadoop Workshop 45
  • 47. • MRUnit‟s MapDriver or ReduceDriver are the key class • Configure which mapper under test • The input value, • The expected output key • The expected output value Hadoop Workshop 47
  • 48. • Testing A Mapper • Testing A Reducer Hadoop Workshop 48
  • 49. • Not for quick jobs • Scripts / R programs are often better • Doesn‟t work well with a lot of small files in HDFS • Don‟t have that real – time aspect (see Apache Storm) • Don‟t let people tell you Hadoop is the answer for everything • Hive * Pig are good alternative (espcially for SQL speakers) Hadoop Workshop 49
  • 50. • Questions? • Contact Me brian.enochson@gmail.com Hadoop Workshop 50