SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Introduction to Apache
Hadoop
BACS 488 – February 6, 2012
Monfort College of Business
Christopher Pezza
Overview
 Data Storage and Analysis
 Comparison with other Systems
 HPC and Grid Computing
 Volunteer Computing
 History of Hadoop
 Analyzing Data with Hadoop
 Hadoop in the Enterprise
 The Collective Wisdom of the Valley
The Problem
   IDC estimates the size of the digital
    universe has grown to 1.8 zettabytes
    by the end of 2011
    ◦ 1 zettabyte = 1,000 exabytes = 1M
      petabytes
 Individual data footprints are growing
 Storing and Analyzing datasets in the
  petabyte range requires new and
  innovative solutions
The Problem
   Storage capacities of hard drives have
    increased but transfer rates have not
    kept up
    ◦ Solution: read from multiple disks at once
 Hardware Failure
 Most analysis tasks need to be able to
  combine the data in some way.
What Hadoop provides:
 The ability to read and write data in
  parallel to or from multiple disks
 Enables applications to work with
  thousands of nodes and petabytes of
  data.
 A reliable shared storage and analysis
  system (HDFS and MapReduce)
 A free license
Who uses Hadoop?
MapReduce vs. RDBMS
   MapReduce Premise: the entire
    dataset—or at least a good portion of
    it—is processed for each query.
    ◦ Batch Query Processor
 Another Trend: Seek time is improving
  more slowly than transfer time
 MapReduce is good for analyzing the
  whole dataset, whereas RDBMS is
  good for point queries or updates.
MapReduce vs. RDBMS
            Traditional RDBMS       MapReduce
Data Size   Gigabytes               Petabytes
Access      Interactive and batch   Batch
Updates     Read and write many     Write once, read many
            times                   times
Structure   Static schema           Dynamic schema
Integrity   High                    Low
Scaling     Nonlinear               Linear

• MapReduce suits applications where
  the data is written once, and read
  many times, whereas a RDBMS is
  good for datasets that are continually
  updated.
Data Structure
   Structured Data – data organized into
    entities that have a defined format.
    ◦ Realm of RDBMS
 Semi-Structured Data – there may be
  a schema, but often ignored; schema
  is used as a guide to the structure of
  the data.
 Unstructured Data – doesn’t have any
  particular internal structure.
 MapReduce works well with semi-
  structured and unstructured data.
More differences…
 Relational data is often normalized to
  retain its integrity and remove
  redundancy
 Normalization poses problems for
  MapReduce
 MapReduce is a linearly scalable
  programming model.
 Over time, the differences between
  RDBMS and MapReduce are likely to
  blur
HPC and Grid Computing
   The approach in HPC is to distribute the
    work across a cluster of machines, which
    access a shared filesystem, hosted by a
    SAN.
    ◦ In very large datasets, bandwidth is the
      bottleneck and network nodes become idle
   MapReduce tries to collocate the data
    with the compute node, so data access
    is fast since it is local.
    ◦ Works to conserve bandwidth by explicitly
      modeling network topology.
Handling Partial Failure
 MapReduce – implementation detects
  failed map or reduce tasks and
  reschedules replacements on
  machines that are healthy
 Shared-Nothing Architecture – tasks
  have no dependence on one another
 To contrast, MPI programs have to
  explicitly manage their own
  checkpointing and recovery.
Why is MapReduce cool?
 Invented by engineers at Google as a
  system for building production search
  indexes because they found
  themselves solving the same problem
  over and over again.
 Wide range of algorithms expressed:
    ◦ Image Analysis
    ◦ Graph-based problems
    ◦ Machine Learning
Volunteer Computing
 Seti@Home
 MapReduce is designed to run jobs that
  last minutes or hours on trusted,
  dedicated hardware running in a single
  data center with very high aggregate
  bandwidth interconnects.
 Seti@home runs a perpetual
  computation on untrusted machines on
  the Internet with highly variable
  connection speeds and no data locality
History of Hadoop
   Created by Doug Cutting
   2002 – Apache Nutch, open source web
    search engine
   2003 – Google publishes a paper describing
    the architecture of their distributed filesystem,
    GFS.
   2004 – Nutch Distributed Filesystem (NDFS)
   2004 – Google publishes a paper on
    MapReduce
   2005 – Nutch MapReduce implementation
   2006 – Hadoop is created; Cutting joins
    Yahoo!
   2008 – Yahoo! demonstrates Hadoop
Hadoop Projects
 Common
 Avro
 MapReduce
 HDFS
 Pig
 Hive
 Hbase
 ZooKeeper
 Sqoop
Analyzing Data with Hadoop
   Case: NCDC Weather Data
    ◦ What’s the highest recorded global temp for each
      year in the dataset?
 Express our query as a MapReduce job
 MapReduce breaks the processing into two
  phases: Map and Reduce
 Input to our Map phase is raw NCDC data
 Map Function: Pull out the year and air
  temperature AND filter out temps that are
  missing, suspect or erroneous.
 Reducer Function: finding the max temp for
  each year
MapReduce Example
   Map function extracts the year and
    temp:
    ◦ (1950, 0), (1950, 22), (1950, -11), (1949,
      111), (1949, 78)
   MapReduce sorts and groups the
    data:
    ◦ (1949, [111,78])
    ◦ (1950, [0, 22, -11])
   Reduce function iterates through the
    list:
Hadoop in the Enterprise
   Accelerate nightly batch business processes
   Storage of extremely high volumes of data
   Creation of automatic, redundant backups
   Improving the scalability of applications
   Use of Java for data processing instead of
    SQL
   Producing JIT feeds for dashboards and BI
   Handling urgent, ad hoc request for data
   Turning unstructured data into relational data
   Taking on tasks that require massive
    parallelism
   Moving existing algorithms, code,
    frameworks, and components to a highly
    distributed computing environment
Hadoop in the News
 the open-source LAMP stack
  transformed web startup economics 10
  years ago
 Argues that Hadoop is now displacing
  expense proprietary solutions.
 Hadoop’s architechture of map-reducing
  across of a cluster of commodity nodes
  is more flexible and cost effective than
  traditional data warehouses.
 3 Areas of application in Startup’s:
    ◦ Analysis of Customer Behavior
    ◦ Powering new user-facing features
    ◦ Enabling entire new lines of business
An interesting point to close on…
   From TechCrunch: ―What is most
    remarkable is how the startup world is
    collectively creating this ecosystem:
    Yahoo, Facebook, Twitter, LinkedIn, and
    other companies are actively adding to
    the tool chain. This illustrates a new
    thesis or collective wisdom rising from
    the valley: If a technology is not your
    core value-add, it should be open-
    sourced because then others can
    improve it, and potential future
    employees can learn it. This rising tide
    has lifted all boats, and is just getting
    started‖
Training and Certifications
   Hortonworks – Believes that Apache
    Hadoop will process half of the world’s
    data within the next five years
    ◦ Hortonworks Data Platform – open source
      distribution of Apache Hadoop
    ◦ Support, Training, Partner Enablement
      programs designed to assist enterprises
      and solution providers
      Hortonworks University
Extra Resources
 Running Hadoop on Ubuntu Linux
  (Single-Node Cluster)
 Running Hadoop on Amazon EC2
Works Cited
 White, Tom (2011).
  Hadoop: The Definitive
     Guide. Sebastopol,
  CA: O’Reilly.
 TechCrunch (July 2011) –
  ―Hadoop and Startups:
  Where Open Source
  Meets Business Data‖
 Wikipedia – Apache
  Hadoop
 Apache Hadoop Website

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopCenk Derinozlu
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 

Was ist angesagt? (20)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve Hadoop
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 

Ähnlich wie Introduction to Apache Hadoop: Analyzing Big Data

Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoopdatabloginfo
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 

Ähnlich wie Introduction to Apache Hadoop: Analyzing Big Data (20)

Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
G017143640
G017143640G017143640
G017143640
 

Kürzlich hochgeladen

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Kürzlich hochgeladen (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Introduction to Apache Hadoop: Analyzing Big Data

  • 1. Introduction to Apache Hadoop BACS 488 – February 6, 2012 Monfort College of Business Christopher Pezza
  • 2. Overview  Data Storage and Analysis  Comparison with other Systems  HPC and Grid Computing  Volunteer Computing  History of Hadoop  Analyzing Data with Hadoop  Hadoop in the Enterprise  The Collective Wisdom of the Valley
  • 3. The Problem  IDC estimates the size of the digital universe has grown to 1.8 zettabytes by the end of 2011 ◦ 1 zettabyte = 1,000 exabytes = 1M petabytes  Individual data footprints are growing  Storing and Analyzing datasets in the petabyte range requires new and innovative solutions
  • 4. The Problem  Storage capacities of hard drives have increased but transfer rates have not kept up ◦ Solution: read from multiple disks at once  Hardware Failure  Most analysis tasks need to be able to combine the data in some way.
  • 5. What Hadoop provides:  The ability to read and write data in parallel to or from multiple disks  Enables applications to work with thousands of nodes and petabytes of data.  A reliable shared storage and analysis system (HDFS and MapReduce)  A free license
  • 7. MapReduce vs. RDBMS  MapReduce Premise: the entire dataset—or at least a good portion of it—is processed for each query. ◦ Batch Query Processor  Another Trend: Seek time is improving more slowly than transfer time  MapReduce is good for analyzing the whole dataset, whereas RDBMS is good for point queries or updates.
  • 8. MapReduce vs. RDBMS Traditional RDBMS MapReduce Data Size Gigabytes Petabytes Access Interactive and batch Batch Updates Read and write many Write once, read many times times Structure Static schema Dynamic schema Integrity High Low Scaling Nonlinear Linear • MapReduce suits applications where the data is written once, and read many times, whereas a RDBMS is good for datasets that are continually updated.
  • 9. Data Structure  Structured Data – data organized into entities that have a defined format. ◦ Realm of RDBMS  Semi-Structured Data – there may be a schema, but often ignored; schema is used as a guide to the structure of the data.  Unstructured Data – doesn’t have any particular internal structure.  MapReduce works well with semi- structured and unstructured data.
  • 10. More differences…  Relational data is often normalized to retain its integrity and remove redundancy  Normalization poses problems for MapReduce  MapReduce is a linearly scalable programming model.  Over time, the differences between RDBMS and MapReduce are likely to blur
  • 11. HPC and Grid Computing  The approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN. ◦ In very large datasets, bandwidth is the bottleneck and network nodes become idle  MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. ◦ Works to conserve bandwidth by explicitly modeling network topology.
  • 12. Handling Partial Failure  MapReduce – implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy  Shared-Nothing Architecture – tasks have no dependence on one another  To contrast, MPI programs have to explicitly manage their own checkpointing and recovery.
  • 13. Why is MapReduce cool?  Invented by engineers at Google as a system for building production search indexes because they found themselves solving the same problem over and over again.  Wide range of algorithms expressed: ◦ Image Analysis ◦ Graph-based problems ◦ Machine Learning
  • 14. Volunteer Computing  Seti@Home  MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects.  Seti@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality
  • 15. History of Hadoop  Created by Doug Cutting  2002 – Apache Nutch, open source web search engine  2003 – Google publishes a paper describing the architecture of their distributed filesystem, GFS.  2004 – Nutch Distributed Filesystem (NDFS)  2004 – Google publishes a paper on MapReduce  2005 – Nutch MapReduce implementation  2006 – Hadoop is created; Cutting joins Yahoo!  2008 – Yahoo! demonstrates Hadoop
  • 16. Hadoop Projects  Common  Avro  MapReduce  HDFS  Pig  Hive  Hbase  ZooKeeper  Sqoop
  • 17. Analyzing Data with Hadoop  Case: NCDC Weather Data ◦ What’s the highest recorded global temp for each year in the dataset?  Express our query as a MapReduce job  MapReduce breaks the processing into two phases: Map and Reduce  Input to our Map phase is raw NCDC data  Map Function: Pull out the year and air temperature AND filter out temps that are missing, suspect or erroneous.  Reducer Function: finding the max temp for each year
  • 18. MapReduce Example  Map function extracts the year and temp: ◦ (1950, 0), (1950, 22), (1950, -11), (1949, 111), (1949, 78)  MapReduce sorts and groups the data: ◦ (1949, [111,78]) ◦ (1950, [0, 22, -11])  Reduce function iterates through the list:
  • 19.
  • 20. Hadoop in the Enterprise  Accelerate nightly batch business processes  Storage of extremely high volumes of data  Creation of automatic, redundant backups  Improving the scalability of applications  Use of Java for data processing instead of SQL  Producing JIT feeds for dashboards and BI  Handling urgent, ad hoc request for data  Turning unstructured data into relational data  Taking on tasks that require massive parallelism  Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment
  • 21.
  • 22. Hadoop in the News  the open-source LAMP stack transformed web startup economics 10 years ago  Argues that Hadoop is now displacing expense proprietary solutions.  Hadoop’s architechture of map-reducing across of a cluster of commodity nodes is more flexible and cost effective than traditional data warehouses.  3 Areas of application in Startup’s: ◦ Analysis of Customer Behavior ◦ Powering new user-facing features ◦ Enabling entire new lines of business
  • 23. An interesting point to close on…  From TechCrunch: ―What is most remarkable is how the startup world is collectively creating this ecosystem: Yahoo, Facebook, Twitter, LinkedIn, and other companies are actively adding to the tool chain. This illustrates a new thesis or collective wisdom rising from the valley: If a technology is not your core value-add, it should be open- sourced because then others can improve it, and potential future employees can learn it. This rising tide has lifted all boats, and is just getting started‖
  • 24. Training and Certifications  Hortonworks – Believes that Apache Hadoop will process half of the world’s data within the next five years ◦ Hortonworks Data Platform – open source distribution of Apache Hadoop ◦ Support, Training, Partner Enablement programs designed to assist enterprises and solution providers  Hortonworks University
  • 25. Extra Resources  Running Hadoop on Ubuntu Linux (Single-Node Cluster)  Running Hadoop on Amazon EC2
  • 26. Works Cited  White, Tom (2011). Hadoop: The Definitive Guide. Sebastopol, CA: O’Reilly.  TechCrunch (July 2011) – ―Hadoop and Startups: Where Open Source Meets Business Data‖  Wikipedia – Apache Hadoop  Apache Hadoop Website

Hinweis der Redaktion

  1. Storage Capacities: One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in about 5 minutes. 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/sec, so it takes more than 2.5 hours to ready all the data off the disk. Solution: Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under 2 minutes. Only using one hundredth of a disk may seem wasteful. But we can store on hundred dataset, each of which is one terabyte, and provide shared access to them.Hardware Failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way to avoid data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure there is another copy available. This is how RAID works, though the Hadoop Distributed Filesystem (HDFS) takes a slightly different approach. Darta read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, tranforming it into a computation over sets of keys and values. The important point here is that there are two parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs.
  2. Yahoo – 10,000 core Linux clusterFacebook – claims to have the largest Hadoop cluster in the world at 30 PB
  3. MapReduce enables you to run an ad hoc query against your whole dataset and get the results in a reasonable time E.g. Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. They said: “This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.”Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform sesks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
  4. Structured Data – such as XML documents or database tables that conform to a particular predefined schema (RDBMS).Semi-Structured Data – for example, a spreadsheet, in which the structure is the grid of the cells, although the cells themselves may hold any form of dataUnstructured Data – e.g. plain text or image dataMapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.
  5. Problems for MapReduce – it makes reading a record a non local operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear may times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReudce. MapReduce is a linearly scalable programming model. The programmer writes 2 functions: a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to anotherThese functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More importantly, if you double the size of the input data, a job will run twice as slow. But if you also double the sixe of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. The lines will blur as relational databases start incorporating some of the ideas form MapReduce and from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable to traditional database programmers.
  6. High Performance Computing (HPC) and Grid Computing communities have been doing large-scale data processing for years, using such API’s such as Message Passing Interface (MPI)HPC works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (100’s of GB)Data locality is at the heart of MapReduce and is the reason for it’s good performance. Recognizing that network bandwidth is the most precious resource in a data center environment (e.g. it is easy to saturate network links by copying data around), MapReduce implementations go to great lengths to conserve it by explicitly modeling network topology. MPI gives great control to the programmer, but requires that he or she explicitly handthe mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higer-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.
  7. How do you handle partial failure?—When you don’t know if a remote process has failed or not—and still making progress with the overall computationShared nothing architecture makes MapReduce able to handle partial failure. From a programmers point of view, the order in which the tasks run doesn’t matter. MPI Programs – gives more control to the programmer, but makes them more difficult to write. In some ways MapReduce is a restrictive programming model since you are limited to key and value types that are related in specified ways, and mapper and reducers run with very limited coordination between one another (The mappers pass keys and values to reducers)
  8. Search for Extra-Terrestrial Intelligence – volunteers donate CUP time from their otherwise idel computers to analyze radio telescope data for signs of intelligent life outside earth. Most prominent of many volunteer computing progjects. Similar to MapReduce in that it breaks a problem into independent pieces to be worked on in parallel
  9. Nutch -- Architecture wouldn’t scale to index billions of pagesPaper about GFS provided the info needed to solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. NDFS was an open source implementation of the GFSGoogle introduced MapReduce to the world, by mid 2005 the Nutch project developed an open source implementationDoug Cutting joined Yahoo!, which proviede a dedicated team and the resources to turn Hadoop into a system that ran at the web scale. This was demonstrated in February 2008 when yahoo! announced that it’s production search index was being generated by a 10,000 core Hadoop ClusterThe NY Times used Amazon’s EC2 compute cloud to crunch through 4 terabytes of scanned arhives from the paper converting them to PDFs fro the Web. The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked on without the combination of Amazon’s pay by the hour model and hadoops easy to use parallel programming model. Broke a world record to become the fastest system to sort a terabyte of data. Running on a 910 node cluster, Hadoop sorted one terabyte in 209 seconds. In November of the same year, Google announced its MapReduce implementation sorted one terabyte in 68 secods. By 2009, Yahoo! used Hadoop to sort one terabyte in 62 seconds.
  10. MapReduce – a distribute ddata processing model and execution environment that runs on large clusters of commodity machinesHDFS – a distributed filesystem that runs on large clusters of commodity machines.Pig – A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clustersHive – A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL for querying the dataHbase – a distributed, column-oriented database. Hbase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). Sqoop – a tool for efficiently moving data between relational databases and HDFS.
  11. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The two functions are also specified by the programmer.For the example, we choose a text input format that gives us each line in the dataset as a text vlue. The key is the offset of the beginning of the line from the beginning of a file. Map function – just a data preparation phase, setting up the data in such a way that the reducer function can do its work on it: finding the max temp each year
  12. http://techcrunch.com/2011/07/17/hadoop-startups-where-open-source-meets-business-data/ LAMP (Linux, Apache, MySQL, PHP/Python) - As new open0-source webservers, databases, and web-friendly programming lanuages liberated developers from proprietary software and big iron hardware, startup costs plummeted. This lower barrier to entry changed the startup funding game, and led to the emergence of the current Angel/Seed funding ecosystem. – This also enabled the generation of web apps we use today. With Hadoop… Startups are creating more intelligent businesses and more intelligent productsEven modestly successful startup has a user base comparable in population to nation statesThe problem this poses is that understanding the value of every user and transaction becomes more complex.The opportunity this poses is that the collective intelligence of the population can be leveraged into better user experiences. Before Hadoop, analyzing this scale of data required the same kind of enterprise solutions that LAMP was created to avoid.The key to understanding the significance of Hadoop is that it’s not juast a specific piece of technology, but movement of developers trying to collectively solve the Big Data problems of their organizations.