Hadoop introduction

Hadoop Introduction
Background && Installation && Hello world && related

Outline

• Background
• Hello world
• Installation
• Related

12/20/12 2

Background
• Why Hadoop?
• Accessible: AWS
• Robust : handle most such failures
• Scalable: linearly
• Simple: 1 == 1 w
• Key Points:
• Scale-out
• Moving code to data

12/20/12 3

Background: History
• Apache Top Project: Doug Cutting
• Lucence -> Nutch -> Hadoop(2004)
• Yahoo (1w)
• Facebook (Hive, Hbase,…)
• HULU (Hbase)
• Baidu (3000TB, one week)
• Twitter (sweat data)

12/20/12 4

Background
• Comparing SQL database and Hadoop
• Structure:
• SQL(structure data, Specific Pattern)
• Hadoop(Key-value, like Text, Picture)
• Scale-out <- scale-up
• Key-Value <- Relation Tables
• Functional Programming <- Declarative Queries
• Offline batch processing <- Online (Once
Write , Read many times)
12/20/12 5

Background – Understanding
• Word Count
• File Size ++ ， Memory Leak
• Disk-Hash Table (More complex)
• Distributed:
• Phase 1: Part Processing
• Phase 2: Merge Results
• Shuffle the partitions the appropriate machines(AlphaBeta)

• Now, We have already finish a minimal Hadoop.

12/20/12 6

Hello World: Word Count
• Two Phase:
• Mapping: 获取输入数据，并将其装载到 mapper 中
• Reducing: 处理来自 mapper 的所有输出，产生最终结果。

• 1.1 list(filename, file content)
• 1.2 list(word, 1)
• 2.1 list(word, list(word))
• 2.2 list(word, count)

12/20/12 7

Hello World
• mapper.py
• Reducer.py

12/20/12 8

Installation
• Mode:
• 单机模式（ default)
• 伪分布模式推荐开发和调试模式
• 全分布模式
• Configuration:
• 基本配置
• Ssh 配置
• Ubuntu 配置

12/20/12 9

Hadoop Framework
• HDFS:
• NameNode : 跟踪，指导，记录
• DataNode ：底层 IO 操作
• Secondary NameNode
• Map Reduce ：
• Job Tracker
• Task Tracker

12/20/12 10

Related
• Programming:
• Java
• Python
• Jython （ Translate Python ）
• Hadoop Streaming （ stdin , stdout ）
• Dumbo
• Happy

12/20/12 11

Related
• Pig: 高级数据流语言
• Hive: SQL 数据仓库
• Hbase ： Google BigTable ，面向列的数据库
• ZookKeeper: 共享状态的协同系统
• Chukwa ：数据收集系统
• Mahout ：数据挖掘与机器学习
• Hama: 矩阵计算

12/20/12 12

Resource
• Book:
• Hadoop In action
• Hadoop 实战（第二版）
• Video && Google Course
• URL:
• 资源收藏

12/20/12 13

Hadoop introduction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Hadoop introduction

Ähnlich wie Hadoop introduction (20)

Mehr von Tianwei Liu

Mehr von Tianwei Liu (11)

Hadoop introduction

Hinweis der Redaktion