Hadoop on Azure, Blue elephants

One elephant went out to play, Azure way
Orlando Code Camp, 2013

Ovidiu Dimulescu

@odimulescu
speakerdeck.com/odimulescu

Agenda
• Overview
• Installation
• Azure story
• .Net Integration
• MapReduce
• Q &A

About @odimulescu
• Working on the Web since 1997

•

• Organizer for JaxMUG.com

• Co-Organizer for Jax Big Data meetup

What is ?

Apache Hadoop is an open source framework
for running data-intensive applications on large
clusters of commodity hardware

What and how is solving?
Processing diverse large datasets in practical time at low cost

• Consolidates data in a distributed ﬁle system
• Moves computation to data rather then data to computation
• Simpliﬁes programming model

CPU
CPU

CPU
CPU

CPU
CPU

CPU CPU

Why does it matter?

• Volume - Datasets outgrow local HDDs let alone RAM

• Velocity - Data grows at tremendous pace

• Variety - Data is heterogeneous

• Value

- Scaling up is expensive (licensing, cpus, disks, fabric, etc.)

- Scaling up has a ceiling (physical, technical, etc.)

Why does it matter?

Data types Complex Data

Images,Video
20% Logs
Documents
Call records
Sensor data
80% Mail archives

Structured Data

User Proﬁles
CRM
Complex HR Records
Structured
* Chart Source: IDC White Paper

Use cases

• ETL

• Pattern Recognition

• Recommendation Engines

• Prediction Models

• Log Processing

• Data “sandbox”

When not to use?

• Not a database replacement

• Not a data warehousing, complements it

• Not for interactive reporting

• Not a general purpose storage mechanism

• Not for problems that are not parallelizable in a
share-nothing fashion *

Architecture – Core Components

HDFS

Distributed ﬁlesystem designed for low cost storage
and high bandwidth access across the cluster.

MapReduce

Simpler programming model for processing and
generating large data sets.

Architecture - HDFS

Namenode (NN)
Client ask NN for ﬁle H
NN returns DNs that has it
D
F
Client ask DN for data S
Datanode 1 Datanode 2 Datanode N

Namenode - Master Datanode - Slaves

• Filesystem metadata • Blocks R/W per clients
• Files R/W control • Replicates blocks per master
• Blocks replication • Notiﬁes master about block-ids

Architecture - MapReduce

J JobsTracker (JT)
O
B
Client starts a job
S

API TaskTracker 1 TaskTracker 2 TaskTracker N

JobTracker - Master TaskTracker - Slaves

• Accepts MR jobs submitted by clients • Runs MR tasks received from JobTracker
• Assigns MR tasks to TaskTrackers • Manages storage and transmission of
• Monitors tasks and TaskTracker status, intermediate output
re-executes tasks upon failure
• Speculative execution

Architecture - Core Hadoop

J JobsTracker
O
B
S
TaskTracker 1 TaskTracker 2 TaskTracker N
API
DataNode 1 DataNode 2 DataNode N
H
D
F
S
NameNode

* Mini OS: Filesystem & Scheduler

Hadoop - Ecosystem

Management

ZooKeeper Chukwa Ambari HUE

Data Access

Pig Hive Sqoop Impala Stinger

Data Processing
MapReduce Giraph Hama Mahout

Storage
HDFS HBase

Installation - Platform Notes

Production

Linux – Ofﬁcial

Development

Linux

OSX

Windows via Cygwin *

Other Unixes

Installation

1. Download & conﬁgure single-node cluster

hadoop.apache.org/common/releases.html

2. Download a demo VM

Cloudera, Hortonworks, MapR, etc.

3. Download MS HDInsight Server

4. Cloud: Amazon EMR, Azure HDInsight Service

Hadoop - Azure Story

Name:
Windows Azure HDInsight Service

Where:
Hadoop on Azure dot com

Status:
Public Preview

*On-premise: Microsoft HDInsight Server

HDFS - .Net access

Microsoft Distribution of Hadoop

C library for HDFS ﬁle access

Hadoop .Net HDFS File Access

Managed C++ Solution

Hadoop .Net SDK

hadoopsdk.codeplex.com

• MapReduce
• LINQ to Hive
• WebHDFS Client

Hadoop Integration

ODBC Driver

Excel PowerPivot
Other BI tools

Connector for Hadoop

Import / Export via SQOOP

slideshare.net/esaliya/mapreduce-in-simple-terms

by Saliya Ekanayake

30

MapReduce - Clients

Java - Native
hadoop jar jar_path main_class input_path output_path

C++ - Pipes framework
hadoop pipes -input path_in -output path_out -program exec_program

Any – Streaming
hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
input path_in -output path_out

Pig Latin, Hive HQL, C via JNI

C# - .Net SDK Mapper & Reducer

C# - .Net SDK Driver Class

MRRunner -dll WordFrequency.dll -- input output

MRRunner -dll WordFrequency.dll -class WordFrequency -- input output

References
Hadoop at Yahoo!, by Y! Developer Network

MapReduce in Simple Terms, by Saliya Ekanayake

Hadoop on Azure, Getting Started

Hadoop .Net SDK

.Net HDFS File Access

SQL Server Connector for Hadoop

Questions ?

Ovidiu Dimulescu

@odimulescu
speakerdeck.com/odimulescu

Hadoop on Azure, Blue elephants

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop on Azure, Blue elephants

Ähnlich wie Hadoop on Azure, Blue elephants (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop on Azure, Blue elephants