Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.
We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.
3. About @odimulescu
• Working on the Web since 1997
•
• Organizer for JaxMUG.com
• Co-Organizer for Jax Big Data meetup
4. What is ?
Apache Hadoop is an open source framework
for running data-intensive applications on large
clusters of commodity hardware
5. What and how is solving?
Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed file system
• Moves computation to data rather then data to computation
• Simplifies programming model
CPU
CPU
CPU
CPU
CPU
CPU
CPU CPU
6. Why does it matter?
• Volume - Datasets outgrow local HDDs let alone RAM
• Velocity - Data grows at tremendous pace
• Variety - Data is heterogeneous
• Value
- Scaling up is expensive (licensing, cpus, disks, fabric, etc.)
- Scaling up has a ceiling (physical, technical, etc.)
7. Why does it matter?
Data types Complex Data
Images,Video
20% Logs
Documents
Call records
Sensor data
80% Mail archives
Structured Data
User Profiles
CRM
Complex HR Records
Structured
* Chart Source: IDC White Paper
8. Use cases
• ETL
• Pattern Recognition
• Recommendation Engines
• Prediction Models
• Log Processing
• Data “sandbox”
11. When not to use?
• Not a database replacement
• Not a data warehousing, complements it
• Not for interactive reporting
• Not a general purpose storage mechanism
• Not for problems that are not parallelizable in a
share-nothing fashion *
12. Architecture – Core Components
HDFS
Distributed filesystem designed for low cost storage
and high bandwidth access across the cluster.
MapReduce
Simpler programming model for processing and
generating large data sets.
13. Architecture - HDFS
Namenode (NN)
Client ask NN for file H
NN returns DNs that has it
D
F
Client ask DN for data S
Datanode 1 Datanode 2 Datanode N
Namenode - Master Datanode - Slaves
• Filesystem metadata • Blocks R/W per clients
• Files R/W control • Replicates blocks per master
• Blocks replication • Notifies master about block-ids
14. Architecture - MapReduce
J JobsTracker (JT)
O
B
Client starts a job
S
API TaskTracker 1 TaskTracker 2 TaskTracker N
JobTracker - Master TaskTracker - Slaves
• Accepts MR jobs submitted by clients • Runs MR tasks received from JobTracker
• Assigns MR tasks to TaskTrackers • Manages storage and transmission of
• Monitors tasks and TaskTracker status, intermediate output
re-executes tasks upon failure
• Speculative execution
15. Architecture - Core Hadoop
J JobsTracker
O
B
S
TaskTracker 1 TaskTracker 2 TaskTracker N
API
DataNode 1 DataNode 2 DataNode N
H
D
F
S
NameNode
* Mini OS: Filesystem & Scheduler
18. Installation - Platform Notes
Production
Linux – Official
Development
Linux
OSX
Windows via Cygwin *
Other Unixes
19. Installation
1. Download & configure single-node cluster
hadoop.apache.org/common/releases.html
2. Download a demo VM
Cloudera, Hortonworks, MapR, etc.
3. Download MS HDInsight Server
4. Cloud: Amazon EMR, Azure HDInsight Service
20. Hadoop - Azure Story
Name:
Windows Azure HDInsight Service
Where:
Hadoop on Azure dot com
Status:
Public Preview
*On-premise: Microsoft HDInsight Server
38. References
Hadoop at Yahoo!, by Y! Developer Network
MapReduce in Simple Terms, by Saliya Ekanayake
Hadoop on Azure, Getting Started
Hadoop .Net SDK
.Net HDFS File Access
SQL Server Connector for Hadoop