SlideShare ist ein Scribd-Unternehmen logo
1 von 39
One elephant went out to play, Azure way
         Orlando Code Camp, 2013




                       Ovidiu Dimulescu

                       @odimulescu
                       speakerdeck.com/odimulescu
Agenda
  •   Overview
  •   Installation
  •   Azure story
  •   .Net Integration
  •   MapReduce
  •   Q &A
About @odimulescu
• Working on the Web since 1997

•

• Organizer for JaxMUG.com

• Co-Organizer for Jax Big Data meetup
What is                  ?



Apache Hadoop is an open source framework
for running data-intensive applications on large
clusters of commodity hardware
What and how is solving?
Processing diverse large datasets in practical time at low cost

• Consolidates data in a distributed file system
• Moves computation to data rather then data to computation
• Simplifies programming model


                                                     CPU
                           CPU

                                                     CPU
                           CPU

                                                     CPU
                           CPU

                           CPU                       CPU
Why does it matter?

• Volume - Datasets outgrow local HDDs let alone RAM

• Velocity - Data grows at tremendous pace

• Variety - Data is heterogeneous

• Value

  - Scaling up is expensive (licensing, cpus, disks, fabric, etc.)

  - Scaling up has a ceiling (physical, technical, etc.)
Why does it matter?


                         Data types     Complex Data

                                           Images,Video
                         20%               Logs
                                           Documents
                                           Call records
                                           Sensor data
                                  80%      Mail archives

                                        Structured Data

                                           User Profiles
                                           CRM
                           Complex         HR Records
                           Structured
* Chart Source: IDC White Paper
Use cases

• ETL

• Pattern Recognition

• Recommendation Engines

• Prediction Models

• Log Processing

• Data “sandbox”
Who uses it?
Who supports it?
When not to use?

• Not a database replacement

• Not a data warehousing, complements it

• Not for interactive reporting

• Not a general purpose storage mechanism

• Not for problems that are not parallelizable in a
  share-nothing fashion *
Architecture – Core Components

HDFS

Distributed filesystem designed for low cost storage
and high bandwidth access across the cluster.


MapReduce

Simpler programming model for processing and
generating large data sets.
Architecture - HDFS


                                                  Namenode (NN)
Client ask NN for file        H
NN returns DNs that has it
                             D
                             F
Client ask DN for data       S
                                 Datanode 1         Datanode 2         Datanode N



Namenode - Master                             Datanode - Slaves

•     Filesystem metadata                     •     Blocks R/W per clients
•     Files R/W control                       •     Replicates blocks per master
•     Blocks replication                      •     Notifies master about block-ids
Architecture - MapReduce

                        J                     JobsTracker (JT)
                        O
                        B
Client starts a job
                        S

                       API    TaskTracker 1    TaskTracker 2     TaskTracker N



JobTracker - Master                        TaskTracker - Slaves

• Accepts MR jobs submitted by clients     • Runs MR tasks received from JobTracker
• Assigns MR tasks to TaskTrackers         • Manages storage and transmission of
• Monitors tasks and TaskTracker status,     intermediate output
  re-executes tasks upon failure
• Speculative execution
Architecture - Core Hadoop


    J                     JobsTracker
    O
    B
    S
          TaskTracker 1   TaskTracker 2   TaskTracker N
    API
          DataNode   1    DataNode   2    DataNode   N
                                                          H
                                                          D
                                                          F
                                                          S
                          NameNode




* Mini OS: Filesystem & Scheduler
Hadoop - Ecosystem

                     Management

 ZooKeeper      Chukwa           Ambari          HUE

                      Data Access

  Pig        Hive       Sqoop          Impala   Stinger

                     Data Processing
 MapReduce          Giraph          Hama        Mahout

                       Storage
        HDFS                            HBase
Hadoop - Ecosystem

                     Management

 ZooKeeper      Chukwa           Ambari          HUE

                      Data Access

  Pig        Hive       Sqoop          Impala   Stinger

                     Data Processing
 MapReduce          Giraph          Hama        Mahout

                       Storage
        HDFS                            HBase
Installation - Platform Notes

Production
	

 	

 Linux – Official

Development
	

 	

 Linux
	

 	

 OSX
	

 	

 Windows via Cygwin *
	

 	

 Other Unixes
Installation

1. Download & configure single-node cluster

   hadoop.apache.org/common/releases.html

2. Download a demo VM

      Cloudera, Hortonworks, MapR, etc.

3. Download MS HDInsight Server

4. Cloud: Amazon EMR, Azure HDInsight Service
Hadoop - Azure Story

Name:
  Windows Azure HDInsight Service

Where:
  Hadoop on Azure dot com

Status:
    Public Preview

*On-premise: Microsoft HDInsight Server
Hadoop - Azure Story
Hadoop - Azure Story
Hadoop - Azure Story
Hadoop - Azure Story
Hadoop - Azure Story
HDFS - .Net access

Microsoft Distribution of Hadoop

  C library for HDFS file access



Hadoop .Net HDFS File Access

  Managed C++ Solution
HDFS - .Net access
Hadoop .Net SDK

hadoopsdk.codeplex.com

 • MapReduce
 • LINQ to Hive
 • WebHDFS Client
Hadoop Integration

    ODBC Driver

     Excel PowerPivot
     Other BI tools

   Connector for Hadoop

     Import / Export via SQOOP
slideshare.net/esaliya/mapreduce-in-simple-terms

by Saliya Ekanayake




                                                   30
MapReduce - Clients

Java - Native
 hadoop jar jar_path main_class input_path output_path


C++ - Pipes framework
 hadoop pipes -input path_in -output path_out -program exec_program


Any – Streaming
 hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
 input path_in -output path_out


Pig Latin, Hive HQL, C via JNI
C# - Streaming - Mapper
C# - Streaming - Reducer
C# - .Net SDK Mapper & Reducer
C# - .Net SDK Driver Class
C# - .Net SDK Driver Class




MRRunner -dll WordFrequency.dll -- input output



MRRunner -dll WordFrequency.dll -class WordFrequency -- input output
C# - .Net SDK Debugging
References
Hadoop at Yahoo!, by Y! Developer Network

MapReduce in Simple Terms, by Saliya Ekanayake

Hadoop on Azure, Getting Started

Hadoop .Net SDK

.Net HDFS File Access

SQL Server Connector for Hadoop
Questions ?



      Ovidiu Dimulescu

      @odimulescu
      speakerdeck.com/odimulescu

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Was ist angesagt? (20)

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Andere mochten auch

Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionOvidiu Dimulescu
 
Applying Data Privacy Techniques on Published Data in Uganda
 Applying Data Privacy Techniques on Published Data in Uganda Applying Data Privacy Techniques on Published Data in Uganda
Applying Data Privacy Techniques on Published Data in UgandaKato Mivule
 
Evaluation Q.6
Evaluation Q.6Evaluation Q.6
Evaluation Q.6JessicaT-A
 
Prezentatsia
PrezentatsiaPrezentatsia
Prezentatsiaipf_acc
 
POLYCOM VoIP Technical Track
POLYCOM VoIP Technical TrackPOLYCOM VoIP Technical Track
POLYCOM VoIP Technical TrackMohamed Hamdy
 
Node.js, toy or power tool?
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?Ovidiu Dimulescu
 
Dreamforce2016から読み解く、これからのit 大友幹
Dreamforce2016から読み解く、これからのit 大友幹Dreamforce2016から読み解く、これからのit 大友幹
Dreamforce2016から読み解く、これからのit 大友幹TerraSky
 

Andere mochten auch (20)

Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java edition
 
Applying Data Privacy Techniques on Published Data in Uganda
 Applying Data Privacy Techniques on Published Data in Uganda Applying Data Privacy Techniques on Published Data in Uganda
Applying Data Privacy Techniques on Published Data in Uganda
 
Evaluation Q.6
Evaluation Q.6Evaluation Q.6
Evaluation Q.6
 
Prezentatsia
PrezentatsiaPrezentatsia
Prezentatsia
 
Fonts used
Fonts usedFonts used
Fonts used
 
myCv_ashik
myCv_ashikmyCv_ashik
myCv_ashik
 
Module 5 lesson 10
Module 5 lesson 10Module 5 lesson 10
Module 5 lesson 10
 
σάρωση0006
σάρωση0006σάρωση0006
σάρωση0006
 
POLYCOM VoIP Technical Track
POLYCOM VoIP Technical TrackPOLYCOM VoIP Technical Track
POLYCOM VoIP Technical Track
 
Parking lot
Parking lotParking lot
Parking lot
 
Module 3 lesson 13
Module 3 lesson 13Module 3 lesson 13
Module 3 lesson 13
 
Module 3 lesson 6
Module 3 lesson 6Module 3 lesson 6
Module 3 lesson 6
 
HTML5, are we there yet?
HTML5, are we there yet?HTML5, are we there yet?
HTML5, are we there yet?
 
Node.js, toy or power tool?
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?
 
The Rise of DevOps
The Rise of DevOpsThe Rise of DevOps
The Rise of DevOps
 
Git for Windows
Git for WindowsGit for Windows
Git for Windows
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Journeyman to Master
Journeyman to MasterJourneyman to Master
Journeyman to Master
 
Dreamforce2016から読み解く、これからのit 大友幹
Dreamforce2016から読み解く、これからのit 大友幹Dreamforce2016から読み解く、これからのit 大友幹
Dreamforce2016から読み解く、これからのit 大友幹
 
Book review
Book reviewBook review
Book review
 

Ähnlich wie Hadoop on Azure, Blue elephants

App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 

Ähnlich wie Hadoop on Azure, Blue elephants (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Kürzlich hochgeladen

Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Kürzlich hochgeladen (20)

Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Hadoop on Azure, Blue elephants

  • 1. One elephant went out to play, Azure way Orlando Code Camp, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu
  • 2. Agenda • Overview • Installation • Azure story • .Net Integration • MapReduce • Q &A
  • 3. About @odimulescu • Working on the Web since 1997 • • Organizer for JaxMUG.com • Co-Organizer for Jax Big Data meetup
  • 4. What is ? Apache Hadoop is an open source framework for running data-intensive applications on large clusters of commodity hardware
  • 5. What and how is solving? Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simplifies programming model CPU CPU CPU CPU CPU CPU CPU CPU
  • 6. Why does it matter? • Volume - Datasets outgrow local HDDs let alone RAM • Velocity - Data grows at tremendous pace • Variety - Data is heterogeneous • Value - Scaling up is expensive (licensing, cpus, disks, fabric, etc.) - Scaling up has a ceiling (physical, technical, etc.)
  • 7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data User Profiles CRM Complex HR Records Structured * Chart Source: IDC White Paper
  • 8. Use cases • ETL • Pattern Recognition • Recommendation Engines • Prediction Models • Log Processing • Data “sandbox”
  • 11. When not to use? • Not a database replacement • Not a data warehousing, complements it • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion *
  • 12. Architecture – Core Components HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. MapReduce Simpler programming model for processing and generating large data sets.
  • 13. Architecture - HDFS Namenode (NN) Client ask NN for file H NN returns DNs that has it D F Client ask DN for data S Datanode 1 Datanode 2 Datanode N Namenode - Master Datanode - Slaves • Filesystem metadata • Blocks R/W per clients • Files R/W control • Replicates blocks per master • Blocks replication • Notifies master about block-ids
  • 14. Architecture - MapReduce J JobsTracker (JT) O B Client starts a job S API TaskTracker 1 TaskTracker 2 TaskTracker N JobTracker - Master TaskTracker - Slaves • Accepts MR jobs submitted by clients • Runs MR tasks received from JobTracker • Assigns MR tasks to TaskTrackers • Manages storage and transmission of • Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure • Speculative execution
  • 15. Architecture - Core Hadoop J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode * Mini OS: Filesystem & Scheduler
  • 16. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
  • 17. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
  • 18. Installation - Platform Notes Production Linux – Official Development Linux OSX Windows via Cygwin * Other Unixes
  • 19. Installation 1. Download & configure single-node cluster hadoop.apache.org/common/releases.html 2. Download a demo VM Cloudera, Hortonworks, MapR, etc. 3. Download MS HDInsight Server 4. Cloud: Amazon EMR, Azure HDInsight Service
  • 20. Hadoop - Azure Story Name: Windows Azure HDInsight Service Where: Hadoop on Azure dot com Status: Public Preview *On-premise: Microsoft HDInsight Server
  • 21. Hadoop - Azure Story
  • 22. Hadoop - Azure Story
  • 23. Hadoop - Azure Story
  • 24. Hadoop - Azure Story
  • 25. Hadoop - Azure Story
  • 26. HDFS - .Net access Microsoft Distribution of Hadoop C library for HDFS file access Hadoop .Net HDFS File Access Managed C++ Solution
  • 27. HDFS - .Net access
  • 28. Hadoop .Net SDK hadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client
  • 29. Hadoop Integration ODBC Driver Excel PowerPivot Other BI tools Connector for Hadoop Import / Export via SQOOP
  • 31. MapReduce - Clients Java - Native hadoop jar jar_path main_class input_path output_path C++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_program Any – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out Pig Latin, Hive HQL, C via JNI
  • 32. C# - Streaming - Mapper
  • 33. C# - Streaming - Reducer
  • 34. C# - .Net SDK Mapper & Reducer
  • 35. C# - .Net SDK Driver Class
  • 36. C# - .Net SDK Driver Class MRRunner -dll WordFrequency.dll -- input output MRRunner -dll WordFrequency.dll -class WordFrequency -- input output
  • 37. C# - .Net SDK Debugging
  • 38. References Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop on Azure, Getting Started Hadoop .Net SDK .Net HDFS File Access SQL Server Connector for Hadoop
  • 39. Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu