SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
1www.aditi.com
Introducing
Hadoop
on Azure
M Sheik Uduman Ali
Technical Architect, Aditi Technologies
Instead of reinventing the wheel, Microsoft takes a strong and
brilliant move to integrate Hadoop on its blockbuster cloud
computing PaaS stack. Isn't it? Of course, LINQ2HPC was
embraced many .NET developers, however, Hadoop distribu-
tion for Windows is also the safest move. This paper evaluates
the early preview of Hadoop on Azure. It cover the basics of
using Hadoop on Azure. It would be helpful to read
about MapReduce and Hadoop Topology before learning
about Hadoop on Azure.
For comments or questions regarding the content of this pa-
per, please contact
Sunny Neogi (sunnyn@aditi.com) or
Arun Kumar (arung@aditi.com)
www.aditi.com
2www.aditi.com
Why do we need Hadoop?
The simple answer to this question is "Big data analysis". Some examples of
big data analysis are:
 Calculating consumers purchasing trend on particular product categories
based on the growing big data with the rate of 1 million transactions per
hour
 Web application log analysis
 Internet search indexing
 Social network data
Since relational databases and its ecosystem were designed on "scale-up"
strategy with centralized data processing, they are not much suitable for data
warehousing space. And the data persistence of modern applications is mix
and match of relational, structured and non-structured. Hence, we need a
much more powerful system. Hadoop is one of the successful open source
platform based on MapReduce principle, which in turn follows the "Making
big by small" philosophy.
The big data processing is called as "Job" since it would be done very fre-
quently, periodically, some in a while or only once. It is not to be part of day
to day business.
3www.aditi.com
ABOUT ADITI
Basically, the input data is processed on "n" number of small physical nodes in
a clustered environment in two different phases:
 Map: The input data needs to be grouped as <k1, v1> key-value pair. For
example, if the input data reside in one or more files, then k1 would be the
file name and v1 be the file content. Hence, the map phase receives list of
<k1, v1>. It splits each k1 into available map nodes in the cluster. On every
node, the mapping function mostly performs "filtering and transfor-
mation“ and produces <k2, v2>. For example, if you want to count the
number of occurrences of words in the given set of documents, <filename,
content> as <k1, v1> and the nodes in the mapping phase does counting
the words in the given v1. This will generate output like <"aditi", 1> as <k2,
v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is
one of the words in the document. Hence, the output of mapping phase is
list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.
 Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the
word count example, a node in the Hadoop cluster may produce may
<"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve-
ry list(v2) for k2 passed to a node for reducing. The output will be list of
<k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates
all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again
"aditi". Each reducer node does the same for different words.
The <k2, v2> aggregation is actually performed by a component called
"combiner". As of now, let us keep focus on the mapper and reducer.
See the below figure (figure 1):
What are the layers of
Architecture?
What is MapReduce?
4www.aditi.com
ABOUT ADITI
Hadoop cluster is an infrastructure with many physical nodes, where some are
configured for "mapping" and some are for "reducing" along with administra-
tive, tracking and data persistence nodes called as "Name Node", "Job Track-
er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi-
tecture "Name Node" and "Job Tracker" are masters and remaining are
slaves. This is shown in figure 2.
In order handle big data storage and processing, Hadoop uses HDFS as a file
system which even handle 100 TB content as a single file.
What are the layers of
Architecture?
Hadoop Cluster
5www.aditi.com
ABOUT ADITI
Since every task is called as "Job", you can rent required nodes for your job,
use and release. Hence, the elastic computing and data storage (blob and ta-
ble storage) in Azure is definitely the good choice for running your Hadoop
job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop
Java SDK is one of the good options for your job. In addition to this, the
"Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop
streaming, by which you can write your job on C# or F# and use Azure blob
for data persistence (the scheme is called as ASV). The figure below shows the
Hadoop ecosystem on Azure (figure 3).
To create directories, get and put files, and issue some data processing com-
mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac-
tual Hadoop distribution, Java is the main interface for this). In addition to
this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high
level data processing language).
What are the layers of
Architecture?
Hadoop Ecosystem on Azure
6www.aditi.com
ABOUT ADITI
The www.hadooponazure.com is the management portal to create, release
and renew clusters for your job. The following are the steps you need to per-
form to run job:
1. Develop the mapping and reducing functions either in Java or your pre-
ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py-
thon, etc. In Azure, you can write the code in .NET.
2. Decide from where the input data and output result of the job need to be
managed. Either in HDFS or Azure Blob.
3. Request a cluster for the job in the portal
4. Specify all the parameters for the job which includes the executable for the
job, input and output details
5. Run the job and get the output
6. Release the cluster
In this post, let us see the step 3, how we can create a cluster for a job.
Requesting a new Cluster
After you entered into the portal, you need to enter the following details for
the new cluster environment as shown in the below figure (figure 4):
 DNS name (<dnsname>.cloudapp.net)
 Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32
nodes + 16 TB = extra large
 Cluster login information
What are the layers of
Architecture?
TheWeb Portal
for Hadoop on Azure
7www.aditi.com
ABOUT ADITI
After entering these details, press Request Cluster button. This will create the
cluster environment for your job. The screen shows the progress of creating
new nodes for the cluster as shown in the below figure (figure 5):
8www.aditi.com
ABOUT ADITI
After the provisioning, you will see a screen as shown below (figure 6):
You can start create a new job and if you want to access the environment you
can use either "Interactive Console" or “Remote Desktop".
9www.aditi.com
ABOUT ADITI
The above figure is a Hadoop Streaming based job.
——————————————————————————————————
About the Author:
M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic-
es. He is a blogger and published an online book about "Domain Specific Languages
in .NET".
Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to
drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so-
lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg-
ing technologies and are focused on custom development.
ABOUT ADITI
When you click on new job, you will see the below screen (figure 7):

Weitere ähnliche Inhalte

Mehr von HARMAN Services

Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance HARMAN Services
 
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHow to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHARMAN Services
 
Digital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsDigital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsHARMAN Services
 
Webinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTWebinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTHARMAN Services
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaHARMAN Services
 
15 Big Data Billionaires
15 Big Data Billionaires15 Big Data Billionaires
15 Big Data BillionairesHARMAN Services
 
Digital Transformation in Travel
Digital Transformation in TravelDigital Transformation in Travel
Digital Transformation in TravelHARMAN Services
 
Digital Transformation in Retail
Digital Transformation in RetailDigital Transformation in Retail
Digital Transformation in RetailHARMAN Services
 
Digital Transformation in Media
Digital Transformation in MediaDigital Transformation in Media
Digital Transformation in MediaHARMAN Services
 
Digital Transformation in Hospitality
Digital Transformation in HospitalityDigital Transformation in Hospitality
Digital Transformation in HospitalityHARMAN Services
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
 
Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow HARMAN Services
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study HARMAN Services
 
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHow Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHARMAN Services
 
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...HARMAN Services
 
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership HARMAN Services
 
24 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 2424 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 24HARMAN Services
 
Webinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerWebinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerHARMAN Services
 
5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference5 Takeaways From The UX India Conference
5 Takeaways From The UX India ConferenceHARMAN Services
 
Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!HARMAN Services
 

Mehr von HARMAN Services (20)

Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance
 
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHow to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
 
Digital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsDigital Transformation: Connected API Ecosystems
Digital Transformation: Connected API Ecosystems
 
Webinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTWebinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoT
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D Kesharia
 
15 Big Data Billionaires
15 Big Data Billionaires15 Big Data Billionaires
15 Big Data Billionaires
 
Digital Transformation in Travel
Digital Transformation in TravelDigital Transformation in Travel
Digital Transformation in Travel
 
Digital Transformation in Retail
Digital Transformation in RetailDigital Transformation in Retail
Digital Transformation in Retail
 
Digital Transformation in Media
Digital Transformation in MediaDigital Transformation in Media
Digital Transformation in Media
 
Digital Transformation in Hospitality
Digital Transformation in HospitalityDigital Transformation in Hospitality
Digital Transformation in Hospitality
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study
 
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHow Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
 
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
 
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
 
24 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 2424 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 24
 
Webinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerWebinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected Customer
 
5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference
 
Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Hadoop on Windows Azure - an Introduction

  • 1. 1www.aditi.com Introducing Hadoop on Azure M Sheik Uduman Ali Technical Architect, Aditi Technologies Instead of reinventing the wheel, Microsoft takes a strong and brilliant move to integrate Hadoop on its blockbuster cloud computing PaaS stack. Isn't it? Of course, LINQ2HPC was embraced many .NET developers, however, Hadoop distribu- tion for Windows is also the safest move. This paper evaluates the early preview of Hadoop on Azure. It cover the basics of using Hadoop on Azure. It would be helpful to read about MapReduce and Hadoop Topology before learning about Hadoop on Azure. For comments or questions regarding the content of this pa- per, please contact Sunny Neogi (sunnyn@aditi.com) or Arun Kumar (arung@aditi.com) www.aditi.com
  • 2. 2www.aditi.com Why do we need Hadoop? The simple answer to this question is "Big data analysis". Some examples of big data analysis are:  Calculating consumers purchasing trend on particular product categories based on the growing big data with the rate of 1 million transactions per hour  Web application log analysis  Internet search indexing  Social network data Since relational databases and its ecosystem were designed on "scale-up" strategy with centralized data processing, they are not much suitable for data warehousing space. And the data persistence of modern applications is mix and match of relational, structured and non-structured. Hence, we need a much more powerful system. Hadoop is one of the successful open source platform based on MapReduce principle, which in turn follows the "Making big by small" philosophy. The big data processing is called as "Job" since it would be done very fre- quently, periodically, some in a while or only once. It is not to be part of day to day business.
  • 3. 3www.aditi.com ABOUT ADITI Basically, the input data is processed on "n" number of small physical nodes in a clustered environment in two different phases:  Map: The input data needs to be grouped as <k1, v1> key-value pair. For example, if the input data reside in one or more files, then k1 would be the file name and v1 be the file content. Hence, the map phase receives list of <k1, v1>. It splits each k1 into available map nodes in the cluster. On every node, the mapping function mostly performs "filtering and transfor- mation“ and produces <k2, v2>. For example, if you want to count the number of occurrences of words in the given set of documents, <filename, content> as <k1, v1> and the nodes in the mapping phase does counting the words in the given v1. This will generate output like <"aditi", 1> as <k2, v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is one of the words in the document. Hence, the output of mapping phase is list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.  Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the word count example, a node in the Hadoop cluster may produce may <"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve- ry list(v2) for k2 passed to a node for reducing. The output will be list of <k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again "aditi". Each reducer node does the same for different words. The <k2, v2> aggregation is actually performed by a component called "combiner". As of now, let us keep focus on the mapper and reducer. See the below figure (figure 1): What are the layers of Architecture? What is MapReduce?
  • 4. 4www.aditi.com ABOUT ADITI Hadoop cluster is an infrastructure with many physical nodes, where some are configured for "mapping" and some are for "reducing" along with administra- tive, tracking and data persistence nodes called as "Name Node", "Job Track- er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi- tecture "Name Node" and "Job Tracker" are masters and remaining are slaves. This is shown in figure 2. In order handle big data storage and processing, Hadoop uses HDFS as a file system which even handle 100 TB content as a single file. What are the layers of Architecture? Hadoop Cluster
  • 5. 5www.aditi.com ABOUT ADITI Since every task is called as "Job", you can rent required nodes for your job, use and release. Hence, the elastic computing and data storage (blob and ta- ble storage) in Azure is definitely the good choice for running your Hadoop job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop Java SDK is one of the good options for your job. In addition to this, the "Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop streaming, by which you can write your job on C# or F# and use Azure blob for data persistence (the scheme is called as ASV). The figure below shows the Hadoop ecosystem on Azure (figure 3). To create directories, get and put files, and issue some data processing com- mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac- tual Hadoop distribution, Java is the main interface for this). In addition to this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high level data processing language). What are the layers of Architecture? Hadoop Ecosystem on Azure
  • 6. 6www.aditi.com ABOUT ADITI The www.hadooponazure.com is the management portal to create, release and renew clusters for your job. The following are the steps you need to per- form to run job: 1. Develop the mapping and reducing functions either in Java or your pre- ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py- thon, etc. In Azure, you can write the code in .NET. 2. Decide from where the input data and output result of the job need to be managed. Either in HDFS or Azure Blob. 3. Request a cluster for the job in the portal 4. Specify all the parameters for the job which includes the executable for the job, input and output details 5. Run the job and get the output 6. Release the cluster In this post, let us see the step 3, how we can create a cluster for a job. Requesting a new Cluster After you entered into the portal, you need to enter the following details for the new cluster environment as shown in the below figure (figure 4):  DNS name (<dnsname>.cloudapp.net)  Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32 nodes + 16 TB = extra large  Cluster login information What are the layers of Architecture? TheWeb Portal for Hadoop on Azure
  • 7. 7www.aditi.com ABOUT ADITI After entering these details, press Request Cluster button. This will create the cluster environment for your job. The screen shows the progress of creating new nodes for the cluster as shown in the below figure (figure 5):
  • 8. 8www.aditi.com ABOUT ADITI After the provisioning, you will see a screen as shown below (figure 6): You can start create a new job and if you want to access the environment you can use either "Interactive Console" or “Remote Desktop".
  • 9. 9www.aditi.com ABOUT ADITI The above figure is a Hadoop Streaming based job. —————————————————————————————————— About the Author: M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic- es. He is a blogger and published an online book about "Domain Specific Languages in .NET". Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so- lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg- ing technologies and are focused on custom development. ABOUT ADITI When you click on new job, you will see the below screen (figure 7):