Hadoop jon

Humoyun Ahmedov
Humoyun AhmedovSoftware Developer at Konkuk University, Cloud Computing Lab um Cloud Computing Lab
Hadoop Jon 
By HumoyunJon Lee
90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST 
THREE YEARS ALONE, AND IT IS GROWING 
AT EVEN A MORE RAPID RATE. 
BIG DATA 
The world has been exponential data growth, due to social media, 
mobility, E-commerce and other factors. 
• Volume 
• Variety 
• Velocity
“Big Data is like teenage sex; 
everyone talks about it, 
nobody really knows how to do it, 
everyone thinks everyone else is doing it, 
so everyone claims they are doing it” 
Dan Ariely, Duke University
Big Data Ecosystem
To Address This Issue 
We need HadoopJon
A Shared Nothing Network or 
What is that Hadoop
The Apache Hadoop software library is a framework that allows for the 
distributed processing of large data sets across clusters of computers using 
simple programming models. It is designed to scale up from single servers to 
thousands of machines, each offering local computation and storage. Rather 
than rely on hardware to deliver high-availability, the library itself is designed 
to detect and handle failures at the application layer, so delivering a highly-available 
service on top of a cluster of computers, each of which may be prone 
to failures.
Hadoop jon
Hadoop jon
Prerequisites : 
• Installing Java v1.5+ 
• Adding dedicated Hadoop system user. 
• Configuring SSH access. 
• Disabling IPv6. 
Installing HadoopJon
Configuring Hadoop : 
a. hadoop-env.sh 
b. core-site.xml 
c. mapred-site.xml 
d. hdfs-site.xml
Hadoop comes with several web interfaces which are by 
default available at these locations: 
• http://localhost:50070/ – web UI of the NameNode daemon 
• http://localhost:50030/ – web UI of the JobTracker daemon 
• http://localhost:50060/ – web UI of the TaskTracker daemon 
Hadoop Web Interfaces
Reliable 
Hadoop 
Features 
Flexible Economical 
Scalable 
Hadoop Key Characteristics:
• Scalable – New nodes can be added as needed, and added without 
needing to change data formats, how data is loaded, how jobs are 
written, or the applications on top. 
• Economical – Hadoop brings massively parallel computing to 
commodity servers. The result is a sizeable decrease in the cost per 
terabyte of storage, which in turn makes it affordable to model all 
your data.
• Flexible – Hadoop is schema-less, and can absorb any type of data, 
structured or not, from any number of sources. Data from multiple 
sources can be joined and aggregated in arbitrary ways enabling 
deeper analyses than any one system can provide. 
• Reliable – When you lose a node, the system redirects work to 
another location of the data and continues processing without missing 
a beat
Hadoop Ecosystem
HDFS Architecture
• HDFS is designed to store a very large amount of information 
(terabytes or petabytes). This requires spreading the data across a 
large number of machines. 
• HDFS stores data reliably. If individual machines in the cluster fail, 
data is still being available with data redundancy. 
Hadoop Distributed File 
System (HDFS):
• HDFS provides fast, scalable access to the information loaded on the 
clusters. It is possible to serve a larger number of clients by simply 
adding more machines to the cluster. 
• HDFS integrate well with Hadoop MapReduce, allowing data to be 
read and computed upon locally whenever needed. 
• HDFS was originally built as infrastructure for the Apache Nutch 
web search engine project
Hadoop does not require expensive, highly reliable hardware. It is 
designed to run on clusters of commodity hardware, an HDFS instance 
may consist of hundreds or thousands of server machines, each storing 
part of the file system’s data. The fact that there are a huge number of 
components and that each component has a non-trivial probability of 
failure means that some component of HDFS is always non-functional. 
Therefore, detection of faults and quick, automatic recovery from them 
is a core architectural goal of HDFS. 
Commodity Hardware Failure:
Applications that run on HDFS need continuous access to their data 
sets. HDFS is designed more for batch processing rather than interactive 
use by users. The emphasis is on high throughput of data access rather 
than low latency of data access. 
Continuous Data Access:
Applications that run on HDFS have large data sets. A typical file in 
HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support 
large files. 
It is also worth examining the 
applications for which using HDFS 
does not work so well. While this 
may change in the future, these are 
areas where HDFS is not a good fit 
today: 
Very Large Data Files:
• Low-latency data access 
• Lots of small files 
• Multiple writers, arbitrary file modifications
• Pig is an open-source high-level dataflow 
system. 
• It provides a simple language for queries and 
data manipulation Pig Latin, that is compiled 
into MapReduce jobs that are run on Hadoop. 
• Why is it important? 
- Companies like Yahoo, Google and Microsoft 
are collecting vast sets in the form of click 
steams, search logs, and web crawls. 
- Some form of ad-hoc processing and analysis 
of all of this information is required. 
What is Pig
• An ad-hoc way of creating and executing MapReduce jobs on very 
large data sets 
• Rapid Development 
• No Java is required 
• Developed byYahoo! 
Why was Pig created?
Hadoop jon
• Pig is a data flow language. It is at the top of Hadoop and makes it 
possible to create complex jobs to process large volumes of data 
quicly and efficiently. 
• It will consume any data that you feed it: Structured, semi-structured, 
or unstructured. 
• Pig provides the common data operations (filters, joins, ordering) and 
nested data types (tuple, bags, and maps) which are missing in 
MapReduce. 
• PIG scripts are easier and faster to write than standard Java Hadoop 
jobs and PIG has lot of clever optimizations like multi query 
execution, which can make your complex queries execute quiker. 
Where I should Use PIG
• Hive is a data warehouse infrastructure built 
on top of Hadoop. 
• It facilitates querying large datasets residing 
on a distributed storage. 
• It provides a mechanism to project structure 
on to the data and query the data using a 
SQL-like query language called “HiveQL”. 
What is Hive
• Hive was developed by Facebook and was open sourced in 2008 . 
• Data stored in Hadoop is inaccessible to business users. 
• High level languages like Pig, Cascading etc are geared towards 
developers. 
• SQL is a common language that is known to many. Hive was 
developed to give access to data stored in HadoopJon, translating 
SQL like queries into map reduce jobs. 
Why hive was developed
Hammaga rahmat 
Nahorgi Presentatsiya 
over
1 von 30

Recomendados

Hadoop presentation von
Hadoop presentationHadoop presentation
Hadoop presentationChandra Sekhar Saripaka
1.5K views23 Folien
Introduction to Hadoop - The Essentials von
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
2.6K views26 Folien
Hadoop training von
Hadoop trainingHadoop training
Hadoop trainingTIB Academy
33 views17 Folien
PPT on Hadoop von
PPT on HadoopPPT on Hadoop
PPT on HadoopShubham Parmar
23.1K views14 Folien
Hadoop von
Hadoop Hadoop
Hadoop ABHIJEET RAJ
525 views31 Folien
Hadoop Ecosystem von
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemSandip Darwade
2.9K views29 Folien

Más contenido relacionado

Was ist angesagt?

Big Data and Hadoop Ecosystem von
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
3K views27 Folien
Hadoop Tutorial For Beginners von
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd
1.3K views26 Folien
Seminar Presentation Hadoop von
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
81.5K views51 Folien
Hadoop Technology von
Hadoop TechnologyHadoop Technology
Hadoop TechnologyEce Seçil AKBAŞ
1.8K views73 Folien
Introduction to Apache Hadoop von
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
5.4K views26 Folien
Hadoop ecosystem von
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemStanley Wang
2.9K views22 Folien

Was ist angesagt?(20)

Seminar Presentation Hadoop von Varun Narang
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang81.5K views
Hadoop And Their Ecosystem von sunera pathan
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan451 views
HADOOP TECHNOLOGY ppt von sravya raju
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju4.4K views
4. hadoop גיא לבנברג von Taldor Group
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group864 views
Big data - Online Training von Learntek1
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek130 views
Big data Hadoop presentation von Shivanee garg
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg558 views
Hadoop Overview von EMC
Hadoop Overview Hadoop Overview
Hadoop Overview
EMC1.5K views
Apache Hadoop von Ajit Koti
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti5.1K views
Introduction to Big Data & Hadoop Architecture - Module 1 von Rohit Agrawal
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal1.3K views
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka von Edureka!
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!1.5K views
Big Data on the Microsoft Platform von Andrew Brust
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
Andrew Brust1.4K views
Presentation on Hadoop Technology von OpenDev
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev348 views
Introduction to HDFS and MapReduce von Derek Chen
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen105 views

Similar a Hadoop jon

Hadoop in action von
Hadoop in actionHadoop in action
Hadoop in actionMahmoud Yassin
619 views26 Folien
Hadoop von
HadoopHadoop
HadoopNishant Gandhi
19.5K views17 Folien
Open source stak of big data techs open suse asia von
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
476 views39 Folien
Hadoop and Big Data von
Hadoop and Big DataHadoop and Big Data
Hadoop and Big DataHarshdeep Kaur
1.1K views25 Folien
HDFS von
HDFSHDFS
HDFSVardhman Kale
154 views28 Folien
Hadoop von
HadoopHadoop
Hadoopchandinisanz
74 views38 Folien

Similar a Hadoop jon(20)

Open source stak of big data techs open suse asia von Muhammad Rifqi
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
Muhammad Rifqi476 views
Hadoop introduction von 葵慶 李
Hadoop introductionHadoop introduction
Hadoop introduction
葵慶 李476 views
Hadoop a Natural Choice for Data Intensive Log Processing von Hitendra Kumar
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar3.8K views
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015 von Abdul Nasir
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Abdul Nasir341 views
Hadoop and BigData - July 2016 von Ranjith Sekar
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar1.2K views
Introduction to BIg Data and Hadoop von Amir Shaikh
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh870 views
Data infrastructure at Facebook von AhmedDoukh
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
AhmedDoukh120 views
Big Data Hadoop Technology von Rahul Sharma
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
Rahul Sharma64 views

Último

Soco 7.pdf von
Soco 7.pdfSoco 7.pdf
Soco 7.pdfSocioCosmos
8 views1 Folie
The Playing cards.pptx von
The Playing cards.pptxThe Playing cards.pptx
The Playing cards.pptxdivyabhana2
25 views5 Folien
SOCO 9.pdf von
SOCO 9.pdfSOCO 9.pdf
SOCO 9.pdfSocioCosmos
6 views1 Folie
PDF.pdf von
PDF.pdfPDF.pdf
PDF.pdfoliverumr
9 views1 Folie
"Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C... von
"Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C..."Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C...
"Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C...Embtel Solutions
20 views19 Folien
Soco 11 (2).pdf von
Soco 11 (2).pdfSoco 11 (2).pdf
Soco 11 (2).pdfSocioCosmos
5 views1 Folie

Último(9)

The Playing cards.pptx von divyabhana2
The Playing cards.pptxThe Playing cards.pptx
The Playing cards.pptx
divyabhana225 views
"Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C... von Embtel Solutions
"Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C..."Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C...
"Mastering Social Media Marketing: A Guide to Fremont's Local Influence and C...
Embtel Solutions20 views
Unlock the Power of Viral Marketing 7 Proven Strategies to Amplify Your Brand... von Sarah Boyer
Unlock the Power of Viral Marketing 7 Proven Strategies to Amplify Your Brand...Unlock the Power of Viral Marketing 7 Proven Strategies to Amplify Your Brand...
Unlock the Power of Viral Marketing 7 Proven Strategies to Amplify Your Brand...
Sarah Boyer6 views

Hadoop jon

  • 1. Hadoop Jon By HumoyunJon Lee
  • 2. 90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST THREE YEARS ALONE, AND IT IS GROWING AT EVEN A MORE RAPID RATE. BIG DATA The world has been exponential data growth, due to social media, mobility, E-commerce and other factors. • Volume • Variety • Velocity
  • 3. “Big Data is like teenage sex; everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it” Dan Ariely, Duke University
  • 5. To Address This Issue We need HadoopJon
  • 6. A Shared Nothing Network or What is that Hadoop
  • 7. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 10. Prerequisites : • Installing Java v1.5+ • Adding dedicated Hadoop system user. • Configuring SSH access. • Disabling IPv6. Installing HadoopJon
  • 11. Configuring Hadoop : a. hadoop-env.sh b. core-site.xml c. mapred-site.xml d. hdfs-site.xml
  • 12. Hadoop comes with several web interfaces which are by default available at these locations: • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon Hadoop Web Interfaces
  • 13. Reliable Hadoop Features Flexible Economical Scalable Hadoop Key Characteristics:
  • 14. • Scalable – New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. • Economical – Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • 15. • Flexible – Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. • Reliable – When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat
  • 18. • HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. • HDFS stores data reliably. If individual machines in the cluster fail, data is still being available with data redundancy. Hadoop Distributed File System (HDFS):
  • 19. • HDFS provides fast, scalable access to the information loaded on the clusters. It is possible to serve a larger number of clients by simply adding more machines to the cluster. • HDFS integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally whenever needed. • HDFS was originally built as infrastructure for the Apache Nutch web search engine project
  • 20. Hadoop does not require expensive, highly reliable hardware. It is designed to run on clusters of commodity hardware, an HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Commodity Hardware Failure:
  • 21. Applications that run on HDFS need continuous access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. Continuous Data Access:
  • 22. Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support large files. It is also worth examining the applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today: Very Large Data Files:
  • 23. • Low-latency data access • Lots of small files • Multiple writers, arbitrary file modifications
  • 24. • Pig is an open-source high-level dataflow system. • It provides a simple language for queries and data manipulation Pig Latin, that is compiled into MapReduce jobs that are run on Hadoop. • Why is it important? - Companies like Yahoo, Google and Microsoft are collecting vast sets in the form of click steams, search logs, and web crawls. - Some form of ad-hoc processing and analysis of all of this information is required. What is Pig
  • 25. • An ad-hoc way of creating and executing MapReduce jobs on very large data sets • Rapid Development • No Java is required • Developed byYahoo! Why was Pig created?
  • 27. • Pig is a data flow language. It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quicly and efficiently. • It will consume any data that you feed it: Structured, semi-structured, or unstructured. • Pig provides the common data operations (filters, joins, ordering) and nested data types (tuple, bags, and maps) which are missing in MapReduce. • PIG scripts are easier and faster to write than standard Java Hadoop jobs and PIG has lot of clever optimizations like multi query execution, which can make your complex queries execute quiker. Where I should Use PIG
  • 28. • Hive is a data warehouse infrastructure built on top of Hadoop. • It facilitates querying large datasets residing on a distributed storage. • It provides a mechanism to project structure on to the data and query the data using a SQL-like query language called “HiveQL”. What is Hive
  • 29. • Hive was developed by Facebook and was open sourced in 2008 . • Data stored in Hadoop is inaccessible to business users. • High level languages like Pig, Cascading etc are geared towards developers. • SQL is a common language that is known to many. Hive was developed to give access to data stored in HadoopJon, translating SQL like queries into map reduce jobs. Why hive was developed
  • 30. Hammaga rahmat Nahorgi Presentatsiya over