SlideShare ist ein Scribd-Unternehmen logo
1 von 18
EWT Portal Practice Team 2013

Hadoop Cluster Configuration

Table of contents
1. Introduction…………………………………………………………………………………………………………… 2
2. Prerequisites Softwares…………………………………………………………………………………………. 8
3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8
4. Hadoop………..………………………………………………………………………………………………………… 9
5. Flume…………………………………………………………………………………………………………………….. 11
6. Hive……………………………………………………………………………………………………………………….. 12
7. Hbase…………………………………………………………………………………………………………………….. 13
8. Organizations using Hadoop………………………………………………………………………………….. 14
9. References…………………………………………………………………………………………………………….. 14

1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications.
[Type text]

Page 1
EWT Portal Practice Team 2013
a. Hadoop?
Software platform that lets one easily write and run applications that process vast amounts of
data. It includes:
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
– MapReduce – offline computing engine
•

Yahoo! is the biggest contributor

•

Here's what makes it especially useful:
Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available
computers (in thousands).
Efficient: By distributing the data, it can process it in parallel on the nodes where the data
is located.
Reliable: It automatically maintains multiple copies of data and automatically redeploys
computing tasks based on failures.

b. What does it do?
•

Hadoop implements Google’s MapReduce, using HDFS

•

MapReduce divides applications into many small blocks of work.

•

HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster.

•

MapReduce can then process the data where it is located.

•

Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
Written in Java
Does work with other languages
Runs on
Linux, Windows and more

c. HDFS?

[Type text]

Page 2
EWT Portal Practice Team 2013
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant.
–
–

provides high throughput access to application data and is suitable for applications that
have large data sets.

–

[Type text]

highly fault-tolerant and is designed to be deployed on low-cost hardware.

relaxes a few POSIX requirements to enable streaming access to file system data.

Page 3
EWT Portal Practice Team 2013
d. MapReduce?
•

Programming model developed at Google

•

Sort/merge based distributed computing

•

Initially, it was intended for their internal search/indexing application, but now used
extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)

•

It is functional style programming (e.g., LISP) that is naturally parallelizable across a
large cluster of workstations or PCS.

•

The underlying system takes care of the partitioning of the input data, scheduling the
program’s execution across several machines, handling machine failures, and managing
required inter-machine communication. (This is the key for Hadoop’s success)

e. How does MapReduce work?
•

The run time partitions the input and provides it to different Map instances;

•

Map (key, value)  (key’, value’)

•

The run time collects the (key’, value’) pairs and distributes them to several Reduce
functions so that each Reduce function gets the pairs with the same key’.

•

Each Reduce produces a single (or zero) file output.

•

Map and Reduce are user written functions

f. Flume?
[Type text]

Page 4
EWT Portal Practice Team 2013
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. The system is
centrally managed and allows for intelligent dynamic management. It uses a simple
extensible data model that allows for online analytic applications.
Flume Architecture

g. Hive?

[Type text]

Page 5
EWT Portal Practice Team 2013
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
•

Hive - SQL on top of Hadoop

•

Rich data types (structs, lists and maps)

•

Efficient implementations of SQL filters, joins and group-by’s on top of map reduce

•

Allow users to access Hive data without using Hive
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce

h. Hbase?

[Type text]

Page 6
EWT Portal Practice Team 2013
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the
database isn't an RDBMS which supports SQL as its primary access language, but there
are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL
database, whereas HBase is very much a distributed database. Technically speaking,
HBase is really more a "Data Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and
advanced query languages, etc.

i. When Should I Use HBase?
•

HBase isn't suitable for every problem.

•

First, make sure you have enough data. If you have hundreds of millions or billions of
rows, then HBase is a good candidate. If you only have a few thousand/million rows, then
using a traditional RDBMS might be a better choice due to the fact that all of your data
might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

•

Second, make sure you can live without all the extra features that an RDBMS provides
(e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An
application built against an RDBMS cannot be "ported" to HBase by simply changing a
JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete
redesign as opposed to a port.

•

Third, make sure you have enough hardware. Even HDFS doesn't do well with anything
less than 5 DataNodes (due to things such as HDFS block replication which has a default
of 3), plus a NameNode.

•

HBase can run quite well stand-alone on a laptop - but this should be considered a
development configuration only.

j. What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's
documentation states that it is not, however, a general purpose file system, and does not
provide fast individual record lookups in files. HBase, on the other hand, is built on top of
HDFS and provides fast record lookups (and updates) for large tables. This can
sometimes be a point of conceptual confusion. HBase internally puts your data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.

2. Prerequisites Softwares
[Type text]

Page 7
EWT Portal Practice Team 2013
To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the
Apache Download Mirrors.
Note: Configuration setup on two linux servers (Server1 and Server2).

2.1 Download the prerequisites softwares from the below urls(Server 1 machine)
a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/
b. Flume : http://archive.apache.org/dist/flume/stable/
c. Hive : http://download.nextag.com/apache/hive/stable/
d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/

2.2 Download the Java 1.6/1.7 :
http://www.oracle.com/technetwork/java/javase/downloads/index.html

2.3 Stable versions of the Hadoop Components on August 2013

•

Hadoop-1.1.2

•

Flume-1.4.0

•

Hbase-0.94.9

•

Hive -0.10.0

3. Cluster configuration on Server1 and Server2

Create a user and password to give the admin permission(Server1 and Server2)
3.1 Task: Add a user with group to the system
useradd -G {hadoop} hduser

3.2 Task : Add a password to hduser
[Type text]

Page 8
EWT Portal Practice Team 2013
Passwd hduser

3.3 Open the host file from server1 system and edit the file /etc/hosts
# For example:
#

102.54.94.97

rhino.acme.com

master

rhino.acme.com

slaves

Server2 system
#

102.54.94.98

3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2
3.4.1 First login into server1 with user hduser and generate a pair of public keys
using following command. (Note: Same steps to server2)
Sshy-keygen –t rsa –P “”

3.4.2

Upload Generated Public Keys to – server1 to server2

Use SSH from server1 and upload new generated public key (id_rsa.pub) on
server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note:
Same steps server2 to server1).

3.4.3

Login from SERVER1 to SERVER2 Server without Password

Ssh server1
Ssh server2

4. Hadoop

Create a directory called Hadoop under the /home/hduser
Mkdir hadoop
Chmod –R 777 hadoop
[Type text]

Page 9
EWT Portal Practice Team 2013

a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory
tar –xzvf hadoop-1.1.2.tar.gz
Check the extracted files into the below dir /home/hduser/hadoop
sudo chown -R hduser:hadoop hadoop

b. Create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown
hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up
security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp
Add the following snippets between the <configuration> ... </configuration> tags in the
respective configuration XML file.

core-site.xml

In file hadoop/conf/core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system. A URI whose scheme and authority determine
the FileSystem implementation. The uri's scheme determines the config property
(fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

[Type text]

Page 10
EWT Portal Practice Team 2013

mapred-site.xml

In file conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are
run in-process as a single map and reduce task. </description>
</property>

hdfs-site.xml

In file conf/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when
the file is created. The default is used if replication is not specified in create time. </description>
</property>

In file conf/hadoop-env.sh, masters and slaves

hadoop-env.sh

masters

slaves

c. Formatting the HDFS filesystem via the NameNode
hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format

d. Starting your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
[Type text]

Page 11
EWT Portal Practice Team 2013

e. Stopping your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh

For Clustering, Open the Server2(Slave) System

1. Login into hduser
2. Make directory /home/hduser
3. Move the hadoop directory into Server2(Slave) system
4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop
5. Starting your multi-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
6. Check the process should be started on both machines(master and slave)
7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
8. Ps –e | grep java

5. Flume
Apache Flume Configuration

1. Extract the apache-flume-1.4.0-bin.tar into Flume directory
tar –xzvf apache-flume-1.4.0-bin.tar
Check the extracted files into the below dir /home/hduser/flume
sudo chown -R hduser:hadoop flume
[Type text]

Page 12
EWT Portal Practice Team 2013
2. Open the flume directory and run the below command
a. sudo cp conf/flume-conf.properties.template conf/flume.conf
b. sudo cp conf/flume-env.sh.template conf/flume-env.sh
c. Open the conf directory and check 5 files are available

flume.conf

3. In file flume/conf/flume.conf overwrite the flume.conf file
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Here exec1 is source name.
agent1.sources.exec1.channels = ch1
agent1.sources.exec1.type = exec
agent1.sources.exec1.command = tail -F /var/log/anaconda.log
#in /home/hadoop/as/ash i have kept a text file.

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
# Here HDFS is sink name.
agent1.sinks.HDFS.channel = ch1
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log
agent1.sinks.HDFS.hdfs.file.Type = DataStream

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
[Type text]

Page 13
EWT Portal Practice Team 2013
agent1.channels = ch1
#source name can be of anything.(here i have chosen exec1)
agent1.sources = exec1
#sinkname can be of anything.(here i have chosen HDFS)
agent1.sinks = HDFS

4. Run the command
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

5. Check the file is written in HDFS
6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
7. Hadoop fs –cat /user/root/*

6. Hive
Apache Hive Configuration

1. Extract the hive-0.10.0.tar into Hbase directory
tar –xzvf hive-0.10.0.tar
Check the extracted files into the below dir /home/hduser/hive
sudo chown -R hduser:hadoop hive

2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
[Type text]

Page 14
EWT Portal Practice Team 2013
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hive_metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
</configuration>
[Type text]

Page 15
EWT Portal Practice Team 2013

7. Hbase
Apache Hbase Configuration

1. Extract the hbase-0.94.9.tar into Hbase directory
tar –xzvf hbase-0.94.9.tar
Check the extracted files into the below dir /home/hduser/hbase
sudo chown -R hduser:hadoop hbase

2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file

hbase-site.xml

<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/hadoop/data/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
[Type text]

Page 16
EWT Portal Practice Team 2013

regionservers

3. In file hbase/conf/regionservers overwrite regionservers file
Master
Slaves

4. Open the hbase directory and run the below command
hbase/bin start-hbase.sh

8. Example Applications and Organizations using Hadoop
•

A9.com – Amazon: To build Amazon's product search indices; process millions of
sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.

•

Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop;
biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support
research for Ad Systems and Web Search

•

AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is
50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and
800 GB hard-disk giving us a total of 37 TB HDFS capacity.

•

Facebook: To store copies of internal log and dimension data sources and use it
as a source for reporting/analytics and machine learning; 320 machine cluster
with 2,560 cores and about 1.3 PB raw storage;

9. References:

1. http://download.nextag.com/apache/hadoop/common/stable/
[Type text]

Page 17
EWT Portal Practice Team 2013
2. http://archive.apache.org/dist/flume/stable/
3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/
4. Hive : http://download.nextag.com/apache/hive/stable/
5. http://www.oracle.com/technetwork/java/javase/downloads/index.html
6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/
7. http://hadoop.apache.org/docs/stable/single_node_setup.html
8. http://hadoop.apache.org/docs/stable/cluster_setup.html

[Type text]

Page 18

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
DataWorks Summit
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 

Was ist angesagt? (20)

Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Cloud Computing Using OpenStack
Cloud Computing Using OpenStack Cloud Computing Using OpenStack
Cloud Computing Using OpenStack
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop
HadoopHadoop
Hadoop
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platform
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
VTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
VTU Open Elective 6th Sem CSE - Module 2 - Cloud ComputingVTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
VTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
data replication
data replicationdata replication
data replication
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
6.hive
6.hive6.hive
6.hive
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Big data-cheat-sheet
Big data-cheat-sheetBig data-cheat-sheet
Big data-cheat-sheet
 
Hive
HiveHive
Hive
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 

Ähnlich wie Hadoop cluster configuration

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Ähnlich wie Hadoop cluster configuration (20)

Hdfs design
Hdfs designHdfs design
Hdfs design
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Unit 5
Unit  5Unit  5
Unit 5
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

Mehr von prabakaranbrick (9)

Install and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windowsInstall and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windows
 
Sonar
SonarSonar
Sonar
 
Web services for remote portlets v01
Web services for remote portlets v01Web services for remote portlets v01
Web services for remote portlets v01
 
Jmeter
JmeterJmeter
Jmeter
 
Nuxeo dm installation
Nuxeo dm installationNuxeo dm installation
Nuxeo dm installation
 
Gwt portlet
Gwt portletGwt portlet
Gwt portlet
 
Jackrabbit setup configuration
Jackrabbit setup configurationJackrabbit setup configuration
Jackrabbit setup configuration
 
Integrating open am with liferay portal
Integrating open am with liferay portalIntegrating open am with liferay portal
Integrating open am with liferay portal
 
Installation and configure the oracle webcenter
Installation and configure the oracle webcenterInstallation and configure the oracle webcenter
Installation and configure the oracle webcenter
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

Hadoop cluster configuration

  • 1. EWT Portal Practice Team 2013 Hadoop Cluster Configuration Table of contents 1. Introduction…………………………………………………………………………………………………………… 2 2. Prerequisites Softwares…………………………………………………………………………………………. 8 3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8 4. Hadoop………..………………………………………………………………………………………………………… 9 5. Flume…………………………………………………………………………………………………………………….. 11 6. Hive……………………………………………………………………………………………………………………….. 12 7. Hbase…………………………………………………………………………………………………………………….. 13 8. Organizations using Hadoop………………………………………………………………………………….. 14 9. References…………………………………………………………………………………………………………….. 14 1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications. [Type text] Page 1
  • 2. EWT Portal Practice Team 2013 a. Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access – MapReduce – offline computing engine • Yahoo! is the biggest contributor • Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. b. What does it do? • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. Written in Java Does work with other languages Runs on Linux, Windows and more c. HDFS? [Type text] Page 2
  • 3. EWT Portal Practice Team 2013 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. – – provides high throughput access to application data and is suitable for applications that have large data sets. – [Type text] highly fault-tolerant and is designed to be deployed on low-cost hardware. relaxes a few POSIX requirements to enable streaming access to file system data. Page 3
  • 4. EWT Portal Practice Team 2013 d. MapReduce? • Programming model developed at Google • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) • It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. • The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) e. How does MapReduce work? • The run time partitions the input and provides it to different Map instances; • Map (key, value)  (key’, value’) • The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. • Each Reduce produces a single (or zero) file output. • Map and Reduce are user written functions f. Flume? [Type text] Page 4
  • 5. EWT Portal Practice Team 2013 Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications. Flume Architecture g. Hive? [Type text] Page 5
  • 6. EWT Portal Practice Team 2013 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. • Hive - SQL on top of Hadoop • Rich data types (structs, lists and maps) • Efficient implementations of SQL filters, joins and group-by’s on top of map reduce • Allow users to access Hive data without using Hive Hive Optimizations Efficient Execution of SQL on top of Map-Reduce h. Hbase? [Type text] Page 6
  • 7. EWT Portal Practice Team 2013 HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. i. When Should I Use HBase? • HBase isn't suitable for every problem. • First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. • Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. • Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. • HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. j. What Is The Difference Between HBase and Hadoop/HDFS? HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. 2. Prerequisites Softwares [Type text] Page 7
  • 8. EWT Portal Practice Team 2013 To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the Apache Download Mirrors. Note: Configuration setup on two linux servers (Server1 and Server2). 2.1 Download the prerequisites softwares from the below urls(Server 1 machine) a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/ b. Flume : http://archive.apache.org/dist/flume/stable/ c. Hive : http://download.nextag.com/apache/hive/stable/ d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 2.2 Download the Java 1.6/1.7 : http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.3 Stable versions of the Hadoop Components on August 2013 • Hadoop-1.1.2 • Flume-1.4.0 • Hbase-0.94.9 • Hive -0.10.0 3. Cluster configuration on Server1 and Server2 Create a user and password to give the admin permission(Server1 and Server2) 3.1 Task: Add a user with group to the system useradd -G {hadoop} hduser 3.2 Task : Add a password to hduser [Type text] Page 8
  • 9. EWT Portal Practice Team 2013 Passwd hduser 3.3 Open the host file from server1 system and edit the file /etc/hosts # For example: # 102.54.94.97 rhino.acme.com master rhino.acme.com slaves Server2 system # 102.54.94.98 3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2 3.4.1 First login into server1 with user hduser and generate a pair of public keys using following command. (Note: Same steps to server2) Sshy-keygen –t rsa –P “” 3.4.2 Upload Generated Public Keys to – server1 to server2 Use SSH from server1 and upload new generated public key (id_rsa.pub) on server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note: Same steps server2 to server1). 3.4.3 Login from SERVER1 to SERVER2 Server without Password Ssh server1 Ssh server2 4. Hadoop Create a directory called Hadoop under the /home/hduser Mkdir hadoop Chmod –R 777 hadoop [Type text] Page 9
  • 10. EWT Portal Practice Team 2013 a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory tar –xzvf hadoop-1.1.2.tar.gz Check the extracted files into the below dir /home/hduser/hadoop sudo chown -R hduser:hadoop hadoop b. Create the directory and set the required ownerships and permissions: $ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file. core-site.xml In file hadoop/conf/core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/hduser/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> [Type text] Page 10
  • 11. EWT Portal Practice Team 2013 mapred-site.xml In file conf/mapred-site.xml: <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> hdfs-site.xml In file conf/hdfs-site.xml: <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> In file conf/hadoop-env.sh, masters and slaves hadoop-env.sh masters slaves c. Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format d. Starting your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh [Type text] Page 11
  • 12. EWT Portal Practice Team 2013 e. Stopping your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh For Clustering, Open the Server2(Slave) System 1. Login into hduser 2. Make directory /home/hduser 3. Move the hadoop directory into Server2(Slave) system 4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop 5. Starting your multi-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh 6. Check the process should be started on both machines(master and slave) 7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 8. Ps –e | grep java 5. Flume Apache Flume Configuration 1. Extract the apache-flume-1.4.0-bin.tar into Flume directory tar –xzvf apache-flume-1.4.0-bin.tar Check the extracted files into the below dir /home/hduser/flume sudo chown -R hduser:hadoop flume [Type text] Page 12
  • 13. EWT Portal Practice Team 2013 2. Open the flume directory and run the below command a. sudo cp conf/flume-conf.properties.template conf/flume.conf b. sudo cp conf/flume-env.sh.template conf/flume-env.sh c. Open the conf directory and check 5 files are available flume.conf 3. In file flume/conf/flume.conf overwrite the flume.conf file # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Here exec1 is source name. agent1.sources.exec1.channels = ch1 agent1.sources.exec1.type = exec agent1.sources.exec1.command = tail -F /var/log/anaconda.log #in /home/hadoop/as/ash i have kept a text file. # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. # Here HDFS is sink name. agent1.sinks.HDFS.channel = ch1 agent1.sinks.HDFS.type = hdfs agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log agent1.sinks.HDFS.hdfs.file.Type = DataStream # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. [Type text] Page 13
  • 14. EWT Portal Practice Team 2013 agent1.channels = ch1 #source name can be of anything.(here i have chosen exec1) agent1.sources = exec1 #sinkname can be of anything.(here i have chosen HDFS) agent1.sinks = HDFS 4. Run the command bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 5. Check the file is written in HDFS 6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 7. Hadoop fs –cat /user/root/* 6. Hive Apache Hive Configuration 1. Extract the hive-0.10.0.tar into Hbase directory tar –xzvf hive-0.10.0.tar Check the extracted files into the below dir /home/hduser/hive sudo chown -R hduser:hadoop hive 2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> [Type text] Page 14
  • 15. EWT Portal Practice Team 2013 </property> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/hive/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> </configuration> [Type text] Page 15
  • 16. EWT Portal Practice Team 2013 7. Hbase Apache Hbase Configuration 1. Extract the hbase-0.94.9.tar into Hbase directory tar –xzvf hbase-0.94.9.tar Check the extracted files into the below dir /home/hduser/hbase sudo chown -R hduser:hadoop hbase 2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://master:9000/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/hduser/hadoop/data/zookeeper</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> [Type text] Page 16
  • 17. EWT Portal Practice Team 2013 regionservers 3. In file hbase/conf/regionservers overwrite regionservers file Master Slaves 4. Open the hbase directory and run the below command hbase/bin start-hbase.sh 8. Example Applications and Organizations using Hadoop • A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. • Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search • AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. • Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; 9. References: 1. http://download.nextag.com/apache/hadoop/common/stable/ [Type text] Page 17
  • 18. EWT Portal Practice Team 2013 2. http://archive.apache.org/dist/flume/stable/ 3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 4. Hive : http://download.nextag.com/apache/hive/stable/ 5. http://www.oracle.com/technetwork/java/javase/downloads/index.html 6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/ 7. http://hadoop.apache.org/docs/stable/single_node_setup.html 8. http://hadoop.apache.org/docs/stable/cluster_setup.html [Type text] Page 18