SlideShare ist ein Scribd-Unternehmen logo
1 von 97
Introduction of Hadoop Echosystem
Agenda
• Hadoop Brief History
• What is Hadoop Distributions
• Hadoop Distributions
• Core components of Hadoop
• Hadoop Base Platform
• Hadoop Cluster
• Hadoop Distributed File Systems
• Hadoop Map Reduce
• HBase
• Hive
• Pig
• RHadoop
• RHive
Data!
We live in the data age.
The size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a
tenfold growth by 2011 to 1.8 zettabytes.1
A zettabyte is 10^21 bytes, or equivalently one thousand exabytes, one million
petabytes, or one billion terabytes.
This flood of data is coming from many sources. Consider the ollowing:
• The New York Stock Exchange generates about one terabyte of new trade data
per day.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of
storage.
So there’s a lot of data out there.
Data Storage and Analysis
 One terabyte drives are the norm, but the transfer
speed is around 100 MB/s, so it takes more than two and a half hours to read all the
data off the disk
 This is a long time to read all data on a single drive—and writing is even slower.
 Imagine if we had 100 drives, each holding one hundredth of the data. Working in
parallel, we could read the data in under two minutes.
The first problem to solve is hardware failure:
 As soon as you start using many pieces of hardware, the chance that one will fail is
fairly high.
 A common way of avoiding data loss is through replication: redundant copies of the
data are kept by the system so that in the event of failure, there is another copy
available, where the DFS comes in.
The second problem to solve is combine the data :
 that most analysis tasks need to be able to combine the data in some way; data read
from one disk may need to be combined with the data from any of the other 99
disks
 Various distributed systems allow data to be combined from multiple sources, but
doing this correctly is notoriously challenging.
 MapReduce pro-vides a programming model that abstracts the problem from disk
reads and writes, transforming it into a computation over sets of keys and values
A Brief History of Hadoop
Google File System GFS Architecture.
• A GFS cluster consists of multiple
nodes.These nodes are divided into two
types: one Master node and a large number
of Chunkservers
• In 2004, they set about writing an open source
implementation, the Nutch Distributed Filesystem (NDFS).
• Hadoop was created by Doug Cutting and Michael J. Cafarella[8] in 2005. Doug,
who was working at Yahoo at the time,[9] named it after his son's toy elephant.[10] It
was originally developed to support distribution for the Nutch search engine
project.[11]
Apache Nutch is a project of the Apache Software Foundation. Nutch was started in
2002. However, they realized that their architecture wouldn’t scale to the
billions of pages on the Web
Help was at hand with the publication of a paper in 2003 that described
the architecture of Google’s distributed filesystem, called GFS, which
was being used in production at Google
• Hadoop has its origins in Apache Nutch, an open source web search engine, itself
a part of the Lucene project
Google’s Query Processor
The Hadoop Ecosystem
What Hadoop is, and what it’s not
 MapReduce
 Created at Google in 2004
 The MapReduce framework is the powerhouse behind most of today’s big data
processing
 In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL databases,
such as Vertica or MongoDB.
 At its core, Hadoop is an open source MapReduce implementation, feb 2006. Funded
by Yahoo
 The ability of MapReduce to distribute computation over multiple servers
 HDFS
 MapReduce computation to take place, each server must have access to the data. This
is the role of HDFS, the Hadoop Distributed File System.
 HDFS and MapReduce are robust.
 Servers in a Hadoop cluster can fail and not abort the computation process.
 HDFS ensures data is replicated with redundancy across the cluster.
 On completion of a calculation, a node will write its results back into HDFS.
 Improving programmability: Pig and Hive
 Improving data access: HBase, Sqoop and Flume
The core of Hadoop
Ambari Deployment, configuration and monitoring
Flume Collection and import of log and event data
HBase Column-oriented database scaling to billions of rows
HCatalog Schema and data type sharing over Pig, Hive and MapReduce
HDFS Distributed redundant file system for Hadoop
Hive Data warehouse with SQL-like access
Mahout Library of machine learning and data mining algorithms
MapReduce Parallel computation on server clusters
Pig High-level programming language for Hadoop computations
Oozie High-level programming language for Hadoop computations
Sqoop Imports data from relational databases
Whirr Cloud-agnostic deployment of clusters
Zookeeper Configuration management and coordination
The Hadoop Bestiary
Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under
the Apache v2 license. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
Apache Hadoop
Cloudera : Hadoop :: Red Hat : Linux
Cloudera’s Distribution Including Apache Hadoop (CDH)
A packaged set of Hadoop modules that work together
Now at CDH4
Largest contributor of code to Apache Hadoop
CDH4
The world's leading Apache Hadoop distribution.
CDH (Cloudera's Distribution, including Apache Hadoop) is Cloudera's 100% open-source Hadoop distribution, and the world's leading Apache
Hadoop solution.
Cloudera
Installations are available for
• Windows
• Red Hat Linux
Installing and Configuring
HDP using Hortonworks
Management Center
Powered by Apache Hadoop
MapR’s Distribution for Apache Hadoop
MapR is a complete distribution that includes HBase™, Pig, Hive, Mahout, Cascading, Sqoop, Flume and more. MapR’s distribution is 100% API
compatible with Hadoop (MapReduce, HDFS and HBase™). MapR Technologies has significantly advanced Hadoop by making it easy, dependable, and
fast.
On Cloud: http://aws.amazon.com/elasticmapreduce/mapr/
MapR Apache Hadoop
Open Platform for Next-Gen Analytics
Intel® Distribution for Apache Hadoop* software (Intel® Distribution) is a software platform that provides distributed
processing and data management for enterprise applications that analyze massive amounts of diverse data. Intel Distribution is an
open source software product that includes Apache Hadoop and other software components along with enhancements and fixes
from Intel. Proven in production at some of the most demanding enterprise deployments in the world, Intel Distribution is
supported by a worldwide engineering team with access to expertise in the entire software stack as well as the underlying
processor, storage, and networking components.
Key Features:
• Up to 30x boost in Hadoop performance with optimizations for Intel® Xeon processors, Intel® SSD storage, and Intel®
10GbE networking
• Data confidentiality without a performance penalty with encryption and decryption in HDFS enhanced by Intel® AES-NI
and role-based access control with cell-level granularity in Hbase
• Multi-site scalability and adaptive data replication in HBase and HDFS
• Up to 3.5x improvement in Hive query performance
• Support for statistical analysis with R connector
• Enables graph analytics with Intel® Graph Builder
• Enterprise-grade support and services from Intel
Intel Distribution for Apache Hadoop Software
Apache Hadoop on cloud and on premises that
Accommodates both Windows and Linux.
• Interactive Console
• run a Pig-Latin Job from the
Interactive Javascript Console
• Create and run a JavaScript
MapReduce job
• Execute a job using Hive
• Remote Desktop
• The Hadoop command shell
• View the Job Tracker
• View HDFS
• Open Port
• Connect Excel Hive Add-In To
Hadoop on Azure via HiveODBC
• FTP data to Hadoop on Azure
• Manage Data
• Import Data from Data Market
• Setup ASV–use your Windows
Azure Blob Store account
• Setup S3–use your Amazon S3 account
Apache Hadoop HDInsight Service Dashboard
HDInsight Service makes Apache Hadoop available as a service in the cloud. It makes the
HDFS/MapReduce software framework available in a simpler, more scalable, and cost efficient
environment.
Seamless Interoperability with your microsoft tools
IBM InfoSphere BigInsights
Bringing the power of Hadoop to the enterprise
Hadoop Base Platform
Creating a Hadoop User
It’s good practice to create a dedicated Hadoop user account to separate the Hadoop
installation from other services running on the same machine
For small clusters, some administrators choose to make this user’s home directory an
NFS-mounted drive, to aid with SSH key distribution
Hadoop Cluster
A common Hadoop cluster architecture consists of a two-level network topology, as
illustrated in Figure 9-1. Typically there are 30 to 40 servers per rack, with a 1 GB switch
for the rack (only three are shown in the diagram), and an uplink to a core switch or
router (which is normally 1 GB or better). The salient point is that the aggregate band-
width between nodes on the same rack is much greater than that between nodes on
different racks.
Network Topology
• Files split into 128MB blocks
• Blocks replicated across several datanodes
(usually 3)
• Single namenode stores metadata (file
names, block locations, etc)
• Optimized for large files, sequential reads
• Files are append-only
Namenode
Datanodes
1
2
3
4
1
2
4
2
1
3
1
4
3
3
2
4
File1
Hadoop Distributed File Systems
Block Replication
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of
all blocks on a DataNode.
Hadoop Distributed File Systems
Setting Up a Hadoop Cluster
Anatomy of a File Write
The case we’re going to consider is the case of
creating a new file, writing data to it,
then closing the file. See Figure.
1. The client creates the file by calling create() on DistributedFileSystem
2. DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace
3. The client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data
queue. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in
the pipeline.
4. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to
the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third
(and last) datanode in the pipeline
5. DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes,
called the ack queue. A packet is removed from the ack queue only when it has been cknowledged by all the datanodes
in the pipeline
1. If a datanode fails while data is being written to it, the failed datanode is removed from the pipeline and the
remainder of the block’s data is written to the two good datanodes in the pipeline. The namenode notices that
the block is under-replicated, and it arranges for a further replica to be created on another node
6. When the client has finished writing data, it calls close() on the stream.
7. Close()action flushes all the remaining packets to the datanode pipeline and waits for ac-knowledgments before
contacting the namenode to signal that the file is complete
Hadoop Distributed File Systems
Setting Up a Hadoop Cluster
Anatomy of a File Read:
To get an idea of how data
flows between the client
interacting with HDFS, the
name-
node and the datanodes,
consider Figure 3-2, which
shows the main sequence of
events when reading a file.
1. The client opens the file it wishes to read by calling open() on the DistributedFileSystem object
2. DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file
3. The client then calls read() on the stream . DFSInputStream, which has stored the datanode addresses for the first few blocks in
the file, then connects to the first (closest) datanode for the first block in the file.
4. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream. When the end of the block is
reached, DFSInputStream will close the connection to the datanode.
5. Then find the best datanode for the next block (step 5). This happens transparently to the client, which from its point of view is
just reading a continuous stream
6. Blocks are read in order with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It
will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished
reading, it calls close() on the FSDataInputStream.
Hadoop Distributed File Systems
MapReduce data flow
MapReduce Programming Model
 Data type: key-value records
 Map function:
(Kin, Vin)  list(Kinter, Vinter)
 Reduce function:
(Kinter, list(Vinter))  list(Kout, Vout)
Reduce
(Count)
Reduce
(Count)
Reduce
(Count)Data
Collection:
split1
Split the data to
Supply multiple
processors
Data
Collection:
split 2
Data
Collection:
split n Map
Map
……
Map
…
Cat
Bat
Dog
Other
Words
(size:
TByte)
P-0000
P-0001
P-0002
, count1
, count2
,count3
Map <key, 1> Reducers (say, Count)
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
MapReduce Operation
The whole data flow with a single reduce task is illustrated in Figure 2-3. The dotted
boxes indicate nodes, the light arrows show data transfers on a node, and the heavy
arrows show data transfers between nodes.
Imagine the first map produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)
MapReduce Operation(Cont)
MapReduce Example Word Count
 def mapper(line):
 foreach word in line.split():
 output(word, 1)
 def reducer(key, values):
 output(key, sum(values))
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
An Optimization: The Combiner
• A combiner is a local aggregation function for repeated
keys produced by same map
• Works for associative functions like sum, count, max
• Decreases size of intermediate data
• Example: map-side aggregation for Word Count:
def combiner(key, values):
output(key, sum(values))
Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
MapReduce Execution Details
• Single master controls job execution on multiple slaves
• Mappers preferentially placed on same node or same rack as
their input block
– Minimizes network usage
• Mappers save outputs to local disk before serving them to
reducers
– Allows recovery if a reducer crashes
– Allows having more reducers than nodes
Anatomy of a MapReduce Job Run
• Asks the jobtracker for a new job ID (by calling
getNewJobId() on JobTracker) (step2).
• Checks the output specification of the job. For example,
if the output directory has not been specified or it
already exists, the job is not submitted and an error is
thrown to the MapReduce program.
• Computes the input splits for the job. If the splits cannot
be computed, because the input paths don’t exist, for
example, then the job is not submitted and an error is
thrown to the MapReduce program.
• Copies the resources needed to run the job, including the job JAR file, the config-uration file, and the computed
input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a
high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there
are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3)
• Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4)
• The client, which submits the MapReduce job.
Fault Tolarence in MapReduce
1. If a task crashes:
– Retry on another node
» OK for a map because it has no dependencies
» OK for reduce because map outputs are on disk
– If the same task fails repeatedly, fail the job or ignore that
input block (user-controlled)
Note: For these fault tolerance features to work, your
map and reduce tasks must be side-effect-free
Fault Tolarence in MapReduce
2. If a node crashes:
– Re-launch its current tasks on other nodes
– Re-run any maps the node previously ran
» Necessary because their output files were lost along with
the crashed node
3. If a task is going slowly (straggler):
– Launch second copy of task on another node (“speculative
execution”)
– Take the output of whichever copy finishes first, and kill the
other
Surprisingly important in large clusters
– Stragglers occur frequently due to failing hardware, software
bugs, misconfiguration, etc
– Single straggler may noticeably slow down a job
Fault Tolarence in MapReduce
Hadoop comes with a web UI for viewing information about your jobs. It is useful for
following a job’s progress while it is running, as well as finding job statistics and logs
after the job has completed. You can find the UI at http://jobtracker-host:50030/.
Figure 5-2. Screenshot of the job page
Walkthrough the MapReduce Web UI
The task details page
Walkthrough the MapReduce Web UI
Retrieving the Results
Once the job is finished, there are various ways to retrieve the results. Each reducer produces one
output file, so there are 30 part files named part-r-00000 to part-r-00029 in the max-temp
directory.
Walkthrough the MapReduce Web UI
MapReduce
HBase
HBase Architecture
• HBase is a distributed column-oriented database built on top of HDFS
• HBase is the Hadoop application to use when you require real-time read/write
random-access to very large datasets
What if you need the database features that Hive doesn’t provide, like row-level
updates, rapid query response times, and transactions?
Hbase Cluster members
• HBase depends on ZooKeeper and by default it manages a ZooKeeper instance
as the authority on cluster state
• HBase hosts vitals such as the location of the root
catalog table and the address of the current cluster
Master
• Regionserver slave nodes are listed in the HBase
conf/regionservers file as you would list datanodes
and tasktrackers in the Hadoop conf/slaves file
• There are multiple implemen-tations of the filesystem
interface—one for the local filesystem, one for the KFS
file-system, Amazon’s S3, and HDFS (the Hadoop
Distributed Filesystem)—HBase can persist to any of
these implementations
• By default, unless told otherwise, HBase writes to the local filesystem
HBase
HBase
• To administer your HBase instance, launch the HBase shell by typing:
% hbase shell
hbase(main):001:0>
hbase(main):007:0> create 'test', 'data'
0 row(s) in 1.3066 seconds
• To prove the new table was created successfully, run the list command. This will output
all tables in user space:
• To insert data into three different rows and columns in the data column family, and then
list the table content, do the following:
hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'
0 row(s) in 0.0090 seconds
hbase(main):024:0> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198, value=value1
row2 column=data:2, timestamp=1240148040035, value=value2
row3 column=data:3, timestamp=1240148047497, value=value3
3 row(s) in 0.0825 seconds
hbase(main):019:0> list
test
1 row(s) in 0.1485 seconds
HBase
HBase
Stargate Stargate is the name of the REST server bundled with HBase.
Query Table List
Examples:
% curl http://localhost:8000/
HTTP/1.1 200 OK
Content-Length: 13
Cache-Control: no-cache
Content-Type: text/plain
test
% curl -H "Accept: text/xml" http://localhost:8000/
HTTP/1.1 200 OK
Cache-Control: no-cache
Content-Type: text/xml
Content-Length: 121
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TableList><Table name="content"/><Table name="urls"/></TableList>
% curl -H "Accept: application/json" http://localhost:8000/
HTTP/1.1 200 OK
Cache-Control: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
{"Table":[{"name":“test"},{"name":"urls"}]}
Set Accept header to text/plain for plain text output.
Set Accept header to text/xml for XML reply.
Set Accept header to application/json for JSON reply.
Set Accept header to application/x-protobuf for protobufs
HBase
Stargate Stargate is the name of the REST server bundled with HBase.
 Many parallel algorithms can be expressed by a series
of MapReduce jobs
 But MapReduce is fairly low-level: must think about
keys, values, partitioning, etc
 Can we capture common “job building blocks”?
Motivation
Apache Oozie
max-temp-workflow/
├── lib/
│ └── hadoop-examples.jar
└── workflow.xml
An open-source workflow/coordination service to manage data processing jobs
for Hadoop, developed and then open-sourced by Yahoo!
Example 5-14. Oozie workflow definition to run the maximum temperature MapReduce job
<workflow-app xmlns="uri:oozie:workflow:0.1" name="max-temp-workflow">
<start to="max-temp-mr"/>
<action name="max-temp-mr">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/output"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>OldMaxTemperature$OldMaxTemperatureMapper</value>
</property>
<property>
<name>mapred.combiner.class</name>
<value>OldMaxTemperature$OldMaxTemperatureReducer</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>OldMaxTemperature$OldMaxTemperatureReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
Apache Oozie
Pig
 Started at Yahoo! Research
 Runs about 30% of Yahoo!’s jobs
 Features:
 Expresses sequences of MapReduce jobs
 Data model: nested “bags” of items
 Provides relational (SQL) operators (JOIN, GROUP BY,
etc)
 Easy to plug in Java functions
 Pig Pen development environment for Eclipse
• Higher level data flow language
• Convert them into MapReduce Job and runs it
• Provides good functionality (JOINS, practitioners)
• Very compact!
• A comparison between Pig & Java
• Faster to develop
• Slower to run
Pig
An Example Problem
Suppose you have user
data in one file, page
view data in another, and
you need to find the top
5 most visited pages by
users aged 18 - 25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In MapReduce
i m p o r t j a v a . i o . I O E x c e p t i o n ;
i m p o r t j a v a . u t i l . A r r a y L i s t ;
i m p o r t j a v a . u t i l . I t e r a t o r ;
i m p o r t j a v a . u t i l . L i s t ;
i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;
im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ;
i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;
i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;
p u b l i c c l a s s M R E x a m p l e {
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u t
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ;
T e x t o u t K e y = n e w T e x t ( k e y ) ;
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e
/ / i t c a m e f r o m .
T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ;
o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}
}
p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u t
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ;
i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ;
i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ;
S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ;
T e x t o u t K e y = n e w T e x t ( k e y ) ;
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e
/ / i t c a m e f r o m .
T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;
o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}
}
p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > {
p u b l i c v o i d r e d u c e ( T e x t k e y ,
I t e r a t o r < T e x t > i t e r ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d
s t o r e i t
/ / a c c o r d i n g l y .
L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ;
L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) {
T e x t t = i t e r . n e x t ( ) ;
S t r i n g v a l u e = t . t oS t r i n g ( ) ;
i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' )
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
/ / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e s
f o r ( S t r i n g s 1 : f i r s t ) {
f o r ( S t r i n g s 2 : s e c o n d ) {
S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ;
o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
}
}
}
p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > {
p u b l i c v o i d m a p (
T e x t k ,
T e x t v a l ,
O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F i n d t h e u r l
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ;
S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ;
/ / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e ,
/ / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d .
T e x t o u t K e y = n e w T e x t ( k e y ) ;
o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ;
}
}
p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e ,
W r i t a b l e > {
p u b l i c v o i d r e d u c e (
T e x t k ey ,
I t e r a t o r < L o n g W r i t a b l e > i t e r ,
O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / A d d u p a l l t h e v a l u e s w e s e e
l o n g s u m = 0 ;
w hi l e ( i t e r . h a s N e x t ( ) ) {
s u m + = i t e r . n e x t ( ) . g e t ( ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ;
}
}
p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e
im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,
T e x t > {
p u b l i c v o i d m a p (
W r i t a b l e C o m p a r a b l e k e y ,
W r i t a b l e v a l ,
O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,
R e p o r t e r r e p o r t e r )t h r o w s I O E x c e p t i o n {
o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ;
}
}
p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > {
i n t c o u n t = 0 ;
p u b l i cv o i d r e d u c e (
L o n g W r i t a b l e k e y ,
I t e r a t o r < T e x t > i t e r ,
O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s
w h i l e ( c o u n t< 1 0 0 & & i t e r . h a s N e x t ( ) ) {
o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ;
c o u n t + + ;
}
}
}
p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n {
J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
l p . s et J o b N a m e ( " L o a d P a g e s " ) ;
l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s )
l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s )
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n
P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p ,
n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e
l p . s e t N u m R e d u c e T a s k s ( 0 ) ;
J o b l o a d P a g e s = n e w J o b ( l p ) ;
J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l
l f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;
l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t
l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s
l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s
F i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w
P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u
n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t
l f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
J o b l o a d U s e r s = n e w J o b ( l f u ) ;
J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ;
j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a
j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n
j o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s )
j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s
j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ;
j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n ,
P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n ,
P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;
F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b j o i n J o b = n e w J o b ( j o i n ) ;
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s )
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s )
J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ;
g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ;
g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I
g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s
g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t
g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s s
g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s .
g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p ,
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b g r o u p J o b = n e w J o b ( g r o u p ) ;
g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;
J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a
t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ;
t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e
t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a
t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l
t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i lo r m a t . c l a s s ) ;
t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c
t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k
t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0
P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2
t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ;
l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;
J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i1 0 0 s i t e s f o r
1 8 t o 2 5 " ) ;
j c . a d d J o b ( l o a d P a g e s ) ;
j c . a d d J o b ( l o a d U s e r s ) ;
j c . a d d J o b ( j o i n J o b ) ;
j c . a d d J o b ( g r o u p J o b ) ;
j c . a d d J o b ( l i m i t ) ;
j c . r u n ( ) ;
}
}
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In Pig Latin
Notice how naturally the components of the job translate into Pig Latin.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Ease of Translation
Ease of Translation
Notice how naturally the components of the job translate into Pig Latin.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Job 1
Job 2
Job 3
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Hive
 Developed at Facebook
 Used for majority of Facebook jobs
 “Relational database” built on Hadoop
 Maintains list of table schemas
 SQL-like query language (HQL)
 Can call Hadoop Streaming scripts from HQL
 Supports table partitioning, clustering, complex
data types, some optimizations
 Translates SQL into MapReduce jobs
 So you can do this
 Select count(*) from users where user_id=56
 And Hive will translates this into MapReduce jobs
Hive Architecture
Apache Hive, which is built on top of Hadoop for providing data warehouse services
Hive
Sample Hive Queries
SELECT p.url, COUNT(1) as clicks
FROM users u JOIN page_views p ON (u.name = p.user)
WHERE u.age >= 18 AND u.age <= 25
GROUP BY p.url
ORDER BY clicks
LIMIT 5;
• Find top 5 pages visited by users aged 18-25:
• Filter page views through Python script:
SELECT TRANSFORM(p.user, p.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views p;
Hive
Pig
Hive
Sqoop
• Often, valuable data in an organization is stored in relational
database systems (RDBMS)
• Sqoop is an open-source tool that allows users to extract
data from a relational database into Hadoop for further
processing.
• It’s even possible to use Sqoop to move data from a relational
database into HBase.
A Sample Import
Sqoop
• After you install Sqoop, you can use it to import data to Hadoop.
• Assumed that MySQL is installed, let’s log in and create a database
Example 15-1. Creating a new MySQL database schema
% mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 349
Server version: 5.1.37-1ubuntu5.4 (Ubuntu)
Type 'help;' or 'h' for help. Type 'c' to clear the current input
statement.
mysql> CREATE DATABASE hadoopguide;
Query OK, 1 row affected (0.02 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> quit;
Bye
Example 15-2. Populating the database
% mysql hadoopguide
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 352
Server version: 5.1.37-1ubuntu5.4 (Ubuntu)
Type 'help;' or 'h' for help. Type 'c' to clear the current input stateme
mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
-> widget_name VARCHAR(64) NOT NULL,
-> price DECIMAL(10,2),
-> design_date DATE,
-> version INT,
-> design_comment VARCHAR(100));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10',
-> 1, 'Connects two gizmos');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4,
-> NULL);
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',
-> 13, 'Our flagship product');
Query OK, 1 row affected (0.00 sec)
mysql> quit;
Sqoop
• Now let’s login back into the database (not as root, but as yourself this time), and create a table to import into
HDFS
Sqoop
• Now let’s use Sqoop to import this table into HDFS:
% sqoop import --connect jdbc:mysql://localhost/hadoopguide 
> --table widgets -m 1
10/06/23 14:44:18 INFO tool.CodeGenTool: Beginning code generation
...
10/06/23 14:44:20 INFO mapred.JobClient: Running job:
job_201006231439_0002
10/06/23 14:44:21 INFO mapred.JobClient: map 0% reduce 0%
10/06/23 14:44:32 INFO mapred.JobClient: map 100% reduce 0%
10/06/23 14:44:34 INFO mapred.JobClient: Job complete:
job_201006231439_0002
Sqoop
RHadoop
• It allows data scientists familiar with R to quickly utilize the enterprise-grade capabilities of the MapR Hadoop
distribution directly with the analytic capabilities of R.
• Rhadoop is an open source collection of three R packages created by Revolution Analytics that allow users to
Manage and analyze data with Hadoop from an R environment.
• RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop
• The packages have been implemented and tested in Cloudera's distribution of Hadoop(CDH3) & (CDH4).
and R 2.15.0.
• The packages have also been tested with Revolution R 4.3, 5.0, and 6.0. For rmr see Compatibility.
> library(rhdfs)
> hdfs.init()
> hdfs.ls('/')
> q()
• From R, load the rhdfs library and confirm that you can access the MapR cluster file system by
listing the root directory.
RHadoop and MapR Accessing Enterprise-Grade Hadoop from R
>library("rmr2")
>small.ints <- to.dfs(1:1000)
>out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
>df <- as.data.frame(from.dfs(out))
• From R, load the rmr2 library and confirm that you can access the hadoop cluster file system by
runing simple MapReduce job.
R --save
> library(rhbase)
> hb.init()
> hb.new.table('testtable', 'colfam1')
> hb.describe.table('testtable')
> hb.delete.table('testtable')
> q()
• Load rhbase library and create a HBase table, display its description, and drop it.
RHadoop
RHive is an R extension facilitating distributed computing via HIVE query. It
provides an easy to use HQL like SQL and R objects and functions in HQL.
Examples:
##Loading Rhive library into R
>library(RHive)
>rhive.int()
## try to connect hive server
>rhive.connect(“HiveServer_IP”)
## execute HQL(hive query)
>dt < - rhive.query("select * from emp")
Deployment with R evolution R Enterprise
Questions ?
Thank you
Mahabubur Rahaman
Sr. Software Engineer
Orion Informarics Ltd
Dhaka, Bangladesh

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with SparkMohammed Guller
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 

Was ist angesagt? (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 

Andere mochten auch

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise DataWorks Summit
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby建興 王
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 

Andere mochten auch (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Big Data
Big DataBig Data
Big Data
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata.pptx
Bigdata.pptxBigdata.pptx
Bigdata.pptx
 
從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 

Ähnlich wie Introduction to Apache Hadoop Ecosystem

Ähnlich wie Introduction to Apache Hadoop Ecosystem (20)

Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Cap 10 ingles
Cap  10 inglesCap  10 ingles
Cap 10 ingles
 
Cap 10 ingles
Cap  10 inglesCap  10 ingles
Cap 10 ingles
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 

Mehr von Mahabubur Rahaman

Transaction isolationexamples
Transaction isolationexamplesTransaction isolationexamples
Transaction isolationexamplesMahabubur Rahaman
 
supporting t-sql scripts for IndexPage, Datapage and IndexDefragmentation
supporting t-sql scripts for IndexPage, Datapage and IndexDefragmentationsupporting t-sql scripts for IndexPage, Datapage and IndexDefragmentation
supporting t-sql scripts for IndexPage, Datapage and IndexDefragmentationMahabubur Rahaman
 
supporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered tablesupporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered tableMahabubur Rahaman
 
Introduction of sql server indexing
Introduction of sql server indexingIntroduction of sql server indexing
Introduction of sql server indexingMahabubur Rahaman
 

Mehr von Mahabubur Rahaman (6)

Transaction isolationexamples
Transaction isolationexamplesTransaction isolationexamples
Transaction isolationexamples
 
Lock basicsexamples
Lock basicsexamplesLock basicsexamples
Lock basicsexamples
 
Sql server concurrency
Sql server concurrencySql server concurrency
Sql server concurrency
 
supporting t-sql scripts for IndexPage, Datapage and IndexDefragmentation
supporting t-sql scripts for IndexPage, Datapage and IndexDefragmentationsupporting t-sql scripts for IndexPage, Datapage and IndexDefragmentation
supporting t-sql scripts for IndexPage, Datapage and IndexDefragmentation
 
supporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered tablesupporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered table
 
Introduction of sql server indexing
Introduction of sql server indexingIntroduction of sql server indexing
Introduction of sql server indexing
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Introduction to Apache Hadoop Ecosystem

  • 2. Agenda • Hadoop Brief History • What is Hadoop Distributions • Hadoop Distributions • Core components of Hadoop • Hadoop Base Platform • Hadoop Cluster • Hadoop Distributed File Systems • Hadoop Map Reduce • HBase • Hive • Pig • RHadoop • RHive
  • 3. Data! We live in the data age. The size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.1 A zettabyte is 10^21 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. This flood of data is coming from many sources. Consider the ollowing: • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. So there’s a lot of data out there.
  • 4. Data Storage and Analysis  One terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk  This is a long time to read all data on a single drive—and writing is even slower.  Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. The first problem to solve is hardware failure:  As soon as you start using many pieces of hardware, the chance that one will fail is fairly high.  A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available, where the DFS comes in. The second problem to solve is combine the data :  that most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks  Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging.  MapReduce pro-vides a programming model that abstracts the problem from disk reads and writes, transforming it into a computation over sets of keys and values
  • 5. A Brief History of Hadoop Google File System GFS Architecture. • A GFS cluster consists of multiple nodes.These nodes are divided into two types: one Master node and a large number of Chunkservers • In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). • Hadoop was created by Doug Cutting and Michael J. Cafarella[8] in 2005. Doug, who was working at Yahoo at the time,[9] named it after his son's toy elephant.[10] It was originally developed to support distribution for the Nutch search engine project.[11] Apache Nutch is a project of the Apache Software Foundation. Nutch was started in 2002. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google • Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project
  • 8. What Hadoop is, and what it’s not
  • 9.  MapReduce  Created at Google in 2004  The MapReduce framework is the powerhouse behind most of today’s big data processing  In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.  At its core, Hadoop is an open source MapReduce implementation, feb 2006. Funded by Yahoo  The ability of MapReduce to distribute computation over multiple servers  HDFS  MapReduce computation to take place, each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System.  HDFS and MapReduce are robust.  Servers in a Hadoop cluster can fail and not abort the computation process.  HDFS ensures data is replicated with redundancy across the cluster.  On completion of a calculation, a node will write its results back into HDFS.  Improving programmability: Pig and Hive  Improving data access: HBase, Sqoop and Flume The core of Hadoop
  • 10. Ambari Deployment, configuration and monitoring Flume Collection and import of log and event data HBase Column-oriented database scaling to billions of rows HCatalog Schema and data type sharing over Pig, Hive and MapReduce HDFS Distributed redundant file system for Hadoop Hive Data warehouse with SQL-like access Mahout Library of machine learning and data mining algorithms MapReduce Parallel computation on server clusters Pig High-level programming language for Hadoop computations Oozie High-level programming language for Hadoop computations Sqoop Imports data from relational databases Whirr Cloud-agnostic deployment of clusters Zookeeper Configuration management and coordination The Hadoop Bestiary
  • 11. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Apache Hadoop
  • 12. Cloudera : Hadoop :: Red Hat : Linux Cloudera’s Distribution Including Apache Hadoop (CDH) A packaged set of Hadoop modules that work together Now at CDH4 Largest contributor of code to Apache Hadoop CDH4 The world's leading Apache Hadoop distribution. CDH (Cloudera's Distribution, including Apache Hadoop) is Cloudera's 100% open-source Hadoop distribution, and the world's leading Apache Hadoop solution. Cloudera
  • 13. Installations are available for • Windows • Red Hat Linux Installing and Configuring HDP using Hortonworks Management Center Powered by Apache Hadoop
  • 14. MapR’s Distribution for Apache Hadoop MapR is a complete distribution that includes HBase™, Pig, Hive, Mahout, Cascading, Sqoop, Flume and more. MapR’s distribution is 100% API compatible with Hadoop (MapReduce, HDFS and HBase™). MapR Technologies has significantly advanced Hadoop by making it easy, dependable, and fast. On Cloud: http://aws.amazon.com/elasticmapreduce/mapr/ MapR Apache Hadoop
  • 15. Open Platform for Next-Gen Analytics Intel® Distribution for Apache Hadoop* software (Intel® Distribution) is a software platform that provides distributed processing and data management for enterprise applications that analyze massive amounts of diverse data. Intel Distribution is an open source software product that includes Apache Hadoop and other software components along with enhancements and fixes from Intel. Proven in production at some of the most demanding enterprise deployments in the world, Intel Distribution is supported by a worldwide engineering team with access to expertise in the entire software stack as well as the underlying processor, storage, and networking components. Key Features: • Up to 30x boost in Hadoop performance with optimizations for Intel® Xeon processors, Intel® SSD storage, and Intel® 10GbE networking • Data confidentiality without a performance penalty with encryption and decryption in HDFS enhanced by Intel® AES-NI and role-based access control with cell-level granularity in Hbase • Multi-site scalability and adaptive data replication in HBase and HDFS • Up to 3.5x improvement in Hive query performance • Support for statistical analysis with R connector • Enables graph analytics with Intel® Graph Builder • Enterprise-grade support and services from Intel Intel Distribution for Apache Hadoop Software
  • 16. Apache Hadoop on cloud and on premises that Accommodates both Windows and Linux. • Interactive Console • run a Pig-Latin Job from the Interactive Javascript Console • Create and run a JavaScript MapReduce job • Execute a job using Hive • Remote Desktop • The Hadoop command shell • View the Job Tracker • View HDFS • Open Port • Connect Excel Hive Add-In To Hadoop on Azure via HiveODBC • FTP data to Hadoop on Azure • Manage Data • Import Data from Data Market • Setup ASV–use your Windows Azure Blob Store account • Setup S3–use your Amazon S3 account Apache Hadoop HDInsight Service Dashboard HDInsight Service makes Apache Hadoop available as a service in the cloud. It makes the HDFS/MapReduce software framework available in a simpler, more scalable, and cost efficient environment.
  • 17. Seamless Interoperability with your microsoft tools
  • 18. IBM InfoSphere BigInsights Bringing the power of Hadoop to the enterprise
  • 20. Creating a Hadoop User It’s good practice to create a dedicated Hadoop user account to separate the Hadoop installation from other services running on the same machine For small clusters, some administrators choose to make this user’s home directory an NFS-mounted drive, to aid with SSH key distribution
  • 22. A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in Figure 9-1. Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack (only three are shown in the diagram), and an uplink to a core switch or router (which is normally 1 GB or better). The salient point is that the aggregate band- width between nodes on the same rack is much greater than that between nodes on different racks. Network Topology
  • 23. • Files split into 128MB blocks • Blocks replicated across several datanodes (usually 3) • Single namenode stores metadata (file names, block locations, etc) • Optimized for large files, sequential reads • Files are append-only Namenode Datanodes 1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4 File1 Hadoop Distributed File Systems
  • 24. Block Replication The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. Hadoop Distributed File Systems
  • 25. Setting Up a Hadoop Cluster Anatomy of a File Write The case we’re going to consider is the case of creating a new file, writing data to it, then closing the file. See Figure. 1. The client creates the file by calling create() on DistributedFileSystem 2. DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace 3. The client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline. 4. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline 5. DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been cknowledged by all the datanodes in the pipeline 1. If a datanode fails while data is being written to it, the failed datanode is removed from the pipeline and the remainder of the block’s data is written to the two good datanodes in the pipeline. The namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node 6. When the client has finished writing data, it calls close() on the stream. 7. Close()action flushes all the remaining packets to the datanode pipeline and waits for ac-knowledgments before contacting the namenode to signal that the file is complete Hadoop Distributed File Systems
  • 26. Setting Up a Hadoop Cluster Anatomy of a File Read: To get an idea of how data flows between the client interacting with HDFS, the name- node and the datanodes, consider Figure 3-2, which shows the main sequence of events when reading a file. 1. The client opens the file it wishes to read by calling open() on the DistributedFileSystem object 2. DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file 3. The client then calls read() on the stream . DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file. 4. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream. When the end of the block is reached, DFSInputStream will close the connection to the datanode. 5. Then find the best datanode for the next block (step 5). This happens transparently to the client, which from its point of view is just reading a continuous stream 6. Blocks are read in order with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream. Hadoop Distributed File Systems
  • 28. MapReduce Programming Model  Data type: key-value records  Map function: (Kin, Vin)  list(Kinter, Vinter)  Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)
  • 29. Reduce (Count) Reduce (Count) Reduce (Count)Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map Map …… Map … Cat Bat Dog Other Words (size: TByte) P-0000 P-0001 P-0002 , count1 , count2 ,count3 Map <key, 1> Reducers (say, Count) MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result> MapReduce Operation
  • 30. The whole data flow with a single reduce task is illustrated in Figure 2-3. The dotted boxes indicate nodes, the light arrows show data transfers on a node, and the heavy arrows show data transfers between nodes. Imagine the first map produced the output: (1950, 0) (1950, 20) (1950, 10) And the second produced: (1950, 25) (1950, 15) The reduce function would be called with a list of all the values: (1950, [0, 20, 10, 25, 15]) with output: (1950, 25) MapReduce Operation(Cont)
  • 31. MapReduce Example Word Count  def mapper(line):  foreach word in line.split():  output(word, 1)  def reducer(key, values):  output(key, sum(values))
  • 32. Word Count Execution the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output
  • 33. An Optimization: The Combiner • A combiner is a local aggregation function for repeated keys produced by same map • Works for associative functions like sum, count, max • Decreases size of intermediate data • Example: map-side aggregation for Word Count: def combiner(key, values): output(key, sum(values))
  • 34. Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 2 fox, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1
  • 35. MapReduce Execution Details • Single master controls job execution on multiple slaves • Mappers preferentially placed on same node or same rack as their input block – Minimizes network usage • Mappers save outputs to local disk before serving them to reducers – Allows recovery if a reducer crashes – Allows having more reducers than nodes
  • 36. Anatomy of a MapReduce Job Run • Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2). • Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. • Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program. • Copies the resources needed to run the job, including the job JAR file, the config-uration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3) • Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4) • The client, which submits the MapReduce job.
  • 37. Fault Tolarence in MapReduce 1. If a task crashes: – Retry on another node » OK for a map because it has no dependencies » OK for reduce because map outputs are on disk – If the same task fails repeatedly, fail the job or ignore that input block (user-controlled) Note: For these fault tolerance features to work, your map and reduce tasks must be side-effect-free
  • 38. Fault Tolarence in MapReduce 2. If a node crashes: – Re-launch its current tasks on other nodes – Re-run any maps the node previously ran » Necessary because their output files were lost along with the crashed node
  • 39. 3. If a task is going slowly (straggler): – Launch second copy of task on another node (“speculative execution”) – Take the output of whichever copy finishes first, and kill the other Surprisingly important in large clusters – Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc – Single straggler may noticeably slow down a job Fault Tolarence in MapReduce
  • 40. Hadoop comes with a web UI for viewing information about your jobs. It is useful for following a job’s progress while it is running, as well as finding job statistics and logs after the job has completed. You can find the UI at http://jobtracker-host:50030/. Figure 5-2. Screenshot of the job page Walkthrough the MapReduce Web UI
  • 41. The task details page Walkthrough the MapReduce Web UI
  • 42. Retrieving the Results Once the job is finished, there are various ways to retrieve the results. Each reducer produces one output file, so there are 30 part files named part-r-00000 to part-r-00029 in the max-temp directory. Walkthrough the MapReduce Web UI
  • 44. HBase HBase Architecture • HBase is a distributed column-oriented database built on top of HDFS • HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets What if you need the database features that Hive doesn’t provide, like row-level updates, rapid query response times, and transactions?
  • 45. Hbase Cluster members • HBase depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state • HBase hosts vitals such as the location of the root catalog table and the address of the current cluster Master • Regionserver slave nodes are listed in the HBase conf/regionservers file as you would list datanodes and tasktrackers in the Hadoop conf/slaves file • There are multiple implemen-tations of the filesystem interface—one for the local filesystem, one for the KFS file-system, Amazon’s S3, and HDFS (the Hadoop Distributed Filesystem)—HBase can persist to any of these implementations • By default, unless told otherwise, HBase writes to the local filesystem HBase
  • 46. HBase • To administer your HBase instance, launch the HBase shell by typing: % hbase shell hbase(main):001:0> hbase(main):007:0> create 'test', 'data' 0 row(s) in 1.3066 seconds • To prove the new table was created successfully, run the list command. This will output all tables in user space: • To insert data into three different rows and columns in the data column family, and then list the table content, do the following: hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1' 0 row(s) in 0.0454 seconds hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2' 0 row(s) in 0.0035 seconds hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3' 0 row(s) in 0.0090 seconds hbase(main):024:0> scan 'test' ROW COLUMN+CELL row1 column=data:1, timestamp=1240148026198, value=value1 row2 column=data:2, timestamp=1240148040035, value=value2 row3 column=data:3, timestamp=1240148047497, value=value3 3 row(s) in 0.0825 seconds hbase(main):019:0> list test 1 row(s) in 0.1485 seconds
  • 47. HBase
  • 48. HBase Stargate Stargate is the name of the REST server bundled with HBase. Query Table List Examples: % curl http://localhost:8000/ HTTP/1.1 200 OK Content-Length: 13 Cache-Control: no-cache Content-Type: text/plain test % curl -H "Accept: text/xml" http://localhost:8000/ HTTP/1.1 200 OK Cache-Control: no-cache Content-Type: text/xml Content-Length: 121 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <TableList><Table name="content"/><Table name="urls"/></TableList> % curl -H "Accept: application/json" http://localhost:8000/ HTTP/1.1 200 OK Cache-Control: no-cache Content-Type: application/json Transfer-Encoding: chunked {"Table":[{"name":“test"},{"name":"urls"}]} Set Accept header to text/plain for plain text output. Set Accept header to text/xml for XML reply. Set Accept header to application/json for JSON reply. Set Accept header to application/x-protobuf for protobufs
  • 49. HBase Stargate Stargate is the name of the REST server bundled with HBase.
  • 50.  Many parallel algorithms can be expressed by a series of MapReduce jobs  But MapReduce is fairly low-level: must think about keys, values, partitioning, etc  Can we capture common “job building blocks”? Motivation
  • 51. Apache Oozie max-temp-workflow/ ├── lib/ │ └── hadoop-examples.jar └── workflow.xml An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by Yahoo!
  • 52. Example 5-14. Oozie workflow definition to run the maximum temperature MapReduce job <workflow-app xmlns="uri:oozie:workflow:0.1" name="max-temp-workflow"> <start to="max-temp-mr"/> <action name="max-temp-mr"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/${wf:user()}/output"/> </prepare> <configuration> <property> <name>mapred.mapper.class</name> <value>OldMaxTemperature$OldMaxTemperatureMapper</value> </property> <property> <name>mapred.combiner.class</name> <value>OldMaxTemperature$OldMaxTemperatureReducer</value> </property> <property> <name>mapred.reducer.class</name> <value>OldMaxTemperature$OldMaxTemperatureReducer</value> </property> <property> <name>mapred.output.key.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> Apache Oozie
  • 53. Pig  Started at Yahoo! Research  Runs about 30% of Yahoo!’s jobs  Features:  Expresses sequences of MapReduce jobs  Data model: nested “bags” of items  Provides relational (SQL) operators (JOIN, GROUP BY, etc)  Easy to plug in Java functions  Pig Pen development environment for Eclipse
  • 54. • Higher level data flow language • Convert them into MapReduce Job and runs it • Provides good functionality (JOINS, practitioners) • Very compact! • A comparison between Pig & Java • Faster to develop • Slower to run Pig
  • 55. An Example Problem Suppose you have user data in one file, page view data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 56. In MapReduce i m p o r t j a v a . i o . I O E x c e p t i o n ; i m p o r t j a v a . u t i l . A r r a y L i s t ; i m p o r t j a v a . u t i l . I t e r a t o r ; i m p o r t j a v a . u t i l . L i s t ; i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ; i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ; i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; p u b l i c c l a s s M R E x a m p l e { p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > { p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , O u t p u t C o l l e c t o r < T e x t , T e x t > o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / P u l l t h e k e y o u t S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ; S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ; T e x t o u t K e y = n e w T e x t ( k e y ) ; / / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e / / i t c a m e f r o m . T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ; o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } } p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > { p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , O u t p u t C o l l e c t o r < T e x t , T e x t > o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / P u l l t h e k e y o u t S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ; i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; T e x t o u t K e y = n e w T e x t ( k e y ) ; / / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e / / i t c a m e f r o m . T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ; o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } } p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > { p u b l i c v o i d r e d u c e ( T e x t k e y , I t e r a t o r < T e x t > i t e r , O u t p u t C o l l e c t o r < T e x t , T e x t > o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d s t o r e i t / / a c c o r d i n g l y . L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ; L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ; w h i l e ( i t e r . h a s N e x t ( ) ) { T e x t t = i t e r . n e x t ( ) ; S t r i n g v a l u e = t . t oS t r i n g ( ) ; i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' ) f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; } / / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e s f o r ( S t r i n g s 1 : f i r s t ) { f o r ( S t r i n g s 2 : s e c o n d ) { S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ; o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; } } } } p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > { p u b l i c v o i d m a p ( T e x t k , T e x t v a l , O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / F i n d t h e u r l S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ; / / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e , / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d . T e x t o u t K e y = n e w T e x t ( k e y ) ; o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ; } } p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e , W r i t a b l e > { p u b l i c v o i d r e d u c e ( T e x t k ey , I t e r a t o r < L o n g W r i t a b l e > i t e r , O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / A d d u p a l l t h e v a l u e s w e s e e l o n g s u m = 0 ; w hi l e ( i t e r . h a s N e x t ( ) ) { s u m + = i t e r . n e x t ( ) . g e t ( ) ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; } o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ; } } p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e , T e x t > { p u b l i c v o i d m a p ( W r i t a b l e C o m p a r a b l e k e y , W r i t a b l e v a l , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c , R e p o r t e r r e p o r t e r )t h r o w s I O E x c e p t i o n { o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ; } } p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > { i n t c o u n t = 0 ; p u b l i cv o i d r e d u c e ( L o n g W r i t a b l e k e y , I t e r a t o r < T e x t > i t e r , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s w h i l e ( c o u n t< 1 0 0 & & i t e r . h a s N e x t ( ) ) { o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ; c o u n t + + ; } } } p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n { J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ; l p . s et J o b N a m e ( " L o a d P a g e s " ) ; l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ; l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s ) F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ; F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p , n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e l p . s e t N u m R e d u c e T a s k s ( 0 ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ; J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l l f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ; l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s F i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ; F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t l f u . s e t N u m R e d u c e T a s k s ( 0 ) ; J o b l o a d U s e r s = n e w J o b ( l f u ) ; J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ; j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n j o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ; j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ; F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; J o b j o i n J o b = n e w J o b ( j o i n ) ; j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ; g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s s g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; J o b g r o u p J o b = n e w J o b ( g r o u p ) ; g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ; J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ; t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i lo r m a t . c l a s s ) ; t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ; l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ; J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i1 0 0 s i t e s f o r 1 8 t o 2 5 " ) ; j c . a d d J o b ( l o a d P a g e s ) ; j c . a d d J o b ( l o a d U s e r s ) ; j c . a d d J o b ( j o i n J o b ) ; j c . a d d J o b ( g r o u p J o b ) ; j c . a d d J o b ( l i m i t ) ; j c . r u n ( ) ; } } Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 57. Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt In Pig Latin
  • 58. Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit … Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt Ease of Translation
  • 59. Ease of Translation Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit … Job 1 Job 2 Job 3 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 60.
  • 61. Hive  Developed at Facebook  Used for majority of Facebook jobs  “Relational database” built on Hadoop  Maintains list of table schemas  SQL-like query language (HQL)  Can call Hadoop Streaming scripts from HQL  Supports table partitioning, clustering, complex data types, some optimizations  Translates SQL into MapReduce jobs  So you can do this  Select count(*) from users where user_id=56  And Hive will translates this into MapReduce jobs
  • 62. Hive Architecture Apache Hive, which is built on top of Hadoop for providing data warehouse services Hive
  • 63. Sample Hive Queries SELECT p.url, COUNT(1) as clicks FROM users u JOIN page_views p ON (u.name = p.user) WHERE u.age >= 18 AND u.age <= 25 GROUP BY p.url ORDER BY clicks LIMIT 5; • Find top 5 pages visited by users aged 18-25: • Filter page views through Python script: SELECT TRANSFORM(p.user, p.date) USING 'map_script.py' AS dt, uid CLUSTER BY dt FROM page_views p; Hive
  • 64. Pig
  • 65. Hive
  • 66. Sqoop • Often, valuable data in an organization is stored in relational database systems (RDBMS) • Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. • It’s even possible to use Sqoop to move data from a relational database into HBase.
  • 67. A Sample Import Sqoop • After you install Sqoop, you can use it to import data to Hadoop. • Assumed that MySQL is installed, let’s log in and create a database Example 15-1. Creating a new MySQL database schema % mysql -u root -p Enter password: Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 349 Server version: 5.1.37-1ubuntu5.4 (Ubuntu) Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. mysql> CREATE DATABASE hadoopguide; Query OK, 1 row affected (0.02 sec) mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost'; Query OK, 0 rows affected (0.00 sec) mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost'; Query OK, 0 rows affected (0.00 sec) mysql> quit; Bye
  • 68. Example 15-2. Populating the database % mysql hadoopguide Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 352 Server version: 5.1.37-1ubuntu5.4 (Ubuntu) Type 'help;' or 'h' for help. Type 'c' to clear the current input stateme mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, -> widget_name VARCHAR(64) NOT NULL, -> price DECIMAL(10,2), -> design_date DATE, -> version INT, -> design_comment VARCHAR(100)); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10', -> 1, 'Connects two gizmos'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4, -> NULL); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13', -> 13, 'Our flagship product'); Query OK, 1 row affected (0.00 sec) mysql> quit; Sqoop • Now let’s login back into the database (not as root, but as yourself this time), and create a table to import into HDFS
  • 69. Sqoop • Now let’s use Sqoop to import this table into HDFS: % sqoop import --connect jdbc:mysql://localhost/hadoopguide > --table widgets -m 1 10/06/23 14:44:18 INFO tool.CodeGenTool: Beginning code generation ... 10/06/23 14:44:20 INFO mapred.JobClient: Running job: job_201006231439_0002 10/06/23 14:44:21 INFO mapred.JobClient: map 0% reduce 0% 10/06/23 14:44:32 INFO mapred.JobClient: map 100% reduce 0% 10/06/23 14:44:34 INFO mapred.JobClient: Job complete: job_201006231439_0002
  • 70. Sqoop
  • 71.
  • 72.
  • 73. RHadoop • It allows data scientists familiar with R to quickly utilize the enterprise-grade capabilities of the MapR Hadoop distribution directly with the analytic capabilities of R. • Rhadoop is an open source collection of three R packages created by Revolution Analytics that allow users to Manage and analyze data with Hadoop from an R environment. • RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop • The packages have been implemented and tested in Cloudera's distribution of Hadoop(CDH3) & (CDH4). and R 2.15.0. • The packages have also been tested with Revolution R 4.3, 5.0, and 6.0. For rmr see Compatibility.
  • 74. > library(rhdfs) > hdfs.init() > hdfs.ls('/') > q() • From R, load the rhdfs library and confirm that you can access the MapR cluster file system by listing the root directory. RHadoop and MapR Accessing Enterprise-Grade Hadoop from R >library("rmr2") >small.ints <- to.dfs(1:1000) >out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) >df <- as.data.frame(from.dfs(out)) • From R, load the rmr2 library and confirm that you can access the hadoop cluster file system by runing simple MapReduce job. R --save > library(rhbase) > hb.init() > hb.new.table('testtable', 'colfam1') > hb.describe.table('testtable') > hb.delete.table('testtable') > q() • Load rhbase library and create a HBase table, display its description, and drop it.
  • 76. RHive is an R extension facilitating distributed computing via HIVE query. It provides an easy to use HQL like SQL and R objects and functions in HQL.
  • 77. Examples: ##Loading Rhive library into R >library(RHive) >rhive.int() ## try to connect hive server >rhive.connect(“HiveServer_IP”) ## execute HQL(hive query) >dt < - rhive.query("select * from emp")
  • 78.
  • 79. Deployment with R evolution R Enterprise
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 97. Thank you Mahabubur Rahaman Sr. Software Engineer Orion Informarics Ltd Dhaka, Bangladesh