SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
DOI:10.5121/iju.2013.4304 41
HMR LOG ANALYZER: ANALYZE WEB
APPLICATION LOGS OVER HADOOP MAPREDUCE
Sayalee Narkhede1
and Tripti Baraskar2
Department of Information Technology, MIT-Pune,University of Pune, Pune
sayleenarkhede@gmail.com
Department of Information Technology, MIT-Pune, University of Pune, Pune
baraskartn@gmail.com
ABSTRACT
In today’s Internet world, log file analysis is becoming a necessary task for analyzing the customer’s
behavior in order to improve advertising and sales as well as for datasets like environment, medical,
banking system it is important to analyze the log data to get required knowledge from it. Web mining is the
process of discovering the knowledge from the web data. Log files are getting generated very fast at the
rate of 1-10 Mb/s per machine, a single data center can generate tens of terabytes of log data in a day.
These datasets are huge. In order to analyze such large datasets we need parallel processing system and
reliable data storage mechanism. Virtual database system is an effective solution for integrating the data
but it becomes inefficient for large datasets. The Hadoop framework provides reliable data storage by
Hadoop Distributed File System and MapReduce programming model which is a parallel processing
system for large datasets. Hadoop distributed file system breaks up input data and sends fractions of the
original data to several machines in hadoop cluster to hold blocks of data. This mechanism helps to
process log data in parallel using all the machines in the hadoop cluster and computes result efficiently.
The dominant approach provided by hadoop to “Store first query later”, loads the data to the Hadoop
Distributed File System and then executes queries written in Pig Latin. This approach reduces the response
time as well as the load on to the end system. This paper proposes a log analysis system using Hadoop
MapReduce which will provide accurate results in minimum response time.
KEYWORDS
Hadoop, MapReduce, Log Files, Parallel Processing, Hadoop Distributed File System.
1. INTRODUCTION
As per the need of today’s world, everything is going online. Each and every field is having their
own way of putting their applications, business online on Internet. Seating at home we can do
shopping, banking related work; we get weather information, and many more services. And in
such a competitive environment, service providers are eager to know about are they providing
best service in the market, whether people are purchasing their product, are they finding
application interesting and friendly to use, or in the field of banking they need to know about how
many customers are looking forward to our bank scheme. In similar way, they also need to know
about problems that have been occurred, how to resolve them, how to make websites or web
application interesting, which products people are not purchasing and in that case how to improve
advertising strategies to attract customer, what will be the future marketing plans. To answer
these entire questions, log files are helpful. Log files contain list of actions that have been
occurred whenever someone accesses to your website or web application. These log files resides
in web servers. Each individual request is listed on a separate line in a log file, called a log entry.
It is automatically created every time someone makes a request to your web site. The point of a
log file is to keep track of what is happening with the web server. Log files are also used to keep
track of complex systems, so that when a problem does occur, it is easy to pinpoint and fix. But
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
42
there are times when log files are too difficult to read or make sense of, and it is then that log file
analysis is necessary. These log files have tons of useful information for service providers,
analyzing these log files can give lots of insights that help understand website traffic patterns,
user activity, there interest etc[10][11]. Thus, through the log file analysis we can get the
information about all the above questions as log is the record of people interaction with websites
and applications.
Figure 1. WorkFlow of The System
1.1. Background
Log files are generating at a record rate. Thousands of terabytes or petabytes of log files are
generated by a data center in a day. It is very challenging to store and analyze these large volumes
of log files. The problem of analyzing log files is complicated not only because of its volume but
also because of the disparate structure of log files. Conventional database solutions are not
suitable for analyzing such log files because they are not capable of handling such a large volume
of logs efficiently. . Andrew Pavlo and Erik Paulson in 2009 [13] compared the SQL DBMS and
Hadoop MapReduce and suggested that Hadoop MapReduce tunes up with the task faster and
also load data faster than DBMS. Also traditional DBMS cannot handle large datasets. This is
where big data technologies come to the rescue[8]. Hadoop-MapReduce[6][8][17] is applicable in
many areas for Big Data analysis. As log files is one of the type of big data so Hadoop is the best
suitable platform for storing log files and parallel implementation of MapReduce[3] program for
analyzing them. Apache Hadoop is a new way for enterprises to store and analyze data. Hadoop is
an open-source project created by Doug Cutting[17], administered by the Apache Software
Foundation. It enables applications to work with thousands of nodes and petabytes of data. While
it can be used on a single machine, its true power lies in its ability to scale to hundreds or
thousands of computers, each with several processor cores. As described by Tom White [6] in
Hadoop cluster, there are thousands of nodes which store multiple blocks of log files. Hadoop is
specially designed to work on large volume of information by connecting commodity computers
to work I parallel. Hadoop breaks up log files into blocks and these blocks are evenly distributed
over thousands of nodes in a Hadoop cluster. Also it does the replication of these blocks over the
multiple nodes so as to provide features like reliability and fault tolerance. Parallel computation
of MapReduce improves performance for large log files by breaking job into number of tasks.
1.2. Special Issues
1.2.1. Data Distribution
Performing computation on large volumes of log files have been done before but what makes
Hadoop different from them is its simplified programming model and its efficient, automatic
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
43
distribution of data and work across the machines. Condor does not provide automatic
distribution of data; separate SAN must be managed in addition to the compute cluster.
1.2.2. Isolation of Processes
Each individual record is processed by a task in isolation with one another, limiting the
communication overhead between the processes by Hadoop. This makes the whole framework
more reliable. In MapReduce, Mapper tasks process records in isolation. Individual node failure
can be worked around by restarting tasks on other machine. Due to isolation other nodes continue
to operate as nothing went wrong.
1.2.3. Type of Data
Log files are a plain text files consisting of semi-structured or unstructured records. Traditional
RDBMS has a pre-defined schema; you need to fit all the log data in that schema only. For this
Trans-Log algorithm is used which converts such unstructured logs to structured one by
transforming simple text log file into oracle table[12]. But again traditional RDBMS has
limitation on the size of data. Hadoop is compatible for any type of data; it works well for simple
text files too.
1.2.4. Fault Tolerance
In a Hadoop cluster, problem of data loss is resolved. Blocks of input file are replicated by factor
of three on multiple machines in the Hadoop cluster[4]. So even if any machine goes down, other
machine where same block is residing will take care of further processing. Thus failure of some
nodes in the cluster does not affect the performance of the system.
1.2.5. Data Locality and Network Bandwidth
Log files are spread across HDFS as blocks, so compute process running on a node operates on a
subset of files. Which data operated upon by a node is chosen by the locality of a node i.e.
reading from the local system, reducing the strain on network bandwidth and preventing
unnecessary network transfer. This property of moving computation near to the data makes
Hadoop different from other systems[4][6]. Data locality is achieved by this property of Hadoop
which results into providing high performance.
2. SYSTEM ARCHITECTURE
Following system architecture shown in Figure 2. consists of major components like Web servers,
Cloud Framework implementing Hadoop storage and MapReduce programming model and user
interface.
Figure 2. System Architecture
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
44
2.1. Web Servers
This Module consists of multiple web servers from which log files are collected. As log files
reside in a web server we need to collect them from these servers and collected log files may
require pre-processing to be done before storing log files to HDFS[4][6]. Pre-processing consists
on cleaning log files, removing redundancy, etc. Because we need quality of data, pre-processing
has to be performed. So these servers are responsible for fetching relevant log files which are
required to be processed.
2.2. Cloud Framework
Cloud consists of storage module and processing MapReduce[3] model. Many virtual servers
configured with Hadoop stores log files in distributed manner in HDFS[4][6]. Dividing log files
into blocks of size 64MB or above we can store them on multiple virtual servers in a Hadoop
cluster. Workers in the MapReduce are assigned with Map and Reduce tasks. Workers do parallel
computation of Map tasks. So it does analysis of log files in just two phases Map and Reduce
wherein the Map tasks it generate intermediate results (Key,value) pairs and Reduce task provides
with the summarized value for a particular key. Pig[5] installed on Hadoop virtual servers map
user query to MapReduce jobs because working out how to fit log processing into pattern of
MapReduce is challenge[14]. Evaluated results of log analysis are stored back onto virtual servers
of HDFS.
2.3. User Interface
This module communicate between the user and HMR system allowing user to interact with the
system specifying processing query, evaluate results and also get visualize results of log analysis
in different form of graphical reports.
3. IMPLEMENTATION OF HMR LOG PROCESSOR
HMR log processor is implemented in three phases. It includes log pre-processing,
interacting with HDFS and implementation of MapReduce programming model.
3.1. Log Pre-processing
In any analytical tool, pre-processing is necessary, because Log file may contain noisy &
ambiguous data which may affect result of analysis process. Log pre-processing is an important
step to filter and organize only appropriate information before applying MapReduce algorithm.
Pre-processing reduces size of log file also it increases quality of available data. The purpose of
log pre-processing is to improve log quality and increase accuracy of results.
3.1.1. Individual Field In The Log
Log file contains many fields like IP address, URL, Date, Hit, Age, Country, State, City, etc.But
as log file is a simple text file we need to separate out each field in each log entry. This could be
done by using the separator like ‘space’ or ‘;’ or’#’. We have used here ‘#’ to separate fields in
the log entry.
3.1.2. Removing Redundant Log Entries
It happens many times that a person visits same page many times. So even if a person had visited
particular page 10 times processor should take this as 1 hit to that page. This we can check by
using IP address and URL. If IP address and URL is same for 10 log entries then other 9 entries
must be removed leaving only 1 entry of that log. This improves the accuracy of the results of the
system.
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
45
3.1.3. Cleaning
Cleaning contains removing unnecessary log entries from the log file i.e. removing multimedia
files, image, page styles with extensions like .jpg, .gif, .css from any log. These fields are
unnecessary for application logs so we need to remove them so that we can get log file with
quality logs. This will make log file size to be reduced somewhat.
3.2. Interacting With HDFS
Figure 3. Hadoop Distributed File System
Hadoop Distributed File System holds a large log files in a redundant way across multiple
machines to achieve high availability for parallel processing and durability during failures. It also
provides high throughput access to log files. It is block-structured file system as it breaks up log
files into small blocks of fixed size. Default size of block is 64 MB as given by Hadoop but we
can also set block size of our choice. These blocks are replicated over multiple machines across a
Hadoop cluster. Replication factor is 3 so Hadoop replicates each block 3 times so even if in the
failure of any node there should be no data loss. Hadoop storage is shown in the Figure 3. In
Hadoop, log file data is accessed in Write once, Read many times manner. HDFS is a powerful
companion to Hadoop MapReduce. Hadoop MapReduce jobs automatically draws their input log
files from HDFS by setting the fs.default.name configuration option to point to the NameNode.
3.3. MapReduce Framework
MapReduce is a simple programming model for parallel processing of large volume of data. This
data could be anything but it is specifically designed to process lists of data. Fundamental concept
of MapReduce is to transform lists of input data to lists of output data. It happens many times that
input data is not in readable format; it could be the difficult task to understand large input
datasets. In that case, we need a model that can mold input data lists into readable, understandable
output lists. MapReduce does this conversion twice for the two major tasks: Map and Reduce just
by dividing whole workload into number of tasks and distributing them over different machines
in the Hadoop cluster.As we know, logs in the log files are also in the form of lists. Log file
consists of thousands of records i.e. logs which are in the text format. Nowadays business servers
are generating large volumes of log files of the size of terabytes in a day. According to business
perspective, there is a need to process these log files so that we can have appropriate reports of
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
46
how our business is going. For this application MapReduce implementation in Hadoop is one of
the best solutions. MapReduce is divided into two phases: Map phase and Reduce phase.
3.3.1. Map Phase
Figure 4. Architecture of Map Phase
Input to the MapReduce is log file, each record in log file is considered as an input to a Map task.
Map function takes a key-value pair as an input thus producing intermediate result in terms of
key-value pair. It takes each attribute in the record as a key and Maps each value in a record to its
key generating intermediate output as key-value pair. Map reads each log from simple text file,
breaks each log into the sequence of keys (x1, x2, …., xn) and emits value for each key which is
always 1. If key appears n times among all records then there will be n key-value pairs (x, 1)
among its output.
Map: (x1, v1)  [(x2, v2)]
3.3.2. Grouping
After all the Map tasks have completed successfully, the master controller merges the
intermediate results file from each Map task that are destined for a particular Reduce task and
feeds the merged file to that process as a sequence of key-value pairs. That is, for each key x, the
input to the Reduce task that handles key x is a pair of the form (x, [v1, v2, . . . , vn]), where (x,
v1), (x, v2), . . . , (x, vn) are all the key-value pairs with key x coming from all the Map tasks.
3.3.3. Reduce Phase
Reduce task takes key and its list of associated values as an input. It combines values for input
key by reducing list of values as single value which is the count of occurrences of each key in the
log file, thus generating output in the form of key-value pair (x, sum).
Reduce: (x2, [v2])  (x3, v3)
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
47
Figure 5. Architecture of Reduce Phase
4. IN DEPTH MAPREDUCE DATA FLOW
4.1. Input Data
Input to the MapReduce comes from HDFS where log files are stored on the processing cluster.
By dividing log files into small blocks we can distribute them over nodes of Hadoop cluster. The
format of input files to MapReduce is arbitrary but it is line-based for log files. As each line is
considered as a one record as we can say one log. InputFormat is a class that defines splitting of
input files and how to read these files. FileInputFormat is the abstract class; all other
InputFormats that operate on files inherits functionality from this class. While starting
MapReduce job, FileInputFormat reads log files from the directory and splits it into the chunks.
By calling setInputFormat() method from JobConf object we can set any InputFormat provided
by Hadoop. TextInputFormat treats each line in the input file as a one record and best suited for
unformatted data. For the line-based log files, TextInputFormat is useful.While implementing
MapReduce whole job is broken up into pieces to operate over blocks of input files. These pieces
are Map tasks which operate on input blocks or whole log file. By default file breaks up into
blocks of size 64MB but we can choose parameter to break file as per our need by setting
mapred.min.split.size parameter. So for the large log files it is helpful to improve performance by
parallelizing Map tasks. Due to the significance of allowance of scheduling tasks on the different
nodes of the cluster, where log files actually reside, it is possible to process log files locally.
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
48
Figure 6. MapReduce WorkFlow
4.2. RecordReader
RecordReader is the class that defines how to access the logs in the log files and converts them
into (Key, value) pairs readable by Mapper. TextInputFormat provides LineRecordReader which
treats each line as a new value. RecordReader gets invoked until it does not deplete whole block.
The Map() method of the mapper is called each time when RecordReader is invoked.
4.3. Mapper
Mapper does Map phase in the MapReduce job. Map() method emits intermediate output in the
form of (key, value) pairs. For each Map task, a new instance of Mapper is instantiated in a
separate java process. Map() method receives four parameters. Theses parameters are key, value,
OutputCollector, Reporter. Collect() method from the OutputCollector object forwards
intermediate (Key, value) pairs to Reducer as an input for Reduce phase. Map task provides
information about its progress to Reporter object. Thus providing information about current task
such as InputSplit, Map task, also it emits status back to the user.
4.4. Shuffle And Sort
After first Map tasks have completed, nodes starts exchanging intermediate output from Map
tasks to destined Reduce tasks. This process of moving intermediate output from Map tasks to
Reduce tasks as an input is called as shuffling. This is the only communication step in
MapReduce. Before putting these (key, value) pairs as an input to Reducers; Hadoop groups all
the values for the same key no matter where they come from. Such partitioned data is then
assigned to Reduce tasks.
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
49
4.5. Reducer
Reducer instance calls Reduce() method for each key in the partition assigned to a Reducer. It
receives a key and iterator over all the values associated with that key. Reduce() method also has
parameters like key, iterator for all values, OutputCollector and Reporter which works in similar
manner as for Map() method.
4.6. Output Data
MapReduce provides (key, value) pairs to the OutputCollector to write them in the output files.
OutputFormat provides a way for how to write these pairs in output files. TextOutputFormat
works Similar to TextInputFormat. TextInputFormat writes each (key, value) pair on a separate
line in the output file. It writes a line in “Key Value” format. OutputFormat of Hadoop writes to
the files mainly on HDFS.
5. EXPERIMENTAL SETUP
We present here sample results of our proposed work. For sample test we have taken log files of
banking application server. These log files contain fields like IP address, URL, date, age, hit, city,
state and country. We have installed Hadoop on two machines with three instances of Hadoop on
each machine having Ubuntu 10.10 and above version. We have also installed Pig for query
processing. Log files are distributed evenly on these nodes. We have created web application for
user to interact with the system where he can distribute log files on the Hadoop cluster, run the
MapReduce job on these files and get analysed results in the graphical formats like bar charts and
pie charts. Analyzed results shows total hits for web application, hits per page, hits for city, hits
for page from each city, hits for quarter of the year, hits for each page during whole year, etc.
Figure 7. gives an idea about total hits from each city, Figure 8. shows total hits per page.
Figure 7. Total hits from each city
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
50
Figure 8. Total hits per page
6. CONCLUSIONS
In order to have a summarized data results for a particular web application, we need to do log
analysis which will help to improve the business strategies as well as to generate statistical
reports. Hadoop – MR log file analysis tool will provide us graphical reports showing hits for web
pages, user’s activity, in which part of website users are interested, traffic sources, etc. From
these reports business communities can evaluate which parts of the website need to be improved,
which are the potential customers, from which geographical region website is getting maximum
hits, etc, which will help in designing future marketing plans. Log analysis can be done by
various methods but what matters is response time. Hadoop MapReduce framework provides
parallel distributed processing and reliable data storage for large volumes of log files. Firstly, data
get stored in the hierarchy on several nodes in a cluster so that access time required can be
reduced which saves much of the processing time. Here hadoop’s characteristic of moving
computation to the data rather moving data to computation helps to improve response time.
Secondly, MapReduce successfully works for large datasets giving the efficient results.
REFERENCES
[1] S.Sathya Prof. M.Victor Jose, (2011) “Application of Hadoop MapReduce Technique to Virtual
Database System Design”, International Conference on Emerging Trends in Electrical and
Computer Technology (ICETECT), pp. 892-896.
[2] Yulai Yuan, Yongwei Wu_, Xiao Feng, Jing Li, Guangwen Yang, Weimin Zheng, (2010) “VDB-
MR: MapReduce- based distributed data integration using virtual database”, Future Generation
Computer Systems, vol. 26, pp. 1418-1425.
[3] Jeffrey Dean and Sanjay Ghemawat., (2004) “MapReduce: Simplified Data Processing on Large
Clusters”, Google Research Publication.
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, (2010) “The Hadoop
Distributed File System”, Mass Storage Systems and Technologies(MSST), Sunnyvale, California
USA, vol. 10, pp. 1-10.
[5] C.Olston, B.Reed, U.Srivastava, R.Kumar, and A.Tomkins, (2008) “Pig latin: a not-so-foreign
language for data processing”, ACM SIGMOD International conference on Management of data,
pp. 1099– 1110.
[6] Tom White, (2009) “Hadoop: The Definitive Guide. O’Reilly”, Scbastopol, California.
[7] M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica, (2008) “Improving mapreduce
performance in heterogeneous environments” OSDI’08: 8th USENIX Symposium on Operating
Systems Design and Implementation.
International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
51
[8] Mr. Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh Poladia, (2012)“Big Data Processing
using Apache Hadoop in Cloud System”, National Conference on Emerging Trends in
Engineering & Technology.
[9] Cooley R., Srivastava J., Mobasher B., (1997) “Web mining: informationa and pattern discovery
on world wide web”, IEEE International conference on tools with artificial intelligence, pp. 558-
567.
[10] Liu Zhijing, Wang Bin, (2003) “Web mining research”, International conference on computational
intelligence and multimedia applications, pp. 84-89.
[11] Yang, Q. and Zhang, H., (2003) “Web-Log Mining for predictive Caching”, IEEE Trans.
Knowledge and Data Eng., 15( 4), pp. 1050-1053.
[12] P. Nithya, Dr. P. Sumathi, (2012) “A Survey on Web Usage Mining: Theory and Applications”,
International Journal Computer Technology and Applications, Vol. 3, pp. 1625-1629.
[13] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden,
Michael Stonebraker, (2009) ”A Comparison of Approaches to Large-Scale Data Analysis”, ACM
SIGMOD’09.
[14] Gates et al., (2009) “Building a High-Level Dataflow System on top of Map-Reduce: The Pig
Experience”, VLDB 2009, Section 4.
[15] LI Jing-min, HE Guo-hui, (2010) “Research of Distributed Database System Based on Hadoop”,
IEEE International conference on Information Science and Engineering (ICISE), pp. 1417-1420.
[16] T. Hoff, (2008) “How Rackspace Now Uses MapReduce and Hadoop To Query Terabytes of
Data”.
[17] Apache-Hadoop,http://Hadoop.apache.org
Authors
Sayalee Narkhede is a ME student of IT department in MIT-Pune under University
of Pune. She has Received BE degree in Computer Engineering from AISSMS IOIT
under University of Pune. Her research interest includes Distributed Systems, Cloud
Computing.
Tripti Baraskar is an assistant professor in IT department of MIT- Pune. She has
received her ME degree in Communication Controls and Networking from MIT
Gwalior and BE degree from Oriental Institute of Science and Technology under
RGTV. Her research interest includes Image Processing.

Weitere ähnliche Inhalte

Was ist angesagt?

Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overviewrahulmonikasharma
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Comparison with Traditional databases
Comparison with Traditional databasesComparison with Traditional databases
Comparison with Traditional databasesGowriLatha1
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportAhmad El Tawil
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...eSAT Journals
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPijdms
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentIJERA Editor
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 

Was ist angesagt? (18)

Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Big data
Big dataBig data
Big data
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
paper
paperpaper
paper
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Comparison with Traditional databases
Comparison with Traditional databasesComparison with Traditional databases
Comparison with Traditional databases
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud Environment
 
E018142329
E018142329E018142329
E018142329
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 

Andere mochten auch

Andere mochten auch (7)

Northern Coalfields Limited Sales & Marketing
Northern Coalfields Limited Sales & MarketingNorthern Coalfields Limited Sales & Marketing
Northern Coalfields Limited Sales & Marketing
 
Smu bba sem 5 bbr summer 2016 assignments
Smu bba sem 5 bbr summer 2016 assignmentsSmu bba sem 5 bbr summer 2016 assignments
Smu bba sem 5 bbr summer 2016 assignments
 
Ncl summer training report
Ncl summer training reportNcl summer training report
Ncl summer training report
 
Summer training report
Summer training reportSummer training report
Summer training report
 
Applied Shariah In Financial Transactions
Applied Shariah In Financial TransactionsApplied Shariah In Financial Transactions
Applied Shariah In Financial Transactions
 
Northern Coalfields Limited NCL
Northern Coalfields Limited NCLNorthern Coalfields Limited NCL
Northern Coalfields Limited NCL
 
Project Of PepsiCo
Project Of PepsiCoProject Of PepsiCo
Project Of PepsiCo
 

Ähnlich wie HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
A NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATAA NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATAijdms
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Labkevinflorian
 

Ähnlich wie HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce (20)

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Big data
Big dataBig data
Big data
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
A NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATAA NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATA
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
Hadoop
HadoopHadoop
Hadoop
 
Presentation1
Presentation1Presentation1
Presentation1
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 

Mehr von ijujournal

Users Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase IIUsers Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase IIijujournal
 
Users Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase IIUsers Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase IIijujournal
 
October 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdfOctober 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdfijujournal
 
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERS
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERSACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERS
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERSijujournal
 
A novel integrated approach for handling anomalies in RFID data
A novel integrated approach for handling anomalies in RFID dataA novel integrated approach for handling anomalies in RFID data
A novel integrated approach for handling anomalies in RFID dataijujournal
 
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...ijujournal
 
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIES
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIESENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIES
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIESijujournal
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
 
SERVICE DISCOVERY – A SURVEY AND COMPARISON
SERVICE DISCOVERY – A SURVEY AND COMPARISONSERVICE DISCOVERY – A SURVEY AND COMPARISON
SERVICE DISCOVERY – A SURVEY AND COMPARISONijujournal
 
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKS
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKSSIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKS
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKSijujournal
 
International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)ijujournal
 
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...ijujournal
 
A proposed Novel Approach for Sentiment Analysis and Opinion Mining
A proposed Novel Approach for Sentiment Analysis and Opinion MiningA proposed Novel Approach for Sentiment Analysis and Opinion Mining
A proposed Novel Approach for Sentiment Analysis and Opinion Miningijujournal
 
International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)ijujournal
 
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...ijujournal
 
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCS
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCSSECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCS
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCSijujournal
 
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKS
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKSPERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKS
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKSijujournal
 
PERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLSPERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLSijujournal
 
PERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLSPERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLSijujournal
 
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2R
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2RDETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2R
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2Rijujournal
 

Mehr von ijujournal (20)

Users Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase IIUsers Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase II
 
Users Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase IIUsers Approach on Providing Feedback for Smart Home Devices – Phase II
Users Approach on Providing Feedback for Smart Home Devices – Phase II
 
October 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdfOctober 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdf
 
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERS
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERSACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERS
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERS
 
A novel integrated approach for handling anomalies in RFID data
A novel integrated approach for handling anomalies in RFID dataA novel integrated approach for handling anomalies in RFID data
A novel integrated approach for handling anomalies in RFID data
 
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...
 
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIES
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIESENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIES
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIES
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
 
SERVICE DISCOVERY – A SURVEY AND COMPARISON
SERVICE DISCOVERY – A SURVEY AND COMPARISONSERVICE DISCOVERY – A SURVEY AND COMPARISON
SERVICE DISCOVERY – A SURVEY AND COMPARISON
 
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKS
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKSSIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKS
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKS
 
International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)
 
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...
 
A proposed Novel Approach for Sentiment Analysis and Opinion Mining
A proposed Novel Approach for Sentiment Analysis and Opinion MiningA proposed Novel Approach for Sentiment Analysis and Opinion Mining
A proposed Novel Approach for Sentiment Analysis and Opinion Mining
 
International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)International Journal of Ubiquitous Computing (IJU)
International Journal of Ubiquitous Computing (IJU)
 
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...
 
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCS
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCSSECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCS
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCS
 
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKS
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKSPERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKS
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKS
 
PERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLSPERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLS
 
PERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLSPERFORMANCE COMPARISON OF OCR TOOLS
PERFORMANCE COMPARISON OF OCR TOOLS
 
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2R
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2RDETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2R
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2R
 

Kürzlich hochgeladen

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Kürzlich hochgeladen (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce

  • 1. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 DOI:10.5121/iju.2013.4304 41 HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE Sayalee Narkhede1 and Tripti Baraskar2 Department of Information Technology, MIT-Pune,University of Pune, Pune sayleenarkhede@gmail.com Department of Information Technology, MIT-Pune, University of Pune, Pune baraskartn@gmail.com ABSTRACT In today’s Internet world, log file analysis is becoming a necessary task for analyzing the customer’s behavior in order to improve advertising and sales as well as for datasets like environment, medical, banking system it is important to analyze the log data to get required knowledge from it. Web mining is the process of discovering the knowledge from the web data. Log files are getting generated very fast at the rate of 1-10 Mb/s per machine, a single data center can generate tens of terabytes of log data in a day. These datasets are huge. In order to analyze such large datasets we need parallel processing system and reliable data storage mechanism. Virtual database system is an effective solution for integrating the data but it becomes inefficient for large datasets. The Hadoop framework provides reliable data storage by Hadoop Distributed File System and MapReduce programming model which is a parallel processing system for large datasets. Hadoop distributed file system breaks up input data and sends fractions of the original data to several machines in hadoop cluster to hold blocks of data. This mechanism helps to process log data in parallel using all the machines in the hadoop cluster and computes result efficiently. The dominant approach provided by hadoop to “Store first query later”, loads the data to the Hadoop Distributed File System and then executes queries written in Pig Latin. This approach reduces the response time as well as the load on to the end system. This paper proposes a log analysis system using Hadoop MapReduce which will provide accurate results in minimum response time. KEYWORDS Hadoop, MapReduce, Log Files, Parallel Processing, Hadoop Distributed File System. 1. INTRODUCTION As per the need of today’s world, everything is going online. Each and every field is having their own way of putting their applications, business online on Internet. Seating at home we can do shopping, banking related work; we get weather information, and many more services. And in such a competitive environment, service providers are eager to know about are they providing best service in the market, whether people are purchasing their product, are they finding application interesting and friendly to use, or in the field of banking they need to know about how many customers are looking forward to our bank scheme. In similar way, they also need to know about problems that have been occurred, how to resolve them, how to make websites or web application interesting, which products people are not purchasing and in that case how to improve advertising strategies to attract customer, what will be the future marketing plans. To answer these entire questions, log files are helpful. Log files contain list of actions that have been occurred whenever someone accesses to your website or web application. These log files resides in web servers. Each individual request is listed on a separate line in a log file, called a log entry. It is automatically created every time someone makes a request to your web site. The point of a log file is to keep track of what is happening with the web server. Log files are also used to keep track of complex systems, so that when a problem does occur, it is easy to pinpoint and fix. But
  • 2. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 42 there are times when log files are too difficult to read or make sense of, and it is then that log file analysis is necessary. These log files have tons of useful information for service providers, analyzing these log files can give lots of insights that help understand website traffic patterns, user activity, there interest etc[10][11]. Thus, through the log file analysis we can get the information about all the above questions as log is the record of people interaction with websites and applications. Figure 1. WorkFlow of The System 1.1. Background Log files are generating at a record rate. Thousands of terabytes or petabytes of log files are generated by a data center in a day. It is very challenging to store and analyze these large volumes of log files. The problem of analyzing log files is complicated not only because of its volume but also because of the disparate structure of log files. Conventional database solutions are not suitable for analyzing such log files because they are not capable of handling such a large volume of logs efficiently. . Andrew Pavlo and Erik Paulson in 2009 [13] compared the SQL DBMS and Hadoop MapReduce and suggested that Hadoop MapReduce tunes up with the task faster and also load data faster than DBMS. Also traditional DBMS cannot handle large datasets. This is where big data technologies come to the rescue[8]. Hadoop-MapReduce[6][8][17] is applicable in many areas for Big Data analysis. As log files is one of the type of big data so Hadoop is the best suitable platform for storing log files and parallel implementation of MapReduce[3] program for analyzing them. Apache Hadoop is a new way for enterprises to store and analyze data. Hadoop is an open-source project created by Doug Cutting[17], administered by the Apache Software Foundation. It enables applications to work with thousands of nodes and petabytes of data. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. As described by Tom White [6] in Hadoop cluster, there are thousands of nodes which store multiple blocks of log files. Hadoop is specially designed to work on large volume of information by connecting commodity computers to work I parallel. Hadoop breaks up log files into blocks and these blocks are evenly distributed over thousands of nodes in a Hadoop cluster. Also it does the replication of these blocks over the multiple nodes so as to provide features like reliability and fault tolerance. Parallel computation of MapReduce improves performance for large log files by breaking job into number of tasks. 1.2. Special Issues 1.2.1. Data Distribution Performing computation on large volumes of log files have been done before but what makes Hadoop different from them is its simplified programming model and its efficient, automatic
  • 3. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 43 distribution of data and work across the machines. Condor does not provide automatic distribution of data; separate SAN must be managed in addition to the compute cluster. 1.2.2. Isolation of Processes Each individual record is processed by a task in isolation with one another, limiting the communication overhead between the processes by Hadoop. This makes the whole framework more reliable. In MapReduce, Mapper tasks process records in isolation. Individual node failure can be worked around by restarting tasks on other machine. Due to isolation other nodes continue to operate as nothing went wrong. 1.2.3. Type of Data Log files are a plain text files consisting of semi-structured or unstructured records. Traditional RDBMS has a pre-defined schema; you need to fit all the log data in that schema only. For this Trans-Log algorithm is used which converts such unstructured logs to structured one by transforming simple text log file into oracle table[12]. But again traditional RDBMS has limitation on the size of data. Hadoop is compatible for any type of data; it works well for simple text files too. 1.2.4. Fault Tolerance In a Hadoop cluster, problem of data loss is resolved. Blocks of input file are replicated by factor of three on multiple machines in the Hadoop cluster[4]. So even if any machine goes down, other machine where same block is residing will take care of further processing. Thus failure of some nodes in the cluster does not affect the performance of the system. 1.2.5. Data Locality and Network Bandwidth Log files are spread across HDFS as blocks, so compute process running on a node operates on a subset of files. Which data operated upon by a node is chosen by the locality of a node i.e. reading from the local system, reducing the strain on network bandwidth and preventing unnecessary network transfer. This property of moving computation near to the data makes Hadoop different from other systems[4][6]. Data locality is achieved by this property of Hadoop which results into providing high performance. 2. SYSTEM ARCHITECTURE Following system architecture shown in Figure 2. consists of major components like Web servers, Cloud Framework implementing Hadoop storage and MapReduce programming model and user interface. Figure 2. System Architecture
  • 4. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 44 2.1. Web Servers This Module consists of multiple web servers from which log files are collected. As log files reside in a web server we need to collect them from these servers and collected log files may require pre-processing to be done before storing log files to HDFS[4][6]. Pre-processing consists on cleaning log files, removing redundancy, etc. Because we need quality of data, pre-processing has to be performed. So these servers are responsible for fetching relevant log files which are required to be processed. 2.2. Cloud Framework Cloud consists of storage module and processing MapReduce[3] model. Many virtual servers configured with Hadoop stores log files in distributed manner in HDFS[4][6]. Dividing log files into blocks of size 64MB or above we can store them on multiple virtual servers in a Hadoop cluster. Workers in the MapReduce are assigned with Map and Reduce tasks. Workers do parallel computation of Map tasks. So it does analysis of log files in just two phases Map and Reduce wherein the Map tasks it generate intermediate results (Key,value) pairs and Reduce task provides with the summarized value for a particular key. Pig[5] installed on Hadoop virtual servers map user query to MapReduce jobs because working out how to fit log processing into pattern of MapReduce is challenge[14]. Evaluated results of log analysis are stored back onto virtual servers of HDFS. 2.3. User Interface This module communicate between the user and HMR system allowing user to interact with the system specifying processing query, evaluate results and also get visualize results of log analysis in different form of graphical reports. 3. IMPLEMENTATION OF HMR LOG PROCESSOR HMR log processor is implemented in three phases. It includes log pre-processing, interacting with HDFS and implementation of MapReduce programming model. 3.1. Log Pre-processing In any analytical tool, pre-processing is necessary, because Log file may contain noisy & ambiguous data which may affect result of analysis process. Log pre-processing is an important step to filter and organize only appropriate information before applying MapReduce algorithm. Pre-processing reduces size of log file also it increases quality of available data. The purpose of log pre-processing is to improve log quality and increase accuracy of results. 3.1.1. Individual Field In The Log Log file contains many fields like IP address, URL, Date, Hit, Age, Country, State, City, etc.But as log file is a simple text file we need to separate out each field in each log entry. This could be done by using the separator like ‘space’ or ‘;’ or’#’. We have used here ‘#’ to separate fields in the log entry. 3.1.2. Removing Redundant Log Entries It happens many times that a person visits same page many times. So even if a person had visited particular page 10 times processor should take this as 1 hit to that page. This we can check by using IP address and URL. If IP address and URL is same for 10 log entries then other 9 entries must be removed leaving only 1 entry of that log. This improves the accuracy of the results of the system.
  • 5. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 45 3.1.3. Cleaning Cleaning contains removing unnecessary log entries from the log file i.e. removing multimedia files, image, page styles with extensions like .jpg, .gif, .css from any log. These fields are unnecessary for application logs so we need to remove them so that we can get log file with quality logs. This will make log file size to be reduced somewhat. 3.2. Interacting With HDFS Figure 3. Hadoop Distributed File System Hadoop Distributed File System holds a large log files in a redundant way across multiple machines to achieve high availability for parallel processing and durability during failures. It also provides high throughput access to log files. It is block-structured file system as it breaks up log files into small blocks of fixed size. Default size of block is 64 MB as given by Hadoop but we can also set block size of our choice. These blocks are replicated over multiple machines across a Hadoop cluster. Replication factor is 3 so Hadoop replicates each block 3 times so even if in the failure of any node there should be no data loss. Hadoop storage is shown in the Figure 3. In Hadoop, log file data is accessed in Write once, Read many times manner. HDFS is a powerful companion to Hadoop MapReduce. Hadoop MapReduce jobs automatically draws their input log files from HDFS by setting the fs.default.name configuration option to point to the NameNode. 3.3. MapReduce Framework MapReduce is a simple programming model for parallel processing of large volume of data. This data could be anything but it is specifically designed to process lists of data. Fundamental concept of MapReduce is to transform lists of input data to lists of output data. It happens many times that input data is not in readable format; it could be the difficult task to understand large input datasets. In that case, we need a model that can mold input data lists into readable, understandable output lists. MapReduce does this conversion twice for the two major tasks: Map and Reduce just by dividing whole workload into number of tasks and distributing them over different machines in the Hadoop cluster.As we know, logs in the log files are also in the form of lists. Log file consists of thousands of records i.e. logs which are in the text format. Nowadays business servers are generating large volumes of log files of the size of terabytes in a day. According to business perspective, there is a need to process these log files so that we can have appropriate reports of
  • 6. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 46 how our business is going. For this application MapReduce implementation in Hadoop is one of the best solutions. MapReduce is divided into two phases: Map phase and Reduce phase. 3.3.1. Map Phase Figure 4. Architecture of Map Phase Input to the MapReduce is log file, each record in log file is considered as an input to a Map task. Map function takes a key-value pair as an input thus producing intermediate result in terms of key-value pair. It takes each attribute in the record as a key and Maps each value in a record to its key generating intermediate output as key-value pair. Map reads each log from simple text file, breaks each log into the sequence of keys (x1, x2, …., xn) and emits value for each key which is always 1. If key appears n times among all records then there will be n key-value pairs (x, 1) among its output. Map: (x1, v1)  [(x2, v2)] 3.3.2. Grouping After all the Map tasks have completed successfully, the master controller merges the intermediate results file from each Map task that are destined for a particular Reduce task and feeds the merged file to that process as a sequence of key-value pairs. That is, for each key x, the input to the Reduce task that handles key x is a pair of the form (x, [v1, v2, . . . , vn]), where (x, v1), (x, v2), . . . , (x, vn) are all the key-value pairs with key x coming from all the Map tasks. 3.3.3. Reduce Phase Reduce task takes key and its list of associated values as an input. It combines values for input key by reducing list of values as single value which is the count of occurrences of each key in the log file, thus generating output in the form of key-value pair (x, sum). Reduce: (x2, [v2])  (x3, v3)
  • 7. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 47 Figure 5. Architecture of Reduce Phase 4. IN DEPTH MAPREDUCE DATA FLOW 4.1. Input Data Input to the MapReduce comes from HDFS where log files are stored on the processing cluster. By dividing log files into small blocks we can distribute them over nodes of Hadoop cluster. The format of input files to MapReduce is arbitrary but it is line-based for log files. As each line is considered as a one record as we can say one log. InputFormat is a class that defines splitting of input files and how to read these files. FileInputFormat is the abstract class; all other InputFormats that operate on files inherits functionality from this class. While starting MapReduce job, FileInputFormat reads log files from the directory and splits it into the chunks. By calling setInputFormat() method from JobConf object we can set any InputFormat provided by Hadoop. TextInputFormat treats each line in the input file as a one record and best suited for unformatted data. For the line-based log files, TextInputFormat is useful.While implementing MapReduce whole job is broken up into pieces to operate over blocks of input files. These pieces are Map tasks which operate on input blocks or whole log file. By default file breaks up into blocks of size 64MB but we can choose parameter to break file as per our need by setting mapred.min.split.size parameter. So for the large log files it is helpful to improve performance by parallelizing Map tasks. Due to the significance of allowance of scheduling tasks on the different nodes of the cluster, where log files actually reside, it is possible to process log files locally.
  • 8. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 48 Figure 6. MapReduce WorkFlow 4.2. RecordReader RecordReader is the class that defines how to access the logs in the log files and converts them into (Key, value) pairs readable by Mapper. TextInputFormat provides LineRecordReader which treats each line as a new value. RecordReader gets invoked until it does not deplete whole block. The Map() method of the mapper is called each time when RecordReader is invoked. 4.3. Mapper Mapper does Map phase in the MapReduce job. Map() method emits intermediate output in the form of (key, value) pairs. For each Map task, a new instance of Mapper is instantiated in a separate java process. Map() method receives four parameters. Theses parameters are key, value, OutputCollector, Reporter. Collect() method from the OutputCollector object forwards intermediate (Key, value) pairs to Reducer as an input for Reduce phase. Map task provides information about its progress to Reporter object. Thus providing information about current task such as InputSplit, Map task, also it emits status back to the user. 4.4. Shuffle And Sort After first Map tasks have completed, nodes starts exchanging intermediate output from Map tasks to destined Reduce tasks. This process of moving intermediate output from Map tasks to Reduce tasks as an input is called as shuffling. This is the only communication step in MapReduce. Before putting these (key, value) pairs as an input to Reducers; Hadoop groups all the values for the same key no matter where they come from. Such partitioned data is then assigned to Reduce tasks.
  • 9. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 49 4.5. Reducer Reducer instance calls Reduce() method for each key in the partition assigned to a Reducer. It receives a key and iterator over all the values associated with that key. Reduce() method also has parameters like key, iterator for all values, OutputCollector and Reporter which works in similar manner as for Map() method. 4.6. Output Data MapReduce provides (key, value) pairs to the OutputCollector to write them in the output files. OutputFormat provides a way for how to write these pairs in output files. TextOutputFormat works Similar to TextInputFormat. TextInputFormat writes each (key, value) pair on a separate line in the output file. It writes a line in “Key Value” format. OutputFormat of Hadoop writes to the files mainly on HDFS. 5. EXPERIMENTAL SETUP We present here sample results of our proposed work. For sample test we have taken log files of banking application server. These log files contain fields like IP address, URL, date, age, hit, city, state and country. We have installed Hadoop on two machines with three instances of Hadoop on each machine having Ubuntu 10.10 and above version. We have also installed Pig for query processing. Log files are distributed evenly on these nodes. We have created web application for user to interact with the system where he can distribute log files on the Hadoop cluster, run the MapReduce job on these files and get analysed results in the graphical formats like bar charts and pie charts. Analyzed results shows total hits for web application, hits per page, hits for city, hits for page from each city, hits for quarter of the year, hits for each page during whole year, etc. Figure 7. gives an idea about total hits from each city, Figure 8. shows total hits per page. Figure 7. Total hits from each city
  • 10. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 50 Figure 8. Total hits per page 6. CONCLUSIONS In order to have a summarized data results for a particular web application, we need to do log analysis which will help to improve the business strategies as well as to generate statistical reports. Hadoop – MR log file analysis tool will provide us graphical reports showing hits for web pages, user’s activity, in which part of website users are interested, traffic sources, etc. From these reports business communities can evaluate which parts of the website need to be improved, which are the potential customers, from which geographical region website is getting maximum hits, etc, which will help in designing future marketing plans. Log analysis can be done by various methods but what matters is response time. Hadoop MapReduce framework provides parallel distributed processing and reliable data storage for large volumes of log files. Firstly, data get stored in the hierarchy on several nodes in a cluster so that access time required can be reduced which saves much of the processing time. Here hadoop’s characteristic of moving computation to the data rather moving data to computation helps to improve response time. Secondly, MapReduce successfully works for large datasets giving the efficient results. REFERENCES [1] S.Sathya Prof. M.Victor Jose, (2011) “Application of Hadoop MapReduce Technique to Virtual Database System Design”, International Conference on Emerging Trends in Electrical and Computer Technology (ICETECT), pp. 892-896. [2] Yulai Yuan, Yongwei Wu_, Xiao Feng, Jing Li, Guangwen Yang, Weimin Zheng, (2010) “VDB- MR: MapReduce- based distributed data integration using virtual database”, Future Generation Computer Systems, vol. 26, pp. 1418-1425. [3] Jeffrey Dean and Sanjay Ghemawat., (2004) “MapReduce: Simplified Data Processing on Large Clusters”, Google Research Publication. [4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, (2010) “The Hadoop Distributed File System”, Mass Storage Systems and Technologies(MSST), Sunnyvale, California USA, vol. 10, pp. 1-10. [5] C.Olston, B.Reed, U.Srivastava, R.Kumar, and A.Tomkins, (2008) “Pig latin: a not-so-foreign language for data processing”, ACM SIGMOD International conference on Management of data, pp. 1099– 1110. [6] Tom White, (2009) “Hadoop: The Definitive Guide. O’Reilly”, Scbastopol, California. [7] M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica, (2008) “Improving mapreduce performance in heterogeneous environments” OSDI’08: 8th USENIX Symposium on Operating Systems Design and Implementation.
  • 11. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013 51 [8] Mr. Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh Poladia, (2012)“Big Data Processing using Apache Hadoop in Cloud System”, National Conference on Emerging Trends in Engineering & Technology. [9] Cooley R., Srivastava J., Mobasher B., (1997) “Web mining: informationa and pattern discovery on world wide web”, IEEE International conference on tools with artificial intelligence, pp. 558- 567. [10] Liu Zhijing, Wang Bin, (2003) “Web mining research”, International conference on computational intelligence and multimedia applications, pp. 84-89. [11] Yang, Q. and Zhang, H., (2003) “Web-Log Mining for predictive Caching”, IEEE Trans. Knowledge and Data Eng., 15( 4), pp. 1050-1053. [12] P. Nithya, Dr. P. Sumathi, (2012) “A Survey on Web Usage Mining: Theory and Applications”, International Journal Computer Technology and Applications, Vol. 3, pp. 1625-1629. [13] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker, (2009) ”A Comparison of Approaches to Large-Scale Data Analysis”, ACM SIGMOD’09. [14] Gates et al., (2009) “Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience”, VLDB 2009, Section 4. [15] LI Jing-min, HE Guo-hui, (2010) “Research of Distributed Database System Based on Hadoop”, IEEE International conference on Information Science and Engineering (ICISE), pp. 1417-1420. [16] T. Hoff, (2008) “How Rackspace Now Uses MapReduce and Hadoop To Query Terabytes of Data”. [17] Apache-Hadoop,http://Hadoop.apache.org Authors Sayalee Narkhede is a ME student of IT department in MIT-Pune under University of Pune. She has Received BE degree in Computer Engineering from AISSMS IOIT under University of Pune. Her research interest includes Distributed Systems, Cloud Computing. Tripti Baraskar is an assistant professor in IT department of MIT- Pune. She has received her ME degree in Communication Controls and Networking from MIT Gwalior and BE degree from Oriental Institute of Science and Technology under RGTV. Her research interest includes Image Processing.