SlideShare a Scribd company logo
1 of 5
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 14, Issue 6 (Sep. - Oct. 2013), PP 89-93
www.iosrjournals.org
www.iosrjournals.org 89 | Page
Paralyzing Bioinformatics Applications Using Conducive Hadoop
Cluster
Bincy P Andrews1
, Binu A2
1
(Rajagiri School of Engineering and Technology, Kochi )
2
(Rajagiri School of Engineering and Technology, Kochi)
Abstract : Bioinformatics may be defined as the application of computer science to molecular biology in the
form of statistics and analytics. The bioinformatics applications deal with bulk amount of data. Researchers are
now facing problems with the analysis of such ultra large-scale data sets, a problem that will only increase at
an alarming rate in coming years. More over big challenge is involved in processing, storing and analyzing
these peta bytes of data without causing much delay. Most of the bioinformatics algorithms are sequential thus
making situation rather worse. This implies that data manipulations by means of uniprocessor systems are
impractical. However most of the biological problems have parallel nature. Hence a practical and effective
approach involves the usage of parallel clusters of workstations. Hadoop can be used to tackle this class of
problems with good performance and scalability. This technology could be the basis of a computational parallel
platform for several problems in the context of bioinformatics applications. Normally, Hadoop is deployed over
high performance computing systems which are expensive involving complex deployment scenarios that only big
enterprises are able to make it possible. So for smaller research organizations where cost is an important factor
cannot choose systems with high computational capabilities for cluster set up. Rocks cluster is a viable solution
in such scenarios. Rocks Cluster Distribution originally called NPACI Rocks is a Linux distribution intended for
high-performance computing clusters. This paper implements a cost-effective cluster for paralyzing
bioinformatics applications by deploying Hadoop over rock cluster and Emphasizes on the usage of commodity
clusters for paralyzing bioinformatics applications by providing necessary justifications. Results show that
paralyzing bioinformatics application saves much time compared to stand alone mode of execution effectively
under optimal cost considerations.
Keywords :Hadoop, Clusters, Rocks Cluster, commodity clusters, cluster environment, big data, bioinformatics
I. INTRODUCTION
Bioinformatics may be defined as the application of computer science to molecular biology in the form
of statistics and analytics. Some of the application areas of bioinformatics include:
 molecular medicine
 personalized medicine
 preventative medicine
 gene therapy
 drug development
 waste cleanup
 climate change studies
 alternative energy sources
 biotechnology
 antibiotic resistance
 forensic analysis of microbes
 bio-weapon creation
 crop improvement
 insect resistance
 vetinary sciences
Most of these applications deal with bulk amount of data. Bioinformatics researchers are now facing
problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming
rate in coming years. More over big challenge is involved in processing, storing and analyzing data these peta
bytes of data without causing much delay. Also most of the bioinformatics algorithms are sequential thus
making situation rather worse. This implies that data manipulations by means of uniprocessor systems are
impractical. However most of the biological problems have parallel nature. Hence a practical and effective
approach involves the usage of parallel clusters of workstations. The various advantages of using parallel
clusters are:
 Save time
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
www.iosrjournals.org 90 | Page
 Save money
 Solve larger &complex problems quickly
 Provide concurrency
 Use of non local resources
Hadoop can be used to tackle such class of problems with good performance and scalability. This
technology could be the basis of a computational parallel platform for several problems in the context of
bioinformatics applications. The Hadoop platform was designed to solve problems where there is lot of data that
doesn’t fit nicely into tables because of its complex and unstructured nature. Hadoop project and associated
software provide a foundation for scaling to peta byte scale data warehouses using clusters, providing fault-
tolerant parallelized analysis on such data using a programming style named MapReduce. Thus Hadoop will
definitely be a promising solution for bioinformatics applications. Normally, Hadoop is deployed over high
performance computing systems which are expensive that only big enterprises are able to make it possible.
Moreover such deployment scenarios are complex making it infeasible for smaller organizations to handle. . In
normal scenarios a cluster set up is highly influenced by following factors like:
 Cost: Normally cost is very high for systems with much computational power that are used to set up a
cluster.
 Power: for systems with much computational capabilities power requirements is also high.
 Temperature: usually air conditioned atmosphere is required for maintaining a cluster.
All these above mentioned factors never get along. That is if we are choosing systems with high computational
capabilities cost will get elevated and also temperature requirement will have to be taken into consideration. So
for smaller research organizations where cost is an important factor cannot choose systems with high
computational capabilities for cluster set up. This is where rocks cluster comes in. If we are using rocks it is not
necessary to meet each of these factors. For setting up of rocks high performance systems are not required there
by cost factor reduces considerably. Also temperature is not a big constrain. Rocks Cluster Distribution
originally called NPACI Rocks is a Linux distribution intended for high-performance computing clusters. Rocks
was initially based on the Red Hat Linux distribution, however modern versions of Rocks are now based on
CentOS that simplifies mass installation onto many computers. Rocks include many tools such as MPI which
are not part of CentOS but are integral components that make a group of computers into a cluster. Most
important feature or rocks is that it doesn’t need high performance computing nodes for deployment. More over
rocks contain rolls that can be used for bioinformatics applications and its main advantage is that all these tools
get automatically installed during installation of operating system and no further configuration of tools are
necessary. Rocks cluster has got following features:
 Easy deployment of cluster
 Addition & deletion of new host can be easily accomplished
 Rocks automatically edit most of the files that is used to identify its hosts.
 Password-less SSH
 No additional login to compute nodes can be performed due password less SSH.
 Rich set of rolls that automatically get configured during rock cluster installation.
This paper implements a cost-effective cluster for paralyzing bioinformatics applications by deploying
Hadoop over rock cluster and Emphasizes on the usage of commodity clusters for paralyzing bioinformatics
applications by providing necessary justifications. Old, low cost - low performance, single core machines were
selected as the compute nodes for the cluster. Cost factor affected both master node & for Cisco switch. Job
submissions were done using some of the existing systems in the campus network which acts as user nodes there
by reducing additional cost or bandwidth. Low cost compute nodes had only low power requirement and low
power backup. After cluster set up testing of bioinformatics tool fasta in both stand alone & parallel
environment were done. Results show that paralyzing bioinformatics application saves much time compared to
stand alone mode of execution. Remaining chapters are organized as follows: chapter II provides as overview of
related works and chapter III provides detailed system design and implementation followed by its performance
evaluation in chapter IV. Finally paper concludes with chapter V.
II. BACKGROUND
Parallelism can be achieved by either of these methods:
 Develop or adapt existing software to distribute and manage parallel jobs, or
 Modify existing applications to make use of libraries that facilitate distributed programming such as
MPI, OpenMP, RPC and RMI.
Effort required for first approach is recurrent since new versions of the original sequential code may render the
parallelized application obsolete. So created versions may lag and lack in features of the latest sequential tool
version. Also it is very difficult to incorporate failure handling mechanisms in such systems. First approach is
cost effective compared to other: [9] [10] [11] are a small sample of Grid-based solutions for bioinformatics
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
www.iosrjournals.org 91 | Page
applications that automate the process of transferring or sharing large files, selecting appropriate application
binaries for a variety of environments or existing services, managing the creation and submission of a large
number of jobs to be executed in parallel and recovering from possible failures. Also cloud-blast paralyses
bioinformatics application using MapReduce framework. Most of These works deals with execution of a single
file at a time. Processing of large no of files at same time is not being addressed, which is often necessary if
amount of data to be dealt with is more. More over most of these works concentrate on paralyzing
bioinformatics applications using blast. BLAST (Basic Local Alignment and Search Tool) is the most widely
used sequence alignment tool. ClustalW, HMMER, FASTA, Glimmer etc. are similar tools. FASTA and
BLAST have the same goal: to identify statistically significant sequence similarity that can be used to infer
homology. The FASTA programs offer several advantages over BLAST:
 Rigorous algorithms unavailable in BLAST (Table I). Smith-Waterman (ssearch36), global: global
(ggsearch36), and global: local (glsearch36) programs are available, and these programs can be used with
psiblast PSSM profiles.
 Better translated alignments. fastx36, fasty36, tfastx36, and tfastx36 allow frame-shifts in alignments;
frame-shifts are treated like gap-penalties, alignments tend to be longer in error-prone reads.
 Better statistics. BLAST calculates very accurate statistics for protein: protein alignments, but its model-
based strategy is less robust for translated-DNA: protein and DNA: DNA scores.
 FASTA uses an empirical estimation strategy, and now provides both search-based, and high-scoring
shuffle-based statistics (-z 21).
 More flexible library sequence formats. The FASTA programs can read FASTA, NCBI/ formatdb, and
several other sequence formats, and can directly query MySQL and Postgres databases. The programs offer
several strategies for specifying subsets of databases.
 A very efficient threaded implementation. The FASTA programs are fully threaded; both similarity scores
and alignments can be calculated in parallel on multi-core hardware. On multi-core machines, FASTA can
be faster than BLAST while producing better alignments with more accurate statistical estimates.
Considering such advantages this paper adopts FASTA for paralyzing bioinformatics applications. More over in
most of the paralyzing approaches using Hadoop incurs much cost & also involves complex mechanisms for
cluster setup. In order to reduce both cost & complexity rock cluster is used for cluster set up. Rock is an
operating system which helps in easy deployment of cluster.
III. SYSTEM DESIGN AND IMPLEMENTATION
The system components and modules are shown in Fig 4.1.
3.1 Public network
Public network consists of user nodes used for job submission. Systems present in the campus network were
chosen as user nodes there by reducing further costs.
3.2 Private network
Private network consists of low cost low performance single core systems. They are very cheap. They are
connected to front end node by means of Ethernet switch. Once job is submitted by master node or frontend
node to each of these compute nodes they process it and generates results.
3.3 Frontend node
It is a high performance multiprocessor system that accepts job request from user nodes and assign it to
compute nodes for processing. It also consists of several tools for cluster monitoring (ganglia), bioinformatics
applications, resource management etc.
3.4 Ethernet switch
It is a 300 series 28 port Cisco Ethernet switch. It restricts traffic in the private network formed by the
frontend node and compute nodes. The identification of compute nodes in the cluster is accomplished by means
of DHCP requests.
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
www.iosrjournals.org 92 | Page
Figure 3.1: system components and modules
3.5 Hadoop Distributed File system
The Hadoop Distributed File System (HDFS) is a subproject of the Apache Hadoop project. It is a
distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS provides
high-throughput access to application data and is suitable for applications with large data sets. Both input &
output data are stored in Hadoop distributed file system.
3.6 Bioinformatics module
This is the most important module. In order to paralyze fasta MapReduce program should have following:
 fasta input files handler
 Mapper
 Main program
Since fasta doesn’t require a Reducer to handle the intermediate map output, Reducer is not necessary. Input
to MapReduce program is a set of fasta files. Initially all input files will be uploaded to HDFS. By default fasta
programs like fasta36, ssearch36 etc are installed automatically during rocks installation. Since it is present on
both frontend node as well as compute nodes it should not be uploaded to HDFS. Input file handler collects all
uploaded input files from HDFS to create key-value pair which will be passed to map function. The Mapper
mainly creates a java process to call the fasta program. The Map key is the filename, and the Map value includes
the full HDFS path for each uploaded input files. Each map task downloads the assigned input file from HDFS,
and passes this input to run the fasta program. Finally output file will be created and will be uploaded to HDFS.
The entire process is depicted in figure 3.2. Here there are n files say F1, F2…FN. Each of which in turn contains
n fasta formatted sequences say s={s1,s2…sn}. Each of these n files containing n sequences will be assigned to n
workers. Now workers follow MapReduce paradigm, process each input files and writes corresponding results
back to HDFS. Output files are represented by O1, O2...ON.
Figure 3.2: processing of MapReduce task
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
www.iosrjournals.org 93 | Page
IV. PERFORMANCE EVALUATION
The evaluation of system performance was done by system benchmarking and comparing communication
overheads.
4.1 Experimental Setup
Experimental setup consists of master node connected to twelve compute nodes by means of a Cisco SG
300 switch. Rocks cluster version 6.1. Emerald Boa was installed on the master node with bioinformatics Roll.
The MAC address of all the compute nodes were detected using a command in Rocks called “insert-ethers”. The
compute nodes are assigned with private IP address automatically by rocks depending on frontend nodes private
ip range. The MAC address – IP address mapping is also done by the rocks cluster operating system and
corresponding details will be stored in its internal database. Now all identified compute nodes are installed with
Rocks kernel via PXE boot. Upon completion of cluster setup the Hadoop file system is deployed. Test was
done to compare time requirement for running bioinformatics tool in parallel as well as in standalone
environment. Input set constitutes a folder consisting of fasta formatted files of 68kb each. Time taken for
execution in parallel environment was calculated by varying number of input files. Finally results were tabulated.
The performance Analysis tests were conducted using UnixBench and NetPipe tools.
4.2 Experiment Results
TABLE 4.1 shows time taken for executing fasta in sequential as well as in parallel mode. We can see
that time taken for execution in parallel environment is much less compared to sequential execution of fasta.
Number of input files was varied for each execution in parallel environment.
No of files
Time taken in
sequential mode (seconds)
Time taken in
cluster environment
(seconds)
2 1minute 14 seconds 26 seconds
4 2minutes 28 seconds 42 seconds
8 4minutes 56 seconds 59 seconds
16 9minutes 12seconds 1minute 40 seconds
32 18minutes 24seconds 3minutes 20 seconds
Table 4.1: Time taken
Thus we can conclude that paralyzing bioinformatics application is inevitable and proposed system is a
viable solution. Also proposed system requires only minimal cluster setup cost, minimal energy consumption
irrespective of temperature requirement. Thus proposed system meets its objective efficiently.
V. CONCLUSION
This paper, proposes implementation of a Hadoop cluster for paralyzing bioinformatics application and
its performance analysis based on cost effectiveness. The implementation is done using low cost commodity
machines which can minimize the cost to a large extent. Most of the existing approaches involve high
performance systems for cluster setup. But when the cost aspect is considered, the high performance machine
involves a large cost and usage of such systems for cluster set up is not feasible in a small university or research
organization. Considering the performance verses cost effectiveness, proposed commodity cluster model for
paralyzing bioinformatics application is an adaptable approach for small research organization or university.
Acknowledgements
We are greatly indebted to the college management and the faculty members for providing necessary
facilities and hardware along with timely guidance and suggestions for implementing this work.
REFERENCES
[1] Rocks Cluster Installation Memo, Hiroyuki Mishima, DDS, Ph.D., Research Fellow, and Jun Ni, Ph.D. Associate Professor Medical
Imaging HPC & Informatics Lab Department of Radiology, College of Medicine, University of IowIowaCity, IA 52242, USA
[2] Rocks cluster emerald boa 6.1 User Guide –ebook
[3] Hadoopin Practice –ebook
[4] http://www.hadoop.apache.org
[5] http://www.rocksclusters.org/
[6] http://www.ibm.com/developerworks/library/l-hadoop-1/
[7] http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html
[8] http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html
[9] Arun Krishnan, “GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework,”
Concurrency and Computation, v.17, Issue 13, pp. 1607-1623, 2005, doi:10.1002/cpe.v17:13.
[10] H. Stockinger, M. Pagni, L. Cerutti, L. Falquet, “Grid Approach to Proc of the 2nd IEEE Intl. Conf. on e-Science and Grid
Computing, 2006, doi:10.1109/E-SCIENCE.2006.70.
[11] J. Andrade, M. Andersen, L. Berglund, and J. Odeberg, “Applications of Grid Computing in Genetics and Proteomics,” LNCS, 2007,
doi:10.1007/978-3-540-75755-9.

More Related Content

What's hot

Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Frederic Desprez
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl systemijdms
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in BigdataIRJET Journal
 
grid mining
grid mininggrid mining
grid miningARNOLD
 
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster IJECEIAES
 
Cloud colonography distributed medical testbed over cloud
Cloud colonography distributed medical testbed over cloudCloud colonography distributed medical testbed over cloud
Cloud colonography distributed medical testbed over cloudVenkat Projects
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...
Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...
Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...IJSRP Journal
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...Alexander Decker
 
CCCORE: Cloud Container for Collaborative Research
CCCORE: Cloud Container for Collaborative Research CCCORE: Cloud Container for Collaborative Research
CCCORE: Cloud Container for Collaborative Research IJECEIAES
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query EngineMeasuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engineparekhnikunj
 
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"Guy K. Kloss
 

What's hot (20)

Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl system
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in Bigdata
 
grid mining
grid mininggrid mining
grid mining
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
 
Cloud colonography distributed medical testbed over cloud
Cloud colonography distributed medical testbed over cloudCloud colonography distributed medical testbed over cloud
Cloud colonography distributed medical testbed over cloud
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...
Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...
Application of Lotka-Volterra model to analyse Cloud behavior and optimise re...
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
 
CCCORE: Cloud Container for Collaborative Research
CCCORE: Cloud Container for Collaborative Research CCCORE: Cloud Container for Collaborative Research
CCCORE: Cloud Container for Collaborative Research
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query EngineMeasuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
 
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
 

Viewers also liked

Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...
Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...
Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...IOSR Journals
 
Structural, compositional and electrochemical properties of Aluminium-Silicon...
Structural, compositional and electrochemical properties of Aluminium-Silicon...Structural, compositional and electrochemical properties of Aluminium-Silicon...
Structural, compositional and electrochemical properties of Aluminium-Silicon...IOSR Journals
 
Impact of Price Expectation on the Demand of Electric Energy
Impact of Price Expectation on the Demand of Electric EnergyImpact of Price Expectation on the Demand of Electric Energy
Impact of Price Expectation on the Demand of Electric EnergyIOSR Journals
 
Simulation and Partial Discharge Measurement in 400kv Typical GIS Substation
Simulation and Partial Discharge Measurement in 400kv Typical GIS SubstationSimulation and Partial Discharge Measurement in 400kv Typical GIS Substation
Simulation and Partial Discharge Measurement in 400kv Typical GIS SubstationIOSR Journals
 
Implementation of Cellular IP and Its Performance Analysis
Implementation of Cellular IP and Its Performance AnalysisImplementation of Cellular IP and Its Performance Analysis
Implementation of Cellular IP and Its Performance AnalysisIOSR Journals
 
A Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
A Robust Approach for Detecting Data Leakage and Data Leaker in OrganizationsA Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
A Robust Approach for Detecting Data Leakage and Data Leaker in OrganizationsIOSR Journals
 
Enabling Integrity for the Compressed Files in Cloud Server
Enabling Integrity for the Compressed Files in Cloud ServerEnabling Integrity for the Compressed Files in Cloud Server
Enabling Integrity for the Compressed Files in Cloud ServerIOSR Journals
 
Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...
Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...
Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...IOSR Journals
 
Improvement of limited Storage Placement in Wireless Sensor Network
Improvement of limited Storage Placement in Wireless Sensor NetworkImprovement of limited Storage Placement in Wireless Sensor Network
Improvement of limited Storage Placement in Wireless Sensor NetworkIOSR Journals
 
Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...
Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...
Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...IOSR Journals
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
 
A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...
A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...
A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...IOSR Journals
 
Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...
Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...
Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...IOSR Journals
 
Design of Load Frequency Controllers for Interconnected Power Systems with Su...
Design of Load Frequency Controllers for Interconnected Power Systems with Su...Design of Load Frequency Controllers for Interconnected Power Systems with Su...
Design of Load Frequency Controllers for Interconnected Power Systems with Su...IOSR Journals
 

Viewers also liked (20)

Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...
Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...
Isolation Of Salmonella Gallinarum From Poultry Droppings In Jos Metropolis, ...
 
Structural, compositional and electrochemical properties of Aluminium-Silicon...
Structural, compositional and electrochemical properties of Aluminium-Silicon...Structural, compositional and electrochemical properties of Aluminium-Silicon...
Structural, compositional and electrochemical properties of Aluminium-Silicon...
 
Impact of Price Expectation on the Demand of Electric Energy
Impact of Price Expectation on the Demand of Electric EnergyImpact of Price Expectation on the Demand of Electric Energy
Impact of Price Expectation on the Demand of Electric Energy
 
E013122834
E013122834E013122834
E013122834
 
A0420104
A0420104A0420104
A0420104
 
Simulation and Partial Discharge Measurement in 400kv Typical GIS Substation
Simulation and Partial Discharge Measurement in 400kv Typical GIS SubstationSimulation and Partial Discharge Measurement in 400kv Typical GIS Substation
Simulation and Partial Discharge Measurement in 400kv Typical GIS Substation
 
Implementation of Cellular IP and Its Performance Analysis
Implementation of Cellular IP and Its Performance AnalysisImplementation of Cellular IP and Its Performance Analysis
Implementation of Cellular IP and Its Performance Analysis
 
A Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
A Robust Approach for Detecting Data Leakage and Data Leaker in OrganizationsA Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
A Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
 
Enabling Integrity for the Compressed Files in Cloud Server
Enabling Integrity for the Compressed Files in Cloud ServerEnabling Integrity for the Compressed Files in Cloud Server
Enabling Integrity for the Compressed Files in Cloud Server
 
Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...
Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...
Effect Of Nitrogen And Potassium On The Yield And Quality Of Ginger In The De...
 
Improvement of limited Storage Placement in Wireless Sensor Network
Improvement of limited Storage Placement in Wireless Sensor NetworkImprovement of limited Storage Placement in Wireless Sensor Network
Improvement of limited Storage Placement in Wireless Sensor Network
 
Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...
Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...
Multipath Routing Protocol by Breadth First Search Algorithm in Wireless Mesh...
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining Procedure
 
K1803046067
K1803046067K1803046067
K1803046067
 
A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...
A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...
A Survey on Applications of Neural Networks and Genetic Algorithms in Fault D...
 
B01120813 copy
B01120813   copyB01120813   copy
B01120813 copy
 
H01115155
H01115155H01115155
H01115155
 
Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...
Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...
Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan,...
 
Design of Load Frequency Controllers for Interconnected Power Systems with Su...
Design of Load Frequency Controllers for Interconnected Power Systems with Su...Design of Load Frequency Controllers for Interconnected Power Systems with Su...
Design of Load Frequency Controllers for Interconnected Power Systems with Su...
 
M01312106112
M01312106112M01312106112
M01312106112
 

Similar to Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
 
Dataintensive
DataintensiveDataintensive
Dataintensivesulfath
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCEAM Publications,India
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodIRJET Journal
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
 

Similar to Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster (20)

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Facade
FacadeFacade
Facade
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 14, Issue 6 (Sep. - Oct. 2013), PP 89-93 www.iosrjournals.org www.iosrjournals.org 89 | Page Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster Bincy P Andrews1 , Binu A2 1 (Rajagiri School of Engineering and Technology, Kochi ) 2 (Rajagiri School of Engineering and Technology, Kochi) Abstract : Bioinformatics may be defined as the application of computer science to molecular biology in the form of statistics and analytics. The bioinformatics applications deal with bulk amount of data. Researchers are now facing problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. More over big challenge is involved in processing, storing and analyzing these peta bytes of data without causing much delay. Most of the bioinformatics algorithms are sequential thus making situation rather worse. This implies that data manipulations by means of uniprocessor systems are impractical. However most of the biological problems have parallel nature. Hence a practical and effective approach involves the usage of parallel clusters of workstations. Hadoop can be used to tackle this class of problems with good performance and scalability. This technology could be the basis of a computational parallel platform for several problems in the context of bioinformatics applications. Normally, Hadoop is deployed over high performance computing systems which are expensive involving complex deployment scenarios that only big enterprises are able to make it possible. So for smaller research organizations where cost is an important factor cannot choose systems with high computational capabilities for cluster set up. Rocks cluster is a viable solution in such scenarios. Rocks Cluster Distribution originally called NPACI Rocks is a Linux distribution intended for high-performance computing clusters. This paper implements a cost-effective cluster for paralyzing bioinformatics applications by deploying Hadoop over rock cluster and Emphasizes on the usage of commodity clusters for paralyzing bioinformatics applications by providing necessary justifications. Results show that paralyzing bioinformatics application saves much time compared to stand alone mode of execution effectively under optimal cost considerations. Keywords :Hadoop, Clusters, Rocks Cluster, commodity clusters, cluster environment, big data, bioinformatics I. INTRODUCTION Bioinformatics may be defined as the application of computer science to molecular biology in the form of statistics and analytics. Some of the application areas of bioinformatics include:  molecular medicine  personalized medicine  preventative medicine  gene therapy  drug development  waste cleanup  climate change studies  alternative energy sources  biotechnology  antibiotic resistance  forensic analysis of microbes  bio-weapon creation  crop improvement  insect resistance  vetinary sciences Most of these applications deal with bulk amount of data. Bioinformatics researchers are now facing problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. More over big challenge is involved in processing, storing and analyzing data these peta bytes of data without causing much delay. Also most of the bioinformatics algorithms are sequential thus making situation rather worse. This implies that data manipulations by means of uniprocessor systems are impractical. However most of the biological problems have parallel nature. Hence a practical and effective approach involves the usage of parallel clusters of workstations. The various advantages of using parallel clusters are:  Save time
  • 2. Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster www.iosrjournals.org 90 | Page  Save money  Solve larger &complex problems quickly  Provide concurrency  Use of non local resources Hadoop can be used to tackle such class of problems with good performance and scalability. This technology could be the basis of a computational parallel platform for several problems in the context of bioinformatics applications. The Hadoop platform was designed to solve problems where there is lot of data that doesn’t fit nicely into tables because of its complex and unstructured nature. Hadoop project and associated software provide a foundation for scaling to peta byte scale data warehouses using clusters, providing fault- tolerant parallelized analysis on such data using a programming style named MapReduce. Thus Hadoop will definitely be a promising solution for bioinformatics applications. Normally, Hadoop is deployed over high performance computing systems which are expensive that only big enterprises are able to make it possible. Moreover such deployment scenarios are complex making it infeasible for smaller organizations to handle. . In normal scenarios a cluster set up is highly influenced by following factors like:  Cost: Normally cost is very high for systems with much computational power that are used to set up a cluster.  Power: for systems with much computational capabilities power requirements is also high.  Temperature: usually air conditioned atmosphere is required for maintaining a cluster. All these above mentioned factors never get along. That is if we are choosing systems with high computational capabilities cost will get elevated and also temperature requirement will have to be taken into consideration. So for smaller research organizations where cost is an important factor cannot choose systems with high computational capabilities for cluster set up. This is where rocks cluster comes in. If we are using rocks it is not necessary to meet each of these factors. For setting up of rocks high performance systems are not required there by cost factor reduces considerably. Also temperature is not a big constrain. Rocks Cluster Distribution originally called NPACI Rocks is a Linux distribution intended for high-performance computing clusters. Rocks was initially based on the Red Hat Linux distribution, however modern versions of Rocks are now based on CentOS that simplifies mass installation onto many computers. Rocks include many tools such as MPI which are not part of CentOS but are integral components that make a group of computers into a cluster. Most important feature or rocks is that it doesn’t need high performance computing nodes for deployment. More over rocks contain rolls that can be used for bioinformatics applications and its main advantage is that all these tools get automatically installed during installation of operating system and no further configuration of tools are necessary. Rocks cluster has got following features:  Easy deployment of cluster  Addition & deletion of new host can be easily accomplished  Rocks automatically edit most of the files that is used to identify its hosts.  Password-less SSH  No additional login to compute nodes can be performed due password less SSH.  Rich set of rolls that automatically get configured during rock cluster installation. This paper implements a cost-effective cluster for paralyzing bioinformatics applications by deploying Hadoop over rock cluster and Emphasizes on the usage of commodity clusters for paralyzing bioinformatics applications by providing necessary justifications. Old, low cost - low performance, single core machines were selected as the compute nodes for the cluster. Cost factor affected both master node & for Cisco switch. Job submissions were done using some of the existing systems in the campus network which acts as user nodes there by reducing additional cost or bandwidth. Low cost compute nodes had only low power requirement and low power backup. After cluster set up testing of bioinformatics tool fasta in both stand alone & parallel environment were done. Results show that paralyzing bioinformatics application saves much time compared to stand alone mode of execution. Remaining chapters are organized as follows: chapter II provides as overview of related works and chapter III provides detailed system design and implementation followed by its performance evaluation in chapter IV. Finally paper concludes with chapter V. II. BACKGROUND Parallelism can be achieved by either of these methods:  Develop or adapt existing software to distribute and manage parallel jobs, or  Modify existing applications to make use of libraries that facilitate distributed programming such as MPI, OpenMP, RPC and RMI. Effort required for first approach is recurrent since new versions of the original sequential code may render the parallelized application obsolete. So created versions may lag and lack in features of the latest sequential tool version. Also it is very difficult to incorporate failure handling mechanisms in such systems. First approach is cost effective compared to other: [9] [10] [11] are a small sample of Grid-based solutions for bioinformatics
  • 3. Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster www.iosrjournals.org 91 | Page applications that automate the process of transferring or sharing large files, selecting appropriate application binaries for a variety of environments or existing services, managing the creation and submission of a large number of jobs to be executed in parallel and recovering from possible failures. Also cloud-blast paralyses bioinformatics application using MapReduce framework. Most of These works deals with execution of a single file at a time. Processing of large no of files at same time is not being addressed, which is often necessary if amount of data to be dealt with is more. More over most of these works concentrate on paralyzing bioinformatics applications using blast. BLAST (Basic Local Alignment and Search Tool) is the most widely used sequence alignment tool. ClustalW, HMMER, FASTA, Glimmer etc. are similar tools. FASTA and BLAST have the same goal: to identify statistically significant sequence similarity that can be used to infer homology. The FASTA programs offer several advantages over BLAST:  Rigorous algorithms unavailable in BLAST (Table I). Smith-Waterman (ssearch36), global: global (ggsearch36), and global: local (glsearch36) programs are available, and these programs can be used with psiblast PSSM profiles.  Better translated alignments. fastx36, fasty36, tfastx36, and tfastx36 allow frame-shifts in alignments; frame-shifts are treated like gap-penalties, alignments tend to be longer in error-prone reads.  Better statistics. BLAST calculates very accurate statistics for protein: protein alignments, but its model- based strategy is less robust for translated-DNA: protein and DNA: DNA scores.  FASTA uses an empirical estimation strategy, and now provides both search-based, and high-scoring shuffle-based statistics (-z 21).  More flexible library sequence formats. The FASTA programs can read FASTA, NCBI/ formatdb, and several other sequence formats, and can directly query MySQL and Postgres databases. The programs offer several strategies for specifying subsets of databases.  A very efficient threaded implementation. The FASTA programs are fully threaded; both similarity scores and alignments can be calculated in parallel on multi-core hardware. On multi-core machines, FASTA can be faster than BLAST while producing better alignments with more accurate statistical estimates. Considering such advantages this paper adopts FASTA for paralyzing bioinformatics applications. More over in most of the paralyzing approaches using Hadoop incurs much cost & also involves complex mechanisms for cluster setup. In order to reduce both cost & complexity rock cluster is used for cluster set up. Rock is an operating system which helps in easy deployment of cluster. III. SYSTEM DESIGN AND IMPLEMENTATION The system components and modules are shown in Fig 4.1. 3.1 Public network Public network consists of user nodes used for job submission. Systems present in the campus network were chosen as user nodes there by reducing further costs. 3.2 Private network Private network consists of low cost low performance single core systems. They are very cheap. They are connected to front end node by means of Ethernet switch. Once job is submitted by master node or frontend node to each of these compute nodes they process it and generates results. 3.3 Frontend node It is a high performance multiprocessor system that accepts job request from user nodes and assign it to compute nodes for processing. It also consists of several tools for cluster monitoring (ganglia), bioinformatics applications, resource management etc. 3.4 Ethernet switch It is a 300 series 28 port Cisco Ethernet switch. It restricts traffic in the private network formed by the frontend node and compute nodes. The identification of compute nodes in the cluster is accomplished by means of DHCP requests.
  • 4. Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster www.iosrjournals.org 92 | Page Figure 3.1: system components and modules 3.5 Hadoop Distributed File system The Hadoop Distributed File System (HDFS) is a subproject of the Apache Hadoop project. It is a distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS provides high-throughput access to application data and is suitable for applications with large data sets. Both input & output data are stored in Hadoop distributed file system. 3.6 Bioinformatics module This is the most important module. In order to paralyze fasta MapReduce program should have following:  fasta input files handler  Mapper  Main program Since fasta doesn’t require a Reducer to handle the intermediate map output, Reducer is not necessary. Input to MapReduce program is a set of fasta files. Initially all input files will be uploaded to HDFS. By default fasta programs like fasta36, ssearch36 etc are installed automatically during rocks installation. Since it is present on both frontend node as well as compute nodes it should not be uploaded to HDFS. Input file handler collects all uploaded input files from HDFS to create key-value pair which will be passed to map function. The Mapper mainly creates a java process to call the fasta program. The Map key is the filename, and the Map value includes the full HDFS path for each uploaded input files. Each map task downloads the assigned input file from HDFS, and passes this input to run the fasta program. Finally output file will be created and will be uploaded to HDFS. The entire process is depicted in figure 3.2. Here there are n files say F1, F2…FN. Each of which in turn contains n fasta formatted sequences say s={s1,s2…sn}. Each of these n files containing n sequences will be assigned to n workers. Now workers follow MapReduce paradigm, process each input files and writes corresponding results back to HDFS. Output files are represented by O1, O2...ON. Figure 3.2: processing of MapReduce task
  • 5. Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster www.iosrjournals.org 93 | Page IV. PERFORMANCE EVALUATION The evaluation of system performance was done by system benchmarking and comparing communication overheads. 4.1 Experimental Setup Experimental setup consists of master node connected to twelve compute nodes by means of a Cisco SG 300 switch. Rocks cluster version 6.1. Emerald Boa was installed on the master node with bioinformatics Roll. The MAC address of all the compute nodes were detected using a command in Rocks called “insert-ethers”. The compute nodes are assigned with private IP address automatically by rocks depending on frontend nodes private ip range. The MAC address – IP address mapping is also done by the rocks cluster operating system and corresponding details will be stored in its internal database. Now all identified compute nodes are installed with Rocks kernel via PXE boot. Upon completion of cluster setup the Hadoop file system is deployed. Test was done to compare time requirement for running bioinformatics tool in parallel as well as in standalone environment. Input set constitutes a folder consisting of fasta formatted files of 68kb each. Time taken for execution in parallel environment was calculated by varying number of input files. Finally results were tabulated. The performance Analysis tests were conducted using UnixBench and NetPipe tools. 4.2 Experiment Results TABLE 4.1 shows time taken for executing fasta in sequential as well as in parallel mode. We can see that time taken for execution in parallel environment is much less compared to sequential execution of fasta. Number of input files was varied for each execution in parallel environment. No of files Time taken in sequential mode (seconds) Time taken in cluster environment (seconds) 2 1minute 14 seconds 26 seconds 4 2minutes 28 seconds 42 seconds 8 4minutes 56 seconds 59 seconds 16 9minutes 12seconds 1minute 40 seconds 32 18minutes 24seconds 3minutes 20 seconds Table 4.1: Time taken Thus we can conclude that paralyzing bioinformatics application is inevitable and proposed system is a viable solution. Also proposed system requires only minimal cluster setup cost, minimal energy consumption irrespective of temperature requirement. Thus proposed system meets its objective efficiently. V. CONCLUSION This paper, proposes implementation of a Hadoop cluster for paralyzing bioinformatics application and its performance analysis based on cost effectiveness. The implementation is done using low cost commodity machines which can minimize the cost to a large extent. Most of the existing approaches involve high performance systems for cluster setup. But when the cost aspect is considered, the high performance machine involves a large cost and usage of such systems for cluster set up is not feasible in a small university or research organization. Considering the performance verses cost effectiveness, proposed commodity cluster model for paralyzing bioinformatics application is an adaptable approach for small research organization or university. Acknowledgements We are greatly indebted to the college management and the faculty members for providing necessary facilities and hardware along with timely guidance and suggestions for implementing this work. REFERENCES [1] Rocks Cluster Installation Memo, Hiroyuki Mishima, DDS, Ph.D., Research Fellow, and Jun Ni, Ph.D. Associate Professor Medical Imaging HPC & Informatics Lab Department of Radiology, College of Medicine, University of IowIowaCity, IA 52242, USA [2] Rocks cluster emerald boa 6.1 User Guide –ebook [3] Hadoopin Practice –ebook [4] http://www.hadoop.apache.org [5] http://www.rocksclusters.org/ [6] http://www.ibm.com/developerworks/library/l-hadoop-1/ [7] http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html [8] http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html [9] Arun Krishnan, “GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework,” Concurrency and Computation, v.17, Issue 13, pp. 1607-1623, 2005, doi:10.1002/cpe.v17:13. [10] H. Stockinger, M. Pagni, L. Cerutti, L. Falquet, “Grid Approach to Proc of the 2nd IEEE Intl. Conf. on e-Science and Grid Computing, 2006, doi:10.1109/E-SCIENCE.2006.70. [11] J. Andrade, M. Andersen, L. Berglund, and J. Odeberg, “Applications of Grid Computing in Genetics and Proteomics,” LNCS, 2007, doi:10.1007/978-3-540-75755-9.