SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Big Data Analysis: SQL Server and Apache Hadoop
Ruben Casas
School of Computing Engineering and Mathematics
University of Brighton
United Kingdom
2014
r.casasrincon1@uni.brighton.ac.uk
Word Count (Excluding References): 3450
Contents
1. Introduction
2. Traditional Relational Databases
2.1. SQL ServerTestData
2.2. QueryResults
3. DistributedComputing
4. Available Tools ForBigDataAnalysis
4.1. MongoDB
4.1.1. Advantages
4.1.2. Weaknesses
4.1.3. Features
4.2. Apache Hadoop
4.2.1. Apache HadoopYARN
4.2.2. Benefitsof YARN
4.2.3. How YARN works
4.2.4. Componentsof YARN
5. Suitable Solutions
6. HortonworksData Plataform
7. Apache Ambari
8. Apache Sqoop
9. Apache Hive
10. ComparingThe TestResults
11. Conclusion
12. References
Big Data Analysis: SQL Server and Apache Hadoop
1. Introduction
Big data hasbecome a commonterm nowadays;it’snotjusta fancyword usedbythe big internet
companiesworryingaboutthe incremental growthof theirdatacenters.The Big data problem has
reachedthe mediumandsmall enterprisesgivingthe IT anddata-warehouse managerssomething
to think about.
Big Data isnot justabout size,itreferstodifficultdata,difficult to analyze and store and it should
be lookedat as a solution and an opportunity rather than a problem. The new tools and benefits
of bigdata are notexclusive of bigcompanieslikeGoogle,Facebookor Amazon. After the release
of source code in a google research publication in 2004 (Dean & Ghemawat 2008) including the
google file systemandMapReduce, the secretandthe tools have been revealed to the public and
projectslike Apache Hadoopand all its components were born bringing the big data analysis and
solutions close to the smalls and medium enterprises.
The purpose of this project is to evaluate the performance and the suitability of the solution
provided by Apache Hadoop in order to analyze data stored in a big database and to compare
these results with the traditional database using SQL Server.
2. Traditional Relational Database Management Systems
The problem of the traditional relational databases is not storing large amounts of data, their
problemisanalysingthe storeddata.Accordingto an experiment carried by Adam Jacobs (Jacobs
2009), usingPostgreSQLtorun a simple queryona 1 billion row fake dataset containing the basic
details of the population of the world, He demonstrated that the traditional RDBMS struggle to
scale gracefullytobigdatasetswhenthe numberof rowsgrow past 1 million.Hisconclusionbrings
questionsandconcernsaboutthe wayRDBMS approachbig datasets and the difficulty to analyse
the stored data, furthermore, his experiment includes just a hypothetical situation using
structured data, and the biggest weakness of RDBMS is that they cannot analyse unstructured
data, the great struggle is to store raw data and upload it onto a traditional database.
I have decidedto continue with his experiment comparing SQL Server, one of the most common
RDBMS used by companies in the world with one of the leading tools for big data analysis:
Hadoop.
2.1 SQL Server Test Data
To demonstrate the difficultiesof the RDBMS,I createdtwo datasetsfor the test, the first one is a
100 million row database called “population2” (8GB) containing a single table called
“MOCK_DATA” and 6 columns:
Column Name Data Type
Id int
first_name varchar(50)
last_name varchar(50)
Email varchar(50)
Country varchar(20)
ip_address varchar(20)
The seconddatabase is called “population” with the same meta-data structure, but this time the
table was populated with 1.100 million rows (97GB).
Both databases where created using Microsoft SQL Server 2012 and to populate them with the
fake data,I usedSQL Data generator,provided by Red Gate. (www.red-gate.com) and a standard
server machine HP ProLiant ML310e Gen8 V2 Server; RAM8GB, Intel Xeon E3 Intel Core i3.
The fake-data generation of the first database took around 1 hour and 15 minutes for the 100
millionrows,but when I created the second database I found that after the first 100million rows
the time of data generationstartedto increase gradually, passing from 100 million rows per hour
to double that period of time taking a total of 23 hours 40 minutes to complete the whole 1100
million rows. The performance of the machine was seriously reduced using almost all of the
available resources of processor and RAMMemory.
Withmy datasetfor testingready, I tried to run the following query on SQL Management Studio:
Use Population2
Select first_name, last_name, email, country from MOCK_DATA group by
country, first_name, last_name, email.
After 20 minutes, the query stopped and SQL Server Management Studio showed the following
error message: “System.OutOfMemoryException”. According to Microsoft support this issue
occurs when SSMS is unable to display query results in the query window due to the memory
restrictionof 2GB. SSMS is a 32bit processand imposesanartificial limit on how much text can be
displayed per database field in the results window. This limit is 64kb in “grid” and 8kb in “text”
mode,therefore if the result set is too large and the memory required may surpass the 2GB limit
of the SSMS process, the query will stop and the error message thrown. (Microsoft, KB2874903)
The solutionfor large dataset is exporting the results to a file or using a different tool. Microsoft
suggestsqlcmd64 bit tool instead of SSMS to run the SQL queries, this avoids the 2GB restriction
that affects the 32bit SSMS process.
2.2 Query Results
The query took 29 minutes 48 seconds on Population2 database, to extract the information and
save it on an unstructured text file (15.3GB)
For the 1 Billionrows population database ittook 9hours 22 minutestoprocessthe queryand the
outputfile was151GB big.Note that it tooknearlytwice as long per every 100 million rows to run
this query compared with the Population2 database.
Thisis example representsthe real challenge andtruthaboutthe difficultyforRelational database
models to analyze Big Data. The difficulty is not storing the data is retrieving it and analyzing it.
The DBMS (Database relational models) are designed for transactional performance; this is
updating, adding, searching for and retrieving small amounts of data in a large database. The
DBMS executesthe actionsof transactionsinaninterleavedfashion to obtain good performance,
but schedules them in such a way as to ensure that conflicting operations are not permitted to
proceed concurrently. (Ramakrishnan, 2000)
3. Distributed Computing
We could argue that the solution for this issue is increasing the computer resources, in the end,
the Proliant G8 gen2 is not the most powerful machine in the market and we could get a better
serverwithmore memoryandprocessingpowertoimprove the query time and the performance
when analyzing those large datasets; however, another truth about big data is that it is always
growing and eventually it will overcome the initial estimations of hardware resources.
The main issue in this case is scalability and performance and an alternative is using distributed
computing to separates the data in smaller parts (chunks) in a set of systems; usually small
computersor servers(nodes)whichprocessthe data. The main advantage of this approach is the
scalabilitybyjustaddingone more node to the clusterandcan quicklyimplementmodificationsto
the scripts or applications.
4. Available tools for Big Data
The main concernaboutrelational databases is scalability; this model doesn’t work properly in a
distributed environment because joining tables across the nodes and keeping the consistency
could be almost impossible to achieve due to the structured nature of the model.
In response tothe limitationsandconcernsaboutthe relational model,developers and industries
have increasinglyturnedtoNoSQLdatabases. The NoSQL movement probably stared inspired by
Google’s Big Table or Amazon S3 and simpleDB and along with them have been introduced new
tools and ways of understand what databases are and what they can do. (Bartholomew, 2010).
The questioniswhattool to use and whendependingonthe needsof agrowingenterprise. There
are two tools that I reviewed in order to select the best way to analyze big data: MongoDB and
Apache Hadoop.
4.1 MongoDB
MongoDB is an opersource product Developed and supported by 10gen. It is a scalable, open
source, high performance, document orientated NoSQL database (10gen, 2014). The extra
features and advantages of MongoDB are:
 Query Language
 Fast Performance
 Horizontal Scalability
4.1.2 The main weaknesses of MongoDB and NoSQL systems in general are:
 No Joins support
 No Complex Transaction support
 No Constraints support
MongoDB uses collections to store the data, called BSON, very similar to JSON format used in
JavaScript.Each collectionhasdifferent documents(objects). The documents could be compared
to the rows ina relational database andthe collectiontothe tables. The id field is mandatory and
could be comparable to a primary key. The number of fields within a collection is very flexible
compared to a relational database and they could store multiple values.
MongoDB uses a document orientated query language to query the data in the database; the
following are examples of the structure:
db.employees.find({_id:123}); Find record with specific id against collection “employee”
db.employees.find().sort({ name:1} ) Find all records fromthe collection “employee”sort by name.
4.1.3 MongoDB Features:
 Ad hoc queries
 Supports search by field, range queries and regular expressions searches.
 Indexing: Any field in a document could be index.
 Master/slave replication:A mastercanperformreadsandwritesandthe slave copiesdata
from the master and can only be used for reads or backup (no writes)
 Duplicationof data:MongoDB runsover multiple servers to give protection and keep the
system up and running in case of hardware failure.
 Load balancing: Automatic load balancing is easy to deploy
 Scalable: Scales horizontally, new machines can be added to a running database.
 File storage:GridFScouldbe usedas a file system, takingadvantage of load balancing and
data replication features.
 Aggregations: simple limited MapReduce can be used for batch processing of data and
aggregation operations
 Server-side JavaScript: JavaScript functions can be used in queries.
 Special support for locations: Understands longitude and latitude natively
 Applications: MongoDB could be used for Big Data application as well as traditional
applications
 Object Orientated Programming: No conversion required when using OO programing
language.
4.2 Apache Hadoop
The leadingtool intermsof popularityforbigdata analysisisthe opensource projectcalled
Hadoop.ThisApache projectwritteninjavaisa computingenvironmentbuiltontopof a
distributedclusteredfilesystemdesignedforverylarge scale dataoperations.(Apache,2014)
Hadoopwas inspiredbythe Google’sdistributedfile system(GFS) andthe MapReduce
programmingparadigm.Unlike transactionalsystemsHadoopisdesignedtoscanthroughthe big
data and large scale datasetstoproduce its outcome througha veryeasyscalable anddistributed
batch processingsystem.Hadoopisnotaboutthe processingspeedresponse times,orthe
transactional speedorthe real time storing.Itisabout makingthe logical distributionof the
workloadsandthe scalabilityandanalysisperspective.
4.2.1 Apache Hadoop YARN
YARN it’sa HadoopprojectintroducedinHadoop2.0. The functionalityprovidedbyMapReduce is
based on Cluster resource management and data processing. There was a need to enable a
broaderinteractionwithdatabeyondMapReduce.WithYARN the resource management and the
processingcomponentsare nowseparated.YARN takescare of the cluster resource management
and MapReduce performsthe dataprocessingvia YARN. The YARN-based architecture of Hadoop
2.0 provides a general processing platform with is not constrained to MapReduce. Thus YARN
takes over the resource management capabilities that were part of MapReduce and now it can
focuses on the data processing. Multiple applications can run on top of Hadoop using YARN,
sharing a common resource management.
4.2.2 Benefits of YARN
It enhances the performance of the Hadoop cluster as follows:
Scalability: The processing power in data centers grows with time; more nodes are added to the
clusteras the demandincreases.YARN resource management focuses exclusively on scheduling,
that is why it manages larger clusters easily.
Backward compatibility:YARN is backwardcompatible witholderversionsof MapReduce.Existing
MapReduce applications can run on top of YARN without making any changes to the current
processes.
Improve utilizationof cluster:The resource managerworkspurelyasa schedulerand it optimizes
cluster utilization. The scheduler bases the optimization on certain criteria such as capacity
guarantees, fairness, SLAs.
There is no named map and reduce slots: This resultsinbetterutilizationof the clusterresources.
Support for other workloads: Data processing could now be performed using other programing
models than MapReduce such as graph processing and iterative modeling. This results into
increased return on investment
MapReduce´sagility:MapReduce isnow independentof the resource management which is now
handledbyYARN and MapReduce onlyfocusesondataprocessing.Itisnow more agile intermsof
evolving independently of the underline resource management layer.
4.2.3 How YARN works
The fundamental idea of YARN is to separate the two main responsibilities in a Hadoop system,
the job tracker and the task tracker into separate entities.
4.2.4 The components in the YARN based system are:
Global resource management: The Resource Manager and the Node Manager form the basis for
managingapplications in a distributed manner. The responsibility of the Resource Manager is to
distribute the available resources to the applications.
Per-applicationApplicationMaster:Is the frameworkspecificentity.Onone side itcommunicates
with the Resource manager and on the other with Node Managers. It negotiates resources from
the resource manager and it works with the node managers to execute and monitor the
component tasks. The resource manager has a scheduler which is responsible for allocating
resourcestothe variousrunningapplications.Itperformsallocationsaccording to the constraints.
Examplesof constraintsare queue capacities and user limits. The scheduling is performed based
on the resource requirements of the applications
Per-node slave NodeManager and Per-Application container: The node manager is the per-
machine slave and is responsible for launching the applications containers. It monitors the
resource usage of CPU, memory, disk and network. It also reports the same back to the resource
manager.Each applicationmasternegotiatesresource containersfromthe scheduler. It tracks the
status and monitors their progress.
5. Suitable Solutions.
Whenit comesto decide whichtool willprovidethe bestsolutiontothe specificproblem,we have
to take into account that MongoDB and Apache Hadoop are two different approaches to Big data
Analysis. After studied their characteristics, benefits and components we can conclude that
Hadoopis usedto process data for analytical purposes and it is not meant to be use on real-time
processing where MongoDB could be used for real-time processing and could store massive
amounts of data as well, however, the processing time is performed in small subsets of data. In
Relational database world we can compare Hadoop with OLAP and MongoDB to OLTP.
For the purposes of this research, I have chosen Hadoop to analyze the 97GB dataset containing
the basic details of the population. Although the Apache open source Hadoop project could be
downloaded and configured directly from the Apache website, there are many vendors and
Hadooptechnologiesavailable which provide a better an easier implementation for our project.
All of them are based on the core Apache project, HDFS and MapReduce, and include the
additional applicationssuchas Hive, Pig, HBase, sqoop, etc. The three main companies providing
distributions for Hadoop ready ecosystems: Cloudera, MapR and Hortonworks.
6. Hortonworks Data Platform 2.1
I have chosenHortonworks,whichiscompletelyopensource.Theyprovideacomplete framework
which includes all the useful applications for Big Data analysis.
HortonworksData platformversion2.1isa platformmulti-workload for data processing across an
array of processing methods, from batch through the interactive and real time.
(www.hortonworks.com) HDP 2.1 runs on top of the YARN data operating system and integrates
the capabilities that are required for the enterprise needs.
The distribution was installed on a Corei3 machine with 6GB RAM, running Oracle Virtual Box. I
have enabledApache Ambari 1.5.1that isincludedinthe HDPdistributionto set up and configure
my cluster.Ambari is an open operational framework for provisioning managing and monitoring
Apache Hadoop clusters.
7. Apache Ambari
The cluster configuration is straight forward thanks to Ambari Web UI. The configuration wizard
providesthe essential tools to install, configure and the deploy the new host within the cluster,
enablingthe health checks, service status, resource management and jobs status running on the
cluster, which for this test, consists on the Master Node and 2 slaves hosts
8. Apache Sqoop
The cluster is configured and running ready to test the data processing and the benefits of the
MapReduce technology. For a proof of concept, I uploaded the data from MS SQL Server into the
HDP with a sqoop script. Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and relational databases. The data is imported from the external structured
datastore directlyintoHDFSor relatedsystemslike Hive and HBase. It can be used also to extract
data fromthe Hadoopand exportitto an external RDBMS.It supports: Teradata, Netezza, Oracle,
MySQL, Postgres, HSQLDB and MS SQL server.
Usingthe MicrosoftSQL ServerdriverforHadoop, I ran the followingscripttotestthe connection
betweenthe SQLServerandHadoop
sqoop list-databases --connect jdbc:sqlserver://192.168.56.1:1433
--username hadoop --password password
The outputis the listof databasesavailable atthe momenton the SQL Server and ready to import
into the HDFS using the following command
sqoop import --connect
"jdbc:sqlserver://192.168.56.1:1433;database=population2;username=
hadoop;password=hadoop1" --table MOCK_DATA --hive-import
Afterthe executionof the command,Hadooplaunchesaseriesof MapReduce jobs,whichprocess
the data and importthe table intoHDFS and HBase. The executiontime was1.521 secondsto
importthe whole Population2table.
9. Apache Hive
Nowwithour testingdataimportedintoourcluster,we are readyto testand compare the results.
HDP provides awebUI that allowsinteractivebrowsinganduse of the integratedtools.Ihave
usedApache Hive whichisthe standardfor SQL queriesinHadoop.Itusesthe HiveQLlanguage,a
verysimilarlanguage toSQLin termsof semantics andfeatures.Beewax isthe UIforHive included
inthe HDP 2.1
10. Comparing The TestResults
I ran the same query Select first_name, last_name, email, country from
MOCK_DATA group by country, first_name, last_name, email. Using Beewax and
HiveQL and these are the results:
The query took 869 seconds (14 min, 48 seconds) to process the Population2 database when
executing the query with the group by clause. The MapReduce aggregation function running on
HadoopHDP 2.1 clusterwith2 slave nodes,representedaperformance increase of 50% compared
with the SQL Server query.
The second query running on the Population database took 12559 seconds (3, 48 hours) to
complete the aggregationsforthe 1Billionrows dataset, representing a performance increase of
62.16%
The advantage of Hive and Beewax is that the results could be displayed on screen and the data
could be examined and copied directly from the query results windows. In the SQL server
Management Studio case, the results had to be saved to a text file that can only be open with a
powerful machine abletoloadthe file ontomemoryinordertoanalyze it. It’s important to notice
that the performance couldbe improvedevenmore,byaddingmore hosts to the Hadoop cluster.
Conclusion:
We have proventhatRDBMS are not enoughformediumandBigData Analysis.The scalabilityand
performance for these purposes could be achieved by using the Big Data tools available like
Hadoop. The test carried on demonstrating the weaknesses of the Relational model and the
differences in performance between Hadoop, is based just on structured data stored in a
relational way. Analyzing unstructured data is the real struggle for RDBMS. An example of
unstructured big data is the log files generated by the normal interactions of users visiting a
website.Toanalyze these kindsof datasets using SQL server, we would have to parse the log file
and convert it into a structured relational table in order to query it and analyze it. On the other
hand, Hadoop and the HDP, offer a very effective tool to analyze log files using HDFS and
HCatalog.It sharesthe schemawithPigand Hive withinaHadoopclusterwhichallowsthese tools
to performactionsandqueriesonthe data stored.The conversiontakes place withinthe Hcatalog
and shows a relational waytorepresentunstructured datainthe Hadoop Distributed File system.
There are manyadvantagesof the Hadoop implementationforBigData analysis;however, itisnot
always the best solution for all cases. Hadoop is meant to be used for offline transaction
processingandanalysis, not for real time processing. In that case, the traditional RDBMS tools or
the alternative NoSQLtoolssuchas MongoDB, present a better solution in terms of performance
for small andbigsubsetsof data. The alternative for medium and big enterprises worrying about
the incremental growth of theirdatasets,isto evaluate concisely the applications generating the
data on real-time transactions,thenenhance the performance of the datacentersandtransferthe
data to be analyzed in a distributed environment, where the Big Data tools such as Hadoop can
processeasilyandefficientlythe data,extractingthe real value fromitandimprovingthe decision
making based on the results.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.
Communications of the ACM, 51(1), 107-113.
Jacobs,A. (2009). The pathologiesof big data.Communicationsof the ACM,52(8),36-44.
Apache Foundation(2014) What is Hadoop,Retrieved7Jun,2014 from http://hadoop.apache.org
MicrosoftKnowledge Base (2014) KB2874903. Retrieved 24 Jul, 2014 from
http://support2.microsoft.com/kb/2874903
Ramakrishnan,R.,& Gehrke,J.(2000). Databasemanagementsys-tems.Osborne/McGraw-Hill.(p.
2)
Bartholomew,D.(2010) SQL vs.NoSQL,Linux Journal,195, 54-59.
MongoDB Inc (2014) MongoDBOverview,retrieved30 Jul,from
http://www.mongodb.com/mongodb-overview
Apache Foundation(2014) ApacheHadoop NextGen MapReduce(YARN), Retrieved30Jul,from
http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/YARN.html
Hortonworks(2014) HortonworksData Platform2.1,Retrieved 31Jul,from
http://hortonworks.com/
Apache Foundation(2014) AmbariOverview, Retrieved31Jul,from http://ambari.apache.org/
Microsoft(2014) MicrosoftSQL ServerJDBC Driver Downloadedfrom
http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-
613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz
Apache Foundation(2014) Apachesqoop, Retrieved31Jul,fromhttp://sqoop.apache.org/
Hortonworks(2013) ImportfromSQL server into the HDP using sqoop, Retrieved6Augfrom
http://hortonworks.com/hadoop-tutorial/import-microsoft-sql-server-hortonworks-sandbox-
using-sqoop/
Apache Foundation(2014) ApacheHive TM, Retrieved31Jul,fromhttps://hive.apache.org/

Weitere ähnliche Inhalte

Was ist angesagt?

MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...MongoDB
 
A Study of Performance NoSQL Databases
A Study of Performance NoSQL DatabasesA Study of Performance NoSQL Databases
A Study of Performance NoSQL DatabasesAM Publications
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data queryIJDKP
 
Representing Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian NetworksRepresenting Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian NetworksIJERA Editor
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 

Was ist angesagt? (16)

T9
T9T9
T9
 
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
 
C1803041317
C1803041317C1803041317
C1803041317
 
A Study of Performance NoSQL Databases
A Study of Performance NoSQL DatabasesA Study of Performance NoSQL Databases
A Study of Performance NoSQL Databases
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Assigment 2
Assigment 2Assigment 2
Assigment 2
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data query
 
Representing Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian NetworksRepresenting Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian Networks
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association Mining
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
B040101007012
B040101007012B040101007012
B040101007012
 

Ähnlich wie disertation

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6varshakumar21
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemijitjournal
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 

Ähnlich wie disertation (20)

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
No sql database
No sql databaseNo sql database
No sql database
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 

disertation

  • 1. Big Data Analysis: SQL Server and Apache Hadoop Ruben Casas School of Computing Engineering and Mathematics University of Brighton United Kingdom 2014 r.casasrincon1@uni.brighton.ac.uk Word Count (Excluding References): 3450
  • 2. Contents 1. Introduction 2. Traditional Relational Databases 2.1. SQL ServerTestData 2.2. QueryResults 3. DistributedComputing 4. Available Tools ForBigDataAnalysis 4.1. MongoDB 4.1.1. Advantages 4.1.2. Weaknesses 4.1.3. Features 4.2. Apache Hadoop 4.2.1. Apache HadoopYARN 4.2.2. Benefitsof YARN 4.2.3. How YARN works 4.2.4. Componentsof YARN 5. Suitable Solutions 6. HortonworksData Plataform 7. Apache Ambari 8. Apache Sqoop 9. Apache Hive 10. ComparingThe TestResults 11. Conclusion 12. References
  • 3. Big Data Analysis: SQL Server and Apache Hadoop 1. Introduction Big data hasbecome a commonterm nowadays;it’snotjusta fancyword usedbythe big internet companiesworryingaboutthe incremental growthof theirdatacenters.The Big data problem has reachedthe mediumandsmall enterprisesgivingthe IT anddata-warehouse managerssomething to think about. Big Data isnot justabout size,itreferstodifficultdata,difficult to analyze and store and it should be lookedat as a solution and an opportunity rather than a problem. The new tools and benefits of bigdata are notexclusive of bigcompanieslikeGoogle,Facebookor Amazon. After the release of source code in a google research publication in 2004 (Dean & Ghemawat 2008) including the google file systemandMapReduce, the secretandthe tools have been revealed to the public and projectslike Apache Hadoopand all its components were born bringing the big data analysis and solutions close to the smalls and medium enterprises. The purpose of this project is to evaluate the performance and the suitability of the solution provided by Apache Hadoop in order to analyze data stored in a big database and to compare these results with the traditional database using SQL Server.
  • 4. 2. Traditional Relational Database Management Systems The problem of the traditional relational databases is not storing large amounts of data, their problemisanalysingthe storeddata.Accordingto an experiment carried by Adam Jacobs (Jacobs 2009), usingPostgreSQLtorun a simple queryona 1 billion row fake dataset containing the basic details of the population of the world, He demonstrated that the traditional RDBMS struggle to scale gracefullytobigdatasetswhenthe numberof rowsgrow past 1 million.Hisconclusionbrings questionsandconcernsaboutthe wayRDBMS approachbig datasets and the difficulty to analyse the stored data, furthermore, his experiment includes just a hypothetical situation using structured data, and the biggest weakness of RDBMS is that they cannot analyse unstructured data, the great struggle is to store raw data and upload it onto a traditional database. I have decidedto continue with his experiment comparing SQL Server, one of the most common RDBMS used by companies in the world with one of the leading tools for big data analysis: Hadoop. 2.1 SQL Server Test Data To demonstrate the difficultiesof the RDBMS,I createdtwo datasetsfor the test, the first one is a 100 million row database called “population2” (8GB) containing a single table called “MOCK_DATA” and 6 columns: Column Name Data Type Id int first_name varchar(50) last_name varchar(50) Email varchar(50) Country varchar(20) ip_address varchar(20) The seconddatabase is called “population” with the same meta-data structure, but this time the table was populated with 1.100 million rows (97GB). Both databases where created using Microsoft SQL Server 2012 and to populate them with the fake data,I usedSQL Data generator,provided by Red Gate. (www.red-gate.com) and a standard server machine HP ProLiant ML310e Gen8 V2 Server; RAM8GB, Intel Xeon E3 Intel Core i3. The fake-data generation of the first database took around 1 hour and 15 minutes for the 100 millionrows,but when I created the second database I found that after the first 100million rows the time of data generationstartedto increase gradually, passing from 100 million rows per hour to double that period of time taking a total of 23 hours 40 minutes to complete the whole 1100 million rows. The performance of the machine was seriously reduced using almost all of the available resources of processor and RAMMemory. Withmy datasetfor testingready, I tried to run the following query on SQL Management Studio:
  • 5. Use Population2 Select first_name, last_name, email, country from MOCK_DATA group by country, first_name, last_name, email. After 20 minutes, the query stopped and SQL Server Management Studio showed the following error message: “System.OutOfMemoryException”. According to Microsoft support this issue occurs when SSMS is unable to display query results in the query window due to the memory restrictionof 2GB. SSMS is a 32bit processand imposesanartificial limit on how much text can be displayed per database field in the results window. This limit is 64kb in “grid” and 8kb in “text” mode,therefore if the result set is too large and the memory required may surpass the 2GB limit of the SSMS process, the query will stop and the error message thrown. (Microsoft, KB2874903) The solutionfor large dataset is exporting the results to a file or using a different tool. Microsoft suggestsqlcmd64 bit tool instead of SSMS to run the SQL queries, this avoids the 2GB restriction that affects the 32bit SSMS process. 2.2 Query Results The query took 29 minutes 48 seconds on Population2 database, to extract the information and save it on an unstructured text file (15.3GB)
  • 6. For the 1 Billionrows population database ittook 9hours 22 minutestoprocessthe queryand the outputfile was151GB big.Note that it tooknearlytwice as long per every 100 million rows to run this query compared with the Population2 database. Thisis example representsthe real challenge andtruthaboutthe difficultyforRelational database models to analyze Big Data. The difficulty is not storing the data is retrieving it and analyzing it. The DBMS (Database relational models) are designed for transactional performance; this is updating, adding, searching for and retrieving small amounts of data in a large database. The DBMS executesthe actionsof transactionsinaninterleavedfashion to obtain good performance, but schedules them in such a way as to ensure that conflicting operations are not permitted to proceed concurrently. (Ramakrishnan, 2000) 3. Distributed Computing We could argue that the solution for this issue is increasing the computer resources, in the end, the Proliant G8 gen2 is not the most powerful machine in the market and we could get a better serverwithmore memoryandprocessingpowertoimprove the query time and the performance when analyzing those large datasets; however, another truth about big data is that it is always growing and eventually it will overcome the initial estimations of hardware resources. The main issue in this case is scalability and performance and an alternative is using distributed computing to separates the data in smaller parts (chunks) in a set of systems; usually small computersor servers(nodes)whichprocessthe data. The main advantage of this approach is the
  • 7. scalabilitybyjustaddingone more node to the clusterandcan quicklyimplementmodificationsto the scripts or applications. 4. Available tools for Big Data The main concernaboutrelational databases is scalability; this model doesn’t work properly in a distributed environment because joining tables across the nodes and keeping the consistency could be almost impossible to achieve due to the structured nature of the model. In response tothe limitationsandconcernsaboutthe relational model,developers and industries have increasinglyturnedtoNoSQLdatabases. The NoSQL movement probably stared inspired by Google’s Big Table or Amazon S3 and simpleDB and along with them have been introduced new tools and ways of understand what databases are and what they can do. (Bartholomew, 2010). The questioniswhattool to use and whendependingonthe needsof agrowingenterprise. There are two tools that I reviewed in order to select the best way to analyze big data: MongoDB and Apache Hadoop. 4.1 MongoDB MongoDB is an opersource product Developed and supported by 10gen. It is a scalable, open source, high performance, document orientated NoSQL database (10gen, 2014). The extra features and advantages of MongoDB are:  Query Language  Fast Performance  Horizontal Scalability 4.1.2 The main weaknesses of MongoDB and NoSQL systems in general are:  No Joins support  No Complex Transaction support  No Constraints support MongoDB uses collections to store the data, called BSON, very similar to JSON format used in JavaScript.Each collectionhasdifferent documents(objects). The documents could be compared to the rows ina relational database andthe collectiontothe tables. The id field is mandatory and could be comparable to a primary key. The number of fields within a collection is very flexible compared to a relational database and they could store multiple values.
  • 8. MongoDB uses a document orientated query language to query the data in the database; the following are examples of the structure: db.employees.find({_id:123}); Find record with specific id against collection “employee” db.employees.find().sort({ name:1} ) Find all records fromthe collection “employee”sort by name. 4.1.3 MongoDB Features:  Ad hoc queries  Supports search by field, range queries and regular expressions searches.  Indexing: Any field in a document could be index.  Master/slave replication:A mastercanperformreadsandwritesandthe slave copiesdata from the master and can only be used for reads or backup (no writes)  Duplicationof data:MongoDB runsover multiple servers to give protection and keep the system up and running in case of hardware failure.  Load balancing: Automatic load balancing is easy to deploy  Scalable: Scales horizontally, new machines can be added to a running database.  File storage:GridFScouldbe usedas a file system, takingadvantage of load balancing and data replication features.  Aggregations: simple limited MapReduce can be used for batch processing of data and aggregation operations  Server-side JavaScript: JavaScript functions can be used in queries.  Special support for locations: Understands longitude and latitude natively  Applications: MongoDB could be used for Big Data application as well as traditional applications  Object Orientated Programming: No conversion required when using OO programing language. 4.2 Apache Hadoop The leadingtool intermsof popularityforbigdata analysisisthe opensource projectcalled Hadoop.ThisApache projectwritteninjavaisa computingenvironmentbuiltontopof a distributedclusteredfilesystemdesignedforverylarge scale dataoperations.(Apache,2014)
  • 9. Hadoopwas inspiredbythe Google’sdistributedfile system(GFS) andthe MapReduce programmingparadigm.Unlike transactionalsystemsHadoopisdesignedtoscanthroughthe big data and large scale datasetstoproduce its outcome througha veryeasyscalable anddistributed batch processingsystem.Hadoopisnotaboutthe processingspeedresponse times,orthe transactional speedorthe real time storing.Itisabout makingthe logical distributionof the workloadsandthe scalabilityandanalysisperspective. 4.2.1 Apache Hadoop YARN YARN it’sa HadoopprojectintroducedinHadoop2.0. The functionalityprovidedbyMapReduce is based on Cluster resource management and data processing. There was a need to enable a broaderinteractionwithdatabeyondMapReduce.WithYARN the resource management and the processingcomponentsare nowseparated.YARN takescare of the cluster resource management and MapReduce performsthe dataprocessingvia YARN. The YARN-based architecture of Hadoop 2.0 provides a general processing platform with is not constrained to MapReduce. Thus YARN takes over the resource management capabilities that were part of MapReduce and now it can focuses on the data processing. Multiple applications can run on top of Hadoop using YARN, sharing a common resource management. 4.2.2 Benefits of YARN It enhances the performance of the Hadoop cluster as follows: Scalability: The processing power in data centers grows with time; more nodes are added to the clusteras the demandincreases.YARN resource management focuses exclusively on scheduling, that is why it manages larger clusters easily. Backward compatibility:YARN is backwardcompatible witholderversionsof MapReduce.Existing MapReduce applications can run on top of YARN without making any changes to the current processes. Improve utilizationof cluster:The resource managerworkspurelyasa schedulerand it optimizes cluster utilization. The scheduler bases the optimization on certain criteria such as capacity guarantees, fairness, SLAs. There is no named map and reduce slots: This resultsinbetterutilizationof the clusterresources. Support for other workloads: Data processing could now be performed using other programing models than MapReduce such as graph processing and iterative modeling. This results into increased return on investment MapReduce´sagility:MapReduce isnow independentof the resource management which is now handledbyYARN and MapReduce onlyfocusesondataprocessing.Itisnow more agile intermsof evolving independently of the underline resource management layer.
  • 10. 4.2.3 How YARN works The fundamental idea of YARN is to separate the two main responsibilities in a Hadoop system, the job tracker and the task tracker into separate entities. 4.2.4 The components in the YARN based system are: Global resource management: The Resource Manager and the Node Manager form the basis for managingapplications in a distributed manner. The responsibility of the Resource Manager is to distribute the available resources to the applications. Per-applicationApplicationMaster:Is the frameworkspecificentity.Onone side itcommunicates with the Resource manager and on the other with Node Managers. It negotiates resources from the resource manager and it works with the node managers to execute and monitor the component tasks. The resource manager has a scheduler which is responsible for allocating resourcestothe variousrunningapplications.Itperformsallocationsaccording to the constraints. Examplesof constraintsare queue capacities and user limits. The scheduling is performed based on the resource requirements of the applications Per-node slave NodeManager and Per-Application container: The node manager is the per- machine slave and is responsible for launching the applications containers. It monitors the resource usage of CPU, memory, disk and network. It also reports the same back to the resource manager.Each applicationmasternegotiatesresource containersfromthe scheduler. It tracks the status and monitors their progress. 5. Suitable Solutions. Whenit comesto decide whichtool willprovidethe bestsolutiontothe specificproblem,we have to take into account that MongoDB and Apache Hadoop are two different approaches to Big data Analysis. After studied their characteristics, benefits and components we can conclude that Hadoopis usedto process data for analytical purposes and it is not meant to be use on real-time processing where MongoDB could be used for real-time processing and could store massive amounts of data as well, however, the processing time is performed in small subsets of data. In Relational database world we can compare Hadoop with OLAP and MongoDB to OLTP. For the purposes of this research, I have chosen Hadoop to analyze the 97GB dataset containing the basic details of the population. Although the Apache open source Hadoop project could be downloaded and configured directly from the Apache website, there are many vendors and Hadooptechnologiesavailable which provide a better an easier implementation for our project. All of them are based on the core Apache project, HDFS and MapReduce, and include the additional applicationssuchas Hive, Pig, HBase, sqoop, etc. The three main companies providing distributions for Hadoop ready ecosystems: Cloudera, MapR and Hortonworks.
  • 11. 6. Hortonworks Data Platform 2.1 I have chosenHortonworks,whichiscompletelyopensource.Theyprovideacomplete framework which includes all the useful applications for Big Data analysis. HortonworksData platformversion2.1isa platformmulti-workload for data processing across an array of processing methods, from batch through the interactive and real time. (www.hortonworks.com) HDP 2.1 runs on top of the YARN data operating system and integrates the capabilities that are required for the enterprise needs. The distribution was installed on a Corei3 machine with 6GB RAM, running Oracle Virtual Box. I have enabledApache Ambari 1.5.1that isincludedinthe HDPdistributionto set up and configure my cluster.Ambari is an open operational framework for provisioning managing and monitoring Apache Hadoop clusters. 7. Apache Ambari The cluster configuration is straight forward thanks to Ambari Web UI. The configuration wizard providesthe essential tools to install, configure and the deploy the new host within the cluster, enablingthe health checks, service status, resource management and jobs status running on the cluster, which for this test, consists on the Master Node and 2 slaves hosts 8. Apache Sqoop The cluster is configured and running ready to test the data processing and the benefits of the MapReduce technology. For a proof of concept, I uploaded the data from MS SQL Server into the HDP with a sqoop script. Sqoop is a tool designed for efficiently transferring bulk data between
  • 12. Apache Hadoop and relational databases. The data is imported from the external structured datastore directlyintoHDFSor relatedsystemslike Hive and HBase. It can be used also to extract data fromthe Hadoopand exportitto an external RDBMS.It supports: Teradata, Netezza, Oracle, MySQL, Postgres, HSQLDB and MS SQL server. Usingthe MicrosoftSQL ServerdriverforHadoop, I ran the followingscripttotestthe connection betweenthe SQLServerandHadoop sqoop list-databases --connect jdbc:sqlserver://192.168.56.1:1433 --username hadoop --password password The outputis the listof databasesavailable atthe momenton the SQL Server and ready to import into the HDFS using the following command sqoop import --connect "jdbc:sqlserver://192.168.56.1:1433;database=population2;username= hadoop;password=hadoop1" --table MOCK_DATA --hive-import Afterthe executionof the command,Hadooplaunchesaseriesof MapReduce jobs,whichprocess the data and importthe table intoHDFS and HBase. The executiontime was1.521 secondsto importthe whole Population2table. 9. Apache Hive Nowwithour testingdataimportedintoourcluster,we are readyto testand compare the results. HDP provides awebUI that allowsinteractivebrowsinganduse of the integratedtools.Ihave usedApache Hive whichisthe standardfor SQL queriesinHadoop.Itusesthe HiveQLlanguage,a verysimilarlanguage toSQLin termsof semantics andfeatures.Beewax isthe UIforHive included inthe HDP 2.1
  • 13. 10. Comparing The TestResults I ran the same query Select first_name, last_name, email, country from MOCK_DATA group by country, first_name, last_name, email. Using Beewax and HiveQL and these are the results: The query took 869 seconds (14 min, 48 seconds) to process the Population2 database when executing the query with the group by clause. The MapReduce aggregation function running on HadoopHDP 2.1 clusterwith2 slave nodes,representedaperformance increase of 50% compared with the SQL Server query. The second query running on the Population database took 12559 seconds (3, 48 hours) to complete the aggregationsforthe 1Billionrows dataset, representing a performance increase of 62.16% The advantage of Hive and Beewax is that the results could be displayed on screen and the data could be examined and copied directly from the query results windows. In the SQL server Management Studio case, the results had to be saved to a text file that can only be open with a powerful machine abletoloadthe file ontomemoryinordertoanalyze it. It’s important to notice that the performance couldbe improvedevenmore,byaddingmore hosts to the Hadoop cluster.
  • 14. Conclusion: We have proventhatRDBMS are not enoughformediumandBigData Analysis.The scalabilityand performance for these purposes could be achieved by using the Big Data tools available like Hadoop. The test carried on demonstrating the weaknesses of the Relational model and the differences in performance between Hadoop, is based just on structured data stored in a relational way. Analyzing unstructured data is the real struggle for RDBMS. An example of unstructured big data is the log files generated by the normal interactions of users visiting a website.Toanalyze these kindsof datasets using SQL server, we would have to parse the log file and convert it into a structured relational table in order to query it and analyze it. On the other hand, Hadoop and the HDP, offer a very effective tool to analyze log files using HDFS and HCatalog.It sharesthe schemawithPigand Hive withinaHadoopclusterwhichallowsthese tools to performactionsandqueriesonthe data stored.The conversiontakes place withinthe Hcatalog and shows a relational waytorepresentunstructured datainthe Hadoop Distributed File system. There are manyadvantagesof the Hadoop implementationforBigData analysis;however, itisnot always the best solution for all cases. Hadoop is meant to be used for offline transaction processingandanalysis, not for real time processing. In that case, the traditional RDBMS tools or the alternative NoSQLtoolssuchas MongoDB, present a better solution in terms of performance for small andbigsubsetsof data. The alternative for medium and big enterprises worrying about the incremental growth of theirdatasets,isto evaluate concisely the applications generating the data on real-time transactions,thenenhance the performance of the datacentersandtransferthe data to be analyzed in a distributed environment, where the Big Data tools such as Hadoop can processeasilyandefficientlythe data,extractingthe real value fromitandimprovingthe decision making based on the results.
  • 15. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. Jacobs,A. (2009). The pathologiesof big data.Communicationsof the ACM,52(8),36-44. Apache Foundation(2014) What is Hadoop,Retrieved7Jun,2014 from http://hadoop.apache.org MicrosoftKnowledge Base (2014) KB2874903. Retrieved 24 Jul, 2014 from http://support2.microsoft.com/kb/2874903 Ramakrishnan,R.,& Gehrke,J.(2000). Databasemanagementsys-tems.Osborne/McGraw-Hill.(p. 2) Bartholomew,D.(2010) SQL vs.NoSQL,Linux Journal,195, 54-59. MongoDB Inc (2014) MongoDBOverview,retrieved30 Jul,from http://www.mongodb.com/mongodb-overview Apache Foundation(2014) ApacheHadoop NextGen MapReduce(YARN), Retrieved30Jul,from http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/YARN.html Hortonworks(2014) HortonworksData Platform2.1,Retrieved 31Jul,from http://hortonworks.com/ Apache Foundation(2014) AmbariOverview, Retrieved31Jul,from http://ambari.apache.org/ Microsoft(2014) MicrosoftSQL ServerJDBC Driver Downloadedfrom http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F- 613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz Apache Foundation(2014) Apachesqoop, Retrieved31Jul,fromhttp://sqoop.apache.org/ Hortonworks(2013) ImportfromSQL server into the HDP using sqoop, Retrieved6Augfrom http://hortonworks.com/hadoop-tutorial/import-microsoft-sql-server-hortonworks-sandbox- using-sqoop/ Apache Foundation(2014) ApacheHive TM, Retrieved31Jul,fromhttps://hive.apache.org/