disertation

Big Data Analysis: SQL Server and Apache Hadoop
Ruben Casas
School of Computing Engineering and Mathematics
University of Brighton
United Kingdom
2014
r.casasrincon1@uni.brighton.ac.uk
Word Count (Excluding References): 3450

Contents
1. Introduction
2. Traditional Relational Databases
2.1. SQL ServerTestData
2.2. QueryResults
3. DistributedComputing
4. Available Tools ForBigDataAnalysis
4.1. MongoDB
4.1.1. Advantages
4.1.2. Weaknesses
4.1.3. Features
4.2. Apache Hadoop
4.2.1. Apache HadoopYARN
4.2.2. Benefitsof YARN
4.2.3. How YARN works
4.2.4. Componentsof YARN
5. Suitable Solutions
6. HortonworksData Plataform
7. Apache Ambari
8. Apache Sqoop
9. Apache Hive
10. ComparingThe TestResults
11. Conclusion
12. References

Big Data Analysis: SQL Server and Apache Hadoop
1. Introduction
Big data hasbecome a commonterm nowadays;it’snotjusta fancyword usedbythe big internet
companiesworryingaboutthe incremental growthof theirdatacenters.The Big data problem has
reachedthe mediumandsmall enterprisesgivingthe IT anddata-warehouse managerssomething
to think about.
Big Data isnot justabout size,itreferstodifficultdata,difficult to analyze and store and it should
be lookedat as a solution and an opportunity rather than a problem. The new tools and benefits
of bigdata are notexclusive of bigcompanieslikeGoogle,Facebookor Amazon. After the release
of source code in a google research publication in 2004 (Dean & Ghemawat 2008) including the
google file systemandMapReduce, the secretandthe tools have been revealed to the public and
projectslike Apache Hadoopand all its components were born bringing the big data analysis and
solutions close to the smalls and medium enterprises.
The purpose of this project is to evaluate the performance and the suitability of the solution
provided by Apache Hadoop in order to analyze data stored in a big database and to compare
these results with the traditional database using SQL Server.

2. Traditional Relational Database Management Systems
The problem of the traditional relational databases is not storing large amounts of data, their
problemisanalysingthe storeddata.Accordingto an experiment carried by Adam Jacobs (Jacobs
2009), usingPostgreSQLtorun a simple queryona 1 billion row fake dataset containing the basic
details of the population of the world, He demonstrated that the traditional RDBMS struggle to
scale gracefullytobigdatasetswhenthe numberof rowsgrow past 1 million.Hisconclusionbrings
questionsandconcernsaboutthe wayRDBMS approachbig datasets and the difficulty to analyse
the stored data, furthermore, his experiment includes just a hypothetical situation using
structured data, and the biggest weakness of RDBMS is that they cannot analyse unstructured
data, the great struggle is to store raw data and upload it onto a traditional database.
I have decidedto continue with his experiment comparing SQL Server, one of the most common
RDBMS used by companies in the world with one of the leading tools for big data analysis:
Hadoop.
2.1 SQL Server Test Data
To demonstrate the difficultiesof the RDBMS,I createdtwo datasetsfor the test, the first one is a
100 million row database called “population2” (8GB) containing a single table called
“MOCK_DATA” and 6 columns:
Column Name Data Type
Id int
first_name varchar(50)
last_name varchar(50)
Email varchar(50)
Country varchar(20)
ip_address varchar(20)
The seconddatabase is called “population” with the same meta-data structure, but this time the
table was populated with 1.100 million rows (97GB).
Both databases where created using Microsoft SQL Server 2012 and to populate them with the
fake data,I usedSQL Data generator,provided by Red Gate. (www.red-gate.com) and a standard
server machine HP ProLiant ML310e Gen8 V2 Server; RAM8GB, Intel Xeon E3 Intel Core i3.
The fake-data generation of the first database took around 1 hour and 15 minutes for the 100
millionrows,but when I created the second database I found that after the first 100million rows
the time of data generationstartedto increase gradually, passing from 100 million rows per hour
to double that period of time taking a total of 23 hours 40 minutes to complete the whole 1100
million rows. The performance of the machine was seriously reduced using almost all of the
available resources of processor and RAMMemory.
Withmy datasetfor testingready, I tried to run the following query on SQL Management Studio:

Use Population2
Select first_name, last_name, email, country from MOCK_DATA group by
country, first_name, last_name, email.
After 20 minutes, the query stopped and SQL Server Management Studio showed the following
error message: “System.OutOfMemoryException”. According to Microsoft support this issue
occurs when SSMS is unable to display query results in the query window due to the memory
restrictionof 2GB. SSMS is a 32bit processand imposesanartificial limit on how much text can be
displayed per database field in the results window. This limit is 64kb in “grid” and 8kb in “text”
mode,therefore if the result set is too large and the memory required may surpass the 2GB limit
of the SSMS process, the query will stop and the error message thrown. (Microsoft, KB2874903)
The solutionfor large dataset is exporting the results to a file or using a different tool. Microsoft
suggestsqlcmd64 bit tool instead of SSMS to run the SQL queries, this avoids the 2GB restriction
that affects the 32bit SSMS process.
2.2 Query Results
The query took 29 minutes 48 seconds on Population2 database, to extract the information and
save it on an unstructured text file (15.3GB)

For the 1 Billionrows population database ittook 9hours 22 minutestoprocessthe queryand the
outputfile was151GB big.Note that it tooknearlytwice as long per every 100 million rows to run
this query compared with the Population2 database.
Thisis example representsthe real challenge andtruthaboutthe difficultyforRelational database
models to analyze Big Data. The difficulty is not storing the data is retrieving it and analyzing it.
The DBMS (Database relational models) are designed for transactional performance; this is
updating, adding, searching for and retrieving small amounts of data in a large database. The
DBMS executesthe actionsof transactionsinaninterleavedfashion to obtain good performance,
but schedules them in such a way as to ensure that conflicting operations are not permitted to
proceed concurrently. (Ramakrishnan, 2000)
3. Distributed Computing
We could argue that the solution for this issue is increasing the computer resources, in the end,
the Proliant G8 gen2 is not the most powerful machine in the market and we could get a better
serverwithmore memoryandprocessingpowertoimprove the query time and the performance
when analyzing those large datasets; however, another truth about big data is that it is always
growing and eventually it will overcome the initial estimations of hardware resources.
The main issue in this case is scalability and performance and an alternative is using distributed
computing to separates the data in smaller parts (chunks) in a set of systems; usually small
computersor servers(nodes)whichprocessthe data. The main advantage of this approach is the

scalabilitybyjustaddingone more node to the clusterandcan quicklyimplementmodificationsto
the scripts or applications.
4. Available tools for Big Data
The main concernaboutrelational databases is scalability; this model doesn’t work properly in a
distributed environment because joining tables across the nodes and keeping the consistency
could be almost impossible to achieve due to the structured nature of the model.
In response tothe limitationsandconcernsaboutthe relational model,developers and industries
have increasinglyturnedtoNoSQLdatabases. The NoSQL movement probably stared inspired by
Google’s Big Table or Amazon S3 and simpleDB and along with them have been introduced new
tools and ways of understand what databases are and what they can do. (Bartholomew, 2010).
The questioniswhattool to use and whendependingonthe needsof agrowingenterprise. There
are two tools that I reviewed in order to select the best way to analyze big data: MongoDB and
Apache Hadoop.
4.1 MongoDB
MongoDB is an opersource product Developed and supported by 10gen. It is a scalable, open
source, high performance, document orientated NoSQL database (10gen, 2014). The extra
features and advantages of MongoDB are:
 Query Language
 Fast Performance
 Horizontal Scalability
4.1.2 The main weaknesses of MongoDB and NoSQL systems in general are:
 No Joins support
 No Complex Transaction support
 No Constraints support
MongoDB uses collections to store the data, called BSON, very similar to JSON format used in
JavaScript.Each collectionhasdifferent documents(objects). The documents could be compared
to the rows ina relational database andthe collectiontothe tables. The id field is mandatory and
could be comparable to a primary key. The number of fields within a collection is very flexible
compared to a relational database and they could store multiple values.

MongoDB uses a document orientated query language to query the data in the database; the
following are examples of the structure:
db.employees.find({_id:123}); Find record with specific id against collection “employee”
db.employees.find().sort({ name:1} ) Find all records fromthe collection “employee”sort by name.
4.1.3 MongoDB Features:
 Ad hoc queries
 Supports search by field, range queries and regular expressions searches.
 Indexing: Any field in a document could be index.
 Master/slave replication:A mastercanperformreadsandwritesandthe slave copiesdata
from the master and can only be used for reads or backup (no writes)
 Duplicationof data:MongoDB runsover multiple servers to give protection and keep the
system up and running in case of hardware failure.
 Load balancing: Automatic load balancing is easy to deploy
 Scalable: Scales horizontally, new machines can be added to a running database.
 File storage:GridFScouldbe usedas a file system, takingadvantage of load balancing and
data replication features.
 Aggregations: simple limited MapReduce can be used for batch processing of data and
aggregation operations
 Server-side JavaScript: JavaScript functions can be used in queries.
 Special support for locations: Understands longitude and latitude natively
 Applications: MongoDB could be used for Big Data application as well as traditional
applications
 Object Orientated Programming: No conversion required when using OO programing
language.
4.2 Apache Hadoop
The leadingtool intermsof popularityforbigdata analysisisthe opensource projectcalled
Hadoop.ThisApache projectwritteninjavaisa computingenvironmentbuiltontopof a
distributedclusteredfilesystemdesignedforverylarge scale dataoperations.(Apache,2014)

Hadoopwas inspiredbythe Google’sdistributedfile system(GFS) andthe MapReduce
programmingparadigm.Unlike transactionalsystemsHadoopisdesignedtoscanthroughthe big
data and large scale datasetstoproduce its outcome througha veryeasyscalable anddistributed
batch processingsystem.Hadoopisnotaboutthe processingspeedresponse times,orthe
transactional speedorthe real time storing.Itisabout makingthe logical distributionof the
workloadsandthe scalabilityandanalysisperspective.
4.2.1 Apache Hadoop YARN
YARN it’sa HadoopprojectintroducedinHadoop2.0. The functionalityprovidedbyMapReduce is
based on Cluster resource management and data processing. There was a need to enable a
broaderinteractionwithdatabeyondMapReduce.WithYARN the resource management and the
processingcomponentsare nowseparated.YARN takescare of the cluster resource management
and MapReduce performsthe dataprocessingvia YARN. The YARN-based architecture of Hadoop
2.0 provides a general processing platform with is not constrained to MapReduce. Thus YARN
takes over the resource management capabilities that were part of MapReduce and now it can
focuses on the data processing. Multiple applications can run on top of Hadoop using YARN,
sharing a common resource management.
4.2.2 Benefits of YARN
It enhances the performance of the Hadoop cluster as follows:
Scalability: The processing power in data centers grows with time; more nodes are added to the
clusteras the demandincreases.YARN resource management focuses exclusively on scheduling,
that is why it manages larger clusters easily.
Backward compatibility:YARN is backwardcompatible witholderversionsof MapReduce.Existing
MapReduce applications can run on top of YARN without making any changes to the current
processes.
Improve utilizationof cluster:The resource managerworkspurelyasa schedulerand it optimizes
cluster utilization. The scheduler bases the optimization on certain criteria such as capacity
guarantees, fairness, SLAs.
There is no named map and reduce slots: This resultsinbetterutilizationof the clusterresources.
Support for other workloads: Data processing could now be performed using other programing
models than MapReduce such as graph processing and iterative modeling. This results into
increased return on investment
MapReduce´sagility:MapReduce isnow independentof the resource management which is now
handledbyYARN and MapReduce onlyfocusesondataprocessing.Itisnow more agile intermsof
evolving independently of the underline resource management layer.

4.2.3 How YARN works
The fundamental idea of YARN is to separate the two main responsibilities in a Hadoop system,
the job tracker and the task tracker into separate entities.
4.2.4 The components in the YARN based system are:
Global resource management: The Resource Manager and the Node Manager form the basis for
managingapplications in a distributed manner. The responsibility of the Resource Manager is to
distribute the available resources to the applications.
Per-applicationApplicationMaster:Is the frameworkspecificentity.Onone side itcommunicates
with the Resource manager and on the other with Node Managers. It negotiates resources from
the resource manager and it works with the node managers to execute and monitor the
component tasks. The resource manager has a scheduler which is responsible for allocating
resourcestothe variousrunningapplications.Itperformsallocationsaccording to the constraints.
Examplesof constraintsare queue capacities and user limits. The scheduling is performed based
on the resource requirements of the applications
Per-node slave NodeManager and Per-Application container: The node manager is the per-
machine slave and is responsible for launching the applications containers. It monitors the
resource usage of CPU, memory, disk and network. It also reports the same back to the resource
manager.Each applicationmasternegotiatesresource containersfromthe scheduler. It tracks the
status and monitors their progress.
5. Suitable Solutions.
Whenit comesto decide whichtool willprovidethe bestsolutiontothe specificproblem,we have
to take into account that MongoDB and Apache Hadoop are two different approaches to Big data
Analysis. After studied their characteristics, benefits and components we can conclude that
Hadoopis usedto process data for analytical purposes and it is not meant to be use on real-time
processing where MongoDB could be used for real-time processing and could store massive
amounts of data as well, however, the processing time is performed in small subsets of data. In
Relational database world we can compare Hadoop with OLAP and MongoDB to OLTP.
For the purposes of this research, I have chosen Hadoop to analyze the 97GB dataset containing
the basic details of the population. Although the Apache open source Hadoop project could be
downloaded and configured directly from the Apache website, there are many vendors and
Hadooptechnologiesavailable which provide a better an easier implementation for our project.
All of them are based on the core Apache project, HDFS and MapReduce, and include the
additional applicationssuchas Hive, Pig, HBase, sqoop, etc. The three main companies providing
distributions for Hadoop ready ecosystems: Cloudera, MapR and Hortonworks.

6. Hortonworks Data Platform 2.1
I have chosenHortonworks,whichiscompletelyopensource.Theyprovideacomplete framework
which includes all the useful applications for Big Data analysis.
HortonworksData platformversion2.1isa platformmulti-workload for data processing across an
array of processing methods, from batch through the interactive and real time.
(www.hortonworks.com) HDP 2.1 runs on top of the YARN data operating system and integrates
the capabilities that are required for the enterprise needs.
The distribution was installed on a Corei3 machine with 6GB RAM, running Oracle Virtual Box. I
have enabledApache Ambari 1.5.1that isincludedinthe HDPdistributionto set up and configure
my cluster.Ambari is an open operational framework for provisioning managing and monitoring
Apache Hadoop clusters.
7. Apache Ambari
The cluster configuration is straight forward thanks to Ambari Web UI. The configuration wizard
providesthe essential tools to install, configure and the deploy the new host within the cluster,
enablingthe health checks, service status, resource management and jobs status running on the
cluster, which for this test, consists on the Master Node and 2 slaves hosts
8. Apache Sqoop
The cluster is configured and running ready to test the data processing and the benefits of the
MapReduce technology. For a proof of concept, I uploaded the data from MS SQL Server into the
HDP with a sqoop script. Sqoop is a tool designed for efficiently transferring bulk data between

Apache Hadoop and relational databases. The data is imported from the external structured
datastore directlyintoHDFSor relatedsystemslike Hive and HBase. It can be used also to extract
data fromthe Hadoopand exportitto an external RDBMS.It supports: Teradata, Netezza, Oracle,
MySQL, Postgres, HSQLDB and MS SQL server.
Usingthe MicrosoftSQL ServerdriverforHadoop, I ran the followingscripttotestthe connection
betweenthe SQLServerandHadoop
sqoop list-databases --connect jdbc:sqlserver://192.168.56.1:1433
--username hadoop --password password
The outputis the listof databasesavailable atthe momenton the SQL Server and ready to import
into the HDFS using the following command
sqoop import --connect
"jdbc:sqlserver://192.168.56.1:1433;database=population2;username=
hadoop;password=hadoop1" --table MOCK_DATA --hive-import
Afterthe executionof the command,Hadooplaunchesaseriesof MapReduce jobs,whichprocess
the data and importthe table intoHDFS and HBase. The executiontime was1.521 secondsto
importthe whole Population2table.
9. Apache Hive
Nowwithour testingdataimportedintoourcluster,we are readyto testand compare the results.
HDP provides awebUI that allowsinteractivebrowsinganduse of the integratedtools.Ihave
usedApache Hive whichisthe standardfor SQL queriesinHadoop.Itusesthe HiveQLlanguage,a
verysimilarlanguage toSQLin termsof semantics andfeatures.Beewax isthe UIforHive included
inthe HDP 2.1

10. Comparing The TestResults
I ran the same query Select first_name, last_name, email, country from
MOCK_DATA group by country, first_name, last_name, email. Using Beewax and
HiveQL and these are the results:
The query took 869 seconds (14 min, 48 seconds) to process the Population2 database when
executing the query with the group by clause. The MapReduce aggregation function running on
HadoopHDP 2.1 clusterwith2 slave nodes,representedaperformance increase of 50% compared
with the SQL Server query.
The second query running on the Population database took 12559 seconds (3, 48 hours) to
complete the aggregationsforthe 1Billionrows dataset, representing a performance increase of
62.16%
The advantage of Hive and Beewax is that the results could be displayed on screen and the data
could be examined and copied directly from the query results windows. In the SQL server
Management Studio case, the results had to be saved to a text file that can only be open with a
powerful machine abletoloadthe file ontomemoryinordertoanalyze it. It’s important to notice
that the performance couldbe improvedevenmore,byaddingmore hosts to the Hadoop cluster.

Conclusion:
We have proventhatRDBMS are not enoughformediumandBigData Analysis.The scalabilityand
performance for these purposes could be achieved by using the Big Data tools available like
Hadoop. The test carried on demonstrating the weaknesses of the Relational model and the
differences in performance between Hadoop, is based just on structured data stored in a
relational way. Analyzing unstructured data is the real struggle for RDBMS. An example of
unstructured big data is the log files generated by the normal interactions of users visiting a
website.Toanalyze these kindsof datasets using SQL server, we would have to parse the log file
and convert it into a structured relational table in order to query it and analyze it. On the other
hand, Hadoop and the HDP, offer a very effective tool to analyze log files using HDFS and
HCatalog.It sharesthe schemawithPigand Hive withinaHadoopclusterwhichallowsthese tools
to performactionsandqueriesonthe data stored.The conversiontakes place withinthe Hcatalog
and shows a relational waytorepresentunstructured datainthe Hadoop Distributed File system.
There are manyadvantagesof the Hadoop implementationforBigData analysis;however, itisnot
always the best solution for all cases. Hadoop is meant to be used for offline transaction
processingandanalysis, not for real time processing. In that case, the traditional RDBMS tools or
the alternative NoSQLtoolssuchas MongoDB, present a better solution in terms of performance
for small andbigsubsetsof data. The alternative for medium and big enterprises worrying about
the incremental growth of theirdatasets,isto evaluate concisely the applications generating the
data on real-time transactions,thenenhance the performance of the datacentersandtransferthe
data to be analyzed in a distributed environment, where the Big Data tools such as Hadoop can
processeasilyandefficientlythe data,extractingthe real value fromitandimprovingthe decision
making based on the results.

Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.
Communications of the ACM, 51(1), 107-113.
Jacobs,A. (2009). The pathologiesof big data.Communicationsof the ACM,52(8),36-44.
Apache Foundation(2014) What is Hadoop,Retrieved7Jun,2014 from http://hadoop.apache.org
MicrosoftKnowledge Base (2014) KB2874903. Retrieved 24 Jul, 2014 from
http://support2.microsoft.com/kb/2874903
Ramakrishnan,R.,& Gehrke,J.(2000). Databasemanagementsys-tems.Osborne/McGraw-Hill.(p.
2)
Bartholomew,D.(2010) SQL vs.NoSQL,Linux Journal,195, 54-59.
MongoDB Inc (2014) MongoDBOverview,retrieved30 Jul,from
http://www.mongodb.com/mongodb-overview
Apache Foundation(2014) ApacheHadoop NextGen MapReduce(YARN), Retrieved30Jul,from
http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/YARN.html
Hortonworks(2014) HortonworksData Platform2.1,Retrieved 31Jul,from
http://hortonworks.com/
Apache Foundation(2014) AmbariOverview, Retrieved31Jul,from http://ambari.apache.org/
Microsoft(2014) MicrosoftSQL ServerJDBC Driver Downloadedfrom
http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-
613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz
Apache Foundation(2014) Apachesqoop, Retrieved31Jul,fromhttp://sqoop.apache.org/
Hortonworks(2013) ImportfromSQL server into the HDP using sqoop, Retrieved6Augfrom
http://hortonworks.com/hadoop-tutorial/import-microsoft-sql-server-hortonworks-sandbox-
using-sqoop/
Apache Foundation(2014) ApacheHive TM, Retrieved31Jul,fromhttps://hive.apache.org/

disertation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

Ähnlich wie disertation

Ähnlich wie disertation (20)

disertation