SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
5/25/2015
By | Danyal , Baqir and Shoaib
IBA-FCS
INTRODUCTORY DATA ANALYTICS WITH APACHE
SPARK
Table of Contents
Setting Up Network Of Oracle Virtual Box: ....................................................................................................................... 3
Setup Spark On Windows 7 Standalone Mode:.................................................................................................................. 4
Setup Spark On Linux: ......................................................................................................................................................... 5
Spark Features:...................................................................................................................................................................... 5
Spark Standalone Cluster Mode: ............................................................................................................................ 6
Setup Master Node ................................................................................................................................................... 6
Starting Cluster Manually .................................................................................................................................................... 7
Starting & Connecting Worker To Master ......................................................................................................................... 8
Submitting Applications To Spark....................................................................................................................................... 9
Interactive Analysis With The Spark Shell ......................................................................................................................... 9
Spark Submit ......................................................................................................................................................................... 9
Custom Code Execution & Hdfs File Reading .................................................................................................................. 10
Setting up network of Oracle Virtual Box:
1. When starting the VM set the Attached to value to Bridged Adapter (to make it connect to other
VMs on the network , which reside on different Host machines)
2. Refresh the MAC Address by clicking the REFRESH Icon, so that every VM must have a different
MAC to have a unique IP assigned to it.
Figure 1: VM setting to follow when creating the network
3. Then make sure you are able to PING from Host to VM and vice versa when the VM starts after
applying the above settings.
Setup SPARK on Windows 7 Standalone Mode:
Prerequisites:
 Java6+
 Scala 2.10
 Python 2.6 +
 Spark 1.2.x
 sbt ( In case of building Spark Source code)
 GIT( If you use sbt tool)
Environment Variables:
 Set JAVA_HOME and PATH variable as environment variables.
 Download Scala 2.10 and install
 Set SCALA_HOME andadd %SCALA_HOME%bin in PATH variable in environment variables. To test
whether Scala is installed or not, run following command.
Downloading & Setting up Spark:
 Choose a Spark prebuilt package for Hadoop i.e. Prebuilt for Hadoop 2.3/2.4 or later. Download and
extract it to any drive i.e. D:spark-1.2.1-bin-hadoop2.3
 Set SPARK_HOME and add %SPARK_HOME%bin in PATH in environment variables
 Download winutils.exe and place it in any location (i.e. D:winutilsbinwinutils.exe) to avoid any
Hadoop Errors
 Set HADOOP_HOME = D:winutils in environment variable
 Now, Re run the command “spark-shell’ , you’ll see the scala shell .

 For Spark UI : open http://localhost:4040/ in browser
 ctrl + z to get out of it when it executes successfully .
 For testing the successful setup you can run the example :
If all goes fine , this will execute this sample program and return the result on console .
And that’s how you have setup Spark on windows 7 .
Master Web UI can be accessed on url SPARK://IP:7077
Setup SPARK on Linux:
 Download Hortonworks Sandbox ,with Spark Installed and Configured from following link
http://hortonworks.com/hdp/downloads/
 HDP 2.2.4 on Sandbox.
 You can find Spark in this directory
o /usr/hdp/2.2.4.2-2/spark/bin
You are now all ready to go as HortonWorks have this setup for you pre-packaged.
SPARK Features:
 Speed - Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory
computing.
 Ease of Use – Write apps quickly in Java , Scala or Python . Spark offer 80 high level operators that
make it easy to build parallel apps.
 Generality – Combines SQL , Streaming and complex analytics , powers its stack with high level tools
including Spark SQL , MLib (for machine learning and AI),GraphX & Spark streaming .
 Runs Everywhere – Spark runs on Hadoop , standalone Mode or in cloud as well , can access diverse
data sources including HDFS , Cassandra , Hbase and S3 .
SPARK Standalone Cluster Mode:
 There are 2 other Spark deployment modes i.e. YARN and MESOS but here we will only talk about
standalone mode.
 Spark is already installed on HortonWorks VM in standalone mode on your VM node. Simply acquire
pre-built version of spark for any future implementations.
Setup Master Node
1. Edit etc/hosts file by VI Editor
2. Edit the file like below , here we have shown 2 slaves (which must be setup the same
way as this VM , and we must enter information like the same way in etc hosts file
on slaves too)
3. Then You should change you Masters host name machine to Master and slaves
machine to slave1 and slave2 respectively as shown below
4. Now on slave machines, there is one more additional step to be performed and that
is to create a file called conf/slaves in your Spark directory, which must contain the
hostnames of all the machines where you intend to start Spark workers, one per
line. If conf/slaves does not exist, the launch scripts defaults to a single machine
(localhost), which is useful for testing only.
5. Now if everything have gone according to plan then your Multi Cluster Spark setup
over Network is up and running and you can verify by Ping to other machines from
every machine , in our case there were 3 machines (1 master , 2 slaves)
Starting Cluster Manually
 You can start standalone master server by executing the following
./sbin/start-master.sh
 Once started, the master will print out the MASTER URL i.e.
Spark: //HOSTNAME:PORT, which will be used to connect workers to it.
 This url can also be find on WebUI , whose default is url is http://localhost:8080
 sbin/start-slaves.sh - Starts a slave instance on each machine specified in
the conf/slaves file.
 sbin/start-all.sh - Starts both a master and a number of slaves as described above.
 sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
 sbin/stop-slaves.sh - Stops all slave instances on the machines specified in
the conf/slaves file.
 sbin/stop-all.sh - Stops both the master and the slaves as described above.
Starting & Connecting Worker to Master
 You can start and connect worker(S) to master via this command
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://HOST:PORT
 Once you have started a worker, look at the master’s web UI (http://localhost:8080 by
default). You should see the new node listed there, along with its number of CPUs and
memory (minus one gigabyte left for the OS).
 Now once you connect workers to master successfully (checked it by browsing the web UI ,
salves machines will be shown there (like below) , any task you run from master to slave will
be distributed to all the available machines in network to be worked on in parallel .
Figure 2 : Master Web UI
Submitting Applications to Spark
We have 2 ways to do this.
Interactive Analysis with the Spark Shell
 Pyspark
./bin/pyspark –master spark://IP:port
 Spark-shell
./bin/spark-shell – master spark://IP:port
There are many parameters to be passed with above commands, links of which are given at the
end of this document. Running pyspark or spark-shell commands will open up an interactive
shell for you to work on, write code line by line, pressing enter, for example as below
textFile = sc.textFile("README.md")
textFile.count() # Number of items in this RDD
textFile.first() # First item in this RDD
where ‘sc’ is SPARK CONTEXT object , which is made available by spark when you run either
pyspark or spark-shell commands . Behind the scenes, spark-shell invokes the more
general spark-submit script.
Spark Submit
Once a user application is bundled, it can be launched using the bin/spark-submit script. This
script takes care of setting up the classpath with Spark and its dependencies, and can support
different cluster managers and deploy modes that Spark supports:
For simple example consider this from inside spark folder, execute something like this:
./bin/spark-submit --master spark://master:7077 k-means.py
Where master means where to to submit the app and then the file name to run , either of scala
, java or python .
Template for spark submit command
./bin/spark-submit
--class <main-class>
--master <master-url>
--deploy-mode <deploy-mode>
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]
Some of the commonly used options are:
 --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
 --master: The master URL for the cluster (e.g. spark://hostname or IP:7077)
 --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an
external client (client) (default: client)
 --conf: Arbitrary Spark configuration property in key=value format. For values that contain
spaces wrap “key=value” in quotes (as shown).
 application-jar: Path to a bundled jar including your application and all dependencies. The URL
must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
 application-arguments: Arguments passed to the main method of your main class, if any
Custom code execution & HDFS File reading
 To read / access file text inside your python or scala programs , the file must reside inside HDFS
,only then will sc.textFile(“file.txt”) will be able to read that and access the content . See the
commands below as sample to create HDFS directory and put a file in it
 hadoop fs –mkdir hdfs://master/user/hadoop/spark/data/
 to create directory inside HDFS with given path uri
 hadoop fs –ls hdfs://master/user/hadoop/spark/data/
 to view directory contents inside HDFS with given path uri
 hadoop fs –put home/sample.txt hdfs://master/user/hadoop/spark/data/
 to upload file onto HDFS from local Linux directory (home/ in this case)
 hadoop fs –get hdfs://master/user/hadoop/spark/data/sample.txt home/
 to download file from HDFS dir to local Linux directory (home/ in this case)
Sample Code given below which read a file from HDFS directory
1. from pyspark.mllib.clustering import KMeans
2. from numpy import array
3. from math import sqrt
4. from pyspark import SparkContext
5. import time
6. start_time = time.time()
7.
8. sc = SparkContext(appName="K means")
9. # Load and parse the data
10. data = sc.textFile("hdfs://master/user/hadoop/spark/data/kmeans.csv")
11. header = data.first()
12. parsedData = data.filter(lambda x: x != header).map(lambda line: array([float(x) for x in li
ne.split(',')]))
13.
14. # Build the model (cluster the data)
15. clusters = KMeans.train(parsedData, 2, maxIterations=100,
16. runs=10, initializationMode="random")
17.
18. # Evaluate clustering by computing Within Set Sum of Squared Errors
19. def error(point):
20. center = clusters.centers[clusters.predict(point)]
21. return sqrt(sum([x**2 for x in (point - center)]))
22.
23. WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
24. print("Within Set Sum of Squared Error = " + str(WSSSE))
25. print("--- %s seconds ---" % (time.time() - start_time))
Aggregate Function Benchmarking
Query Windows Spark Cluster
.8M records - Total amount spent
on all transaction per customer
30 secs 7 secs
10M records – Transaction count 167 secs 8 secs
10M records – Total amount sum 106 secs 10 secs
K-Means Clustering Benchmarking
Cluster config Windows Spark Cluster
K=4, iter = 100 , python, rows = 1048576 82 Secs 25 Secs
K=4, iter = 1000 , python , rows = 1048576 800 Secs 31 Secs
k=4, iter = 1000, Scala , rows = 1048576 - 13 Secs
Refrences:
1. https://spark.apache.org/docs/1.3.0/index.html
2. http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/
3. https://spark.apache.org/docs/1.3.0/spark-standalone.html
4. https://spark.apache.org/docs/1.3.0/quick-start.html
5. https://spark.apache.org/docs/1.3.0/submitting-applications.html

Weitere ähnliche Inhalte

Was ist angesagt?

Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1
Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1
Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1Osama Mustafa
 
Dockerizing WordPress
Dockerizing WordPressDockerizing WordPress
Dockerizing WordPressdotCloud
 
R hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveR hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveAiden Seonghak Hong
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDescribing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDale Lane
 
How to run_moses 2
How to run_moses 2How to run_moses 2
How to run_moses 2Mahmoud Eid
 
How To Install Openbravo ERP 2.50 MP43 in Ubuntu
How To Install Openbravo ERP 2.50 MP43 in UbuntuHow To Install Openbravo ERP 2.50 MP43 in Ubuntu
How To Install Openbravo ERP 2.50 MP43 in UbuntuWirabumi Software
 
Integrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application ServerIntegrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application Serverwebhostingguy
 
Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2IMC Institute
 
Writing & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp BostonWriting & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp BostonPuppet
 
Useful Kafka tools
Useful Kafka toolsUseful Kafka tools
Useful Kafka toolsDale Lane
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopAiden Seonghak Hong
 
How to export import a mysql database via ssh in aws lightsail wordpress rizw...
How to export import a mysql database via ssh in aws lightsail wordpress rizw...How to export import a mysql database via ssh in aws lightsail wordpress rizw...
How to export import a mysql database via ssh in aws lightsail wordpress rizw...AlexRobert25
 
Ansible automation tool with modules
Ansible automation tool with modulesAnsible automation tool with modules
Ansible automation tool with modulesmohamedmoharam
 
Oracle to MySQL DatabaseLink
Oracle to MySQL DatabaseLinkOracle to MySQL DatabaseLink
Oracle to MySQL DatabaseLinkOsama Mustafa
 
How to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysisHow to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysisTiago Simões
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloudTahsin Hasan
 
Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Laurent Leturgez
 

Was ist angesagt? (19)

Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1
Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1
Step by Step to Install oracle grid 11.2.0.3 on solaris 11.1
 
Dockerizing WordPress
Dockerizing WordPressDockerizing WordPress
Dockerizing WordPress
 
R hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveR hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing Hive
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDescribing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPI
 
hw1a
hw1ahw1a
hw1a
 
How to run_moses 2
How to run_moses 2How to run_moses 2
How to run_moses 2
 
How To Install Openbravo ERP 2.50 MP43 in Ubuntu
How To Install Openbravo ERP 2.50 MP43 in UbuntuHow To Install Openbravo ERP 2.50 MP43 in Ubuntu
How To Install Openbravo ERP 2.50 MP43 in Ubuntu
 
Integrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application ServerIntegrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application Server
 
Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2
 
Writing & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp BostonWriting & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp Boston
 
Useful Kafka tools
Useful Kafka toolsUseful Kafka tools
Useful Kafka tools
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 
How to export import a mysql database via ssh in aws lightsail wordpress rizw...
How to export import a mysql database via ssh in aws lightsail wordpress rizw...How to export import a mysql database via ssh in aws lightsail wordpress rizw...
How to export import a mysql database via ssh in aws lightsail wordpress rizw...
 
Ansible automation tool with modules
Ansible automation tool with modulesAnsible automation tool with modules
Ansible automation tool with modules
 
Oracle to MySQL DatabaseLink
Oracle to MySQL DatabaseLinkOracle to MySQL DatabaseLink
Oracle to MySQL DatabaseLink
 
How to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysisHow to create a multi tenancy for an interactive data analysis
How to create a multi tenancy for an interactive data analysis
 
Cloning 2
Cloning 2Cloning 2
Cloning 2
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloud
 
Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1
 

Ähnlich wie Final Report - Spark

Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparisonarunkumar sadhasivam
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hivearunkumar sadhasivam
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Integrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application ServerIntegrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application Serverwebhostingguy
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Oracle api gateway installation as cluster and single node
Oracle api gateway installation as cluster and single nodeOracle api gateway installation as cluster and single node
Oracle api gateway installation as cluster and single nodeOsama Mustafa
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
Installation and setup spark published
Installation and setup spark publishedInstallation and setup spark published
Installation and setup spark publishedDipendra Kusi
 
Installing oracle grid infrastructure and database 12c r1
Installing oracle grid infrastructure and database 12c r1Installing oracle grid infrastructure and database 12c r1
Installing oracle grid infrastructure and database 12c r1Voeurng Sovann
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Docker for Ruby Developers
Docker for Ruby DevelopersDocker for Ruby Developers
Docker for Ruby DevelopersAptible
 
Example Cosmos SDK Application Tutorial
Example Cosmos SDK Application TutorialExample Cosmos SDK Application Tutorial
Example Cosmos SDK Application TutorialJim Yang
 

Ähnlich wie Final Report - Spark (20)

Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Integrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application ServerIntegrating Apache Web Server with Tomcat Application Server
Integrating Apache Web Server with Tomcat Application Server
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache
ApacheApache
Apache
 
Apache
ApacheApache
Apache
 
Oracle api gateway installation as cluster and single node
Oracle api gateway installation as cluster and single nodeOracle api gateway installation as cluster and single node
Oracle api gateway installation as cluster and single node
 
R server and spark
R server and sparkR server and spark
R server and spark
 
Installation and setup spark published
Installation and setup spark publishedInstallation and setup spark published
Installation and setup spark published
 
Installing oracle grid infrastructure and database 12c r1
Installing oracle grid infrastructure and database 12c r1Installing oracle grid infrastructure and database 12c r1
Installing oracle grid infrastructure and database 12c r1
 
Installing lemp with ssl and varnish on Debian 9
Installing lemp with ssl and varnish on Debian 9Installing lemp with ssl and varnish on Debian 9
Installing lemp with ssl and varnish on Debian 9
 
infra-as-code
infra-as-codeinfra-as-code
infra-as-code
 
Oracle WebLogic
Oracle WebLogicOracle WebLogic
Oracle WebLogic
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Docker for Ruby Developers
Docker for Ruby DevelopersDocker for Ruby Developers
Docker for Ruby Developers
 
Example Cosmos SDK Application Tutorial
Example Cosmos SDK Application TutorialExample Cosmos SDK Application Tutorial
Example Cosmos SDK Application Tutorial
 

Final Report - Spark

  • 1. 5/25/2015 By | Danyal , Baqir and Shoaib IBA-FCS INTRODUCTORY DATA ANALYTICS WITH APACHE SPARK
  • 2. Table of Contents Setting Up Network Of Oracle Virtual Box: ....................................................................................................................... 3 Setup Spark On Windows 7 Standalone Mode:.................................................................................................................. 4 Setup Spark On Linux: ......................................................................................................................................................... 5 Spark Features:...................................................................................................................................................................... 5 Spark Standalone Cluster Mode: ............................................................................................................................ 6 Setup Master Node ................................................................................................................................................... 6 Starting Cluster Manually .................................................................................................................................................... 7 Starting & Connecting Worker To Master ......................................................................................................................... 8 Submitting Applications To Spark....................................................................................................................................... 9 Interactive Analysis With The Spark Shell ......................................................................................................................... 9 Spark Submit ......................................................................................................................................................................... 9 Custom Code Execution & Hdfs File Reading .................................................................................................................. 10
  • 3. Setting up network of Oracle Virtual Box: 1. When starting the VM set the Attached to value to Bridged Adapter (to make it connect to other VMs on the network , which reside on different Host machines) 2. Refresh the MAC Address by clicking the REFRESH Icon, so that every VM must have a different MAC to have a unique IP assigned to it. Figure 1: VM setting to follow when creating the network 3. Then make sure you are able to PING from Host to VM and vice versa when the VM starts after applying the above settings.
  • 4. Setup SPARK on Windows 7 Standalone Mode: Prerequisites:  Java6+  Scala 2.10  Python 2.6 +  Spark 1.2.x  sbt ( In case of building Spark Source code)  GIT( If you use sbt tool) Environment Variables:  Set JAVA_HOME and PATH variable as environment variables.  Download Scala 2.10 and install  Set SCALA_HOME andadd %SCALA_HOME%bin in PATH variable in environment variables. To test whether Scala is installed or not, run following command. Downloading & Setting up Spark:  Choose a Spark prebuilt package for Hadoop i.e. Prebuilt for Hadoop 2.3/2.4 or later. Download and extract it to any drive i.e. D:spark-1.2.1-bin-hadoop2.3  Set SPARK_HOME and add %SPARK_HOME%bin in PATH in environment variables  Download winutils.exe and place it in any location (i.e. D:winutilsbinwinutils.exe) to avoid any Hadoop Errors  Set HADOOP_HOME = D:winutils in environment variable  Now, Re run the command “spark-shell’ , you’ll see the scala shell .   For Spark UI : open http://localhost:4040/ in browser  ctrl + z to get out of it when it executes successfully .  For testing the successful setup you can run the example :
  • 5. If all goes fine , this will execute this sample program and return the result on console . And that’s how you have setup Spark on windows 7 . Master Web UI can be accessed on url SPARK://IP:7077 Setup SPARK on Linux:  Download Hortonworks Sandbox ,with Spark Installed and Configured from following link http://hortonworks.com/hdp/downloads/  HDP 2.2.4 on Sandbox.  You can find Spark in this directory o /usr/hdp/2.2.4.2-2/spark/bin You are now all ready to go as HortonWorks have this setup for you pre-packaged. SPARK Features:  Speed - Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.  Ease of Use – Write apps quickly in Java , Scala or Python . Spark offer 80 high level operators that make it easy to build parallel apps.  Generality – Combines SQL , Streaming and complex analytics , powers its stack with high level tools including Spark SQL , MLib (for machine learning and AI),GraphX & Spark streaming .
  • 6.  Runs Everywhere – Spark runs on Hadoop , standalone Mode or in cloud as well , can access diverse data sources including HDFS , Cassandra , Hbase and S3 . SPARK Standalone Cluster Mode:  There are 2 other Spark deployment modes i.e. YARN and MESOS but here we will only talk about standalone mode.  Spark is already installed on HortonWorks VM in standalone mode on your VM node. Simply acquire pre-built version of spark for any future implementations. Setup Master Node 1. Edit etc/hosts file by VI Editor 2. Edit the file like below , here we have shown 2 slaves (which must be setup the same way as this VM , and we must enter information like the same way in etc hosts file on slaves too)
  • 7. 3. Then You should change you Masters host name machine to Master and slaves machine to slave1 and slave2 respectively as shown below 4. Now on slave machines, there is one more additional step to be performed and that is to create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which is useful for testing only. 5. Now if everything have gone according to plan then your Multi Cluster Spark setup over Network is up and running and you can verify by Ping to other machines from every machine , in our case there were 3 machines (1 master , 2 slaves) Starting Cluster Manually  You can start standalone master server by executing the following ./sbin/start-master.sh  Once started, the master will print out the MASTER URL i.e. Spark: //HOSTNAME:PORT, which will be used to connect workers to it.  This url can also be find on WebUI , whose default is url is http://localhost:8080  sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.  sbin/start-all.sh - Starts both a master and a number of slaves as described above.  sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.  sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.  sbin/stop-all.sh - Stops both the master and the slaves as described above.
  • 8. Starting & Connecting Worker to Master  You can start and connect worker(S) to master via this command ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://HOST:PORT  Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).  Now once you connect workers to master successfully (checked it by browsing the web UI , salves machines will be shown there (like below) , any task you run from master to slave will be distributed to all the available machines in network to be worked on in parallel . Figure 2 : Master Web UI
  • 9. Submitting Applications to Spark We have 2 ways to do this. Interactive Analysis with the Spark Shell  Pyspark ./bin/pyspark –master spark://IP:port  Spark-shell ./bin/spark-shell – master spark://IP:port There are many parameters to be passed with above commands, links of which are given at the end of this document. Running pyspark or spark-shell commands will open up an interactive shell for you to work on, write code line by line, pressing enter, for example as below textFile = sc.textFile("README.md") textFile.count() # Number of items in this RDD textFile.first() # First item in this RDD where ‘sc’ is SPARK CONTEXT object , which is made available by spark when you run either pyspark or spark-shell commands . Behind the scenes, spark-shell invokes the more general spark-submit script. Spark Submit Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports: For simple example consider this from inside spark folder, execute something like this: ./bin/spark-submit --master spark://master:7077 k-means.py Where master means where to to submit the app and then the file name to run , either of scala , java or python .
  • 10. Template for spark submit command ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments] Some of the commonly used options are:  --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)  --master: The master URL for the cluster (e.g. spark://hostname or IP:7077)  --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)  --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).  application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.  application-arguments: Arguments passed to the main method of your main class, if any Custom code execution & HDFS File reading  To read / access file text inside your python or scala programs , the file must reside inside HDFS ,only then will sc.textFile(“file.txt”) will be able to read that and access the content . See the commands below as sample to create HDFS directory and put a file in it  hadoop fs –mkdir hdfs://master/user/hadoop/spark/data/  to create directory inside HDFS with given path uri  hadoop fs –ls hdfs://master/user/hadoop/spark/data/  to view directory contents inside HDFS with given path uri  hadoop fs –put home/sample.txt hdfs://master/user/hadoop/spark/data/  to upload file onto HDFS from local Linux directory (home/ in this case)
  • 11.  hadoop fs –get hdfs://master/user/hadoop/spark/data/sample.txt home/  to download file from HDFS dir to local Linux directory (home/ in this case) Sample Code given below which read a file from HDFS directory 1. from pyspark.mllib.clustering import KMeans 2. from numpy import array 3. from math import sqrt 4. from pyspark import SparkContext 5. import time 6. start_time = time.time() 7. 8. sc = SparkContext(appName="K means") 9. # Load and parse the data 10. data = sc.textFile("hdfs://master/user/hadoop/spark/data/kmeans.csv") 11. header = data.first() 12. parsedData = data.filter(lambda x: x != header).map(lambda line: array([float(x) for x in li ne.split(',')])) 13. 14. # Build the model (cluster the data) 15. clusters = KMeans.train(parsedData, 2, maxIterations=100, 16. runs=10, initializationMode="random") 17. 18. # Evaluate clustering by computing Within Set Sum of Squared Errors 19. def error(point): 20. center = clusters.centers[clusters.predict(point)] 21. return sqrt(sum([x**2 for x in (point - center)])) 22. 23. WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) 24. print("Within Set Sum of Squared Error = " + str(WSSSE)) 25. print("--- %s seconds ---" % (time.time() - start_time)) Aggregate Function Benchmarking Query Windows Spark Cluster .8M records - Total amount spent on all transaction per customer 30 secs 7 secs 10M records – Transaction count 167 secs 8 secs 10M records – Total amount sum 106 secs 10 secs
  • 12. K-Means Clustering Benchmarking Cluster config Windows Spark Cluster K=4, iter = 100 , python, rows = 1048576 82 Secs 25 Secs K=4, iter = 1000 , python , rows = 1048576 800 Secs 31 Secs k=4, iter = 1000, Scala , rows = 1048576 - 13 Secs Refrences: 1. https://spark.apache.org/docs/1.3.0/index.html 2. http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/ 3. https://spark.apache.org/docs/1.3.0/spark-standalone.html 4. https://spark.apache.org/docs/1.3.0/quick-start.html 5. https://spark.apache.org/docs/1.3.0/submitting-applications.html