SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Spark Application on 
AWS EC2 Cluster 
COEN241 - Cloud Computing 
Group 6 
Chao-Hsuan Shen, Patrick Loomis, John Geevarghese
Explanation 
Outline 
Naive Bayes’ Classifier Algorithm 
Amazon Web Service 
Spark 
RDD 
Demo
Raw Data of Imagenet 
n01484850_17,998 516 98 296 158 720 422 253 623 
n01484850_18,470 507 507 988 502 471 809 128 128 177 771 771 771 771 401 417 
n01491361_23,491 202 939 57 882 937 752 752 677 748 314 314 794 434 314 314 
729 771 104 629 725 771 771 961 771 771 794 314 909 894 551 689 130 684 576 
895 605 865 314 314 44 865 314 446 202 910 910 446 446 961 909 910 960 191 
314 446 446 321 744 752 957 882 
: 
: 
Key Features: 
● 1,000 classes/files 
● 10 million labeled images, depicting 10,000+ object categories 
○ (all done by hand)
AWS - Architecture 
master 
slave 
slave 
slave 
slave 
S3
AWS - Instance & S3
AWS- Cluster 
● Stand alone cluster 
./sbin/start-master.sh 
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT 
● EC2 cluster 
./spark-ec2 -k <key pair> -i <key file path> launch <cluster-name> 
./spark-ec2 -k <key pair> -i <key file path> login <cluster-name> 
./spark-ec2 -k <key pair> -i <key file path> start|stop <cluster-name> 
● on YARN && Mesos 
o Probably better idea (buggy scripts..) 
o for existing distributed data on existing cluster 
o Amazon Elastic MapReduce
AWS- Cluster set-up roadmap 
1. check identity (ASW key-pair) 
2. set-up security group 
3. Launch slaves 
4. Launch master 
5. set-up keyless ssh (rsa) 
6. install Apache Mesos (cluster manager) 
7. install spark, shark, hdfs, tachyon, ganglia on cluster 
8. running setup.sh on master node
AWS- CloudWatch 
master 
CPU, Disk Read, Disk Write, Network in, Network out
AWS- Obstacles Faced 
● Keyless SSH setup 
“Failed to SSH to remote host ec2-54-213-154-133.us-west-2.compute.amazonaws.com.” 
When Instance say it is running, it is not actually fully running… 
● Old version instance vs new version 
o m1.large → Good^^ 
o m3.large → Problems…@.@ 
● S3 v.s HDFS 
o S3 is freaking slow to load big file into cluster. after 40 min loading 
data from S3… I quit 
● Iaas v.s Paas 
o update software yourself → older version of AWS CLI..==
Spark - Scala & Akka 
● Functional Programming + Object-Oriented 
1. every value is an object and every operation is a method-call 
2. first-class functions 
3. a library with efficient immutable data structures 
● functional programming w/out fully OO 
fn2(*args2,fn1(*args1, fn0(*args0))) 
● Akka actor as backbone of Spark
Spark- RDD
Why in-memory processing? 
Support batch, streaming, and interactive 
computations…… in a unified framework. 
Memory Throughput matters
Spark- Cache
Spark- JVM vs Tachyon
Spark- Monitoring Cluster
Spark- Monitoring running App
Mahout vs MLlib 
Future 
Mahout had fairly straight forward operations 
● Sequence file → Sparse Matrix → TF-IDF → train model 
Spark MLlib did not provide all of the functions needed to classify 
● Text classification was initially implemented in MLI, but not the stable 
release of MLlib did not include any of the functions needed 
● TFHashing, Word2Vec, and a few other functions are being implemented 
in an upcoming stable release - Spark 1.1.0 
If we had more time we could contribute our own implementation 
of text classification using multinomial naive bayes’: 
TextDirectoryLoader Tokenizer Vectorizer 
MNB Training 
Algorithm
Mahout vs MLlib - Ecosystem 
Mahout is moving away from its implementation of MapReduce, 
and moving towards DSL for linear algebraic operations 
● which are automatically optimized and executed in parallel on Spark 
● Unfortunately, NB is still in development for Spark 
Contributions to Spark are rising dramatically: 
June 
2013 
July 
2014 
Total 
Contributions 
68 255 
Companies 
Contributing 
17 50 
Total lines of 
code 
63,00 
0 
175,000
Fin
Naive Bayes’ Classifier (1) 
Two things of note: 
1. Conditional Probability 
i. What is the probability that something will happen, given that 
something else has already happened? 
ii. Example: 
2. Bayes’ Rule 
i. Predict an outcome given multiple evidence 
ii. ‘uncouple’ multiple pieces of evidence and treat each piece of 
evidence as independent = naive Bayes 
iii. Multiple prior so that we give high probability to more common 
outcomes, and low probabilities to unlikely outcomes. These are also 
called base rates and they are a way to scale our predicted 
probabilities
Naive Bayes’ Classifier (2) - 
TF-IDF 
Convert word counts of a document using the TF-IDF 
transformation before applying the learning algorithm. The 
resulting formula gives a weighted importance of word 
● f = word frequency 
● D = number of documents 
● df = number of documents containing word under consideration
Naive Bayes’ Classifier (3) 
Naive Bayes Multinomial 
● Class label denoted by C; n is size of vocabulary 
● Class Prior, Pr(c) = # of documents belonging to class c divided by the 
total # of documents 
● Likelihood, Pr(x|c) = Probability of obtaining a document like x in class c 
● Predictor Prior (Evidence), Pr(x) = count of a feature over the total count of 
features 
● MNB assigns a test document x to the class that has the highest 
probability Pr(c | x)

Weitere ähnliche Inhalte

Was ist angesagt?

Brisk hadoop june2011_sfjava
Brisk hadoop june2011_sfjavaBrisk hadoop june2011_sfjava
Brisk hadoop june2011_sfjavasrisatish ambati
 
Apache spark presentation
Apache spark presentationApache spark presentation
Apache spark presentationMahboob Hussain
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and presentGordon Chung
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & StormOtto Mok
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraVarunkumar Manohar
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Scientific Computing With Amazon Web Services
Scientific Computing With Amazon Web ServicesScientific Computing With Amazon Web Services
Scientific Computing With Amazon Web ServicesJamie Kinney
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014StampedeCon
 

Was ist angesagt? (20)

Brisk hadoop june2011_sfjava
Brisk hadoop june2011_sfjavaBrisk hadoop june2011_sfjava
Brisk hadoop june2011_sfjava
 
Apache spark presentation
Apache spark presentationApache spark presentation
Apache spark presentation
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Scientific Computing With Amazon Web Services
Scientific Computing With Amazon Web ServicesScientific Computing With Amazon Web Services
Scientific Computing With Amazon Web Services
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 

Andere mochten auch

Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlibXiangrui Meng
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Boston p camp introducing pmf v3.3 - Steve Wells at ProductCamp Boston, Ap...
Boston p camp   introducing pmf  v3.3 - Steve Wells at ProductCamp Boston, Ap...Boston p camp   introducing pmf  v3.3 - Steve Wells at ProductCamp Boston, Ap...
Boston p camp introducing pmf v3.3 - Steve Wells at ProductCamp Boston, Ap...ProductCamp Boston
 
Mapas conceptuales de proyectos .....
Mapas conceptuales de proyectos .....Mapas conceptuales de proyectos .....
Mapas conceptuales de proyectos .....Silvia Alba Gonzalez
 
8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry FebreyTerry Febrey
 
Recommendation Letter - Xiuting
Recommendation Letter - XiutingRecommendation Letter - Xiuting
Recommendation Letter - XiutingXiuting Hao
 
Day 4 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 4 Reflection at #SXSW 2013 -- #SXSWOgilvyDay 4 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 4 Reflection at #SXSW 2013 -- #SXSWOgilvyOgilvy Consulting
 
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Ogilvy Consulting
 
Excel dad6 8
Excel dad6 8Excel dad6 8
Excel dad6 8daalt209
 
XVII Jornadas de Estudios de Lingüística: lenguaje y deporte
XVII Jornadas de Estudios de Lingüística: lenguaje y deporteXVII Jornadas de Estudios de Lingüística: lenguaje y deporte
XVII Jornadas de Estudios de Lingüística: lenguaje y deporteMaría del Carmen Méndez Santos
 
Profile Optimisation
Profile OptimisationProfile Optimisation
Profile OptimisationLinkedIn
 
Mobile marketing e geolocalização: um mundo de possibilidades
Mobile marketing e geolocalização: um mundo de possibilidadesMobile marketing e geolocalização: um mundo de possibilidades
Mobile marketing e geolocalização: um mundo de possibilidadesVanissa Wanick
 
Smokeless Tobacco and Oral Cancer
Smokeless Tobacco and Oral CancerSmokeless Tobacco and Oral Cancer
Smokeless Tobacco and Oral CancerSteven Kizior
 
Do you know the true performance of your email campaigns
Do you know the true performance of your email campaignsDo you know the true performance of your email campaigns
Do you know the true performance of your email campaignsPure360
 
Para que una sociedad pueda vivir feliz
Para que una sociedad pueda vivir felizPara que una sociedad pueda vivir feliz
Para que una sociedad pueda vivir felizMariaIssabell
 
Information wants to be free
Information wants to be freeInformation wants to be free
Information wants to be freeEric Tachibana
 
8ink 기획서V1 0 김수현,유지은
8ink 기획서V1 0 김수현,유지은8ink 기획서V1 0 김수현,유지은
8ink 기획서V1 0 김수현,유지은jin_yoo
 

Andere mochten auch (20)

Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Boston p camp introducing pmf v3.3 - Steve Wells at ProductCamp Boston, Ap...
Boston p camp   introducing pmf  v3.3 - Steve Wells at ProductCamp Boston, Ap...Boston p camp   introducing pmf  v3.3 - Steve Wells at ProductCamp Boston, Ap...
Boston p camp introducing pmf v3.3 - Steve Wells at ProductCamp Boston, Ap...
 
Mapas conceptuales de proyectos .....
Mapas conceptuales de proyectos .....Mapas conceptuales de proyectos .....
Mapas conceptuales de proyectos .....
 
Ms word1
Ms word1Ms word1
Ms word1
 
8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey
 
Recommendation Letter - Xiuting
Recommendation Letter - XiutingRecommendation Letter - Xiuting
Recommendation Letter - Xiuting
 
3º A 2015
3º A 20153º A 2015
3º A 2015
 
Day 4 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 4 Reflection at #SXSW 2013 -- #SXSWOgilvyDay 4 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 4 Reflection at #SXSW 2013 -- #SXSWOgilvy
 
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
 
Excel dad6 8
Excel dad6 8Excel dad6 8
Excel dad6 8
 
XVII Jornadas de Estudios de Lingüística: lenguaje y deporte
XVII Jornadas de Estudios de Lingüística: lenguaje y deporteXVII Jornadas de Estudios de Lingüística: lenguaje y deporte
XVII Jornadas de Estudios de Lingüística: lenguaje y deporte
 
Profile Optimisation
Profile OptimisationProfile Optimisation
Profile Optimisation
 
Mobile marketing e geolocalização: um mundo de possibilidades
Mobile marketing e geolocalização: um mundo de possibilidadesMobile marketing e geolocalização: um mundo de possibilidades
Mobile marketing e geolocalização: um mundo de possibilidades
 
Smokeless Tobacco and Oral Cancer
Smokeless Tobacco and Oral CancerSmokeless Tobacco and Oral Cancer
Smokeless Tobacco and Oral Cancer
 
Do you know the true performance of your email campaigns
Do you know the true performance of your email campaignsDo you know the true performance of your email campaigns
Do you know the true performance of your email campaigns
 
Para que una sociedad pueda vivir feliz
Para que una sociedad pueda vivir felizPara que una sociedad pueda vivir feliz
Para que una sociedad pueda vivir feliz
 
Information wants to be free
Information wants to be freeInformation wants to be free
Information wants to be free
 
日本
日本日本
日本
 
8ink 기획서V1 0 김수현,유지은
8ink 기획서V1 0 김수현,유지은8ink 기획서V1 0 김수현,유지은
8ink 기획서V1 0 김수현,유지은
 

Ähnlich wie Spark application on ec2 cluster

Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017Amazon Web Services
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3'sdelagoya
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Lessons learned from writing over 300,000 lines of infrastructure code
Lessons learned from writing over 300,000 lines of infrastructure codeLessons learned from writing over 300,000 lines of infrastructure code
Lessons learned from writing over 300,000 lines of infrastructure codeYevgeniy Brikman
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Final training course
Final training courseFinal training course
Final training courseNoor Dhiya
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerShunta Saito
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetAmazon Web Services
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018Apache MXNet
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesYevgeniy Brikman
 
Migrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMigrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMicrosoft Tech Community
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH
 

Ähnlich wie Spark application on ec2 cluster (20)

Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Lessons learned from writing over 300,000 lines of infrastructure code
Lessons learned from writing over 300,000 lines of infrastructure codeLessons learned from writing over 300,000 lines of infrastructure code
Lessons learned from writing over 300,000 lines of infrastructure code
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Final training course
Final training courseFinal training course
Final training course
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modules
 
Migrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMigrating existing open source machine learning to azure
Migrating existing open source machine learning to azure
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
 

Kürzlich hochgeladen

%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Kürzlich hochgeladen (20)

%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Spark application on ec2 cluster

  • 1. Spark Application on AWS EC2 Cluster COEN241 - Cloud Computing Group 6 Chao-Hsuan Shen, Patrick Loomis, John Geevarghese
  • 2. Explanation Outline Naive Bayes’ Classifier Algorithm Amazon Web Service Spark RDD Demo
  • 3. Raw Data of Imagenet n01484850_17,998 516 98 296 158 720 422 253 623 n01484850_18,470 507 507 988 502 471 809 128 128 177 771 771 771 771 401 417 n01491361_23,491 202 939 57 882 937 752 752 677 748 314 314 794 434 314 314 729 771 104 629 725 771 771 961 771 771 794 314 909 894 551 689 130 684 576 895 605 865 314 314 44 865 314 446 202 910 910 446 446 961 909 910 960 191 314 446 446 321 744 752 957 882 : : Key Features: ● 1,000 classes/files ● 10 million labeled images, depicting 10,000+ object categories ○ (all done by hand)
  • 4. AWS - Architecture master slave slave slave slave S3
  • 6. AWS- Cluster ● Stand alone cluster ./sbin/start-master.sh ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT ● EC2 cluster ./spark-ec2 -k <key pair> -i <key file path> launch <cluster-name> ./spark-ec2 -k <key pair> -i <key file path> login <cluster-name> ./spark-ec2 -k <key pair> -i <key file path> start|stop <cluster-name> ● on YARN && Mesos o Probably better idea (buggy scripts..) o for existing distributed data on existing cluster o Amazon Elastic MapReduce
  • 7. AWS- Cluster set-up roadmap 1. check identity (ASW key-pair) 2. set-up security group 3. Launch slaves 4. Launch master 5. set-up keyless ssh (rsa) 6. install Apache Mesos (cluster manager) 7. install spark, shark, hdfs, tachyon, ganglia on cluster 8. running setup.sh on master node
  • 8. AWS- CloudWatch master CPU, Disk Read, Disk Write, Network in, Network out
  • 9. AWS- Obstacles Faced ● Keyless SSH setup “Failed to SSH to remote host ec2-54-213-154-133.us-west-2.compute.amazonaws.com.” When Instance say it is running, it is not actually fully running… ● Old version instance vs new version o m1.large → Good^^ o m3.large → Problems…@.@ ● S3 v.s HDFS o S3 is freaking slow to load big file into cluster. after 40 min loading data from S3… I quit ● Iaas v.s Paas o update software yourself → older version of AWS CLI..==
  • 10. Spark - Scala & Akka ● Functional Programming + Object-Oriented 1. every value is an object and every operation is a method-call 2. first-class functions 3. a library with efficient immutable data structures ● functional programming w/out fully OO fn2(*args2,fn1(*args1, fn0(*args0))) ● Akka actor as backbone of Spark
  • 12. Why in-memory processing? Support batch, streaming, and interactive computations…… in a unified framework. Memory Throughput matters
  • 14. Spark- JVM vs Tachyon
  • 17. Mahout vs MLlib Future Mahout had fairly straight forward operations ● Sequence file → Sparse Matrix → TF-IDF → train model Spark MLlib did not provide all of the functions needed to classify ● Text classification was initially implemented in MLI, but not the stable release of MLlib did not include any of the functions needed ● TFHashing, Word2Vec, and a few other functions are being implemented in an upcoming stable release - Spark 1.1.0 If we had more time we could contribute our own implementation of text classification using multinomial naive bayes’: TextDirectoryLoader Tokenizer Vectorizer MNB Training Algorithm
  • 18. Mahout vs MLlib - Ecosystem Mahout is moving away from its implementation of MapReduce, and moving towards DSL for linear algebraic operations ● which are automatically optimized and executed in parallel on Spark ● Unfortunately, NB is still in development for Spark Contributions to Spark are rising dramatically: June 2013 July 2014 Total Contributions 68 255 Companies Contributing 17 50 Total lines of code 63,00 0 175,000
  • 19.
  • 20. Fin
  • 21. Naive Bayes’ Classifier (1) Two things of note: 1. Conditional Probability i. What is the probability that something will happen, given that something else has already happened? ii. Example: 2. Bayes’ Rule i. Predict an outcome given multiple evidence ii. ‘uncouple’ multiple pieces of evidence and treat each piece of evidence as independent = naive Bayes iii. Multiple prior so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities
  • 22. Naive Bayes’ Classifier (2) - TF-IDF Convert word counts of a document using the TF-IDF transformation before applying the learning algorithm. The resulting formula gives a weighted importance of word ● f = word frequency ● D = number of documents ● df = number of documents containing word under consideration
  • 23. Naive Bayes’ Classifier (3) Naive Bayes Multinomial ● Class label denoted by C; n is size of vocabulary ● Class Prior, Pr(c) = # of documents belonging to class c divided by the total # of documents ● Likelihood, Pr(x|c) = Probability of obtaining a document like x in class c ● Predictor Prior (Evidence), Pr(x) = count of a feature over the total count of features ● MNB assigns a test document x to the class that has the highest probability Pr(c | x)