SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
[1]
Big Data Analysis using SparkR
Abstract:
R is popular statistical programming language for analysis, graphical representation and reporting with
large number of packages on Machine Learning, Text Mining, Graph analysis. But interactive R has limited
processing speed due to single threading running on it causing difficult to handle and process huge
amount of data like big data. That why SparkR has rose to overcome it with its distributed computational
engine to enable large scale data analysis from R shell. Our goal is to describe R, SparkR and various
command on SparkR for data manipulation and data exploration using Data frame and various library and
Data frame API.
Introduction:
Data is growing rapidly day by day. As of 2012, about 2.5 exabytes of data are created each day, and that
number is doubling every 40 months[10]. Many companies are generating petabytes of data in a single
data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5
petabytes of data every hour from its customer transactions. And Facebook alone has 300 peta bytes of
Hive data and growing data of 600TB everyday[9]. For analysis data, there are number of data analysis
tools and R is one of the popular data analysis tool among data scientist. It provides support for structured
data processing using data frame including number of statistical and visualization packages.
But, analysis on R is limited to amount of memory available on single computer running in single thread.
So, in this paper we will explore about the SparkR for processing huge amount of data taking advantage
of parallel processing over the cluster. SparkR contains large number of packages for SQL querying,
distributed machine learning, graph analytic.
Background:
R programming
R is a programing language for statistical analysis, graphical representation and reporting. It was
developed at Bell Laboratories (formerly AT&T now Lucent Technologies) by John Chambers and
colleagues. Because of free, open source, powerful and highly extensible too, R has become hot in the
data analysis field. It provides wide varieties of statistical packages on linear and nonlinear modelling,
classical statistical test, time series analysis, classification, clustering and many more. Currently it has more
than nine thousand packages available and many more developer are supporting it. Use of data frame
and matrices, help to handle data effectively and operate effectively. Also, provision of graphical facilities
and display either on-screen or on hardcopy, help user to analysis the data more precise. Moreover, it
provides programming paradigm like conditions, loops, user-defined recursive function and input output
facilities.
[2]
R not only provides for numerical computation but also support for structured data processing through
data frames. Data frame are tabular data structure containing multiple column in vector form that include
numerical value as well as categorical values. Data frame make it easy for data filtering, summarizing and
sorting data. Packages like dplyr, data.table, reshap2, readr, tidyr, lubricate help greatly in data
exploration.
Apache Spark
Apache is powerful open source tools for processing huge amount of data with ease of use having
sophisticated analysis tools. It is started as research project at UC Berkeley in the AMPLab that focus on
big data analytics. MapReduce is inefficient for multi-processing application that require low-latency data
sharing across parallel operation. Spark overcome those short come and become famous for its parallel
operation. The Spark project first introduced Resilient Distributed Datasets (RDD), an API for fault tolerant
computation in a cluster computing environment. It is top on the market due to some of its major features
like: it includes many machine learning algorithm like MLLib and graph algorithm like GraphX, PageRank,
also it can process data in memory giving it more rapidly process data and query data over cluster. Since
the above libraries are closely integrated with the core API, Spark enables complex workflows like SQL
queries can be used to pre-processing data and the results can then be analyzed using advanced machine
learning algorithms and graph algorithm.
Apache SparkR
SparkR is a light-weight frontend on top of Apache Spark. It was initially started in AMPLab for exploring
usability of R with the scalability of Spark. It was first opened source in January 2014. In Spark 2.0.2, SparkR
provides a distributed data frame implementation that supports data exploration like selection, filtering,
aggregation, summarization and advanced package on Text mining, machine learning and graph analysis.
Benefits of SparkR Integration:
Using the Spark API, SparkR inherits many benefits being tightly integrated with Spark. These are:
Data Source API:
SparkR’s data source API enable users to load data form variety of big data sources like HBase, Cassandra,
Hive table, JSON files, Parquet file easily.
Data Frame Optimization:
SparkR DataFrame is optimized in term of memory management and coding. The chart show the
runtime performance of running group by aggregation on 10 million integer pairs on single machine in R,
Python and Scala[1]. The graph shows that the SparkR performance is like Scala and Python.
[3]
Scalability over cluster machine:
The query and operation executed on SparkR Data Frame automatically get distributed across all core of
processing and machine available in Spark cluster easily so that terabytes of data over cluster with
thousands of machine compute and analyze data in no time which otherwise take large amount of time.
DataFrame in SparkR:
SparkR support data frame from various sources like local, Hive table, JSON file, CSV and Parquet files etc.
From local data:
# read data from iris package
df <- as.DataFrame(iris)
# show the top 5 rows
head(df)
[4]
From data source:
Spark support CSV, JSON, Parquet files natively and can be loaded using third party data source
connector like Avro.
# start SparkR with avro package
sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
# read csv file
dataCar <- read.df("car.json", "json")
head(dataCar)
# see the schema of json
printSchema(dataCar)
# we can also load multiple files at once
people <- read.json(c("Car1.json", "Car2.json"))
# read csv file
data.car <- read.df(carcsv, "csv", header = "true", inferSchema = "true",
na.strings = "NA")
Load from Hive table:
SparkR has built in Hive support so Spark Session with Hive support can access Hive table.
# start SparkR session
sparkR.session()
# create table
sql("CREATE TABLE IF NOT EXISTS tb_src (key INT, value STRING)")
# load data in hive table
sql("LOAD DATA LOCAL INPATH 'data_kv.txt' INTO TABLE tb_src")
# query using HiveQL
results <- sql("FROM tb_src SELECT key, value")
# see the top 5 rows from results dataframe
head(results)
Data Exploration:
SparkR supports data exploration like filter, aggregate, grouping, select, operation and summarization.
Selecting:
# select age only
head(select(df, df$age,df$gender))
# get data having age >20 and select age and gender only
head(filter(df, df$age >20),df$age,df$gender)
[5]
Grouping and Aggregating:
# group based on age and count the number
head(summarize(groupBy(df, df$age), count = n(df$age)))
# sort based on the count
group_by_age <- summarize(groupBy(df, df$age), count = n(df$age))
head(arrange(group_by_age, desc(group_by_age$count)))
Data visualization using ggplot2:
SparkR supports ggplot2 library for data visualization.
# install packages ggplot2
install.packages("ggplot2")
library(ggplot2)
# read file form csv
data <- read.df("data.csv",header='true', source = "com.databricks.spark.csv",
inferSchema='true')
# group by gender and get summary value by age
summary_by_age <- collect(
agg(
groupBy(data, "gender"),
AVG_VALP=avg(data$age)
)
)
head(summary_by_age)
ggplot(summary_by_age, aes(x = gender)) + geom_bar()
K-Means Model
K-Means is widely used clustering algorithm for dividing the data into different cluster. In K-Means
clustering we have to choose cluster number and see how data fit in different cluster.
# Fit a k-means model with spark.kmeans
irisDF <- createDataFrame(iris)
kmeansDF <- irisDF
kmeansTestDF <- irisDF
kmeansModel <- spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width +
Petal_Length + Petal_Width,
k = 5)
# Model summary
summary(kmeansModel)
[6]
# Get fitted result from the k-means model
showDF(fitted(kmeansModel))
# Prediction
kmeansPredictions <- predict(kmeansModel, kmeansTestDF)
showDF(kmeansPredictions)
Conclusion:
In summary, SparkR provides capability of R analysis residing on top of Spark taking advantage of
processing and analyzing huge amount of data. SparkR has lot of package supporting the machine learning
and graph package like MLLib, GraphX and so on. SparkR also can be used for summarization, aggregation,
filter and visualization for quick insight on data and find pattern over their.
Reference:
[1]. SparkR: Scaling R Programs with Spark
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
[2]. SparkR: Scaling R Programs with Spark
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
[3]. Spark Research http://spark.apache.org/research.html
[4]. Announcing SparkR: R on Apache Spark https://databricks.com/blog/2015/06/09/announcing-
sparkr-r-on-spark.html
[5]. Do Faster Data Manipulation using These 7 R Packages
https://www.analyticsvidhya.com/blog/2015/12/faster-data-m anipulation-7-packages/
[6]. SparkR and Sparking Water https://rpubs.com/wendyu/sparkr
[7]. Exploring geographical data using SparkR and ggplot2
https://www.codementor.io/spark/tutorial/exploratory-geographical-data-using-sparkr-and-
ggplot2
[8]. Plot http://skku-skt.github.io/ggplot2.SparkR/plot-types
[9]. Scaling the Facebook data warehouse to 300 PB
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-
300-pb/
[10]. Big Data: The Management Revolution https://hbr.org/2012/10/big-data-the-management-
revolution

Weitere Àhnliche Inhalte

Was ist angesagt?

Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopJan Pieter Posthuma
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterHadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterDataWorks Summit
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on HadoopDataWorks Summit
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Datasamuel shamiri
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 

Was ist angesagt? (20)

Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterHadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at Twitter
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 

Andere mochten auch

Data analysis and statistical inference project
Data analysis and statistical inference projectData analysis and statistical inference project
Data analysis and statistical inference projectMaruƟa Pescu (Beca)
 
Data collection m.com final
Data collection m.com finalData collection m.com final
Data collection m.com finalAnuj Bhatia
 
Research methodology mcom part II sem IV assignment
Research methodology mcom part II sem IV assignmentResearch methodology mcom part II sem IV assignment
Research methodology mcom part II sem IV assignmentRutuja Chudnaik
 
Research project for m. com. students by Dr. Shitole
Research project for m. com. students by Dr. ShitoleResearch project for m. com. students by Dr. Shitole
Research project for m. com. students by Dr. Shitolecommercesndtmumbai
 
Marketing project topics
Marketing project topicsMarketing project topics
Marketing project topicsMD Atiullah Khan
 
Marketing management project on hair oil class 12th by faizan khan
Marketing management project on hair oil class 12th by faizan khanMarketing management project on hair oil class 12th by faizan khan
Marketing management project on hair oil class 12th by faizan khanFaizan Khan
 

Andere mochten auch (6)

Data analysis and statistical inference project
Data analysis and statistical inference projectData analysis and statistical inference project
Data analysis and statistical inference project
 
Data collection m.com final
Data collection m.com finalData collection m.com final
Data collection m.com final
 
Research methodology mcom part II sem IV assignment
Research methodology mcom part II sem IV assignmentResearch methodology mcom part II sem IV assignment
Research methodology mcom part II sem IV assignment
 
Research project for m. com. students by Dr. Shitole
Research project for m. com. students by Dr. ShitoleResearch project for m. com. students by Dr. Shitole
Research project for m. com. students by Dr. Shitole
 
Marketing project topics
Marketing project topicsMarketing project topics
Marketing project topics
 
Marketing management project on hair oil class 12th by faizan khan
Marketing management project on hair oil class 12th by faizan khanMarketing management project on hair oil class 12th by faizan khan
Marketing management project on hair oil class 12th by faizan khan
 

Ähnlich wie Big data analysis using spark r published

Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkROlgun Aydın
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Jump Start into Apache¼ Sparkℱ and Databricks
Jump Start into Apache¼ Sparkℱ and DatabricksJump Start into Apache¼ Sparkℱ and Databricks
Jump Start into Apache¼ Sparkℱ and DatabricksDatabricks
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark ZaranTech LLC
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesStratio
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 

Ähnlich wie Big data analysis using spark r published (20)

Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Jump Start into Apache¼ Sparkℱ and Databricks
Jump Start into Apache¼ Sparkℱ and DatabricksJump Start into Apache¼ Sparkℱ and Databricks
Jump Start into Apache¼ Sparkℱ and Databricks
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 

KĂŒrzlich hochgeladen

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...gajnagarg
 
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...gajnagarg
 
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...gajnagarg
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...gajnagarg
 
âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Standamitlee9823
 
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 

KĂŒrzlich hochgeladen (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎9352988975 Two shot with one girl (...
 
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
 
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
 
âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
 
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 

Big data analysis using spark r published

  • 1. [1] Big Data Analysis using SparkR Abstract: R is popular statistical programming language for analysis, graphical representation and reporting with large number of packages on Machine Learning, Text Mining, Graph analysis. But interactive R has limited processing speed due to single threading running on it causing difficult to handle and process huge amount of data like big data. That why SparkR has rose to overcome it with its distributed computational engine to enable large scale data analysis from R shell. Our goal is to describe R, SparkR and various command on SparkR for data manipulation and data exploration using Data frame and various library and Data frame API. Introduction: Data is growing rapidly day by day. As of 2012, about 2.5 exabytes of data are created each day, and that number is doubling every 40 months[10]. Many companies are generating petabytes of data in a single data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. And Facebook alone has 300 peta bytes of Hive data and growing data of 600TB everyday[9]. For analysis data, there are number of data analysis tools and R is one of the popular data analysis tool among data scientist. It provides support for structured data processing using data frame including number of statistical and visualization packages. But, analysis on R is limited to amount of memory available on single computer running in single thread. So, in this paper we will explore about the SparkR for processing huge amount of data taking advantage of parallel processing over the cluster. SparkR contains large number of packages for SQL querying, distributed machine learning, graph analytic. Background: R programming R is a programing language for statistical analysis, graphical representation and reporting. It was developed at Bell Laboratories (formerly AT&T now Lucent Technologies) by John Chambers and colleagues. Because of free, open source, powerful and highly extensible too, R has become hot in the data analysis field. It provides wide varieties of statistical packages on linear and nonlinear modelling, classical statistical test, time series analysis, classification, clustering and many more. Currently it has more than nine thousand packages available and many more developer are supporting it. Use of data frame and matrices, help to handle data effectively and operate effectively. Also, provision of graphical facilities and display either on-screen or on hardcopy, help user to analysis the data more precise. Moreover, it provides programming paradigm like conditions, loops, user-defined recursive function and input output facilities.
  • 2. [2] R not only provides for numerical computation but also support for structured data processing through data frames. Data frame are tabular data structure containing multiple column in vector form that include numerical value as well as categorical values. Data frame make it easy for data filtering, summarizing and sorting data. Packages like dplyr, data.table, reshap2, readr, tidyr, lubricate help greatly in data exploration. Apache Spark Apache is powerful open source tools for processing huge amount of data with ease of use having sophisticated analysis tools. It is started as research project at UC Berkeley in the AMPLab that focus on big data analytics. MapReduce is inefficient for multi-processing application that require low-latency data sharing across parallel operation. Spark overcome those short come and become famous for its parallel operation. The Spark project first introduced Resilient Distributed Datasets (RDD), an API for fault tolerant computation in a cluster computing environment. It is top on the market due to some of its major features like: it includes many machine learning algorithm like MLLib and graph algorithm like GraphX, PageRank, also it can process data in memory giving it more rapidly process data and query data over cluster. Since the above libraries are closely integrated with the core API, Spark enables complex workflows like SQL queries can be used to pre-processing data and the results can then be analyzed using advanced machine learning algorithms and graph algorithm. Apache SparkR SparkR is a light-weight frontend on top of Apache Spark. It was initially started in AMPLab for exploring usability of R with the scalability of Spark. It was first opened source in January 2014. In Spark 2.0.2, SparkR provides a distributed data frame implementation that supports data exploration like selection, filtering, aggregation, summarization and advanced package on Text mining, machine learning and graph analysis. Benefits of SparkR Integration: Using the Spark API, SparkR inherits many benefits being tightly integrated with Spark. These are: Data Source API: SparkR’s data source API enable users to load data form variety of big data sources like HBase, Cassandra, Hive table, JSON files, Parquet file easily. Data Frame Optimization: SparkR DataFrame is optimized in term of memory management and coding. The chart show the runtime performance of running group by aggregation on 10 million integer pairs on single machine in R, Python and Scala[1]. The graph shows that the SparkR performance is like Scala and Python.
  • 3. [3] Scalability over cluster machine: The query and operation executed on SparkR Data Frame automatically get distributed across all core of processing and machine available in Spark cluster easily so that terabytes of data over cluster with thousands of machine compute and analyze data in no time which otherwise take large amount of time. DataFrame in SparkR: SparkR support data frame from various sources like local, Hive table, JSON file, CSV and Parquet files etc. From local data: # read data from iris package df <- as.DataFrame(iris) # show the top 5 rows head(df)
  • 4. [4] From data source: Spark support CSV, JSON, Parquet files natively and can be loaded using third party data source connector like Avro. # start SparkR with avro package sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") # read csv file dataCar <- read.df("car.json", "json") head(dataCar) # see the schema of json printSchema(dataCar) # we can also load multiple files at once people <- read.json(c("Car1.json", "Car2.json")) # read csv file data.car <- read.df(carcsv, "csv", header = "true", inferSchema = "true", na.strings = "NA") Load from Hive table: SparkR has built in Hive support so Spark Session with Hive support can access Hive table. # start SparkR session sparkR.session() # create table sql("CREATE TABLE IF NOT EXISTS tb_src (key INT, value STRING)") # load data in hive table sql("LOAD DATA LOCAL INPATH 'data_kv.txt' INTO TABLE tb_src") # query using HiveQL results <- sql("FROM tb_src SELECT key, value") # see the top 5 rows from results dataframe head(results) Data Exploration: SparkR supports data exploration like filter, aggregate, grouping, select, operation and summarization. Selecting: # select age only head(select(df, df$age,df$gender)) # get data having age >20 and select age and gender only head(filter(df, df$age >20),df$age,df$gender)
  • 5. [5] Grouping and Aggregating: # group based on age and count the number head(summarize(groupBy(df, df$age), count = n(df$age))) # sort based on the count group_by_age <- summarize(groupBy(df, df$age), count = n(df$age)) head(arrange(group_by_age, desc(group_by_age$count))) Data visualization using ggplot2: SparkR supports ggplot2 library for data visualization. # install packages ggplot2 install.packages("ggplot2") library(ggplot2) # read file form csv data <- read.df("data.csv",header='true', source = "com.databricks.spark.csv", inferSchema='true') # group by gender and get summary value by age summary_by_age <- collect( agg( groupBy(data, "gender"), AVG_VALP=avg(data$age) ) ) head(summary_by_age) ggplot(summary_by_age, aes(x = gender)) + geom_bar() K-Means Model K-Means is widely used clustering algorithm for dividing the data into different cluster. In K-Means clustering we have to choose cluster number and see how data fit in different cluster. # Fit a k-means model with spark.kmeans irisDF <- createDataFrame(iris) kmeansDF <- irisDF kmeansTestDF <- irisDF kmeansModel <- spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, k = 5) # Model summary summary(kmeansModel)
  • 6. [6] # Get fitted result from the k-means model showDF(fitted(kmeansModel)) # Prediction kmeansPredictions <- predict(kmeansModel, kmeansTestDF) showDF(kmeansPredictions) Conclusion: In summary, SparkR provides capability of R analysis residing on top of Spark taking advantage of processing and analyzing huge amount of data. SparkR has lot of package supporting the machine learning and graph package like MLLib, GraphX and so on. SparkR also can be used for summarization, aggregation, filter and visualization for quick insight on data and find pattern over their. Reference: [1]. SparkR: Scaling R Programs with Spark https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf [2]. SparkR: Scaling R Programs with Spark https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf [3]. Spark Research http://spark.apache.org/research.html [4]. Announcing SparkR: R on Apache Spark https://databricks.com/blog/2015/06/09/announcing- sparkr-r-on-spark.html [5]. Do Faster Data Manipulation using These 7 R Packages https://www.analyticsvidhya.com/blog/2015/12/faster-data-m anipulation-7-packages/ [6]. SparkR and Sparking Water https://rpubs.com/wendyu/sparkr [7]. Exploring geographical data using SparkR and ggplot2 https://www.codementor.io/spark/tutorial/exploratory-geographical-data-using-sparkr-and- ggplot2 [8]. Plot http://skku-skt.github.io/ggplot2.SparkR/plot-types [9]. Scaling the Facebook data warehouse to 300 PB https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to- 300-pb/ [10]. Big Data: The Management Revolution https://hbr.org/2012/10/big-data-the-management- revolution