SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Spark	
  :	
  Lightening	
  Fast	
  Cluster	
  
Compu6ng	
  Framework	
  !	
  
Ni#sh	
  Upre#	
  
nzu100@cse.psu.edu	
  
	
  
Agenda	
  for	
  the	
  Day	
  
•  How	
   do	
   we	
   mine	
   useful	
   informa#on	
   from	
  
massive	
  datasets	
  ?	
  
•  Overview	
  of	
  Problems	
  in	
  Big	
  Data	
  Space.	
  
•  Apache	
  Hadoop	
  and	
  its	
  Limita#ons.	
  
•  Introduce	
  Apache	
  Spark.	
  
•  Switching	
  Gears	
  :	
  Exploring	
  Spark	
  Internals	
  
•  Discuss	
   an	
   ac#ve	
   research	
   area,	
   “BlinkDB”	
   :	
  	
  
Queries	
   with	
   Bounded	
   Reponses	
   #me	
   on	
   very	
  
large	
  data.	
  
•  Conclusion	
  
Data	
  is	
  KEY	
  for	
  Organiza6ons.	
  
Data	
  Mining	
  is	
  challenging.	
  
Mining	
  is	
  even	
  more	
  challenging	
  
with	
  Massive	
  Datasets	
  !	
  
 Big	
  Data	
  Problem	
  for	
  Amazon	
  	
  
“Grouping	
  books	
  on	
  same	
  topic”	
  	
  
Solu6on	
  :	
  k-­‐Means	
  Algorithm	
  
Not	
  so	
  simple	
  with	
  large	
  datasets	
  …	
  
How	
  to	
  solve	
  this	
  problem	
  ?	
  
Back	
  In	
  2000	
  Google	
  had	
  a	
  problem…	
  
•  Processing	
  massive	
  amount	
  of	
  raw	
  data	
  (	
  crawled	
  
documents,	
  web	
  request	
  logs	
  )	
  
•  Specialized	
  Hardware	
  (Ver#cal	
  Scaling)	
  was	
  super	
  
expensive.	
  
•  The	
  computa#on	
  thus	
  needed	
  to	
  be	
  distributed.	
  
Solu6on	
   :	
   Google’s	
   Map	
   Reduce	
   Paradigm	
   by	
  
Jeffrey	
  Dean	
  and	
  Sanjay	
  Ghemawat.	
  
Essen6ally	
   a	
   Distributed	
   Cluster	
   Compu6ng	
  
Framework.	
  
What	
  are	
  the	
  core	
  ideas	
  behind	
  Map	
  
Reduce	
  and	
  why	
  does	
  it	
  scale	
  to	
  
Massive	
  Datasets?	
  
Why	
  is	
  Map	
  Reduce	
  so	
  important	
  ?	
  
•  Using	
  commodity	
  hardware	
  for	
  computa#on.	
  
•  Overcome	
   Commodity	
   Hardware	
   limita2ons:	
  
Provides	
   solu#on	
   for	
   an	
   environment	
   where	
  
failures	
  are	
  very	
  frequent.	
  
•  Pushing	
   computa#on	
   to	
   data	
   (unlike	
   the	
   other	
  
way	
  round)	
  
•  Provides	
  abstrac2on	
  to	
  focus	
  on	
  domain	
  logic	
  :	
  A	
  
programming	
   /	
   cluster	
   compu#ng	
   model	
   for	
  
distributed	
  compu#ng	
  using	
  func#onal	
  primi#ves	
  
map	
  and	
  reduce.	
  	
  
Simplest	
  Big	
  Data	
  Problem	
  :	
  Coun6ng	
  
Words.	
  
Text	
  Mining	
  :	
  Word	
  Count	
  
Map-­‐Reduce	
  Pseudo	
  Code	
  	
  
Does	
  the	
  Map	
  Reduce	
  Paradigm	
  
completely	
  solve	
  the	
  Big	
  Data	
  
Problem?	
  
Where	
  do	
  we	
  go	
  from	
  here	
  ?	
  
The	
  BIG	
  Picture	
  !	
  
What	
  do	
  we	
  need	
  to	
  process	
  Big	
  Data?	
  	
  
Three	
  major	
  Big	
  Data	
  Scenarios	
  
•  Interac6ve	
  Queries	
  :	
  Enable	
  faster	
  decision.	
  
Example	
  :	
  Query	
  website	
  logs	
  and	
  diagnose	
  why	
  the	
  
website	
  is	
  slow	
  ?	
  (	
  Apache	
  Pig,	
  Apache	
  Hive..	
  )	
  
•  Sophis6cated	
  Batch	
  Data	
  Processing	
  :	
  Enable	
  
be_er	
  decisions.	
  
Example	
  :	
  Trend	
  Analysis,	
  Analy2cs	
  
•  Streaming	
  Data	
  Processing	
  :	
  Real	
  #me	
  decision	
  
making.	
  
Example	
  :	
  Fraud	
  Detec2on,	
  detect	
  DDoS	
  aJacks.	
  
 
A>er	
  Google	
  ’s	
  iniBal	
  work	
  …	
  
The	
  Open	
  Source	
  Community	
  soon	
  
caught	
  up	
  with	
  Hadoop	
  Ecosystem	
  !	
  
Tools	
   for	
   every	
   Data	
   Analysis	
  
scenario…	
  
There	
  are	
  many	
  inherent	
  limita6ons	
  
with	
  Map	
  Reduce	
  ecosystem	
  …	
  
Map	
  Reduce	
  Limita6ons	
  
•  Itera6ve	
   Jobs	
   :	
   Common	
   algorithms	
   apply	
   a	
  
func#on	
   repeatedly	
   to	
   the	
   same	
   dataset.	
  
While	
   each	
   itera#on	
   can	
   be	
   expressed	
   as	
   a	
  
MapReduce	
  job,	
  each	
  job	
  must	
  store	
  and	
  than	
  
reload	
  data	
  from	
  disk.	
  
•  Interac6ve	
   Analysis	
   :	
   There	
   is	
   a	
   need	
   to	
   run	
  
ad-­‐hoc	
   queries	
   on	
   datasets.	
   We	
   want	
   to	
   be	
  
able	
   to	
   load	
   a	
   dataset	
   into	
   memory	
   across	
  
machines	
  and	
  query	
  it	
  repeatedly.	
  
Example	
  1	
  :	
  Itera6ve	
  Breadth	
  First	
  
Search	
  in	
  Hadoop.	
  
Example	
  2	
  :	
  Ad-­‐Hoc	
  SQL	
  like	
  Queries	
  
anyone	
  ?	
  
Using	
  	
  Hive	
  and	
  Pig	
  ?	
  
to	
  the	
  Rescue	
  !	
  
Spark	
   easily	
   outperforms	
   Hadoop	
   in	
   the	
   discussed	
  
scenarios	
  by	
  10X	
  –	
  100X.	
  
not	
  simply	
  about	
  Performance	
  gain…	
  
Introducing	
  Spark	
  	
  
•  Open	
  Source	
  data	
  analy#cs	
  cluster	
  compu#ng	
  
framework.	
  
•  Provides	
   primi#ves	
   for	
   In-­‐Memory	
   cluster	
  
compu#ng	
  that	
  allows	
  user	
  to	
  load	
  data	
  into	
  
cluster’s	
   memory	
   and	
   query	
   it	
   repeatedly	
  
(spills	
  to	
  HDFS	
  only	
  when	
  needed).	
  
•  Allows	
   interac#ve	
   ad-­‐hoc	
   data	
   explora#on.	
  
(Supports	
  Pipelining	
  &	
  Lazy	
  Ini#aliza#on)	
  
•  Unifies	
   batch,	
   streaming	
   and	
   interac6ve	
  
computa6on.	
  
A	
  rich	
  Spark	
  Ecosystem	
  	
  
The	
  Scala	
  Programming	
  language	
  
-­‐	
  Object	
  /	
  Func#onal	
  
-­‐	
  Runs	
  on	
  the	
  JVM	
  
-­‐	
  Concise	
  Syntax	
  
-­‐	
  Aims	
  to	
  support	
  interac#ve	
  scrip#ng	
  +	
  
development	
  	
  
	
  
Working	
  with	
  Spark	
  …	
  
Why	
  is	
  it	
  so	
  Powerful	
  ?	
  
Exploring	
  Spark	
  Internals	
  …	
  
•  Spark	
  :	
  Cluster	
  Compu6ng	
  with	
  Working	
  Sets	
  	
  
Matei	
   Zaharia,	
   Mosharaf	
   Chowdhury,	
   Michael	
   J.	
   Franklin,	
   Sco_	
  	
  	
  	
  	
  	
  	
  	
  	
  
Shenker,	
  Ion	
  Stoica.	
  Hot	
  Cloud	
  2010.	
  June	
  2010.	
  
•  Resilient	
   Distributed	
   Datasets:	
   A	
   Fault-­‐Tolerant	
  
Abstrac6on	
  for	
  In-­‐Memory	
  Cluster	
  Compu6ng	
  	
  
Matei	
   Zaharia,	
   Mosharaf	
   Chowdhury,	
   Tathagata	
   Das,	
   Ankur	
   Dave,	
  
Jus#n	
  Ma,	
  Murphy	
  McCauley,	
  Michael	
  J.	
  Franklin,	
  Sco_	
  Shenker,	
  	
  	
  	
  Ion	
  
Stoica.	
   NSDI	
   2012.	
   April	
   2012.	
   Best	
   Paper	
   Award	
   and	
   Honorable	
  
Men6on	
  for	
  Community	
  Award.	
  
Spark	
  Essen6als	
  :	
  Drive	
  Program	
  
Core	
  of	
  Spark	
  :	
  RDDs	
  
•  RDD	
  :	
  Resilient	
  Distributed	
  Datasets	
  are	
  a	
  distributed	
  
memory	
  abstrac6on	
  that	
  lets	
  programmer	
  perform	
  in	
  
memory	
   computa#ons	
   on	
   large	
   clusters	
   in	
   fault	
  
tolerant	
  manner.	
  	
  
•  An	
  RDD	
  is	
  a	
  read-­‐only	
  collec6on	
  of	
  objects	
  par66oned	
  
across	
  a	
  set	
  of	
  machines.	
  
•  RDDs	
   provide	
   an	
   interface	
   based	
   on	
   coarse-­‐grained	
  
transforma#on	
   that	
   applies	
   same	
   opera#on	
   to	
   many	
  
data	
   items.	
   This	
   allows	
   fault	
   tolerance	
   by	
   logging	
  
transforma6ons	
   and	
   building	
   a	
   dataset	
   lineage	
   that	
  
can	
  be	
  used	
  to	
  rebuilt	
  them	
  if	
  a	
  par66on	
  is	
  lost.	
  
Manipula6ng	
  RDDs	
  …	
  
Periodic	
  CheckPoin6ng	
  of	
  data	
  for	
  
long	
  running	
  lineage.	
  
Shared	
  Variables	
  in	
  SPARK	
  :	
  
-­‐	
  Broadcast	
  Variables	
  
-­‐	
  Accumulators	
  
Programmers	
   can	
   create	
   two	
   restricted	
  
types	
   of	
   shared	
   variables	
   to	
   support	
   two	
  
common	
  simple	
  usage	
  paJerns…	
  	
  
	
  
Ini6al	
  Experiment	
  Results	
  
Logis6c	
  
Regression	
  29GB	
  
dataset.	
  	
  
Interac6ve	
  Data	
  
Mining	
  on	
  
Wikipedia	
  
1	
  TB	
  from	
  disk	
  
took	
  170s	
  
Back	
  to	
  our	
  Ques6on	
  …	
  	
  
Why	
  is	
  SPARK	
  so	
  Powerful	
  ?	
  
RDDs	
  Expressivity	
  and	
  Generality	
  
•  RDDs	
  are	
  able	
  to	
  express	
  a	
  diverse	
  set	
  of	
  programming	
  
models	
  as	
  the	
  Restric#ons	
  have	
  li_le	
  impact	
  in	
  parallel	
  
applica#ons.	
  	
  
•  A	
   lot	
   of	
   these	
   programs	
   naturally	
   apply	
   the	
   same	
  
opera#on	
  on	
  many	
  records,	
  making	
  them	
  a	
  good	
  fit.	
  
•  Previous	
   systems	
   explored	
   specific	
   problems	
   with	
  
MapReduce.	
   However,	
   at	
   the	
   core	
   of	
   the	
   problem	
   is	
  
the	
  need	
  for	
  a	
  common	
  data	
  sharing	
  abstrac6on.	
  
•  RDDs	
  capture	
  all	
  major	
  op#miza#ons	
  :	
  keeping	
  specific	
  
data	
   in	
   memory,	
   custom	
   par##oning	
   to	
   minimize	
  
communica#on	
   and	
   recovering	
   from	
   failures	
  
effec#vely.	
  
Current	
  Research…	
  
Queries	
   with	
   Bounded	
   Error	
   and	
  
Bounded	
   Response	
   Time	
   on	
   Very	
  
Large	
  Data.	
  
Approximate	
  Queries…	
  
Explore	
  More	
  of	
  BlinkDB	
  	
  
•  Visit	
  :	
  h_p://blinkdb.org/	
  
•  Blink	
  and	
  It’s	
  Done:	
  Interac6ve	
  Queries	
  on	
  Very	
  
Large	
  Data	
  :	
  In	
  PVLDB	
  5(12):	
  1902-­‐1905,	
  2012,	
  
Istanbul,	
  Turkey	
  by	
  Sameer	
  Agarwal,	
  Aurojit	
  
Panda,	
  Barzan	
  Mozafari,	
  Anand	
  P.	
  Iyer,	
  Samuel	
  
Madden,	
  Ion	
  Stoica.	
  	
  
•  Queries	
  with	
  Bounded	
  Errors	
  and	
  Bounded	
  
Response	
  Times	
  on	
  Very	
  Large	
  Data	
  :	
  Sameer	
  
Agarwal,	
  Barzan	
  Mozafari,	
  Aurojit	
  Panda,	
  Henry	
  
Milner,	
  Samuel	
  Madden,	
  Ion	
  Stoica.	
  BlinkDB:.	
  In	
  
ACM	
  EuroSys	
  2013,	
  Prague,	
  Czech	
  Republic	
  (Best	
  
Paper	
  Award).	
  
Ques6ons	
  ?	
  
	
  
Thank	
  You	
  !	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsVineet Gupta
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemVarad Meru
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 

Was ist angesagt? (20)

Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 

Ähnlich wie Spark

Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptxITLAb21
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraStratio
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 

Ähnlich wie Spark (20)

Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
963
963963
963
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

Mehr von Nitish Upreti

Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsNitish Upreti
 
Socail Influence & Homophilly
Socail Influence & HomophillySocail Influence & Homophilly
Socail Influence & HomophillyNitish Upreti
 
PSU CSE 541 Project Idea
PSU CSE 541 Project IdeaPSU CSE 541 Project Idea
PSU CSE 541 Project IdeaNitish Upreti
 

Mehr von Nitish Upreti (7)

Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Project progress
Project progressProject progress
Project progress
 
Socail Influence & Homophilly
Socail Influence & HomophillySocail Influence & Homophilly
Socail Influence & Homophilly
 
Software testing
Software testingSoftware testing
Software testing
 
PSU CSE 541 Project Idea
PSU CSE 541 Project IdeaPSU CSE 541 Project Idea
PSU CSE 541 Project Idea
 

Kürzlich hochgeladen

Top Call Girls In Jankipuram ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
Top Call Girls In Jankipuram ( Lucknow  ) 🔝 8923113531 🔝  Cash PaymentTop Call Girls In Jankipuram ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment
Top Call Girls In Jankipuram ( Lucknow ) 🔝 8923113531 🔝 Cash Paymentanilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service 🦺
CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service  🦺CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service  🦺
CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service 🦺anilsa9823
 
🔝|97111༒99012🔝 Call Girls In {Delhi} Cr Park ₹5.5k Cash Payment With Room De...
🔝|97111༒99012🔝 Call Girls In  {Delhi} Cr Park ₹5.5k Cash Payment With Room De...🔝|97111༒99012🔝 Call Girls In  {Delhi} Cr Park ₹5.5k Cash Payment With Room De...
🔝|97111༒99012🔝 Call Girls In {Delhi} Cr Park ₹5.5k Cash Payment With Room De...Diya Sharma
 
Slovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docx
Slovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docxSlovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docx
Slovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docxWorld Wide Tickets And Hospitality
 
JORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdfJORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdfArturo Pacheco Alvarez
 
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...World Wide Tickets And Hospitality
 
08448380779 Call Girls In Lajpat Nagar Women Seeking Men
08448380779 Call Girls In Lajpat Nagar Women Seeking Men08448380779 Call Girls In Lajpat Nagar Women Seeking Men
08448380779 Call Girls In Lajpat Nagar Women Seeking MenDelhi Call girls
 
9990611130 Find & Book Russian Call Girls In Ghazipur
9990611130 Find & Book Russian Call Girls In Ghazipur9990611130 Find & Book Russian Call Girls In Ghazipur
9990611130 Find & Book Russian Call Girls In GhazipurGenuineGirls
 
Who Is Emmanuel Katto Uganda? His Career, personal life etc.
Who Is Emmanuel Katto Uganda? His Career, personal life etc.Who Is Emmanuel Katto Uganda? His Career, personal life etc.
Who Is Emmanuel Katto Uganda? His Career, personal life etc.Marina Costa
 
CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service 🧣
CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service  🧣CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service  🧣
CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service 🧣anilsa9823
 
08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking Men08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Karol Bagh Women Seeking Men
08448380779 Call Girls In Karol Bagh Women Seeking Men08448380779 Call Girls In Karol Bagh Women Seeking Men
08448380779 Call Girls In Karol Bagh Women Seeking MenDelhi Call girls
 
Technical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics Trade
Technical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics TradeTechnical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics Trade
Technical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics TradeOptics-Trade
 
08448380779 Call Girls In International Airport Women Seeking Men
08448380779 Call Girls In International Airport Women Seeking Men08448380779 Call Girls In International Airport Women Seeking Men
08448380779 Call Girls In International Airport Women Seeking MenDelhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual serviceanilsa9823
 
Hire 💕 8617697112 Kasauli Call Girls Service Call Girls Agency
Hire 💕 8617697112 Kasauli Call Girls Service Call Girls AgencyHire 💕 8617697112 Kasauli Call Girls Service Call Girls Agency
Hire 💕 8617697112 Kasauli Call Girls Service Call Girls AgencyNitya salvi
 
Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...
Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...
Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...baharayali
 
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdfTAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdfSocial Samosa
 
Albania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docx
Albania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docxAlbania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docx
Albania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docxWorld Wide Tickets And Hospitality
 
UEFA Euro 2024 Squad Check-in Who is Most Favorite.docx
UEFA Euro 2024 Squad Check-in Who is Most Favorite.docxUEFA Euro 2024 Squad Check-in Who is Most Favorite.docx
UEFA Euro 2024 Squad Check-in Who is Most Favorite.docxEuro Cup 2024 Tickets
 

Kürzlich hochgeladen (20)

Top Call Girls In Jankipuram ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
Top Call Girls In Jankipuram ( Lucknow  ) 🔝 8923113531 🔝  Cash PaymentTop Call Girls In Jankipuram ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment
Top Call Girls In Jankipuram ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
 
CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service 🦺
CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service  🦺CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service  🦺
CALL ON ➥8923113531 🔝Call Girls Saharaganj Lucknow best Female service 🦺
 
🔝|97111༒99012🔝 Call Girls In {Delhi} Cr Park ₹5.5k Cash Payment With Room De...
🔝|97111༒99012🔝 Call Girls In  {Delhi} Cr Park ₹5.5k Cash Payment With Room De...🔝|97111༒99012🔝 Call Girls In  {Delhi} Cr Park ₹5.5k Cash Payment With Room De...
🔝|97111༒99012🔝 Call Girls In {Delhi} Cr Park ₹5.5k Cash Payment With Room De...
 
Slovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docx
Slovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docxSlovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docx
Slovenia Vs Serbia UEFA Euro 2024 Fixture Guide Every Fixture Detailed.docx
 
JORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdfJORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdf
 
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
 
08448380779 Call Girls In Lajpat Nagar Women Seeking Men
08448380779 Call Girls In Lajpat Nagar Women Seeking Men08448380779 Call Girls In Lajpat Nagar Women Seeking Men
08448380779 Call Girls In Lajpat Nagar Women Seeking Men
 
9990611130 Find & Book Russian Call Girls In Ghazipur
9990611130 Find & Book Russian Call Girls In Ghazipur9990611130 Find & Book Russian Call Girls In Ghazipur
9990611130 Find & Book Russian Call Girls In Ghazipur
 
Who Is Emmanuel Katto Uganda? His Career, personal life etc.
Who Is Emmanuel Katto Uganda? His Career, personal life etc.Who Is Emmanuel Katto Uganda? His Career, personal life etc.
Who Is Emmanuel Katto Uganda? His Career, personal life etc.
 
CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service 🧣
CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service  🧣CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service  🧣
CALL ON ➥8923113531 🔝Call Girls Telibagh Lucknow best Night Fun service 🧣
 
08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking Men08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking Men
 
08448380779 Call Girls In Karol Bagh Women Seeking Men
08448380779 Call Girls In Karol Bagh Women Seeking Men08448380779 Call Girls In Karol Bagh Women Seeking Men
08448380779 Call Girls In Karol Bagh Women Seeking Men
 
Technical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics Trade
Technical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics TradeTechnical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics Trade
Technical Data | Sig Sauer Easy6 BDX 1-6x24 | Optics Trade
 
08448380779 Call Girls In International Airport Women Seeking Men
08448380779 Call Girls In International Airport Women Seeking Men08448380779 Call Girls In International Airport Women Seeking Men
08448380779 Call Girls In International Airport Women Seeking Men
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service
 
Hire 💕 8617697112 Kasauli Call Girls Service Call Girls Agency
Hire 💕 8617697112 Kasauli Call Girls Service Call Girls AgencyHire 💕 8617697112 Kasauli Call Girls Service Call Girls Agency
Hire 💕 8617697112 Kasauli Call Girls Service Call Girls Agency
 
Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...
Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...
Asli Kala jadu, Black magic specialist in Pakistan Or Kala jadu expert in Egy...
 
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdfTAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
 
Albania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docx
Albania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docxAlbania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docx
Albania Vs Spain Albania is Loaded with Defensive Talent on their Roster.docx
 
UEFA Euro 2024 Squad Check-in Who is Most Favorite.docx
UEFA Euro 2024 Squad Check-in Who is Most Favorite.docxUEFA Euro 2024 Squad Check-in Who is Most Favorite.docx
UEFA Euro 2024 Squad Check-in Who is Most Favorite.docx
 

Spark

  • 1. Spark  :  Lightening  Fast  Cluster   Compu6ng  Framework  !   Ni#sh  Upre#   nzu100@cse.psu.edu    
  • 2. Agenda  for  the  Day   •  How   do   we   mine   useful   informa#on   from   massive  datasets  ?   •  Overview  of  Problems  in  Big  Data  Space.   •  Apache  Hadoop  and  its  Limita#ons.   •  Introduce  Apache  Spark.   •  Switching  Gears  :  Exploring  Spark  Internals   •  Discuss   an   ac#ve   research   area,   “BlinkDB”   :     Queries   with   Bounded   Reponses   #me   on   very   large  data.   •  Conclusion  
  • 3. Data  is  KEY  for  Organiza6ons.  
  • 4. Data  Mining  is  challenging.   Mining  is  even  more  challenging   with  Massive  Datasets  !  
  • 5.  Big  Data  Problem  for  Amazon     “Grouping  books  on  same  topic”    
  • 6. Solu6on  :  k-­‐Means  Algorithm  
  • 7. Not  so  simple  with  large  datasets  …   How  to  solve  this  problem  ?  
  • 8. Back  In  2000  Google  had  a  problem…   •  Processing  massive  amount  of  raw  data  (  crawled   documents,  web  request  logs  )   •  Specialized  Hardware  (Ver#cal  Scaling)  was  super   expensive.   •  The  computa#on  thus  needed  to  be  distributed.   Solu6on   :   Google’s   Map   Reduce   Paradigm   by   Jeffrey  Dean  and  Sanjay  Ghemawat.   Essen6ally   a   Distributed   Cluster   Compu6ng   Framework.  
  • 9. What  are  the  core  ideas  behind  Map   Reduce  and  why  does  it  scale  to   Massive  Datasets?  
  • 10. Why  is  Map  Reduce  so  important  ?   •  Using  commodity  hardware  for  computa#on.   •  Overcome   Commodity   Hardware   limita2ons:   Provides   solu#on   for   an   environment   where   failures  are  very  frequent.   •  Pushing   computa#on   to   data   (unlike   the   other   way  round)   •  Provides  abstrac2on  to  focus  on  domain  logic  :  A   programming   /   cluster   compu#ng   model   for   distributed  compu#ng  using  func#onal  primi#ves   map  and  reduce.    
  • 11. Simplest  Big  Data  Problem  :  Coun6ng   Words.  
  • 12. Text  Mining  :  Word  Count  
  • 14. Does  the  Map  Reduce  Paradigm   completely  solve  the  Big  Data   Problem?   Where  do  we  go  from  here  ?  
  • 15. The  BIG  Picture  !   What  do  we  need  to  process  Big  Data?    
  • 16. Three  major  Big  Data  Scenarios   •  Interac6ve  Queries  :  Enable  faster  decision.   Example  :  Query  website  logs  and  diagnose  why  the   website  is  slow  ?  (  Apache  Pig,  Apache  Hive..  )   •  Sophis6cated  Batch  Data  Processing  :  Enable   be_er  decisions.   Example  :  Trend  Analysis,  Analy2cs   •  Streaming  Data  Processing  :  Real  #me  decision   making.   Example  :  Fraud  Detec2on,  detect  DDoS  aJacks.  
  • 17.   A>er  Google  ’s  iniBal  work  …   The  Open  Source  Community  soon   caught  up  with  Hadoop  Ecosystem  !  
  • 18. Tools   for   every   Data   Analysis   scenario…  
  • 19.
  • 20. There  are  many  inherent  limita6ons   with  Map  Reduce  ecosystem  …  
  • 21. Map  Reduce  Limita6ons   •  Itera6ve   Jobs   :   Common   algorithms   apply   a   func#on   repeatedly   to   the   same   dataset.   While   each   itera#on   can   be   expressed   as   a   MapReduce  job,  each  job  must  store  and  than   reload  data  from  disk.   •  Interac6ve   Analysis   :   There   is   a   need   to   run   ad-­‐hoc   queries   on   datasets.   We   want   to   be   able   to   load   a   dataset   into   memory   across   machines  and  query  it  repeatedly.  
  • 22. Example  1  :  Itera6ve  Breadth  First   Search  in  Hadoop.  
  • 23. Example  2  :  Ad-­‐Hoc  SQL  like  Queries   anyone  ?   Using    Hive  and  Pig  ?  
  • 24. to  the  Rescue  !   Spark   easily   outperforms   Hadoop   in   the   discussed   scenarios  by  10X  –  100X.  
  • 25. not  simply  about  Performance  gain…  
  • 26.
  • 27.
  • 28.
  • 29. Introducing  Spark     •  Open  Source  data  analy#cs  cluster  compu#ng   framework.   •  Provides   primi#ves   for   In-­‐Memory   cluster   compu#ng  that  allows  user  to  load  data  into   cluster’s   memory   and   query   it   repeatedly   (spills  to  HDFS  only  when  needed).   •  Allows   interac#ve   ad-­‐hoc   data   explora#on.   (Supports  Pipelining  &  Lazy  Ini#aliza#on)   •  Unifies   batch,   streaming   and   interac6ve   computa6on.  
  • 30. A  rich  Spark  Ecosystem    
  • 31. The  Scala  Programming  language   -­‐  Object  /  Func#onal   -­‐  Runs  on  the  JVM   -­‐  Concise  Syntax   -­‐  Aims  to  support  interac#ve  scrip#ng  +   development      
  • 33. Why  is  it  so  Powerful  ?  
  • 34. Exploring  Spark  Internals  …   •  Spark  :  Cluster  Compu6ng  with  Working  Sets     Matei   Zaharia,   Mosharaf   Chowdhury,   Michael   J.   Franklin,   Sco_                   Shenker,  Ion  Stoica.  Hot  Cloud  2010.  June  2010.   •  Resilient   Distributed   Datasets:   A   Fault-­‐Tolerant   Abstrac6on  for  In-­‐Memory  Cluster  Compu6ng     Matei   Zaharia,   Mosharaf   Chowdhury,   Tathagata   Das,   Ankur   Dave,   Jus#n  Ma,  Murphy  McCauley,  Michael  J.  Franklin,  Sco_  Shenker,        Ion   Stoica.   NSDI   2012.   April   2012.   Best   Paper   Award   and   Honorable   Men6on  for  Community  Award.  
  • 35. Spark  Essen6als  :  Drive  Program  
  • 36. Core  of  Spark  :  RDDs   •  RDD  :  Resilient  Distributed  Datasets  are  a  distributed   memory  abstrac6on  that  lets  programmer  perform  in   memory   computa#ons   on   large   clusters   in   fault   tolerant  manner.     •  An  RDD  is  a  read-­‐only  collec6on  of  objects  par66oned   across  a  set  of  machines.   •  RDDs   provide   an   interface   based   on   coarse-­‐grained   transforma#on   that   applies   same   opera#on   to   many   data   items.   This   allows   fault   tolerance   by   logging   transforma6ons   and   building   a   dataset   lineage   that   can  be  used  to  rebuilt  them  if  a  par66on  is  lost.  
  • 38.
  • 39.
  • 40.
  • 41. Periodic  CheckPoin6ng  of  data  for   long  running  lineage.  
  • 42. Shared  Variables  in  SPARK  :   -­‐  Broadcast  Variables   -­‐  Accumulators   Programmers   can   create   two   restricted   types   of   shared   variables   to   support   two   common  simple  usage  paJerns…      
  • 43. Ini6al  Experiment  Results   Logis6c   Regression  29GB   dataset.     Interac6ve  Data   Mining  on   Wikipedia   1  TB  from  disk   took  170s  
  • 44. Back  to  our  Ques6on  …     Why  is  SPARK  so  Powerful  ?  
  • 45. RDDs  Expressivity  and  Generality   •  RDDs  are  able  to  express  a  diverse  set  of  programming   models  as  the  Restric#ons  have  li_le  impact  in  parallel   applica#ons.     •  A   lot   of   these   programs   naturally   apply   the   same   opera#on  on  many  records,  making  them  a  good  fit.   •  Previous   systems   explored   specific   problems   with   MapReduce.   However,   at   the   core   of   the   problem   is   the  need  for  a  common  data  sharing  abstrac6on.   •  RDDs  capture  all  major  op#miza#ons  :  keeping  specific   data   in   memory,   custom   par##oning   to   minimize   communica#on   and   recovering   from   failures   effec#vely.  
  • 47.
  • 48. Queries   with   Bounded   Error   and   Bounded   Response   Time   on   Very   Large  Data.  
  • 49.
  • 51. Explore  More  of  BlinkDB     •  Visit  :  h_p://blinkdb.org/   •  Blink  and  It’s  Done:  Interac6ve  Queries  on  Very   Large  Data  :  In  PVLDB  5(12):  1902-­‐1905,  2012,   Istanbul,  Turkey  by  Sameer  Agarwal,  Aurojit   Panda,  Barzan  Mozafari,  Anand  P.  Iyer,  Samuel   Madden,  Ion  Stoica.     •  Queries  with  Bounded  Errors  and  Bounded   Response  Times  on  Very  Large  Data  :  Sameer   Agarwal,  Barzan  Mozafari,  Aurojit  Panda,  Henry   Milner,  Samuel  Madden,  Ion  Stoica.  BlinkDB:.  In   ACM  EuroSys  2013,  Prague,  Czech  Republic  (Best   Paper  Award).