SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Up and Running with
Achilles Heel for Hadoop
 Hadoop is not fast enough *apparently* for things like ML .
 Need to Read again from disk after each MR job.
{ MR1 => HDFS =>MR2 => HDFS =>MR3 }
 MR , Let’s admit is a bit too complicated.
 The problem with giant codebase.
{Hadoop : 1.7 Million LOC}
{Spark : .35 Million LOC}
Why Spark?
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
A brief History of Spark Timeline
 UC Berkeley : The home of innovation.
 2009 : Started as a simple class project.
 The UCB folks wanted to create a Cluster Management system  Mesos
 They needed something to test on top of Mesos .  Voila Spark
 2010 : Open sourced under BSD licence
 Feb 2014 : Became Apache Top Level project
 Nov 2014 : New world record in Large scale sorting
https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry
Spark made wise choices
The Spark Stack
Spark Concepts
 In memory Processing
{Processors : 64 bit ~~ Up to 1 TB RAM}
{Fact: RAM will always be faster than disk}
{Idea : Compress data, do processing }
{Remember : Data is distributed across various machines too}
 Resilient Distributed Datasets
http://www.gridgain.com/in-memory-computing-in-plain-english/
Resilient This is Sparta and we don’t give up on data without a fight.
Distributed A part of data is everywhere.
Dataset Meh!
A bit more on RDD
Basic unit of data in Spark
 RDDs are immutable
// int a=0;
// final int b =0;
 There are two main categories of operations on RDD
a) Transformation => Lazy evaluation.
=> Creates a new RDD from the existing RDD.
b) Actions => Return values
=> Write to disk
Eg : My mom asks me to buy Grocery items
Setting Up
Download “Prebuilt for Hadoop 2.4 and later”
 Build from source with Maven or sbt.
./bin/pyspark
http://spark.apache.org/downloads.html
Talk is Cheap! Show me the code
Pyspark shell is REPL.
 Creating an RDD
a) From data in memory.
b) From File.
c) From another RDD
rdd = sc.parallelize(“ChennaiPy”) // from string
nums = [1,2,3]
rdd_nums = sc.parallelize(nums) // from list
rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file
Transformations
Less Dramatic than this . But beautiful nevertheless.
 Classic Example 1 : Map
a) Beauty in this case comes from Lambda Expressions .
nums = [1,2,3,4,5,6]
rdd_nums = sc.parallelize(nums) // Creating our RDD
new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares
print new_rdd.colect() // Finally some action
Did you say 80 Operations?
http://nbviewer.ipython.org/github/jkthompson/pyspark-
pictures/blob/master/pyspark-pictures.ipynb
Use case : Log Analysis
Demo Time
Thank You 

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

Andere mochten auch

Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
Edureka!
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
pramode_ce
 

Andere mochten auch (10)

Scala 101
Scala 101Scala 101
Scala 101
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Introduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQIntroduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQ
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 

Ähnlich wie Up and running with pyspark

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Ähnlich wie Up and running with pyspark (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Scaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop and MesosScaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop and Mesos
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Data Science
Data ScienceData Science
Data Science
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

Mehr von Krishna Sangeeth KS

Mehr von Krishna Sangeeth KS (17)

Bringing back Vangogh
Bringing back VangoghBringing back Vangogh
Bringing back Vangogh
 
All things py
All things pyAll things py
All things py
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Solving graph problems using networkX
Solving graph problems using networkXSolving graph problems using networkX
Solving graph problems using networkX
 
Lvc
LvcLvc
Lvc
 
Round 1 gnosis
Round 1 gnosisRound 1 gnosis
Round 1 gnosis
 
Lvc
LvcLvc
Lvc
 
Anti Counterfeit System using QR codes and Various other applications
Anti Counterfeit System using QR codes and Various other applicationsAnti Counterfeit System using QR codes and Various other applications
Anti Counterfeit System using QR codes and Various other applications
 
Automatic web monitoring & retrieval system
Automatic web monitoring & retrieval systemAutomatic web monitoring & retrieval system
Automatic web monitoring & retrieval system
 
Written round
Written roundWritten round
Written round
 
Visual connect
Visual connectVisual connect
Visual connect
 
Supertheme
SuperthemeSupertheme
Supertheme
 
Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1
 
Dry 2
Dry 2Dry 2
Dry 2
 
Choice round
Choice roundChoice round
Choice round
 
Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1Gnosis quiz 2k10 dry1
Gnosis quiz 2k10 dry1
 
Gnosis Quiz 2k10 Prelims
Gnosis Quiz  2k10 PrelimsGnosis Quiz  2k10 Prelims
Gnosis Quiz 2k10 Prelims
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Up and running with pyspark

  • 2. Achilles Heel for Hadoop  Hadoop is not fast enough *apparently* for things like ML .  Need to Read again from disk after each MR job. { MR1 => HDFS =>MR2 => HDFS =>MR3 }  MR , Let’s admit is a bit too complicated.  The problem with giant codebase. {Hadoop : 1.7 Million LOC} {Spark : .35 Million LOC}
  • 4. A brief History of Spark Timeline  UC Berkeley : The home of innovation.  2009 : Started as a simple class project.  The UCB folks wanted to create a Cluster Management system  Mesos  They needed something to test on top of Mesos .  Voila Spark  2010 : Open sourced under BSD licence  Feb 2014 : Became Apache Top Level project  Nov 2014 : New world record in Large scale sorting https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry
  • 5. Spark made wise choices
  • 7. Spark Concepts  In memory Processing {Processors : 64 bit ~~ Up to 1 TB RAM} {Fact: RAM will always be faster than disk} {Idea : Compress data, do processing } {Remember : Data is distributed across various machines too}  Resilient Distributed Datasets http://www.gridgain.com/in-memory-computing-in-plain-english/ Resilient This is Sparta and we don’t give up on data without a fight. Distributed A part of data is everywhere. Dataset Meh!
  • 8. A bit more on RDD Basic unit of data in Spark  RDDs are immutable // int a=0; // final int b =0;  There are two main categories of operations on RDD a) Transformation => Lazy evaluation. => Creates a new RDD from the existing RDD. b) Actions => Return values => Write to disk Eg : My mom asks me to buy Grocery items
  • 9. Setting Up Download “Prebuilt for Hadoop 2.4 and later”  Build from source with Maven or sbt. ./bin/pyspark http://spark.apache.org/downloads.html
  • 10. Talk is Cheap! Show me the code Pyspark shell is REPL.  Creating an RDD a) From data in memory. b) From File. c) From another RDD rdd = sc.parallelize(“ChennaiPy”) // from string nums = [1,2,3] rdd_nums = sc.parallelize(nums) // from list rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file
  • 11. Transformations Less Dramatic than this . But beautiful nevertheless.  Classic Example 1 : Map a) Beauty in this case comes from Lambda Expressions . nums = [1,2,3,4,5,6] rdd_nums = sc.parallelize(nums) // Creating our RDD new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares print new_rdd.colect() // Finally some action
  • 12. Did you say 80 Operations? http://nbviewer.ipython.org/github/jkthompson/pyspark- pictures/blob/master/pyspark-pictures.ipynb
  • 13. Use case : Log Analysis

Hinweis der Redaktion

  1. This is just a test
  2. This is just a test