SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
www.KnowBigData.comBig Data and Hadoop
Welcome to
Interview Questions on Apache
Spark [Part 2]
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
reachus@knowbigdata.com
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
www.KnowBigData.comBig Data and Hadoop
ABOUT INSTRUCTOR - SANDEEP GIRI
2014 KnowBigData Founded
2014
Amazon
Built High Throughput Systems for Amazon.com site using
in-house NoSql.
2012
2012 InMobi Built Recommender that churns 200 TB
2011
tBits Global
Founded tBits Global
Built an enterprise grade Document Management System
2006
D.E.Shaw
Built the big data systems before the
term was coined
2002
2002 IIT Roorkee Finished B.Tech.
www.KnowBigData.comBig Data and Hadoop
❏ Expert Instructors
❏ CloudLabs
❏ Lifetime access to LMS
❏ Presentations
❏ Class Recording
❏ Assignments +
Quizzes
❏ Project Work
WELCOME - KNOWBIGDATA
❏ Real Life Project
❏ Course Completion Certificate
❏ 24x7 support
❏ KnowsBigData - Alumni
❏ Jobs
❏ Stay Abreast (Updated
Content, Complimentary
Sessions)
❏ Stay Connected
www.KnowBigData.comBig Data and Hadoop
QUESTION 1
Say I have a huge list of numbers in RDD
(say myrdd). And I wrote the following
code to compute average:
def myAvg(x, y):
return (x+y)/2.0;
avg = myrdd.reduce(myAvg);
www.KnowBigData.comBig Data and Hadoop
QUESTION 1
ANSWER:
The average function is not commutative and associative;
I would simply sum it and then divide by count.
def sum(x, y):
return x+y;
total = myrdd.reduce(sum);
avg = total / myrdd.count();
The only problem with the above code is that the total might become very big thus
over flow.
So, I would rather divide each number by count and then sum in the following way.
cnt = myrdd.count();
def devideByCnd(x):
return x/cnt;
myrdd1 = myrdd.map(devideByCnd);
avg = myrdd.reduce(sum);
www.KnowBigData.comBig Data and Hadoop
QUESTION 2
Say I have a huge list of numbers in a file in
HDFS. Each line has one number.
And I want to compute the square root of
sum of squares of these numbers.
How would you do it?
www.KnowBigData.comBig Data and Hadoop
QUESTION 2
ANSWER:
There could two approaches:
1.
numsAsText = sc.textFile("hdfs://hadoop1.knowbigdata.
com/user/student/sgiri/mynumbersfile.txt");
def toSqInt(str):
v = int(str);
return v*v;
2.
nums = numsAsText.map(toInt);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
www.KnowBigData.comBig Data and Hadoop
QUESTION 3
Is the following approach correct?
Is the sqrtOfSumOfSq a valid reducer?
numsAsText = sc.textFile("hdfs://hadoop1.knowbigdata.
com/user/student/sgiri/mynumbersfile.txt");
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
www.KnowBigData.comBig Data and Hadoop
QUESTION 3
Is the following approach correct?
Is the sqrtOfSumOfSq a valid reducer?
ANSWER:
Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
www.KnowBigData.comBig Data and Hadoop
QUESTION 4
Which approach is better?
Could you compare the pros and cons of
the your approach and my approach?
www.KnowBigData.comBig Data and Hadoop
QUESTION 4
Which approach is better?
Could you compare the pros and cons of the your approach
and my approach?
ANSWER:
You are doing the square and square root as part of reduce code while I am squaring
in map() and summing in reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is
calling math.sqrt() and reducer code is generally executed approximately n-1 times.
The only downside of my approach is that there is a huge chance of integer overflow
because I am computing the sum of squares as part of map.
www.KnowBigData.comBig Data and Hadoop
QUESTION 5
How do you do the word count using
Spark?
www.KnowBigData.comBig Data and Hadoop
QUESTION 5
How do you do the word count using Spark?
ANSWER:
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
www.KnowBigData.comBig Data and Hadoop
QUESTION 6
In a very huge text file, you want to just
check if a particular keyword exists. How
would you do this using Spark?
www.KnowBigData.comBig Data and Hadoop
QUESTION 6
ANSWER:
lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
def isFound(line):
if line.find(“mykeyword”) > -1:
return 1;
return 0;
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “FOUND”;
else:
print “NOT FOUND”;
In a very huge text file, you want to just check if a particular
keyword exists. How would you do this using Spark?
www.KnowBigData.comBig Data and Hadoop
QUESTION 7
Can you improve the performance of
this code in previous answer?
www.KnowBigData.comBig Data and Hadoop
QUESTION 7
ANSWER:
Yes. The search is not stopping even after the word we are looking for has been found.
Our map code would keep executing on all the nodes which is very inefficient.
We could utilize accumulators to report whether the word has been found or not and
then stop the job. Something on these line:
import thread, threading
from time import sleep
result = "Not Set"
lock = threading.Lock()
accum = sc.accumulator(0)
Can you improve the performance of this code in previous
answer?
….CONTD
www.KnowBigData.comBig Data and Hadoop
QUESTION 7
def map_func(line):
#introduce delay to emulate the slowness
sleep(1);
if line.find("Adventures") > -1:
accum.add(1);
return 1;
return 0;
def start_job():
global result
try:
sc.setJobGroup("job_to_cancel", "some description")
lines = sc.textFile("hdfs://hadoop1.knowbigdata.
com/user/student/sgiri/wordcount/input/big.txt");
….CONTD
www.KnowBigData.comBig Data and Hadoop
QUESTION 7
result = lines.map(map_func);
result.take(1);
except Exception as e:
result = "Cancelled"
lock.release()
def stop_job():
while accum.value < 3 :
sleep(1);
sc.cancelJobGroup("job_to_cancel")
supress = lock.acquire()
supress = thread.start_new_thread(start_job, tuple())
supress = thread.start_new_thread(stop_job, tuple())
supress = lock.acquire()
www.KnowBigData.comBig Data and Hadoop
Thank You.
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
reachus@knowbigdata.com
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA

Weitere ähnliche Inhalte

Was ist angesagt?

A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin trainingArun Kumar
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 

Was ist angesagt? (20)

A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin training
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
HDFS
HDFSHDFS
HDFS
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 

Andere mochten auch

Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesTodd McGrath
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hadoop assignment 1
Hadoop assignment 1Hadoop assignment 1
Hadoop assignment 1Pooja Puri
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkQAware GmbH
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comknowbigdata
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperknowbigdata
 
9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy
9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy
9799078 the-speed-reading-book-tony-buzan-235-pg-enjoyFabio Antonio
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsKalyan Hadoop
 
Top 10 architect interview questions and answers
Top 10 architect interview questions and answersTop 10 architect interview questions and answers
Top 10 architect interview questions and answersWhitneyHouston012
 
Buzan tony speed memory
Buzan  tony   speed memoryBuzan  tony   speed memory
Buzan tony speed memoryNancy Tran
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 

Andere mochten auch (17)

Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop assignment 1
Hadoop assignment 1Hadoop assignment 1
Hadoop assignment 1
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and Spark
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.com
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy
9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy
9799078 the-speed-reading-book-tony-buzan-235-pg-enjoy
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
 
Top 10 architect interview questions and answers
Top 10 architect interview questions and answersTop 10 architect interview questions and answers
Top 10 architect interview questions and answers
 
Buzan tony speed memory
Buzan  tony   speed memoryBuzan  tony   speed memory
Buzan tony speed memory
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 

Ähnlich wie Interview questions on Apache spark [part 2]

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Redispresentation apac2012
Redispresentation apac2012Redispresentation apac2012
Redispresentation apac2012Ankur Gupta
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 

Ähnlich wie Interview questions on Apache spark [part 2] (20)

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
BigData primer
BigData primerBigData primer
BigData primer
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Redispresentation apac2012
Redispresentation apac2012Redispresentation apac2012
Redispresentation apac2012
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 

Kürzlich hochgeladen

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Kürzlich hochgeladen (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Interview questions on Apache spark [part 2]

  • 1. www.KnowBigData.comBig Data and Hadoop Welcome to Interview Questions on Apache Spark [Part 2] +1 419 665 3276 (US) +91 803 959 1464 (IN) reachus@knowbigdata.com Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
  • 2. www.KnowBigData.comBig Data and Hadoop ABOUT INSTRUCTOR - SANDEEP GIRI 2014 KnowBigData Founded 2014 Amazon Built High Throughput Systems for Amazon.com site using in-house NoSql. 2012 2012 InMobi Built Recommender that churns 200 TB 2011 tBits Global Founded tBits Global Built an enterprise grade Document Management System 2006 D.E.Shaw Built the big data systems before the term was coined 2002 2002 IIT Roorkee Finished B.Tech.
  • 3. www.KnowBigData.comBig Data and Hadoop ❏ Expert Instructors ❏ CloudLabs ❏ Lifetime access to LMS ❏ Presentations ❏ Class Recording ❏ Assignments + Quizzes ❏ Project Work WELCOME - KNOWBIGDATA ❏ Real Life Project ❏ Course Completion Certificate ❏ 24x7 support ❏ KnowsBigData - Alumni ❏ Jobs ❏ Stay Abreast (Updated Content, Complimentary Sessions) ❏ Stay Connected
  • 4. www.KnowBigData.comBig Data and Hadoop QUESTION 1 Say I have a huge list of numbers in RDD (say myrdd). And I wrote the following code to compute average: def myAvg(x, y): return (x+y)/2.0; avg = myrdd.reduce(myAvg);
  • 5. www.KnowBigData.comBig Data and Hadoop QUESTION 1 ANSWER: The average function is not commutative and associative; I would simply sum it and then divide by count. def sum(x, y): return x+y; total = myrdd.reduce(sum); avg = total / myrdd.count(); The only problem with the above code is that the total might become very big thus over flow. So, I would rather divide each number by count and then sum in the following way. cnt = myrdd.count(); def devideByCnd(x): return x/cnt; myrdd1 = myrdd.map(devideByCnd); avg = myrdd.reduce(sum);
  • 6. www.KnowBigData.comBig Data and Hadoop QUESTION 2 Say I have a huge list of numbers in a file in HDFS. Each line has one number. And I want to compute the square root of sum of squares of these numbers. How would you do it?
  • 7. www.KnowBigData.comBig Data and Hadoop QUESTION 2 ANSWER: There could two approaches: 1. numsAsText = sc.textFile("hdfs://hadoop1.knowbigdata. com/user/student/sgiri/mynumbersfile.txt"); def toSqInt(str): v = int(str); return v*v; 2. nums = numsAsText.map(toInt); total = nums.reduce(sum) import math; print math.sqrt(total);
  • 8. www.KnowBigData.comBig Data and Hadoop QUESTION 3 Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer? numsAsText = sc.textFile("hdfs://hadoop1.knowbigdata. com/user/student/sgiri/mynumbersfile.txt"); def toInt(str): return int(str); nums = numsAsText.map(toInt); def sqrtOfSumOfSq(x, y): return math.sqrt(x*x+y*y); total = nums.reduce(sum) import math; print math.sqrt(total);
  • 9. www.KnowBigData.comBig Data and Hadoop QUESTION 3 Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer? ANSWER: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
  • 10. www.KnowBigData.comBig Data and Hadoop QUESTION 4 Which approach is better? Could you compare the pros and cons of the your approach and my approach?
  • 11. www.KnowBigData.comBig Data and Hadoop QUESTION 4 Which approach is better? Could you compare the pros and cons of the your approach and my approach? ANSWER: You are doing the square and square root as part of reduce code while I am squaring in map() and summing in reduce in my approach. My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer code is generally executed approximately n-1 times. The only downside of my approach is that there is a huge chance of integer overflow because I am computing the sum of squares as part of map.
  • 12. www.KnowBigData.comBig Data and Hadoop QUESTION 5 How do you do the word count using Spark?
  • 13. www.KnowBigData.comBig Data and Hadoop QUESTION 5 How do you do the word count using Spark? ANSWER: lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt");
  • 14. www.KnowBigData.comBig Data and Hadoop QUESTION 6 In a very huge text file, you want to just check if a particular keyword exists. How would you do this using Spark?
  • 15. www.KnowBigData.comBig Data and Hadoop QUESTION 6 ANSWER: lines = sc.textFile("hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt"); def isFound(line): if line.find(“mykeyword”) > -1: return 1; return 0; foundBits = lines.map(isFound); sum = foundBits.reduce(sum); if sum > 0: print “FOUND”; else: print “NOT FOUND”; In a very huge text file, you want to just check if a particular keyword exists. How would you do this using Spark?
  • 16. www.KnowBigData.comBig Data and Hadoop QUESTION 7 Can you improve the performance of this code in previous answer?
  • 17. www.KnowBigData.comBig Data and Hadoop QUESTION 7 ANSWER: Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep executing on all the nodes which is very inefficient. We could utilize accumulators to report whether the word has been found or not and then stop the job. Something on these line: import thread, threading from time import sleep result = "Not Set" lock = threading.Lock() accum = sc.accumulator(0) Can you improve the performance of this code in previous answer? ….CONTD
  • 18. www.KnowBigData.comBig Data and Hadoop QUESTION 7 def map_func(line): #introduce delay to emulate the slowness sleep(1); if line.find("Adventures") > -1: accum.add(1); return 1; return 0; def start_job(): global result try: sc.setJobGroup("job_to_cancel", "some description") lines = sc.textFile("hdfs://hadoop1.knowbigdata. com/user/student/sgiri/wordcount/input/big.txt"); ….CONTD
  • 19. www.KnowBigData.comBig Data and Hadoop QUESTION 7 result = lines.map(map_func); result.take(1); except Exception as e: result = "Cancelled" lock.release() def stop_job(): while accum.value < 3 : sleep(1); sc.cancelJobGroup("job_to_cancel") supress = lock.acquire() supress = thread.start_new_thread(start_job, tuple()) supress = thread.start_new_thread(stop_job, tuple()) supress = lock.acquire()
  • 20. www.KnowBigData.comBig Data and Hadoop Thank You. +1 419 665 3276 (US) +91 803 959 1464 (IN) reachus@knowbigdata.com Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA