SlideShare ist ein Scribd-Unternehmen logo
1 von 16
HADOOP VS
SPARK
YOUR PRESENTER – SAMPAT KUMAR BUDANKAYALA
• Sr . Big Data Analyst @ Harman Solutions
• Over 4.5 years of Big Data experience working on over 15-20 projects .
• Specialist in Building Data Lake Projects, Data Security, Streaming
Solutions(RealTime Ingestion),Linear Regression and Building
Recommendation Systems .
• Email: sampatbigdata@gmail.com
• Linkedin:
AGENDA
• Around the Globe (Spark and Hadoop)
• Big Data, Big Data Stack, Apache Hadoop, Apache Spark.
• What is Hadoop and What is Spark ?
• SparkVs Hadoop and the combination effect.
• Q & A
Around the Globe:
NEWS:
----------
• Is it Spark ‘vs’ OR ‘and’ Hadoop.
• Apache Spark is continuing beyond Apache Hadoop.
SURVEYS:
--------------
• Big Data, the analysis of large quantities of data to gain new insight has become a ubiquitous phrase in
recent years. Day by day the data is growing at a staggering rate. One of the efficient technologies that
deal with the Big Data is Hadoop.
• Hadoop, for processing large data volume jobs uses MapReduce programming model.
http://www.ijetae.com/files/Volume4Issue5/IJETAE_0514_15.pdf
• Hadoop's historic focus on batch processing of data was well supported by MapReduce, but there is an
appetite for more flexible developer tools to support the larger market of 'mid-size' datasets and use
cases that call for real-time processing.
http://www.marketwired.com/press-release/survey-indicates-apache-spark-gaining-developer-
adoption-as-big-datas-projects-require-1986162.htm
Around the Globe Cont :
Big Data, Big Data Stack, Apache Spark and Hadoop
Big Data
----------
• Big data is a term that describes the large volume of data –structured ,semi-structured and unstructured .
• But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can
be analyzed for insights that lead to better decisions and strategic business moves.
• The concept gained momentum in the early 2000s when industry analysts articulated the now- mainstream
definition of big data as the threeVs:
Volume – Organizations collect data from a variety of sources, including business transactions, social media and
information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new
technologies (such as Hadoop) have eased the burden.
Velocity – Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and
smart metering are driving the need to deal with torrents of data in near-real time.
Variety – Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text
documents, email, video, audio, stock ticker data and financial transactions.
https://www.zettaset.com/index.php/info-center/what-is-big-data/
Big Data, Big Data Stack, Apache Spark and Hadoop
Big Data Stack
-------------------
Big Data, Big Data Stack, Apache Spark and Hadoop
Apache Hadoop
---------------------
• Hadoop is a framework designed to work with huge amount of data sets which is much larger in magnitude than
the normal systems can handle.
• Hadoop distributes this data across a set of machines.The real power of Hadoop comes from the fact its
competence to scalable to hundreds or thousands of computers each containing several processor cores.
• Many big enterprises believe that within a few years more than half of the world’s data will be stored in Hadoop.
• Hadoop mainly consists of:
1. Hadoop Distributed File System (HDFS): a distributed file system to achieve storage and fault tolerance
2. Hadoop MapReduce a powerful parallel programming model which processes vast quantity of data via
distributed computing across the clusters.
Big Data, Big Data Stack, Apache Spark and Hadoop
Apache Spark
---------------------
• Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to
allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast
iterative access to datasets.
• Apache Spark consists of Spark Core and a set of libraries.The core is the distributed execution engine and the
Java, Scala, and Python APIs offer a platform for distributed ETL application development.
• Spark is designed for data science and its abstraction makes data science easier. Data scientists commonly use
machine learning – a set of techniques and algorithms that can learn from data.
• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Spark Vs Hadoop and the combination effect
Performance
-----------------
• Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or
reduce action, so Spark should outperform Hadoop MapReduce.
• Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it
there until further notice, for the sake of caching.
• If Spark runs on HadoopYARN with other resource-demanding services, or if the data is too big to fit entirely into
the memory, then there could be major performance degradations for Spark.
• MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with
minor performance differences.
• Bottom line: Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop
MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services.
Spark Vs Hadoop and the combination effect
Ease Of User:
-----------------
• Spark has comfortable APIs for Java, Scala and Python, and also includes Spark SQL (formerly known as Shark) for
the SQL savvy.
• Hadoop MapReduce is written in Java and is infamous for being very difficult to program. Pig makes it easier,
though it requires some time to learn the syntax, and Hive adds SQL compatibility to the plate.
• MapReduce doesn’t have an interactive mode, although Hive includes a command line interface. Projects like
Impala, Presto andTez want to bring full interactive querying to Hadoop.
Bottom line: Spark is easier to program and includes an interactive mode; Hadoop MapReduce is more difficult to
program but many tools are available to make it easier.
Spark Vs Hadoop and the combination effect
Cost:
-----------------
• Both Spark and Hadoop MapReduce are open source, but money still needs to be spent on machines and staff.
• Hardware Requirements.
• The memory in the Spark cluster should be at least as large as the amount of data you need to process, because
the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop
will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space.
• Furthermore, there is a wide array of Hadoop-as-a-service offerings and Hadoop-based, which help to skip the
hardware and staffing requirements. In comparison, there are few Spark-as-a-service options and they are all very
new.
• Bottom line: Spark is more cost-effective according to the benchmarks, though staffing could be more costly;
Hadoop MapReduce could be cheaper because more personnel are available and because of Hadoop-as-a-service
offerings.
Spark Vs Hadoop and the combination effect
Data Processing:
----------------------
• Apache Spark can do more than plain data processing: it can process graphs and use the existing machine-learning
libraries.
• Spark can do real-time processing as well as batch processing.
• Hadoop MapReduce is great for batch processing. If you want a real-time option you’ll need to use another
platform like Storm or Impala, and for graph processing you can use Giraph. MapReduce used to have Apache
Mahout for machine learning, but the elephant riders have ditched it in favor of Spark and h2o.
• Bottom line: Spark is key for real time data processing; Hadoop MapReduce is the key for batch processing.
Spark Vs Hadoop and the combination effect
FailureTolerance:
----------------------
• Spark has retries per task and speculative execution—just like MapReduce. Nonetheless, because MapReduce
relies on hard drives, if a process crashes in the middle of execution, it could continue where it left off, whereas
Spark will have to start processing from the beginning.This can save time.
• Bottom line: Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is
slightly more tolerant.
Security:
------------------
• Spark is a bit bare at the moment when it comes to security.
• Spark can run onYARN and use HDFS, which means that it can also enjoy Kerberos authentication, HDFS file
permissions and encryption between nodes.
• Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like
Knox Gateway and Sentry.
• Bottom line: Spark security is still in its infancy; Hadoop MapReduce has more security features and projects.
Practical Demo On Performance and
Ease of Using API’s
Reference Links:
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
ArangoDB Database
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 

Was ist angesagt? (20)

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 

Andere mochten auch

Andere mochten auch (20)

Architecting DevOps Ready Application
Architecting DevOps Ready Application Architecting DevOps Ready Application
Architecting DevOps Ready Application
 
Making DevOps a reality for Legacy Enterprise Monolithic Products
Making DevOps a reality for Legacy Enterprise Monolithic ProductsMaking DevOps a reality for Legacy Enterprise Monolithic Products
Making DevOps a reality for Legacy Enterprise Monolithic Products
 
Salesforce: CI,CD & CT
Salesforce: CI,CD & CTSalesforce: CI,CD & CT
Salesforce: CI,CD & CT
 
Windows Automation with Ansible
Windows Automation with Ansible Windows Automation with Ansible
Windows Automation with Ansible
 
A systemic approach to shaping a DevOps culture
A systemic approach to shaping a DevOps cultureA systemic approach to shaping a DevOps culture
A systemic approach to shaping a DevOps culture
 
Distributed And Scaled (DiSc) Agile PMO
Distributed And Scaled (DiSc) Agile PMODistributed And Scaled (DiSc) Agile PMO
Distributed And Scaled (DiSc) Agile PMO
 
Addressing the challenges of delivering Microservice applications in the ente...
Addressing the challenges of delivering Microservice applications in the ente...Addressing the challenges of delivering Microservice applications in the ente...
Addressing the challenges of delivering Microservice applications in the ente...
 
Design Thinking Approach for Analytics
Design Thinking Approach for AnalyticsDesign Thinking Approach for Analytics
Design Thinking Approach for Analytics
 
Demonetization, IoT and related thoughts!
Demonetization, IoT and related thoughts!Demonetization, IoT and related thoughts!
Demonetization, IoT and related thoughts!
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
DevOps++ Global Summit 2017
DevOps++ Global Summit 2017DevOps++ Global Summit 2017
DevOps++ Global Summit 2017
 
Prediction Of Muscle Power In Elderly Using Functional Screening Data
Prediction Of Muscle Power In Elderly Using Functional Screening DataPrediction Of Muscle Power In Elderly Using Functional Screening Data
Prediction Of Muscle Power In Elderly Using Functional Screening Data
 
Linuxkit and Moby - A Sneek Peek into The Future of Container Ecosystem
Linuxkit and Moby - A Sneek Peek into The Future of Container EcosystemLinuxkit and Moby - A Sneek Peek into The Future of Container Ecosystem
Linuxkit and Moby - A Sneek Peek into The Future of Container Ecosystem
 
DevOps In Mobility World With Microsoft Technology
DevOps In Mobility World With Microsoft Technology DevOps In Mobility World With Microsoft Technology
DevOps In Mobility World With Microsoft Technology
 
Industrial Approach IOT: Practical Approach
Industrial Approach IOT: Practical Approach Industrial Approach IOT: Practical Approach
Industrial Approach IOT: Practical Approach
 
Key Success (And Failure) modes for your Large Scale DevOps Transformation
Key Success (And Failure) modes for your Large Scale DevOps TransformationKey Success (And Failure) modes for your Large Scale DevOps Transformation
Key Success (And Failure) modes for your Large Scale DevOps Transformation
 
Strengthening CX through Agile Ecosystems
Strengthening CX through Agile EcosystemsStrengthening CX through Agile Ecosystems
Strengthening CX through Agile Ecosystems
 
About Agile Testing Alliance (ATA)
About Agile Testing Alliance (ATA)About Agile Testing Alliance (ATA)
About Agile Testing Alliance (ATA)
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
Robotic Process Automation
Robotic Process Automation Robotic Process Automation
Robotic Process Automation
 

Ähnlich wie Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing

Ähnlich wie Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 

Mehr von Agile Testing Alliance

Mehr von Agile Testing Alliance (20)

#Interactive Session by Anindita Rath and Mahathee Dandibhotla, "From Good to...
#Interactive Session by Anindita Rath and Mahathee Dandibhotla, "From Good to...#Interactive Session by Anindita Rath and Mahathee Dandibhotla, "From Good to...
#Interactive Session by Anindita Rath and Mahathee Dandibhotla, "From Good to...
 
#Interactive Session by Ajay Balamurugadas, "Where Are The Real Testers In T...
#Interactive Session by  Ajay Balamurugadas, "Where Are The Real Testers In T...#Interactive Session by  Ajay Balamurugadas, "Where Are The Real Testers In T...
#Interactive Session by Ajay Balamurugadas, "Where Are The Real Testers In T...
 
#Interactive Session by Jishnu Nambiar and Mayur Ovhal, "Monitoring Web Per...
#Interactive Session by  Jishnu Nambiar and  Mayur Ovhal, "Monitoring Web Per...#Interactive Session by  Jishnu Nambiar and  Mayur Ovhal, "Monitoring Web Per...
#Interactive Session by Jishnu Nambiar and Mayur Ovhal, "Monitoring Web Per...
 
#Interactive Session by Pradipta Biswas and Sucheta Saurabh Chitale, "Navigat...
#Interactive Session by Pradipta Biswas and Sucheta Saurabh Chitale, "Navigat...#Interactive Session by Pradipta Biswas and Sucheta Saurabh Chitale, "Navigat...
#Interactive Session by Pradipta Biswas and Sucheta Saurabh Chitale, "Navigat...
 
#Interactive Session by Apoorva Ram, "The Art of Storytelling for Testers" at...
#Interactive Session by Apoorva Ram, "The Art of Storytelling for Testers" at...#Interactive Session by Apoorva Ram, "The Art of Storytelling for Testers" at...
#Interactive Session by Apoorva Ram, "The Art of Storytelling for Testers" at...
 
#Interactive Session by Nikhil Jain, "Catch All Mail With Graph" at #ATAGTR2023.
#Interactive Session by Nikhil Jain, "Catch All Mail With Graph" at #ATAGTR2023.#Interactive Session by Nikhil Jain, "Catch All Mail With Graph" at #ATAGTR2023.
#Interactive Session by Nikhil Jain, "Catch All Mail With Graph" at #ATAGTR2023.
 
#Interactive Session by Ashok Kumar S, "Test Data the key to robust test cove...
#Interactive Session by Ashok Kumar S, "Test Data the key to robust test cove...#Interactive Session by Ashok Kumar S, "Test Data the key to robust test cove...
#Interactive Session by Ashok Kumar S, "Test Data the key to robust test cove...
 
#Interactive Session by Seema Kohli, "Test Leadership in the Era of Artificia...
#Interactive Session by Seema Kohli, "Test Leadership in the Era of Artificia...#Interactive Session by Seema Kohli, "Test Leadership in the Era of Artificia...
#Interactive Session by Seema Kohli, "Test Leadership in the Era of Artificia...
 
#Interactive Session by Ashwini Lalit, RRR of Test Automation Maintenance" at...
#Interactive Session by Ashwini Lalit, RRR of Test Automation Maintenance" at...#Interactive Session by Ashwini Lalit, RRR of Test Automation Maintenance" at...
#Interactive Session by Ashwini Lalit, RRR of Test Automation Maintenance" at...
 
#Interactive Session by Srithanga Aishvarya T, "Machine Learning Model to aut...
#Interactive Session by Srithanga Aishvarya T, "Machine Learning Model to aut...#Interactive Session by Srithanga Aishvarya T, "Machine Learning Model to aut...
#Interactive Session by Srithanga Aishvarya T, "Machine Learning Model to aut...
 
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
 
#Interactive Session by Sudhir Upadhyay and Ashish Kumar, "Strengthening Test...
#Interactive Session by Sudhir Upadhyay and Ashish Kumar, "Strengthening Test...#Interactive Session by Sudhir Upadhyay and Ashish Kumar, "Strengthening Test...
#Interactive Session by Sudhir Upadhyay and Ashish Kumar, "Strengthening Test...
 
#Interactive Session by Sayan Deb Kundu, "Testing Gen AI Applications" at #AT...
#Interactive Session by Sayan Deb Kundu, "Testing Gen AI Applications" at #AT...#Interactive Session by Sayan Deb Kundu, "Testing Gen AI Applications" at #AT...
#Interactive Session by Sayan Deb Kundu, "Testing Gen AI Applications" at #AT...
 
#Interactive Session by Dinesh Boravke, "Zero Defects – Myth or Reality" at #...
#Interactive Session by Dinesh Boravke, "Zero Defects – Myth or Reality" at #...#Interactive Session by Dinesh Boravke, "Zero Defects – Myth or Reality" at #...
#Interactive Session by Dinesh Boravke, "Zero Defects – Myth or Reality" at #...
 
#Interactive Session by Saby Saurabh Bhardwaj, "Redefine Quality Assurance –...
#Interactive Session by  Saby Saurabh Bhardwaj, "Redefine Quality Assurance –...#Interactive Session by  Saby Saurabh Bhardwaj, "Redefine Quality Assurance –...
#Interactive Session by Saby Saurabh Bhardwaj, "Redefine Quality Assurance –...
 
#Keynote Session by Sanjay Kumar, "Innovation Inspired Testing!!" at #ATAGTR2...
#Keynote Session by Sanjay Kumar, "Innovation Inspired Testing!!" at #ATAGTR2...#Keynote Session by Sanjay Kumar, "Innovation Inspired Testing!!" at #ATAGTR2...
#Keynote Session by Sanjay Kumar, "Innovation Inspired Testing!!" at #ATAGTR2...
 
#Keynote Session by Schalk Cronje, "Don’t Containerize me" at #ATAGTR2023.
#Keynote Session by Schalk Cronje, "Don’t Containerize me" at #ATAGTR2023.#Keynote Session by Schalk Cronje, "Don’t Containerize me" at #ATAGTR2023.
#Keynote Session by Schalk Cronje, "Don’t Containerize me" at #ATAGTR2023.
 
#Interactive Session by Chidambaram Vetrivel and Venkatesh Belde, "Revolution...
#Interactive Session by Chidambaram Vetrivel and Venkatesh Belde, "Revolution...#Interactive Session by Chidambaram Vetrivel and Venkatesh Belde, "Revolution...
#Interactive Session by Chidambaram Vetrivel and Venkatesh Belde, "Revolution...
 
#Interactive Session by Aniket Diwakar Kadukar and Padimiti Vaidik Eswar Dat...
#Interactive Session by Aniket Diwakar Kadukar and  Padimiti Vaidik Eswar Dat...#Interactive Session by Aniket Diwakar Kadukar and  Padimiti Vaidik Eswar Dat...
#Interactive Session by Aniket Diwakar Kadukar and Padimiti Vaidik Eswar Dat...
 
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing

  • 2. YOUR PRESENTER – SAMPAT KUMAR BUDANKAYALA • Sr . Big Data Analyst @ Harman Solutions • Over 4.5 years of Big Data experience working on over 15-20 projects . • Specialist in Building Data Lake Projects, Data Security, Streaming Solutions(RealTime Ingestion),Linear Regression and Building Recommendation Systems . • Email: sampatbigdata@gmail.com • Linkedin:
  • 3. AGENDA • Around the Globe (Spark and Hadoop) • Big Data, Big Data Stack, Apache Hadoop, Apache Spark. • What is Hadoop and What is Spark ? • SparkVs Hadoop and the combination effect. • Q & A
  • 4. Around the Globe: NEWS: ---------- • Is it Spark ‘vs’ OR ‘and’ Hadoop. • Apache Spark is continuing beyond Apache Hadoop. SURVEYS: -------------- • Big Data, the analysis of large quantities of data to gain new insight has become a ubiquitous phrase in recent years. Day by day the data is growing at a staggering rate. One of the efficient technologies that deal with the Big Data is Hadoop. • Hadoop, for processing large data volume jobs uses MapReduce programming model. http://www.ijetae.com/files/Volume4Issue5/IJETAE_0514_15.pdf • Hadoop's historic focus on batch processing of data was well supported by MapReduce, but there is an appetite for more flexible developer tools to support the larger market of 'mid-size' datasets and use cases that call for real-time processing. http://www.marketwired.com/press-release/survey-indicates-apache-spark-gaining-developer- adoption-as-big-datas-projects-require-1986162.htm
  • 6. Big Data, Big Data Stack, Apache Spark and Hadoop Big Data ---------- • Big data is a term that describes the large volume of data –structured ,semi-structured and unstructured . • But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves. • The concept gained momentum in the early 2000s when industry analysts articulated the now- mainstream definition of big data as the threeVs: Volume – Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden. Velocity – Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety – Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. https://www.zettaset.com/index.php/info-center/what-is-big-data/
  • 7. Big Data, Big Data Stack, Apache Spark and Hadoop Big Data Stack -------------------
  • 8. Big Data, Big Data Stack, Apache Spark and Hadoop Apache Hadoop --------------------- • Hadoop is a framework designed to work with huge amount of data sets which is much larger in magnitude than the normal systems can handle. • Hadoop distributes this data across a set of machines.The real power of Hadoop comes from the fact its competence to scalable to hundreds or thousands of computers each containing several processor cores. • Many big enterprises believe that within a few years more than half of the world’s data will be stored in Hadoop. • Hadoop mainly consists of: 1. Hadoop Distributed File System (HDFS): a distributed file system to achieve storage and fault tolerance 2. Hadoop MapReduce a powerful parallel programming model which processes vast quantity of data via distributed computing across the clusters.
  • 9. Big Data, Big Data Stack, Apache Spark and Hadoop Apache Spark --------------------- • Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. • Apache Spark consists of Spark Core and a set of libraries.The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. • Spark is designed for data science and its abstraction makes data science easier. Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
  • 10. Spark Vs Hadoop and the combination effect Performance ----------------- • Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action, so Spark should outperform Hadoop MapReduce. • Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it there until further notice, for the sake of caching. • If Spark runs on HadoopYARN with other resource-demanding services, or if the data is too big to fit entirely into the memory, then there could be major performance degradations for Spark. • MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences. • Bottom line: Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services.
  • 11. Spark Vs Hadoop and the combination effect Ease Of User: ----------------- • Spark has comfortable APIs for Java, Scala and Python, and also includes Spark SQL (formerly known as Shark) for the SQL savvy. • Hadoop MapReduce is written in Java and is infamous for being very difficult to program. Pig makes it easier, though it requires some time to learn the syntax, and Hive adds SQL compatibility to the plate. • MapReduce doesn’t have an interactive mode, although Hive includes a command line interface. Projects like Impala, Presto andTez want to bring full interactive querying to Hadoop. Bottom line: Spark is easier to program and includes an interactive mode; Hadoop MapReduce is more difficult to program but many tools are available to make it easier.
  • 12. Spark Vs Hadoop and the combination effect Cost: ----------------- • Both Spark and Hadoop MapReduce are open source, but money still needs to be spent on machines and staff. • Hardware Requirements. • The memory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space. • Furthermore, there is a wide array of Hadoop-as-a-service offerings and Hadoop-based, which help to skip the hardware and staffing requirements. In comparison, there are few Spark-as-a-service options and they are all very new. • Bottom line: Spark is more cost-effective according to the benchmarks, though staffing could be more costly; Hadoop MapReduce could be cheaper because more personnel are available and because of Hadoop-as-a-service offerings.
  • 13. Spark Vs Hadoop and the combination effect Data Processing: ---------------------- • Apache Spark can do more than plain data processing: it can process graphs and use the existing machine-learning libraries. • Spark can do real-time processing as well as batch processing. • Hadoop MapReduce is great for batch processing. If you want a real-time option you’ll need to use another platform like Storm or Impala, and for graph processing you can use Giraph. MapReduce used to have Apache Mahout for machine learning, but the elephant riders have ditched it in favor of Spark and h2o. • Bottom line: Spark is key for real time data processing; Hadoop MapReduce is the key for batch processing.
  • 14. Spark Vs Hadoop and the combination effect FailureTolerance: ---------------------- • Spark has retries per task and speculative execution—just like MapReduce. Nonetheless, because MapReduce relies on hard drives, if a process crashes in the middle of execution, it could continue where it left off, whereas Spark will have to start processing from the beginning.This can save time. • Bottom line: Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is slightly more tolerant. Security: ------------------ • Spark is a bit bare at the moment when it comes to security. • Spark can run onYARN and use HDFS, which means that it can also enjoy Kerberos authentication, HDFS file permissions and encryption between nodes. • Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry. • Bottom line: Spark security is still in its infancy; Hadoop MapReduce has more security features and projects.
  • 15. Practical Demo On Performance and Ease of Using API’s