SlideShare a Scribd company logo
1 of 13
Download to read offline
Spark Internals
Training Contents
page2 ☛ Course Contents
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
Spark Internals
Course Contents
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
Distributing Data with HDFS Day1
Understanding Hadoop I/O
Spark Introduction
RDDs
RDD Internals:Part-1
RDD Internals:Part-2 Day2
Data ingress and egress
Running on a Cluster
Spark Internals
Advanced Spark Programming
Spark Streaming
Spark SQL Day3
Tuning and Debugging
Spark Kafka Internals
Storm Internals
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
Spark Internals
■ Distributing Data with HDFS
• Interfaces
• Hadoop Filesystems
• The Design of HDFS
■ Parallel Copying with distcp
• Keeping an HDFS Cluster Balanced
• Hadoop Archives
■
■
Using Hadoop Archives
• Limitations
Data Flow
• Anatomy of a File Write
• Anatomy of a File Read
• Coherency Model
■
■
The Command-Line Interface
• Basic Filesystem Operations
The Java Interface
• Querying the Filesystem
• Reading Data Using the FileSystem API
• Directories
• Deleting Data
• Reading Data from a Hadoop URL
• Writing Data
■ Understanding Hadoop I/O
■ Serialization
• Implementing a Custom Writable
• Serialization Frameworks
• The Writable Interface
• Writable Classes
• Avro
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■ Data Integrity
• ChecksumFileSystem
• LocalFileSystem
• Data Integrity in HDFS
■ ORC Files
• Large size enables efficient read of columns
• New types (datetime, decimal)
• Encoding specific to the column type
• Default stripe size is 250 MB
• A single file as output of each task
• Split files without scanning for markers
• Bound the amount of memory required for reading or writing.
• Lowers pressure on the NameNode
• Dramatically simplifies integration with Hive
• Break file into sets of rows called a stripe
• Complex types (struct, list, map, union)
• Support for the Hive type model
■ ORC File:Footer
• Count, min, max, and sum for each column
• Types, number of rows
• Contains list of stripes
■ ORC Files:Index
• Required for skipping rows
• Position in each stream
• Min and max for each column
• Currently every 10,000 rows
• Could include bit field or bloom filter
■ ORC Files:Postscript
• Contains compression parameters
• Size of compressed footer
■ ORC Files:Data
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
■
• Directory of stream locations
• Required for table scan
Parquet
• Nested Encoding
• Configurations
• Error recovery
• Extensibility
• Nulls
• File format
• Data Pages
• Motivation
• Unit of parallelization
• Logical Types
• Metadata
• Modules
• Column chunks
• Separating metadata and column data
• Checksumming
• Types
File-Based Data Structures
• MapFile
• SequenceFile
Compression
• Codecs
• Using Compression in MapReduce
• Compression and Input Splits
■ Spark Introduction
• GraphX
• MLlib
• Spark SQL
• Data Processing Applications
• Spark Streaming
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
• What Is Apache Spark?
• Data Science Tasks
• Spark Core
• Storage Layers for Spark
• Who Uses Spark, and for What?
• A Unified Stack
• Cluster Managers
RDDs
• Lazy Evaluation
• Common Transformations and Actions
• Passing Functions to Spark
• Creating RDDs
• RDD Operations
• Actions
• Transformations
• Scala
• Java
• Persistence
• Python
• Converting Between RDD Types
• RDD Basics
• Basic RDDs
RDD Internals:Part-1
• Expressing Existing Programming Models
• Fault Recovery
• Interpreter Integration
• Memory Management
• Implementation
• MapReduce
• RDD Operations in Spark
• Iterative MapReduce
• Console Log Minning
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
• Google's Pregel
• User Applications Built with Spark
• Behavior with Insufficient Memory
• Support for Checkpointing
• A Fault-Tolerant Abstraction
• Evaluation
• Job Scheduling
• Spark Programming Interface
• Advantages of the RDD Model
• Understanding the Speedup
• Leveraging RDDs for Debugging
• Iterative Machine Learning Applications
• Explaining the Expressivity of RDDs
• Representing RDDs
• Applications Not Suitable for RDDs
RDD Internals:Part-2
• Sorting Data
• Operations That Affect Partitioning
• Determining an RDD’s Partitioner
• Grouping Data
• Motivation
• Aggregations
• Data Partitioning (Advanced)
• Actions Available on Pair RDDs
• Joins
• Creating Pair RDDs
• Operations That Benefit from Partitioning
• Transformations on Pair RDDs
• Example: PageRank
• Custom Partitioners
Data ingress and egress
• Hadoop Input and Output Formats
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• File Formats
• Local/“Regular” FS
• Text Files
• Java Database Connectivity
• Structured Data with Spark SQL
• Elasticsearch
• File Compression
• Apache Hive
• Cassandra
• Object Files
• Comma-Separated Values and Tab-Separated Values
• HBase
• Databases
• Filesystems
• SequenceFiles
• JSON
• HDFS
• Motivation
• JSON
• Amazon S3
■ Running on a Cluster
• Scheduling Within and Between Spark Applications
• Spark Runtime Architecture
• A Scala Spark Application Built with sbt
• Packaging Your Code and Dependencies
• Launching a Program
• A Java Spark Application Built with Maven
• Hadoop YARN
• Deploying Applications with spark-submit
• The Driver
• Standalone Cluster Manager
• Cluster Managers
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• Executors
• Amazon EC2
• Cluster Manager
• Dependency Conflicts
• Apache Mesos
• Which Cluster Manager to Use?
■ Spark Internals
■ Spark:YARN Mode
• Resource Manager
• Node Manager
• Workers
• Containers
• Threads
• Task
• Executers
• Application Master
• Multiple Applications
• Tuning Parameters
■ Spark:LocalMode
■ Spark Caching
• With Serialization
• Off-heap
• In Memory
■ Running on a Cluster
• Scheduling Within and Between Spark Applications
• Spark Runtime Architecture
• A Scala Spark Application Built with sbt
• Packaging Your Code and Dependencies
• Launching a Program
• A Java Spark Application Built with Maven
• Hadoop YARN
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• Deploying Applications with spark-submit
• The Driver
• Standalone Cluster Manager
• Cluster Managers
• Executors
• Amazon EC2
• Cluster Manager
• Dependency Conflicts
• Apache Mesos
• Which Cluster Manager to Use?
■
■
Spark Serialization
StandAlone Mode
• Task
• Multiple Applications
• Executers
• Tuning Parameters
• Workers
• Threads
■ Advanced Spark Programming
• Working on a Per-Partition Basis
• Optimizing Broadcasts
• Accumulators
• Custom Accumulators
• Accumulators and Fault Tolerance
• Numeric RDD Operations
• Piping to External Programs
• Broadcast Variables
■ Spark Streaming
• Stateless Transformations
• Output Operations
• Checkpointing
• Core Sources
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
• Receiver Fault Tolerance
• Worker Fault Tolerance
• Stateful Transformations
• Batch and Window Sizes
• Architecture and Abstraction
• Performance Considerations
• Streaming UI
• Driver Fault Tolerance
• Multiple Sources and Cluster Sizing
• Processing Guarantees
• A Simple Example
• Input Sources
• Additional Sources
• Transformations
Spark SQL
• User-Defined Functions
• Long-Lived Tables and Queries
• Spark SQL Performance
• Apache Hive
• Loading and Saving Data
• Initializing Spark SQL
• Parquet
• Performance Tuning Options
• SchemaRDDs
• Caching
• JSON
• From RDDs
• Linking with Spark SQL
• Spark SQL UDFs
• Using Spark SQL in Applications
• Basic Query Example
Tuning and Debugging Spark
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• Driver and Executor Logs
• Memory Management
• Finding Information
• Configuring Spark with SparkConf
• Key Performance Considerations
• Components of Execution: Jobs, Tasks, and Stages
• Spark Web UI
• Hardware Provisioning
• Level of Parallelism
• Serialization Format
■ Kafka Internals
■
■
■
Kafka Core Concepts
• brokers
• Topics
• producers
• replicas
• Partitions
• consumers
Operating Kafka
• P&S tuning
• monitoring
• deploying
• Architecture
• hardware specs
Developing Kafka apps
• serialization
• compression
• testing
• Case Study
• reading from Kafka
• Writing to Kafka
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■ Storm Internals
■ Developing Storm apps
• Case Studies
• Bolts and topologies
• P&S tuning
• serialization
• testing
• Kafka integration
■
■
Storm core concepts
• spouts
• groupings
• tuples
• bolts
• parallelism
• Topologies
Operating Storm
• monitoring
• Architecture
• deploying
• hardware specs
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

More Related Content

What's hot

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Databricks
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 

What's hot (20)

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
 

Similar to Spark Internals Training | Apache Spark | Spark | Anika Technologies

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 

Similar to Spark Internals Training | Apache Spark | Spark | Anika Technologies (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 

More from Anand Narayanan

Deep learning with_computer_vision
Deep learning with_computer_visionDeep learning with_computer_vision
Deep learning with_computer_vision
Anand Narayanan
 
SynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDFSynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDF
Anand Narayanan
 

More from Anand Narayanan (12)

Scrum Foundation Training by Anika Technologies
Scrum Foundation Training by Anika TechnologiesScrum Foundation Training by Anika Technologies
Scrum Foundation Training by Anika Technologies
 
Agile Essentials Training by Anika Technologies
Agile Essentials Training by Anika TechnologiesAgile Essentials Training by Anika Technologies
Agile Essentials Training by Anika Technologies
 
Smart Staffing using Regression Analysis Model
Smart Staffing using Regression Analysis ModelSmart Staffing using Regression Analysis Model
Smart Staffing using Regression Analysis Model
 
Sentiment analysis using nlp
Sentiment analysis using nlpSentiment analysis using nlp
Sentiment analysis using nlp
 
Deep learning with_computer_vision
Deep learning with_computer_visionDeep learning with_computer_vision
Deep learning with_computer_vision
 
Advanced Elastic Search | Elastic Search | Kibana | Logstash
Advanced Elastic Search | Elastic Search | Kibana | LogstashAdvanced Elastic Search | Elastic Search | Kibana | Logstash
Advanced Elastic Search | Elastic Search | Kibana | Logstash
 
Deep learning internals
Deep learning internalsDeep learning internals
Deep learning internals
 
JVM and Java Performance Tuning | JVM Tuning | Java Performance
JVM and Java Performance Tuning | JVM Tuning | Java PerformanceJVM and Java Performance Tuning | JVM Tuning | Java Performance
JVM and Java Performance Tuning | JVM Tuning | Java Performance
 
Java Concurrency and Performance | Multi Threading | Concurrency | Java Conc...
Java Concurrency and Performance | Multi Threading  | Concurrency | Java Conc...Java Concurrency and Performance | Multi Threading  | Concurrency | Java Conc...
Java Concurrency and Performance | Multi Threading | Concurrency | Java Conc...
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
Big Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical IntelligenceBig Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical Intelligence
 
SynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDFSynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDF
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Spark Internals Training | Apache Spark | Spark | Anika Technologies

  • 1. Spark Internals Training Contents page2 ☛ Course Contents Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 2. Spark Internals Course Contents ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ Distributing Data with HDFS Day1 Understanding Hadoop I/O Spark Introduction RDDs RDD Internals:Part-1 RDD Internals:Part-2 Day2 Data ingress and egress Running on a Cluster Spark Internals Advanced Spark Programming Spark Streaming Spark SQL Day3 Tuning and Debugging Spark Kafka Internals Storm Internals Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 3. Spark Internals ■ Distributing Data with HDFS • Interfaces • Hadoop Filesystems • The Design of HDFS ■ Parallel Copying with distcp • Keeping an HDFS Cluster Balanced • Hadoop Archives ■ ■ Using Hadoop Archives • Limitations Data Flow • Anatomy of a File Write • Anatomy of a File Read • Coherency Model ■ ■ The Command-Line Interface • Basic Filesystem Operations The Java Interface • Querying the Filesystem • Reading Data Using the FileSystem API • Directories • Deleting Data • Reading Data from a Hadoop URL • Writing Data ■ Understanding Hadoop I/O ■ Serialization • Implementing a Custom Writable • Serialization Frameworks • The Writable Interface • Writable Classes • Avro Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 4. ■ Data Integrity • ChecksumFileSystem • LocalFileSystem • Data Integrity in HDFS ■ ORC Files • Large size enables efficient read of columns • New types (datetime, decimal) • Encoding specific to the column type • Default stripe size is 250 MB • A single file as output of each task • Split files without scanning for markers • Bound the amount of memory required for reading or writing. • Lowers pressure on the NameNode • Dramatically simplifies integration with Hive • Break file into sets of rows called a stripe • Complex types (struct, list, map, union) • Support for the Hive type model ■ ORC File:Footer • Count, min, max, and sum for each column • Types, number of rows • Contains list of stripes ■ ORC Files:Index • Required for skipping rows • Position in each stream • Min and max for each column • Currently every 10,000 rows • Could include bit field or bloom filter ■ ORC Files:Postscript • Contains compression parameters • Size of compressed footer ■ ORC Files:Data Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 5. ■ ■ ■ • Directory of stream locations • Required for table scan Parquet • Nested Encoding • Configurations • Error recovery • Extensibility • Nulls • File format • Data Pages • Motivation • Unit of parallelization • Logical Types • Metadata • Modules • Column chunks • Separating metadata and column data • Checksumming • Types File-Based Data Structures • MapFile • SequenceFile Compression • Codecs • Using Compression in MapReduce • Compression and Input Splits ■ Spark Introduction • GraphX • MLlib • Spark SQL • Data Processing Applications • Spark Streaming Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 6. ■ ■ • What Is Apache Spark? • Data Science Tasks • Spark Core • Storage Layers for Spark • Who Uses Spark, and for What? • A Unified Stack • Cluster Managers RDDs • Lazy Evaluation • Common Transformations and Actions • Passing Functions to Spark • Creating RDDs • RDD Operations • Actions • Transformations • Scala • Java • Persistence • Python • Converting Between RDD Types • RDD Basics • Basic RDDs RDD Internals:Part-1 • Expressing Existing Programming Models • Fault Recovery • Interpreter Integration • Memory Management • Implementation • MapReduce • RDD Operations in Spark • Iterative MapReduce • Console Log Minning Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 7. ■ ■ • Google's Pregel • User Applications Built with Spark • Behavior with Insufficient Memory • Support for Checkpointing • A Fault-Tolerant Abstraction • Evaluation • Job Scheduling • Spark Programming Interface • Advantages of the RDD Model • Understanding the Speedup • Leveraging RDDs for Debugging • Iterative Machine Learning Applications • Explaining the Expressivity of RDDs • Representing RDDs • Applications Not Suitable for RDDs RDD Internals:Part-2 • Sorting Data • Operations That Affect Partitioning • Determining an RDD’s Partitioner • Grouping Data • Motivation • Aggregations • Data Partitioning (Advanced) • Actions Available on Pair RDDs • Joins • Creating Pair RDDs • Operations That Benefit from Partitioning • Transformations on Pair RDDs • Example: PageRank • Custom Partitioners Data ingress and egress • Hadoop Input and Output Formats Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 8. • File Formats • Local/“Regular” FS • Text Files • Java Database Connectivity • Structured Data with Spark SQL • Elasticsearch • File Compression • Apache Hive • Cassandra • Object Files • Comma-Separated Values and Tab-Separated Values • HBase • Databases • Filesystems • SequenceFiles • JSON • HDFS • Motivation • JSON • Amazon S3 ■ Running on a Cluster • Scheduling Within and Between Spark Applications • Spark Runtime Architecture • A Scala Spark Application Built with sbt • Packaging Your Code and Dependencies • Launching a Program • A Java Spark Application Built with Maven • Hadoop YARN • Deploying Applications with spark-submit • The Driver • Standalone Cluster Manager • Cluster Managers Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 9. • Executors • Amazon EC2 • Cluster Manager • Dependency Conflicts • Apache Mesos • Which Cluster Manager to Use? ■ Spark Internals ■ Spark:YARN Mode • Resource Manager • Node Manager • Workers • Containers • Threads • Task • Executers • Application Master • Multiple Applications • Tuning Parameters ■ Spark:LocalMode ■ Spark Caching • With Serialization • Off-heap • In Memory ■ Running on a Cluster • Scheduling Within and Between Spark Applications • Spark Runtime Architecture • A Scala Spark Application Built with sbt • Packaging Your Code and Dependencies • Launching a Program • A Java Spark Application Built with Maven • Hadoop YARN Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 10. • Deploying Applications with spark-submit • The Driver • Standalone Cluster Manager • Cluster Managers • Executors • Amazon EC2 • Cluster Manager • Dependency Conflicts • Apache Mesos • Which Cluster Manager to Use? ■ ■ Spark Serialization StandAlone Mode • Task • Multiple Applications • Executers • Tuning Parameters • Workers • Threads ■ Advanced Spark Programming • Working on a Per-Partition Basis • Optimizing Broadcasts • Accumulators • Custom Accumulators • Accumulators and Fault Tolerance • Numeric RDD Operations • Piping to External Programs • Broadcast Variables ■ Spark Streaming • Stateless Transformations • Output Operations • Checkpointing • Core Sources Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 11. ■ ■ • Receiver Fault Tolerance • Worker Fault Tolerance • Stateful Transformations • Batch and Window Sizes • Architecture and Abstraction • Performance Considerations • Streaming UI • Driver Fault Tolerance • Multiple Sources and Cluster Sizing • Processing Guarantees • A Simple Example • Input Sources • Additional Sources • Transformations Spark SQL • User-Defined Functions • Long-Lived Tables and Queries • Spark SQL Performance • Apache Hive • Loading and Saving Data • Initializing Spark SQL • Parquet • Performance Tuning Options • SchemaRDDs • Caching • JSON • From RDDs • Linking with Spark SQL • Spark SQL UDFs • Using Spark SQL in Applications • Basic Query Example Tuning and Debugging Spark Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 12. • Driver and Executor Logs • Memory Management • Finding Information • Configuring Spark with SparkConf • Key Performance Considerations • Components of Execution: Jobs, Tasks, and Stages • Spark Web UI • Hardware Provisioning • Level of Parallelism • Serialization Format ■ Kafka Internals ■ ■ ■ Kafka Core Concepts • brokers • Topics • producers • replicas • Partitions • consumers Operating Kafka • P&S tuning • monitoring • deploying • Architecture • hardware specs Developing Kafka apps • serialization • compression • testing • Case Study • reading from Kafka • Writing to Kafka Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 13. ■ Storm Internals ■ Developing Storm apps • Case Studies • Bolts and topologies • P&S tuning • serialization • testing • Kafka integration ■ ■ Storm core concepts • spouts • groupings • tuples • bolts • parallelism • Topologies Operating Storm • monitoring • Architecture • deploying • hardware specs Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com