SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Building massive scale,
    fault tolerant,
job processing systems
    with Scala Akka
      framework
     Vignesh Sukumar
        SVCC 2012
About me

• Storage group, Backend Engineering at Box
• Love enterprise software!
• Interested in Big Data and building distributed
  systems in the cloud
About Box

• Leader in enterprise cloud collaboration and
  storage
• Cutting-edge work in backend, frontend,
  platform and engineering services
• A really fun place to work – we have a long
  slide!
Talk outline
• Job processing requirements
• Traditional & new models for job processing

• Akka actors framework
• Achieving and controlling high IO throughput
• Fine-grained fault tolerance
Typical architecture in a cloud storage
             environment
Practical realities

•Storage nodes are usually of varying
configurations (OS, processing power, storage
capacity, etc) mainly because of rapid evolution
in provisioning operations
•Some nodes are more over-worked than the
others (for ex, accepting live uploads)
•Billions of files; petabytes
Job processing requirements

• Iterate over all files (billions, petabyte scale):
  for ex, check consistency of all files

• High throughput

• Fault tolerant

• Secure
Traditional job processing model
Why traditional models fail in cloud
       storage environments
• Not scalable: petabyte scale, billions of files
• Insecure: cannot move files out of storage
  nodes
• No performance control: easy to overwhelm
  any storage node
• No fine grained fault tolerance
Compute on Storage

• Move job computation directly to storage
  nodes
• Utilize abundant CPU on storage nodes
• Metadata store still stays in a highly available
  system like a RDBMS
• Results from operations on a file are
  completely independent
Master – slave architecture
Benefits

• High IO throughput: Direct access; no transfer
  of files over a network
• Secure: files do not leave storage nodes
• Better performance control: compute can
  easily monitor system load and back off
• Better fault tolerance handling: finer grained
  handling of errors
Master node

• Responsible for accepting job submissions and
  splitting them to tasks for slave nodes
• Stateful: keeps durable copy of jobs and tasks
  in Zookeeper
• Horizontally scalable: service can be run on
  multiple nodes
Agent

• Runs directly on the storage nodes on a
  machine-independent JVM container
• Stateless: no task state is maintained
• Monitors system load with back-off
• Reports results directly to master without
  synchronizing with other agents
Implementation with the
  the Scala Akka Actor
       framework
Actors

• Concurrent threads abstraction with no
  shared state
• Exchange messages
• Asynchronous, non-blocking
• Multiple actors can map to a single OS thread
• Parent-children hierarchical relationship
Actors and messages
• Class MyActor extends Actor {
  def receive = {
    case MsgType1 => // do something
  }
}

// instantiation and sending messages
 val actorRef = system.actorOf(Props(new MyActor))
actorRef ! MsgType1
Agent Actor System
Achieving high IO throughput
• Parallel, asynchronous IO through “Futures”
val fileIOResult = Future {
  // issue high latency tasks like file IO
 }
val networkIOResult = Future { // read from network }

Futures.awaitAll(<wait time>, fileIOResult, networkIOResult)
fileIOResult onSuccess { // do something }
networkIOResult onFailure { // retry }
Controlling system throughput

• The problem: agents need to throttle
  themselves as storage nodes serve live traffic

• Adjust number of parallel workers dynamically
  through a monitoring service
Controlling throughput: Examples

•Parallelism parameters can be gotten from a
separate configuration service on a per node
basis
•Some machines can be speeded up and others
slowed down this way
•The configuration can be updated on a cron
schedule to speed up during weekends
Fine grained fault tolerance with
              Supervisors

• Parents of child actors can define specific
  fault-handling strategies for each failure
  scenario in their children
• Components can fail gracefully without
  affecting the entire system
Supervision strategy: Examples


Class TaskActor extends Actor {
  // create child workers
  override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) {
   case SqlException => Resume // retry the same file
   case FileCorruptionException => Stop // don’t clobber it!
   case IOException => Restart // report and move on
}
Unit testing

• Scalatra test framework: very easy to read!
  TaskActorTest.receive(BadFileMsg) must throw
  FileNotFoundException
• Mocks for network and database calls
val mockHttp = mock[HttpExecutor]
TaskActorTest ! doHttpPost
there was atLeastOne(mockHttp).POST


• Extensive testing of failure injection scenarios
Takeaways
• Keep your architecture simple by modeling
  actor message flow along the same paths as
  parent-child actor hierarchy (i.e., no message
  exchange between peer child actors)
• Design and implement for component failures
• Write unit tests extensively: we did not have
  any fundamental level functionality breakage
• Box Engineering is awesome!

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyPeter Tröger
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Prometheus
PrometheusPrometheus
Prometheuswyukawa
 
Network Automation (NetDevOps) with Ansible
Network Automation (NetDevOps) with AnsibleNetwork Automation (NetDevOps) with Ansible
Network Automation (NetDevOps) with AnsibleAPNIC
 
The aggregate is dead! Long live the aggregate! - SpringIO.pdf
The aggregate is dead! Long live the aggregate! - SpringIO.pdfThe aggregate is dead! Long live the aggregate! - SpringIO.pdf
The aggregate is dead! Long live the aggregate! - SpringIO.pdfSara Pellegrini
 
Speed up web API with Laravel and Swoole using Docker
Speed up web API with Laravel and Swoole using DockerSpeed up web API with Laravel and Swoole using Docker
Speed up web API with Laravel and Swoole using DockerLaravel Poland MeetUp
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Timothy Spann
 
An Introduction to CMake
An Introduction to CMakeAn Introduction to CMake
An Introduction to CMakeICS
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 

Was ist angesagt? (20)

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Swoole w PHP. Czy to ma sens?
Swoole w PHP. Czy to ma sens?Swoole w PHP. Czy to ma sens?
Swoole w PHP. Czy to ma sens?
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
kafka
kafkakafka
kafka
 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - Concurrency
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Prometheus
PrometheusPrometheus
Prometheus
 
Network Automation (NetDevOps) with Ansible
Network Automation (NetDevOps) with AnsibleNetwork Automation (NetDevOps) with Ansible
Network Automation (NetDevOps) with Ansible
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
The aggregate is dead! Long live the aggregate! - SpringIO.pdf
The aggregate is dead! Long live the aggregate! - SpringIO.pdfThe aggregate is dead! Long live the aggregate! - SpringIO.pdf
The aggregate is dead! Long live the aggregate! - SpringIO.pdf
 
Speed up web API with Laravel and Swoole using Docker
Speed up web API with Laravel and Swoole using DockerSpeed up web API with Laravel and Swoole using Docker
Speed up web API with Laravel and Swoole using Docker
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
An Introduction to CMake
An Introduction to CMakeAn Introduction to CMake
An Introduction to CMake
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 

Ähnlich wie Massive scale job processing with Scala Akka framework

Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Ilya Ganelin
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with EpsilonSina Madani
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayLuka Zakrajšek
 
Indic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndicThreads
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
Enhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxEnhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxRohit Radhakrishnan
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatCase Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatVMware Hyperic
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 

Ähnlich wie Massive scale job processing with Scala Akka framework (20)

Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and Play
 
Indic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvm
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Enhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxEnhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptx
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatCase Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
 
Road Trip To Component
Road Trip To ComponentRoad Trip To Component
Road Trip To Component
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 

Kürzlich hochgeladen

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Kürzlich hochgeladen (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Massive scale job processing with Scala Akka framework

  • 1. Building massive scale, fault tolerant, job processing systems with Scala Akka framework Vignesh Sukumar SVCC 2012
  • 2. About me • Storage group, Backend Engineering at Box • Love enterprise software! • Interested in Big Data and building distributed systems in the cloud
  • 3. About Box • Leader in enterprise cloud collaboration and storage • Cutting-edge work in backend, frontend, platform and engineering services • A really fun place to work – we have a long slide!
  • 4. Talk outline • Job processing requirements • Traditional & new models for job processing • Akka actors framework • Achieving and controlling high IO throughput • Fine-grained fault tolerance
  • 5. Typical architecture in a cloud storage environment
  • 6. Practical realities •Storage nodes are usually of varying configurations (OS, processing power, storage capacity, etc) mainly because of rapid evolution in provisioning operations •Some nodes are more over-worked than the others (for ex, accepting live uploads) •Billions of files; petabytes
  • 7. Job processing requirements • Iterate over all files (billions, petabyte scale): for ex, check consistency of all files • High throughput • Fault tolerant • Secure
  • 9. Why traditional models fail in cloud storage environments • Not scalable: petabyte scale, billions of files • Insecure: cannot move files out of storage nodes • No performance control: easy to overwhelm any storage node • No fine grained fault tolerance
  • 10. Compute on Storage • Move job computation directly to storage nodes • Utilize abundant CPU on storage nodes • Metadata store still stays in a highly available system like a RDBMS • Results from operations on a file are completely independent
  • 11. Master – slave architecture
  • 12. Benefits • High IO throughput: Direct access; no transfer of files over a network • Secure: files do not leave storage nodes • Better performance control: compute can easily monitor system load and back off • Better fault tolerance handling: finer grained handling of errors
  • 13. Master node • Responsible for accepting job submissions and splitting them to tasks for slave nodes • Stateful: keeps durable copy of jobs and tasks in Zookeeper • Horizontally scalable: service can be run on multiple nodes
  • 14. Agent • Runs directly on the storage nodes on a machine-independent JVM container • Stateless: no task state is maintained • Monitors system load with back-off • Reports results directly to master without synchronizing with other agents
  • 15. Implementation with the the Scala Akka Actor framework
  • 16. Actors • Concurrent threads abstraction with no shared state • Exchange messages • Asynchronous, non-blocking • Multiple actors can map to a single OS thread • Parent-children hierarchical relationship
  • 17. Actors and messages • Class MyActor extends Actor { def receive = { case MsgType1 => // do something } } // instantiation and sending messages val actorRef = system.actorOf(Props(new MyActor)) actorRef ! MsgType1
  • 19. Achieving high IO throughput • Parallel, asynchronous IO through “Futures” val fileIOResult = Future { // issue high latency tasks like file IO } val networkIOResult = Future { // read from network } Futures.awaitAll(<wait time>, fileIOResult, networkIOResult) fileIOResult onSuccess { // do something } networkIOResult onFailure { // retry }
  • 20. Controlling system throughput • The problem: agents need to throttle themselves as storage nodes serve live traffic • Adjust number of parallel workers dynamically through a monitoring service
  • 21. Controlling throughput: Examples •Parallelism parameters can be gotten from a separate configuration service on a per node basis •Some machines can be speeded up and others slowed down this way •The configuration can be updated on a cron schedule to speed up during weekends
  • 22. Fine grained fault tolerance with Supervisors • Parents of child actors can define specific fault-handling strategies for each failure scenario in their children • Components can fail gracefully without affecting the entire system
  • 23. Supervision strategy: Examples Class TaskActor extends Actor { // create child workers override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) { case SqlException => Resume // retry the same file case FileCorruptionException => Stop // don’t clobber it! case IOException => Restart // report and move on }
  • 24. Unit testing • Scalatra test framework: very easy to read! TaskActorTest.receive(BadFileMsg) must throw FileNotFoundException • Mocks for network and database calls val mockHttp = mock[HttpExecutor] TaskActorTest ! doHttpPost there was atLeastOne(mockHttp).POST • Extensive testing of failure injection scenarios
  • 25. Takeaways • Keep your architecture simple by modeling actor message flow along the same paths as parent-child actor hierarchy (i.e., no message exchange between peer child actors) • Design and implement for component failures • Write unit tests extensively: we did not have any fundamental level functionality breakage • Box Engineering is awesome!

Hinweis der Redaktion

  1. 1. Example of a job is to check consistency of all the files: this will involve iterating over every file on all storage nodes, reading file and verifying content integrity.
  2. Scalability: non-performant because of the IO bottleneck in getting files to the application cluster Insecure: application clusters can store the files locally. It’s easy to melt a single a storage node by reading or writing a lot to it Cannot perform fine grained fault tolerance