SlideShare a Scribd company logo
1 of 47
Download to read offline
Systems that never
stop (and Erlang)
      Joe Armstrong
How can we get

10 nines reliability?
SIX LAWS
ONE

ISOLATION
ISOLATION

   10 nines = 99.99999999% availability
   P(fail) = 10-10
   If P(fail | one computer) = 10-3 then
       P(fail | four computers) = 10-12
   Fixed
TWO

CONCURRENCY
Concurrency

   World is concurrent
   Need at least TWO computers to make a non-stop
    sytem
   TWO computer is concurrent and distributed
“My first message is that
             concurrency
   is best regarded as a program
        structuring principle”

Structured concurrent programming
             – Tony Hoare
        Redmond, July 2001
THREE

     MUST
DETECT FAILURES
Failure detection
   If you can’t detect a failure you can’t fix it
   Must work across machine boundaries
    the entire machine might fail
   Implies distributed error handling,
    no shared state,
    asynchronous messaging
FOUR

    FAULT
IDENTIFICATION
Failure Identification

   Fault detection is not enough - you must no why
    the failure occurred
   Implies that you have sufficient information for
    post hock debugging
FIVE

  LIVE
 CODE
UPGRADE
Live code upgrade


   Must upgrade software while it is running
   Want zero down time
SIX

 STABLE
STORAGE
Stable storage

   Must store stuff forever
   No backup necessary - storage just works
   Implies multiple copies, distribution, ...
   Must keep crash reports
HISTORY

  Those who cannot learn from history are
doomed to repeat it.


                        George Santayana
GRAY
As with hardware, the key to software fault-tolerance is to
hierarchically decompose large systems into modules, each module being
a unit of service and a unit of failure. A failure of a module does
not propagate beyond the module.

...

 The process achieves fault containment by sharing no state with
other processes; its only contact with other processes is via messages
carried by a kernel message system

-                                                                 Jim Gray
-                        Why do computers stop and what can be done about it
-                           Technical Report, 85.7 - Tandem Computers,1985
SCHNEIDER
Halt on failure in the event of an error a processor
should halt instead of performing a possibly erroneous
operation.

Failure status property when a processor fails,
other processors in the system must be informed. The
reason for failure must be communicated.

Stable Storage Property The storage of a processor
should be partitioned into stable storage (which
survives a processor crash) and volatile storage which
is lost if a processor crashes.
                                             Schneider
              ACM Computing Surveys 22(4):229-319, 1990
GRAY
   Fault containment through fail-fast software modules.
   Process-pairs to tolerant hardware and transient software faults.
   Transaction mechanisms to provide data and message integrity.
   Transaction mechanisms combined with process-pairs to ease
    exception handling and tolerate software fault
   Software modularity through processes and messages.
KAY
Folks --

Just a gentle reminder that I took some pains at the last OOPSLA to
try to remind everyone that Smalltalk is not only NOT its syntax or
the class library, it is not even about classes. I'm sorry that I long ago
coined the term "objects" for this topic because it gets many people to
focus on the lesser idea.

The big idea is "messaging" -- that is what the kernal of Smalltalk/
Squeak is all about (and it's something that was never quite completed
in our Xerox PARC phase)....


http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/
017019.html
GRAY
Software modularity through processes
and messages. As with hardware, the key
to software fault-tolerance is to
hierarchically decompose large systems
into modules, each module being a unit of
service and a unit of failure. A failure of a
module does not propagate beyond the
module.
Fail Fast
The process approach to fault isolation advocates that the process
software be fail-fast, it should either function correctly or it
should detect the fault, signal failure and stop operating.

 Processes are made fail-fast by defensive programming. They check
all their inputs, intermediate results and data structures as a matter
of course. If any error is detected, they signal a failure and stop. In
the terminology of [Cristian], fail-fast software has small fault
detection latency.

Gray
Why ...
Fail Early
A fault in a software system can cause one or more
errors. The latency time which is the interval between
the existence of the fault and the occurrence of the
error can be very high, which complicates the
backwards analysis of an error ...

For an effective error handling we must detect errors and
failures as early as possible

                                                            Renzel -
                    Error Handling for Business Information Systems,
 Software Design and Management, GmbH & Co. KG, München, 2003
ARMSTRONG
   Processes are the units of error encapsulation. Errors
    occurring in a process will not affect other processes in the
    system. We call this property strong isolation.
   Processes do what they are supposed to do or fail as soon
    as possible.
   Failure and the reason for failure can be detected by
    remote processes.
   Processes share no state, but communicate by message
    passing.

                                                    Armstrong
     Making reliable systems in the presence of software errors
                                      PhD Thesis, KTH, 2003
COMMERCIAL
  BREAK
Joe’s 2’nd theorem

   Whatever Joe starts talking about, He will end up
    talking about Erlang
Erlang was
  designed
 to program
fault-tolerant
   systems
Concurrent
   programming             Functional
                          programming


Concurrency
 Oriented
programming
                 Erlang

      Fault                 Multicore
    tolerance
Erlang
   Very light-weight processes
   Very fast message passing
   Total separation between processes
   Automatic marshalling/demarshalling
   Fast sequential code
   Strict functional code
   Dynamic typing
   Transparent distribution
   Compose sequential AND concurrent code
Properties
   No sharing
   Hot code replacement
   Pure message passing
   No locks
   Lots of computers (= fault tolerant scalable ...)
   Functional programming (no side effects)
What is COP?
                    Machine


                    Process

                    Message




    ➡
      Large numbers of processes
    ➡ Complete isolation between processes
    ➡ Location transparency

    ➡ No Sharing of data

    ➡ Pure message passing systems
Thread Safety
 Erlang programs are
 automatically thread
 safe if they don't use
 an external resource.
Functional
        If you call the
   same function twice with
     the same arguments
it should return the same value




                                 “jolly good”
                              Joe Armstrong
No Mutable State
   Mutable state needs locks
   No mutable state = no locks = programmers bliss
Multicore ready
The rise of the cores
      2 cores won't hurt you
      4 cores will hurt a little
      8 cores will hurt a bit
      16 will start hurting
      32 cores will hurt a lot (2009)
      ...
      1 M cores ouch (2019)
          (complete paradigm shift)

      1997 1 Tflop = 850 KW
      2007 1 Tflop = 24 W (factor 35,000)
      2017 1 Tflop = ?
LAWS
ISOLATION
     CONCURRENCY
Pid = spawn(.....)
Pid = spawn(Node, ....)

Pid ! Message         receive
                          Pattern1 -> Actions1;
                          Pattern2 -> Actions2;
                          ...
                      end
FAULT
IDENTIFICATION
  link(Pid),
       receive
            {Pid, ‘EXIT’, Why} ->
                  ...
       end
LIVE CODE
            UPGRADE
   Can upgrade code while its running

   Existing processes continue to use original code, new
    processes run new code - no mixups of namespaces

   Sophisticated roll-forward, roll-back, roll-back-on-error
    functions in OTP libraries

   Properly designed systems can be rolled-forward and
    back with no loss of service. Not easy, but possible
STABLE STORAGE
   Performed in libraries

    mnesia:transaction(
        fun() ->
              Val = mnesia:read(Key),
               mnesia:write({Key,Val}),
               ...
        end)
Projects
   CouchDB
   Amazon SimpleDB
   Mochiweb (facebook chat)
   Scalaris
   Nitrogren
   Ejabberd (xmpp)
   Rabbit MQ (amqp)
   ....
Companies
   Ericsson
   Amazon
   Tail-f
   Kreditor
   Synapse
   ...
Books
THE END

More Related Content

Similar to Joe armstrong erlanga_languageforprogrammingreliablesystems

Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...siouxhotornot
 
Networking and Computer Troubleshooting
Networking and Computer TroubleshootingNetworking and Computer Troubleshooting
Networking and Computer TroubleshootingRence Montanes
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsTorben Hoffmann
 
Nt1330 Unit 1 Problem Analysis Paper
Nt1330 Unit 1 Problem Analysis PaperNt1330 Unit 1 Problem Analysis Paper
Nt1330 Unit 1 Problem Analysis PaperJoanna Paulsen
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaPeter Lawrey
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesBrian Troutwine
 
44CON 2014 - Switches Get Stitches, Eireann Leverett & Matt Erasmus
44CON 2014 - Switches Get Stitches,  Eireann Leverett & Matt Erasmus44CON 2014 - Switches Get Stitches,  Eireann Leverett & Matt Erasmus
44CON 2014 - Switches Get Stitches, Eireann Leverett & Matt Erasmus44CON
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Osdc 2011 michael_neale
Osdc 2011 michael_nealeOsdc 2011 michael_neale
Osdc 2011 michael_nealeMichael Neale
 
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICESKEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICESMykola Novik
 
How smartly Erlang uses Distributed Computing
How smartly Erlang uses Distributed ComputingHow smartly Erlang uses Distributed Computing
How smartly Erlang uses Distributed ComputingMandeep Singh
 
The pragmatic programmer
The pragmatic programmerThe pragmatic programmer
The pragmatic programmerLeylimYaln
 

Similar to Joe armstrong erlanga_languageforprogrammingreliablesystems (20)

Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
Sioux Hot-or-Not: Functional programming: unlocking the real power of multi-c...
 
Networking and Computer Troubleshooting
Networking and Computer TroubleshootingNetworking and Computer Troubleshooting
Networking and Computer Troubleshooting
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business Needs
 
Erlang OTP
Erlang OTPErlang OTP
Erlang OTP
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
 
Nt1330 Unit 1 Problem Analysis Paper
Nt1330 Unit 1 Problem Analysis PaperNt1330 Unit 1 Problem Analysis Paper
Nt1330 Unit 1 Problem Analysis Paper
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in java
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable Services
 
44CON 2014 - Switches Get Stitches, Eireann Leverett & Matt Erasmus
44CON 2014 - Switches Get Stitches,  Eireann Leverett & Matt Erasmus44CON 2014 - Switches Get Stitches,  Eireann Leverett & Matt Erasmus
44CON 2014 - Switches Get Stitches, Eireann Leverett & Matt Erasmus
 
Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Techno-Fest-15nov16
Techno-Fest-15nov16Techno-Fest-15nov16
Techno-Fest-15nov16
 
Osdc 2011 michael_neale
Osdc 2011 michael_nealeOsdc 2011 michael_neale
Osdc 2011 michael_neale
 
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICESKEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
 
Audit
AuditAudit
Audit
 
How smartly Erlang uses Distributed Computing
How smartly Erlang uses Distributed ComputingHow smartly Erlang uses Distributed Computing
How smartly Erlang uses Distributed Computing
 
The pragmatic programmer
The pragmatic programmerThe pragmatic programmer
The pragmatic programmer
 
Embedded Systems
Embedded SystemsEmbedded Systems
Embedded Systems
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Joe armstrong erlanga_languageforprogrammingreliablesystems

  • 1. Systems that never stop (and Erlang) Joe Armstrong
  • 2. How can we get 10 nines reliability?
  • 5. ISOLATION  10 nines = 99.99999999% availability  P(fail) = 10-10  If P(fail | one computer) = 10-3 then P(fail | four computers) = 10-12  Fixed
  • 7. Concurrency  World is concurrent  Need at least TWO computers to make a non-stop sytem  TWO computer is concurrent and distributed
  • 8. “My first message is that concurrency is best regarded as a program structuring principle” Structured concurrent programming – Tony Hoare Redmond, July 2001
  • 9. THREE MUST DETECT FAILURES
  • 10. Failure detection  If you can’t detect a failure you can’t fix it  Must work across machine boundaries the entire machine might fail  Implies distributed error handling, no shared state, asynchronous messaging
  • 11. FOUR FAULT IDENTIFICATION
  • 12. Failure Identification  Fault detection is not enough - you must no why the failure occurred  Implies that you have sufficient information for post hock debugging
  • 13. FIVE LIVE CODE UPGRADE
  • 14. Live code upgrade  Must upgrade software while it is running  Want zero down time
  • 16. Stable storage  Must store stuff forever  No backup necessary - storage just works  Implies multiple copies, distribution, ...  Must keep crash reports
  • 17. HISTORY Those who cannot learn from history are doomed to repeat it. George Santayana
  • 18. GRAY As with hardware, the key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module. ... The process achieves fault containment by sharing no state with other processes; its only contact with other processes is via messages carried by a kernel message system - Jim Gray - Why do computers stop and what can be done about it - Technical Report, 85.7 - Tandem Computers,1985
  • 19. SCHNEIDER Halt on failure in the event of an error a processor should halt instead of performing a possibly erroneous operation. Failure status property when a processor fails, other processors in the system must be informed. The reason for failure must be communicated. Stable Storage Property The storage of a processor should be partitioned into stable storage (which survives a processor crash) and volatile storage which is lost if a processor crashes. Schneider ACM Computing Surveys 22(4):229-319, 1990
  • 20. GRAY  Fault containment through fail-fast software modules.  Process-pairs to tolerant hardware and transient software faults.  Transaction mechanisms to provide data and message integrity.  Transaction mechanisms combined with process-pairs to ease exception handling and tolerate software fault  Software modularity through processes and messages.
  • 21. KAY Folks -- Just a gentle reminder that I took some pains at the last OOPSLA to try to remind everyone that Smalltalk is not only NOT its syntax or the class library, it is not even about classes. I'm sorry that I long ago coined the term "objects" for this topic because it gets many people to focus on the lesser idea. The big idea is "messaging" -- that is what the kernal of Smalltalk/ Squeak is all about (and it's something that was never quite completed in our Xerox PARC phase).... http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/ 017019.html
  • 22. GRAY Software modularity through processes and messages. As with hardware, the key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module.
  • 23. Fail Fast The process approach to fault isolation advocates that the process software be fail-fast, it should either function correctly or it should detect the fault, signal failure and stop operating. Processes are made fail-fast by defensive programming. They check all their inputs, intermediate results and data structures as a matter of course. If any error is detected, they signal a failure and stop. In the terminology of [Cristian], fail-fast software has small fault detection latency. Gray Why ...
  • 24. Fail Early A fault in a software system can cause one or more errors. The latency time which is the interval between the existence of the fault and the occurrence of the error can be very high, which complicates the backwards analysis of an error ... For an effective error handling we must detect errors and failures as early as possible Renzel - Error Handling for Business Information Systems, Software Design and Management, GmbH & Co. KG, München, 2003
  • 25. ARMSTRONG  Processes are the units of error encapsulation. Errors occurring in a process will not affect other processes in the system. We call this property strong isolation.  Processes do what they are supposed to do or fail as soon as possible.  Failure and the reason for failure can be detected by remote processes.  Processes share no state, but communicate by message passing. Armstrong Making reliable systems in the presence of software errors PhD Thesis, KTH, 2003
  • 27.
  • 28. Joe’s 2’nd theorem  Whatever Joe starts talking about, He will end up talking about Erlang
  • 29. Erlang was designed to program fault-tolerant systems
  • 30. Concurrent programming Functional programming Concurrency Oriented programming Erlang Fault Multicore tolerance
  • 31. Erlang  Very light-weight processes  Very fast message passing  Total separation between processes  Automatic marshalling/demarshalling  Fast sequential code  Strict functional code  Dynamic typing  Transparent distribution  Compose sequential AND concurrent code
  • 32. Properties  No sharing  Hot code replacement  Pure message passing  No locks  Lots of computers (= fault tolerant scalable ...)  Functional programming (no side effects)
  • 33. What is COP? Machine Process Message ➡ Large numbers of processes ➡ Complete isolation between processes ➡ Location transparency ➡ No Sharing of data ➡ Pure message passing systems
  • 34. Thread Safety Erlang programs are automatically thread safe if they don't use an external resource.
  • 35. Functional If you call the same function twice with the same arguments it should return the same value “jolly good” Joe Armstrong
  • 36. No Mutable State  Mutable state needs locks  No mutable state = no locks = programmers bliss
  • 38. The rise of the cores  2 cores won't hurt you  4 cores will hurt a little  8 cores will hurt a bit  16 will start hurting  32 cores will hurt a lot (2009)  ...  1 M cores ouch (2019)  (complete paradigm shift)  1997 1 Tflop = 850 KW  2007 1 Tflop = 24 W (factor 35,000)  2017 1 Tflop = ?
  • 39. LAWS
  • 40. ISOLATION CONCURRENCY Pid = spawn(.....) Pid = spawn(Node, ....) Pid ! Message receive Pattern1 -> Actions1; Pattern2 -> Actions2; ... end
  • 41. FAULT IDENTIFICATION link(Pid), receive {Pid, ‘EXIT’, Why} -> ... end
  • 42. LIVE CODE UPGRADE  Can upgrade code while its running  Existing processes continue to use original code, new processes run new code - no mixups of namespaces  Sophisticated roll-forward, roll-back, roll-back-on-error functions in OTP libraries  Properly designed systems can be rolled-forward and back with no loss of service. Not easy, but possible
  • 43. STABLE STORAGE  Performed in libraries mnesia:transaction( fun() -> Val = mnesia:read(Key), mnesia:write({Key,Val}), ... end)
  • 44. Projects  CouchDB  Amazon SimpleDB  Mochiweb (facebook chat)  Scalaris  Nitrogren  Ejabberd (xmpp)  Rabbit MQ (amqp)  ....
  • 45. Companies  Ericsson  Amazon  Tail-f  Kreditor  Synapse  ...
  • 46. Books