SlideShare ist ein Scribd-Unternehmen logo
1 von 82
The Mathematics of Batch
      Processing



                 Nathan Marz
                    BackType
Motivation

A story about Timmy the software engineer
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.




New Data                        Processed
                                Output


              Hadoop Workflow
Timmy @ Big Data Inc.
• Business requirement: 15 hour
  turnaround on processing new data
Timmy @ Big Data Inc.
• Business requirement: 15 hour
  turnaround on processing new data

• Current turnaround is 10 hours
Timmy @ Big Data Inc.
• Business requirement: 15 hour
  turnaround on processing new data

• Current turnaround is 10 hours


• Plenty of extra capacity!
Timmy @ Big Data Inc.

• Company increases data collection rate
  by 10%
Timmy @ Big Data Inc.

• Company increases data collection rate
  by 10%

    Surprise! Turnaround time
    explodes to 30 hours!
Timmy @ Big Data Inc.

Fix it ASAP! We’re
losing customers!
Timmy @ Big Data Inc.

        We need 2 times
        more machines!
Timmy @ Big Data Inc.

We don’t even have that
much space in the
datacenter!
Timmy @ Big Data Inc.




Rack 1      Rack 2

         Data Center
Timmy @ Big Data Inc.




Rack 1      Rack 2     New Rack

         Data Center
Timmy @ Big Data Inc.
• Turnaround drops to 6 hours!!

              ??      ??
False Assumptions
• Will take 10% longer to process 10%
  more data



• 50% more machines only creates 50%
  more performance
What is a batch processing
         system?


     while (true) {
       processNewData()
     }
“Hours of Data”
• Assume constant rate of new data



• Measure amount of data in terms of
  hours
Questions to answer
• How does a 10% increase in data cause
  my turnaround time to increase by 200%?
Questions to answer
• How does a 10% increase in data cause
  my turnaround time to increase by 200%?

• Why doesn’t the speed of my workflow
  double when I double the number of
  machines?
Questions to answer
• How does a 10% increase in data cause
  my turnaround time to increase by 200%?

• Why doesn’t the speed of my workflow
  double when I double the number of
  machines?
• How many machines do I need for my
  workflow to perform well and be fault-
  tolerant?
Example
• Workflow that runs in 10 hours


• 10 hours of data processed each run
Example
Example
• Suppose you extend workflow with a
  component that will take 2 hours on 10
  hour dataset
Example
• Suppose you extend workflow with a
  component that will take 2 hours on 10
  hour dataset


• Workflow runtime may increase by a
  lot more than 2 hours!
Example
Example
• Will it increase by 3 hours?
Example
• Will it increase by 3 hours?

• Will it increase by 50 hours?
Example
• Will it increase by 3 hours?

• Will it increase by 50 hours?

• Will it get longer and longer each
  iteration forever?
Example
Example
• Increased runtime of workflow that
  operates on 10 hours of data to 12
  hours
Example
• Increased runtime of workflow that
  operates on 10 hours of data to 12
  hours

• Next run, there will be 12 hours of data
  to process
Example
• Increased runtime of workflow that
  operates on 10 hours of data to 12
  hours

• Next run, there will be 12 hours of data
  to process

• Because more data, will take longer to
  run
Example
• Which means next iteration will have
  even more data
Example
• Which means next iteration will have
  even more data

• And so on...
Example
• Which means next iteration will have
  even more data

• And so on...

    Does the runtime ever stabilize?
Example
• Which means next iteration will have
  even more data

• And so on...

    Does the runtime ever stabilize?

                 If so, when?
Math

Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)




               Runtime for a single run of a workflow
Math

Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)


  T =         O +           H        x        P


               Runtime for a single run of a workflow
Overhead (O)
   • Fixed time in workflow

       – Job startup time
       – Time spent independent of amount of
         data



Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

  T =         O +           H        x        P
Time to Process One Hour of
                Data (P)
   • How long it takes to process one hour of data,
     minus overhead




Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

  T =         O +           H        x        P
Time to Process One Hour of
                Data (P)
   • How long it takes to process one hour of data,
     minus overhead

    • P=1       -> Each hour adds one hour to runtime




Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

  T =         O +           H        x        P
Time to Process One Hour of
                Data (P)
   • How long it takes to process one hour of data,
     minus overhead

    • P=1       -> Each hour adds one hour to runtime

    • P=2       -> Each hour adds two hours to runtime



Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

  T =         O +           H        x        P
Time to Process One Hour of
                Data (P)
   • How long it takes to process one hour of data,
     minus overhead

    • P=1       -> Each hour adds one hour to runtime

    • P=2       -> Each hour adds two hours to runtime

    • P = 0.5 -> Each hour adds 30 minutes to runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

  T =         O +           H        x        P
Stable Runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

                      T=O+HxP
Stable Runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

                      T=O+HxP
                   Stabilizes when:
       Runtime (T) = Hours of data processed (H)
Stable Runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)

                      T=O+HxP
                   Stabilizes when:
       Runtime (T) = Hours of data processed (H)

                      T=O+TxP
Stable Runtime

 T=O+TxP
Stable Runtime

 T=O+TxP

       O
 T=
      1-P
Stable Runtime

                   T=O+TxP

                           O
                   T=
                          1-P
                                Overhead
Stable Runtime =
                   1 - (Time to process one hour of data)
Stable Runtime
• Linearly proportional to Overhead (O)

• Non-linearly proportional to P
  – Diminishing returns on each new machine
Double # machines

• Why doesn’t the speed of my workflow
  double when I double the number of
  machines?
Double # machines
                 O
Old runtime =
                1-P
Double # machines
                 O
Old runtime =
                1-P
                 O
New runtime =
                1 - P/2
Double # machines
                 O
Old runtime =
                1-P
                 O
New runtime =
                1 - P/2
New runtime     1-P
            =
Old runtime     1 - P/2
Double # machines


New runtime
Old runtime




                Time to process one hour of data (P)
Double # machines
• P = 0.9 (54 minutes / hour of data)
   ->
  Runtime decreases by 80%
Double # machines
• P = 0.9 (54 minutes / hour of data)
   ->
  Runtime decreases by 80%


• P = 0.2 (12 minutes / hour of data)
   ->
  Runtime decreases by 10%
Increase in Data

• Why does a 10% increase in data cause
  my turnaround to increase by 200%?
Increase in Data
                 O
Old runtime =
                1-P
Increase in Data
                 O
Old runtime =
                1-P
                  O
New runtime =
              1 - 1.1*P
Increase in Data
                 O
Old runtime =
                1-P
                  O
New runtime =
              1 - 1.1*P
New runtime    1-P
            =
Old runtime   1 - 1.1*P
Increase in Data


New runtime
Old runtime




              Time to process one hour of data (P)
Increase in Data

• Less “extra capacity” -> more dramatic
  deterioration in performance
Increase in Data

• Less “extra capacity” -> more dramatic
  deterioration in performance

• Effect can also happen:
   • Increase in hardware/software
     failures
   • Sharing cluster
Real life example
• How does optimizing out 30% of my
  workflow runtime cause the runtime to
  decrease by 80%?
Real life example
• 30 hour workflow
Real life example
• 30 hour workflow

• Remove bottleneck causing 10 hours of
  overhead
Real life example
• 30 hour workflow

• Remove bottleneck causing 10 hours of
  overhead

• Runtime dropped to 6 hours
Real life example
        O
30 =
       1-P

       O - 10
6=
        1-P

O = 12.5, P = 0.58
Takeaways
Takeaways
• You should measure the O and P values of your
  workflow to avoid disasters
Takeaways
• You should measure the O and P values of your
  workflow to avoid disasters

• When P is high:
  – Expand cluster
  – OR: Optimize code that touches data
Takeaways
• You should measure the O and P values of your
  workflow to avoid disasters

• When P is high:
  – Expand cluster
  – OR: Optimize code that touches data

• When P is low:
  – Optimize overhead (i.e., reduce job startup time)
Questions?


           Nathan Marz
            BackType

      nathan.marz@gmail.com
        Twitter: @nathanmarz
     http://nathanmarz.com/blog

Weitere ähnliche Inhalte

Andere mochten auch

Congelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadCongelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadDiario Elcomahueonline
 
Datos importantes para planificar un evento edith giraldo productora de eve...
Datos importantes para planificar un evento   edith giraldo productora de eve...Datos importantes para planificar un evento   edith giraldo productora de eve...
Datos importantes para planificar un evento edith giraldo productora de eve...Edith Giraldo
 
BNI 10 Minute Presentation from Supply My School
BNI 10 Minute Presentation from Supply My SchoolBNI 10 Minute Presentation from Supply My School
BNI 10 Minute Presentation from Supply My SchoolDavid du Plessis
 
Paradigmas tecnoeconomicos
Paradigmas tecnoeconomicosParadigmas tecnoeconomicos
Paradigmas tecnoeconomicosMARIELIPALENCIA
 
γιορτή της σημαίας
γιορτή της σημαίαςγιορτή της σημαίας
γιορτή της σημαίαςMaria Rokadaki
 
54 Tactics You Can Do Yourself to get REAL customers to follow you
54 Tactics You Can Do Yourself to get REAL customers to follow you54 Tactics You Can Do Yourself to get REAL customers to follow you
54 Tactics You Can Do Yourself to get REAL customers to follow youIntranet Future
 
2012 DuPage Environmental Summit
2012 DuPage Environmental Summit2012 DuPage Environmental Summit
2012 DuPage Environmental SummitNapervilleNCEC
 
Leccion i persona_y_organizacion
Leccion i persona_y_organizacionLeccion i persona_y_organizacion
Leccion i persona_y_organizacionrichard rivera
 

Andere mochten auch (15)

Cartas
CartasCartas
Cartas
 
Congelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadCongelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertad
 
Datos importantes para planificar un evento edith giraldo productora de eve...
Datos importantes para planificar un evento   edith giraldo productora de eve...Datos importantes para planificar un evento   edith giraldo productora de eve...
Datos importantes para planificar un evento edith giraldo productora de eve...
 
BNI 10 Minute Presentation from Supply My School
BNI 10 Minute Presentation from Supply My SchoolBNI 10 Minute Presentation from Supply My School
BNI 10 Minute Presentation from Supply My School
 
Paradigmas tecnoeconomicos
Paradigmas tecnoeconomicosParadigmas tecnoeconomicos
Paradigmas tecnoeconomicos
 
Res.Talk.Dec2009
Res.Talk.Dec2009Res.Talk.Dec2009
Res.Talk.Dec2009
 
γιορτή της σημαίας
γιορτή της σημαίαςγιορτή της σημαίας
γιορτή της σημαίας
 
Cobertura Aids 2010 Viena
Cobertura Aids 2010 VienaCobertura Aids 2010 Viena
Cobertura Aids 2010 Viena
 
VIH-AIDS 2008.
VIH-AIDS 2008.VIH-AIDS 2008.
VIH-AIDS 2008.
 
54 Tactics You Can Do Yourself to get REAL customers to follow you
54 Tactics You Can Do Yourself to get REAL customers to follow you54 Tactics You Can Do Yourself to get REAL customers to follow you
54 Tactics You Can Do Yourself to get REAL customers to follow you
 
GANGA
GANGAGANGA
GANGA
 
What is PR?
What is PR?What is PR?
What is PR?
 
Aux emferordena
Aux emferordenaAux emferordena
Aux emferordena
 
2012 DuPage Environmental Summit
2012 DuPage Environmental Summit2012 DuPage Environmental Summit
2012 DuPage Environmental Summit
 
Leccion i persona_y_organizacion
Leccion i persona_y_organizacionLeccion i persona_y_organizacion
Leccion i persona_y_organizacion
 

Mehr von nathanmarz

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineeringnathanmarz
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrongnathanmarz
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypenathanmarz
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systemsnathanmarz
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackTypenathanmarz
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshopnathanmarz
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loopnathanmarz
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Daynathanmarz
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 

Mehr von nathanmarz (18)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Storm
StormStorm
Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 
Cascading
CascadingCascading
Cascading
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Mathematics of Batch Processing

  • 1. The Mathematics of Batch Processing Nathan Marz BackType
  • 2. Motivation A story about Timmy the software engineer
  • 3. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 4. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 5. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 6. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 7. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 8. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 9. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 10. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 11. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • 12. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data
  • 13. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours
  • 14. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours • Plenty of extra capacity!
  • 15. Timmy @ Big Data Inc. • Company increases data collection rate by 10%
  • 16. Timmy @ Big Data Inc. • Company increases data collection rate by 10% Surprise! Turnaround time explodes to 30 hours!
  • 17. Timmy @ Big Data Inc. Fix it ASAP! We’re losing customers!
  • 18. Timmy @ Big Data Inc. We need 2 times more machines!
  • 19. Timmy @ Big Data Inc. We don’t even have that much space in the datacenter!
  • 20. Timmy @ Big Data Inc. Rack 1 Rack 2 Data Center
  • 21. Timmy @ Big Data Inc. Rack 1 Rack 2 New Rack Data Center
  • 22. Timmy @ Big Data Inc. • Turnaround drops to 6 hours!! ?? ??
  • 23. False Assumptions • Will take 10% longer to process 10% more data • 50% more machines only creates 50% more performance
  • 24. What is a batch processing system? while (true) { processNewData() }
  • 25. “Hours of Data” • Assume constant rate of new data • Measure amount of data in terms of hours
  • 26. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%?
  • 27. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines?
  • 28. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines? • How many machines do I need for my workflow to perform well and be fault- tolerant?
  • 29. Example • Workflow that runs in 10 hours • 10 hours of data processed each run
  • 31. Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset
  • 32. Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset • Workflow runtime may increase by a lot more than 2 hours!
  • 34. Example • Will it increase by 3 hours?
  • 35. Example • Will it increase by 3 hours? • Will it increase by 50 hours?
  • 36. Example • Will it increase by 3 hours? • Will it increase by 50 hours? • Will it get longer and longer each iteration forever?
  • 38. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours
  • 39. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process
  • 40. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process • Because more data, will take longer to run
  • 41. Example • Which means next iteration will have even more data
  • 42. Example • Which means next iteration will have even more data • And so on...
  • 43. Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize?
  • 44. Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize? If so, when?
  • 45. Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) Runtime for a single run of a workflow
  • 46. Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P Runtime for a single run of a workflow
  • 47. Overhead (O) • Fixed time in workflow – Job startup time – Time spent independent of amount of data Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • 48. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • 49. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • 50. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • 51. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime • P = 0.5 -> Each hour adds 30 minutes to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • 52. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP
  • 53. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H)
  • 54. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H) T=O+TxP
  • 57. Stable Runtime T=O+TxP O T= 1-P Overhead Stable Runtime = 1 - (Time to process one hour of data)
  • 58. Stable Runtime • Linearly proportional to Overhead (O) • Non-linearly proportional to P – Diminishing returns on each new machine
  • 59. Double # machines • Why doesn’t the speed of my workflow double when I double the number of machines?
  • 60. Double # machines O Old runtime = 1-P
  • 61. Double # machines O Old runtime = 1-P O New runtime = 1 - P/2
  • 62. Double # machines O Old runtime = 1-P O New runtime = 1 - P/2 New runtime 1-P = Old runtime 1 - P/2
  • 63. Double # machines New runtime Old runtime Time to process one hour of data (P)
  • 64. Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80%
  • 65. Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80% • P = 0.2 (12 minutes / hour of data) -> Runtime decreases by 10%
  • 66. Increase in Data • Why does a 10% increase in data cause my turnaround to increase by 200%?
  • 67. Increase in Data O Old runtime = 1-P
  • 68. Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P
  • 69. Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P New runtime 1-P = Old runtime 1 - 1.1*P
  • 70. Increase in Data New runtime Old runtime Time to process one hour of data (P)
  • 71. Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance
  • 72. Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance • Effect can also happen: • Increase in hardware/software failures • Sharing cluster
  • 73. Real life example • How does optimizing out 30% of my workflow runtime cause the runtime to decrease by 80%?
  • 74. Real life example • 30 hour workflow
  • 75. Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead
  • 76. Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead • Runtime dropped to 6 hours
  • 77. Real life example O 30 = 1-P O - 10 6= 1-P O = 12.5, P = 0.58
  • 79. Takeaways • You should measure the O and P values of your workflow to avoid disasters
  • 80. Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data
  • 81. Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data • When P is low: – Optimize overhead (i.e., reduce job startup time)
  • 82. Questions? Nathan Marz BackType nathan.marz@gmail.com Twitter: @nathanmarz http://nathanmarz.com/blog

Hinweis der Redaktion