SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Datorarkitektur                                            Fö 4 - 1         Datorarkitektur                                                    Fö 4 - 2




                                                                                                Reducing Pipeline Branch Penalties



                        INSTRUCTION PIPELINING (II)                                   •      Branch instructions can dramatically affect pipeline
                                                                                             performance. Control operations (conditional and
                                                                                             unconditional branch) are very frequent in current
                                                                                             programs.

                 1. Reducing Pipeline Branch Penalties
                                                                                      •      Some statistics:
                                                                                              - 20% - 35% of the instructions executed are
                 2. Instruction Fetch Units and Instruction Queues
                                                                                                branches (conditional and unconditional).
                                                                                              - ~ 65% of the branches actually take the branch.
                 3. Delayed Branching
                                                                                              - Conditional branches are much more frequent
                                                                                                than unconditional ones (more than two times).
                 4. Branch Prediction Strategies                                                More than 50% of conditional branches are
                                                                                                taken.
                 5. Static Branch Prediction
                                                                                      •      It is very important to reduce the penalties
                                                                                             produced by branches.
                 6. Dynamic Branch Prediction

                 7. Branch History Table




Petru Eles, IDA, LiTH                                                       Petru Eles, IDA, LiTH




      Datorarkitektur                                            Fö 4 - 3         Datorarkitektur                                                    Fö 4 - 4




      Instruction Fetch Units and Instruction Queues                                                     Delayed Branching
                                                                                  These are the pipeline sequences for a conditional
          •      Most processors employ sophisticated fetch units                 branch instruction (see slide 12, lecture 3)
                 that fetch instructions before they are needed and
                                                                                  Branch is taken
                 store them in a queue.
                                                                                  Clock cycle →             1 2 3 4 5 6 7 8 9 10 11 12
 Instruction
   cache                                                                          ADD R1,R2                 FI DI COFO EI WO
                                                                                  BEZ TARGET                   FI DI COFO EI WO
                                    Instruction Queue
                    Instruction                             Rest of the           the target                      FI   stall stall   FI DI COFO EI WO
                    Fetch Unit                              pipeline
                                                                                  Penalty: 3 cycles
          •      The fetch unit also has the ability to recognize
                                                                                  Branch is not taken
                 branch instructions and to generate the target
                 address. Thus, penalty produced by unconditional                 Clock cycle →       1 2 3 4 5 6 7 8 9 10 11 12
                 branches can be drastically reduced: the fetch unit
                 computes the target address and continues to fetch                ADD R1,R2                FI DI COFO EI WO
                 instructions from that address, which are sent to the             BEZ TARGET                  FI DI COFO EI WO
                 queue. Thus, the rest of the pipeline gets a
                                                                                   instr i+1                      FI   stall stall   DI COFO EI WO
                 continuous stream of instructions, without stalling.
          •      The rate at which instructions can be read (from the             Penalty: 2 cycles
                 instruction cache) must be sufficiently high to avoid
                 an empty queue.
          •      With conditional branches penalties can not be                       •      The idea with delayed branching is to let the CPU
                 avoided. The branch condition, which usually                                do some useful work during some of the cycles
                 depends on the result of the preceding instruction,                         which are shown above to be stalled.
                 has to be known in order to determine the following                  •      With delayed branching the CPU always executes
                 instruction.                                                                the instruction that immediately follows after the
                                                                                             branch and only then alters (if necessary) the
                                                                                             sequence of execution. The instruction after the
                                                                                             branch is said to be in the branch delay slot.


Petru Eles, IDA, LiTH                                                       Petru Eles, IDA, LiTH
Datorarkitektur                                              Fö 4 - 5            Datorarkitektur                                              Fö 4 - 6




                          Delayed Branching (cont’d)                                                     Delayed Branching (cont’d)

      This is what the programmer has written
This instruction does not                                                              This happens in the pipeline:
influence any of the instructions
                                         MUL R3,R4 R3 ← R3*R4
which follow until the branch; it                                                      Branch is taken        At this moment, both the
also doesn’t influence the                SUB #1,R2 R2 ← R2-1                                                  condition (set by ADD) and
outcome of the branch.                   ADD R1,R2 R1 ← R1+R2                                                 the target address are known.
              This instruction should    BEZ TAR    branch if zero
                                                                                        Clock cycle →          1 2 3 4 5 6 7 8 9 10 11 12
              be executed only if the    MOVE #10,R1 R1 ← 10
              branch is not taken.
                                         -------------                                  ADD R1,R2             FI DI COFO EI WO
                                 TAR     -------------                                  BEZ TAR                  FI DI COFO EI WO
                                                                                        MUL R3,R4                   FI DI COFO EI WO
          •      The compiler (assembler) has to find an instruction                   the target                      FI   stall   FI DI COFO EI WO
                 which can be moved from its original place into the
                 branch delay slot after the branch and which will be                  Penalty: 2 cycles
                 executed regardless of the outcome of the branch.
                                                                                       Branch is not take
      This is what the compiler (assembler) has produced and                                                 At this moment the condition is
      what actually will be executed:                                                                        known and the MOVE can go on.
                                                                                        Clock cycle →         1 2 3 4 5 6 7 8 9 10 11 12
                        SUB #1,R2
                        ADD R1,R2                                                       ADD R1,R2             FI DI COFO EI WO
                                             This instruction will be execut-
                        BEZ   TAR            ed regardless of the condition.
                                                                                        BEZ TAR                  FI DI COFO EI WO
                        MUL R3,R4                                                       MUL R3,R4                   FI DI COFO EI WO
                        MOVE #10,R1                                                     MOVE #10,R1                    FI   stall   DI COFO EI WO
                                             This will be executed only if the
                        -------------        branch has not been taken                 Penalty: 1 cycle
              TAR       -------------




Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH




      Datorarkitektur                                              Fö 4 - 7            Datorarkitektur                                              Fö 4 - 8




                          Delayed Branching (cont’d)                                                        Branch Prediction
                                                                                           •
                                                                                           In the last example we have considered that the
                                                                                           branch will not be taken and we fetched the
          •      What happens if the compiler is not able to find an                        instruction following the branch; in the case the
                 instruction to be moved after the branch, into the                        branch was taken the fetched instruction was
                 branch delay slot?                                                        discarded. As result, we had
                                                                                                                      1 if the branch is not taken
                                                                                                                         (prediction fulfilled)
                 In this case a NOP instruction (an instruction that                       branch penalty of
                                                                                                                      2 if the branch is taken
                 does nothing) has to be placed after the branch. In
                                                                                                                         (prediction not fulfilled)
                 this case the penalty will be the same as without
                 delayed branching.                                                      • Let us consider the opposite prediction: branch taken.
                                                                                           For this solution it is needed that the target address
                                           Now, with R2, this instruction in-              is computed in advance by an instruction fetch unit.
                        MUL R2,R4     fluences the following ones and
                                      cannot be moved from its place.                  Branch is taken
                        SUB #1,R2                                                      Clock cycle →        1 2 3 4 5 6 7 8 9 10 11 12
                        ADD R1,R2                                                      ADD R1,R2            FI DI COFO EI WO
                        BEZ   TAR
                                                                                        BEZ TAR                  FI DI COFO EI WO
                        NOP
                                                                                        MUL R3,R4                   FI DI COFO EI WO
                        MOVE #10,R1
                                                                                        the target                     FI   stall   DI COFO EI WO
                        -------------
              TAR       -------------                                                  Penalty: 1 cycle (prediction fulfilled)
                                                                                       Branch is not taken
          •      Some statistics show that for between 60% and                         Clock cycle →       1 2 3 4 5 6 7 8 9 10 11 12
                 85% of branches, sophisticated compilers are able                     ADD R1,R2           FI DI COFO EI WO
                 to find an instruction to be moved into the branch
                 delay slot.                                                            BEZ TAR                  FI DI COFO EI WO
                                                                                        MUL R3,R4                   FI DI COFO EI WO
                                                                                        MOVE #10,R1                    FI   stall   FI DI COFO EI WO
                                                                                       Penalty: 2 cycles (prediction not fulfilled)


Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH
Datorarkitektur                                                 Fö 4 - 9         Datorarkitektur                                                Fö 4 - 10




                           Branch Prediction (cont’d)                                                         Static Branch Prediction
          •      Correct branch prediction is very important and can
                 produce substantial performance improvements.                             •      Static prediction techniques do not take into
                                                                                                  consideration execution history.
          •      Based on the predicted outcome, the respective in-
                 struction can be fetched, as well as the instructions
                 following it, and they can be placed into the instruc-                Static approaches:
                 tion queue (see slide 3). If, after the branch condi-
                 tion is computed, it turns out that the prediction was
                 correct, execution continues. On the other hand, if                       •      Predict never taken (Motorola 68020): assumes
                 the prediction is not fulfilled, the fetched instruc-                             that the branch is not taken.
                 tion(s) must be discarded and the correct instruc-
                 tion must be fetched.                                                     •      Predict always taken: assumes that the branch is
                                                                                                  taken.
          •      To take full advantage of branch prediction, we can
                 have the instructions not only fetched but also
                 begin execution. This is known as speculative                             •      Predict depending on the branch direction
                 execution.                                                                       (PowerPC 601):
                                                                                                    - predict branch taken for backward branches;
          •      Speculative execution means that instructions are                                  - predict branch not taken for forward branches.
                 executed before the processor is certain that they are
                 in the correct execution path. If it turns out that the
                 prediction was correct, execution goes on without
                 introducing any branch penalty. If, however, the pre-
                 diction is not fulfilled, the instruction(s) started in ad-
                 vance and all their associated data must be purged
                 and the state previous to their execution restored.

          •      Branch prediction strategies:
                     1. Static prediction
                     2. Dynamic prediction


Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH




      Datorarkitektur                                                Fö 4 - 11         Datorarkitektur                                                Fö 4 - 12




                        Dynamic Branch Prediction                                                             Two-Bit Prediction Scheme

                                                                                           •      With a two-bit scheme predictions can be made
          •      Dynamic prediction techniques improve the                                        depending on the last two instances of execution.
                 accuracy of the prediction by recording the history of                    •      A typical scheme is to change the prediction only if
                 conditional branches.                                                            there have been two incorrect predictions in a row.
                                                                                                    taken
                          One-Bit Prediction Scheme
          •      One-bit is used in order to record if the last execu-                                               not taken
                 tion resulted in a branch taken or not. The system                                      11                               01
                 predicts the same behavior as for the last time.                               prd.: taken                           prd.: taken
                                                                                                                       taken
                 Shortcoming
                      When a branch is almost always taken, then
                 when it is not taken, we will predict incorrectly                                                                          not taken
                                                                                               taken
                 twice, rather than once:
                                 -----------
                                                                                                                     not taken
                      LOOP - - - - - - - - - - -                                                         10                               00
                                 -----------                                                   prd.: not taken                      prd.: not taken
                                                                                                                       taken
                                 BNZ      LOOP
                                 -----------
                 - After the loop has been executed for the first time                                                                 not taken
                    and left, it will be remembered that BNZ has not
                    been taken. Now, when the loop is executed
                    again, after the first iteration there will be a false                                -----------
                    prediction; following predictions are OK until the                                                  After the first execution of the
                    last iteration, when there will be a second false                  LOOP              -----------    loop the bits attached to BNZ
                    prediction.                                                                          -----------    will be 01; now, there will be
                 - In this case the result is even worse than with                                       BNZ   LOOP     always one false prediction for
                    static prediction considering that backward loops                                                   the loop, at its exit.
                                                                                                         -----------
                    are always taken (PowerPC 601 approach).


Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH
Datorarkitektur                                                                     Fö 4 - 13         Datorarkitektur                                             Fö 4 - 14




                                                  Branch History Table                                                         Branch History Table (cont’d)

                               •   History information can be used not only to predict                       Some explanations to previous figure:
                                   the outcome of a conditional branch but also to
                                   avoid recalculation of the target address. Together
                                                                                                                        - Address where to fetch from: If the branch
                                   with the bits used for prediction, the target address
                                                                                                                          instruction is not in the table the next instruction
                                   can be stored for later use in a branch history table.
                                                                                                                          (address PC+1) is to be fetched. If the branch
                                                                                                                          instruction is in the table first of all a prediction
                                        Instr.        Target Pred.
                                                                                                                          based on the prediction bits is made. Depending
                                        addr.          addr. bits
                                                                                                                          on the prediction outcome the next instruction
                                                                                                                          (address PC+1) or the instruction at the target
                                                                                                                          address is to be fetched.
                                                                       Update                                           - Update entry: If the branch instruction has been
                                                                        entry
                                                                                                                          in the table, the respective entry has to be
                                                                                                                          updated to reflect the correct or incorrect
      Addr. of branch instr.




                                                                                                                          prediction.
                                                                     Add new




                                                                                                                        - Add new entry: If the branch instruction has not
                                                                      entry




                                                                                                                          been in the table, it is added to the table with the
                                               m e




                                                                                      Y
                                            fro her




                                                                                                                          corresponding information concerning branch
                                         ch w




                                                                                                                          outcome and target address. If needed one of
                                      fet ss




                                                                               N   Branch
                                    to ddre




                                                                                   was in                                 the existing table entries is discarded.
                                                                                   TAB?                                   Replacement algorithms similar to those for
                                      A




                                                                                                                          cache memories are used.
                                   Instruction
                                                                                    EI                           •      Using dynamic branch prediction with history tables
                                   Fetch Unit
                                                                                                                        up to 90% of predictions can be correct.
                                                           discard fetches                                       •      Both Pentium and PowerPC 620 use speculative
                                   Instruction             and restart from        Pred.
                                                           correct address                                              execution with dynamic branch prediction based on
                                     cache                                         was
                                                                                                                        a branch history table.
                                                                          N        OK?




Petru Eles, IDA, LiTH                                                                                  Petru Eles, IDA, LiTH




       Datorarkitektur                                                                     Fö 4 - 15




                                                         Summary

                               •   Branch instructions can dramatically affect pipeline
                                   performance. It is very important to reduce
                                   penalties produced by branches.
                               •   Instruction fetch units are able to recognize branch
                                   instructions and generate the target address.
                                   Fetching at a high rate from the instruction cache
                                   and keeping the instruction queue loaded, it is
                                   possible to reduce the penalty for unconditional
                                   branches to zero. For conditional branches this is
                                   not possible because we have to wait for the
                                   outcome of the decision.
                               •   Delayed branching is a compiler based technique
                                   aimed to reduce the branch penalty by moving
                                   instructions into the branch delay slot.
                               •   Efficient reduction of the branch penalty for
                                   conditional branches needs a clever branch
                                   prediction strategy. Static branch prediction does
                                   not take into consideration execution history.
                                   Dynamic branch prediction is based on a record of
                                   the history of a conditional branch.
                               •   Branch history tables are used to store both
                                   information on the outcome of branches and the
                                   target address of the respective branch.




Petru Eles, IDA, LiTH

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelininga_spac
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorSmit Shah
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch predictionrinnocente
 
Pipelining
PipeliningPipelining
PipeliningAJAL A J
 
pipelining and hazards occure in assembly language.
pipelining and hazards occure in assembly language.pipelining and hazards occure in assembly language.
pipelining and hazards occure in assembly language.Zohaib Arshid
 
Pipelining , structural hazards
Pipelining , structural hazardsPipelining , structural hazards
Pipelining , structural hazardsMunaam Munawar
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Concept of Pipelining
Concept of PipeliningConcept of Pipelining
Concept of PipeliningSHAKOOR AB
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipeliningMazin Alwaaly
 
Instruction pipelining
Instruction pipeliningInstruction pipelining
Instruction pipeliningTech_MX
 
Pipelinig hazardous
Pipelinig hazardousPipelinig hazardous
Pipelinig hazardousjasscheema
 

Was ist angesagt? (20)

Assembly p1
Assembly p1Assembly p1
Assembly p1
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelining
 
Pipelining
PipeliningPipelining
Pipelining
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline Processor
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch prediction
 
Pipeline Computing by S. M. Risalat Hasan Chowdhury
Pipeline Computing by S. M. Risalat Hasan ChowdhuryPipeline Computing by S. M. Risalat Hasan Chowdhury
Pipeline Computing by S. M. Risalat Hasan Chowdhury
 
Pipelining
PipeliningPipelining
Pipelining
 
pipelining and hazards occure in assembly language.
pipelining and hazards occure in assembly language.pipelining and hazards occure in assembly language.
pipelining and hazards occure in assembly language.
 
Pipelining , structural hazards
Pipelining , structural hazardsPipelining , structural hazards
Pipelining , structural hazards
 
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Concept of Pipelining
Concept of PipeliningConcept of Pipelining
Concept of Pipelining
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
Conditional branches
Conditional branchesConditional branches
Conditional branches
 
Unit 3
Unit 3Unit 3
Unit 3
 
Instruction pipelining
Instruction pipeliningInstruction pipelining
Instruction pipelining
 
Pipelinig hazardous
Pipelinig hazardousPipelinig hazardous
Pipelinig hazardous
 
pipelining
pipeliningpipelining
pipelining
 
3 Pipelining
3 Pipelining3 Pipelining
3 Pipelining
 

Andere mochten auch

Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processingKamal Acharya
 
Input output in computer Orgranization and architecture
Input output in computer Orgranization and architectureInput output in computer Orgranization and architecture
Input output in computer Orgranization and architecturevikram patel
 
Memory mapping techniques and low power memory design
Memory mapping techniques and low power memory designMemory mapping techniques and low power memory design
Memory mapping techniques and low power memory designUET Taxila
 
Pipelining Coputing
Pipelining CoputingPipelining Coputing
Pipelining Coputingm.assayony
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 
Lec18 pipeline
Lec18 pipelineLec18 pipeline
Lec18 pipelineGRajendra
 
Pipeline and data hazard
Pipeline and data hazardPipeline and data hazard
Pipeline and data hazardWaed Shagareen
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInteX Research Lab
 
Cache memory
Cache memoryCache memory
Cache memoryAnuj Modi
 
Address mapping
Address mappingAddress mapping
Address mappingrockymani
 

Andere mochten auch (20)

Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
Mapping
MappingMapping
Mapping
 
Input output in computer Orgranization and architecture
Input output in computer Orgranization and architectureInput output in computer Orgranization and architecture
Input output in computer Orgranization and architecture
 
Memory mapping techniques and low power memory design
Memory mapping techniques and low power memory designMemory mapping techniques and low power memory design
Memory mapping techniques and low power memory design
 
Memory mapping
Memory mappingMemory mapping
Memory mapping
 
Cache memory
Cache memoryCache memory
Cache memory
 
Pipelining Coputing
Pipelining CoputingPipelining Coputing
Pipelining Coputing
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Aca2 06 new
Aca2 06 newAca2 06 new
Aca2 06 new
 
Cpu pipeline basics
Cpu pipeline basicsCpu pipeline basics
Cpu pipeline basics
 
Pipeline
PipelinePipeline
Pipeline
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Lec18 pipeline
Lec18 pipelineLec18 pipeline
Lec18 pipeline
 
pipelining
pipeliningpipelining
pipelining
 
Pipeline and data hazard
Pipeline and data hazardPipeline and data hazard
Pipeline and data hazard
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
 
Memory organization
Memory organizationMemory organization
Memory organization
 
Cache memory
Cache memoryCache memory
Cache memory
 
Address mapping
Address mappingAddress mapping
Address mapping
 

Ähnlich wie Instruction pipelining (ii)

Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelinesturki_09
 
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptxJoyChowdhury30
 
Stinson post si and verification
Stinson post si and verificationStinson post si and verification
Stinson post si and verificationObsidian Software
 
Control hazards MIPS pipeline.pptx
Control hazards MIPS pipeline.pptxControl hazards MIPS pipeline.pptx
Control hazards MIPS pipeline.pptxIrfan Anjum
 
Ct213 processor design_pipelinehazard
Ct213 processor design_pipelinehazardCt213 processor design_pipelinehazard
Ct213 processor design_pipelinehazardrakeshrakesh2020
 
[2009 11-09] branch prediction
[2009 11-09] branch prediction[2009 11-09] branch prediction
[2009 11-09] branch predictionmobilevc
 
Unit 2 contd. and( unit 3 voice over ppt)
Unit 2 contd. and( unit 3   voice over ppt)Unit 2 contd. and( unit 3   voice over ppt)
Unit 2 contd. and( unit 3 voice over ppt)Dr Reeja S R
 
BGP OPERATIONAL Message
BGP OPERATIONAL MessageBGP OPERATIONAL Message
BGP OPERATIONAL MessageRob Shakir
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
Remote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoRemote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoAndrea Lombardo
 
Mimo ofdm wireless communications with matlab
Mimo ofdm wireless communications with matlabMimo ofdm wireless communications with matlab
Mimo ofdm wireless communications with matlabntnam113
 
71382704 mimo-of dm-wireless-communications-with-matlab-0470825618
71382704 mimo-of dm-wireless-communications-with-matlab-047082561871382704 mimo-of dm-wireless-communications-with-matlab-0470825618
71382704 mimo-of dm-wireless-communications-with-matlab-0470825618nxtxxx
 
Lec Jan29 2009
Lec Jan29 2009Lec Jan29 2009
Lec Jan29 2009Ravi Soni
 

Ähnlich wie Instruction pipelining (ii) (20)

Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
 
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptx
 
Stinson post si and verification
Stinson post si and verificationStinson post si and verification
Stinson post si and verification
 
Branch prediction
Branch predictionBranch prediction
Branch prediction
 
Control hazards MIPS pipeline.pptx
Control hazards MIPS pipeline.pptxControl hazards MIPS pipeline.pptx
Control hazards MIPS pipeline.pptx
 
Ct213 processor design_pipelinehazard
Ct213 processor design_pipelinehazardCt213 processor design_pipelinehazard
Ct213 processor design_pipelinehazard
 
Pipelining
PipeliningPipelining
Pipelining
 
Superscalar processors
Superscalar processorsSuperscalar processors
Superscalar processors
 
[2009 11-09] branch prediction
[2009 11-09] branch prediction[2009 11-09] branch prediction
[2009 11-09] branch prediction
 
Instruction pipelining (i)
Instruction pipelining (i)Instruction pipelining (i)
Instruction pipelining (i)
 
Unit 2 contd. and( unit 3 voice over ppt)
Unit 2 contd. and( unit 3   voice over ppt)Unit 2 contd. and( unit 3   voice over ppt)
Unit 2 contd. and( unit 3 voice over ppt)
 
Intro to burn in & ess quantification
Intro to burn in & ess quantificationIntro to burn in & ess quantification
Intro to burn in & ess quantification
 
BGP OPERATIONAL Message
BGP OPERATIONAL MessageBGP OPERATIONAL Message
BGP OPERATIONAL Message
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
ch2.pptx
ch2.pptxch2.pptx
ch2.pptx
 
ZERO WIRE LOAD MODEL.pptx
ZERO WIRE LOAD MODEL.pptxZERO WIRE LOAD MODEL.pptx
ZERO WIRE LOAD MODEL.pptx
 
Remote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoRemote core locking-Andrea Lombardo
Remote core locking-Andrea Lombardo
 
Mimo ofdm wireless communications with matlab
Mimo ofdm wireless communications with matlabMimo ofdm wireless communications with matlab
Mimo ofdm wireless communications with matlab
 
71382704 mimo-of dm-wireless-communications-with-matlab-0470825618
71382704 mimo-of dm-wireless-communications-with-matlab-047082561871382704 mimo-of dm-wireless-communications-with-matlab-0470825618
71382704 mimo-of dm-wireless-communications-with-matlab-0470825618
 
Lec Jan29 2009
Lec Jan29 2009Lec Jan29 2009
Lec Jan29 2009
 

Instruction pipelining (ii)

  • 1. Datorarkitektur Fö 4 - 1 Datorarkitektur Fö 4 - 2 Reducing Pipeline Branch Penalties INSTRUCTION PIPELINING (II) • Branch instructions can dramatically affect pipeline performance. Control operations (conditional and unconditional branch) are very frequent in current programs. 1. Reducing Pipeline Branch Penalties • Some statistics: - 20% - 35% of the instructions executed are 2. Instruction Fetch Units and Instruction Queues branches (conditional and unconditional). - ~ 65% of the branches actually take the branch. 3. Delayed Branching - Conditional branches are much more frequent than unconditional ones (more than two times). 4. Branch Prediction Strategies More than 50% of conditional branches are taken. 5. Static Branch Prediction • It is very important to reduce the penalties produced by branches. 6. Dynamic Branch Prediction 7. Branch History Table Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 4 - 3 Datorarkitektur Fö 4 - 4 Instruction Fetch Units and Instruction Queues Delayed Branching These are the pipeline sequences for a conditional • Most processors employ sophisticated fetch units branch instruction (see slide 12, lecture 3) that fetch instructions before they are needed and Branch is taken store them in a queue. Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 Instruction cache ADD R1,R2 FI DI COFO EI WO BEZ TARGET FI DI COFO EI WO Instruction Queue Instruction Rest of the the target FI stall stall FI DI COFO EI WO Fetch Unit pipeline Penalty: 3 cycles • The fetch unit also has the ability to recognize Branch is not taken branch instructions and to generate the target address. Thus, penalty produced by unconditional Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 branches can be drastically reduced: the fetch unit computes the target address and continues to fetch ADD R1,R2 FI DI COFO EI WO instructions from that address, which are sent to the BEZ TARGET FI DI COFO EI WO queue. Thus, the rest of the pipeline gets a instr i+1 FI stall stall DI COFO EI WO continuous stream of instructions, without stalling. • The rate at which instructions can be read (from the Penalty: 2 cycles instruction cache) must be sufficiently high to avoid an empty queue. • With conditional branches penalties can not be • The idea with delayed branching is to let the CPU avoided. The branch condition, which usually do some useful work during some of the cycles depends on the result of the preceding instruction, which are shown above to be stalled. has to be known in order to determine the following • With delayed branching the CPU always executes instruction. the instruction that immediately follows after the branch and only then alters (if necessary) the sequence of execution. The instruction after the branch is said to be in the branch delay slot. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 2. Datorarkitektur Fö 4 - 5 Datorarkitektur Fö 4 - 6 Delayed Branching (cont’d) Delayed Branching (cont’d) This is what the programmer has written This instruction does not This happens in the pipeline: influence any of the instructions MUL R3,R4 R3 ← R3*R4 which follow until the branch; it Branch is taken At this moment, both the also doesn’t influence the SUB #1,R2 R2 ← R2-1 condition (set by ADD) and outcome of the branch. ADD R1,R2 R1 ← R1+R2 the target address are known. This instruction should BEZ TAR branch if zero Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 be executed only if the MOVE #10,R1 R1 ← 10 branch is not taken. ------------- ADD R1,R2 FI DI COFO EI WO TAR ------------- BEZ TAR FI DI COFO EI WO MUL R3,R4 FI DI COFO EI WO • The compiler (assembler) has to find an instruction the target FI stall FI DI COFO EI WO which can be moved from its original place into the branch delay slot after the branch and which will be Penalty: 2 cycles executed regardless of the outcome of the branch. Branch is not take This is what the compiler (assembler) has produced and At this moment the condition is what actually will be executed: known and the MOVE can go on. Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 SUB #1,R2 ADD R1,R2 ADD R1,R2 FI DI COFO EI WO This instruction will be execut- BEZ TAR ed regardless of the condition. BEZ TAR FI DI COFO EI WO MUL R3,R4 MUL R3,R4 FI DI COFO EI WO MOVE #10,R1 MOVE #10,R1 FI stall DI COFO EI WO This will be executed only if the ------------- branch has not been taken Penalty: 1 cycle TAR ------------- Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 4 - 7 Datorarkitektur Fö 4 - 8 Delayed Branching (cont’d) Branch Prediction • In the last example we have considered that the branch will not be taken and we fetched the • What happens if the compiler is not able to find an instruction following the branch; in the case the instruction to be moved after the branch, into the branch was taken the fetched instruction was branch delay slot? discarded. As result, we had 1 if the branch is not taken (prediction fulfilled) In this case a NOP instruction (an instruction that branch penalty of 2 if the branch is taken does nothing) has to be placed after the branch. In (prediction not fulfilled) this case the penalty will be the same as without delayed branching. • Let us consider the opposite prediction: branch taken. For this solution it is needed that the target address Now, with R2, this instruction in- is computed in advance by an instruction fetch unit. MUL R2,R4 fluences the following ones and cannot be moved from its place. Branch is taken SUB #1,R2 Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 ADD R1,R2 ADD R1,R2 FI DI COFO EI WO BEZ TAR BEZ TAR FI DI COFO EI WO NOP MUL R3,R4 FI DI COFO EI WO MOVE #10,R1 the target FI stall DI COFO EI WO ------------- TAR ------------- Penalty: 1 cycle (prediction fulfilled) Branch is not taken • Some statistics show that for between 60% and Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 85% of branches, sophisticated compilers are able ADD R1,R2 FI DI COFO EI WO to find an instruction to be moved into the branch delay slot. BEZ TAR FI DI COFO EI WO MUL R3,R4 FI DI COFO EI WO MOVE #10,R1 FI stall FI DI COFO EI WO Penalty: 2 cycles (prediction not fulfilled) Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 3. Datorarkitektur Fö 4 - 9 Datorarkitektur Fö 4 - 10 Branch Prediction (cont’d) Static Branch Prediction • Correct branch prediction is very important and can produce substantial performance improvements. • Static prediction techniques do not take into consideration execution history. • Based on the predicted outcome, the respective in- struction can be fetched, as well as the instructions following it, and they can be placed into the instruc- Static approaches: tion queue (see slide 3). If, after the branch condi- tion is computed, it turns out that the prediction was correct, execution continues. On the other hand, if • Predict never taken (Motorola 68020): assumes the prediction is not fulfilled, the fetched instruc- that the branch is not taken. tion(s) must be discarded and the correct instruc- tion must be fetched. • Predict always taken: assumes that the branch is taken. • To take full advantage of branch prediction, we can have the instructions not only fetched but also begin execution. This is known as speculative • Predict depending on the branch direction execution. (PowerPC 601): - predict branch taken for backward branches; • Speculative execution means that instructions are - predict branch not taken for forward branches. executed before the processor is certain that they are in the correct execution path. If it turns out that the prediction was correct, execution goes on without introducing any branch penalty. If, however, the pre- diction is not fulfilled, the instruction(s) started in ad- vance and all their associated data must be purged and the state previous to their execution restored. • Branch prediction strategies: 1. Static prediction 2. Dynamic prediction Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 4 - 11 Datorarkitektur Fö 4 - 12 Dynamic Branch Prediction Two-Bit Prediction Scheme • With a two-bit scheme predictions can be made • Dynamic prediction techniques improve the depending on the last two instances of execution. accuracy of the prediction by recording the history of • A typical scheme is to change the prediction only if conditional branches. there have been two incorrect predictions in a row. taken One-Bit Prediction Scheme • One-bit is used in order to record if the last execu- not taken tion resulted in a branch taken or not. The system 11 01 predicts the same behavior as for the last time. prd.: taken prd.: taken taken Shortcoming When a branch is almost always taken, then when it is not taken, we will predict incorrectly not taken taken twice, rather than once: ----------- not taken LOOP - - - - - - - - - - - 10 00 ----------- prd.: not taken prd.: not taken taken BNZ LOOP ----------- - After the loop has been executed for the first time not taken and left, it will be remembered that BNZ has not been taken. Now, when the loop is executed again, after the first iteration there will be a false ----------- prediction; following predictions are OK until the After the first execution of the last iteration, when there will be a second false LOOP ----------- loop the bits attached to BNZ prediction. ----------- will be 01; now, there will be - In this case the result is even worse than with BNZ LOOP always one false prediction for static prediction considering that backward loops the loop, at its exit. ----------- are always taken (PowerPC 601 approach). Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 4. Datorarkitektur Fö 4 - 13 Datorarkitektur Fö 4 - 14 Branch History Table Branch History Table (cont’d) • History information can be used not only to predict Some explanations to previous figure: the outcome of a conditional branch but also to avoid recalculation of the target address. Together - Address where to fetch from: If the branch with the bits used for prediction, the target address instruction is not in the table the next instruction can be stored for later use in a branch history table. (address PC+1) is to be fetched. If the branch instruction is in the table first of all a prediction Instr. Target Pred. based on the prediction bits is made. Depending addr. addr. bits on the prediction outcome the next instruction (address PC+1) or the instruction at the target address is to be fetched. Update - Update entry: If the branch instruction has been entry in the table, the respective entry has to be updated to reflect the correct or incorrect Addr. of branch instr. prediction. Add new - Add new entry: If the branch instruction has not entry been in the table, it is added to the table with the m e Y fro her corresponding information concerning branch ch w outcome and target address. If needed one of fet ss N Branch to ddre was in the existing table entries is discarded. TAB? Replacement algorithms similar to those for A cache memories are used. Instruction EI • Using dynamic branch prediction with history tables Fetch Unit up to 90% of predictions can be correct. discard fetches • Both Pentium and PowerPC 620 use speculative Instruction and restart from Pred. correct address execution with dynamic branch prediction based on cache was a branch history table. N OK? Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 4 - 15 Summary • Branch instructions can dramatically affect pipeline performance. It is very important to reduce penalties produced by branches. • Instruction fetch units are able to recognize branch instructions and generate the target address. Fetching at a high rate from the instruction cache and keeping the instruction queue loaded, it is possible to reduce the penalty for unconditional branches to zero. For conditional branches this is not possible because we have to wait for the outcome of the decision. • Delayed branching is a compiler based technique aimed to reduce the branch penalty by moving instructions into the branch delay slot. • Efficient reduction of the branch penalty for conditional branches needs a clever branch prediction strategy. Static branch prediction does not take into consideration execution history. Dynamic branch prediction is based on a record of the history of a conditional branch. • Branch history tables are used to store both information on the outcome of branches and the target address of the respective branch. Petru Eles, IDA, LiTH