SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Superscalar and VLIW
    Architectures
Parallel processing [2]
Processing instructions in parallel requires
   three major tasks:
2. checking dependencies between
   instructions to determine which
   instructions can be grouped together for
   parallel execution;
3. assigning instructions to the functional
   units on the hardware;
4. determining when instructions are initiated
   placed together into a single word.
Major categories [2]




VLIW – Very Long Instruction Word
EPIC – Explicitly Parallel Instruction Computing
Major categories [2]
Superscalar Processors [1]

    Superscalar processors are designed to exploit
     more instruction-level parallelism in user
     programs.
    Only independent instructions can be executed
     in parallel without causing a wait state.
    The amount of instruction-level parallelism
     varies widely depending on the type of code
     being executed.
Pipelining in Superscalar
Processors [1]
     In order to fully utilise a superscalar processor
      of degree m, m instructions must be executable
      in parallel. This situation may not be true in all
      clock cycles. In that case, some of the pipelines
      may be stalling in a wait state.
     In a superscalar processor, the simple
      operation latency should require only one cycle,
      as in the base scalar processor.
Superscalar Execution
Superscalar
Implementation
   Simultaneously fetch multiple instructions
   Logic to determine true dependencies
    involving register values
   Mechanisms to communicate these values
   Mechanisms to initiate multiple instructions in
    parallel
   Resources for parallel execution of multiple
    instructions
   Mechanisms for committing process state in
    correct order
Some Architectures
   PowerPC 604
    – six independent execution units:
           Branch execution unit
           Load/Store unit
           3 Integer units
           Floating-point unit
    – in-order issue
    – register renaming
   Power PC 620
    – provides in addition to the 604 out-of-order issue
   Pentium
    – three independent execution units:
           2 Integer units
           Floating point unit
    – in-order issue
VLIW
   Very Long Instruction Word (VLIW) architectures are used for executing more
    than one basic instruction at a time.

   These processors contain multiple functional units, which fetch from the
    instruction cache a Very-Long Instruction Word containing several basic
    instructions, and dispatch the entire VLIW for parallel execution. These
    capabilities are exploited by compilers which generate code that has grouped
    together independent primitive instructions executable in parallel.

   VLIW has been described as a natural successor to RISC (Reduced Instruction
    Set Computing), because it moves complexity from the hardware to the compiler,
    allowing simpler, faster processors.

    VLIW eliminates the complicated instruction scheduling and parallel dispatch
    that occurs in most modern microprocessors.
WHY VLIW ?
The key to higher performance in microprocessors for a broad range of
applications is the ability to exploit fine-grain, instruction-level
parallelism.

Some methods for exploiting fine-grain parallelism include:

   Pipelining
   Multiple processors
   Superscalar implementation
   Specifying multiple independent operations per instruction
Architecture Comparison:
          CISC, RISC & VLIW

ARCHITECTURE                CISC                     RISC                        VLIW
CHARACTERISTIC

INSTRUCTION SIZE   Varies                    One size, usually 32 bits   One size



INSTRUCTION        Field placement varies    Regular, consistent         Regular, consistent
FORMAT                                       placement of fields         placement of
                                                                         Fields
INSTRUCTION        Varies from simple to     Almost always one           Many simple,
SEMANTICS          complex ; possibly many   simple operation            independent
                   dependent operations                                  operations
                   per instruction


REGISTERS          Few, sometimes special    Many, general-purpose       Many, general-purpose
Architecture Comparison:
           CISC, RISC & VLIW
ARCHITECTURE                  CISC                       RISC                      VLIW
CHARACTERISTIC

MEMORY REFERENCES      Bundled with operations   Not bundled with          Not bundled with
                       in many different types   operations,               operations,i.e.,
                       of instructions           i.e.,load/store           load/store
                                                 architecture              architecture

HARDWARE DESIGN        Exploit micro coded       Exploit                   Exploit
FOCUS                  implementations           implementations           Implementations
                                                 with one pipeline and &   With multiple pipelines,
                                                 no microcode              no microcode & no
                                                                           complex dispatch logic

PICTURES OF FIVE
TYPICAL INSTRUCTIONS
Advantages of VLIW
   VLIW processors rely on the compiler that generates the VLIW code to

explicitly specify parallelism. Relying on the compiler has advantages.
   VLIW architecture reduces hardware complexity. VLIW simply moves
    complexity from hardware into software.
What is ILP ?

   Instruction-level parallelism (ILP) is a measure of how many of the
    operations in a computer program can be performed simultaneously.
   A system is said to embody ILP (instruction-level parallelism) is
    multiple instructions runs on them at the same time.
   ILP can have a significant effect on performance which is critical to
    embedded systems.
   ILP provides an form of power saving by slowing the clock.
What we intend to do
    with ILP ?
We use Micro-architectural techniques to exploit the ILP. The various techniques
    include :
   Instruction pipelining which depend on CPU caches.
   Register renaming which refers to a technique used to avoid unnecessary.
    serialization of program operations imposed by the reuse of registers by those
    operations.
   Speculative execution which reduce pipeline stalls due to control dependencies.
   Branch prediction which is used to keep the pipeline full.
   Superscalar execution in which multiple execution units are used to execute
    multiple instructions in parallel.
   Out of Order execution which reduces pipeline stall due to operand dependencies.
Algorithms for
scheduling

Few of the Instruction scheduling algorithms used are :

   List scheduling

   Trace scheduling

   Software pipelining (modulo scheduling)
List Scheduling
List scheduling by steps :
2.   Construct a dependence graph of the basic block. (The edges are

     weighted with the latency of the instruction).

3.   Use the dependence graph to determine instructions that can execute;

     insert on a list, called the Readylist.

4.   Use the dependence graph and the Ready list to schedule an instruction

     that causes the smallest possible stall; update the Ready list. Repeat
Code Representation
for
List Scheduling
      a=b+c
      d=e - f
                   1       2   5       6


                       3           7
1.   load R1, b
2.   load R2, c        4           8
3.   add R2,R1
4.   store a, R2
5.   load R3, e
6.   load R4,f
7.   sub R3,R4
8.   store d,R3
Code Representation
for
List Scheduling
1. load R1, b      1. load R1, b    1       2         5       6
2. load R2, c      5.load R3, e
3. add R2,R1       2. load R2, c        3                 7
4. store a, R2     6.load R4, f
5. load R3, e      3.add R2,R1
6. load R4,f       7.sub R3,R4          4                 8
7. sub R3,R4       4.store a, R2
8. store d,R3      8. store d, R3
                                            a=b+c
                                            d=e - f


Now we have a schedule that requires no stalls and no NOPs.
Problem and
    Solution
   Register allocation conflict : use of same register creates

    anti-Dependencies that restrict scheduling

   Register allocation before scheduling

–prevents good scheduling

   Scheduling before register allocation

–spills destroy scheduling

   Solution : Schedule abstract assembly, Allocate registers, Schedule
Trace scheduling

Steps involved in Trace Scheduling :
    Trace Selection

– Find the most common trace of basic blocks.
    Trace Compaction

–Combine the basic blocks in the trace and schedule them as one block

–Create clean-up code if the execution goes off-trace
    Parallelism across IF branches vs. LOOP branches
    Can provide a speedup if static prediction is accurate
How Trace Scheduling
works
Look for higher priority and trace the blocks as shown below.
How Trace Scheduling
works
After tracing the priority blocks you schedule it first and rest
parallel to that .
How Trace Scheduling
 works
We can see the blocks been
traced depending on the priority.
How Trace Scheduling
works
• Creating large extended basic blocks by duplication
• Schedule the larger blocks




Figure above shows how the extended basic blocks can be
created.
How Trace Scheduling
 works
This block diagram in its final stage shows you the parallelism across the
branches.
Limitations of Trace
 Scheduling


   Optimizations depends on the traces being the dominant paths
    in the program’s control-flow.
   Therefore, the following two things should be true:

–Programs should demonstrate the behavior of being skewed in
    the branches taken at run-time, for typical mixes of input data.

–We should have access to this information at compile time.

    Not so easy.
Software Pipelining
   In software pipelining, iterations of a loop in the source program are

continuously initiated at constant intervals, before the preceding

iterations complete thus taking advantage of the parallelism in data path.
   Its also explained as scheduling the operations within an iteration,

such that the iterations can be pipelined to yield optimal throughput.
   The sequence of instructions before the steady state are called

PROLOG and the ones that are in the sequence after the steady state is

called EPILOG.
Software Pipelining
 Example
•Source code:
for(i=0;i<n;i++) sum += a[i]         r7 = L r6
                                    ---;stall
•Loop body in assembly:
                                    r2 = Add r2,r7
r1 = L r0
---;stall                           r6 = add r6,12
r2 = Addr2,r1
r0 = addr0,4                        r10 = L r9
                                    ---;stall
•Unroll loop & allocate registers
                                    r2 = Add r2,r10
r1 = L r0
---;stall                           r9 = add r9,12
r2 = Add r2,r1
r0 = Add r0,12

r4 = L r3
---;stall
r2 = Add r2,r4
r3 = add r3,12
Software Pipelining
Example
Software Pipelining
Example
Schedule Unrolled Instructions, exploiting VLIW (or not)
                                                   PROLOG


                                                     Identify
                                                     Repeating
                                                     Pattern
                                                     (Kernel)



                                                    EPILOG
Constraints in Software
pipelining

   Recurrence Constraints: which is determined
    by loop carried data dependencies.
   Resource Constraints: which is determined by
    total resource requirements.
Remarks on Software
Pipelining
   Innermost loop, loops with larger trip count, loops without conditionals
    can be software pipelined.
   Code size increase due to prolog and epilog.
   Code size increase due to unrolling for MVE (Modulo Variable
    Expansion).
   Register allocation strategies for software pipelined loops .
   Loops with conditional can be software pipelined if predicated execution
    is supported.

–Higher resource requirement, but efficient schedule

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set Computing
 
Direct Memory Access(DMA)
Direct Memory Access(DMA)Direct Memory Access(DMA)
Direct Memory Access(DMA)
 
vedic mathematics based MAC unit
vedic mathematics based MAC unitvedic mathematics based MAC unit
vedic mathematics based MAC unit
 
Mapping
MappingMapping
Mapping
 
What is Gray Code?
What is Gray Code? What is Gray Code?
What is Gray Code?
 
pipelining
pipeliningpipelining
pipelining
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
BCD ADDER
BCD ADDER BCD ADDER
BCD ADDER
 
8086 microprocessor instruction set by Er. Swapnil Kaware
8086 microprocessor instruction set by Er. Swapnil Kaware8086 microprocessor instruction set by Er. Swapnil Kaware
8086 microprocessor instruction set by Er. Swapnil Kaware
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
 
Instruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorInstruction Set of 8086 Microprocessor
Instruction Set of 8086 Microprocessor
 
Lecture 03 basics of pic
Lecture 03 basics of picLecture 03 basics of pic
Lecture 03 basics of pic
 
Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentation
 
Verilog hdl
Verilog hdlVerilog hdl
Verilog hdl
 
VEDIC MULTIPLIER FOR "FPGA"
VEDIC MULTIPLIER FOR "FPGA"VEDIC MULTIPLIER FOR "FPGA"
VEDIC MULTIPLIER FOR "FPGA"
 
Evolution of Microprocessors.pptx
Evolution of Microprocessors.pptxEvolution of Microprocessors.pptx
Evolution of Microprocessors.pptx
 
Flip-Flop || Digital Electronics
Flip-Flop || Digital ElectronicsFlip-Flop || Digital Electronics
Flip-Flop || Digital Electronics
 
Shift registers
Shift registersShift registers
Shift registers
 
Rotate instructions
Rotate instructionsRotate instructions
Rotate instructions
 
Ct213 processor design_pipelinehazard
Ct213 processor design_pipelinehazardCt213 processor design_pipelinehazard
Ct213 processor design_pipelinehazard
 

Andere mochten auch

Andere mochten auch (9)

Trace Scheduling
Trace SchedulingTrace Scheduling
Trace Scheduling
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
 
Os module 2 d
Os module 2 dOs module 2 d
Os module 2 d
 
Vliw
VliwVliw
Vliw
 
6 spatial filtering p2
6 spatial filtering p26 spatial filtering p2
6 spatial filtering p2
 
5 spatial filtering p1
5 spatial filtering p15 spatial filtering p1
5 spatial filtering p1
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Kerberos
KerberosKerberos
Kerberos
 
Network security
Network securityNetwork security
Network security
 

Ähnlich wie Lec1 final

Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Ismail Mukiibi
 
Crussoe proc
Crussoe procCrussoe proc
Crussoe proctyadi
 
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...eSAT Publishing House
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architectureTaha Malampatti
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)Pragnya Dash
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel ComputingMohsin Bhat
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design ApproachA B Shinde
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set ArchitectureJaffer Haadi
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principlesDhaval Bagal
 
5-Embedded processor technology-06-01-2024.pdf
5-Embedded processor technology-06-01-2024.pdf5-Embedded processor technology-06-01-2024.pdf
5-Embedded processor technology-06-01-2024.pdfmovocode
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Meltonharryvanhaaren
 
Computer Organization.pptx
Computer Organization.pptxComputer Organization.pptx
Computer Organization.pptxsaimagul310
 
FIne Grain Multithreading
FIne Grain MultithreadingFIne Grain Multithreading
FIne Grain MultithreadingDharmesh Tank
 

Ähnlich wie Lec1 final (20)

Difficulties in Pipelining
Difficulties in PipeliningDifficulties in Pipelining
Difficulties in Pipelining
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
Crussoe proc
Crussoe procCrussoe proc
Crussoe proc
 
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
 
Vliw or epic
Vliw or epicVliw or epic
Vliw or epic
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
1.My Presentation.pptx
1.My Presentation.pptx1.My Presentation.pptx
1.My Presentation.pptx
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set Architecture
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
5-Embedded processor technology-06-01-2024.pdf
5-Embedded processor technology-06-01-2024.pdf5-Embedded processor technology-06-01-2024.pdf
5-Embedded processor technology-06-01-2024.pdf
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
W04505116121
W04505116121W04505116121
W04505116121
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
Computer Organization.pptx
Computer Organization.pptxComputer Organization.pptx
Computer Organization.pptx
 
Tutor1
Tutor1Tutor1
Tutor1
 
FIne Grain Multithreading
FIne Grain MultithreadingFIne Grain Multithreading
FIne Grain Multithreading
 

Mehr von Gichelle Amon (19)

Os module 2 c
Os module 2 cOs module 2 c
Os module 2 c
 
Image segmentation ppt
Image segmentation pptImage segmentation ppt
Image segmentation ppt
 
Lec3 final
Lec3 finalLec3 final
Lec3 final
 
Lec 3
Lec 3Lec 3
Lec 3
 
Lec2 final
Lec2 finalLec2 final
Lec2 final
 
Lec 4
Lec 4Lec 4
Lec 4
 
Module 3 law of contracts
Module 3  law of contractsModule 3  law of contracts
Module 3 law of contracts
 
Transport triggered architecture
Transport triggered architectureTransport triggered architecture
Transport triggered architecture
 
Time triggered arch.
Time triggered arch.Time triggered arch.
Time triggered arch.
 
Subnetting
SubnettingSubnetting
Subnetting
 
Os module 2 c
Os module 2 cOs module 2 c
Os module 2 c
 
Os module 2 ba
Os module 2 baOs module 2 ba
Os module 2 ba
 
Lec5
Lec5Lec5
Lec5
 
Delivery
DeliveryDelivery
Delivery
 
Addressing
AddressingAddressing
Addressing
 
Medical image analysis
Medical image analysisMedical image analysis
Medical image analysis
 
Presentation2
Presentation2Presentation2
Presentation2
 
Harvard architecture
Harvard architectureHarvard architecture
Harvard architecture
 
Micro channel architecture
Micro channel architectureMicro channel architecture
Micro channel architecture
 

Kürzlich hochgeladen

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Kürzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Lec1 final

  • 1. Superscalar and VLIW Architectures
  • 2. Parallel processing [2] Processing instructions in parallel requires three major tasks: 2. checking dependencies between instructions to determine which instructions can be grouped together for parallel execution; 3. assigning instructions to the functional units on the hardware; 4. determining when instructions are initiated placed together into a single word.
  • 3. Major categories [2] VLIW – Very Long Instruction Word EPIC – Explicitly Parallel Instruction Computing
  • 5. Superscalar Processors [1]  Superscalar processors are designed to exploit more instruction-level parallelism in user programs.  Only independent instructions can be executed in parallel without causing a wait state.  The amount of instruction-level parallelism varies widely depending on the type of code being executed.
  • 6. Pipelining in Superscalar Processors [1]  In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state.  In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.
  • 7.
  • 9. Superscalar Implementation  Simultaneously fetch multiple instructions  Logic to determine true dependencies involving register values  Mechanisms to communicate these values  Mechanisms to initiate multiple instructions in parallel  Resources for parallel execution of multiple instructions  Mechanisms for committing process state in correct order
  • 10. Some Architectures  PowerPC 604 – six independent execution units:  Branch execution unit  Load/Store unit  3 Integer units  Floating-point unit – in-order issue – register renaming  Power PC 620 – provides in addition to the 604 out-of-order issue  Pentium – three independent execution units:  2 Integer units  Floating point unit – in-order issue
  • 11. VLIW  Very Long Instruction Word (VLIW) architectures are used for executing more than one basic instruction at a time.  These processors contain multiple functional units, which fetch from the instruction cache a Very-Long Instruction Word containing several basic instructions, and dispatch the entire VLIW for parallel execution. These capabilities are exploited by compilers which generate code that has grouped together independent primitive instructions executable in parallel.  VLIW has been described as a natural successor to RISC (Reduced Instruction Set Computing), because it moves complexity from the hardware to the compiler, allowing simpler, faster processors.  VLIW eliminates the complicated instruction scheduling and parallel dispatch that occurs in most modern microprocessors.
  • 12. WHY VLIW ? The key to higher performance in microprocessors for a broad range of applications is the ability to exploit fine-grain, instruction-level parallelism. Some methods for exploiting fine-grain parallelism include:  Pipelining  Multiple processors  Superscalar implementation  Specifying multiple independent operations per instruction
  • 13. Architecture Comparison: CISC, RISC & VLIW ARCHITECTURE CISC RISC VLIW CHARACTERISTIC INSTRUCTION SIZE Varies One size, usually 32 bits One size INSTRUCTION Field placement varies Regular, consistent Regular, consistent FORMAT placement of fields placement of Fields INSTRUCTION Varies from simple to Almost always one Many simple, SEMANTICS complex ; possibly many simple operation independent dependent operations operations per instruction REGISTERS Few, sometimes special Many, general-purpose Many, general-purpose
  • 14. Architecture Comparison: CISC, RISC & VLIW ARCHITECTURE CISC RISC VLIW CHARACTERISTIC MEMORY REFERENCES Bundled with operations Not bundled with Not bundled with in many different types operations, operations,i.e., of instructions i.e.,load/store load/store architecture architecture HARDWARE DESIGN Exploit micro coded Exploit Exploit FOCUS implementations implementations Implementations with one pipeline and & With multiple pipelines, no microcode no microcode & no complex dispatch logic PICTURES OF FIVE TYPICAL INSTRUCTIONS
  • 15. Advantages of VLIW  VLIW processors rely on the compiler that generates the VLIW code to explicitly specify parallelism. Relying on the compiler has advantages.  VLIW architecture reduces hardware complexity. VLIW simply moves complexity from hardware into software.
  • 16. What is ILP ?  Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously.  A system is said to embody ILP (instruction-level parallelism) is multiple instructions runs on them at the same time.  ILP can have a significant effect on performance which is critical to embedded systems.  ILP provides an form of power saving by slowing the clock.
  • 17. What we intend to do with ILP ? We use Micro-architectural techniques to exploit the ILP. The various techniques include :  Instruction pipelining which depend on CPU caches.  Register renaming which refers to a technique used to avoid unnecessary. serialization of program operations imposed by the reuse of registers by those operations.  Speculative execution which reduce pipeline stalls due to control dependencies.  Branch prediction which is used to keep the pipeline full.  Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel.  Out of Order execution which reduces pipeline stall due to operand dependencies.
  • 18. Algorithms for scheduling Few of the Instruction scheduling algorithms used are :  List scheduling  Trace scheduling  Software pipelining (modulo scheduling)
  • 19. List Scheduling List scheduling by steps : 2. Construct a dependence graph of the basic block. (The edges are weighted with the latency of the instruction). 3. Use the dependence graph to determine instructions that can execute; insert on a list, called the Readylist. 4. Use the dependence graph and the Ready list to schedule an instruction that causes the smallest possible stall; update the Ready list. Repeat
  • 20. Code Representation for List Scheduling a=b+c d=e - f 1 2 5 6 3 7 1. load R1, b 2. load R2, c 4 8 3. add R2,R1 4. store a, R2 5. load R3, e 6. load R4,f 7. sub R3,R4 8. store d,R3
  • 21. Code Representation for List Scheduling 1. load R1, b 1. load R1, b 1 2 5 6 2. load R2, c 5.load R3, e 3. add R2,R1 2. load R2, c 3 7 4. store a, R2 6.load R4, f 5. load R3, e 3.add R2,R1 6. load R4,f 7.sub R3,R4 4 8 7. sub R3,R4 4.store a, R2 8. store d,R3 8. store d, R3 a=b+c d=e - f Now we have a schedule that requires no stalls and no NOPs.
  • 22. Problem and Solution  Register allocation conflict : use of same register creates anti-Dependencies that restrict scheduling  Register allocation before scheduling –prevents good scheduling  Scheduling before register allocation –spills destroy scheduling  Solution : Schedule abstract assembly, Allocate registers, Schedule
  • 23. Trace scheduling Steps involved in Trace Scheduling :  Trace Selection – Find the most common trace of basic blocks.  Trace Compaction –Combine the basic blocks in the trace and schedule them as one block –Create clean-up code if the execution goes off-trace  Parallelism across IF branches vs. LOOP branches  Can provide a speedup if static prediction is accurate
  • 24. How Trace Scheduling works Look for higher priority and trace the blocks as shown below.
  • 25. How Trace Scheduling works After tracing the priority blocks you schedule it first and rest parallel to that .
  • 26. How Trace Scheduling works We can see the blocks been traced depending on the priority.
  • 27. How Trace Scheduling works • Creating large extended basic blocks by duplication • Schedule the larger blocks Figure above shows how the extended basic blocks can be created.
  • 28. How Trace Scheduling works This block diagram in its final stage shows you the parallelism across the branches.
  • 29. Limitations of Trace Scheduling  Optimizations depends on the traces being the dominant paths in the program’s control-flow.  Therefore, the following two things should be true: –Programs should demonstrate the behavior of being skewed in the branches taken at run-time, for typical mixes of input data. –We should have access to this information at compile time. Not so easy.
  • 30. Software Pipelining  In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete thus taking advantage of the parallelism in data path.  Its also explained as scheduling the operations within an iteration, such that the iterations can be pipelined to yield optimal throughput.  The sequence of instructions before the steady state are called PROLOG and the ones that are in the sequence after the steady state is called EPILOG.
  • 31. Software Pipelining Example •Source code: for(i=0;i<n;i++) sum += a[i] r7 = L r6 ---;stall •Loop body in assembly: r2 = Add r2,r7 r1 = L r0 ---;stall r6 = add r6,12 r2 = Addr2,r1 r0 = addr0,4 r10 = L r9 ---;stall •Unroll loop & allocate registers r2 = Add r2,r10 r1 = L r0 ---;stall r9 = add r9,12 r2 = Add r2,r1 r0 = Add r0,12 r4 = L r3 ---;stall r2 = Add r2,r4 r3 = add r3,12
  • 33. Software Pipelining Example Schedule Unrolled Instructions, exploiting VLIW (or not) PROLOG Identify Repeating Pattern (Kernel) EPILOG
  • 34. Constraints in Software pipelining  Recurrence Constraints: which is determined by loop carried data dependencies.  Resource Constraints: which is determined by total resource requirements.
  • 35. Remarks on Software Pipelining  Innermost loop, loops with larger trip count, loops without conditionals can be software pipelined.  Code size increase due to prolog and epilog.  Code size increase due to unrolling for MVE (Modulo Variable Expansion).  Register allocation strategies for software pipelined loops .  Loops with conditional can be software pipelined if predicated execution is supported. –Higher resource requirement, but efficient schedule