SlideShare a Scribd company logo
1 of 61
Download to read offline
Performance Tuning on the Blackfin
               Processor




1
Outline

    Introduction
    Building a Framework
    Memory Considerations
    Benchmarks
    Managing Shared Resources
    Interrupt Management
    An Example
    Summary




2
Introduction

    The first level of optimization is provided by the compiler

    The remaining portion of the optimization comes from
    techniques performed at the “system” level
     Memory management
     DMA management
     Interrupt management

    The purpose of this presentation is to help you understand
    how some of the system aspects of your application can be
    managed to tune performance




3
One Challenge

                                                                How do we get from here
                                           Type of Memory
                                              Required For
                                  Total              Direct
    Format    Columns   Rows     Pixels     Implementation

    SQCIF         128     96    12288     L1, L2, off-chip L2

    QCIF          176    144    25344     L1, L2, off-chip L2
                                                                 Unlimited memory
    CIF           352    288   101376        L2, off-chip L2
                                                                Unlimited processing
    VGA           640    480   307200            off-chip L2         File-based
                                                                    Power cord
    D-1
    NTSC          720    486   349920            off-chip L2

    4CIF          704    576   405504            off-chip L2

    D-1 PAL       720    576   414720            off-chip L2

    16CIF        1408   1152   1622016           off-chip L2                  to here?

                                                                            Limited memory
                                                                            Finite processing
                                                                             Stream-based
          All formats require management                                          Battery
             of multiple memory spaces
4
The path to optimization
                                                  Compiler                            System                  Assembly
                                                Optimization                       Optimization             Optimization
                            100.00%


                             90.00%
                                                  let compiler
                                                  optimize
                             80.00%


                             70.00%
    Processor Utilization




                             60.00%


                             50.00%
                                                       exploit
                                                       architecture features
                             40.00%


                             30.00%
                                                                                     efficient
                                                                                     memory layout
                             20.00%
                                                                                                                use specialized
                                                                                                                instructions
                             10.00%
                                                                                               streamline
                                                                                               data flow
                              0.00%
                                      0     1          2         3             4      5        6            7          8          9
                                                                         Optimization Steps

5
The World Leader in High Performance Signal Processing Solutions




Creating a Framework
The Importance of Creating a “Framework”


    Framework definition: The software infrastructure that moves
    code and data within an embedded system

    Building this early on in your development can pay big
    dividends

    There are three categories of frameworks that we see
    consistently from Blackfin developers




7
Common Frameworks

    Processing on the fly
     Safety-critical applications (e.g., automotive lane departure
     warning system)
     Applications without external memory

    Programming ease overrides performance
     Programmers who need to meet a deadline

    Performance supersedes programming ease
     Algorithms that push the limits of the processor




8
Processing on the fly

     Can’t afford to wait until large video buffers filled in external
     memory

     Instead, they can be brought into on-chip memory
     immediately and processed line-by-line
      This approach can lead to quick results in decision-based systems

     The processor core can directly access lines of video in on-
     chip memory.
      Software must ensure the active video frame buffer is not
      overwritten until processing on the current frame is complete.




9
Example of “Processing-on-the-fly” Framework

                                                            Collision avoidance
                                                    CORE


                                                        Single cycle              Collision Warning
               720 pixels                                  access


                            Video in
      480
     pixels
                                           DMA
                                                   L1 MEM
          33 msec frame rate
           63 µsec line rate




      Data is processed one line at a time instead of waiting for an entire frame to be collected.




10
“Programming Ease Rules”

     Strives to achieve simplest programming model at the
     expense of some performance

     Focus is on time-to-market
      Optimization isn’t as important as time-to-market
       It can always be revisited later
      Provides a nice path for upgrades!

     Easier to develop for both novices and experts




11
Example of “Programming Ease Rules” framework


     Cores split
        Data          CORE        CORE
     processing



       Processed
          Data
                                             DMA
                                           Sub-image         External
                                                             Memory
                                 L2 MEM
                                             DMA
         Video in
                       DMA




     This programming model best matches time-to-market focused designs


12
“Performance Rules”


      Bandwidth-efficient
      Strives to attain best performance, even if
      programming model is more complex
       Might include targeted assembly routines
      Every aspect of data flow is carefully planned
      Allows use of less expensive processors because the
      device is “right-sized” for the application
      Caveats
       Might not leave enough room for extra features or
       upgrades
       Harder to reuse




13
Example of “Performance Rules” Framework


                            CORE        CORE


                  Single cycle                   Single cycle
     Compressed      access                         access
        data
                                                        DMA
                                                      Macro block   External
                                                                    Memory
                           L1 MEM       L1 MEM
                                                           DMA
       Video in
                          DMA

                     1 line at a time

                                        L2 MEM




             Provides most efficient use of external memory bus.



14
Common Attributes of Each Model

     Instruction and data management is best done up front in the
     project


     Small amounts of planning can save headaches later on




15
The World Leader in High Performance Signal Processing Solutions




Features that can be used to
   improve performance
Now that we have discussed the background…


     We will now review the concepts that will help you tune
     performance
      Using Cache and DMA
      Managing memory
      Managing DMA channels
      Managing interrupts




17
The World Leader in High Performance Signal Processing Solutions




Cache vs. DMA
Cache versus DMA




19
The World Leader in High Performance Signal Processing Solutions




 Cache
Configuring internal instruction memory as cache



                                                                       foo1();
              Core                     Way 1                           foo2();
                                                                       foo3();




                                                                  External memory
                 Example: L1 instruction memory configured
                 as 4-way set-associative cache

                 Instructions are brought into internal memory
                 where single cycle performance can be achieved

     Instruction cache provides several key benefits to increase
       performance
        Usually provides the highest bandwidth path into the core
         For linear code, the next instructions will be on the way into the core after
         each cache miss
        The most recently used instructions are the least likely to be replaced
        Critical items can be locked in cache, by line
        Instructions in cache can execute in a single cycle
21
Configuring internal data memory as cache


                                                                Volatile data          Peripheral
                                    Way 1                                                DMA
               Core
                                                                 Static data


                                   Cache                     External memory
                         Example: Data brought in from a
                         peripheral

     Data cache also provides a way to increase performance
        Usually provides the highest bandwidth path into the core
          For linear data, the next data elements will be on the way into the core after each cache
          miss
        Write-through option keeps “source” memory up to date
          Data written to source memory every time it is modified
        Write-back option can further improve performance
          Data only written to source memory when data is replaced in cache
        “Volatile” buffers must be managed to ensure coherency between DMA and
        cache

22
Write-back vs. Write-through

     Write-back is usually 10-15% more efficient but …
      It is algorithm dependent
     Write-through is better when coherency between more than
     one resource is required




     Make sure you try both options when all of your peripherals
     are running




23
The World Leader in High Performance Signal Processing Solutions




    DMA
Why Use a DMA Controller?

     The DMA controller runs independently of the core
      The core should only have to set up the DMA and respond to interrupts
     Core processor cycles are available for processing data
     DMA can allow creative data movement and “filtering”
      Saves potential core passes to re-arrange the data




                               Interrupt handler
                                                           DMA
                Core
                               Descriptor list
                                                       Peripheral
                               Buffer
                                                                      Data Source

                                Internal memory



25
Using 2D DMA to efficiently
         move image data




26
DMA and Data Cache Coherence
        Coherency between data cache and buffer filled via DMA transfer must be maintained

        Cache must be “invalidated” to ensure “old” data is not used when processing most
        recent buffer
          Invalidate buffer addresses or actual cache line, depending on size of the buffer

        Interrupts can be used to indicate when it is safe to invalidate buffer for next
        processing interval

        This often provides a simpler programming model (with less of a performance
        increase) than a pure DMA model


                                                                  Volatile buffer0   Data brought in
                   Core                Data Cache                                    from a peripheral




                        New                                       Volatile buffer1
                       buffer

                                            Invalidate
                      Process              cache lines
     Interrupt         buffer          associated with that
                                              buffer
27
Does code fit into internal memory?

        Yes                                 No
                                                                      Instruction partitioning
      Map code                       Map code to
     into internal                 external memory
       memory




Desired performance is            Turn Cache on
       achieved
                                                                No
                             Is desired performance achieved?        Lock lines with critical code
                                                                     Use L1 SRAM
                                         Yes


                                                            Is desired performance achieved?
                                                     Yes
                                                                                   No

                                                                                     Use overlay mechanism

      Programming effort
 Increases as you move across                                                 Desired performance is
28                                                                                   achieved
Is the data volatile or static?

     Static                                     Volatile


      Map to
     cacheable                      Will the buffers fit into                        Data partitioning
      memory                          internal memory?
     locations              Yes                                      No


                Single cycle access achieved               Map to external memory




                                                    Is DMA part of the programming model?

                                      No

                                    Turn data cache on
                                                                                        Desired performance is
                                                                                               achieved

                        Is buffer larger than cache size?

           No                                                     Yes

    Invalidate using “invalidate”
       instruction before read                   Invalidate with direct cache line
                                                        access before read
    Programming effort
  29
Increases as you move across
A Combination of Cache and DMA Usually Provides the
                  Best Performance
                         Instruction Cache                              Func_A

             Way 1                                                      Func_B
                                                                                          Once cache is
             Way 2                                                                        enabled and the
                                     High bandwidth cache fill          Func_C            DMA controller is
             Way 3                                                                        configured, the
                                                                                          programmer can
             Way 4                                                      Func_D            focus on core
                                                                                          algorithm
     Data Cache                                                         Func_E            development.

         Way 1
                                                                        Func_F

                                                                        Main( )
         Way 2
                                    High bandwidth cache fill           Tables

     Data SRAM                                                          Data
        Buffer 1
        Buffer 2
        Stack
                                      High-speed DMA
                                                                High-speed peripherals
     On-chip memory:
     Smaller capacity but lower latency           Off-chip memory: Greater capacity but larger latency
30
The World Leader in High Performance Signal Processing Solutions




   Memory
Considerations
Blackfin Memory Architecture: The Basics




                                                       Core                    600MHz


 Configurable as
 Cache or SRAM

 Single cycle to
                       L1 Instruction   L1 Data Memory L1 Data Memory
 access                   Memory                                               600MHz

 10’s of Kbytes

                                                                   Unified     300MHz
     Several cycles to access                DMA
                                                                  on-chip L2
     100’s of Kbytes                     On-chip

                                         Off-chip
     Several system cycles to access                             External Memory
                                                               External Memory <133MHz
                                                           Unified off-chip L2
                                                             External Memory
     100’s of Mbytes                                            Memory

32
Instruction and Data Fetches
     In a single core clock cycle, the processor can
     perform
      One instruction fetch of 64 bits and either …
      Two 32-bit data fetches or
      One 32-bit data fetch and one 32-bit data store

     The DMA controller can also be running in parallel
     with the core without any “cycle stealing”




33
Partitioning Data – Internal Memory Sub-Banks

       Multiple accesses can be made to different internal sub-banks in the
       same cycle


     Core fetch                    DMA        Core fetch
                      Buffer0                                     Buffer0
                      Buffer1
                                    DMA
                      Buffer2                                     Buffer1         DMA
                    Coefficients
     Core fetch                    DMA                                          DMA
                                                                  Buffer2
                      Unused
                                              Core fetch
                                                                Coefficients
                      Unused


                   Un-optimized                                   Optimized

             DMA and core conflict when                    Core and DMA operate in
               accessing sub-banks                                harmony


                         Take advantage of the sub-bank architecture!
34
L1 Instruction Memory 16KB Configurable
 Bank
                             4KB           4KB
                           sub-bank      sub-bank
                                                            EAB
                                                            – Cache Line Fill



                                                              DCB
                                                              - DMA
     Instruction


                             4KB           4KB
                           sub-bank      sub-bank


            16 KB SRAM                 16 KB cache
            • Four 4KB single-ported   • Least Recently Used
               sub-banks                 replacement keeps frequently
                                         executed code in cache
            • Allows simultaneous      • 4-way set associative with
               core and DMA accesses     arbitrary locking of ways and
               to different banks        lines
35
L1 Data Memory 16KB Configurable Bank
     Block is Multi-ported when:
  Accessing different sub-bank          4KB                  4KB
               OR                     sub-bank             sub-bank
 Accessing one odd and one even                                          EAB
                                                                         – Cache Line Fill
     access (Addr bit 2 different)
      within the same sub-bank.

                                                                             DCB
                             Data 0                                          - DMA
                             Data 1


                                        4KB                  4KB
                                      sub-bank             sub-bank



                •   When Used as SRAM            •   When Used as Cache
                    – Allows simultaneous            – Each bank is 2-way
                      dual DAG and DMA                 set-associative
                      access                         – No DMA access
                                                     – Allows simultaneous
                                                       dual DAG access
36
Partitioning Code and Data -- External Memory Banks
        Row activation within an SDRAM consumes multiple system clock
        cycles
        Multiple rows can be “active” at the same time
          One row in each bank
                                 Code                                       Code

                             Video Frame 0
Instruction                                  Instruction
  Fetch                      Video Frame 1                              Video Frame 0
                                               Fetch                                    Four
              External         Ref Frame                   External                     16MByte
  DMA                                           DMA
                Bus                                          Bus                        Banks
                                                                        Video Frame 1   internal to
  DMA                          Unused           DMA
                                                                                        SDRAM

                                                                         Ref Frame
                               Unused


                         External Memory                              External Memory
              Row activation cycles are                    Row activation cycles are spread
              taken almost every access                    across hundreds of accesses
                         Take advantage of all four open rows in external
  37
                         memory to save activation cycles!
The World Leader in High Performance Signal Processing Solutions




 Benchmarks
Important Benchmarks
     Core accesses to SDRAM take longer than accesses made by the
     DMA controller

      For example, Blackfin Processors with a 16-bit external bus
      behave as follows:
       16-bit core reads take 8 System Clock (SCLK) cycles
       32-bit core reads take 9 System Clock cycles

       16-bit DMA reads take ~1 SCLK cycle
       16-bit DMA writes take ~ 1 SCLK cycle


       Bottom line: Data is most efficiently moved with the DMA
                              controllers!

39
The World Leader in High Performance Signal Processing Solutions




Managing Shared Resources
External Memory Interface: ADSP-BF561




                    Core has priority to external bus   You are able to program this so
                                 unless                   that the DMA has a higher
                           DMA is urgent*                   priority than the core




                                                                 * “Urgent” DMA implies data
                                                                will be lost if access isn’t granted


Core A is higher priority than Core B      Programmable priority

41
Priority of Core Accesses and the DMA Controller
                 at the External Bus


     A bit within the EBIU_AMGCTL register can be used to
     change the priority between core accesses and the DMA
     controller




42
What is an “urgent” condition?

                            Peripheral          Peripheral

 When the DMA FIFO is empty and the
 Peripheral is transmitting
                              Peripheral         Peripheral   When the DMA FIFO is full and the
                                FIFO       or      FIFO       Peripheral has a sample to send




                              DMA FIFO           DMA FIFO
                               Empty               Full




                         External memory        External memory


43
Bus Arbitration Guideline Summary: Who wins?
     External memory (in descending priority)
      Locked Core accesses (testset instruction)
      Urgent DMA
      Cache-line fill
      Core accesses
      DMA accesses
       We can set all DMA’s to be urgent, which will elevate the
       priority of all DMAs above Core accesses

     L1 memory (in descending priority)
      DMA accesses
      Core accesses




44
The World Leader in High Performance Signal Processing Solutions




Tuning Performance
Managing External Memory




 Transfers in the same direction are more efficient than intermixed
                   accesses in different directions




               DMA
                            External       External
                              Bus          Memory
               Core


      Group transfers in the same direction to reduce number of
                             turn-arounds

46
Improving Performance

     DMA Traffic Control improves system performance
     when multiple DMAs are ongoing (typical system)
      Multiple DMA accesses can be done in the same “direction”
       For example, into SDRAM or out of SDRAM
      Makes more efficient use of SDRAM




47
DMA Traffic Control




DEB_Traffic_Period has the biggest impact on improving performance

The correct value is application dependent but if 3 or less DMA
channels are active at any one time, a larger value (15) of
DEB_Traffic_Period is usually better. When more than 3 channels are
active, a value closer to the mid value (4 to 7) is usually better.
48
The World Leader in High Performance Signal Processing Solutions




Priority of DMAs
Priority of DMA Channels

     Each DMA controller has multiple channels

     If more than one DMA channel tries to access the controller,
     the highest priority channel wins

     The DMA channels are programmable in priority

     When more than one DMA controller is present, the priority of
     arbitration is programmable

     The DMA Queue Manager provided with System Services
     should be used


50
Interrupt Processing




51
Common Mistakes with Interrupts

     Spending too much time in an interrupt service routine
     prevents other critical code from executing
      From an architecture standpoint, interrupts are disabled once an
      interrupt is serviced
        Higher priority interrupts are enabled once the return address (RETI
        register) is saved to the stack

     It is important to understand your application real-time
     budget
      How long is the processor spending in each ISR?
      Are interrupts nested?




52
Programmer Options

     Program the most important interrupts as the highest priority

     Use nesting to ensure higher priority interrupts are not locked
     out by lower priority events

     Use the Call-back manager provided with System Services
      Interrupts are serviced quickly and higher priority interrupts are
      not locked out




53
The World Leader in High Performance Signal Processing Solutions




A Example
Video Decoder Budget



      What’s the cycle budget for real-time decoding?
      (Cycle/pel * pixel/frame * frame/sec) < core frequency
      ⇒ Cycle/pel < (core frequency / [pixel/frame * frame/sec])
      ⇒ Leave at least 10% of the MIPS for other things (audio, system,
        transport layers…)
      For D-1 video: 720x480, 30 frame/sec (~10M pel/sec)
      Video Decoder budget only ~50 cycle/pel

      What’s the bandwidth budget?
      ([Encoded bitstream in via DMA] + [Reference frame in via DMA] + [Reconstructed MDMA Out]
        + [ITU-R 656 Out] + [PPI Output]) < System Bus Throughput
      Video Decoder Budget ~130 MB/s




55
Video Decoders

                       Packet Buffer       Ref Frames      Display Frames
     Input Bitstream



     1011001001…



                              External Memory



                           •Decode             •Interpolation
                                              •Uncompensate


                                              •Format
                                             Conversion
                       Signal processing


                              Internal L1 Memory
56
Data Placement

     Large buffers go in SDRAM

     •Packet Buffer (~ 3 MB)
     •4 Reference Frames (each 720x480x1.5 bytes)
     •8 Display Frames (each 1716x525 bytes)

     Small Buffers go in L1 memory

     •VLD Lookup Tables
     •Inverse Quantization, DCT, zig-zag, etc.
     •Temporary Result Buffers

57
Data Movement (SDRAM to L1)
          Reference Frames are 2D in SDRAM
          Reference Windows are linear in L1

     Y




     U


     V                       2D-to-1D DMA used to bring in
                           reference windows for interpolation

         SDRAM                   L1
58
Data Movement (L1 to SDRAM)
              All temporary results in L1 buffers are stored linearly.
              All data brought in from SDRAM are used linearly.

       Decoded Macroblock




         L1                               SDRAM

     1D-to-2D DMA transfers decoded macroblock to SDRAM to build reference
59   frame.
Bandwidth Used by DMA
           Input Data Stream                1 MB/s
           Reference Data In                30 MB/s
           In Loop Filter Data In           15 MB/s
           Reference Data Out               15 MB/s
           In Loop Filter Data Out          15 MB/s
           Video Data Out                   27 MB/s

      •Calculations based on 30 frames per second.

         Be careful not to simply add up the bandwidths!

     Shared resources, bus turnaround, and concurrent activity
                    all need to be considered


60
Summary

     The compiler provides the first level of optimization

     There are some straightforward steps you can take up front
     in your development which will save you time

     Blackfin Processors have lots of features that help you
     achieve your desired performance level

     Thank you




61

More Related Content

Similar to Performance tuning on the blackfin Processor

Gerenciamento de Memória(2)
Gerenciamento de Memória(2)Gerenciamento de Memória(2)
Gerenciamento de Memória(2)elliando dias
 
Mpc5121 econfs
Mpc5121 econfsMpc5121 econfs
Mpc5121 econfsDino, llc
 
Sears Point Racetrack
Sears Point RacetrackSears Point Racetrack
Sears Point RacetrackDino, llc
 
BeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOSBeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOSDeveler S.r.l.
 
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Infrastruttura Efficiente Di Sun E Amd -Virtualise with ConfidenceInfrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Infrastruttura Efficiente Di Sun E Amd -Virtualise with ConfidenceWalter Moriconi
 
Open Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourOpen Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourWalter Moriconi
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture explorationDeepak Shankar
 
Nexsan Auto Maid Marketing Report
Nexsan Auto Maid Marketing ReportNexsan Auto Maid Marketing Report
Nexsan Auto Maid Marketing Reportsocialnexsan
 
Linux Power Management Slideshare
Linux Power Management SlideshareLinux Power Management Slideshare
Linux Power Management SlidesharePatrick Bellasi
 
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM France Lab
 
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation AMD
 
Bd 2-577 big-data_video_surveillance_storage_solution-bc
Bd 2-577 big-data_video_surveillance_storage_solution-bcBd 2-577 big-data_video_surveillance_storage_solution-bc
Bd 2-577 big-data_video_surveillance_storage_solution-bcJoel W. King
 
21st Century SOA
21st Century SOA21st Century SOA
21st Century SOABob Rhubart
 
Fy09 Sask Tel Learn It System Centre Garth Jones
Fy09 Sask Tel Learn It   System Centre   Garth JonesFy09 Sask Tel Learn It   System Centre   Garth Jones
Fy09 Sask Tel Learn It System Centre Garth Jonessim100
 
Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019
Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019 Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019
Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019 Amazon Web Services
 
Accelerated test case - Anish bhanu
Accelerated test case - Anish bhanuAccelerated test case - Anish bhanu
Accelerated test case - Anish bhanuRoopa Nadkarni
 

Similar to Performance tuning on the blackfin Processor (20)

Gerenciamento de Memória(2)
Gerenciamento de Memória(2)Gerenciamento de Memória(2)
Gerenciamento de Memória(2)
 
Mpc5121 econfs
Mpc5121 econfsMpc5121 econfs
Mpc5121 econfs
 
Sears Point Racetrack
Sears Point RacetrackSears Point Racetrack
Sears Point Racetrack
 
BeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOSBeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOS
 
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Infrastruttura Efficiente Di Sun E Amd -Virtualise with ConfidenceInfrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
 
6dec2011 - DELL
6dec2011 - DELL6dec2011 - DELL
6dec2011 - DELL
 
Open Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourOpen Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology Tour
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
Nexsan Auto Maid Marketing Report
Nexsan Auto Maid Marketing ReportNexsan Auto Maid Marketing Report
Nexsan Auto Maid Marketing Report
 
Linux Power Management Slideshare
Linux Power Management SlideshareLinux Power Management Slideshare
Linux Power Management Slideshare
 
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & Power
 
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
Bd 2-577 big-data_video_surveillance_storage_solution-bc
Bd 2-577 big-data_video_surveillance_storage_solution-bcBd 2-577 big-data_video_surveillance_storage_solution-bc
Bd 2-577 big-data_video_surveillance_storage_solution-bc
 
21st Century SOA
21st Century SOA21st Century SOA
21st Century SOA
 
Fy09 Sask Tel Learn It System Centre Garth Jones
Fy09 Sask Tel Learn It   System Centre   Garth JonesFy09 Sask Tel Learn It   System Centre   Garth Jones
Fy09 Sask Tel Learn It System Centre Garth Jones
 
Video Streaming
Video StreamingVideo Streaming
Video Streaming
 
Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019
Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019 Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019
Security benefits of the Nitro architecture - SEP401-R - AWS re:Inforce 2019
 
Accelerated test case - Anish bhanu
Accelerated test case - Anish bhanuAccelerated test case - Anish bhanu
Accelerated test case - Anish bhanu
 

More from Pantech ProLabs India Pvt Ltd

Choosing the right processor for embedded system design
Choosing the right processor for embedded system designChoosing the right processor for embedded system design
Choosing the right processor for embedded system designPantech ProLabs India Pvt Ltd
 

More from Pantech ProLabs India Pvt Ltd (20)

Registration process
Registration processRegistration process
Registration process
 
Choosing the right processor for embedded system design
Choosing the right processor for embedded system designChoosing the right processor for embedded system design
Choosing the right processor for embedded system design
 
Brain Computer Interface
Brain Computer InterfaceBrain Computer Interface
Brain Computer Interface
 
Electric Vehicle Design using Matlab
Electric Vehicle Design using MatlabElectric Vehicle Design using Matlab
Electric Vehicle Design using Matlab
 
Image processing application
Image processing applicationImage processing application
Image processing application
 
Internet of Things using Raspberry Pi
Internet of Things using Raspberry PiInternet of Things using Raspberry Pi
Internet of Things using Raspberry Pi
 
Internet of Things Using Arduino
Internet of Things Using ArduinoInternet of Things Using Arduino
Internet of Things Using Arduino
 
Brain controlled robot
Brain controlled robotBrain controlled robot
Brain controlled robot
 
Brain Computer Interface-Webinar
Brain Computer Interface-WebinarBrain Computer Interface-Webinar
Brain Computer Interface-Webinar
 
Development of Deep Learning Architecture
Development of Deep Learning ArchitectureDevelopment of Deep Learning Architecture
Development of Deep Learning Architecture
 
Future of AI
Future of AIFuture of AI
Future of AI
 
Gate driver design and inductance fabrication
Gate driver design and inductance fabricationGate driver design and inductance fabrication
Gate driver design and inductance fabrication
 
Brainsense -Brain computer Interface
Brainsense -Brain computer InterfaceBrainsense -Brain computer Interface
Brainsense -Brain computer Interface
 
Median filter Implementation using TMS320C6745
Median filter Implementation using TMS320C6745Median filter Implementation using TMS320C6745
Median filter Implementation using TMS320C6745
 
Introduction to Code Composer Studio 4
Introduction to Code Composer Studio 4Introduction to Code Composer Studio 4
Introduction to Code Composer Studio 4
 
Waveform Generation Using TMS320C6745 DSP
Waveform Generation Using TMS320C6745 DSPWaveform Generation Using TMS320C6745 DSP
Waveform Generation Using TMS320C6745 DSP
 
Interfacing UART with tms320C6745
Interfacing UART with tms320C6745Interfacing UART with tms320C6745
Interfacing UART with tms320C6745
 
Switch & LED using TMS320C6745 DSP
Switch & LED using TMS320C6745 DSPSwitch & LED using TMS320C6745 DSP
Switch & LED using TMS320C6745 DSP
 
Led blinking using TMS320C6745
Led blinking using TMS320C6745Led blinking using TMS320C6745
Led blinking using TMS320C6745
 
Introduction to tms320c6745 dsp
Introduction to tms320c6745 dspIntroduction to tms320c6745 dsp
Introduction to tms320c6745 dsp
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Performance tuning on the blackfin Processor

  • 1. Performance Tuning on the Blackfin Processor 1
  • 2. Outline Introduction Building a Framework Memory Considerations Benchmarks Managing Shared Resources Interrupt Management An Example Summary 2
  • 3. Introduction The first level of optimization is provided by the compiler The remaining portion of the optimization comes from techniques performed at the “system” level Memory management DMA management Interrupt management The purpose of this presentation is to help you understand how some of the system aspects of your application can be managed to tune performance 3
  • 4. One Challenge How do we get from here Type of Memory Required For Total Direct Format Columns Rows Pixels Implementation SQCIF 128 96 12288 L1, L2, off-chip L2 QCIF 176 144 25344 L1, L2, off-chip L2 Unlimited memory CIF 352 288 101376 L2, off-chip L2 Unlimited processing VGA 640 480 307200 off-chip L2 File-based Power cord D-1 NTSC 720 486 349920 off-chip L2 4CIF 704 576 405504 off-chip L2 D-1 PAL 720 576 414720 off-chip L2 16CIF 1408 1152 1622016 off-chip L2 to here? Limited memory Finite processing Stream-based All formats require management Battery of multiple memory spaces 4
  • 5. The path to optimization Compiler System Assembly Optimization Optimization Optimization 100.00% 90.00% let compiler optimize 80.00% 70.00% Processor Utilization 60.00% 50.00% exploit architecture features 40.00% 30.00% efficient memory layout 20.00% use specialized instructions 10.00% streamline data flow 0.00% 0 1 2 3 4 5 6 7 8 9 Optimization Steps 5
  • 6. The World Leader in High Performance Signal Processing Solutions Creating a Framework
  • 7. The Importance of Creating a “Framework” Framework definition: The software infrastructure that moves code and data within an embedded system Building this early on in your development can pay big dividends There are three categories of frameworks that we see consistently from Blackfin developers 7
  • 8. Common Frameworks Processing on the fly Safety-critical applications (e.g., automotive lane departure warning system) Applications without external memory Programming ease overrides performance Programmers who need to meet a deadline Performance supersedes programming ease Algorithms that push the limits of the processor 8
  • 9. Processing on the fly Can’t afford to wait until large video buffers filled in external memory Instead, they can be brought into on-chip memory immediately and processed line-by-line This approach can lead to quick results in decision-based systems The processor core can directly access lines of video in on- chip memory. Software must ensure the active video frame buffer is not overwritten until processing on the current frame is complete. 9
  • 10. Example of “Processing-on-the-fly” Framework Collision avoidance CORE Single cycle Collision Warning 720 pixels access Video in 480 pixels DMA L1 MEM 33 msec frame rate 63 µsec line rate Data is processed one line at a time instead of waiting for an entire frame to be collected. 10
  • 11. “Programming Ease Rules” Strives to achieve simplest programming model at the expense of some performance Focus is on time-to-market Optimization isn’t as important as time-to-market It can always be revisited later Provides a nice path for upgrades! Easier to develop for both novices and experts 11
  • 12. Example of “Programming Ease Rules” framework Cores split Data CORE CORE processing Processed Data DMA Sub-image External Memory L2 MEM DMA Video in DMA This programming model best matches time-to-market focused designs 12
  • 13. “Performance Rules” Bandwidth-efficient Strives to attain best performance, even if programming model is more complex Might include targeted assembly routines Every aspect of data flow is carefully planned Allows use of less expensive processors because the device is “right-sized” for the application Caveats Might not leave enough room for extra features or upgrades Harder to reuse 13
  • 14. Example of “Performance Rules” Framework CORE CORE Single cycle Single cycle Compressed access access data DMA Macro block External Memory L1 MEM L1 MEM DMA Video in DMA 1 line at a time L2 MEM Provides most efficient use of external memory bus. 14
  • 15. Common Attributes of Each Model Instruction and data management is best done up front in the project Small amounts of planning can save headaches later on 15
  • 16. The World Leader in High Performance Signal Processing Solutions Features that can be used to improve performance
  • 17. Now that we have discussed the background… We will now review the concepts that will help you tune performance Using Cache and DMA Managing memory Managing DMA channels Managing interrupts 17
  • 18. The World Leader in High Performance Signal Processing Solutions Cache vs. DMA
  • 20. The World Leader in High Performance Signal Processing Solutions Cache
  • 21. Configuring internal instruction memory as cache foo1(); Core Way 1 foo2(); foo3(); External memory Example: L1 instruction memory configured as 4-way set-associative cache Instructions are brought into internal memory where single cycle performance can be achieved Instruction cache provides several key benefits to increase performance Usually provides the highest bandwidth path into the core For linear code, the next instructions will be on the way into the core after each cache miss The most recently used instructions are the least likely to be replaced Critical items can be locked in cache, by line Instructions in cache can execute in a single cycle 21
  • 22. Configuring internal data memory as cache Volatile data Peripheral Way 1 DMA Core Static data Cache External memory Example: Data brought in from a peripheral Data cache also provides a way to increase performance Usually provides the highest bandwidth path into the core For linear data, the next data elements will be on the way into the core after each cache miss Write-through option keeps “source” memory up to date Data written to source memory every time it is modified Write-back option can further improve performance Data only written to source memory when data is replaced in cache “Volatile” buffers must be managed to ensure coherency between DMA and cache 22
  • 23. Write-back vs. Write-through Write-back is usually 10-15% more efficient but … It is algorithm dependent Write-through is better when coherency between more than one resource is required Make sure you try both options when all of your peripherals are running 23
  • 24. The World Leader in High Performance Signal Processing Solutions DMA
  • 25. Why Use a DMA Controller? The DMA controller runs independently of the core The core should only have to set up the DMA and respond to interrupts Core processor cycles are available for processing data DMA can allow creative data movement and “filtering” Saves potential core passes to re-arrange the data Interrupt handler DMA Core Descriptor list Peripheral Buffer Data Source Internal memory 25
  • 26. Using 2D DMA to efficiently move image data 26
  • 27. DMA and Data Cache Coherence Coherency between data cache and buffer filled via DMA transfer must be maintained Cache must be “invalidated” to ensure “old” data is not used when processing most recent buffer Invalidate buffer addresses or actual cache line, depending on size of the buffer Interrupts can be used to indicate when it is safe to invalidate buffer for next processing interval This often provides a simpler programming model (with less of a performance increase) than a pure DMA model Volatile buffer0 Data brought in Core Data Cache from a peripheral New Volatile buffer1 buffer Invalidate Process cache lines Interrupt buffer associated with that buffer 27
  • 28. Does code fit into internal memory? Yes No Instruction partitioning Map code Map code to into internal external memory memory Desired performance is Turn Cache on achieved No Is desired performance achieved? Lock lines with critical code Use L1 SRAM Yes Is desired performance achieved? Yes No Use overlay mechanism Programming effort Increases as you move across Desired performance is 28 achieved
  • 29. Is the data volatile or static? Static Volatile Map to cacheable Will the buffers fit into Data partitioning memory internal memory? locations Yes No Single cycle access achieved Map to external memory Is DMA part of the programming model? No Turn data cache on Desired performance is achieved Is buffer larger than cache size? No Yes Invalidate using “invalidate” instruction before read Invalidate with direct cache line access before read Programming effort 29 Increases as you move across
  • 30. A Combination of Cache and DMA Usually Provides the Best Performance Instruction Cache Func_A Way 1 Func_B Once cache is Way 2 enabled and the High bandwidth cache fill Func_C DMA controller is Way 3 configured, the programmer can Way 4 Func_D focus on core algorithm Data Cache Func_E development. Way 1 Func_F Main( ) Way 2 High bandwidth cache fill Tables Data SRAM Data Buffer 1 Buffer 2 Stack High-speed DMA High-speed peripherals On-chip memory: Smaller capacity but lower latency Off-chip memory: Greater capacity but larger latency 30
  • 31. The World Leader in High Performance Signal Processing Solutions Memory Considerations
  • 32. Blackfin Memory Architecture: The Basics Core 600MHz Configurable as Cache or SRAM Single cycle to L1 Instruction L1 Data Memory L1 Data Memory access Memory 600MHz 10’s of Kbytes Unified 300MHz Several cycles to access DMA on-chip L2 100’s of Kbytes On-chip Off-chip Several system cycles to access External Memory External Memory <133MHz Unified off-chip L2 External Memory 100’s of Mbytes Memory 32
  • 33. Instruction and Data Fetches In a single core clock cycle, the processor can perform One instruction fetch of 64 bits and either … Two 32-bit data fetches or One 32-bit data fetch and one 32-bit data store The DMA controller can also be running in parallel with the core without any “cycle stealing” 33
  • 34. Partitioning Data – Internal Memory Sub-Banks Multiple accesses can be made to different internal sub-banks in the same cycle Core fetch DMA Core fetch Buffer0 Buffer0 Buffer1 DMA Buffer2 Buffer1 DMA Coefficients Core fetch DMA DMA Buffer2 Unused Core fetch Coefficients Unused Un-optimized Optimized DMA and core conflict when Core and DMA operate in accessing sub-banks harmony Take advantage of the sub-bank architecture! 34
  • 35. L1 Instruction Memory 16KB Configurable Bank 4KB 4KB sub-bank sub-bank EAB – Cache Line Fill DCB - DMA Instruction 4KB 4KB sub-bank sub-bank 16 KB SRAM 16 KB cache • Four 4KB single-ported • Least Recently Used sub-banks replacement keeps frequently executed code in cache • Allows simultaneous • 4-way set associative with core and DMA accesses arbitrary locking of ways and to different banks lines 35
  • 36. L1 Data Memory 16KB Configurable Bank Block is Multi-ported when: Accessing different sub-bank 4KB 4KB OR sub-bank sub-bank Accessing one odd and one even EAB – Cache Line Fill access (Addr bit 2 different) within the same sub-bank. DCB Data 0 - DMA Data 1 4KB 4KB sub-bank sub-bank • When Used as SRAM • When Used as Cache – Allows simultaneous – Each bank is 2-way dual DAG and DMA set-associative access – No DMA access – Allows simultaneous dual DAG access 36
  • 37. Partitioning Code and Data -- External Memory Banks Row activation within an SDRAM consumes multiple system clock cycles Multiple rows can be “active” at the same time One row in each bank Code Code Video Frame 0 Instruction Instruction Fetch Video Frame 1 Video Frame 0 Fetch Four External Ref Frame External 16MByte DMA DMA Bus Bus Banks Video Frame 1 internal to DMA Unused DMA SDRAM Ref Frame Unused External Memory External Memory Row activation cycles are Row activation cycles are spread taken almost every access across hundreds of accesses Take advantage of all four open rows in external 37 memory to save activation cycles!
  • 38. The World Leader in High Performance Signal Processing Solutions Benchmarks
  • 39. Important Benchmarks Core accesses to SDRAM take longer than accesses made by the DMA controller For example, Blackfin Processors with a 16-bit external bus behave as follows: 16-bit core reads take 8 System Clock (SCLK) cycles 32-bit core reads take 9 System Clock cycles 16-bit DMA reads take ~1 SCLK cycle 16-bit DMA writes take ~ 1 SCLK cycle Bottom line: Data is most efficiently moved with the DMA controllers! 39
  • 40. The World Leader in High Performance Signal Processing Solutions Managing Shared Resources
  • 41. External Memory Interface: ADSP-BF561 Core has priority to external bus You are able to program this so unless that the DMA has a higher DMA is urgent* priority than the core * “Urgent” DMA implies data will be lost if access isn’t granted Core A is higher priority than Core B Programmable priority 41
  • 42. Priority of Core Accesses and the DMA Controller at the External Bus A bit within the EBIU_AMGCTL register can be used to change the priority between core accesses and the DMA controller 42
  • 43. What is an “urgent” condition? Peripheral Peripheral When the DMA FIFO is empty and the Peripheral is transmitting Peripheral Peripheral When the DMA FIFO is full and the FIFO or FIFO Peripheral has a sample to send DMA FIFO DMA FIFO Empty Full External memory External memory 43
  • 44. Bus Arbitration Guideline Summary: Who wins? External memory (in descending priority) Locked Core accesses (testset instruction) Urgent DMA Cache-line fill Core accesses DMA accesses We can set all DMA’s to be urgent, which will elevate the priority of all DMAs above Core accesses L1 memory (in descending priority) DMA accesses Core accesses 44
  • 45. The World Leader in High Performance Signal Processing Solutions Tuning Performance
  • 46. Managing External Memory Transfers in the same direction are more efficient than intermixed accesses in different directions DMA External External Bus Memory Core Group transfers in the same direction to reduce number of turn-arounds 46
  • 47. Improving Performance DMA Traffic Control improves system performance when multiple DMAs are ongoing (typical system) Multiple DMA accesses can be done in the same “direction” For example, into SDRAM or out of SDRAM Makes more efficient use of SDRAM 47
  • 48. DMA Traffic Control DEB_Traffic_Period has the biggest impact on improving performance The correct value is application dependent but if 3 or less DMA channels are active at any one time, a larger value (15) of DEB_Traffic_Period is usually better. When more than 3 channels are active, a value closer to the mid value (4 to 7) is usually better. 48
  • 49. The World Leader in High Performance Signal Processing Solutions Priority of DMAs
  • 50. Priority of DMA Channels Each DMA controller has multiple channels If more than one DMA channel tries to access the controller, the highest priority channel wins The DMA channels are programmable in priority When more than one DMA controller is present, the priority of arbitration is programmable The DMA Queue Manager provided with System Services should be used 50
  • 52. Common Mistakes with Interrupts Spending too much time in an interrupt service routine prevents other critical code from executing From an architecture standpoint, interrupts are disabled once an interrupt is serviced Higher priority interrupts are enabled once the return address (RETI register) is saved to the stack It is important to understand your application real-time budget How long is the processor spending in each ISR? Are interrupts nested? 52
  • 53. Programmer Options Program the most important interrupts as the highest priority Use nesting to ensure higher priority interrupts are not locked out by lower priority events Use the Call-back manager provided with System Services Interrupts are serviced quickly and higher priority interrupts are not locked out 53
  • 54. The World Leader in High Performance Signal Processing Solutions A Example
  • 55. Video Decoder Budget What’s the cycle budget for real-time decoding? (Cycle/pel * pixel/frame * frame/sec) < core frequency ⇒ Cycle/pel < (core frequency / [pixel/frame * frame/sec]) ⇒ Leave at least 10% of the MIPS for other things (audio, system, transport layers…) For D-1 video: 720x480, 30 frame/sec (~10M pel/sec) Video Decoder budget only ~50 cycle/pel What’s the bandwidth budget? ([Encoded bitstream in via DMA] + [Reference frame in via DMA] + [Reconstructed MDMA Out] + [ITU-R 656 Out] + [PPI Output]) < System Bus Throughput Video Decoder Budget ~130 MB/s 55
  • 56. Video Decoders Packet Buffer Ref Frames Display Frames Input Bitstream 1011001001… External Memory •Decode •Interpolation •Uncompensate •Format Conversion Signal processing Internal L1 Memory 56
  • 57. Data Placement Large buffers go in SDRAM •Packet Buffer (~ 3 MB) •4 Reference Frames (each 720x480x1.5 bytes) •8 Display Frames (each 1716x525 bytes) Small Buffers go in L1 memory •VLD Lookup Tables •Inverse Quantization, DCT, zig-zag, etc. •Temporary Result Buffers 57
  • 58. Data Movement (SDRAM to L1) Reference Frames are 2D in SDRAM Reference Windows are linear in L1 Y U V 2D-to-1D DMA used to bring in reference windows for interpolation SDRAM L1 58
  • 59. Data Movement (L1 to SDRAM) All temporary results in L1 buffers are stored linearly. All data brought in from SDRAM are used linearly. Decoded Macroblock L1 SDRAM 1D-to-2D DMA transfers decoded macroblock to SDRAM to build reference 59 frame.
  • 60. Bandwidth Used by DMA Input Data Stream 1 MB/s Reference Data In 30 MB/s In Loop Filter Data In 15 MB/s Reference Data Out 15 MB/s In Loop Filter Data Out 15 MB/s Video Data Out 27 MB/s •Calculations based on 30 frames per second. Be careful not to simply add up the bandwidths! Shared resources, bus turnaround, and concurrent activity all need to be considered 60
  • 61. Summary The compiler provides the first level of optimization There are some straightforward steps you can take up front in your development which will save you time Blackfin Processors have lots of features that help you achieve your desired performance level Thank you 61