SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Multi-core architectures




                    1
Single-core computer




                 2
Single-core CPU chip
                     the single core




                 3
Multi-core architectures
• This lecture is about a new trend in
  computer architecture:
  Replicate multiple processor cores on a
  single die.
    Core 1            Core 2   Core 3   Core 4




Multi-core CPU chip                       4
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)


   c          c         c         c
   o          o         o         o
   r          r         r         r
   e          e         e         e

   1          2         3         4



                                  5
The cores run in parallel
    thread 1       thread 2       thread 3       thread 4




c              c              c              c
o              o              o              o
r              r              r              r
e              e              e              e

1              2              3              4




                                             6
Within each core, threads are time-sliced
       (just like on a uniprocessor)
     several       several       several       several
     threads       threads       threads       threads




 c             c             c             c
 o             o             o             o
 r             r             r             r
 e             e             e             e

 1             2             3             4




                                           7
Interaction with the
           Operating System
• OS perceives each core as a separate processor

• OS scheduler maps threads/processes
  to different cores

• Most major OS support multi-core today:
  Windows, Linux, Mac OS X, …



                                     8
Why multi-core ?
• Difficult to make single-core
  clock frequencies even higher
• Deeply pipelined circuits:
  –   heat problems
  –   speed of light problems
  –   difficult design and verification
  –   large design teams necessary
  –   server farms need expensive
      air-conditioning
• Many new applications are multithreaded
• General trend in computer architecture (shift
  towards more parallelism)            9
Instruction-level parallelism
• Parallelism at the machine-instruction
  level
• The processor can re-order, pipeline
  instructions, split them into
  microinstructions, do aggressive branch
  prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the
  last 15 years
                                   10
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
  thread (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot
  fully exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
                                    11
General context: Multiprocessors
• Multiprocessor is any
  computer with several
  processors

• SIMD
  – Single instruction, multiple data           Lemieux cluster,
                                                     Pittsburgh
  – Modern graphics cards                       supercomputing
                                                          center
• MIMD
  – Multiple instructions, multiple data
                                           12
Multiprocessor memory types
• Shared memory:
  In this model, there is one (large) common
  shared memory for all processors

• Distributed memory:
  In this model, each processor has its own
  (small) local memory, and its content is
  not replicated anywhere else

                                 13
Multi-core processor is a special
     kind of a multiprocessor:
  All processors are on the same chip
• Multi-core processors are MIMD:
  Different cores execute different threads
  (Multiple Instructions), operating on different
  parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:
  All cores share the same memory


                                         14
What applications benefit
         from multi-core?
• Database servers
• Web servers (Web commerce)           Each can
• Compilers                            run on its
                                       own core
• Multimedia applications
• Scientific applications,
  CAD/CAM
• In general, applications with
  Thread-level parallelism
  (as opposed to instruction-
  level parallelism)
                                  15
More examples
• Editing a photo while recording a TV show
  through a digital video recorder
• Downloading software while running an
  anti-virus program
• “Anything that can be threaded today will
  map efficiently to multi-core”
• BUT: some applications difficult to
  parallelize
                                 16
A technique complementary to multi-core:
       Simultaneous multithreading
• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                              Integer        Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
                                                      BTB     Trace Cache          uCode
    arrive from memory                                                             ROM

 Other execution units                                        Decoder
                               Bus


 wait unused                                                BTB and I-TLB
                                                                        Source: Intel
                                                                17
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core

• Example: if one thread is waiting for a floating
  point operation to complete, another thread can
  use the integer units


                                       18
Without SMT, only a single thread
    can run at any given time
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache       uCode ROM

                                       Decoder
     Bus




                                   BTB and I-TLB

                                              Thread 1: floating point
                                                             19
Without SMT, only a single thread
    can run at any given time
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache      uCode ROM

                                       Decoder
     Bus




                                   BTB and I-TLB

                             Thread 2:
                             integer operation           20
SMT processor: both threads can
       run concurrently
                                 L1 D-Cache D-TLB

                           Integer          Floating Point
    L2 Cache and Control



                                     Schedulers

                                   Uop queues

                                  Rename/Alloc

                           BTB     Trace Cache       uCode ROM

                                      Decoder
    Bus




                                  BTB and I-TLB

                            Thread 2:        Thread 1: floating point
                            integer operation               21
But: Can’t simultaneously use the
       same functional unit
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache      uCode ROM

                                       Decoder       This scenario is
     Bus




                                                     impossible with SMT
                                   BTB and I-TLB
                                                     on a single core
                             Thread 1 Thread 2       (assuming a single
                                IMPOSSIBLE           integer unit)
                                                          22
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
  simultaneous thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources
                                  23
Multi-core:
  threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                              Schedulers                                           Schedulers

                              Uop queues                                           Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB      Trace Cache   uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                                Decoder                                              Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                             Thread 1                                          Thread24
                                                                                      2
Multi-core:
  threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                                   Thread 3                                             25 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT: our fish machines
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads” 26
SMT Dual-core: all four threads can
       run concurrently
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                         Thread 1 Thread 3                                     Thread27 Thread 4
                                                                                      2
Comparison: multi-core vs SMT
• Advantages/disadvantages?




                              28
Comparison: multi-core vs SMT
• Multi-core:
  – Since there are several cores,
    each is smaller and not as powerful
    (but also easier to design and manufacture)
  – However, great with thread-level parallelism
• SMT
  – Can have one large and fast superscalar core
  – Great performance on a single thread
  – Mostly still only exploits instruction-level
    parallelism
                                      29
The memory hierarchy
• If simultaneous multithreading only:
  – all caches shared
• Multi-core chips:
  – L1 caches private
  – L2 caches private in some architectures
    and shared in others
• Memory is always shared


                                     30
“Fish” machines
                                      hyper-threads
• Dual-core
  Intel Xeon processors




                          CORE1




                                              CORE0
• Each core is                    L1 cache            L1 cache
  hyper-threaded
                                       L2 cache

• Private L1 caches
                                       memory
• Shared L2 caches
                                         31
CORE1   Designs with private L2 caches


                   CORE0




                                      CORE1




                                                           CORE0
        L1 cache           L1 cache           L1 cache             L1 cache

        L2 cache           L2 cache           L2 cache             L2 cache

                                              L3 cache             L3 cache
             memory
                                                     memory
   Both L1 and L2 are private
                                              A design with L3 caches
   Examples: AMD Opteron,
   AMD Athlon, Intel Pentium D                Example: Intel Itanium 2
                                                         32
Private vs shared caches?
• Advantages/disadvantages?




                              33
Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the
    same cache data
  – More cache space available if a single (or a
    few) high-performance thread runs on the
    system
                                      34
The cache coherence problem
• Since we have private caches:
  How to keep the data consistent across caches?
• Each core should perceive the memory as a
  monolithic array, shared by all the cores




                                    35
The cache coherence problem
Suppose variable x initially contains 15213


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache




                                               multi-core chip
               Main memory
                x=15213                          36
The cache coherence problem
Core 1 reads x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213



                                               multi-core chip
               Main memory
                x=15213                          37
The cache coherence problem
Core 2 reads x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213           x=15213



                                               multi-core chip
               Main memory
                x=15213                          38
The cache coherence problem
Core 1 writes to x, setting it to 21660


   Core 1            Core 2            Core 3         Core 4




 One or more       One or more      One or more     One or more
  levels of         levels of        levels of       levels of
    cache             cache            cache           cache
  x=21660           x=15213



                                                  multi-core chip
               Main memory       assuming
                x=21660          write-through      39
                                 caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=21660           x=15213



                                               multi-core chip
               Main memory
                x=21660                          40
Solutions for cache coherence
• This is a general problem with
  multiprocessors, not limited just to multi-core
• There exist many solution algorithms,
  coherence protocols, etc.

• A simple solution:
  invalidation-based protocol with snooping


                                     41
Inter-core bus


  Core 1            Core 2        Core 3          Core 4




One or more       One or more   One or more     One or more
 levels of         levels of     levels of       levels of
   cache             cache         cache           cache




                                              multi-core chip
              Main memory
                                              inter-core
                                              bus42
Invalidation protocol with snooping
• Invalidation:
  If a core writes to a data item, all other
  copies of this data item in other caches
  are invalidated
• Snooping:
  All cores continuously “snoop” (monitor)
  the bus connecting the cores.


                                   43
The cache coherence problem
Revisited: Cores 1 and 2 have both read x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213           x=15213



                                               multi-core chip
               Main memory
                x=15213                          44
The cache coherence problem
Core 1 writes to x, setting it to 21660


     Core 1           Core 2            Core 3         Core 4




  One or more       One or more      One or more     One or more
   levels of         levels of        levels of       levels of
     cache             cache            cache           cache
   x=21660           x=15213

sends                     INVALIDATED
invalidation
                                                   multi-core chip
request
                Main memory       assuming         inter-core
                 x=21660          write-through       45
                                                   bus
                                  caches
The cache coherence problem
After invalidation:


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=21660



                                               multi-core chip
               Main memory
                x=21660                          46
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.


     Core 1            Core 2        Core 3          Core 4




   One or more       One or more   One or more     One or more
    levels of         levels of     levels of       levels of
      cache             cache         cache           cache
    x=21660           x=21660



                                                 multi-core chip
                 Main memory
                  x=21660                          47
Alternative to invalidate protocol:
               update protocol
Core 1 writes x=21660:


     Core 1           Core 2            Core 3         Core 4




  One or more       One or more      One or more     One or more
   levels of         levels of        levels of       levels of
     cache             cache            cache           cache
   x=21660           x=21660
                          UPDATED

broadcasts
                                                   multi-core chip
updated
value           Main memory       assuming         inter-core
                 x=21660          write-through       48
                                                   bus
                                  caches
Which do you think is better?
  Invalidation or update?




                      49
Invalidation vs update
• Multiple writes to the same location
  – invalidation: only the first time
  – update: must broadcast each write
            (which includes new variable value)


• Invalidation generally performs better:
  it generates less bus traffic


                                     50
Invalidation protocols
• This was just the basic
  invalidation protocol
• More sophisticated protocols
  use extra cache state bits
• MSI, MESI
  (Modified, Exclusive, Shared, Invalid)



                                  51
Programming for multi-core
• Programmers must use threads or
  processes

• Spread the workload across multiple cores

• Write parallel algorithms

• OS will map threads/processes to cores
                                52
Thread safety very important
• Pre-emptive context switching:
  context switch can happen AT ANY TIME

• True concurrency, not just uniprocessor
  time-slicing

• Concurrency bugs exposed much faster
  with multi-core

                                 53
However: Need to use synchronization
even if only time-slicing on a uniprocessor
 int counter=0;

 void thread1() {
   int temp1=counter;
   counter = temp1 + 1;
 }

 void thread2() {
   int temp2=counter;
   counter = temp2 + 1;
 }
                                54
Need to use synchronization even if only
    time-slicing on a uniprocessor

temp1=counter;
counter = temp1 + 1;   gives counter=2
temp2=counter;
counter = temp2 + 1

temp1=counter;
temp2=counter;         gives counter=1
counter = temp1 + 1;
counter = temp2 + 1

                              55
Assigning threads to the cores

• Each thread/process has an affinity mask

• Affinity mask specifies what cores the
  thread is allowed to run on

• Different threads can have different masks

• Affinities are inherited across fork()
                                    56
Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT

         1        1       0        1




       core 3   core 2   core 1   core 0



 • Process/thread is allowed to run on
   cores 0,2,3, but not on core 1
                                       57
Affinity masks when multi-core and
             SMT combined
• Separate bits for each simultaneous thread
• Example: 4-way multi-core, 2 threads per core
           1        1        0         0       1        0       1        1


            core 3            core 2               core 1           core 0



         thread   thread   thread   thread   thread   thread   thread   thread
            1        0        1        0        1        0        1        0


 • Core 2 can’t run the process
 • Core 1 can only use one simultaneous 58
                                         thread
Default Affinities
• Default affinity mask is all 1s:
  all threads can run on all processors

• Then, the OS scheduler decides what
  threads run on what core

• OS scheduler detects skewed workloads,
  migrating threads to less busy processors

                                  59
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as
  much as possible:
  it tends to keeps a thread on the same core
• This is called soft affinity



                                 60
Hard affinities

• The programmer can prescribe her own
  affinities (hard affinities)

• Rule of thumb: use the default scheduler
  unless a good reason not to



                                 61
When to set your own affinities
• Two (or more) threads share data-structures in
  memory
   – map to same core so that can share cache
• Real-time threads:
  Example: a thread running
  a robot controller:
  - must not be context switched,
    or else robot can go unstable               Source: Sensable.com

  - dedicate an entire core just to this thread

                                                 62
Kernel scheduler API
#include <sched.h>
int sched_getaffinity(pid_t pid,
  unsigned int len, unsigned long * mask);

Retrieves the current affinity mask of process ‘pid’ and
   stores it into space pointed to by ‘mask’.
‘len’ is the system word size: sizeof(unsigned int long)




                                              63
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid,
   unsigned int len, unsigned long * mask);

Sets the current affinity mask of process ‘pid’ to *mask
‘len’ is the system word size: sizeof(unsigned int long)

To query affinity of a running process:
[barbic@bonito ~]$ taskset -p 3935
pid 3935's current affinity mask: f


                                              64
Windows Task Manager

                         core 2




                     core 1




                65
Legal licensing issues
• Will software vendors charge a separate
  license per each core or only a single
  license per chip?

• Microsoft, Red Hat Linux, Suse Linux will
  license their OS per chip, not per core



                                  66
Conclusion
• Multi-core chips an
  important new trend in
  computer architecture

• Several new multi-core
  chips in design phases

• Parallel programming techniques
  likely to gain importance
                               67

Weitere ähnliche Inhalte

Was ist angesagt?

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors pptSiddhartha Anand
 
Direct memory access
Direct memory accessDirect memory access
Direct memory accessshubham kuwar
 
Control Unit Design
Control Unit DesignControl Unit Design
Control Unit DesignVinit Raut
 
Auxiliary memory
Auxiliary memoryAuxiliary memory
Auxiliary memoryYuvrajVyas2
 
Computer architecture control unit
Computer architecture control unitComputer architecture control unit
Computer architecture control unitMazin Alwaaly
 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithmsMazin Alwaaly
 
Memory organization in computer architecture
Memory organization in computer architectureMemory organization in computer architecture
Memory organization in computer architectureFaisal Hussain
 
ARM architcture
ARM architcture ARM architcture
ARM architcture Hossam Adel
 
Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9 Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9 ShriyaGautam3
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memorysgpraju
 

Was ist angesagt? (20)

Interrupt
InterruptInterrupt
Interrupt
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Memory management
Memory managementMemory management
Memory management
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Paging and segmentation
Paging and segmentationPaging and segmentation
Paging and segmentation
 
Direct memory access
Direct memory accessDirect memory access
Direct memory access
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Control Unit Design
Control Unit DesignControl Unit Design
Control Unit Design
 
ADDRESSING MODES
ADDRESSING MODESADDRESSING MODES
ADDRESSING MODES
 
VIRTUAL MEMORY
VIRTUAL MEMORYVIRTUAL MEMORY
VIRTUAL MEMORY
 
Auxiliary memory
Auxiliary memoryAuxiliary memory
Auxiliary memory
 
Computer architecture control unit
Computer architecture control unitComputer architecture control unit
Computer architecture control unit
 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithms
 
Memory organization in computer architecture
Memory organization in computer architectureMemory organization in computer architecture
Memory organization in computer architecture
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9 Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memory
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 

Ähnlich wie Multi core-architecture

Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdfshubhangisonawane6
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science StudentsMKKhaing
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 

Ähnlich wie Multi core-architecture (20)

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
27 multicore
27 multicore27 multicore
27 multicore
 
27 multicore
27 multicore27 multicore
27 multicore
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Extlect04
Extlect04Extlect04
Extlect04
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
4 threads
4 threads4 threads
4 threads
 
Memory
MemoryMemory
Memory
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 

Mehr von Piyush Mittal

Mehr von Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Multi core-architecture

  • 3. Single-core CPU chip the single core 3
  • 4. Multi-core architectures • This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 4
  • 5. Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 5
  • 6. The cores run in parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 7
  • 8. Interaction with the Operating System • OS perceives each core as a separate processor • OS scheduler maps threads/processes to different cores • Most major OS support multi-core today: Windows, Linux, Mac OS X, … 8
  • 9. Why multi-core ? • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: – heat problems – speed of light problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism) 9
  • 10. Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 10
  • 11. Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12. General context: Multiprocessors • Multiprocessor is any computer with several processors • SIMD – Single instruction, multiple data Lemieux cluster, Pittsburgh – Modern graphics cards supercomputing center • MIMD – Multiple instructions, multiple data 12
  • 13. Multiprocessor memory types • Shared memory: In this model, there is one (large) common shared memory for all processors • Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else 13
  • 14. Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip • Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). • Multi-core is a shared memory multiprocessor: All cores share the same memory 14
  • 15. What applications benefit from multi-core? • Database servers • Web servers (Web commerce) Each can • Compilers run on its own core • Multimedia applications • Scientific applications, CAD/CAM • In general, applications with Thread-level parallelism (as opposed to instruction- level parallelism) 15
  • 16. More examples • Editing a photo while recording a TV show through a digital video recorder • Downloading software while running an anti-virus program • “Anything that can be threaded today will map efficiently to multi-core” • BUT: some applications difficult to parallelize 16
  • 17. A technique complementary to multi-core: Simultaneous multithreading • Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Other execution units Decoder Bus wait unused BTB and I-TLB Source: Intel 17
  • 18. Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 18
  • 19. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 19
  • 20. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 20
  • 21. SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 21
  • 22. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is Bus impossible with SMT BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 22
  • 23. SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 23
  • 24. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread24 2
  • 25. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 25 Thread 4
  • 26. Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads” 26
  • 27. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread27 Thread 4 2
  • 28. Comparison: multi-core vs SMT • Advantages/disadvantages? 28
  • 29. Comparison: multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism • SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism 29
  • 30. The memory hierarchy • If simultaneous multithreading only: – all caches shared • Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others • Memory is always shared 30
  • 31. “Fish” machines hyper-threads • Dual-core Intel Xeon processors CORE1 CORE0 • Each core is L1 cache L1 cache hyper-threaded L2 cache • Private L1 caches memory • Shared L2 caches 31
  • 32. CORE1 Designs with private L2 caches CORE0 CORE1 CORE0 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2 32
  • 33. Private vs shared caches? • Advantages/disadvantages? 33
  • 34. Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 34
  • 35. The cache coherence problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores 35
  • 36. The cache coherence problem Suppose variable x initially contains 15213 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory x=15213 36
  • 37. The cache coherence problem Core 1 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 multi-core chip Main memory x=15213 37
  • 38. The cache coherence problem Core 2 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 38
  • 39. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory assuming x=21660 write-through 39 caches
  • 40. The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory x=21660 40
  • 41. Solutions for cache coherence • This is a general problem with multiprocessors, not limited just to multi-core • There exist many solution algorithms, coherence protocols, etc. • A simple solution: invalidation-based protocol with snooping 41
  • 42. Inter-core bus Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory inter-core bus42
  • 43. Invalidation protocol with snooping • Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated • Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. 43
  • 44. The cache coherence problem Revisited: Cores 1 and 2 have both read x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 44
  • 45. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 sends INVALIDATED invalidation multi-core chip request Main memory assuming inter-core x=21660 write-through 45 bus caches
  • 46. The cache coherence problem After invalidation: Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 multi-core chip Main memory x=21660 46
  • 47. The cache coherence problem Core 2 reads x. Cache misses, and loads the new copy. Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=21660 multi-core chip Main memory x=21660 47
  • 48. Alternative to invalidate protocol: update protocol Core 1 writes x=21660: Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=21660 UPDATED broadcasts multi-core chip updated value Main memory assuming inter-core x=21660 write-through 48 bus caches
  • 49. Which do you think is better? Invalidation or update? 49
  • 50. Invalidation vs update • Multiple writes to the same location – invalidation: only the first time – update: must broadcast each write (which includes new variable value) • Invalidation generally performs better: it generates less bus traffic 50
  • 51. Invalidation protocols • This was just the basic invalidation protocol • More sophisticated protocols use extra cache state bits • MSI, MESI (Modified, Exclusive, Shared, Invalid) 51
  • 52. Programming for multi-core • Programmers must use threads or processes • Spread the workload across multiple cores • Write parallel algorithms • OS will map threads/processes to cores 52
  • 53. Thread safety very important • Pre-emptive context switching: context switch can happen AT ANY TIME • True concurrency, not just uniprocessor time-slicing • Concurrency bugs exposed much faster with multi-core 53
  • 54. However: Need to use synchronization even if only time-slicing on a uniprocessor int counter=0; void thread1() { int temp1=counter; counter = temp1 + 1; } void thread2() { int temp2=counter; counter = temp2 + 1; } 54
  • 55. Need to use synchronization even if only time-slicing on a uniprocessor temp1=counter; counter = temp1 + 1; gives counter=2 temp2=counter; counter = temp2 + 1 temp1=counter; temp2=counter; gives counter=1 counter = temp1 + 1; counter = temp2 + 1 55
  • 56. Assigning threads to the cores • Each thread/process has an affinity mask • Affinity mask specifies what cores the thread is allowed to run on • Different threads can have different masks • Affinities are inherited across fork() 56
  • 57. Affinity masks are bit vectors • Example: 4-way multi-core, without SMT 1 1 0 1 core 3 core 2 core 1 core 0 • Process/thread is allowed to run on cores 0,2,3, but not on core 1 57
  • 58. Affinity masks when multi-core and SMT combined • Separate bits for each simultaneous thread • Example: 4-way multi-core, 2 threads per core 1 1 0 0 1 0 1 1 core 3 core 2 core 1 core 0 thread thread thread thread thread thread thread thread 1 0 1 0 1 0 1 0 • Core 2 can’t run the process • Core 1 can only use one simultaneous 58 thread
  • 59. Default Affinities • Default affinity mask is all 1s: all threads can run on all processors • Then, the OS scheduler decides what threads run on what core • OS scheduler detects skewed workloads, migrating threads to less busy processors 59
  • 60. Process migration is costly • Need to restart the execution pipeline • Cached data is invalidated • OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core • This is called soft affinity 60
  • 61. Hard affinities • The programmer can prescribe her own affinities (hard affinities) • Rule of thumb: use the default scheduler unless a good reason not to 61
  • 62. When to set your own affinities • Two (or more) threads share data-structures in memory – map to same core so that can share cache • Real-time threads: Example: a thread running a robot controller: - must not be context switched, or else robot can go unstable Source: Sensable.com - dedicate an entire core just to this thread 62
  • 63. Kernel scheduler API #include <sched.h> int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask); Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’. ‘len’ is the system word size: sizeof(unsigned int long) 63
  • 64. Kernel scheduler API #include <sched.h> int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask); Sets the current affinity mask of process ‘pid’ to *mask ‘len’ is the system word size: sizeof(unsigned int long) To query affinity of a running process: [barbic@bonito ~]$ taskset -p 3935 pid 3935's current affinity mask: f 64
  • 65. Windows Task Manager core 2 core 1 65
  • 66. Legal licensing issues • Will software vendors charge a separate license per each core or only a single license per chip? • Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core 66
  • 67. Conclusion • Multi-core chips an important new trend in computer architecture • Several new multi-core chips in design phases • Parallel programming techniques likely to gain importance 67