SlideShare ist ein Scribd-Unternehmen logo
1 von 88
1
MultiCore Processors
Dr. Smruti Ranjan Sarangi
IIT Delhi
2
Part I - Multicore Architecture
Part II - Multicore Design
Part III - Multicore Examples
3
Part I - Multicore
Architecture
4

Moore's Law and Transistor Scaling

Overview of Multicore Processors

Cache Coherence

Memory Consistency
Outline
5
Moore's Law

Intel's co-founder Gordon Moore predicted in
1965 that the number of transistors on chip will
double every year
− Reality TodayReality Today : Doubles once every 2 years

How many transistors do we have today per
chip?
− Approx 1 billion

2014 – 2 billion

2016 – 4 billion
6
Why do Transistors
Double?

The feature size keeps shrinking by √2 every 2
years

Currently it is 32 nm

32 nm → 22 nm → 18 nm →
Feature SizeFeature Size
source: wikipedia
7
Number of transistors per chip over time
Source: Yaseen et. al.
8
Source: Yaseen et. al.
9
How to Use the Extra
Transistors

Traditional
− Complicate the processor

Issue width, ROB size, aggressive speculation
− Increase the cache sizes

L1, L2, L3

Modern (post 2005)
− Increase the number of cores per chip
− Increase the number of threads per core
10
What was Wrong with the Traditional
Approach?

Law of diminishing returns
− PerformancePerformance does not scale well with the increase in cpu
resources
− DelayDelay is proportional to size of a cpu structure
− Sometimes it is proportional to the square of the size
− Hence, there was no significant difference between a 4-
issue and a 8-issue processor
− Extra caches had also limited benefit because the working
set of a program completely fits in the cache

Wire Delay
− It takes 10s of cycles for a signal to propagate from one end
of a chip to the other end. Wire delays decrease the
advantage of pipelining.
11
Power & Temperature
Source: Intel Microprocessor Report
12
Power &Temperature - II

High power consumption
− Increases cooling costs
− Increases the power bill
Source: O'Reilly Radar
13
Intel's Solution: Have Many
Simple Cores

Some major enhancements to processors
− Pentium – 2 issue inorder pipeline (5 stage)
− Pentium Pro – Out of order pipeline (7
stage)
− Pentium 4 – Aggressive out-of-order pipeline
(18 stage) + trace cache
− Pentium Northwood – 27 stage out-of-order
pipeline
− Pentium M – 12 stage out of order pipeline
− Intel Core2 Duo – 2 Pentium M cores
− Intel QuadCore – 4 Pentium M cores
14
E v o l u t i o n
4 5 n m
2 2 n m
2 2 n m
Source: chipworks.com
15

Moore's Law and Transistor Scaling

Overview of Multicore Processors

Cache Coherence

Memory Consistency
Outline
16
What is a Multicore
Processor?
Core
Cache
Core
Cache
Core
Cache
Core
Cache
L2 Cache

Multiple processing cores, with private caches

Large shared L2 or L3 caches

Complex interconnection network

This is also called symmetric multicore processor
17
Advantages of
Multicores

The power consumption per core is
limitedlimited. It decreases by about 30% per
generation.

The design has lesser complexity
− easy to design, debug and verify

The performance per core increases by
about 30% per generation

The number of cores doubles every
generation.
18
Multicores
Power Complexity
Performance
per Core
Parallelism
19
Issues in Multicores
We have so many cores ...
How to use them?
ANSWER
1) We need to write effective and efficient parallel programs
2) The hardware has to facilitate this endeavor
Parallel processing is the biggest advantage
20
Parallel Programs
for (i=0;i<n;i++)
c[i] = a[i] * b[i]
parallel_for(i=0;i<n;i++)
c[i] = a[i] * b[i]
SequentialSequential ParallelParallel
What if ???
parallel_for(i=0;i<n;i++)
c[d[i]] = a[i] * b[i]
Is the loop still parallel?
21
1st Challenge: Finding
Parallelism

To run parallel programs efficiently on multi-
cores we need to:
− discover parallelism in programs
− re-organize code to expose more
parallelism

Discovering parallelism : automatic approaches
− Automatic compiler analysis
− Profiling
− Hardware based methods
22
Finding Parallelism- II

Discovering parallelism: manually
− Write explicitly parallel algorithms
− Restructure loops
for (i=0;i<n-1;i++)
a[i] = a[i] * a[i+1]
a_copy = copy(a)
parallel_for(i=0;i<n-1;i++)
a[i] = a_copy[i] *
a_copy[i+1]
SequentialSequential ParallelParallel
Example
23
Tradeoffs

Compiler Approach
− Not very extensive
− Slow
− Limited utility
Manual approach
Very difficult
Much better results
Broad utility
Given good software level approaches, how can hardware help ?
24
Models of Parallel
Programs

Shared memory
− All the threads see the same view of
memory (same address space)
− Hard to implement but easy to
program
− Default (used by almost all multicores)

Message passing
− Each thread has its private address
space
− Simpler to program
− Research prototypes
25
Shared Memory
Multicores

What is shared memory?
Core
Cache
Core
Cache
x=19
read x
26
How does it help?

Programmers need not bother about low level
details.

The model is immune to process/thread
migration

The same data can be accessed/ modified
from any core

Programs can be ported across architectures
very easily, often without any modification
27

Moore's Law and Transistor Scaling

Overview of Multicore Processors

Cache Coherence

Memory Consistency
Outline
28
What are the Pitfalls?
Example 1
Is the outcome (r1=1, r2=2) (r3=2, r4=1) feasible?
Thread 1 Thread 2 Thread 3
x = 1
x = 0
r1 = x
r2 = x
r3 = x
r4 = x
x = 2
Does it make sense?
29
Example 1 contd...

Order gets reversed
Thread 1 Thread 2 Thread 3
Intercore-network
x=1x=1x=2x=2
30
Example 2
When should a write from one processor be visible
to other processors?
Is the outcome (r1 = 0, r3 = 0) feasible?
Thread 1 Thread 2 Thread 3
x = 1
x = 0
r1 = x r3 = x
Does it make sense?
31
Point to Note

Memory accesses can be reordered by
the memory system.

The memory system is like a real world
network.

It suffers from bottlenecks, delays, hot
spots, and sometimes can drop a packet
32
How should
Applications behave?

It should be possible for programmers to
write programs that make sense

Programs should behave the same way
across different architectures
Cache CoherenceCache Coherence
A write is ultimately visiblevisible.
Writes to the same location are seen in the same
orderorder
Axiom 1
Axiom 2
33
Example Protocol

The following set of conditions satisfy
Axiom 1 and 2
Claim
Every write completes in a finite amount of time
A write is ultimately visiblevisible.Axiom 1
Writes to the same location are seen in the same
orderorder
At any point of time: a given cache block is either being
read by multiplemultiple readers, or being written by just oneone writer
Writes to the same location are seen in the same
orderorder
Condition 1
Axiom 2
Condition 2
34
Practical
Implementations

Snoopy Coherence
− Maintain some state per every cache block
− Elaborate protocol for state transition
− Suitable for CMPs with cores less than 16
− Requires a shared bus

Directory Coherence
− Elaborate state transition logic
− Distributed protocol relying on messages
− Suitable for CMPs with more than 16 cores
35
Implications of Cache
Coherence

It is not possible to drop packets. All the
threads/cores should perceive 100%
reliability

Memory system cannot reorder requests or
responses to the same address

How to enforce cache coherence
− Use Snoopy or Directory protocols
− Reference: Henessey Patterson Part 2
36

Moore's Law and Transistor Scaling

Overview of Multicore Processors

Cache Coherence

Memory Consistency
Outline
37
Reordering Memory
Requests in General

Answer : Depends ….
Is the outcome (r1=0, r2=0) feasible?
Thread 1 Thread 2
x = 0 , y = 0
y = 1
r2 = x
x = 1
r1 = y
Does it make sense?
38
Reordering Memory Requests
Sources
Network on Chip1
Write Buffer2
Out-of-order Pipeline3
39
Write Buffer

Write buffers can break W → R ordering
Processor 1 Processor 2
W
B
W
B
Cache Subsystem
x=1 y=1
Write
Buffers
r1=y r2=x
00 00
40
Out of Order Pipeline

A typical out-of-order pipeline can cause
abnormal thread behavior
Processor 1 Processor 2
Cache Subsystem
x=1 y=1r1=y r2=x
r1=0 r2=0
41
What is allowed and
What is Not?
Same Address
Cannot reorder memory
Requests
Different Address
Reordering may be possible
Cache Coherence Memory Consistency
42
Memory Consistency
Models
W → RW → R W → WW → W R → WR → W R → RR → R
Sequential
Consistency (SC)
E.g, MIPS R1000
Total Store Order
(TSO)
E.g., Intel procs,
Sun Sparc V9
Partial Store
Order (PSO)
E.g., Sparc V8
Weak Consistency
IBM Power
Relaxed
Consistency
E.g., Research
prototypes
Relaxed Memory Models
43
Sequential Consistency

Definition of sequential consistency
− If we run a parallel program on a
sequentially consistent machine, then the
output is equal to that produced by some
sequential execution.
Thread 1Thread 1
Thread 2Thread 2
SequentialSequential
ScheduleSchedule
MemoryMemory
ReferencesReferences
44
Example

Answer
− Sequential Consistency (NO)
− All other models (YES)
Is the outcome (r1=0, r2=0) feasible?
Thread 1 Thread 2
x = 0 , y = 0
y = 1
r2 = x
x = 1
r1 = y
45
Comparison

Sequential consistency is intuitive
− Very low performance
− Hard to implement and verify
− Easy to write programs

Relaxed memory models
− Good performance
− Easier to verify
− Difficult to write correct programs
46
Solution
Program
Instructions
Program
Instructions
FenceFence
Complete all out-standing
Memory requests
47
Make our Example
Run Correctly
Gives the correct result irrespective of the
memory consistency model
Thread 1 Thread 2
x = 0 , y = 0
y = 1
FenceFence
r2 = x
x = 1
FenceFence
r1 = y
48
Basic Theorem

It is possible to guarantee sequentially
consistent execution of a program by
inserting fences at appropriate points
− Irrespective of the memory model
− Such a program is called a properlyproperly
labeled programlabeled program

Implications:
− Programmers need to be multi-core
aware and write programs properly
− Smart compiler infrastructure
− Libraries to make programming easier
49
Shared Memory:
A Perspective

Implementing a memory system for
multicore processors is a non-trivial task
− Cache coherence (fast, scalable)
− Memory consistency

Library and compiler support

Tradeoff: Performance vs simplicity

Programmers need to write properly
labeled programs
50
Part II - Multiprocessor
Design
51

Multiprocessor Caches

Network on-chip

Non Uniform Caches
Outline
52
Multicore Organization
Processor 1 Processor 2 Processor 3 Processor 4
Caches
Memory Cntrlr 1 Memory Cntrlr 2
Memory Bank Memory Bank
I/O Cntrlr
I/O
Devices
53
Architecture vs
Organization

Architecture
− Shared memory
− Cache coherence
− Memory consistency

Organization
− Caches
− Network on chip (NOC)
− Memory and I/O controllers
54
Basics of Multicore
Caches - Cache Bank
55
Large Caches

Multicores typically have large caches (2- 8 MB)

Several cores need to simultaneously access the
cache

We need a cache that is fast and power efficient

DUMB SOLUTION :DUMB SOLUTION : Have one large cache

Why is the solution dumb?
− Violates the basic rules of cache design
56

Delay is proportional to size

Power is proportional to size
− Can be proportional to size2
for very large
caches

Delay is proportional to (#ports)2

Wire delay is a major factor
ABCs of Caches
2
1010
18
# cycles
57

Multiprocessor Caches

Network on-chip

Non Uniform Caches
Outline
58
Smart Solution

Create a network of small caches

Each cache is indexed by unique bits of the
address

Connect the caches using an on-chip network
Router
Cache
Buffer
Link
Properties of a
Network
• Bisection bandwidth
– The minimum number of links that need to
be cut to divide the network into two equal
parts
• Diameter
– Longest minimum distance between any two
pairs of nodes
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
60
Network Topology
(a) chain
(b) ring
(c) mesh
61
(d) 2D Torus
(e) Omega Network
Crossbar
• Very flexible interconnect
• Massive area overhead
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
63
Network Routing

Aims
− Avoid deadlock
− Minimize latency
− Avoid hot-spots

Major types
− Oblivious – fixed policy
− Adaptive – takes into account hot-spots
and congestion
64
Oblivious Routing

X-Y routing
− First move along the X-axis, and then
the Y-axis

Y-X routing
− Reverse of X-Y routing
65
Adaptive Routing

West-first
− If X_T ≤ X_S use X-Y routing
− Else, route adaptively

North-last
− If Y_T ≤ Y_S use X-Y routing
− Else, route adaptively

Negative-first
− If (X_T ≤ X_S || Y_T ≤ Y_S) use X-Y routing
− Else, route adaptively
Route from X_S, Y_S to X_T, Y_T
source target
66
Flow Control

Store and forward
− A router stores the entire packet
− Once it receives all the flits, sends it
onward

Wormhole routing
− Flits continue to proceed along
outbound links
− They are not necessarily buffered
Message
HeadHead
FlitFlit
TailTail
FlitFlit
FlitsFlits
67
Flow Control - II

Circuit switched
− Resources such as buffers and slots are
pre-allocated
− Low latency, at the cost of high starvation

Virtual channel
− Allows multiple flows to share a single
channel
− Implemented by having multiple queues at
the routers
68

Multiprocessor Caches

Network on-chip

Non Uniform Caches
Outline
69
Non-Uniform Caches

There are two options for designing a
multprocessor cache
− Private cache : Each core has its
private cache. To maintain the illusion
of shared memory, we need a cache
coherence protocol.
− Shared cache : One large cache that
is shared by all the cores. Cache
coherence is not required.
70
Private vs Shared
Cache
Core Core Core
Cache Cache Cache
One large coherent cache
Cache
Core Core Core
Cache
Private caches
Shared cache
Shared Caches
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
• Basic approach
– Address based
Processor 1 Processor 2
Address
Block size
Bank Address
Static NUCA
• Different banks have different latencies
• Example
– Nearest bank : 8 cycles
– Farthest bank : 40 cycles
• Compiler needs to place data that is
frequently used in the banks that are
closer to the processor
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
Dynamic NUCA
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
Processor 1 Processor 2
Set Address
Block
Address
Set ID
Set
Tag
Dynamic NUCA - II
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
Processor 1 Processor 2
Tag
SearchSearch
Data
MatchMatch
Dynamic NUCA - III
Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
Processor 1 Processor 2
Cache lines are
promoted on every hit
Cache lines are
promoted on every hit
This scheme ensures that the most frequent blocks
have the lowest access latency
This scheme ensures that the most frequent blocks
have the lowest access latency
76
Part III -- Examples
77
AMD Opteron
78
Intel - Pentium 4
source: Intel Techology Docs
79
Intel Prescott
source: chip-architect.com
80
source: chip-architect.com
81
UltraSparc T2
source: wikipedia
82
source: wikipedia
AMD: Bulldozer
83
Intel Sandybridge &
Ivybridge
source: itportal.com
84
Intel Ivybridge

Micro-architecture
− 32KB L1 data cache + 32 KB L1
instruction cache
− 256 KB coherent L2 cache per core
− Shared L3 cache (2 - 20 MB)
− 256 bit ring interconnect
− Upto 8 cores
− Roughly 1.1 billion transistors
− Turbo mode - Can run at an elevated
temperature for upto 20s.
− Built-in graphics processor
85
Revolutionary Features

3D Tri-gate Transistors based on FinFets
Faster operation
Lower threshold voltage and leakage power
Higher drive strength
source: wikipedia
86
Enhanced I/O Support

PCI Express 3.0

RAM: 2800 MT/s

Intel HD Graphics, DirectX 11, OpenGL
3.1, OpenCL 1.1

DDR3L

Configurable thermal limits

Multiple video playbacks possible
87
Security

RdRand instruction
− Generates pseudo random numbers based
on truly random seeds
− Uses an on-chip source of randomness for
getting the seed

Intel vPRO
− Possible to remotely disable processors or
erase hard disks by sending signals over the
internet or 3G
− Can verify identity of users/ environments for
trusted execution
88
Slides can be downloaded at:
http://www.cse.iitd.ac.in/~srsarangi/files/drdo-pres-july18-2012.ppt

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
mukul bhardwaj
 
Microprocessor & microcontroller
Microprocessor & microcontroller Microprocessor & microcontroller
Microprocessor & microcontroller
Nitesh Kumar
 
I. Introduction to Microprocessor System.ppt
I. Introduction to Microprocessor System.pptI. Introduction to Microprocessor System.ppt
I. Introduction to Microprocessor System.ppt
HAriesOa1
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
Piyush Mittal
 

Was ist angesagt? (20)

Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Pentium processor
Pentium processorPentium processor
Pentium processor
 
Cache coherence ppt
Cache coherence pptCache coherence ppt
Cache coherence ppt
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Multicore Processsors
Multicore ProcesssorsMulticore Processsors
Multicore Processsors
 
Microprocessor & microcontroller
Microprocessor & microcontroller Microprocessor & microcontroller
Microprocessor & microcontroller
 
I2C Protocol
I2C ProtocolI2C Protocol
I2C Protocol
 
I. Introduction to Microprocessor System.ppt
I. Introduction to Microprocessor System.pptI. Introduction to Microprocessor System.ppt
I. Introduction to Microprocessor System.ppt
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Multiprocessor system
Multiprocessor system Multiprocessor system
Multiprocessor system
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
ARM Processors
ARM ProcessorsARM Processors
ARM Processors
 
Flash memory
Flash memoryFlash memory
Flash memory
 
History of microprocessors
History of microprocessorsHistory of microprocessors
History of microprocessors
 
Evolution of microprocessors
Evolution of microprocessorsEvolution of microprocessors
Evolution of microprocessors
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Point to point interconnect
Point to point interconnectPoint to point interconnect
Point to point interconnect
 

Ähnlich wie Multicore Processors

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
04536342
0453634204536342
04536342
fidan78
 

Ähnlich wie Multicore Processors (20)

module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Multiprocessor.pptx
 Multiprocessor.pptx Multiprocessor.pptx
Multiprocessor.pptx
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
27 multicore
27 multicore27 multicore
27 multicore
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer system
 
Archi arm2
Archi arm2Archi arm2
Archi arm2
 
27 multicore
27 multicore27 multicore
27 multicore
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
04536342
0453634204536342
04536342
 
AVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontrollerAVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontroller
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 

Kürzlich hochgeladen

怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
tufbav
 
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
gajnagarg
 
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in DammamAbortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
ahmedjiabur940
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
amitlee9823
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Naicy mandal
 
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
amitlee9823
 
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
amitlee9823
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men 🔝Muzaffarpur🔝 ...
➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men  🔝Muzaffarpur🔝  ...➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men  🔝Muzaffarpur🔝  ...
➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men 🔝Muzaffarpur🔝 ...
amitlee9823
 

Kürzlich hochgeladen (20)

Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
 
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
 
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
 
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
 
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in DammamAbortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
 
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
 
SM-N975F esquematico completo - reparación.pdf
SM-N975F esquematico completo - reparación.pdfSM-N975F esquematico completo - reparación.pdf
SM-N975F esquematico completo - reparación.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
 
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
 
HLH PPT.ppt very important topic to discuss
HLH PPT.ppt very important topic to discussHLH PPT.ppt very important topic to discuss
HLH PPT.ppt very important topic to discuss
 
➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men 🔝Muzaffarpur🔝 ...
➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men  🔝Muzaffarpur🔝  ...➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men  🔝Muzaffarpur🔝  ...
➥🔝 7737669865 🔝▻ Muzaffarpur Call-girls in Women Seeking Men 🔝Muzaffarpur🔝 ...
 

Multicore Processors

  • 1. 1 MultiCore Processors Dr. Smruti Ranjan Sarangi IIT Delhi
  • 2. 2 Part I - Multicore Architecture Part II - Multicore Design Part III - Multicore Examples
  • 3. 3 Part I - Multicore Architecture
  • 4. 4  Moore's Law and Transistor Scaling  Overview of Multicore Processors  Cache Coherence  Memory Consistency Outline
  • 5. 5 Moore's Law  Intel's co-founder Gordon Moore predicted in 1965 that the number of transistors on chip will double every year − Reality TodayReality Today : Doubles once every 2 years  How many transistors do we have today per chip? − Approx 1 billion  2014 – 2 billion  2016 – 4 billion
  • 6. 6 Why do Transistors Double?  The feature size keeps shrinking by √2 every 2 years  Currently it is 32 nm  32 nm → 22 nm → 18 nm → Feature SizeFeature Size source: wikipedia
  • 7. 7 Number of transistors per chip over time Source: Yaseen et. al.
  • 9. 9 How to Use the Extra Transistors  Traditional − Complicate the processor  Issue width, ROB size, aggressive speculation − Increase the cache sizes  L1, L2, L3  Modern (post 2005) − Increase the number of cores per chip − Increase the number of threads per core
  • 10. 10 What was Wrong with the Traditional Approach?  Law of diminishing returns − PerformancePerformance does not scale well with the increase in cpu resources − DelayDelay is proportional to size of a cpu structure − Sometimes it is proportional to the square of the size − Hence, there was no significant difference between a 4- issue and a 8-issue processor − Extra caches had also limited benefit because the working set of a program completely fits in the cache  Wire Delay − It takes 10s of cycles for a signal to propagate from one end of a chip to the other end. Wire delays decrease the advantage of pipelining.
  • 11. 11 Power & Temperature Source: Intel Microprocessor Report
  • 12. 12 Power &Temperature - II  High power consumption − Increases cooling costs − Increases the power bill Source: O'Reilly Radar
  • 13. 13 Intel's Solution: Have Many Simple Cores  Some major enhancements to processors − Pentium – 2 issue inorder pipeline (5 stage) − Pentium Pro – Out of order pipeline (7 stage) − Pentium 4 – Aggressive out-of-order pipeline (18 stage) + trace cache − Pentium Northwood – 27 stage out-of-order pipeline − Pentium M – 12 stage out of order pipeline − Intel Core2 Duo – 2 Pentium M cores − Intel QuadCore – 4 Pentium M cores
  • 14. 14 E v o l u t i o n 4 5 n m 2 2 n m 2 2 n m Source: chipworks.com
  • 15. 15  Moore's Law and Transistor Scaling  Overview of Multicore Processors  Cache Coherence  Memory Consistency Outline
  • 16. 16 What is a Multicore Processor? Core Cache Core Cache Core Cache Core Cache L2 Cache  Multiple processing cores, with private caches  Large shared L2 or L3 caches  Complex interconnection network  This is also called symmetric multicore processor
  • 17. 17 Advantages of Multicores  The power consumption per core is limitedlimited. It decreases by about 30% per generation.  The design has lesser complexity − easy to design, debug and verify  The performance per core increases by about 30% per generation  The number of cores doubles every generation.
  • 19. 19 Issues in Multicores We have so many cores ... How to use them? ANSWER 1) We need to write effective and efficient parallel programs 2) The hardware has to facilitate this endeavor Parallel processing is the biggest advantage
  • 20. 20 Parallel Programs for (i=0;i<n;i++) c[i] = a[i] * b[i] parallel_for(i=0;i<n;i++) c[i] = a[i] * b[i] SequentialSequential ParallelParallel What if ??? parallel_for(i=0;i<n;i++) c[d[i]] = a[i] * b[i] Is the loop still parallel?
  • 21. 21 1st Challenge: Finding Parallelism  To run parallel programs efficiently on multi- cores we need to: − discover parallelism in programs − re-organize code to expose more parallelism  Discovering parallelism : automatic approaches − Automatic compiler analysis − Profiling − Hardware based methods
  • 22. 22 Finding Parallelism- II  Discovering parallelism: manually − Write explicitly parallel algorithms − Restructure loops for (i=0;i<n-1;i++) a[i] = a[i] * a[i+1] a_copy = copy(a) parallel_for(i=0;i<n-1;i++) a[i] = a_copy[i] * a_copy[i+1] SequentialSequential ParallelParallel Example
  • 23. 23 Tradeoffs  Compiler Approach − Not very extensive − Slow − Limited utility Manual approach Very difficult Much better results Broad utility Given good software level approaches, how can hardware help ?
  • 24. 24 Models of Parallel Programs  Shared memory − All the threads see the same view of memory (same address space) − Hard to implement but easy to program − Default (used by almost all multicores)  Message passing − Each thread has its private address space − Simpler to program − Research prototypes
  • 25. 25 Shared Memory Multicores  What is shared memory? Core Cache Core Cache x=19 read x
  • 26. 26 How does it help?  Programmers need not bother about low level details.  The model is immune to process/thread migration  The same data can be accessed/ modified from any core  Programs can be ported across architectures very easily, often without any modification
  • 27. 27  Moore's Law and Transistor Scaling  Overview of Multicore Processors  Cache Coherence  Memory Consistency Outline
  • 28. 28 What are the Pitfalls? Example 1 Is the outcome (r1=1, r2=2) (r3=2, r4=1) feasible? Thread 1 Thread 2 Thread 3 x = 1 x = 0 r1 = x r2 = x r3 = x r4 = x x = 2 Does it make sense?
  • 29. 29 Example 1 contd...  Order gets reversed Thread 1 Thread 2 Thread 3 Intercore-network x=1x=1x=2x=2
  • 30. 30 Example 2 When should a write from one processor be visible to other processors? Is the outcome (r1 = 0, r3 = 0) feasible? Thread 1 Thread 2 Thread 3 x = 1 x = 0 r1 = x r3 = x Does it make sense?
  • 31. 31 Point to Note  Memory accesses can be reordered by the memory system.  The memory system is like a real world network.  It suffers from bottlenecks, delays, hot spots, and sometimes can drop a packet
  • 32. 32 How should Applications behave?  It should be possible for programmers to write programs that make sense  Programs should behave the same way across different architectures Cache CoherenceCache Coherence A write is ultimately visiblevisible. Writes to the same location are seen in the same orderorder Axiom 1 Axiom 2
  • 33. 33 Example Protocol  The following set of conditions satisfy Axiom 1 and 2 Claim Every write completes in a finite amount of time A write is ultimately visiblevisible.Axiom 1 Writes to the same location are seen in the same orderorder At any point of time: a given cache block is either being read by multiplemultiple readers, or being written by just oneone writer Writes to the same location are seen in the same orderorder Condition 1 Axiom 2 Condition 2
  • 34. 34 Practical Implementations  Snoopy Coherence − Maintain some state per every cache block − Elaborate protocol for state transition − Suitable for CMPs with cores less than 16 − Requires a shared bus  Directory Coherence − Elaborate state transition logic − Distributed protocol relying on messages − Suitable for CMPs with more than 16 cores
  • 35. 35 Implications of Cache Coherence  It is not possible to drop packets. All the threads/cores should perceive 100% reliability  Memory system cannot reorder requests or responses to the same address  How to enforce cache coherence − Use Snoopy or Directory protocols − Reference: Henessey Patterson Part 2
  • 36. 36  Moore's Law and Transistor Scaling  Overview of Multicore Processors  Cache Coherence  Memory Consistency Outline
  • 37. 37 Reordering Memory Requests in General  Answer : Depends …. Is the outcome (r1=0, r2=0) feasible? Thread 1 Thread 2 x = 0 , y = 0 y = 1 r2 = x x = 1 r1 = y Does it make sense?
  • 38. 38 Reordering Memory Requests Sources Network on Chip1 Write Buffer2 Out-of-order Pipeline3
  • 39. 39 Write Buffer  Write buffers can break W → R ordering Processor 1 Processor 2 W B W B Cache Subsystem x=1 y=1 Write Buffers r1=y r2=x 00 00
  • 40. 40 Out of Order Pipeline  A typical out-of-order pipeline can cause abnormal thread behavior Processor 1 Processor 2 Cache Subsystem x=1 y=1r1=y r2=x r1=0 r2=0
  • 41. 41 What is allowed and What is Not? Same Address Cannot reorder memory Requests Different Address Reordering may be possible Cache Coherence Memory Consistency
  • 42. 42 Memory Consistency Models W → RW → R W → WW → W R → WR → W R → RR → R Sequential Consistency (SC) E.g, MIPS R1000 Total Store Order (TSO) E.g., Intel procs, Sun Sparc V9 Partial Store Order (PSO) E.g., Sparc V8 Weak Consistency IBM Power Relaxed Consistency E.g., Research prototypes Relaxed Memory Models
  • 43. 43 Sequential Consistency  Definition of sequential consistency − If we run a parallel program on a sequentially consistent machine, then the output is equal to that produced by some sequential execution. Thread 1Thread 1 Thread 2Thread 2 SequentialSequential ScheduleSchedule MemoryMemory ReferencesReferences
  • 44. 44 Example  Answer − Sequential Consistency (NO) − All other models (YES) Is the outcome (r1=0, r2=0) feasible? Thread 1 Thread 2 x = 0 , y = 0 y = 1 r2 = x x = 1 r1 = y
  • 45. 45 Comparison  Sequential consistency is intuitive − Very low performance − Hard to implement and verify − Easy to write programs  Relaxed memory models − Good performance − Easier to verify − Difficult to write correct programs
  • 47. 47 Make our Example Run Correctly Gives the correct result irrespective of the memory consistency model Thread 1 Thread 2 x = 0 , y = 0 y = 1 FenceFence r2 = x x = 1 FenceFence r1 = y
  • 48. 48 Basic Theorem  It is possible to guarantee sequentially consistent execution of a program by inserting fences at appropriate points − Irrespective of the memory model − Such a program is called a properlyproperly labeled programlabeled program  Implications: − Programmers need to be multi-core aware and write programs properly − Smart compiler infrastructure − Libraries to make programming easier
  • 49. 49 Shared Memory: A Perspective  Implementing a memory system for multicore processors is a non-trivial task − Cache coherence (fast, scalable) − Memory consistency  Library and compiler support  Tradeoff: Performance vs simplicity  Programmers need to write properly labeled programs
  • 50. 50 Part II - Multiprocessor Design
  • 52. 52 Multicore Organization Processor 1 Processor 2 Processor 3 Processor 4 Caches Memory Cntrlr 1 Memory Cntrlr 2 Memory Bank Memory Bank I/O Cntrlr I/O Devices
  • 53. 53 Architecture vs Organization  Architecture − Shared memory − Cache coherence − Memory consistency  Organization − Caches − Network on chip (NOC) − Memory and I/O controllers
  • 55. 55 Large Caches  Multicores typically have large caches (2- 8 MB)  Several cores need to simultaneously access the cache  We need a cache that is fast and power efficient  DUMB SOLUTION :DUMB SOLUTION : Have one large cache  Why is the solution dumb? − Violates the basic rules of cache design
  • 56. 56  Delay is proportional to size  Power is proportional to size − Can be proportional to size2 for very large caches  Delay is proportional to (#ports)2  Wire delay is a major factor ABCs of Caches 2 1010 18 # cycles
  • 58. 58 Smart Solution  Create a network of small caches  Each cache is indexed by unique bits of the address  Connect the caches using an on-chip network Router Cache Buffer Link
  • 59. Properties of a Network • Bisection bandwidth – The minimum number of links that need to be cut to divide the network into two equal parts • Diameter – Longest minimum distance between any two pairs of nodes Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
  • 61. 61 (d) 2D Torus (e) Omega Network
  • 62. Crossbar • Very flexible interconnect • Massive area overhead Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
  • 63. 63 Network Routing  Aims − Avoid deadlock − Minimize latency − Avoid hot-spots  Major types − Oblivious – fixed policy − Adaptive – takes into account hot-spots and congestion
  • 64. 64 Oblivious Routing  X-Y routing − First move along the X-axis, and then the Y-axis  Y-X routing − Reverse of X-Y routing
  • 65. 65 Adaptive Routing  West-first − If X_T ≤ X_S use X-Y routing − Else, route adaptively  North-last − If Y_T ≤ Y_S use X-Y routing − Else, route adaptively  Negative-first − If (X_T ≤ X_S || Y_T ≤ Y_S) use X-Y routing − Else, route adaptively Route from X_S, Y_S to X_T, Y_T source target
  • 66. 66 Flow Control  Store and forward − A router stores the entire packet − Once it receives all the flits, sends it onward  Wormhole routing − Flits continue to proceed along outbound links − They are not necessarily buffered Message HeadHead FlitFlit TailTail FlitFlit FlitsFlits
  • 67. 67 Flow Control - II  Circuit switched − Resources such as buffers and slots are pre-allocated − Low latency, at the cost of high starvation  Virtual channel − Allows multiple flows to share a single channel − Implemented by having multiple queues at the routers
  • 69. 69 Non-Uniform Caches  There are two options for designing a multprocessor cache − Private cache : Each core has its private cache. To maintain the illusion of shared memory, we need a cache coherence protocol. − Shared cache : One large cache that is shared by all the cores. Cache coherence is not required.
  • 70. 70 Private vs Shared Cache Core Core Core Cache Cache Cache One large coherent cache Cache Core Core Core Cache Private caches Shared cache
  • 71. Shared Caches Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html> • Basic approach – Address based Processor 1 Processor 2 Address Block size Bank Address
  • 72. Static NUCA • Different banks have different latencies • Example – Nearest bank : 8 cycles – Farthest bank : 40 cycles • Compiler needs to place data that is frequently used in the banks that are closer to the processor Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html>
  • 73. Dynamic NUCA Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html> Processor 1 Processor 2 Set Address Block Address Set ID Set Tag
  • 74. Dynamic NUCA - II Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html> Processor 1 Processor 2 Tag SearchSearch Data MatchMatch
  • 75. Dynamic NUCA - III Smruti R. Sarangi : SRISHTI Research Group <http://www.cse.iitd.ac.in/~srsarangi/research.html> Processor 1 Processor 2 Cache lines are promoted on every hit Cache lines are promoted on every hit This scheme ensures that the most frequent blocks have the lowest access latency This scheme ensures that the most frequent blocks have the lowest access latency
  • 76. 76 Part III -- Examples
  • 78. 78 Intel - Pentium 4 source: Intel Techology Docs
  • 84. 84 Intel Ivybridge  Micro-architecture − 32KB L1 data cache + 32 KB L1 instruction cache − 256 KB coherent L2 cache per core − Shared L3 cache (2 - 20 MB) − 256 bit ring interconnect − Upto 8 cores − Roughly 1.1 billion transistors − Turbo mode - Can run at an elevated temperature for upto 20s. − Built-in graphics processor
  • 85. 85 Revolutionary Features  3D Tri-gate Transistors based on FinFets Faster operation Lower threshold voltage and leakage power Higher drive strength source: wikipedia
  • 86. 86 Enhanced I/O Support  PCI Express 3.0  RAM: 2800 MT/s  Intel HD Graphics, DirectX 11, OpenGL 3.1, OpenCL 1.1  DDR3L  Configurable thermal limits  Multiple video playbacks possible
  • 87. 87 Security  RdRand instruction − Generates pseudo random numbers based on truly random seeds − Uses an on-chip source of randomness for getting the seed  Intel vPRO − Possible to remotely disable processors or erase hard disks by sending signals over the internet or 3G − Can verify identity of users/ environments for trusted execution
  • 88. 88 Slides can be downloaded at: http://www.cse.iitd.ac.in/~srsarangi/files/drdo-pres-july18-2012.ppt