SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Downloaden Sie, um offline zu lesen
Packet Processing &
Cache Coherency -101A Primer
By: M Jay
2
Notices and Disclaimers
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,
fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course
of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided
here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule,
specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from
published specifications. Current characterized errata are available on request.
Intel, the Intel logo, {List of ALL the Intel trademarks in this document} are trademarks of Intel Corporation in the U.S.
and/or other countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.
3
Agenda
•  Cache Coherency – Is it really needed? – Message Passing Vs Shared Mem
•  Read access & cache - benefits we all know
•  What about Write & Cache?
•  Write Through – Write Back Cache
•  DPDK PMD and Cache Coherency
•  Snoop Protocol
•  NUMA
•  LIFO
•  Dynamic Vs Static
•  DDIO & Cache Size
4
Thread Local Storage – why worry about coherency?
Well ! I need to Share Data !!
5
Thread Local Storage – why worry about coherency?
Well ! I need to Share Data !!
6
Why share data?
Why not developers use Message Passing Paradigm?
Can we
visualize no
address
space?
7
Why shared data?
Why not developers use Message Passing Paradigm?
Scratch Scratch Scratch Scratch
If
Developers
Did so?
8
No Need For Coherency Protocol !!
No need for
Coherency
protocol !
9
No Need of Cache Coherency?
Message
Passing – No
need of
Coherency
Shared
Memory
Paradigm – H/
W to manage
Coherency
10
So, really what is the root cause of Cache Coherency
requirement?
Where from Cache Coherency requirement is coming?
Is it software developers’ problem “of not doing truly parallel programming”?
Or is it hardware designer’s “overdo” problem?
11
Well ! But …
Message
Passing needs
Moving Data
Around…
Moving Data
…..
Won’t it be lot
of overhead?
Shared Memory
Means Just Read /
Write. No Moving
Data Around !
Right?
Yeah ! Right !
Bring it On Shared
Memory
12
Why you need to share data with another thread?
Network Platforms Group
What Is The Task At Hand?
Receive
Process
Transmit
rx cost tx cost
A Chain is only as strong as …..
Network Platforms Group
Benefits – Eliminating / Hiding Overheads
Interrupt		
Context		
Switch	
Overhead	
Kernel	User	
Overhead	
Core	To	Thread	
Scheduling	
Overhead	
Elimina=ng													How?	
Polling		
User	Mode	
Driver	
Pthread	
Affinity	
	4K	Paging	
Overhead		
PCI	Bridge	I/
O	Overhead	
Elimina'ng	/Hiding											How?	
Huge	Page		
Lockless	Inter-core	
	Communica=on	
	
High	Throughput		
Bulk	Mode	I/O	calls	
To Tackle this challenge, what kind of devices /latency we have at our
disposal?
Network Platforms Group 15
PCIe* Connectivity and Core Usage
Using run-to-completion or pipeline software models
Processor 0
Physical
Core 0
Linux* Control Plane
NUMA
Pool Caches
Queue/Rings
Buffers
10 GbE
10 GbE
Physical
Core 1
Intel® DPDK
PMD Packet I/O
Packet work
Rx
Tx
Physical
Core 2
Intel® DPDK
PMD Packet I/O
Flow work
Rx
Tx
Physical
Core 3
Intel® DPDK
PMD Packet I/O
Flow
Classification
App A, B, C
Rx
Tx
Physical
Core 5
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Run to Completion Model
• I/O and Application workload can be handled on a single core
• I/O can be scaled over multiple cores
10 GbE
Pipeline Model
• I/O application disperses packets to other cores
• Application work performed on other cores
Processor 1
Physical
Core 4
Intel® DPDK
10 GbE
Physical
Core 5
Intel® DPDK
Physical
Core 0
Intel® DPDK
PMD Packet I/O
Hash
Physical
Core 1
Intel® DPDK
App
A
App
B
App
C
Physical
Core 2
Intel® DPDK
App
A
App
B
App
C
Physical
Core 3
Intel® DPDK
Rx
Tx
10 GbE
Pkt Pkt
Physical
Core 4
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Pkt Pkt
Pkt Pkt
Pkt
Pkt
RSS
Mode
QPI
PCIePCIePCIePCIe
PCIePCIe
NUMA
Pool Caches
Queue/Rings
Buffers
Can handle more I/O
on fewer cores with
vectorization
16
Why you need to share data with another thread?I
So tell me .. Why you need to share data with another thread?
It is the Pipeline Model that
needs Sharing! – looks like!!
Let us go with that for now !!
17
How can we map our s/w variables to h/w infrastructure?
18
How can we map our s/w variables to h/w infrastructure?
19
Individual Memory => For Thread Local Storage?
Shared Memory => For Global Data?
int shared
Function ( )
{
Int private
}
20
Quiz Time
21
What do you wish for?
Bigger Shared memory or
bigger Individual memory?
What
about
Locality
?
22
You look at the header once and forward the packet..
Right Away You Sprint to the next packet
So What do you wish for? Bigger which one?
23
You look once the header & forward pkt
Right Away You Sprint to next packet
Not the same packet
With fast line rate, you sprint from one packet to
another packet very fast
Temporal Locality in Packet Processing?
How are we doing? How much Locality?
Smaller Individual Caches with Less Locality – more Individual cache misses
So you end up often going far Shared Cache / Memory
So it is as if you don’t even have the individual cache and end up as if having slower
memory all the time.
So What do you wish for? Bigger which one?
Last Level Cache
L2 Cache
Challenge: What if there is L1 Cache Miss and LLC Hit?
L1 Cache
Core 0
L1 Cache
Core 0
LLC
Cache
40 cycle
With 40 cycles LLC Hit, How will you achieve Rx budget of 19 cycles ?
L1 Cache
Miss
So what do
you wish
for?
Bigger which
one?
25
Your Answer ???
L1 Cache With 4 Cycle Latency
L1 Cache
Core 0
Latenc
y
4 cycle
Caching Benefits on Read – Excellent !!
Right?
What? Now What?
L1
Cache
Hit
Read Packet Descriptor
With 4 cycles Latency, achieving Rx budget of 19 cycles is within
reach.
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Miss, What about the
first read that may
cause miss
27
Cache is actually hashing !
1st Line
1st Line
1st Line
1st Line
1st Line
Cache
Memory
Cache Tag / Directory
Indicates which one
Is occupying the cache.
What
about
Locality
?
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
28
Cache and Tag!
1st Line
1st Line
1st Line
1st Line
1st Line
Cache
Memory
Cache Tag / Directory
Indicates which one
Is occupying the cache.
What
about
Locality
?
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
29
Let us look at Write now
30
Where will Data be Coming From?
Write-Through Vs Write-Back
31
Where will Data be Coming From?
Write-Through Vs Write-Back
32
Where will Data be Coming From?
Write-Through Vs Write-Back
33
34
Let us Look at Write – Through First
For P2, Where will be Data Coming From?
On
Hit
On
Miss
35
Let us Look at Write – Through First
For P2, Where will be Data Coming From?
On
Hit
On
Miss
36
So Writes happen at what speed? With Write
Through Cache?
What happens if you repeatedly write
37
Let us Look at Write – Back Next
For P2, Where will be Data Coming From?
If Hit,
From
Cache
If
miss
From
Where?
38
At What Speed Write Happens in Write Back ?
How do we improve with more and more writes? – compared to Write Through !
39
Let us Look at Write – Back Next
For P2, Where will be Data Coming From?
If Hit,
From
Cache
If
miss
From
Where?
40
Where Else? Cache To Cache …
So, it can come from
1)  its own cache or
2)  shared memory or
3)  Even from ANY OF the other Individual Cache (WB)
Requesting CPU Which All CPUs can
offer Data
P 0 P1 to Pn
P1 P0 & [P2 to Pn]
P2 P0,P1 & [P3-Pn]
And so on
Pn [P0 to Pn-1]
Total paths [N X N] ?? ???
Looks like we have complexity of Message
Passing also
Remember
Me?
You thought no
movement of data in
“shared memory”?
41
Additional housekeeping “dirty bit” with Write Back
42
That is for Data Side…
What About Control for Coherency?
43
M- Modified E- Exclusive S – Shared I - Invalid
44
https://www.slideshare.net/sumitmittu/aca2-07-new
45
Write Through Memory Speed
Write Back Cache Speed
Can we go faster and faster…
L1 Cache With 4 Cycle Latency
L1 Cache
Core 0
Post it !
POSTED WRITE !!
Write Packet Descriptor
But Why should I “wait for 4 cycles” in case of write?
47
How is the complexity?
Data source is now Posted Buffer too
Posted Buffer participating in Data sourcing
As well as MESI cache coherency
48
Shared Memory – Data Sources
From Local Write Buffer
From Another Write Buffer
From Local Cache
From Another Cache
From Shared cache From Shared memory
49
And you thought You will never see me again !
50
Coming to Packet Processing & Polled Mode Driver…
51
Shall we see couple of use cases?
52
Use Case 1
Prod
ucer
Consu
mer
Software
Queue
Question:
What policy you will
design?
FIFO?
LIFO?
Why?
53
LRU … MRU …. Where Are You?
54
Few NICs .Many Cores …
55
Question – Statistics Collection
Collective task? or
Individual task?
56
Which Thread Gets Picked up by whom?
CPU’s Task Priority Register
57
Which Thread Gets Picked up by whom?
CPU’sTask Priority Register CPU’s Task Priority Register
58
So Going back to the question
So, Collective task? or
Individual task?
59
With Thread Pinning, we avoid Sharing !
Same lcore for same NIC !
No need to Share !!
60
With Thread Pinning, we avoid Sharing !
If Sharing is not needed, then why put it in memory?
Why go through shared Memory? Why?
Why not take it directly into Private Cache?
Why not bypass shared memory?
61
Familiar About Bypass Road?
Why go through congested inner cities?
Why not bypass? Use Bypass Road !!
62
You say Bypass…. We Say DDIO ..
Bypass memory
Directly into cache
63
Do You?
Really?
With Polling and Thread Pinning, we avoid Sharing !
64
With RSS … back to the question -- responsibility
Collective task? or
Individual task?
65
Well, that is a special case use case - RSS
But for RSS, we are good with only Thread Local Storage
No need of shared data
66
Well, that is a special case use case - RSS
Apart from that, we pin 1 core to 1 NIC – so no sharing !!
Is that so?
Really?
67
Classification – Cache Coherency Needed or Not?
68
Depends!!
Depends on What?
http://www.eetimes.com/document.asp?doc_id=1277622
69
Depends on
Static Classification
or
Dynamic Classification?
70
What about Router Table? Is it a shared resource or
private – per core resource?
71
What about Router Table? Is it a shared resource or
private? – per core resource? Collective or Individual
Router Table – is it one table per system?
If so,
Who are all writers? Who are all readers?
Howmany writers? Howmany readers?
What about 2 socket, 4 socket system?
One table for each socket?
Coherency between the 2 or 4 tables in a multi-socket system?
Collective Responsibility or Individual Responsibility?
72
Multiple Writers – What will benefit?
Write back Cache? or Write Through Cache?
73
What if you keep it “Dirty” and DMA control sneaks in?
74
Before we get too far…
75
In case of Siblings, does each has private cache of its
own?
With Siblings, how LTS gets mapped?
76
How do Siblings Share Caches – say, L1 and L2 ?
Cache Consistency – Requirements and its packet processing Performance implications

Weitere ähnliche Inhalte

Was ist angesagt?

Making Networking Apps Scream on Windows with DPDK
Making Networking Apps Scream on Windows with DPDKMaking Networking Apps Scream on Windows with DPDK
Making Networking Apps Scream on Windows with DPDK
Michelle Holley
 

Was ist angesagt? (20)

FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Intel® Ethernet Update
Intel® Ethernet Update Intel® Ethernet Update
Intel® Ethernet Update
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith Wiles
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
 
HOW Series: Knights Landing
HOW Series: Knights LandingHOW Series: Knights Landing
HOW Series: Knights Landing
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Dpdk Validation - Liu, Yong
Dpdk Validation - Liu, YongDpdk Validation - Liu, Yong
Dpdk Validation - Liu, Yong
 
Devconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDKDevconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDK
 
Making Networking Apps Scream on Windows with DPDK
Making Networking Apps Scream on Windows with DPDKMaking Networking Apps Scream on Windows with DPDK
Making Networking Apps Scream on Windows with DPDK
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 

Ähnlich wie Cache Consistency – Requirements and its packet processing Performance implications

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
ice799
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
webuploader
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
Trayan Iliev
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
EvonCanales257
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
tidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
tidwellveronique
 

Ähnlich wie Cache Consistency – Requirements and its packet processing Performance implications (20)

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
The trials and tribulations of providing engineering infrastructure
 The trials and tribulations of providing engineering infrastructure  The trials and tribulations of providing engineering infrastructure
The trials and tribulations of providing engineering infrastructure
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Dsys guide37
Dsys guide37Dsys guide37
Dsys guide37
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICESSpring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 

Mehr von Michelle Holley

Service Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with IstioService Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with Istio
Michelle Holley
 

Mehr von Michelle Holley (20)

NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function Framework
 
Edge and 5G: What is in it for the developers?
Edge and 5G: What is in it for the developers?Edge and 5G: What is in it for the developers?
Edge and 5G: What is in it for the developers?
 
5G and Open Reference Platforms
5G and Open Reference Platforms5G and Open Reference Platforms
5G and Open Reference Platforms
 
De-fogging Edge Computing: Ecosystem, Use-cases, and Opportunities
De-fogging Edge Computing: Ecosystem, Use-cases, and OpportunitiesDe-fogging Edge Computing: Ecosystem, Use-cases, and Opportunities
De-fogging Edge Computing: Ecosystem, Use-cases, and Opportunities
 
Building the SD-Branch using uCPE
Building the SD-Branch using uCPEBuilding the SD-Branch using uCPE
Building the SD-Branch using uCPE
 
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesEnabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
 
Accelerating Edge Computing Adoption
Accelerating Edge Computing Adoption Accelerating Edge Computing Adoption
Accelerating Edge Computing Adoption
 
DPDK & Cloud Native
DPDK & Cloud NativeDPDK & Cloud Native
DPDK & Cloud Native
 
OpenDaylight Update (June 2018)
OpenDaylight Update (June 2018)OpenDaylight Update (June 2018)
OpenDaylight Update (June 2018)
 
Tungsten Fabric Overview
Tungsten Fabric OverviewTungsten Fabric Overview
Tungsten Fabric Overview
 
Orchestrating NFV Workloads in Multiple Clouds
Orchestrating NFV Workloads in Multiple CloudsOrchestrating NFV Workloads in Multiple Clouds
Orchestrating NFV Workloads in Multiple Clouds
 
Convergence of device and data at the Edge Cloud
Convergence of device and data at the Edge CloudConvergence of device and data at the Edge Cloud
Convergence of device and data at the Edge Cloud
 
Intel® Network Builders - Network Edge Ecosystem Program
Intel® Network Builders - Network Edge Ecosystem ProgramIntel® Network Builders - Network Edge Ecosystem Program
Intel® Network Builders - Network Edge Ecosystem Program
 
Design Implications, Challenges and Principles of Zero-Touch Management Envir...
Design Implications, Challenges and Principles of Zero-Touch Management Envir...Design Implications, Challenges and Principles of Zero-Touch Management Envir...
Design Implications, Challenges and Principles of Zero-Touch Management Envir...
 
Using Microservices Architecture and Patterns to Address Applications Require...
Using Microservices Architecture and Patterns to Address Applications Require...Using Microservices Architecture and Patterns to Address Applications Require...
Using Microservices Architecture and Patterns to Address Applications Require...
 
Intel Powered AI Applications for Telco
Intel Powered AI Applications for TelcoIntel Powered AI Applications for Telco
Intel Powered AI Applications for Telco
 
Artificial Intelligence in the Network
Artificial Intelligence in the Network Artificial Intelligence in the Network
Artificial Intelligence in the Network
 
Service Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with IstioService Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with Istio
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Kürzlich hochgeladen (20)

Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 

Cache Consistency – Requirements and its packet processing Performance implications

  • 1. Packet Processing & Cache Coherency -101A Primer By: M Jay
  • 2. 2 Notices and Disclaimers No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Intel, the Intel logo, {List of ALL the Intel trademarks in this document} are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © Intel Corporation.
  • 3. 3 Agenda •  Cache Coherency – Is it really needed? – Message Passing Vs Shared Mem •  Read access & cache - benefits we all know •  What about Write & Cache? •  Write Through – Write Back Cache •  DPDK PMD and Cache Coherency •  Snoop Protocol •  NUMA •  LIFO •  Dynamic Vs Static •  DDIO & Cache Size
  • 4. 4 Thread Local Storage – why worry about coherency? Well ! I need to Share Data !!
  • 5. 5 Thread Local Storage – why worry about coherency? Well ! I need to Share Data !!
  • 6. 6 Why share data? Why not developers use Message Passing Paradigm? Can we visualize no address space?
  • 7. 7 Why shared data? Why not developers use Message Passing Paradigm? Scratch Scratch Scratch Scratch If Developers Did so?
  • 8. 8 No Need For Coherency Protocol !! No need for Coherency protocol !
  • 9. 9 No Need of Cache Coherency? Message Passing – No need of Coherency Shared Memory Paradigm – H/ W to manage Coherency
  • 10. 10 So, really what is the root cause of Cache Coherency requirement? Where from Cache Coherency requirement is coming? Is it software developers’ problem “of not doing truly parallel programming”? Or is it hardware designer’s “overdo” problem?
  • 11. 11 Well ! But … Message Passing needs Moving Data Around… Moving Data ….. Won’t it be lot of overhead? Shared Memory Means Just Read / Write. No Moving Data Around ! Right? Yeah ! Right ! Bring it On Shared Memory
  • 12. 12 Why you need to share data with another thread?
  • 13. Network Platforms Group What Is The Task At Hand? Receive Process Transmit rx cost tx cost A Chain is only as strong as …..
  • 14. Network Platforms Group Benefits – Eliminating / Hiding Overheads Interrupt Context Switch Overhead Kernel User Overhead Core To Thread Scheduling Overhead Elimina=ng How? Polling User Mode Driver Pthread Affinity 4K Paging Overhead PCI Bridge I/ O Overhead Elimina'ng /Hiding How? Huge Page Lockless Inter-core Communica=on High Throughput Bulk Mode I/O calls To Tackle this challenge, what kind of devices /latency we have at our disposal?
  • 15. Network Platforms Group 15 PCIe* Connectivity and Core Usage Using run-to-completion or pipeline software models Processor 0 Physical Core 0 Linux* Control Plane NUMA Pool Caches Queue/Rings Buffers 10 GbE 10 GbE Physical Core 1 Intel® DPDK PMD Packet I/O Packet work Rx Tx Physical Core 2 Intel® DPDK PMD Packet I/O Flow work Rx Tx Physical Core 3 Intel® DPDK PMD Packet I/O Flow Classification App A, B, C Rx Tx Physical Core 5 Intel® DPDK PMD Packet I/O Flow Classification App A, B, C Rx Tx Run to Completion Model • I/O and Application workload can be handled on a single core • I/O can be scaled over multiple cores 10 GbE Pipeline Model • I/O application disperses packets to other cores • Application work performed on other cores Processor 1 Physical Core 4 Intel® DPDK 10 GbE Physical Core 5 Intel® DPDK Physical Core 0 Intel® DPDK PMD Packet I/O Hash Physical Core 1 Intel® DPDK App A App B App C Physical Core 2 Intel® DPDK App A App B App C Physical Core 3 Intel® DPDK Rx Tx 10 GbE Pkt Pkt Physical Core 4 Intel® DPDK PMD Packet I/O Flow Classification App A, B, C Rx Tx Pkt Pkt Pkt Pkt Pkt Pkt RSS Mode QPI PCIePCIePCIePCIe PCIePCIe NUMA Pool Caches Queue/Rings Buffers Can handle more I/O on fewer cores with vectorization
  • 16. 16 Why you need to share data with another thread?I So tell me .. Why you need to share data with another thread? It is the Pipeline Model that needs Sharing! – looks like!! Let us go with that for now !!
  • 17. 17 How can we map our s/w variables to h/w infrastructure?
  • 18. 18 How can we map our s/w variables to h/w infrastructure?
  • 19. 19 Individual Memory => For Thread Local Storage? Shared Memory => For Global Data? int shared Function ( ) { Int private }
  • 21. 21 What do you wish for? Bigger Shared memory or bigger Individual memory? What about Locality ?
  • 22. 22 You look at the header once and forward the packet.. Right Away You Sprint to the next packet So What do you wish for? Bigger which one?
  • 23. 23 You look once the header & forward pkt Right Away You Sprint to next packet Not the same packet With fast line rate, you sprint from one packet to another packet very fast Temporal Locality in Packet Processing? How are we doing? How much Locality? Smaller Individual Caches with Less Locality – more Individual cache misses So you end up often going far Shared Cache / Memory So it is as if you don’t even have the individual cache and end up as if having slower memory all the time. So What do you wish for? Bigger which one?
  • 24. Last Level Cache L2 Cache Challenge: What if there is L1 Cache Miss and LLC Hit? L1 Cache Core 0 L1 Cache Core 0 LLC Cache 40 cycle With 40 cycles LLC Hit, How will you achieve Rx budget of 19 cycles ? L1 Cache Miss So what do you wish for? Bigger which one?
  • 26. L1 Cache With 4 Cycle Latency L1 Cache Core 0 Latenc y 4 cycle Caching Benefits on Read – Excellent !! Right? What? Now What? L1 Cache Hit Read Packet Descriptor With 4 cycles Latency, achieving Rx budget of 19 cycles is within reach. Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Miss, What about the first read that may cause miss
  • 27. 27 Cache is actually hashing ! 1st Line 1st Line 1st Line 1st Line 1st Line Cache Memory Cache Tag / Directory Indicates which one Is occupying the cache. What about Locality ? Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor
  • 28. 28 Cache and Tag! 1st Line 1st Line 1st Line 1st Line 1st Line Cache Memory Cache Tag / Directory Indicates which one Is occupying the cache. What about Locality ? Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor
  • 29. 29 Let us look at Write now
  • 30. 30 Where will Data be Coming From? Write-Through Vs Write-Back
  • 31. 31 Where will Data be Coming From? Write-Through Vs Write-Back
  • 32. 32 Where will Data be Coming From? Write-Through Vs Write-Back
  • 33. 33
  • 34. 34 Let us Look at Write – Through First For P2, Where will be Data Coming From? On Hit On Miss
  • 35. 35 Let us Look at Write – Through First For P2, Where will be Data Coming From? On Hit On Miss
  • 36. 36 So Writes happen at what speed? With Write Through Cache? What happens if you repeatedly write
  • 37. 37 Let us Look at Write – Back Next For P2, Where will be Data Coming From? If Hit, From Cache If miss From Where?
  • 38. 38 At What Speed Write Happens in Write Back ? How do we improve with more and more writes? – compared to Write Through !
  • 39. 39 Let us Look at Write – Back Next For P2, Where will be Data Coming From? If Hit, From Cache If miss From Where?
  • 40. 40 Where Else? Cache To Cache … So, it can come from 1)  its own cache or 2)  shared memory or 3)  Even from ANY OF the other Individual Cache (WB) Requesting CPU Which All CPUs can offer Data P 0 P1 to Pn P1 P0 & [P2 to Pn] P2 P0,P1 & [P3-Pn] And so on Pn [P0 to Pn-1] Total paths [N X N] ?? ??? Looks like we have complexity of Message Passing also Remember Me? You thought no movement of data in “shared memory”?
  • 41. 41 Additional housekeeping “dirty bit” with Write Back
  • 42. 42 That is for Data Side… What About Control for Coherency?
  • 43. 43 M- Modified E- Exclusive S – Shared I - Invalid
  • 45. 45 Write Through Memory Speed Write Back Cache Speed Can we go faster and faster…
  • 46. L1 Cache With 4 Cycle Latency L1 Cache Core 0 Post it ! POSTED WRITE !! Write Packet Descriptor But Why should I “wait for 4 cycles” in case of write?
  • 47. 47 How is the complexity? Data source is now Posted Buffer too Posted Buffer participating in Data sourcing As well as MESI cache coherency
  • 48. 48 Shared Memory – Data Sources From Local Write Buffer From Another Write Buffer From Local Cache From Another Cache From Shared cache From Shared memory
  • 49. 49 And you thought You will never see me again !
  • 50. 50 Coming to Packet Processing & Polled Mode Driver…
  • 51. 51 Shall we see couple of use cases?
  • 52. 52 Use Case 1 Prod ucer Consu mer Software Queue Question: What policy you will design? FIFO? LIFO? Why?
  • 53. 53 LRU … MRU …. Where Are You?
  • 54. 54 Few NICs .Many Cores …
  • 55. 55 Question – Statistics Collection Collective task? or Individual task?
  • 56. 56 Which Thread Gets Picked up by whom? CPU’s Task Priority Register
  • 57. 57 Which Thread Gets Picked up by whom? CPU’sTask Priority Register CPU’s Task Priority Register
  • 58. 58 So Going back to the question So, Collective task? or Individual task?
  • 59. 59 With Thread Pinning, we avoid Sharing ! Same lcore for same NIC ! No need to Share !!
  • 60. 60 With Thread Pinning, we avoid Sharing ! If Sharing is not needed, then why put it in memory? Why go through shared Memory? Why? Why not take it directly into Private Cache? Why not bypass shared memory?
  • 61. 61 Familiar About Bypass Road? Why go through congested inner cities? Why not bypass? Use Bypass Road !!
  • 62. 62 You say Bypass…. We Say DDIO .. Bypass memory Directly into cache
  • 63. 63 Do You? Really? With Polling and Thread Pinning, we avoid Sharing !
  • 64. 64 With RSS … back to the question -- responsibility Collective task? or Individual task?
  • 65. 65 Well, that is a special case use case - RSS But for RSS, we are good with only Thread Local Storage No need of shared data
  • 66. 66 Well, that is a special case use case - RSS Apart from that, we pin 1 core to 1 NIC – so no sharing !! Is that so? Really?
  • 67. 67 Classification – Cache Coherency Needed or Not?
  • 70. 70 What about Router Table? Is it a shared resource or private – per core resource?
  • 71. 71 What about Router Table? Is it a shared resource or private? – per core resource? Collective or Individual Router Table – is it one table per system? If so, Who are all writers? Who are all readers? Howmany writers? Howmany readers? What about 2 socket, 4 socket system? One table for each socket? Coherency between the 2 or 4 tables in a multi-socket system? Collective Responsibility or Individual Responsibility?
  • 72. 72 Multiple Writers – What will benefit? Write back Cache? or Write Through Cache?
  • 73. 73 What if you keep it “Dirty” and DMA control sneaks in?
  • 74. 74 Before we get too far…
  • 75. 75 In case of Siblings, does each has private cache of its own? With Siblings, how LTS gets mapped?
  • 76. 76 How do Siblings Share Caches – say, L1 and L2 ?