4. Performance Scalabilityvs.
Response Time
Ressource Use
Adding more
of a HW ressource
makes things
faster
You can scale without having performance
(ex: HADOOP)
You can perform without having scalability
(ex: In Memory Engines)
5. Our Reasonably Priced Server
• 2 Socket Xeon E3645
• 2 x 6 Cores
• 2.4Ghz
• NUMA enabled, HT off
• 12 GB RAM
• 1 ioDrive2 Duo
• 2.4TB Flash
• 4K formatted
• 64K AUS
• 1 Stripe
• Power Save Off
• Win 2008R2
• SQL 2012
Image Source: DeviantArt
6. Between disk and Memory
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
7. The “cache out curve”
Data Size
Throughput/thread
Cache Size
Service Time + Wait Time
9. There are several of these curves
Throughput
Touched
Data Size
CPU Cache
TLB
NUMA
Remote
Storage
10. Response time = Service Time + Wait Time
Algorithms
and
Data Structures
“Bottlenecks”
11. • DBA tasks
• Installation of OS and SQL
• Basic Memory Configuration
• Basic Perfmon style monitoring
• Backup/Restore and HA setup
• Basic reading a Query Plan
• Basic understanding of database
structures
• Adding Indexes to tables
• Running a Profiler trace
What you ALREADY know
13. What we Need
• Free tools from
MS
• Windows SDK
• In Win8: The
“ADK”
• Need .NET 4 to
install
14. Where Did the Time Go?
Service Time + Wait Time
Xperf –on Base –f Base.etl
SELECT TOP 100000 *
FROM LINEITEM
INNER JOIN ORDERS
ON O_ORDERKEY = L_ORDERKEY
SQLCMD –E –S. –i “Select.sql”
Xperf –stop
18. Quantifying just how stupid XML is
SELECT TOP 1000000 *
FROM ORDERS
JOIN LINEITEM
ON L_ORDERKEY = O_ORDERKEY
FOR XML RAW ('OUTPUT')
Xperf –on Base –f Base.etl
With XML
“Native” Format
19. Which CPU cycles are Expensive?
“App” tier
Web Server Licensing
>3K USD Blades
Database Tier
Core Licensing
>10K USD
<XML> ?
Service Time + Wait Time
20. • What about the time
INSIDE the process?
• What if the EXE won’t
tell us?
Diving even Deeper
21. What is a Debug Symbol?
mov ax,10
mov bx,20
mov cx,3
push ax
push bx
push cx
call <address>
<address>
push bp
mov bp,sp
mov ax,[bp+8]
mov bx,[bp+6]
mov cx,[bp+4]
add ax,bx
div cx
mov dx,ax
ret
HeaderdoStuff(10,20,3)
…
int doStuff(int a, int b, int c)
{
return (a + b) / c
}
myProg.exe
Machine Code
<address> = doStuff
Symbol table
myProg.pdb
Service Time + Wait Time
22. Where do you get PDB files?
_NT_SYMBOL_PATH=SRV*C:Symbols*http://msdl.microsoft.com/download/symbols
_NT_SYMCACHE_PATH=C:SymCache
• Public Symbol Server
• Configure Environment
• Dbghelp.dll
Service Time + Wait Time
23. • Auto Generated by Visual Studio:
Your Own Debug Symbols
Service Time + Wait Time
24. • Symbols are indexed. Have to add them
Adding and Checking Your Symbols
Cd Bin/x64/Release/
symstore add /f *.pdb /s C:/Symbols /t ‚MyExe‛
• Validate that the Symbols can resolve
Cd Bin/x64/Release/
symchk MyExe.exe /V
25. • Standard Xperf works fine
for you own native code
• BUT: Before Windows
8, stack walking is broken
for x64 .NET
• If you have .NET with 64
bit code. You must NGEN
first:
Got .NET and x64?
Ngen install Bin/x64/Release/MyExe.exe
(ngen lives here: %Windir%Microsoft.NETframework64<Version>Ngen.exe
Service Time + Wait Time
26. • Free tool from MS:
.NET tracing is a pain, get a tool!
• Not to be confused with xperfview
• Same trace API and file format
• Helps set obscure .NET specific trace flags
Service Time + Wait Time
27. And Finally, You can do Very Cool Things
Did I tell you about interlocked
operations?...
Whiteboard time!
Service Time + Wait Time
28. • Consider again our LINEITEM table
What is SQL Server REALLY doing?
• How expensive is it to read from that?
• Think ETL code and DW/BI queries
CREATE TABLE LINEITEM (
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] [decimal](15, 2) NOT NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,
[L_DISCOUNT] [decimal](15, 2) NOT NULL,
[L_TAX] [decimal](15, 2) NOT NULL,
[L_RETURNFLAG] [char](1) NOT NULL,
[L_LINESTATUS] [char](1) NOT NULL,
[L_SHIPDATE] [date] NOT NULL,
[L_COMMITDATE] [date] NOT NULL,
[L_RECEIPTDATE] [date] NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] [char](10) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL
)
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
29. SQLCMD – Native code Test
SQLCMD.EXE
Where does the time go?
Service Time + Wait Time
30. Standard Reading of Data
xperf -on base -stackwalk profile -f stackwalk.etl
SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch"
55sec
xperf -stop
xperf –merge stackwalk.etl stackwalkmerge.etl
Service Time + Wait Time
33. An Educated guess about improvements
CREATE TABLE [dbo].[LINEITEM_native](
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] money NOT NULL,
[L_EXTENDEDPRICE] money NOT NULL,
[L_DISCOUNT] money NOT NULL,
[L_TAX] money NOT NULL,
[L_RETURNFLAG] int NOT NULL,
[L_LINESTATUS] int NOT NULL,
[L_SHIPDATE] int NOT NULL,
[L_COMMITDATE] int NOT NULL,
[L_RECEIPTDATE] int NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] int NOT NULL,
[L_COMMENT] char(44) NOT NULL
)
CREATE TABLE [dbo].[LINEITEM](
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] [decimal](15, 2) NOT NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,
[L_DISCOUNT] [decimal](15, 2) NOT NULL,
[L_TAX] [decimal](15, 2) NOT NULL,
[L_RETURNFLAG] [char](1) NOT NULL,
[L_LINESTATUS] [char](1) NOT NULL,
[L_SHIPDATE] [date] NOT NULL,
[L_COMMITDATE] [date] NOT NULL,
[L_RECEIPTDATE] [date] NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] [char](10) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL,
)
Before After
Service Time + Wait Time
34. Getting Rid of Useless Work
Additional parameters for SQLCMD:
-a32767 -W -s";" -f437
x1.5
Service Time + Wait Time
36. Lets try that with Native and Unicode …
x5
Service Time + Wait Time
37. • SQLNCLI is one of these in disguise
• ODBC
• OLEDB
• Pick good data types
• MONEY over NUMERIC
• UNICODE of data arrives like this
• Native protocols vs. flexibility
Summary Moving Data
38. • Get
• Windows 8 ADK
• Windows 7 SDK
• Set up Symbol Paths
• Xperf –on Base
• Standard trace for time, narrow to process
and DLL/EXE
• Xperf –on Base –stackwalk Profile
• Get to the call stack, find the offending
function(s)
• Ease of use for .NET: perfview.exe
Summary – Xperf
Service Time + Wait Time
41. Loop Join
n row B-tree
Log(n) reads
Complexity: O(m * log(n))
Service Time + Wait Time
m row result
1
43
13
7
3
42. Linked List Tree
Linked List vs. Tree
Service Time + Wait Time
0
1
2
3
4
5
6
7
8
n
8
134
62 1510
16141197531
Log2(n)
43. Cluster on O_ORDERKEY Index on O_ORDERKEY
Basic argument for Cluster Indexes
Service Time + Wait Time
CREATE UNIQUE CLUSTERED INDEX CIX_Key
ON ORDERS_Cluster (O_ORDERKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Cluster
WHERE O_ORDERKEY = 3000000
CREATE UNIQUE INDEX IX_Key
ON ORDERS_Heap (O_ORDERKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Heap
WHERE O_ORDERKEY = 3000000
Table 'ORDERS_Heap'. Scan count 0, logical reads 3
, physical reads 0, read-ahead reads 0
Table 'ORDERS_Cluster'. Scan count 0, logical reads 4
, physical reads 0, read-ahead reads 0
44. Cluster on O_ORDERKEY heap + Index on O_ORDERKEY
But what if we do this a lot?
CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY)
WITH (FILLFACTOR = 100)
CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Heap
WHERE O_CUSTKEY = 47480
SELECT *
FROM ORDERS_Cluster
WHERE O_CUSTKEY = 47480
Table 'ORDERS_Cluster'. Scan count 1
, logical reads 27, physical reads 0
Table 'ORDERS_Heap'. Scan count 1
, logical reads 11, physical reads 0
Service Time + Wait Time
45. How many LOOP joins/sec/core?
7 Sec
Service Time + Wait Time
46. What did we just measure?
Xperf –on Base –stackwalk profile
About 40%...
Service Time + Wait Time
47. • The query
language itself
• Why so many
ExecuteStmt?
• …With so much
CPU use?
What is sqllang.dll?
Service Time + Wait Time
48. A different way to Measure Loops
1 Sec
Service Time + Wait Time
49. VS.
What does THAT look like?
Takeaway:
The T-SQL language
itself is expensive
Service Time + Wait Time
50. • Sample from
LINEITEM
• Force loop join with
index seeks
• Do 1.4M seeks
Test: Singleton Row Fetch
51. Singleton seeks – Cost of compression
Compression Seek (1.4M seeks) CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 24 sec 100% one core
None – I/O 21 sec 100% one core
PAGE – I/O 32 sec 100% one core
Function % Weight
CDRecord::LocateColumnInternal 0.82%
DataAccessWrapper::DecompressColumnValue 0.47%
SearchInfo::CompareCompressedColumn 0.28%
PageComprMgr::DecompressColumn 0.24%
AnchorRecordCache::LocateColumn 0.18%
ScalarCompression::AddPadding 0.04%
ScalarCompression::Compare 0.11%
Additional Runtime of
GetNextRowValuesInternal 0.14%
Total Compression 2.28%
Total CPU (single core) 8.33%
Compression % 27.00%
xperf –on base
–stackwalk profile
55. Merge Join
m row result
1
1
2
3
n row result
1
2
3
4
4
43
43
Sorted
Sorted
Complexity: O(m + n)
Service Time + Wait Time
56. Merge Join – What is Fastest?
Service Time + Wait Time
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE)
FROM LINEITEM
INNER MERGE JOIN ORDERS
ON O_ORDERKEY = L_ORDERKEY
…or
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE)
FROM ORDERS
INNER MERGE JOIN LINEITEM
ON O_ORDERKEY = L_ORDERKEY
59. We can beat SQL Server at this game
SELECT MAX(O_ORDERDATE), MAX(MAX_P)
FROM
(SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P
FROM LINEITEM
GROUP BY L_ORDERKEY) b
INNER MERGE JOIN ORDERS
ON O_ORDERKEY = b.L_ORDERKEY
Service Time + Wait Time
60. Hash Join
m row result
1
43
13
7
n row join table
Hash(1)
n row hash table
Complexity: O(m + 2n)
3
Service Time + Wait Time
61. When Hash Joins hurt you
Service Time + Wait Time
0
5
10
15
20
25
30
050100150200250300350400
Hash Memory (MB)
Runtime (seconds)
Spill Zone!
67. What LATCH pattern do we see?
GetNextRangeForChildScan
Inside:
TableScanNew
68. • Partition the table by a
“random” value
• Modulo the Key for
example
• Use SQL Server partition
function/schema
The Fix?…
0
1
2
3
4
5
6
253
254
255
hash
74. Goals:
• Compressed
• Prefetch Friendly
• Cache Resident Code
Example, Column Stores
ID Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
ID Date
1 2011-11-25
2 2011-11-25
3 2011-11-25
4 2011-11-25
5 2011-11-25
6 2011-11-25
7 2011-11-25
Date
ID Sale
1 2 GBP
2 2 GBP
3 10 GBP
4 5 GBP
5 5 GBP
6 10 GBP
7 10 GBP
Sale
Service Time + Wait Time
75. Compression is Easy
ID Value
1-2 Beer
3 Vodka
4-5 Whiskey
6-7 Vodka
ID Customer
1-3 Thomas
4-5 Christian
6-7 Alexei
Product’ Customer’
ID Date
1-7 2011-11-25
Date’
ID Sale
1-2 2 GBP
3 10 GBP
4-5 5 GBP
6-7 10 GBP
Sale’
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
Product’ Customer’
RL Date
7 2011-11-25
Date’
RL Sale
2 2 GBP
1 10 GBP
4 5 GBP
2 10 GBP
Sale’
Service Time + Wait Time
77. RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
Product’ Customer’
2 steps with Beer
2 steps with Thomas
Beer Thomas
Beer Thomas
SELECT Product, Customer FROM Table
1 step with Vodka
1 step with Thomas
Vodka Thomas
2 step with Whiskey
2 step with Christian
Whiskey Christian
Whiskey Christian
2 step with Vodka
(Note: Repeated value)
2 step with Alexei
Vodka Alexei
Vodka Alexei
Service Time + Wait Time
78. Hash Joining with Column Stores
RL Key
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Table
Key Type
Beer Soft
Vodka Strong
Whiskey Strong
Vodka Strong
Dim Product
SELECT …
FROM Table
JOIN DimProduct ON Key
WHERE Type = ‘Strong’
1 Compute bloom filter of Keys belonging to ‘strong’
2 Read RL = 2, Beer from Table
3 Compute bloom value of Beer.
4 Equal to filter value from 1? Yes. Output two rows (RL=2)
5 Compute bloom value for Vodka
6 Equal to filter value from 1? No. Do nothing
7 Compute bloom value for Whiskey
8 Equal to filter value from 1? No. Do nothing
Can pre fetch data (news RLE)
Can calculate match/no
match using only local CPU
cache
Wont work for OLTP!
Service Time + Wait Time
79. Why is it so hard to get joins right?
n
m
Time
Loop Join
Merge Join
Hash Join
Service Time + Wait Time
80. Desired Join Join Hint Query Hint
LOOP [INNER | LEFT | CROSS | FULL]
LOOP JOIN
OPTION (LOOP JOIN)
MERGE [INNER | LEFT | CROSS | FULL]
MERGE JOIN
OPTION (MERGE JOIN)
HASH [INNER | LEFT | CROSS | FULL]
HASH JOIN
OPTION (HASH JOIN)
LOOP with
Seek
WITH FORCESEEK
WITH ( INDEX (index = <name>) )
N/A
Controlling Joins
Note: Join hints force the order of the ENTIRE join tree!
Service Time + Wait Time
81. What Type of Workload?
BigSmall
Small
Big
DataReturned
Data Touched
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
82. How to Classify?
OLTP BI/DW
Simulation ETL
Full Scan/sec
Range Scans/sec
Probe Scans/sec
Index Search/sec
Range Scans/sec
Full Scan/sec
Range Scans/sec
Bulk Copy Rows/sec
?
83. There should ALWAYS be a fully
indexed path to the data.
OLTP System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
84. 1. Find worst CPU consuming query with
sys.dm_exec_query_stats
2. Add OPTION (LOOP JOIN) to offending
query
3. Check estimated query plan
4. If table spool found: add index to
remedy and GOTO 3
5. Happy? If not, GOTO 1
The Super Quick OLTP Tuning Guide
Service Time + Wait Time
85. The query will not be (much)
worse than a full scan of a fact
partition
DW/BI System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
86. 1. Find offending query
2. Add OPTION (HASH JOIN) to query
3. Does dimension tables have indexed path
to build hash? If not, add index
4. Do you get a fact table scan and hash
build of all dimensions? If not, check
statistics (especially on facts and skewed)
5. Optimize Fact table scans
1. Partition and partition elimination
2. Column store if you have it
3. Aggregate Views
4. Bitmap index pushdown (statistics!)
5. Composite indexes (last resort!)
The Super Quick DW tuning Guide
Service Time + Wait Time
87. The expected DW Query Plan
Partial
Aggregate
Fact CSI Scan
Dim Scan
Dim Seek
Batch
Build
Batch
Build
Hash
Join
Hash
Join
HashStream
Aggregate
88. • At least enough RAM to hold the hash
tables of the largest dimension
• De-normalisation helps… a LOT
• Especially for the large/large joins
• Likely: need to scan fast from disk if
RAM is not big enough to hold the fact
• Compression REALLY matters
Things that Follow from desired DW Plan
Service Time + Wait Time
91. Where EVERY Server wide diagnosis starts
SELECT *
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (SELECT wait_type FROM
#ignorewaits)
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC
Service Time + Wait Time
92. • Shows up as waits for PAGEIOLATCH
• You can dig into details with:
Common Problems - PAGEIO
Service Time + Wait Time
SELECT *
FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)
• Can also Xevent your way to it per
query
CREATE EVENT SESSION [TraceIO] ON SERVER
ADD EVENT
sqlserver.file_read_completed(
ACTION (sqlserver.database_id,sqlserver.session_id))
93. • I/O, like memory, is a GLOBAL resource
for the machine
• When does it make sense to partition a
global resource?
• When you deeply know the workload
• When the workload is ALREADY partitioned
• When neither of those are true: DON’T
partition
• If you have NAND/SSD – Why bother?
The general I/O Guidance
Service Time + Wait Time
97. OLTP
• One big SAME setup
• data files
• Tempdb
• Dedicate
• Transaction log
• DRAM:
• Enough to hold most of
DB
Data Warehouse
• JBOD setup
• Data Files
• 1-2 per LUN
• SAME setup
• Tempdb
• Dedicate
• Transaction Log
• DRAM:
• Enough to hold largest
partition of largest table
Rules of Thumb – Spindle I/O and DRAM
Service Time + Wait Time
98. • Short Stroking
• Elevator Sort
• Sequential vs.
Random
• Weaving
You can do a bit better… or worse
Service Time + Wait Time
99. • Intentionally use
lower % of total
space
• Tradeoff:
• Space for Speed
• Test:
• 15K rpm
• SAS spindle
• 300GB
Short Stroking Disks
150
200
250
300
350
400
0% 20% 40% 60% 80% 100%
IOPS
% Capacity Used
Service Time + Wait Time
100. Full Stroked Short Stroked
Why does Short Stroking Work?
Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the
disk wont use the full platter either. The result: less head movement
Service Time + Wait Time
102. Why Chase Sequential I/O?
0
10
20
30
40
50
60
70
80
1
10
100
1000
10000
100000
Sequential Full Stroke Random
Latency(ms)
Log(IOPS)
8K Block Pattern
IOPS
Avg Latency
Max Latency
Service Time + Wait Time
103. • One SATA disk
• Two partitions
• One file on each
• Sequential read on
each file
But all is not well!
File1 File2
Service Time + Wait Time
104. I/O Weaving in action
0
2
4
6
8
10
12
14
16
18
0
50
100
150
200
250
300
64K Random 64K Dual Sequential
Latency(ms)
IOPS
IOPS
Avg Latency
Source: Michael Anderson Service Time + Wait Time
105. Storage Pool and Weaving
DataLog DataLog DataLog
Massive, then Provisioned Pool
Seq
Ran
Seq
Ran
Seq
Ran
RANDOM!
Service Time + Wait Time
106. The SAN will properly handle Sharing!
Green: Checkpoint, Red: tx/sec, Black: Disk Latency Service Time + Wait Time
107. Numbers to Remember - Spindles
Characteristic Typical Units
Throughput / Bandwidth 90-125MB/sec
But ONLY if sequential access!
Operations per Sec 10K RPM Spindle: 100-130 IOPS
15K RPM Spindle: 150-180 IOPS
Can get about 2x if short stroking (more
later)
Latency 3-5ms
(compare DRAM: 100ns)
Capacity 100s of GB to single digit TB
2012 numbers, will change in future Service Time + Wait Time
108. • Few hundreds of IOPS
• Faster if short stroked
• Trade latency for speed with elevator
sort
• Sequential is hard to get right
Summary so far.. Single Disk
Service Time + Wait Time
109. • Wider Stripes neat
• But scale not linear
• Very deep queues
help
• But add latency
• Shared
Components
Why does a big RAID pile not solve this?
Service Time + Wait Time
111. Before After
Getting rid of Sharing
Switch
HBA HBA HBA HBA
Storage
Port
Storage
Port
Switch
LUN LUN
Cache
Disk
CPU
Switch
HBA HBA HBA HBA
Storage
Port
Storage
Port
Switch
LUN LUN
Cache
Disk
CPU
x2
112. 4K
PN N
NAND Flash Basics
112
PN N
Oxide Layer
Floating Gate
Electrons
trapped
Control Gate
NAND Die
Pack
Blocks
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
PN N PN NPN N
PN NPN N PN NPN N
Pages
113. NAND Flash Problems
• Erase Cycles
• Around 100K
• Rebalancing and reclaim/trim
• Voltage measurement
• Gets worse with density
• Changes over time
• Depends on how you program
• Bit Rot
• Must refresh even on read
• SLC easier to manage than MLC
• But much more expensive!
113
Voltage
00
01
10
11
116. • Only partially diagnosed as waits in
sys.dm_os_wait_stats
• Task Manager gives a bit more
information
• Need: transparency to the deep level
latencies and packets!
Common Problems: ASYNC_NETWORK, OLEDB
Service Time + Wait Time
117. A common Wait Type
The database is really
slow! The code takes
forever to run!
Service Time + Wait Time
118. • We may not always have insight into
what is going on at the client…
Xperf Diagnosing the Network
xperf –on latency+network
Summary
Table
Service Time + Wait Time
122. Short Story on DPC/ISR handling
CPU
Core
Core
L1-L3
Cache
PCI
BUS
IRQ
HALT execution
Fire ISR Routine
if (my interrupt)
{
<Mark Handled>
Queue DPC
}
NIC
Work Done
DPC
<Do work needed>
<Wake Application>
Core can
run other stuff
again
Service Time + Wait Time
123. It looks like this…
DPC
ISR
Service Time + Wait Time
124. • Option 1: Use the HW vendors tool
• Option 2: Use interrupt Affinity Policy Tool
from MS
Setting Interrupt Affinity
Service Time + Wait Time
125. • Standard Payload
Network (MTU):
• 1500 B
• Jumbo Frames
• 9014 B(MTU)
Jumbo Frame and SQL Packets
• Standard SQL
payload
• 4096 B
• Largest
• 32767 B
SELECT session_id, net_packet_size
FROM sys.dm_exec_connections
Server=foo;Packet size=32767
Service Time + Wait Time
127. Core Evolution
Moore’s “Law”:
“The number of transistors per
square inch on integrated
circuits has doubled every
two years since the
integrated circuit was
invented”
128. • Never faster than a single core
• Smaller servers are faster than bigger ones
• Large L2 caches and more clock speed help
• The algorithm dictates speed
• Latency of Wait Time sets upper limit
• Examples from MSSQL land:
• Formula Engine in MSAS
• Transaction Log Writes
• INSERT/UPDATE/DELETE (as we shall see)
Single Threaded
129. VLF files
• When switching to new VLF – it has to be ”formatted” with
8K sync write
• While this happens, transactions are blocked
• Too many VLF = Too much blocking
• Lesson: Preallocate the database log file in big chunks
• Up to 128 Log Buffers per database
• Spawned on demand, will not be released once spawned
• Transactions will wait for LOGBUFFER is no buffer available
• Think of this like a pipeline of commits waiting…
VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K
<=60K
X 128
131. • Speed is determined by Latency and
Code Path
• Max Log Write Size: 60K
Zooming to the Log Writer
Log Writer
Async I/O Completion Port
Signal thread which
issued commit
Latency
Writer Queue
132. Long Distance Replication…
Log Entry Log Entry
Network
Log Entry
Send log
Ack Log
Primary Secondary
Write Write
Executive Summary:
The speed of light ( c )
is not fast enough!
133. • Perfmon will only show millisec
• What if we want microseconds?
Getting to the Real Latency
xperf –on latency
134. It’s in Memory, so it must be fast?
VS.
Latency: 15-30us Latency: <5us
RAM DISK
1.5sec 1.5sec
136. The Effect on UPDATE
Naïve
UPDATE MyBigTable
SET c6 = 43
Parallel
UPDATE MyBigTable
SET c6 = 43
WHERE key
BETWEEN 10**9 * n
AND 10**9 * (n+1) -1CX
Runtime
(smaller is faster)
139. Amdahl’s Law of gated speedup
1
6
11
16
21
26
31
0 8 16 24 32 40 48 56 64
SpeedupFactor
Number of cores
P = 100%
P = 95%
P = 90%
P = 80%
P = Part of program that can be made Parallel
(Note that this may be 0... or 1)
N = Number of CPU cores available
Speedup =
141. But those rows have to be stored…
Table A
Table B
Table C
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Data
File
File
Group
142. It all Starts with Wait Stats
SELECT *
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (SELECT
wait_type FROM #ignorewaits)
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC
DBCC PAGE
143. PFS – Hidden Single Page Contention
Data File
GAM/
SGAM
PFS
64MB
PFS PFS
64MB
PFS
64MB
PFS
B B B B
B B B B
B B B B
B B B B
8K
10010010
INSERT TableA …
Allocated bit
145. How many more Files?
1
10
100
1000
10000
100000
1000000
10000000
260
280
300
320
340
360
380
400
0 8 16 24 32 40 48
PAGELATCH
Runtime
# Data Files
Runtime PAGELATCH_UP
146. • Shared, physical MEMORY structures
can cause bottlenecks (ex: PFS)
• SQL Server must sync too…
• Understanding where structure resides
leads to tuning fix
• Theory of engine!
Concurrency: What we learned so far
147. • Commonly misdiagnosed
• CXPACKET does NOT (always) mean
that your DOP is “too high”
CXPACKET
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
10.015.020.025.030.035.040.0
CXPACKETWaits
Throughput (MB/sec)
CXPACKET waits / Throughput
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
1 11 21 31 41
Throughput(MB/sec(
DOP
Throughput / DOP
149. • What happens when you get things like:
LATCH_<x>
PAGELATCH_<x>
Step 1: Dig into:
Diagnosing Latches
SELECT *
FROM sys.dm_os_latch_stats
Service Time + Wait Time
156. Before
ALTER TABLE HotUpdates
ADD COLUMN Padding CHAR(5000)
NOT NULL DEFAULT („X‟)
After
UPDATE Hack on Small Tables
Page (8K)
ROW
LCK_U
PAGELATCH_EX
CHAR(5000)
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX
157. Test: Updates of pages
Compression Update 1.4M CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 54 sec 100% one core
None – I/O 17 sec 100% one core
PAGE – I/O 59 sec 100% one core
L_QUANTITY is NOT NULL
i.e. in place UPDATE
160. How long are locks held?
0
100
200
300
400
500
600
PAGE NONE
CPU KCycles
Lock Held Cycle Count
Avg
StdDev
161. • Sharing is generally bad for scale (but
may be good for performance)
• PAGELATCH and LATCH diagnosis starts
in sys.dm_os_latch_stats
• CXPACKET
• Only important if throughput drops when
DOP goes up
• If this happens, look for another wait/latch
• Table partitioning can be used to work
around concurrency issues
Summary Concurrency – So Far..
162. The Paul Randal INSERT test
160M rows, executing at concurrency
Commit every 1K:
EASY
tuning?
166. And the Score Is…
0
5000
10000
15000
20000
25000
30000
35000
newguid() newsequentialid() IDENTITY
Time in Seconds
167. What is going on here???
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max
168. Tricks to Work Around this
0
-1000
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
INSERT
INSERT
INSERT
169. All Cores at 100%
0
5000
10000
15000
20000
25000
30000
35000
newguid(
)
newsequ
entialid()
IDENTITY
IDENTITY
+Unique
IDENTITY
+Unique
+Hash8
IDENTITY
+Hash24
IDENTITY
+Hash48
SPID+
Offset
Seconds
Runtime in Seconds
600K
Inserts/sec
830K
Inserts/sec
All Cores at ~100%
170. • Don’t use Sequential Keys
• Page Splitting isn’t so bad
• Neither are GUID
• Generate keys wisely. Ideally in the app
server
• For “transparent” speedup, consider our
old hash trick
Takeaways, INSERT workload
171. • Minimally Logged
• Single, large
execution
(thousands)
• Unsorted data
• Concurrent Loaders
BULK INSERT Workload
Heap
Bulk Insert
Bulk Insert
172. Measure:
SELECT * FROM
sys_dm_os_latch_stats
Observe waits on
ALLOC_FREESPACE_CACHE
Theory (just read BOL):
“Used to synchronize the access to
a cache of pages with available
space for heaps and binary large
objects (BLOBs). Contention on
latches of this class can occur
when multiple connections try to
insert rows into a heap or BLOB at
the same time. You can reduce
this contention by partitioning the
object.”
When does BULK INSERT scale break?
0.0
50.0
100.0
150.0
200.0
250.0
0 5 10 15 20 25 30
MB/Sec
Concurrent BULK INSERT
1
2
3
173. What is Happening here?
Free Page information (PFS/GAM/SSGAM)
HOBT Cache
Fat
Chunks
Alloc
new
pages!Bulk Insert
ALLOC_FREESPACE_CACHE
This is in DRAM
and L2
174. • Break Up table
by “some key”
• Optional: Switch
out partitions
• Spin up multiple
bulks
• Linear scale
• 3GB/sec
• 16M
LINEITEM/sec
Breaking Through the Bottleneck
425
555
215
200
101
453
666
Area
Bulk Insert
Bulk Insert
Bulk Insert
175. BULK INSERT - Reloaded
• Thomas, you might have gotten 16M
rows/sec at 3GB/sec insert speed
• But this was on heaps, I have a clustered
table
• Alright then, let us hit a cluster index
1-1000
Clustered and partitioned
1001-2000
2001-3000
3001-4000
X Lock
X Lock
X Lock
X Lock
180. • Context Switching is expensive
• Typically 10K or more CPU cycles
• If you expect the ressource to be held
only shortly, why fall asleep?
What is a Spinlock?
spin_acquire(int* s)
{
while(*s==1)
*s = 1;
}
Spin_release(int* s)
{
*s = 0;
}
181. • Acquire can be very expensive
• SQL Server implements a backoff
mechanism
What is a backoff?
spin_acquire(int* s)
{
int spins = 0;
while(*s==1)
{
spins++;
if (spins > threshold)
{
<Sleep and WaitForRessource>
}
}
*s = 1;
}
SELECT *
FROM sys.dm_os_spinlock_stats
DBCC SQLPERF(spinlockstats)
Backoff
183. WRITELOG is I/O – right?
Should be the same as this… or?
No! Because:
184. • Step 1: Copy sqlserver.pdb to the BINN
directory
• Step 2: DBCC TRACEON (3656, -1)
• Step 3: Steal script from:
http://www.microsoft.com/en-
us/download/details.aspx?id=26666
Note for 2012, you additionally need:
• sqlmin.pdb, sqllang.pdb, sqldk.pdb
Diagnosing a Spinlock the Cool way!
185. Spinlock Walkthrough – Extended Events Script
--Get the type value for any given spinlock type
select map_value, map_key, name from
sys.dm_xe_map_values
where map_value IN ('SOS_CACHESTORE')
--create the even session that will capture the
callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action
(package0.callstack)
where
type = 144 --SOS_CACHESTORE)
add target
package0.asynchronous_bucketizer (
set
filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with
(MAX_MEMORY=50MB, MEMORY_PARTITION_MODE =
PER_NODE)
--Run this section to measure the contention
alter event session spin_lock_backoff on server
state=start
--wait to measure the number of backoffs over a 1
minute period
waitfor delay '00:01:00'
--To view the data
--1. Ensure the sqlservr.pdb is in the same directory
as the sqlservr.exe
--2. Enable this trace flag to turn on symbol
resolution
DBCC traceon (3656, -1)
--Get the callstacks from the bucketize target
select
event_session_address, target_name, execution_count, c
ast (target_data as XML)
from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on
(xst.event_session_address = xs.address)
where xs.name = 'spin_lock_backoff'
--clean up the session
alter event session spin_lock_backoff on server
state=stop
drop event session spin_lock_backoff on server
187. How to improve a spinlock?
CPU
Core
Core
L1-L3
Cache
CPU
Core
Core
L1-L3
Cache
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
191. Bulking at Concurrency
• What’s that spin?
xperf –on latency –stackwalk profile
xperf –d trace.etl
xperview trace.etl
SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY spins_count
DBCC SQLPERF (spinlockstats)
?
192. SOS_OBJECT_STORE at high INSERT
• Observed: This Spin happens when
inserting
• Need: Reduce locking overhead
• Fixes that work well here:
8x
throughput
Bonus
193. • Lets try something really silly:
• Run lots of: EXEC emptyProc
• This should be infinitely scalable, right?
Diagnosing another Spinlock
CREATE PROCEDURE emptyProc
AS
RETURN
199. DECLARE @ParmDef NVARCHAR(500)
DECLARE @sql NVARCHAR(500)
SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK)
(c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)'
SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3))
SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5
INT, @p6 BIGINT'
DECLARE @constDate DATETIME = '1974-12-22'
DECLARE @i INT
WHILE (1=1) BEGIN
BEGIN TRAN
SET @i = 1
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql, @ParmDef
, @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13
SET @i = @i + 1
END
COMMIT TRAN
Consider this Test harness code…
200. Spinning on MUTEX
Diagnose with trace flag shows spins
stack offender:
CSecurityContext::GetUserTokenFromCache
This is REALLY expensive at scale:
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql,
SET @i = @i + 1
END
Initialize a new execution context on
every loop!
201. Fixing the MUTEX spin
• Instead of:
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql,
SET @i = @i + 1
END
• Write:
SET @sql = N'
DECLARE @i INT
WHILE (1=1) BEGIN
BEGIN TRAN
WHILE @i <= 1000 BEGIN
INSERT INTO dbo_<t>.MyBigTable_<t> WITH
(TABLOCK)
(c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)
SET @i = @i + 1
END
COMMIT TRAN
END
EXEC sys.sp_executesql @sql, @ParmDef
4x
throughput
Bonus
202. • When all other bottlenecks are
gone, sharing happens in the most
unlikely places
• You can use spinlock Xevents inside SQL
Server
• Remember symbol files in BINN
• Trace flag 3656
• This can also be done in XPERF for non
SQL apps
• Ex: Analysis Services
Concurrency, Spinlock Summary
203. • Control of buffers and NUMA for Xperf
setting
• By default:
• 4MB mem
• Spool to disk at root of C-drive
• Can do buffer/file control:
• -buffersize and –maxbuffers
• -maxfile and –FileMode Circular
Xperf controlling buffers
204. • Round robin between NUMA nodes
• Inside the NUMA: Pick the one that
looks the least busy
• This is NOT a perfect system
How SQL Server assigns threads
206. • All the tuning wont help you if your
model is wrong
• Tunings gets your far, but to really
scale, you need a good data model
• This is what my other courses are about
But does the Data Model Work?
214. What if…
• Push
• Seek first value
page
• UPDATE Reference
Count
• Pop
• Seek last value
page
• UPDATE Reference
Count
Min Max
Msg++
Min Max
Msg--
218. Summing Up Message Queue Hack
• UPDATE
• instead of INSERT/DELETE
• More partitions = More
B-Trees
• Ring buffer using modulo
• Find Sweet spot
concurrency
Hinweis der Redaktion
For a great introductory course I recommend the Paul Randal course found here: http://www.sqlskills.com/T_ImmersionInternalsDesign.asp
To get a good runtime, we up the count of rows to 1M
Hint: NGEN lives in %Windir%\Microsoft.NET\framework64\<Version>Doc on NGEN: http://msdn.microsoft.com/en-us/magazine/cc163610.aspx
Get perfview here: http://www.microsoft.com/en-us/download/details.aspx?id=28567
Different data structures have different time complexities that lend themselves to more or less efficient service times.
Concurrency of JOIN even when single threaded
The B+ tree is a data structure that seeks to block fetch large areas of data (typically, but not always 8K) before seeking through the pages in memory. There exist many different ways to lay out the data pages of a B-tree, some of them more friendly to memory prefetch than others. The B-tree also allows you to seek the leaf nodes in a linear manner, without paying the log-proportional price to seekThis allows a logarithmic time to seek individual pages while still allowing linear time to range scan. When the expensive price of fetching a page (I/O) has been paid, the parsing of the page can also be made cheap by making use of the memory structures
Highlight spill warning
In the course material I have a query that will help you do 1 in this list.If you are curious about way to optimize the BEST index only plan, I recommend the book by Dan Tow called : “SQL Tuning”
We will get into WHY the transaction log needs to be dedicated
Elevator sorts orders the I-O requests before sending them to the spindle. Depending on the buffering, this ordering can increase IOPS per spindle quite signficantly. However, it comes at cost in increased latency.
Add the spindle illustration here
Hardware vendors have different implementation of RAID. It really depends on the gear you have and there is really only ONE way to get the true, unbiased answer… Which leads us to the next slide
Certain scenarios for shallow B-Trees (BizTalk Spool) row padding can shift the latch to internal structure @ACCESS_METHODS_HOBT_VIRTUAL_ROOT
Root splits are expensive, although it will only affect one partition at a time. It is when many transactions cause page splits. We are suggesting the partitioning is better.