SlideShare ist ein Scribd-Unternehmen logo
1 von 218
Thomas Kejser
thomas@kejser.org
http://blog.kejser.org
@thomaskejser
Super Scaling SQL Server
Diagnosing and Fixing Hard Problems
Thomas Kejser
• Formerly SQLCAT
• Tuning SQL Server since 6.5
• 15+ Years of database experience
• http://blog.kejser.org
• CTO Fusion-io Europe
Image(s): FreeDigitalPhotos.net
VS. VS.
Performance Scalabilityvs.
Response Time
Ressource Use
Adding more
of a HW ressource
makes things
faster
You can scale without having performance
(ex: HADOOP)
You can perform without having scalability
(ex: In Memory Engines)
Our Reasonably Priced Server
• 2 Socket Xeon E3645
• 2 x 6 Cores
• 2.4Ghz
• NUMA enabled, HT off
• 12 GB RAM
• 1 ioDrive2 Duo
• 2.4TB Flash
• 4K formatted
• 64K AUS
• 1 Stripe
• Power Save Off
• Win 2008R2
• SQL 2012
Image Source: DeviantArt
Between disk and Memory
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
The “cache out curve”
Data Size
Throughput/thread
Cache Size
Service Time + Wait Time
NUMA Nodes
CPU
L
3
L
2
L
2
C
C
CPU
L
3
L
2
L
2
C
C
Can I
write?
Bus Transfer
Bus Transfer
There are several of these curves
Throughput
Touched
Data Size
CPU Cache
TLB
NUMA
Remote
Storage
Response time = Service Time + Wait Time
Algorithms
and
Data Structures
“Bottlenecks”
• DBA tasks
• Installation of OS and SQL
• Basic Memory Configuration
• Basic Perfmon style monitoring
• Backup/Restore and HA setup
• Basic reading a Query Plan
• Basic understanding of database
structures
• Adding Indexes to tables
• Running a Profiler trace
What you ALREADY know
Below the Surface
What we Need
• Free tools from
MS
• Windows SDK
• In Win8: The
“ADK”
• Need .NET 4 to
install
Where Did the Time Go?
Service Time + Wait Time
Xperf –on Base –f Base.etl
SELECT TOP 100000 *
FROM LINEITEM
INNER JOIN ORDERS
ON O_ORDERKEY = L_ORDERKEY
SQLCMD –E –S. –i “Select.sql”
Xperf –stop
BASE profile with xperf
Service Time + Wait Time
Right Click – Summary Table
Service Time + Wait Time
What exactly is SQLNCLI?
Service Time + Wait Time
Quantifying just how stupid XML is
SELECT TOP 1000000 *
FROM ORDERS
JOIN LINEITEM
ON L_ORDERKEY = O_ORDERKEY
FOR XML RAW ('OUTPUT')
Xperf –on Base –f Base.etl
With XML
“Native” Format
Which CPU cycles are Expensive?
“App” tier
Web Server Licensing
>3K USD Blades
Database Tier
Core Licensing
>10K USD
<XML> ?
Service Time + Wait Time
• What about the time
INSIDE the process?
• What if the EXE won’t
tell us?
Diving even Deeper
What is a Debug Symbol?
mov ax,10
mov bx,20
mov cx,3
push ax
push bx
push cx
call <address>
<address>
push bp
mov bp,sp
mov ax,[bp+8]
mov bx,[bp+6]
mov cx,[bp+4]
add ax,bx
div cx
mov dx,ax
ret
HeaderdoStuff(10,20,3)
…
int doStuff(int a, int b, int c)
{
return (a + b) / c
}
myProg.exe
Machine Code
<address> = doStuff
Symbol table
myProg.pdb
Service Time + Wait Time
Where do you get PDB files?
_NT_SYMBOL_PATH=SRV*C:Symbols*http://msdl.microsoft.com/download/symbols
_NT_SYMCACHE_PATH=C:SymCache
• Public Symbol Server
• Configure Environment
• Dbghelp.dll
Service Time + Wait Time
• Auto Generated by Visual Studio:
Your Own Debug Symbols
Service Time + Wait Time
• Symbols are indexed. Have to add them
Adding and Checking Your Symbols
Cd Bin/x64/Release/
symstore add /f *.pdb /s C:/Symbols /t ‚MyExe‛
• Validate that the Symbols can resolve
Cd Bin/x64/Release/
symchk MyExe.exe /V
• Standard Xperf works fine
for you own native code
• BUT: Before Windows
8, stack walking is broken
for x64 .NET
• If you have .NET with 64
bit code. You must NGEN
first:
Got .NET and x64?
Ngen install Bin/x64/Release/MyExe.exe
(ngen lives here: %Windir%Microsoft.NETframework64<Version>Ngen.exe
Service Time + Wait Time
• Free tool from MS:
.NET tracing is a pain, get a tool!
• Not to be confused with xperfview
• Same trace API and file format
• Helps set obscure .NET specific trace flags
Service Time + Wait Time
And Finally, You can do Very Cool Things
Did I tell you about interlocked
operations?...
Whiteboard time!
Service Time + Wait Time
• Consider again our LINEITEM table
What is SQL Server REALLY doing?
• How expensive is it to read from that?
• Think ETL code and DW/BI queries
CREATE TABLE LINEITEM (
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] [decimal](15, 2) NOT NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,
[L_DISCOUNT] [decimal](15, 2) NOT NULL,
[L_TAX] [decimal](15, 2) NOT NULL,
[L_RETURNFLAG] [char](1) NOT NULL,
[L_LINESTATUS] [char](1) NOT NULL,
[L_SHIPDATE] [date] NOT NULL,
[L_COMMITDATE] [date] NOT NULL,
[L_RECEIPTDATE] [date] NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] [char](10) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL
)
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
SQLCMD – Native code Test
SQLCMD.EXE
Where does the time go?
Service Time + Wait Time
Standard Reading of Data
xperf -on base -stackwalk profile -f stackwalk.etl
SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch"
55sec
xperf -stop
xperf –merge stackwalk.etl stackwalkmerge.etl
Service Time + Wait Time
Details of the Time – Padding?
Service Time + Wait Time
More Details – Conversion Work?
An Educated guess about improvements
CREATE TABLE [dbo].[LINEITEM_native](
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] money NOT NULL,
[L_EXTENDEDPRICE] money NOT NULL,
[L_DISCOUNT] money NOT NULL,
[L_TAX] money NOT NULL,
[L_RETURNFLAG] int NOT NULL,
[L_LINESTATUS] int NOT NULL,
[L_SHIPDATE] int NOT NULL,
[L_COMMITDATE] int NOT NULL,
[L_RECEIPTDATE] int NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] int NOT NULL,
[L_COMMENT] char(44) NOT NULL
)
CREATE TABLE [dbo].[LINEITEM](
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] [decimal](15, 2) NOT NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,
[L_DISCOUNT] [decimal](15, 2) NOT NULL,
[L_TAX] [decimal](15, 2) NOT NULL,
[L_RETURNFLAG] [char](1) NOT NULL,
[L_LINESTATUS] [char](1) NOT NULL,
[L_SHIPDATE] [date] NOT NULL,
[L_COMMITDATE] [date] NOT NULL,
[L_RECEIPTDATE] [date] NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] [char](10) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL,
)
Before After
Service Time + Wait Time
Getting Rid of Useless Work
Additional parameters for SQLCMD:
-a32767 -W -s";" -f437
x1.5
Service Time + Wait Time
Unicode – 10% overhead?
Service Time + Wait Time
Lets try that with Native and Unicode …
x5
Service Time + Wait Time
• SQLNCLI is one of these in disguise
• ODBC
• OLEDB
• Pick good data types
• MONEY over NUMERIC
• UNICODE of data arrives like this
• Native protocols vs. flexibility
Summary Moving Data
• Get
• Windows 8 ADK
• Windows 7 SDK
• Set up Symbol Paths
• Xperf –on Base
• Standard trace for time, narrow to process
and DLL/EXE
• Xperf –on Base –stackwalk Profile
• Get to the call stack, find the offending
function(s)
• Ease of use for .NET: perfview.exe
Summary – Xperf
Service Time + Wait Time
Response time = Service Time + Wait Time
Introducing TPC-H
Service Time + Wait Time
Loop Join
n row B-tree
Log(n) reads
Complexity: O(m * log(n))
Service Time + Wait Time
m row result
1
43
13
7
3
Linked List Tree
Linked List vs. Tree
Service Time + Wait Time
0
1
2
3
4
5
6
7
8
n
8
134
62 1510
16141197531
Log2(n)
Cluster on O_ORDERKEY Index on O_ORDERKEY
Basic argument for Cluster Indexes
Service Time + Wait Time
CREATE UNIQUE CLUSTERED INDEX CIX_Key
ON ORDERS_Cluster (O_ORDERKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Cluster
WHERE O_ORDERKEY = 3000000
CREATE UNIQUE INDEX IX_Key
ON ORDERS_Heap (O_ORDERKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Heap
WHERE O_ORDERKEY = 3000000
Table 'ORDERS_Heap'. Scan count 0, logical reads 3
, physical reads 0, read-ahead reads 0
Table 'ORDERS_Cluster'. Scan count 0, logical reads 4
, physical reads 0, read-ahead reads 0
Cluster on O_ORDERKEY heap + Index on O_ORDERKEY
But what if we do this a lot?
CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY)
WITH (FILLFACTOR = 100)
CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Heap
WHERE O_CUSTKEY = 47480
SELECT *
FROM ORDERS_Cluster
WHERE O_CUSTKEY = 47480
Table 'ORDERS_Cluster'. Scan count 1
, logical reads 27, physical reads 0
Table 'ORDERS_Heap'. Scan count 1
, logical reads 11, physical reads 0
Service Time + Wait Time
How many LOOP joins/sec/core?
7 Sec
Service Time + Wait Time
What did we just measure?
Xperf –on Base –stackwalk profile
About 40%...
Service Time + Wait Time
• The query
language itself
• Why so many
ExecuteStmt?
• …With so much
CPU use?
What is sqllang.dll?
Service Time + Wait Time
A different way to Measure Loops
1 Sec
Service Time + Wait Time
VS.
What does THAT look like?
Takeaway:
The T-SQL language
itself is expensive
Service Time + Wait Time
• Sample from
LINEITEM
• Force loop join with
index seeks
• Do 1.4M seeks
Test: Singleton Row Fetch
Singleton seeks – Cost of compression
Compression Seek (1.4M seeks) CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 24 sec 100% one core
None – I/O 21 sec 100% one core
PAGE – I/O 32 sec 100% one core
Function % Weight
CDRecord::LocateColumnInternal 0.82%
DataAccessWrapper::DecompressColumnValue 0.47%
SearchInfo::CompareCompressedColumn 0.28%
PageComprMgr::DecompressColumn 0.24%
AnchorRecordCache::LocateColumn 0.18%
ScalarCompression::AddPadding 0.04%
ScalarCompression::Compare 0.11%
Additional Runtime of
GetNextRowValuesInternal 0.14%
Total Compression 2.28%
Total CPU (single core) 8.33%
Compression % 27.00%
xperf –on base
–stackwalk profile
Modern CPU
CPU
L3 Cache
4MB
Inst
Cache
32KB
Core
Data
Cache
32KB
L2 Uni Cache
256K
Inst
Cache
32KB
Core
Data
Cache
32KB
L2 Uni Cache
256K
Bus
Service Time + Wait Time
The B+ Tree
Service Time + Wait Time
B+ Tree
Hekaton Style “Loop”
Lookup Table
(hash)
Service Time + Wait Time
Merge Join
m row result
1
1
2
3
n row result
1
2
3
4
4
43
43
Sorted
Sorted
Complexity: O(m + n)
Service Time + Wait Time
Merge Join – What is Fastest?
Service Time + Wait Time
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE)
FROM LINEITEM
INNER MERGE JOIN ORDERS
ON O_ORDERKEY = L_ORDERKEY
…or
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE)
FROM ORDERS
INNER MERGE JOIN LINEITEM
ON O_ORDERKEY = L_ORDERKEY
Comparing the Query Plans
Service Time + Wait Time
Digging in Deeper
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0
, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0
, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 3265 ms, elapsed time = 3357 ms.
Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0
, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0
, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2469 ms, elapsed time = 2607 ms.
Service Time + Wait Time
We can beat SQL Server at this game
SELECT MAX(O_ORDERDATE), MAX(MAX_P)
FROM
(SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P
FROM LINEITEM
GROUP BY L_ORDERKEY) b
INNER MERGE JOIN ORDERS
ON O_ORDERKEY = b.L_ORDERKEY
Service Time + Wait Time
Hash Join
m row result
1
43
13
7
n row join table
Hash(1)
n row hash table
Complexity: O(m + 2n)
3
Service Time + Wait Time
When Hash Joins hurt you
Service Time + Wait Time
0
5
10
15
20
25
30
050100150200250300350400
Hash Memory (MB)
Runtime (seconds)
Spill Zone!
Hash Joins Don’t Scale in MSSQL
The Bottleneck Curve
ACCESS_METHODS_DATASET_PARENT:
“Used to synchronize child dataset access
to the parent dataset during parallel
operations.”
Books Online Story…
Image: FreeDigitalPhotos.net
Using XPERF to find documentation
xperf –on base+cswitch+dispatcher
–stackwalk profile+readythread+cswitch
Lets dig in…
xperf -on base -stackwalk profile -f stackwalk.etl
What LATCH pattern do we see?
GetNextRangeForChildScan
Inside:
TableScanNew
• Partition the table by a
“random” value
• Modulo the Key for
example
• Use SQL Server partition
function/schema
The Fix?…
0
1
2
3
4
5
6
253
254
255
hash
Closer…
…But no Cigar
What is the Problem here?
Anti Scale Patterns
CPU Caches
0
100
200
300
400
500
600
700
800
900
1,000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
MillionPages/sec
Size of Accessed memory (MB)
Random Pages
Sequential Pages
Single Page
Service Time + Wait Time
Goals:
• Compressed
• Prefetch Friendly
• Cache Resident Code
Example, Column Stores
ID Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
ID Date
1 2011-11-25
2 2011-11-25
3 2011-11-25
4 2011-11-25
5 2011-11-25
6 2011-11-25
7 2011-11-25
Date
ID Sale
1 2 GBP
2 2 GBP
3 10 GBP
4 5 GBP
5 5 GBP
6 10 GBP
7 10 GBP
Sale
Service Time + Wait Time
Compression is Easy
ID Value
1-2 Beer
3 Vodka
4-5 Whiskey
6-7 Vodka
ID Customer
1-3 Thomas
4-5 Christian
6-7 Alexei
Product’ Customer’
ID Date
1-7 2011-11-25
Date’
ID Sale
1-2 2 GBP
3 10 GBP
4-5 5 GBP
6-7 10 GBP
Sale’
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
Product’ Customer’
RL Date
7 2011-11-25
Date’
RL Sale
2 2 GBP
1 10 GBP
4 5 GBP
2 10 GBP
Sale’
Service Time + Wait Time
Squeezing it even more
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Product’
RL Value
2 1
1 2
2 3
2 2
Product’
Beer = 1
Vodka = 2
Whiskey = 3
ID Value
1-2 Beer
3-3 Vodka
4-5 Whiskey
6-7 Vodka
Product’
4+4+4+2 = 14B
+ 4+4+5+2 = 15B
+ 4+4+7+2 = 17B
+ 4+4+5+2 = 15B
= 61B
4+4+2 = 10B
+ 4+5+2 = 11B
+ 4+7+2 = 13B
+ 4+5+2 = 11B
= 45B
4+4 = 8B
+ 4+4 = 8B
+ 4+4 = 8B
+ 4+4 = 8B
= 32B
RL Value
2 0x01
1 0x10
2 0x11
2 0x10
Product’
4 = 4B
+ 4 = 4B
+ 4 = 4B
+ 4 = 4B
+ 4 x 2b = 2B
= 18B
Service Time + Wait Time
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
Product’ Customer’
2 steps with Beer
2 steps with Thomas
Beer Thomas
Beer Thomas
SELECT Product, Customer FROM Table
1 step with Vodka
1 step with Thomas
Vodka Thomas
2 step with Whiskey
2 step with Christian
Whiskey Christian
Whiskey Christian
2 step with Vodka
(Note: Repeated value)
2 step with Alexei
Vodka Alexei
Vodka Alexei
Service Time + Wait Time
Hash Joining with Column Stores
RL Key
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Table
Key Type
Beer Soft
Vodka Strong
Whiskey Strong
Vodka Strong
Dim Product
SELECT …
FROM Table
JOIN DimProduct ON Key
WHERE Type = ‘Strong’
1 Compute bloom filter of Keys belonging to ‘strong’
2 Read RL = 2, Beer from Table
3 Compute bloom value of Beer.
4 Equal to filter value from 1? Yes. Output two rows (RL=2)
5 Compute bloom value for Vodka
6 Equal to filter value from 1? No. Do nothing
7 Compute bloom value for Whiskey
8 Equal to filter value from 1? No. Do nothing
Can pre fetch data (news RLE)
Can calculate match/no
match using only local CPU
cache
Wont work for OLTP!
Service Time + Wait Time
Why is it so hard to get joins right?
n
m
Time
Loop Join
Merge Join
Hash Join
Service Time + Wait Time
Desired Join Join Hint Query Hint
LOOP [INNER | LEFT | CROSS | FULL]
LOOP JOIN
OPTION (LOOP JOIN)
MERGE [INNER | LEFT | CROSS | FULL]
MERGE JOIN
OPTION (MERGE JOIN)
HASH [INNER | LEFT | CROSS | FULL]
HASH JOIN
OPTION (HASH JOIN)
LOOP with
Seek
WITH FORCESEEK
WITH ( INDEX (index = <name>) )
N/A
Controlling Joins
Note: Join hints force the order of the ENTIRE join tree!
Service Time + Wait Time
What Type of Workload?
BigSmall
Small
Big
DataReturned
Data Touched
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
How to Classify?
OLTP BI/DW
Simulation ETL
Full Scan/sec
Range Scans/sec
Probe Scans/sec
Index Search/sec
Range Scans/sec
Full Scan/sec
Range Scans/sec
Bulk Copy Rows/sec
?
There should ALWAYS be a fully
indexed path to the data.
OLTP System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
1. Find worst CPU consuming query with
sys.dm_exec_query_stats
2. Add OPTION (LOOP JOIN) to offending
query
3. Check estimated query plan
4. If table spool found: add index to
remedy and GOTO 3
5. Happy? If not, GOTO 1
The Super Quick OLTP Tuning Guide
Service Time + Wait Time
The query will not be (much)
worse than a full scan of a fact
partition
DW/BI System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
1. Find offending query
2. Add OPTION (HASH JOIN) to query
3. Does dimension tables have indexed path
to build hash? If not, add index
4. Do you get a fact table scan and hash
build of all dimensions? If not, check
statistics (especially on facts and skewed)
5. Optimize Fact table scans
1. Partition and partition elimination
2. Column store if you have it
3. Aggregate Views
4. Bitmap index pushdown (statistics!)
5. Composite indexes (last resort!)
The Super Quick DW tuning Guide
Service Time + Wait Time
The expected DW Query Plan
Partial
Aggregate
Fact CSI Scan
Dim Scan
Dim Seek
Batch
Build
Batch
Build
Hash
Join
Hash
Join
HashStream
Aggregate
• At least enough RAM to hold the hash
tables of the largest dimension
• De-normalisation helps… a LOT
• Especially for the large/large joins
• Likely: need to scan fast from disk if
RAM is not big enough to hold the fact
• Compression REALLY matters
Things that Follow from desired DW Plan
Service Time + Wait Time
Coffee Break
Response time = Service Time + Wait Time
Where EVERY Server wide diagnosis starts
SELECT *
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (SELECT wait_type FROM
#ignorewaits)
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC
Service Time + Wait Time
• Shows up as waits for PAGEIOLATCH
• You can dig into details with:
Common Problems - PAGEIO
Service Time + Wait Time
SELECT *
FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)
• Can also Xevent your way to it per
query
CREATE EVENT SESSION [TraceIO] ON SERVER
ADD EVENT
sqlserver.file_read_completed(
ACTION (sqlserver.database_id,sqlserver.session_id))
• I/O, like memory, is a GLOBAL resource
for the machine
• When does it make sense to partition a
global resource?
• When you deeply know the workload
• When the workload is ALREADY partitioned
• When neither of those are true: DON’T
partition
• If you have NAND/SSD – Why bother?
The general I/O Guidance
Service Time + Wait Time
A good way to Think of Spindle I/O
JBOD SAME
LUN
Seq.
LUN
Seq.
LUN
Seq.
RAID system
Large LUN
Seq. Seq. Seq.
RANDOM I/O
Service Time + Wait Time
Stripe vs. Concatenation
RAID 10 RAID 10
Concatenated LUN
RAID 10 RAID 10
Striped LUN
Service Time + Wait Time
OLTP
• One big SAME setup
• data files
• Tempdb
• Dedicate
• Transaction log
• DRAM:
• Enough to hold most of
DB
Data Warehouse
• JBOD setup
• Data Files
• 1-2 per LUN
• SAME setup
• Tempdb
• Dedicate
• Transaction Log
• DRAM:
• Enough to hold largest
partition of largest table
Rules of Thumb – Spindle I/O and DRAM
Service Time + Wait Time
• Short Stroking
• Elevator Sort
• Sequential vs.
Random
• Weaving
You can do a bit better… or worse
Service Time + Wait Time
• Intentionally use
lower % of total
space
• Tradeoff:
• Space for Speed
• Test:
• 15K rpm
• SAS spindle
• 300GB
Short Stroking Disks
150
200
250
300
350
400
0% 20% 40% 60% 80% 100%
IOPS
% Capacity Used
Service Time + Wait Time
Full Stroked Short Stroked
Why does Short Stroking Work?
Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the
disk wont use the full platter either. The result: less head movement
Service Time + Wait Time
Adding Elevator Sorting
0
200
400
600
800
1000
1200
0
100
200
300
400
500
600
Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked
Latency
IOPS
8K random I/O
IOPS
Avg. Latency
Max Latency
Bat powered
disk!
Why Chase Sequential I/O?
0
10
20
30
40
50
60
70
80
1
10
100
1000
10000
100000
Sequential Full Stroke Random
Latency(ms)
Log(IOPS)
8K Block Pattern
IOPS
Avg Latency
Max Latency
Service Time + Wait Time
• One SATA disk
• Two partitions
• One file on each
• Sequential read on
each file
But all is not well!
File1 File2
Service Time + Wait Time
I/O Weaving in action
0
2
4
6
8
10
12
14
16
18
0
50
100
150
200
250
300
64K Random 64K Dual Sequential
Latency(ms)
IOPS
IOPS
Avg Latency
Source: Michael Anderson Service Time + Wait Time
Storage Pool and Weaving
DataLog DataLog DataLog
Massive, then Provisioned Pool
Seq
Ran
Seq
Ran
Seq
Ran
RANDOM!
Service Time + Wait Time
The SAN will properly handle Sharing!
Green: Checkpoint, Red: tx/sec, Black: Disk Latency Service Time + Wait Time
Numbers to Remember - Spindles
Characteristic Typical Units
Throughput / Bandwidth 90-125MB/sec
But ONLY if sequential access!
Operations per Sec 10K RPM Spindle: 100-130 IOPS
15K RPM Spindle: 150-180 IOPS
Can get about 2x if short stroking (more
later)
Latency 3-5ms
(compare DRAM: 100ns)
Capacity 100s of GB to single digit TB
2012 numbers, will change in future Service Time + Wait Time
• Few hundreds of IOPS
• Faster if short stroked
• Trade latency for speed with elevator
sort
• Sequential is hard to get right
Summary so far.. Single Disk
Service Time + Wait Time
• Wider Stripes neat
• But scale not linear
• Very deep queues
help
• But add latency
• Shared
Components
Why does a big RAID pile not solve this?
Service Time + Wait Time
RAID Scale?
Your Mileage WILL vary with the hardware
Before After
Getting rid of Sharing
Switch
HBA HBA HBA HBA
Storage
Port
Storage
Port
Switch
LUN LUN
Cache
Disk
CPU
Switch
HBA HBA HBA HBA
Storage
Port
Storage
Port
Switch
LUN LUN
Cache
Disk
CPU
x2
4K
PN N
NAND Flash Basics
112
PN N
Oxide Layer
Floating Gate
Electrons
trapped
Control Gate
NAND Die
Pack
Blocks
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
PN N PN NPN N
PN NPN N PN NPN N
Pages
NAND Flash Problems
• Erase Cycles
• Around 100K
• Rebalancing and reclaim/trim
• Voltage measurement
• Gets worse with density
• Changes over time
• Depends on how you program
• Bit Rot
• Must refresh even on read
• SLC easier to manage than MLC
• But much more expensive!
113
Voltage
00
01
10
11
Lessons Learned: Try to Avoid Sharing
BAD BETTER BEST
Service Time + Wait Time
The Network
• Only partially diagnosed as waits in
sys.dm_os_wait_stats
• Task Manager gives a bit more
information
• Need: transparency to the deep level
latencies and packets!
Common Problems: ASYNC_NETWORK, OLEDB
Service Time + Wait Time
A common Wait Type
The database is really
slow! The code takes
forever to run!
Service Time + Wait Time
• We may not always have insight into
what is going on at the client…
Xperf Diagnosing the Network
xperf –on latency+network
Summary
Table
Service Time + Wait Time
Timeline of the network Traffic
ASYNC_NETWORK_IO, the typical issue
Service Time + Wait Time
Handling network is EXPENSIVE
xperf –on latency
?
Service Time + Wait Time
Short Story on DPC/ISR handling
CPU
Core
Core
L1-L3
Cache
PCI
BUS
IRQ
HALT execution
Fire ISR Routine
if (my interrupt)
{
<Mark Handled>
Queue DPC
}
NIC
Work Done
DPC
<Do work needed>
<Wake Application>
Core can
run other stuff
again
Service Time + Wait Time
It looks like this…
DPC
ISR
Service Time + Wait Time
• Option 1: Use the HW vendors tool
• Option 2: Use interrupt Affinity Policy Tool
from MS
Setting Interrupt Affinity
Service Time + Wait Time
• Standard Payload
Network (MTU):
• 1500 B
• Jumbo Frames
• 9014 B(MTU)
Jumbo Frame and SQL Packets
• Standard SQL
payload
• 4096 B
• Largest
• 32767 B
SELECT session_id, net_packet_size
FROM sys.dm_exec_connections
Server=foo;Packet size=32767
Service Time + Wait Time
Single Threaded
Core Evolution
Moore’s “Law”:
“The number of transistors per
square inch on integrated
circuits has doubled every
two years since the
integrated circuit was
invented”
• Never faster than a single core
• Smaller servers are faster than bigger ones
• Large L2 caches and more clock speed help
• The algorithm dictates speed
• Latency of Wait Time sets upper limit
• Examples from MSSQL land:
• Formula Engine in MSAS
• Transaction Log Writes
• INSERT/UPDATE/DELETE (as we shall see)
Single Threaded
VLF files
• When switching to new VLF – it has to be ”formatted” with
8K sync write
• While this happens, transactions are blocked
• Too many VLF = Too much blocking
• Lesson: Preallocate the database log file in big chunks
• Up to 128 Log Buffers per database
• Spawned on demand, will not be released once spawned
• Transactions will wait for LOGBUFFER is no buffer available
• Think of this like a pipeline of commits waiting…
VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K
<=60K
X 128
Transaction Log Background
Buffer Offset (cache line)
LOGCACHE
ACCESS
Alloc Slot in Buffer
MemCpy Slot
Content
Log Writer
Writer Queue
Async I/O Completion
Port
Slot
1
LOGBUFFER
WRITELOG
LOG
FLUSHQ
Signal thread which
issued commit
T0
Tn
Slot
127
Slot
126
• Speed is determined by Latency and
Code Path
• Max Log Write Size: 60K
Zooming to the Log Writer
Log Writer
Async I/O Completion Port
Signal thread which
issued commit
Latency
Writer Queue
Long Distance Replication…
Log Entry Log Entry
Network
Log Entry
Send log
Ack Log
Primary Secondary
Write Write
Executive Summary:
The speed of light ( c )
is not fast enough!
• Perfmon will only show millisec
• What if we want microseconds?
Getting to the Real Latency
xperf –on latency
It’s in Memory, so it must be fast?
VS.
Latency: 15-30us Latency: <5us
RAM DISK
1.5sec 1.5sec
No, Because…
This adds up to one core… it is doing all it can with the CPU it has
The Effect on UPDATE
Naïve
UPDATE MyBigTable
SET c6 = 43
Parallel
UPDATE MyBigTable
SET c6 = 43
WHERE key
BETWEEN 10**9 * n
AND 10**9 * (n+1) -1CX
Runtime
(smaller is faster)
Multi Threaded
What is Scalable?
0
500
1000
1500
2000
2500
3000
0 4 8 12 16 20 24
Throughput
Some Hardware Resource
Good
So so
Bad
We want
to live here
Amdahl’s Law of gated speedup
1
6
11
16
21
26
31
0 8 16 24 32 40 48 56 64
SpeedupFactor
Number of cores
P = 100%
P = 95%
P = 90%
P = 80%
P = Part of program that can be made Parallel
(Note that this may be 0... or 1)
N = Number of CPU cores available
Speedup =
Introducing Contention – Locks
Table A
Table B
Table C
INSERT TableA …
INSERT TableB …
INSERT TableC …
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Wait Stat: LCK_<X>
But those rows have to be stored…
Table A
Table B
Table C
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Data
File
File
Group
It all Starts with Wait Stats
SELECT *
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (SELECT
wait_type FROM #ignorewaits)
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC
DBCC PAGE
PFS – Hidden Single Page Contention
Data File
GAM/
SGAM
PFS
64MB
PFS PFS
64MB
PFS
64MB
PFS
B B B B
B B B B
B B B B
B B B B
8K
10010010
INSERT TableA …
Allocated bit
Data
File
Data
File
Data
File
More Files
Table A
Table B
Table C
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Data
File
File
Group • Round Robin
between files
• More files, more
structures
• No affinity
How many more Files?
1
10
100
1000
10000
100000
1000000
10000000
260
280
300
320
340
360
380
400
0 8 16 24 32 40 48
PAGELATCH
Runtime
# Data Files
Runtime PAGELATCH_UP
• Shared, physical MEMORY structures
can cause bottlenecks (ex: PFS)
• SQL Server must sync too…
• Understanding where structure resides
leads to tuning fix
• Theory of engine!
Concurrency: What we learned so far
• Commonly misdiagnosed
• CXPACKET does NOT (always) mean
that your DOP is “too high”
CXPACKET
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
10.015.020.025.030.035.040.0
CXPACKETWaits
Throughput (MB/sec)
CXPACKET waits / Throughput
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
1 11 21 31 41
Throughput(MB/sec(
DOP
Throughput / DOP
CXPACKET = Issue may be elsewhere…
• What happens when you get things like:
LATCH_<x>
PAGELATCH_<x>
Step 1: Dig into:
Diagnosing Latches
SELECT *
FROM sys.dm_os_latch_stats
Service Time + Wait Time
Digging into Latches Again…
Zooming into the Ready Thread
Post Fix Pattern
GetNextRangeForChildScan
GetNextRangeForChildScan
GetNextRangeForChildScan
• Before: 6GB/sec
• After: 20GB/sec
• This sometimes works on cluster
indexes too…
…Whiteboard
Speedup with Hash Partition of Heap
UPDATE Hotspot
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX
Before
ALTER TABLE HotUpdates
ADD COLUMN Padding CHAR(5000)
NOT NULL DEFAULT („X‟)
After
UPDATE Hack on Small Tables
Page (8K)
ROW
LCK_U
PAGELATCH_EX
CHAR(5000)
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX
Test: Updates of pages
Compression Update 1.4M CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 54 sec 100% one core
None – I/O 17 sec 100% one core
PAGE – I/O 59 sec 100% one core
L_QUANTITY is NOT NULL
i.e. in place UPDATE
Function CPU %
qsort 0.86
CDRecord::Resize 0.84
CDRecord::LocateColumnInternal 0.36
perror 0.36
Page::CompactPage 0.36
ObjectMetadata::`scalar deleting destructor' 0.27
SearchInfo::CompareCompressedColumn 0.24
CDRecord::InitVariable 0.19
CDRecord::LocateColumnWithCookie 0.18
memcmp 0.16
PageDictionary::ValueToSymbol 0.16
Record::DecompressRec 0.14
PageComprMgr::DecompressColumn 0.14
CDRecord::InitFixedFromOld 0.1
SOS_MemoryManager::GetAddressInfo64 0.08
AnchorRecordCache::LocateColumn 0.08
CDRecord::GetDataForAllColumns 0.08
ScalarCompression::Compare 0.07
PageComprMgr::CompressColumn 0.07
Record::CreatePageCompressedRecNoCheck 0.06
memset 0.05
PageComprMgr::ExpandPrefix 0.04
PageRef::ModifyColumnsInternal 0.04
Page::ModifyColumns 0.03
DataAccessWrapper::ProcessAndCompressBuffer 0.03
SingleColAccessor::LocateColumn 0.03
CDRecord::BuildLongRegionBulk 0.02
ChecksumSectors 0.02
Page::MCILinearRegress 0.02
DataAccessWrapper::DecompressColumnValue 0.02
SOS_MemoryManager::GetAddressInfo 0.02
CDRecord::FindDiff 0.02
AnchorRecordCache::Init 0.02
PageComprMgr::CombinePrefix 0.01
Total 5.17
UPDATE Compression burners
Out of 8.55 … Approx: 60%
Compression and Locks
Xevent Trace
Lock Acquire/Release
High Res Timer
How long are locks held?
0
100
200
300
400
500
600
PAGE NONE
CPU KCycles
Lock Held Cycle Count
Avg
StdDev
• Sharing is generally bad for scale (but
may be good for performance)
• PAGELATCH and LATCH diagnosis starts
in sys.dm_os_latch_stats
• CXPACKET
• Only important if throughput drops when
DOP goes up
• If this happens, look for another wait/latch
• Table partitioning can be used to work
around concurrency issues
Summary Concurrency – So Far..
The Paul Randal INSERT test
160M rows, executing at concurrency
Commit every 1K:
EASY
tuning?
All is as Expected?
But Page Splits are Bad, right?
= BAD!
= Better!...
WRITELOG gone? Faster?
?
?
sys.dm_os_wait_stats
And the Score Is…
0
5000
10000
15000
20000
25000
30000
35000
newguid() newsequentialid() IDENTITY
Time in Seconds
What is going on here???
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max
Tricks to Work Around this
0
-1000
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
INSERT
INSERT
INSERT
All Cores at 100%
0
5000
10000
15000
20000
25000
30000
35000
newguid(
)
newsequ
entialid()
IDENTITY
IDENTITY
+Unique
IDENTITY
+Unique
+Hash8
IDENTITY
+Hash24
IDENTITY
+Hash48
SPID+
Offset
Seconds
Runtime in Seconds
600K
Inserts/sec
830K
Inserts/sec
All Cores at ~100%
• Don’t use Sequential Keys
• Page Splitting isn’t so bad
• Neither are GUID
• Generate keys wisely. Ideally in the app
server
• For “transparent” speedup, consider our
old hash trick
Takeaways, INSERT workload
• Minimally Logged
• Single, large
execution
(thousands)
• Unsorted data
• Concurrent Loaders
BULK INSERT Workload
Heap
Bulk Insert
Bulk Insert
Measure:
SELECT * FROM
sys_dm_os_latch_stats
Observe waits on
ALLOC_FREESPACE_CACHE
Theory (just read BOL):
“Used to synchronize the access to
a cache of pages with available
space for heaps and binary large
objects (BLOBs). Contention on
latches of this class can occur
when multiple connections try to
insert rows into a heap or BLOB at
the same time. You can reduce
this contention by partitioning the
object.”
When does BULK INSERT scale break?
0.0
50.0
100.0
150.0
200.0
250.0
0 5 10 15 20 25 30
MB/Sec
Concurrent BULK INSERT
1
2
3
What is Happening here?
Free Page information (PFS/GAM/SSGAM)
HOBT Cache
Fat
Chunks
Alloc
new
pages!Bulk Insert
ALLOC_FREESPACE_CACHE
This is in DRAM
and L2
• Break Up table
by “some key”
• Optional: Switch
out partitions
• Spin up multiple
bulks
• Linear scale
• 3GB/sec
• 16M
LINEITEM/sec
Breaking Through the Bottleneck
425
555
215
200
101
453
666
Area
Bulk Insert
Bulk Insert
Bulk Insert
BULK INSERT - Reloaded
• Thomas, you might have gotten 16M
rows/sec at 3GB/sec insert speed
• But this was on heaps, I have a clustered
table
• Alright then, let us hit a cluster index
1-1000
Clustered and partitioned
1001-2000
2001-3000
3001-4000
X Lock
X Lock
X Lock
X Lock
Cluster Bulking – It seemed so plausible!
1
2
3
Cluster Bulking – Stage and Switch
1
2
3
Coffee Break
SPIN LOCKS
• Context Switching is expensive
• Typically 10K or more CPU cycles
• If you expect the ressource to be held
only shortly, why fall asleep?
What is a Spinlock?
spin_acquire(int* s)
{
while(*s==1)
*s = 1;
}
Spin_release(int* s)
{
*s = 0;
}
• Acquire can be very expensive
• SQL Server implements a backoff
mechanism
What is a backoff?
spin_acquire(int* s)
{
int spins = 0;
while(*s==1)
{
spins++;
if (spins > threshold)
{
<Sleep and WaitForRessource>
}
}
*s = 1;
}
SELECT *
FROM sys.dm_os_spinlock_stats
DBCC SQLPERF(spinlockstats)
Backoff
Life at 600K INSERT/sec
WRITELOG is I/O – right?
Should be the same as this… or?
No! Because:
• Step 1: Copy sqlserver.pdb to the BINN
directory
• Step 2: DBCC TRACEON (3656, -1)
• Step 3: Steal script from:
http://www.microsoft.com/en-
us/download/details.aspx?id=26666
Note for 2012, you additionally need:
• sqlmin.pdb, sqllang.pdb, sqldk.pdb
Diagnosing a Spinlock the Cool way!
Spinlock Walkthrough – Extended Events Script
--Get the type value for any given spinlock type
select map_value, map_key, name from
sys.dm_xe_map_values
where map_value IN ('SOS_CACHESTORE')
--create the even session that will capture the
callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action
(package0.callstack)
where
type = 144 --SOS_CACHESTORE)
add target
package0.asynchronous_bucketizer (
set
filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with
(MAX_MEMORY=50MB, MEMORY_PARTITION_MODE =
PER_NODE)
--Run this section to measure the contention
alter event session spin_lock_backoff on server
state=start
--wait to measure the number of backoffs over a 1
minute period
waitfor delay '00:01:00'
--To view the data
--1. Ensure the sqlservr.pdb is in the same directory
as the sqlservr.exe
--2. Enable this trace flag to turn on symbol
resolution
DBCC traceon (3656, -1)
--Get the callstacks from the bucketize target
select
event_session_address, target_name, execution_count, c
ast (target_data as XML)
from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on
(xst.event_session_address = xs.address)
where xs.name = 'spin_lock_backoff'
--clean up the session
alter event session spin_lock_backoff on server
state=stop
drop event session spin_lock_backoff on server
Of course, you can just use 2012…
How to improve a spinlock?
CPU
Core
Core
L1-L3
Cache
CPU
Core
Core
L1-L3
Cache
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
CoreInfo.Exe – where are my cores?
CoreInfo.exe
Revisiting the TLOG
Buffer Offset (cache line)
LOGCACHE
ACCESS
Alloc Slot in Buffer
MemCpy Slot
Content
Log Writer
Writer Queue
Async I/O Completion
Port
Slot
1
LOGBUFFER
WRITELOG
LOG
FLUSHQ
Signal thread which
issued commit
T0
Tn
Slot
127
Slot
126
I/O Affinity Mask!
0
50
100
150
200
250
SPID
+ Offset
SPID
+ Affinity
sp_configure
„AffinityIOMask‟
Bulking at Concurrency
• What’s that spin?
xperf –on latency –stackwalk profile
xperf –d trace.etl
xperview trace.etl
SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY spins_count
DBCC SQLPERF (spinlockstats)
?
SOS_OBJECT_STORE at high INSERT
• Observed: This Spin happens when
inserting
• Need: Reduce locking overhead
• Fixes that work well here:
8x
throughput
Bonus
• Lets try something really silly:
• Run lots of: EXEC emptyProc
• This should be infinitely scalable, right?
Diagnosing another Spinlock
CREATE PROCEDURE emptyProc
AS
RETURN
Initial Diagnosis
MUTEX ??? … what Mutex?
Using the Spinlock Script gives us
Some cache
Which one?
Validating the Theory
CREATE PROCEDURE emptyProc0
AS
RETURN
GO
CREATE PROCEDURE emptyProc1
AS
RETURN
GO
…
CREATE PROCEDURE emptyProc31
AS
RETURN
What is the SOS_OBJECT_STORE?
Security Check?
Validating the new “fix”…
DECLARE @ParmDef NVARCHAR(500)
DECLARE @sql NVARCHAR(500)
SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK)
(c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)'
SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3))
SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5
INT, @p6 BIGINT'
DECLARE @constDate DATETIME = '1974-12-22'
DECLARE @i INT
WHILE (1=1) BEGIN
BEGIN TRAN
SET @i = 1
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql, @ParmDef
, @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13
SET @i = @i + 1
END
COMMIT TRAN
Consider this Test harness code…
Spinning on MUTEX
Diagnose with trace flag shows spins
stack offender:
CSecurityContext::GetUserTokenFromCache
This is REALLY expensive at scale:
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql,
SET @i = @i + 1
END
Initialize a new execution context on
every loop!
Fixing the MUTEX spin
• Instead of:
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql,
SET @i = @i + 1
END
• Write:
SET @sql = N'
DECLARE @i INT
WHILE (1=1) BEGIN
BEGIN TRAN
WHILE @i <= 1000 BEGIN
INSERT INTO dbo_<t>.MyBigTable_<t> WITH
(TABLOCK)
(c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)
SET @i = @i + 1
END
COMMIT TRAN
END
EXEC sys.sp_executesql @sql, @ParmDef
4x
throughput
Bonus
• When all other bottlenecks are
gone, sharing happens in the most
unlikely places
• You can use spinlock Xevents inside SQL
Server
• Remember symbol files in BINN
• Trace flag 3656
• This can also be done in XPERF for non
SQL apps
• Ex: Analysis Services
Concurrency, Spinlock Summary
• Control of buffers and NUMA for Xperf
setting
• By default:
• 4MB mem
• Spool to disk at root of C-drive
• Can do buffer/file control:
• -buffersize and –maxbuffers
• -maxfile and –FileMode Circular
Xperf controlling buffers
• Round robin between NUMA nodes
• Inside the NUMA: Pick the one that
looks the least busy
• This is NOT a perfect system
How SQL Server assigns threads
Xperf -on
Latency+CSWITCH+DISPATCHER -
stackWalk
CSwitch+ReadyThread+ThreadCreate+Pr
ofile -BufferSize 1024 -MaxBuffers
1024 -MaxFile 1024 -FileMode Circular
REG ADD
"HKLMSystemCurrentControlSetControl
Session ManagerMemory Management"
-v DisablePagingExecutive -d 0x1 -t
REG_DWORD -f
Super Xperf
• All the tuning wont help you if your
model is wrong
• Tunings gets your far, but to really
scale, you need a good data model
• This is what my other courses are about
But does the Data Model Work?
&
Problem Statement
Queue Structure
Msg Msg Msg Msg Msg
Ordered
Push Pop
300B
msg
The Naïve Approach
• Push
• Seek First Row
• INSERT Row
• Pop
• Seek Last Row
• DELETE/Output
Key
Max
Msg
Min Max
Msg
Min
Msg
Why this doesn’t Scale!
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max
NextPrev
Virtual
Root
LATCH
HOBT_VIRTUAL_ROOT
LCK
PAGELATCH
PAGELATCH
PAGELATCH
B-Tree Root Pages
Summarising the Problem
• Hot stuff
• Root
• Min page
• Max page
• Intermediate
pages
• Alloc/Dealloc
• BUT: We Must
have order!
Cooling it down
What if…
• Push
• Seek first value
page
• UPDATE Reference
Count
• Pop
• Seek last value
page
• UPDATE Reference
Count
Min Max
Msg++
Min Max
Msg--
Dissipate the Heat
Min
Msg--
Max
Msg++
Min
Msg--
Max
Msg++
Min
Msg--
Max
Msg++
Last Digit = 0 Last Digit = 1 Last Digit = 2
Eliminating Thread Contention
Queue Structure
Ordered
PushSequence++PopSequence++
87654
VERY
fast!
Ring Buffers
Queue Structure
Ordered
PushSequence++
Mod 100
PopSequence++
Mod 100
Slot: 8
Msg: 108
Slot: 7
Msg: 107
Slot: 6
Msg: 106
Slot: 5
Msg: 105
Slot:4
Msg:104
Summing Up Message Queue Hack
• UPDATE
• instead of INSERT/DELETE
• More partitions = More
B-Trees
• Ring buffer using modulo
• Find Sweet spot
concurrency

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 

Was ist angesagt? (20)

HDFS
HDFSHDFS
HDFS
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
מיכאל
מיכאלמיכאל
מיכאל
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Ähnlich wie Master tuning

Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youSHRUG GIS
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
05 integrate redis
05 integrate redis05 integrate redis
05 integrate redisErhwen Kuo
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajpCloudera Japan
 
An Approach to Sql tuning - Part 1
An Approach to Sql tuning - Part 1An Approach to Sql tuning - Part 1
An Approach to Sql tuning - Part 1Navneet Upneja
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data WarehousesConnor McDonald
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaoneSimon Elliston Ball
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentalsChris Adkin
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationJonathan Katz
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performanceGuy Harrison
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
 

Ähnlich wie Master tuning (20)

Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
05 integrate redis
05 integrate redis05 integrate redis
05 integrate redis
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajp
 
An Approach to Sql tuning - Part 1
An Approach to Sql tuning - Part 1An Approach to Sql tuning - Part 1
An Approach to Sql tuning - Part 1
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
SQL Tuning 101
SQL Tuning 101SQL Tuning 101
SQL Tuning 101
 
sqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdfsqltuning101-170419021007-2.pdf
sqltuning101-170419021007-2.pdf
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentals
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management Application
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
User Group3009
User Group3009User Group3009
User Group3009
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
 

Kürzlich hochgeladen

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Master tuning

  • 2. Thomas Kejser • Formerly SQLCAT • Tuning SQL Server since 6.5 • 15+ Years of database experience • http://blog.kejser.org • CTO Fusion-io Europe
  • 4. Performance Scalabilityvs. Response Time Ressource Use Adding more of a HW ressource makes things faster You can scale without having performance (ex: HADOOP) You can perform without having scalability (ex: In Memory Engines)
  • 5. Our Reasonably Priced Server • 2 Socket Xeon E3645 • 2 x 6 Cores • 2.4Ghz • NUMA enabled, HT off • 12 GB RAM • 1 ioDrive2 Duo • 2.4TB Flash • 4K formatted • 64K AUS • 1 Stripe • Power Save Off • Win 2008R2 • SQL 2012 Image Source: DeviantArt
  • 6. Between disk and Memory Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 1ns 10ns 100ns 100us 10ms10us
  • 7. The “cache out curve” Data Size Throughput/thread Cache Size Service Time + Wait Time
  • 9. There are several of these curves Throughput Touched Data Size CPU Cache TLB NUMA Remote Storage
  • 10. Response time = Service Time + Wait Time Algorithms and Data Structures “Bottlenecks”
  • 11. • DBA tasks • Installation of OS and SQL • Basic Memory Configuration • Basic Perfmon style monitoring • Backup/Restore and HA setup • Basic reading a Query Plan • Basic understanding of database structures • Adding Indexes to tables • Running a Profiler trace What you ALREADY know
  • 13. What we Need • Free tools from MS • Windows SDK • In Win8: The “ADK” • Need .NET 4 to install
  • 14. Where Did the Time Go? Service Time + Wait Time Xperf –on Base –f Base.etl SELECT TOP 100000 * FROM LINEITEM INNER JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY SQLCMD –E –S. –i “Select.sql” Xperf –stop
  • 15. BASE profile with xperf Service Time + Wait Time
  • 16. Right Click – Summary Table Service Time + Wait Time
  • 17. What exactly is SQLNCLI? Service Time + Wait Time
  • 18. Quantifying just how stupid XML is SELECT TOP 1000000 * FROM ORDERS JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY FOR XML RAW ('OUTPUT') Xperf –on Base –f Base.etl With XML “Native” Format
  • 19. Which CPU cycles are Expensive? “App” tier Web Server Licensing >3K USD Blades Database Tier Core Licensing >10K USD <XML> ? Service Time + Wait Time
  • 20. • What about the time INSIDE the process? • What if the EXE won’t tell us? Diving even Deeper
  • 21. What is a Debug Symbol? mov ax,10 mov bx,20 mov cx,3 push ax push bx push cx call <address> <address> push bp mov bp,sp mov ax,[bp+8] mov bx,[bp+6] mov cx,[bp+4] add ax,bx div cx mov dx,ax ret HeaderdoStuff(10,20,3) … int doStuff(int a, int b, int c) { return (a + b) / c } myProg.exe Machine Code <address> = doStuff Symbol table myProg.pdb Service Time + Wait Time
  • 22. Where do you get PDB files? _NT_SYMBOL_PATH=SRV*C:Symbols*http://msdl.microsoft.com/download/symbols _NT_SYMCACHE_PATH=C:SymCache • Public Symbol Server • Configure Environment • Dbghelp.dll Service Time + Wait Time
  • 23. • Auto Generated by Visual Studio: Your Own Debug Symbols Service Time + Wait Time
  • 24. • Symbols are indexed. Have to add them Adding and Checking Your Symbols Cd Bin/x64/Release/ symstore add /f *.pdb /s C:/Symbols /t ‚MyExe‛ • Validate that the Symbols can resolve Cd Bin/x64/Release/ symchk MyExe.exe /V
  • 25. • Standard Xperf works fine for you own native code • BUT: Before Windows 8, stack walking is broken for x64 .NET • If you have .NET with 64 bit code. You must NGEN first: Got .NET and x64? Ngen install Bin/x64/Release/MyExe.exe (ngen lives here: %Windir%Microsoft.NETframework64<Version>Ngen.exe Service Time + Wait Time
  • 26. • Free tool from MS: .NET tracing is a pain, get a tool! • Not to be confused with xperfview • Same trace API and file format • Helps set obscure .NET specific trace flags Service Time + Wait Time
  • 27. And Finally, You can do Very Cool Things Did I tell you about interlocked operations?... Whiteboard time! Service Time + Wait Time
  • 28. • Consider again our LINEITEM table What is SQL Server REALLY doing? • How expensive is it to read from that? • Think ETL code and DW/BI queries CREATE TABLE LINEITEM ( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL ) BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
  • 29. SQLCMD – Native code Test SQLCMD.EXE Where does the time go? Service Time + Wait Time
  • 30. Standard Reading of Data xperf -on base -stackwalk profile -f stackwalk.etl SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch" 55sec xperf -stop xperf –merge stackwalk.etl stackwalkmerge.etl Service Time + Wait Time
  • 31. Details of the Time – Padding? Service Time + Wait Time
  • 32. More Details – Conversion Work?
  • 33. An Educated guess about improvements CREATE TABLE [dbo].[LINEITEM_native]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] money NOT NULL, [L_EXTENDEDPRICE] money NOT NULL, [L_DISCOUNT] money NOT NULL, [L_TAX] money NOT NULL, [L_RETURNFLAG] int NOT NULL, [L_LINESTATUS] int NOT NULL, [L_SHIPDATE] int NOT NULL, [L_COMMITDATE] int NOT NULL, [L_RECEIPTDATE] int NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] int NOT NULL, [L_COMMENT] char(44) NOT NULL ) CREATE TABLE [dbo].[LINEITEM]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL, ) Before After Service Time + Wait Time
  • 34. Getting Rid of Useless Work Additional parameters for SQLCMD: -a32767 -W -s";" -f437 x1.5 Service Time + Wait Time
  • 35. Unicode – 10% overhead? Service Time + Wait Time
  • 36. Lets try that with Native and Unicode … x5 Service Time + Wait Time
  • 37. • SQLNCLI is one of these in disguise • ODBC • OLEDB • Pick good data types • MONEY over NUMERIC • UNICODE of data arrives like this • Native protocols vs. flexibility Summary Moving Data
  • 38. • Get • Windows 8 ADK • Windows 7 SDK • Set up Symbol Paths • Xperf –on Base • Standard trace for time, narrow to process and DLL/EXE • Xperf –on Base –stackwalk Profile • Get to the call stack, find the offending function(s) • Ease of use for .NET: perfview.exe Summary – Xperf Service Time + Wait Time
  • 39. Response time = Service Time + Wait Time
  • 41. Loop Join n row B-tree Log(n) reads Complexity: O(m * log(n)) Service Time + Wait Time m row result 1 43 13 7 3
  • 42. Linked List Tree Linked List vs. Tree Service Time + Wait Time 0 1 2 3 4 5 6 7 8 n 8 134 62 1510 16141197531 Log2(n)
  • 43. Cluster on O_ORDERKEY Index on O_ORDERKEY Basic argument for Cluster Indexes Service Time + Wait Time CREATE UNIQUE CLUSTERED INDEX CIX_Key ON ORDERS_Cluster (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Cluster WHERE O_ORDERKEY = 3000000 CREATE UNIQUE INDEX IX_Key ON ORDERS_Heap (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Heap WHERE O_ORDERKEY = 3000000 Table 'ORDERS_Heap'. Scan count 0, logical reads 3 , physical reads 0, read-ahead reads 0 Table 'ORDERS_Cluster'. Scan count 0, logical reads 4 , physical reads 0, read-ahead reads 0
  • 44. Cluster on O_ORDERKEY heap + Index on O_ORDERKEY But what if we do this a lot? CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY) WITH (FILLFACTOR = 100) CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Heap WHERE O_CUSTKEY = 47480 SELECT * FROM ORDERS_Cluster WHERE O_CUSTKEY = 47480 Table 'ORDERS_Cluster'. Scan count 1 , logical reads 27, physical reads 0 Table 'ORDERS_Heap'. Scan count 1 , logical reads 11, physical reads 0 Service Time + Wait Time
  • 45. How many LOOP joins/sec/core? 7 Sec Service Time + Wait Time
  • 46. What did we just measure? Xperf –on Base –stackwalk profile About 40%... Service Time + Wait Time
  • 47. • The query language itself • Why so many ExecuteStmt? • …With so much CPU use? What is sqllang.dll? Service Time + Wait Time
  • 48. A different way to Measure Loops 1 Sec Service Time + Wait Time
  • 49. VS. What does THAT look like? Takeaway: The T-SQL language itself is expensive Service Time + Wait Time
  • 50. • Sample from LINEITEM • Force loop join with index seeks • Do 1.4M seeks Test: Singleton Row Fetch
  • 51. Singleton seeks – Cost of compression Compression Seek (1.4M seeks) CPU Load None - Memory 13 sec 100% one core PAGE - Memory 24 sec 100% one core None – I/O 21 sec 100% one core PAGE – I/O 32 sec 100% one core Function % Weight CDRecord::LocateColumnInternal 0.82% DataAccessWrapper::DecompressColumnValue 0.47% SearchInfo::CompareCompressedColumn 0.28% PageComprMgr::DecompressColumn 0.24% AnchorRecordCache::LocateColumn 0.18% ScalarCompression::AddPadding 0.04% ScalarCompression::Compare 0.11% Additional Runtime of GetNextRowValuesInternal 0.14% Total Compression 2.28% Total CPU (single core) 8.33% Compression % 27.00% xperf –on base –stackwalk profile
  • 52. Modern CPU CPU L3 Cache 4MB Inst Cache 32KB Core Data Cache 32KB L2 Uni Cache 256K Inst Cache 32KB Core Data Cache 32KB L2 Uni Cache 256K Bus Service Time + Wait Time
  • 53. The B+ Tree Service Time + Wait Time B+ Tree
  • 54. Hekaton Style “Loop” Lookup Table (hash) Service Time + Wait Time
  • 55. Merge Join m row result 1 1 2 3 n row result 1 2 3 4 4 43 43 Sorted Sorted Complexity: O(m + n) Service Time + Wait Time
  • 56. Merge Join – What is Fastest? Service Time + Wait Time SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM LINEITEM INNER MERGE JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY …or SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM ORDERS INNER MERGE JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY
  • 57. Comparing the Query Plans Service Time + Wait Time
  • 58. Digging in Deeper Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3265 ms, elapsed time = 3357 ms. Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 2469 ms, elapsed time = 2607 ms. Service Time + Wait Time
  • 59. We can beat SQL Server at this game SELECT MAX(O_ORDERDATE), MAX(MAX_P) FROM (SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P FROM LINEITEM GROUP BY L_ORDERKEY) b INNER MERGE JOIN ORDERS ON O_ORDERKEY = b.L_ORDERKEY Service Time + Wait Time
  • 60. Hash Join m row result 1 43 13 7 n row join table Hash(1) n row hash table Complexity: O(m + 2n) 3 Service Time + Wait Time
  • 61. When Hash Joins hurt you Service Time + Wait Time 0 5 10 15 20 25 30 050100150200250300350400 Hash Memory (MB) Runtime (seconds) Spill Zone!
  • 62. Hash Joins Don’t Scale in MSSQL
  • 64. ACCESS_METHODS_DATASET_PARENT: “Used to synchronize child dataset access to the parent dataset during parallel operations.” Books Online Story… Image: FreeDigitalPhotos.net
  • 65. Using XPERF to find documentation xperf –on base+cswitch+dispatcher –stackwalk profile+readythread+cswitch
  • 66. Lets dig in… xperf -on base -stackwalk profile -f stackwalk.etl
  • 67. What LATCH pattern do we see? GetNextRangeForChildScan Inside: TableScanNew
  • 68. • Partition the table by a “random” value • Modulo the Key for example • Use SQL Server partition function/schema The Fix?… 0 1 2 3 4 5 6 253 254 255 hash
  • 71. What is the Problem here?
  • 73. CPU Caches 0 100 200 300 400 500 600 700 800 900 1,000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 MillionPages/sec Size of Accessed memory (MB) Random Pages Sequential Pages Single Page Service Time + Wait Time
  • 74. Goals: • Compressed • Prefetch Friendly • Cache Resident Code Example, Column Stores ID Value 1 Beer 2 Beer 3 Vodka 4 Whiskey 5 Whiskey 6 Vodka 7 Vodka ID Customer 1 Thomas 2 Thomas 3 Thomas 4 Christian 5 Christian 6 Alexei 7 Alexei Product Customer ID Date 1 2011-11-25 2 2011-11-25 3 2011-11-25 4 2011-11-25 5 2011-11-25 6 2011-11-25 7 2011-11-25 Date ID Sale 1 2 GBP 2 2 GBP 3 10 GBP 4 5 GBP 5 5 GBP 6 10 GBP 7 10 GBP Sale Service Time + Wait Time
  • 75. Compression is Easy ID Value 1-2 Beer 3 Vodka 4-5 Whiskey 6-7 Vodka ID Customer 1-3 Thomas 4-5 Christian 6-7 Alexei Product’ Customer’ ID Date 1-7 2011-11-25 Date’ ID Sale 1-2 2 GBP 3 10 GBP 4-5 5 GBP 6-7 10 GBP Sale’ RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka RL Customer 3 Thomas 2 Christian 2 Alexei Product’ Customer’ RL Date 7 2011-11-25 Date’ RL Sale 2 2 GBP 1 10 GBP 4 5 GBP 2 10 GBP Sale’ Service Time + Wait Time
  • 76. Squeezing it even more RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka Product’ RL Value 2 1 1 2 2 3 2 2 Product’ Beer = 1 Vodka = 2 Whiskey = 3 ID Value 1-2 Beer 3-3 Vodka 4-5 Whiskey 6-7 Vodka Product’ 4+4+4+2 = 14B + 4+4+5+2 = 15B + 4+4+7+2 = 17B + 4+4+5+2 = 15B = 61B 4+4+2 = 10B + 4+5+2 = 11B + 4+7+2 = 13B + 4+5+2 = 11B = 45B 4+4 = 8B + 4+4 = 8B + 4+4 = 8B + 4+4 = 8B = 32B RL Value 2 0x01 1 0x10 2 0x11 2 0x10 Product’ 4 = 4B + 4 = 4B + 4 = 4B + 4 = 4B + 4 x 2b = 2B = 18B Service Time + Wait Time
  • 77. RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka RL Customer 3 Thomas 2 Christian 2 Alexei Product’ Customer’ 2 steps with Beer 2 steps with Thomas Beer Thomas Beer Thomas SELECT Product, Customer FROM Table 1 step with Vodka 1 step with Thomas Vodka Thomas 2 step with Whiskey 2 step with Christian Whiskey Christian Whiskey Christian 2 step with Vodka (Note: Repeated value) 2 step with Alexei Vodka Alexei Vodka Alexei Service Time + Wait Time
  • 78. Hash Joining with Column Stores RL Key 2 Beer 1 Vodka 2 Whiskey 2 Vodka Table Key Type Beer Soft Vodka Strong Whiskey Strong Vodka Strong Dim Product SELECT … FROM Table JOIN DimProduct ON Key WHERE Type = ‘Strong’ 1 Compute bloom filter of Keys belonging to ‘strong’ 2 Read RL = 2, Beer from Table 3 Compute bloom value of Beer. 4 Equal to filter value from 1? Yes. Output two rows (RL=2) 5 Compute bloom value for Vodka 6 Equal to filter value from 1? No. Do nothing 7 Compute bloom value for Whiskey 8 Equal to filter value from 1? No. Do nothing Can pre fetch data (news RLE) Can calculate match/no match using only local CPU cache Wont work for OLTP! Service Time + Wait Time
  • 79. Why is it so hard to get joins right? n m Time Loop Join Merge Join Hash Join Service Time + Wait Time
  • 80. Desired Join Join Hint Query Hint LOOP [INNER | LEFT | CROSS | FULL] LOOP JOIN OPTION (LOOP JOIN) MERGE [INNER | LEFT | CROSS | FULL] MERGE JOIN OPTION (MERGE JOIN) HASH [INNER | LEFT | CROSS | FULL] HASH JOIN OPTION (HASH JOIN) LOOP with Seek WITH FORCESEEK WITH ( INDEX (index = <name>) ) N/A Controlling Joins Note: Join hints force the order of the ENTIRE join tree! Service Time + Wait Time
  • 81. What Type of Workload? BigSmall Small Big DataReturned Data Touched OLTP BI/DW Simulation ETL Service Time + Wait Time
  • 82. How to Classify? OLTP BI/DW Simulation ETL Full Scan/sec Range Scans/sec Probe Scans/sec Index Search/sec Range Scans/sec Full Scan/sec Range Scans/sec Bulk Copy Rows/sec ?
  • 83. There should ALWAYS be a fully indexed path to the data. OLTP System Basic Query Pattern BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
  • 84. 1. Find worst CPU consuming query with sys.dm_exec_query_stats 2. Add OPTION (LOOP JOIN) to offending query 3. Check estimated query plan 4. If table spool found: add index to remedy and GOTO 3 5. Happy? If not, GOTO 1 The Super Quick OLTP Tuning Guide Service Time + Wait Time
  • 85. The query will not be (much) worse than a full scan of a fact partition DW/BI System Basic Query Pattern BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
  • 86. 1. Find offending query 2. Add OPTION (HASH JOIN) to query 3. Does dimension tables have indexed path to build hash? If not, add index 4. Do you get a fact table scan and hash build of all dimensions? If not, check statistics (especially on facts and skewed) 5. Optimize Fact table scans 1. Partition and partition elimination 2. Column store if you have it 3. Aggregate Views 4. Bitmap index pushdown (statistics!) 5. Composite indexes (last resort!) The Super Quick DW tuning Guide Service Time + Wait Time
  • 87. The expected DW Query Plan Partial Aggregate Fact CSI Scan Dim Scan Dim Seek Batch Build Batch Build Hash Join Hash Join HashStream Aggregate
  • 88. • At least enough RAM to hold the hash tables of the largest dimension • De-normalisation helps… a LOT • Especially for the large/large joins • Likely: need to scan fast from disk if RAM is not big enough to hold the fact • Compression REALLY matters Things that Follow from desired DW Plan Service Time + Wait Time
  • 90. Response time = Service Time + Wait Time
  • 91. Where EVERY Server wide diagnosis starts SELECT * FROM sys.dm_os_wait_stats WHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC Service Time + Wait Time
  • 92. • Shows up as waits for PAGEIOLATCH • You can dig into details with: Common Problems - PAGEIO Service Time + Wait Time SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL) • Can also Xevent your way to it per query CREATE EVENT SESSION [TraceIO] ON SERVER ADD EVENT sqlserver.file_read_completed( ACTION (sqlserver.database_id,sqlserver.session_id))
  • 93. • I/O, like memory, is a GLOBAL resource for the machine • When does it make sense to partition a global resource? • When you deeply know the workload • When the workload is ALREADY partitioned • When neither of those are true: DON’T partition • If you have NAND/SSD – Why bother? The general I/O Guidance Service Time + Wait Time
  • 94. A good way to Think of Spindle I/O
  • 95. JBOD SAME LUN Seq. LUN Seq. LUN Seq. RAID system Large LUN Seq. Seq. Seq. RANDOM I/O Service Time + Wait Time
  • 96. Stripe vs. Concatenation RAID 10 RAID 10 Concatenated LUN RAID 10 RAID 10 Striped LUN Service Time + Wait Time
  • 97. OLTP • One big SAME setup • data files • Tempdb • Dedicate • Transaction log • DRAM: • Enough to hold most of DB Data Warehouse • JBOD setup • Data Files • 1-2 per LUN • SAME setup • Tempdb • Dedicate • Transaction Log • DRAM: • Enough to hold largest partition of largest table Rules of Thumb – Spindle I/O and DRAM Service Time + Wait Time
  • 98. • Short Stroking • Elevator Sort • Sequential vs. Random • Weaving You can do a bit better… or worse Service Time + Wait Time
  • 99. • Intentionally use lower % of total space • Tradeoff: • Space for Speed • Test: • 15K rpm • SAS spindle • 300GB Short Stroking Disks 150 200 250 300 350 400 0% 20% 40% 60% 80% 100% IOPS % Capacity Used Service Time + Wait Time
  • 100. Full Stroked Short Stroked Why does Short Stroking Work? Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the disk wont use the full platter either. The result: less head movement Service Time + Wait Time
  • 101. Adding Elevator Sorting 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked Latency IOPS 8K random I/O IOPS Avg. Latency Max Latency Bat powered disk!
  • 102. Why Chase Sequential I/O? 0 10 20 30 40 50 60 70 80 1 10 100 1000 10000 100000 Sequential Full Stroke Random Latency(ms) Log(IOPS) 8K Block Pattern IOPS Avg Latency Max Latency Service Time + Wait Time
  • 103. • One SATA disk • Two partitions • One file on each • Sequential read on each file But all is not well! File1 File2 Service Time + Wait Time
  • 104. I/O Weaving in action 0 2 4 6 8 10 12 14 16 18 0 50 100 150 200 250 300 64K Random 64K Dual Sequential Latency(ms) IOPS IOPS Avg Latency Source: Michael Anderson Service Time + Wait Time
  • 105. Storage Pool and Weaving DataLog DataLog DataLog Massive, then Provisioned Pool Seq Ran Seq Ran Seq Ran RANDOM! Service Time + Wait Time
  • 106. The SAN will properly handle Sharing! Green: Checkpoint, Red: tx/sec, Black: Disk Latency Service Time + Wait Time
  • 107. Numbers to Remember - Spindles Characteristic Typical Units Throughput / Bandwidth 90-125MB/sec But ONLY if sequential access! Operations per Sec 10K RPM Spindle: 100-130 IOPS 15K RPM Spindle: 150-180 IOPS Can get about 2x if short stroking (more later) Latency 3-5ms (compare DRAM: 100ns) Capacity 100s of GB to single digit TB 2012 numbers, will change in future Service Time + Wait Time
  • 108. • Few hundreds of IOPS • Faster if short stroked • Trade latency for speed with elevator sort • Sequential is hard to get right Summary so far.. Single Disk Service Time + Wait Time
  • 109. • Wider Stripes neat • But scale not linear • Very deep queues help • But add latency • Shared Components Why does a big RAID pile not solve this? Service Time + Wait Time
  • 110. RAID Scale? Your Mileage WILL vary with the hardware
  • 111. Before After Getting rid of Sharing Switch HBA HBA HBA HBA Storage Port Storage Port Switch LUN LUN Cache Disk CPU Switch HBA HBA HBA HBA Storage Port Storage Port Switch LUN LUN Cache Disk CPU x2
  • 112. 4K PN N NAND Flash Basics 112 PN N Oxide Layer Floating Gate Electrons trapped Control Gate NAND Die Pack Blocks 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K PN N PN NPN N PN NPN N PN NPN N Pages
  • 113. NAND Flash Problems • Erase Cycles • Around 100K • Rebalancing and reclaim/trim • Voltage measurement • Gets worse with density • Changes over time • Depends on how you program • Bit Rot • Must refresh even on read • SLC easier to manage than MLC • But much more expensive! 113 Voltage 00 01 10 11
  • 114. Lessons Learned: Try to Avoid Sharing BAD BETTER BEST Service Time + Wait Time
  • 116. • Only partially diagnosed as waits in sys.dm_os_wait_stats • Task Manager gives a bit more information • Need: transparency to the deep level latencies and packets! Common Problems: ASYNC_NETWORK, OLEDB Service Time + Wait Time
  • 117. A common Wait Type The database is really slow! The code takes forever to run! Service Time + Wait Time
  • 118. • We may not always have insight into what is going on at the client… Xperf Diagnosing the Network xperf –on latency+network Summary Table Service Time + Wait Time
  • 119. Timeline of the network Traffic
  • 120. ASYNC_NETWORK_IO, the typical issue Service Time + Wait Time
  • 121. Handling network is EXPENSIVE xperf –on latency ? Service Time + Wait Time
  • 122. Short Story on DPC/ISR handling CPU Core Core L1-L3 Cache PCI BUS IRQ HALT execution Fire ISR Routine if (my interrupt) { <Mark Handled> Queue DPC } NIC Work Done DPC <Do work needed> <Wake Application> Core can run other stuff again Service Time + Wait Time
  • 123. It looks like this… DPC ISR Service Time + Wait Time
  • 124. • Option 1: Use the HW vendors tool • Option 2: Use interrupt Affinity Policy Tool from MS Setting Interrupt Affinity Service Time + Wait Time
  • 125. • Standard Payload Network (MTU): • 1500 B • Jumbo Frames • 9014 B(MTU) Jumbo Frame and SQL Packets • Standard SQL payload • 4096 B • Largest • 32767 B SELECT session_id, net_packet_size FROM sys.dm_exec_connections Server=foo;Packet size=32767 Service Time + Wait Time
  • 127. Core Evolution Moore’s “Law”: “The number of transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”
  • 128. • Never faster than a single core • Smaller servers are faster than bigger ones • Large L2 caches and more clock speed help • The algorithm dictates speed • Latency of Wait Time sets upper limit • Examples from MSSQL land: • Formula Engine in MSAS • Transaction Log Writes • INSERT/UPDATE/DELETE (as we shall see) Single Threaded
  • 129. VLF files • When switching to new VLF – it has to be ”formatted” with 8K sync write • While this happens, transactions are blocked • Too many VLF = Too much blocking • Lesson: Preallocate the database log file in big chunks • Up to 128 Log Buffers per database • Spawned on demand, will not be released once spawned • Transactions will wait for LOGBUFFER is no buffer available • Think of this like a pipeline of commits waiting… VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K <=60K X 128
  • 130. Transaction Log Background Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126
  • 131. • Speed is determined by Latency and Code Path • Max Log Write Size: 60K Zooming to the Log Writer Log Writer Async I/O Completion Port Signal thread which issued commit Latency Writer Queue
  • 132. Long Distance Replication… Log Entry Log Entry Network Log Entry Send log Ack Log Primary Secondary Write Write Executive Summary: The speed of light ( c ) is not fast enough!
  • 133. • Perfmon will only show millisec • What if we want microseconds? Getting to the Real Latency xperf –on latency
  • 134. It’s in Memory, so it must be fast? VS. Latency: 15-30us Latency: <5us RAM DISK 1.5sec 1.5sec
  • 135. No, Because… This adds up to one core… it is doing all it can with the CPU it has
  • 136. The Effect on UPDATE Naïve UPDATE MyBigTable SET c6 = 43 Parallel UPDATE MyBigTable SET c6 = 43 WHERE key BETWEEN 10**9 * n AND 10**9 * (n+1) -1CX Runtime (smaller is faster)
  • 138. What is Scalable? 0 500 1000 1500 2000 2500 3000 0 4 8 12 16 20 24 Throughput Some Hardware Resource Good So so Bad We want to live here
  • 139. Amdahl’s Law of gated speedup 1 6 11 16 21 26 31 0 8 16 24 32 40 48 56 64 SpeedupFactor Number of cores P = 100% P = 95% P = 90% P = 80% P = Part of program that can be made Parallel (Note that this may be 0... or 1) N = Number of CPU cores available Speedup =
  • 140. Introducing Contention – Locks Table A Table B Table C INSERT TableA … INSERT TableB … INSERT TableC … LCK LCK LCK LCK LCK LCK LCK LCK Wait Stat: LCK_<X>
  • 141. But those rows have to be stored… Table A Table B Table C LCK LCK LCK LCK LCK LCK LCK LCK Data File File Group
  • 142. It all Starts with Wait Stats SELECT * FROM sys.dm_os_wait_stats WHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC DBCC PAGE
  • 143. PFS – Hidden Single Page Contention Data File GAM/ SGAM PFS 64MB PFS PFS 64MB PFS 64MB PFS B B B B B B B B B B B B B B B B 8K 10010010 INSERT TableA … Allocated bit
  • 144. Data File Data File Data File More Files Table A Table B Table C LCK LCK LCK LCK LCK LCK LCK LCK Data File File Group • Round Robin between files • More files, more structures • No affinity
  • 145. How many more Files? 1 10 100 1000 10000 100000 1000000 10000000 260 280 300 320 340 360 380 400 0 8 16 24 32 40 48 PAGELATCH Runtime # Data Files Runtime PAGELATCH_UP
  • 146. • Shared, physical MEMORY structures can cause bottlenecks (ex: PFS) • SQL Server must sync too… • Understanding where structure resides leads to tuning fix • Theory of engine! Concurrency: What we learned so far
  • 147. • Commonly misdiagnosed • CXPACKET does NOT (always) mean that your DOP is “too high” CXPACKET 0 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 160,000,000 180,000,000 200,000,000 10.015.020.025.030.035.040.0 CXPACKETWaits Throughput (MB/sec) CXPACKET waits / Throughput 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 1 11 21 31 41 Throughput(MB/sec( DOP Throughput / DOP
  • 148. CXPACKET = Issue may be elsewhere…
  • 149. • What happens when you get things like: LATCH_<x> PAGELATCH_<x> Step 1: Dig into: Diagnosing Latches SELECT * FROM sys.dm_os_latch_stats Service Time + Wait Time
  • 150. Digging into Latches Again…
  • 151.
  • 152. Zooming into the Ready Thread
  • 154. • Before: 6GB/sec • After: 20GB/sec • This sometimes works on cluster indexes too… …Whiteboard Speedup with Hash Partition of Heap
  • 156. Before ALTER TABLE HotUpdates ADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT („X‟) After UPDATE Hack on Small Tables Page (8K) ROW LCK_U PAGELATCH_EX CHAR(5000) Page (8K) ROW ROW ROW LCK_U LCK_U PAGELATCH_EX
  • 157. Test: Updates of pages Compression Update 1.4M CPU Load None - Memory 13 sec 100% one core PAGE - Memory 54 sec 100% one core None – I/O 17 sec 100% one core PAGE – I/O 59 sec 100% one core L_QUANTITY is NOT NULL i.e. in place UPDATE
  • 158. Function CPU % qsort 0.86 CDRecord::Resize 0.84 CDRecord::LocateColumnInternal 0.36 perror 0.36 Page::CompactPage 0.36 ObjectMetadata::`scalar deleting destructor' 0.27 SearchInfo::CompareCompressedColumn 0.24 CDRecord::InitVariable 0.19 CDRecord::LocateColumnWithCookie 0.18 memcmp 0.16 PageDictionary::ValueToSymbol 0.16 Record::DecompressRec 0.14 PageComprMgr::DecompressColumn 0.14 CDRecord::InitFixedFromOld 0.1 SOS_MemoryManager::GetAddressInfo64 0.08 AnchorRecordCache::LocateColumn 0.08 CDRecord::GetDataForAllColumns 0.08 ScalarCompression::Compare 0.07 PageComprMgr::CompressColumn 0.07 Record::CreatePageCompressedRecNoCheck 0.06 memset 0.05 PageComprMgr::ExpandPrefix 0.04 PageRef::ModifyColumnsInternal 0.04 Page::ModifyColumns 0.03 DataAccessWrapper::ProcessAndCompressBuffer 0.03 SingleColAccessor::LocateColumn 0.03 CDRecord::BuildLongRegionBulk 0.02 ChecksumSectors 0.02 Page::MCILinearRegress 0.02 DataAccessWrapper::DecompressColumnValue 0.02 SOS_MemoryManager::GetAddressInfo 0.02 CDRecord::FindDiff 0.02 AnchorRecordCache::Init 0.02 PageComprMgr::CombinePrefix 0.01 Total 5.17 UPDATE Compression burners Out of 8.55 … Approx: 60%
  • 159. Compression and Locks Xevent Trace Lock Acquire/Release High Res Timer
  • 160. How long are locks held? 0 100 200 300 400 500 600 PAGE NONE CPU KCycles Lock Held Cycle Count Avg StdDev
  • 161. • Sharing is generally bad for scale (but may be good for performance) • PAGELATCH and LATCH diagnosis starts in sys.dm_os_latch_stats • CXPACKET • Only important if throughput drops when DOP goes up • If this happens, look for another wait/latch • Table partitioning can be used to work around concurrency issues Summary Concurrency – So Far..
  • 162. The Paul Randal INSERT test 160M rows, executing at concurrency Commit every 1K: EASY tuning?
  • 163. All is as Expected?
  • 164. But Page Splits are Bad, right? = BAD! = Better!...
  • 166. And the Score Is… 0 5000 10000 15000 20000 25000 30000 35000 newguid() newsequentialid() IDENTITY Time in Seconds
  • 167. What is going on here??? Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
  • 168. Tricks to Work Around this 0 -1000 1001 - 2000 2001 - 3000 3001 - 4000 INSERT INSERT INSERT INSERT
  • 169. All Cores at 100% 0 5000 10000 15000 20000 25000 30000 35000 newguid( ) newsequ entialid() IDENTITY IDENTITY +Unique IDENTITY +Unique +Hash8 IDENTITY +Hash24 IDENTITY +Hash48 SPID+ Offset Seconds Runtime in Seconds 600K Inserts/sec 830K Inserts/sec All Cores at ~100%
  • 170. • Don’t use Sequential Keys • Page Splitting isn’t so bad • Neither are GUID • Generate keys wisely. Ideally in the app server • For “transparent” speedup, consider our old hash trick Takeaways, INSERT workload
  • 171. • Minimally Logged • Single, large execution (thousands) • Unsorted data • Concurrent Loaders BULK INSERT Workload Heap Bulk Insert Bulk Insert
  • 172. Measure: SELECT * FROM sys_dm_os_latch_stats Observe waits on ALLOC_FREESPACE_CACHE Theory (just read BOL): “Used to synchronize the access to a cache of pages with available space for heaps and binary large objects (BLOBs). Contention on latches of this class can occur when multiple connections try to insert rows into a heap or BLOB at the same time. You can reduce this contention by partitioning the object.” When does BULK INSERT scale break? 0.0 50.0 100.0 150.0 200.0 250.0 0 5 10 15 20 25 30 MB/Sec Concurrent BULK INSERT 1 2 3
  • 173. What is Happening here? Free Page information (PFS/GAM/SSGAM) HOBT Cache Fat Chunks Alloc new pages!Bulk Insert ALLOC_FREESPACE_CACHE This is in DRAM and L2
  • 174. • Break Up table by “some key” • Optional: Switch out partitions • Spin up multiple bulks • Linear scale • 3GB/sec • 16M LINEITEM/sec Breaking Through the Bottleneck 425 555 215 200 101 453 666 Area Bulk Insert Bulk Insert Bulk Insert
  • 175. BULK INSERT - Reloaded • Thomas, you might have gotten 16M rows/sec at 3GB/sec insert speed • But this was on heaps, I have a clustered table • Alright then, let us hit a cluster index 1-1000 Clustered and partitioned 1001-2000 2001-3000 3001-4000 X Lock X Lock X Lock X Lock
  • 176. Cluster Bulking – It seemed so plausible! 1 2 3
  • 177. Cluster Bulking – Stage and Switch 1 2 3
  • 180. • Context Switching is expensive • Typically 10K or more CPU cycles • If you expect the ressource to be held only shortly, why fall asleep? What is a Spinlock? spin_acquire(int* s) { while(*s==1) *s = 1; } Spin_release(int* s) { *s = 0; }
  • 181. • Acquire can be very expensive • SQL Server implements a backoff mechanism What is a backoff? spin_acquire(int* s) { int spins = 0; while(*s==1) { spins++; if (spins > threshold) { <Sleep and WaitForRessource> } } *s = 1; } SELECT * FROM sys.dm_os_spinlock_stats DBCC SQLPERF(spinlockstats) Backoff
  • 182. Life at 600K INSERT/sec
  • 183. WRITELOG is I/O – right? Should be the same as this… or? No! Because:
  • 184. • Step 1: Copy sqlserver.pdb to the BINN directory • Step 2: DBCC TRACEON (3656, -1) • Step 3: Steal script from: http://www.microsoft.com/en- us/download/details.aspx?id=26666 Note for 2012, you additionally need: • sqlmin.pdb, sqllang.pdb, sqldk.pdb Diagnosing a Spinlock the Cool way!
  • 185. Spinlock Walkthrough – Extended Events Script --Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE') --create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack) where type = 144 --SOS_CACHESTORE) add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE) --Run this section to measure the contention alter event session spin_lock_backoff on server state=start --wait to measure the number of backoffs over a 1 minute period waitfor delay '00:01:00' --To view the data --1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1) --Get the callstacks from the bucketize target select event_session_address, target_name, execution_count, c ast (target_data as XML) from sys.dm_xe_session_targets xst inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address) where xs.name = 'spin_lock_backoff' --clean up the session alter event session spin_lock_backoff on server state=stop drop event session spin_lock_backoff on server
  • 186. Of course, you can just use 2012…
  • 187. How to improve a spinlock? CPU Core Core L1-L3 Cache CPU Core Core L1-L3 Cache spin_acquire Int s spin_acquire Int s spin_acquire Int s Transfer cache line Transfer cache line CPU CPU
  • 188. CoreInfo.Exe – where are my cores? CoreInfo.exe
  • 189. Revisiting the TLOG Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126
  • 190. I/O Affinity Mask! 0 50 100 150 200 250 SPID + Offset SPID + Affinity sp_configure „AffinityIOMask‟
  • 191. Bulking at Concurrency • What’s that spin? xperf –on latency –stackwalk profile xperf –d trace.etl xperview trace.etl SELECT * FROM sys.dm_os_spinlock_stats ORDER BY spins_count DBCC SQLPERF (spinlockstats) ?
  • 192. SOS_OBJECT_STORE at high INSERT • Observed: This Spin happens when inserting • Need: Reduce locking overhead • Fixes that work well here: 8x throughput Bonus
  • 193. • Lets try something really silly: • Run lots of: EXEC emptyProc • This should be infinitely scalable, right? Diagnosing another Spinlock CREATE PROCEDURE emptyProc AS RETURN
  • 194. Initial Diagnosis MUTEX ??? … what Mutex?
  • 195. Using the Spinlock Script gives us Some cache Which one?
  • 196. Validating the Theory CREATE PROCEDURE emptyProc0 AS RETURN GO CREATE PROCEDURE emptyProc1 AS RETURN GO … CREATE PROCEDURE emptyProc31 AS RETURN
  • 197. What is the SOS_OBJECT_STORE? Security Check?
  • 198. Validating the new “fix”…
  • 199. DECLARE @ParmDef NVARCHAR(500) DECLARE @sql NVARCHAR(500) SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6)' SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3)) SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5 INT, @p6 BIGINT' DECLARE @constDate DATETIME = '1974-12-22' DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN SET @i = 1 WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, @ParmDef , @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13 SET @i = @i + 1 END COMMIT TRAN Consider this Test harness code…
  • 200. Spinning on MUTEX Diagnose with trace flag shows spins stack offender: CSecurityContext::GetUserTokenFromCache This is REALLY expensive at scale: WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END Initialize a new execution context on every loop!
  • 201. Fixing the MUTEX spin • Instead of: WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END • Write: SET @sql = N' DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN WHILE @i <= 1000 BEGIN INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6) SET @i = @i + 1 END COMMIT TRAN END EXEC sys.sp_executesql @sql, @ParmDef 4x throughput Bonus
  • 202. • When all other bottlenecks are gone, sharing happens in the most unlikely places • You can use spinlock Xevents inside SQL Server • Remember symbol files in BINN • Trace flag 3656 • This can also be done in XPERF for non SQL apps • Ex: Analysis Services Concurrency, Spinlock Summary
  • 203. • Control of buffers and NUMA for Xperf setting • By default: • 4MB mem • Spool to disk at root of C-drive • Can do buffer/file control: • -buffersize and –maxbuffers • -maxfile and –FileMode Circular Xperf controlling buffers
  • 204. • Round robin between NUMA nodes • Inside the NUMA: Pick the one that looks the least busy • This is NOT a perfect system How SQL Server assigns threads
  • 205. Xperf -on Latency+CSWITCH+DISPATCHER - stackWalk CSwitch+ReadyThread+ThreadCreate+Pr ofile -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular REG ADD "HKLMSystemCurrentControlSetControl Session ManagerMemory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f Super Xperf
  • 206. • All the tuning wont help you if your model is wrong • Tunings gets your far, but to really scale, you need a good data model • This is what my other courses are about But does the Data Model Work?
  • 207. &
  • 208. Problem Statement Queue Structure Msg Msg Msg Msg Msg Ordered Push Pop 300B msg
  • 209. The Naïve Approach • Push • Seek First Row • INSERT Row • Pop • Seek Last Row • DELETE/Output Key Max Msg Min Max Msg Min Msg
  • 210. Why this doesn’t Scale! Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
  • 212. Summarising the Problem • Hot stuff • Root • Min page • Max page • Intermediate pages • Alloc/Dealloc • BUT: We Must have order!
  • 214. What if… • Push • Seek first value page • UPDATE Reference Count • Pop • Seek last value page • UPDATE Reference Count Min Max Msg++ Min Max Msg--
  • 216. Eliminating Thread Contention Queue Structure Ordered PushSequence++PopSequence++ 87654 VERY fast!
  • 217. Ring Buffers Queue Structure Ordered PushSequence++ Mod 100 PopSequence++ Mod 100 Slot: 8 Msg: 108 Slot: 7 Msg: 107 Slot: 6 Msg: 106 Slot: 5 Msg: 105 Slot:4 Msg:104
  • 218. Summing Up Message Queue Hack • UPDATE • instead of INSERT/DELETE • More partitions = More B-Trees • Ring buffer using modulo • Find Sweet spot concurrency

Hinweis der Redaktion

  1. For a great introductory course I recommend the Paul Randal course found here: http://www.sqlskills.com/T_ImmersionInternalsDesign.asp
  2. To get a good runtime, we up the count of rows to 1M
  3. Hint: NGEN lives in %Windir%\Microsoft.NET\framework64\&lt;Version&gt;Doc on NGEN: http://msdn.microsoft.com/en-us/magazine/cc163610.aspx
  4. Get perfview here: http://www.microsoft.com/en-us/download/details.aspx?id=28567
  5. http://msdn.microsoft.com/en-us/library/6t9t5wcf(v=vs.80).aspx
  6. Different data structures have different time complexities that lend themselves to more or less efficient service times.
  7. Concurrency of JOIN even when single threaded
  8. The B+ tree is a data structure that seeks to block fetch large areas of data (typically, but not always 8K) before seeking through the pages in memory. There exist many different ways to lay out the data pages of a B-tree, some of them more friendly to memory prefetch than others. The B-tree also allows you to seek the leaf nodes in a linear manner, without paying the log-proportional price to seekThis allows a logarithmic time to seek individual pages while still allowing linear time to range scan. When the expensive price of fetching a page (I/O) has been paid, the parsing of the page can also be made cheap by making use of the memory structures
  9. Highlight spill warning
  10. In the course material I have a query that will help you do 1 in this list.If you are curious about way to optimize the BEST index only plan, I recommend the book by Dan Tow called : “SQL Tuning”
  11. We will get into WHY the transaction log needs to be dedicated
  12. Elevator sorts orders the I-O requests before sending them to the spindle. Depending on the buffering, this ordering can increase IOPS per spindle quite signficantly. However, it comes at cost in increased latency.
  13. Add the spindle illustration here
  14. Hardware vendors have different implementation of RAID. It really depends on the gear you have and there is really only ONE way to get the true, unbiased answer… Which leads us to the next slide
  15. http://blogs.msdn.com/sqlcat/archive/2008/09/18/scaling-heavy-network-traffic-with-windows.aspxhttp://msdn.microsoft.com/en-us/library/windows/hardware/gg463378.aspx
  16. The jumbo settings vary by vendor.
  17. http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/07/08/under-the-covers-gam-sgam-and-pfs-pages.aspx
  18. Certain scenarios for shallow B-Trees (BizTalk Spool) row padding can shift the latch to internal structure  @ACCESS_METHODS_HOBT_VIRTUAL_ROOT
  19. Root splits are expensive, although it will only affect one partition at a time. It is when many transactions cause page splits. We are suggesting the partitioning is better.