1. Scaling PostgreSQL on SMP Architectures
Doug Tolbert, David Strong, Johney Tsai
{doug.tolbert, david.strong, johney.tsai}@unisys.com
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 1
2. Unisys Cellular Multi-Processor Architecture
⢠Up to 32 CPU sockets
C PU
â In 4- or 8-CPU pods C PU
C PU
â Dual-core supported C PU
Multi-core eventually C PU
T LC
C C PU
â Dynamically partitionable PU C PU
⢠32-bit & 64-bit technologies T LC CPU
C PU C PU
8 GB T LC C PU
⢠Multilayered caches C PU M SU
C PU C PU C PU
C PU C PU C PU
⢠Large cross-bar memories T LC C PU
C PU
â IA32: 64Gb (with PAE) PC I C PU C PU
T LC PC I CP
â EM64T, Opteron, Athlon: 256Tb PC I 8 GB T LCU C PU
â Itanium: 1Pb PC I M SU C PU C PU
PC I C PU
â Theoretical: Multi-Pb PC I T LC C PU
C PU C PU
Processor family dependent DIB PC I
PC I C PU
⢠Up to 96 PCI Slots PC I 8 GB DIB PC I
M SU PC I
â Lower on newer models PC I PC I
PC I PC I
PC I DIB
⢠Linux: SLES, RedHat PC I
PC I
DIB
⢠Windows: Data Center 8 GB PC I
DIB PC I
M SU
PC I
DIB PC I
PC I
PC I
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 2
3. Unisys Open And Secure Integrated Stack
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 3
4. PostgreSQL ASC Lab Engagement
⢠February 2005, Unisys Mission Viejo Facility
â Simon Riggs, Mark Wong (OSDL) participated
â PostgreSQL 8.0.1 measured
â Summary at
http://developer.osdl.org/markw/papers/PostgreSQL%20Visit%20Report%20External%20050513.pdf
⢠Interesting findings
â Locking protocol and granularity
â Context switch âstormsâ (~80K/sec)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 4
5. PostgreSQL 8.1 Enhancements
⢠A few stand out from a performance perspective
â Improve concurrent access to the shared buffer cache (Tom)
â Allow index scans to use an intermediate in-memory bitmap (Tom)
â Improve performance for partitioned tables (Simon)
⢠Context switches reduced
â From ~80K/sec to ~30K/sec
⢠Other effects only seen in overall throughput improvements
⢠Undertook more complete characterization of 8.1 performance
improvements
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 5
6. TPC-C Benchmark
⢠Transaction Processing Performance Council (TPC)
â Defines suite of benchmarks
â Sole certifier of vendor-submitted benchmark results
â Specs, results, more info: http://www.tpc.org
⢠TPC-C benchmark is a complete order-entry environment
â Industry standard measure of DBMS OLTP performance across vendors
â Reports transactions per minute (tpmC) and cost metrics
â Mix of 5 transactions
⢠New Order (selects, updates, inserts)
⢠Payment (selects, updates, inserts)
⢠Order Status (selects)
⢠Stock Level (selects)
⢠Delivery (selects, updates, deletes), interactive & deferred variants
⢠Data volumes, number of clients scaled by vendorâs tpmC goal
â See spec for details
⢠Current (June 2006) result ranges
â Best performance: 3,210,540 tpmC @ $5.07/tpmC
â Lowest performance: 9,347 tpmC
â Best price/performance: 38,622 tpmC @ $0.99/tpmC
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 6
7. Ah, these are
NOT TPC-C
numbers!
Test Application
⢠DBMS stressor application with well-known characteristics
⢠TPCC4J = âTPC-C For Javaâ
â Unisys-developed from TPC-C version 5.3 (April 2004)
â Runs on 1.4.2 JVM (5.0 JVM works)
â Small application footprint (~72K)
â Very few garbage collection cycles
â Client overhead low (7-9% for ~1000 users)
⢠Deviates from TPC-C in important ways
â Non-audited components and results
â Data entry screens not populated, only SQL statements run
â User think time eliminated
â Reports TPS, not tpmC
⢠No, you canât multiply by 60 to get tpmC!
â No checkpoints during benchmark runs
â Not run for a minimum of 8 hours
â Data volumes not scaled
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 7
8. Test Configuration
Client Driver: Unisys ES3040 PostgreSQL: Unisys ES7000/540 Storage: EMC CX-200 External Disk
4 x 3.0Ghz Intel Xeon MP CPUs (IA32) 32 x 3.0Ghz Intel Xeon MP CPUs (IA32) 8 x 36Gb drives (15K rpm)
Hyper threading enabled Hyper threading disabled RAID 10 configured (striped mirrors)
4Gb RAM 16Gb RAM
Small data volume: 50 TPC-C
1 x 36Gb boot driver (10K rpm) 1 x 36Gb boot driver (10K rpm)
warehouses, ~5.4Gb (4.7Gb data +
1 Gbit NIC (copper) 1 Gbit NIC (copper)
700Mb indexes)
OS: SLES 9 SP3 OS: SLES 9 SP3
No I/O issues in benchmarking
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 8
11. Performance vs. Scalability
⢠Performance and Scalability are not the same thing
â But are closely related
â Must be REALLY careful comparing anecdotal reports
⢠Performance comparable on equal configurations
â In this data set, the only equal configuration is 1 CPU
â 8.1.2 exhibits a small (~8%) improvement over 8.0.7
⢠Scalability measures throughput as resources are added
â 8.1.2 uses incremental resources more effectively than 8.0.7
â At 16 CPUs, 8.1.2 is 5.6 times better than 8.0.7
⢠Fair to conclude that 8.1.2 scales effectively to 16 CPUs
â At least on this application and platform combination
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 11
12. Hardware Architecture
⢠Cache invalidations
â Handled quickly when on-cell ES7000/540 Cell
⢠No visible degradation up to 8 CPUs RAM
â Slower across cells Memory
⢠Cross-cell interactions required, but slower Controller
⢠Responsible for dip at 12 CPUs Shared
Cache
â Processing power unbalanced across cells (8 + 4)
CPU CPU
⢠Recovers by 16 CPUs
CPU CPU
â Processing power balance across cells (8 + 8)
CPU CPU
â Reduces chance of cache misses
CPU CPU
⢠Cross-cell cache invalidations increase quickly when data
structures are poorly aligned with cache line size
⢠Every large-scale computing platform, regardless of vendor,
will have some sort of hardware related behavior that needs to
be considered in performance and scalability work
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 12
13. Software Architecture
⢠Degradation above 16 CPUs
â Continued decline implies systemic origin
â Characteristic of software algorithms or data structures
reaching inherent scalability limits
⢠Causes unclear, but something important is wrong
⢠Possibilities include
â The Usual Suspects: lock mgmt, buffer mgmt, etc.
â Lack of Repeatable Read isolation level
⢠A TPC-C requirement
⢠Elevation to Serializable hurts overall throughput
â See To Do List (a later slide)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 13
14. Itâs not just TPCC4J!
⢠Another industry standard, transactional Java benchmark
â ES7000 8x
â 8.0: 34 TPS
â 8.1: 544 TPS
⢠Different benchmark structure
⢠Different hardware platform
â No cache issues, CPUs on one cell
⢠Even more dramatic improvement (16x!)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 14
16. Other Useful Tooling
⢠VTune
â Intel low-level performance monitor
⢠http://www.intel.com/cd/software/products/asmo-na/eng/index.htm
â High frequency functions, by clock ticks HeapTupleSatisfiesSnapshotData??
⢠HeapTupleSatisfiesSnapshotData
⢠LWLockAcquire
⢠XLogInsert
⢠PinBuffer
⢠hash_search
⢠oprofile
â Results pending
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 16
17. To Do List
⢠In-line some high frequency functions
⢠Speed up CRC algorithm by rearranging loop structure
⢠Improve data structure alignment to memory pages and cache
lines
â Possibly dynamically based on processor id and settings
⢠Re-type byte structures to words for newer Intel processors
⢠Investigate efficiency of bitmaps on newer Intel processors
⢠Residual spin-lock implementation issues
⢠Reduce query plan discards (very adventurous!)
⢠Improve statistics infrastructure and real-time monitoring
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 17
18. Conclusions
⢠Scalability dramatically improved in 8.1 release
â Scales to 16 CPUs (on test environment)
â Context switches down
â Supported by multiple benchmarks
⢠Minor single CPU performance improvement
⢠Hardware architecture can affect scalability
â Important consideration for high-capacity, high-throughput
applications
â Systems of all sizes moving to multi-core processor chips
⢠Substantial scalability and performance challenges remain
â TPC-C required feature content missing (repeatable read)
â Algorithm scalability limits exist (especially > 16 CPUs)
PostgreSQL 10th Anniversary Summit, Toronto, July 8-9, 2006 Page 18