Potential of AI (Generative AI) in Business: Learnings and Insights
An Overview of Flash Storage for Databases
1. An Overview of Flash Storage
for Databases
Morgan Tocker
<morgan@percona.com>
1
Wednesday, March 9, 2011
2. Introduction
[ Me] [Percona]
Director of Training. Previously Consulting, Training,
worked at MySQL, Sun Support & Development
Microsystems. for MySQL.
★ No invested interest in which hardware I recommend.
✦
[Disclaimer] Some hardware vendors have engaged in our
services to evaluate and improve performance of their
products.
2
Wednesday, March 9, 2011
3. What this talk is about
★ Flash technologies (NAND, NOR).
★ Server Usage.
✦
Not USB thumb drives.
✦
Not Consumer usage.
★ “For Database” == MySQL.
✦
Should be more or less applicable for all databases.
3
Wednesday, March 9, 2011
4. Agenda
★ Introduction.
★ A look at the current market.
★ Applications.
4
Wednesday, March 9, 2011
5. Revolutionary
★ Change in technology -
✦
From spinning disk to solid state.
★ No mechanical moving parts.
★ Jump in performance.
★ Requires changes in the Application.
★ Hard not to predict a quick replacement to all SSDs in
the next 5-10 years*
* However, at the moment hard disks are still
5 becoming cheaper (size) quicker than SSDs!
Wednesday, March 9, 2011
6. “Numbers everyone should know”
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
NAND Flash (my estimate) 50,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
See: http://www.linux-mag.com/cache/7589/1.html and Google http://
6 www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Wednesday, March 9, 2011
7. Physics Behind
★ “Floating Gate Transistors”
✦
Non volatile memory.
★ One State - Single State (SLC)
✦
Faster, more reliable, expensive.
★ Many States - Multi Level Cell (MLC)
✦
Usually 4 states.
✦
Slower, less reliable, cheaper.
7
Wednesday, March 9, 2011
8. Classification
★ NOR
✦
Speeds like memory for reads.
✦
Much, much slower for erase/writing data.
✦
Practical use: storing firmware.
★ NAND
✦
Faster writes.
✦
Only block-level read access (4K).
✦
Idea is to compact as many cells in limited space - to make it
competitive with hard drives.
8
Wednesday, March 9, 2011
9. Erasing (NAND)
★ Erase is to set all bits to “1111...”
✦
Erasing process is similar to “flash” in photocameras - this is
where the name FLASH comes from.
✦
Erase is slow, done in batch operations (up to 1MB).
★ Change “1” -> “0” is fast.
★ Change “0” -> “1” is possible only be erase.
✦
1st write: “1111” -> “1110”. Block marked as “written”
✦
2nd write: even “1110” -> “1010” is not possible.
9
Wednesday, March 9, 2011
10. Erase Challenges
★ Erase is slow
✦
You want to erase many blocks in a single “flash”.
✦
Block Management.
★ [via software] When you write, card never writes the
same block.
★ Background process to run garbage collection.
10
Wednesday, March 9, 2011
11. Erase Lifecycle
★ SLC ~100K times per cell (may vary).
★ MLC ~10K times per cell (may vary).
★ For many this is a major point of discussion.
✦
How big of an issue depends a lot on firmware.
✦
Many cells and even distribution (“wear levelling”) makes it a
couple of years under heavy work load.
11
Wednesday, March 9, 2011
12. Write degradation
★ Expected.
✦
More full the device, harder it is to garbage collect.
★ Graph for Fusion-io 320G MLC card:
12
Wednesday, March 9, 2011
13. Firmware Really Matters (1)
★ I would not expect even less flat performance on a
cheaper, non-enterprise class of hardware.
✦
Come to my talk on Friday.
✦
I will tell you consistency of performance is more important
than anything else.
13
Wednesday, March 9, 2011
14. Firmware Really Matters (2)
★ Many revisions of firmware for each vendor.
✦
Important to compare apples-to-apples in any comparisons.
✦
I heard a rumour one large SSD vendor is on their 4th
successful complete ground up implementation ;)
14
Wednesday, March 9, 2011
15. Agenda
★ Introduction.
★ A look at the current market.
★ Applications.
15
Wednesday, March 9, 2011
16. The current market (1)
★ Fusion-IO.
✦
Established player with a large product line.
✦
Enjoyed near-monopoly for a while being only PCI card
vendor.
★ Virident.
✦
Previously a MySQL Appliance vendor.
✦
Switched business model in ~2010 to just ship PCI Flash
cards.
✦
Very good, consistent results.
16
Wednesday, March 9, 2011
17. The current market (2)
★ Intel/OCZ/other.
✦
Typically aims for pro-desktop market.
✦
Does not necessarily offer the same features/promises as the
“enterprise hardware”...
17
Wednesday, March 9, 2011
18. You pay more for...
★ Greater amount of over provisioning (more consistent).
★ Internal redundancy (aka RAID).
★ More complex firmware (more consistent).
★ Guarantee of durability (such as a capacitor).
★ Greater life-span (more write cycles).
★ Better Performance (much more IOPS).
18
Wednesday, March 9, 2011
21. Fusion-io Overview
★ Fast. Very fast.
✦
Cheaper than disks in terms of $-per IOPS.
★ PCI-E - closest to CPU.
★ Durability.
★ Shares host memory / CPU
★ Most complex part - firmware.
★ Large amount of space reservation for heavy writes.
21
Wednesday, March 9, 2011
22. Fusion-io drawbacks
★ Expensive. Let’s say “$6000+” (retail; your price may be
less).
✦
For full performance, requires additional 25% space
reservation.
✦
DRAM is actually probably cheaper per GB.
★ PCI-E is not hot swap.
✦
Also has potential for errors (when host fails, garbage keeps
being sent. Fusion-io handles this well.)
22
Wednesday, March 9, 2011
23. Fusion-io durability
★ Cache is located on host system.
★ “Transaction log” to prevent lost data.
✦
Crash recovery.
23
Wednesday, March 9, 2011
24. Fusion-io read performance
160GB SLC card
8 threads: 33K IOPS (525MB/sec), 0.28 ms 95% response time
RAID 10 is Dell Perc 6i
on 8 disks 2.5” 15 RPM SAS
24
Wednesday, March 9, 2011
25. Fusion-io write performance
★ 8 threads: 20K IOPS (314MB/sec), 0.26 ms 95%
response time.
25
Wednesday, March 9, 2011
26. Fusion-io databases
★ Many read / write threads to utilize throughput.
★ “MySQL” is not able to fully use it.
✦
Better in 5.5, MySQL-5.1-plugin, XtraDB.
★ InnoDB IO path “needs work”.
26
Wednesday, March 9, 2011
28. Virident
★ PCI interface.
★ Has NAND flash upgrade modules.
★ Good stable results.
★ Advertised 300,000 IOPS in 75:25 (read:write).
28
Wednesday, March 9, 2011
29. Virident Options
★ 300G, 400G, 600, 800G SLC cards.
✦
400G is $13,600
★ (More or less the same price range as Fusion-io).
29
Wednesday, March 9, 2011
30. 2010 Benchmarks:
http://www.mysqlperformanceblog.com/2010/06/15/virident-
30 tachion-new-player-on-flash-pci-e-cards-market/
Wednesday, March 9, 2011
32. Intel SSDs
★ Were awesome in 2008.
✦
Many accolades, first SSDs that probably made sense for a
lot of pro-desktop users.
★ A couple of iterations of firmware, but mostly intel
treated customers like mushrooms for 2 years.
✦
No clear advance warning of road map.
✦
Finally a replacement 510 series announced last month.
• Slides don’t feature these. Have not used them.
32
Wednesday, March 9, 2011
33. Intel Overview
★ SATA form factor.
★ Intel X25-M Gen 1 (50nm) & Gen 11 (35nm).
✦
MLC
★ Intel X25-E (50nm)
✦
SLC
✦
“Enterprise”.
★ New 510 series - just released last month.
33
Wednesday, March 9, 2011
34. X25-E
★ 32G / 64G
★ Throughput: 35K IOPS reads, 3.5K IOPS writes.
★ Latency: 75us reads, 85us writes.
★ 64G - $725
✦
$11/GB
★ Write endurance:
✦
1 petabyte of random writes (32G)
✦
2 petabytes of random writes (64G)
34
Wednesday, March 9, 2011
35. X25-M Gen II
★ 80G / 160G
★ Throughput: 35K IOS reads, 6.5 / 8.5K IOPS writes.
★ Latency: 65us reads, 85us writes.
★ 160GB - $415
✦
~$3 / GB
★ Write Endurance.
✦
Not mentioned in official specification.
35
Wednesday, March 9, 2011
36. X25-E and X25-M
★ Even if “E” is enterprise - power loss means data loss.
✦
Loss of transactions.
★ You can disable write cache, but performance is woeful.
36
Wednesday, March 9, 2011
37. X25 Deployments
★ RAID
✦
Software / hardware?
✦
Level 0? 1? 10? 5? 50?
★ Engineering process could be complicated and
expensive.
✦
There are/were ready solutions (Schooner[1], Gear6[2], Cisco
servers).
[1] Changed business model recently.
37 [2] Went broke.
Wednesday, March 9, 2011
38. Agenda
★ Introduction.
★ A look at the current market.
★ Applications.
38
Wednesday, March 9, 2011
39. MySQL Specific (1)
★ SSD is very good at Random reads.
✦
Not so good at sequential writes!
★ Data files on SSD.
✦
Table files (*.ibd).
✦
Rollback segments (ibdata1).
★ Logs on RAID with BBU.
✦
Binary logs.
✦
Transaction logs.
✦
Double write buffer.
✦
Insert buffer.
✦
Slow log, error log, general log.
39 See: http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html
Wednesday, March 9, 2011
40. MySQL Specific (2)
★ Buy memory, or buy SSDs?
✦
[Usually] Buy memory when it’s possible.
40
Wednesday, March 9, 2011
41. Other Reasons to use Flash (1)
★ Server Consolidation.
✦
Hard drives do ~100-200 IOPS*
✦
Now one card can get 100K (theorhetical)!
✦
~x2 - x10 reduction in many cases (see craigslist).
41 * Assuming no RAID controller performing additional merging.
Wednesday, March 9, 2011
42. Other Reasons to use Flash (2)
★ Power consumption reduction.
✦
“Transactions per watt” incredibly lower.
• See: http://www.percona.com/files/percona-live/jeremy-
Craigslist.pptx.pdf
✦
Important for a large number of people. Even if power is
cheap, colo facilities often limit availability per-rack.
42
Wednesday, March 9, 2011
43. Other Reasons to use Flash (3)
★ Limit variance / risk of operational issues from cold
starts.
✦
Easy to see something like an advertising network miss
response time goals when aim is 50ms/page.
• Each IO is ~10ms.
• Following a few secondary keys to a primary key and you miss it.
★ Good for throughput too.
43
Wednesday, March 9, 2011
45. Short Term (1)
★ Multi-threaded IO is required to exploit all throughput
offered.
✦
InnoDB Plugin, MySQL 5.5 ready.
✦
Many other databases are not ready.
45
Wednesday, March 9, 2011
46. Short Term (2)
★ Opportunities for Multi-level caches when data exceeds
SSDs size.
✦
See Flashcache (Facebook), ZFS L2 ARC, Veritas.
46
Wednesday, March 9, 2011
47. Long Term
★ Decades of hard drive assumptions about random IO
cost need to be unwound.
✦
For example, InnoDB, Oracle, PostgreSQL work like this...
47
Wednesday, March 9, 2011
48. Basic Operation (High Level)
Log Files
SELECT * FROM City
WHERE CountryCode=ʼAUSʼ
Tablespace
Buffer Pool
48
Wednesday, March 9, 2011
49. Basic Operation (High Level)
Log Files
SELECT * FROM City
WHERE CountryCode=ʼAUSʼ
Tablespace
Buffer Pool
48
Wednesday, March 9, 2011
50. Basic Operation (High Level)
Log Files
SELECT * FROM City
WHERE CountryCode=ʼAUSʼ
Tablespace
Buffer Pool
48
Wednesday, March 9, 2011
51. Basic Operation (High Level)
Log Files
SELECT * FROM City
WHERE CountryCode=ʼAUSʼ
Tablespace
Buffer Pool
48
Wednesday, March 9, 2011
52. Basic Operation (High Level)
Log Files
SELECT * FROM City
WHERE CountryCode=ʼAUSʼ
Tablespace
Buffer Pool
48
Wednesday, March 9, 2011
53. Basic Operation (High Level)
Log Files
SELECT * FROM City
WHERE CountryCode=ʼAUSʼ
Tablespace
Buffer Pool
48
Wednesday, March 9, 2011
54. Basic Operation (cont.)
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
55. Basic Operation (cont.)
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
56. Basic Operation (cont.)
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
57. Basic Operation (cont.)
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
58. Basic Operation (cont.)
01010
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
59. Basic Operation (cont.)
01010
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
60. Basic Operation (cont.)
01010
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
61. Basic Operation (cont.)
01010
Log Files
UPDATE City SET
name = 'Morgansville'
WHERE name = 'Brisbane'
AND CountryCode='AUS'
Tablespace
Buffer Pool
49
Wednesday, March 9, 2011
62. Long Term (cont.)
★ Examples of “the database is the log” for MySQL are the
PBXT and RethinkDB storage engines.
50
Wednesday, March 9, 2011
63. Storage Hardware also changes
★ Most of us used to buying RAID controllers, placing
disks below them.
✦
Only a very limited number of RAID controllers understand
SSDS.
✦
RAID controllers are used to optimizing IO for devices
capable of 100-200 IOPS.
✦
If we look at Fusion-IO, the devices also internally RAID
(~RAID4).
51
Wednesday, March 9, 2011
64. Technologies to look at
★ More PCI express cards.
✦
Potential to lower barrier to entry - only ~2-3 players,
competition not as hot as it could be (yet).
★ More Enterprise focused MLC.
✦
Better software (firmware) means more wear levelling,
improved performance, etc.
✦
More storage in fewer cells = lower cost.
★ Violin Memory
✦
I am not hands-on familiar with their technology, but they
have some very high end offerings.
✦
Expect more awesome high end offerings (all vendors).
52
Wednesday, March 9, 2011
65. Questions
★ Thank you for Confoo for letting me speak about such a
niche topic!
★ If I’m out of time, please feel free to catch me around.
53
Wednesday, March 9, 2011