Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

MemSQL DB Class, Ankur Goyal

1.190 Aufrufe

Veröffentlicht am

What is an in-memory database? Why do they mater? How do you build one? How do people use MemSQL?

Veröffentlicht in: Daten & Analysen

MemSQL DB Class, Ankur Goyal

  1. 1. 15-415/615 1 Ankur Goyal 3/17/2016 1 Based on a lecture given at Carnegie Mellon University. (c) Ankur Goyal
  2. 2. Ques%ons We Will Answer • What is an in-memory database? • Why do they ma3er? • How do you build one? • How do people use MemSQL? (c) Ankur Goyal
  3. 3. Topics • In-Memory Databases • In-Memory Architecture • MemSQL in the Wild • Q/A (c) Ankur Goyal
  4. 4. Ankur Goyal • CMU SCS (2008-2011), PDL (2010-2011) • Microso7 (2010) • VP of Engineering @ MemSQL (2011-) • I ❤ databases (c) Ankur Goyal
  5. 5. Live Demo (c) Ankur Goyal
  6. 6. What is an in-memory database? (c) Ankur Goyal
  7. 7. In-Memory Databases... • Use memory instead of disk (c) Ankur Goyal
  8. 8. In-Memory Databases... • Use memory instead of disk (c) Ankur Goyal
  9. 9. In-Memory Databases... • Use memory instead of disk • Do not (need to) save data on disk (c) Ankur Goyal
  10. 10. In-Memory Databases... • Use memory instead of disk • Do not (need to) save data on disk (c) Ankur Goyal
  11. 11. In-Memory Databases... • Use memory instead of disk • Do not (need to) save data on disk • Put the whole dataset in memory (c) Ankur Goyal
  12. 12. In-Memory Databases... • Use memory instead of disk • Do not (need to) save data on disk • Put the whole dataset in memory (c) Ankur Goyal
  13. 13. In-Memory Databases... • Use memory instead of disk • Do not (need to) save data on disk • Put the whole dataset in memory Well, some)mes... (c) Ankur Goyal
  14. 14. Wikipedia says... In-memory databases primarily rely on main-memory for storage. (c) Ankur Goyal
  15. 15. In-Memory Databases • Are durable to disk (and respect ACID) (c) Ankur Goyal
  16. 16. In-Memory Databases • Are durable to disk (and respect ACID) • Can spill on disk or pin data in-memory (and take advantage of it) (c) Ankur Goyal
  17. 17. In-Memory Databases • Are durable to disk (and respect ACID) • Can spill on disk or pin data in-memory (and take advantage of it) • Tradeoffs are suited to systems with lots of memory (c) Ankur Goyal
  18. 18. In-Memory Databases • Are durable to disk (and respect ACID) • Can spill on disk or pin data in-memory (and take advantage of it) • Tradeoffs are suited to systems with lots of memory • Tend to be distributed systems (c) Ankur Goyal
  19. 19. In-Memory Databases • Are durable to disk (and respect ACID) • Can spill on disk or pin data in-memory (and take advantage of it) • Tradeoffs are suited to systems with lots of memory • Tend to be distributed systems • Have a different set of boClenecks (c) Ankur Goyal
  20. 20. Bold Claim (c) Ankur Goyal
  21. 21. All database workloads will be running on in-memory databases (c) Ankur Goyal
  22. 22. Why? • Memory is ge,ng cheaper (about 40% every year) (c) Ankur Goyal
  23. 23. Why? • Memory is ge,ng cheaper (about 40% every year) • Cache is the new RAM (RAM is the new disk, disk is the new tape, etc) (c) Ankur Goyal
  24. 24. Why? • Memory is ge,ng cheaper (about 40% every year) • Cache is the new RAM (RAM is the new disk, disk is the new tape, etc) • In-memory databases leverage SSD (no random writes) (c) Ankur Goyal
  25. 25. Why? • Memory is ge,ng cheaper (about 40% every year) • Cache is the new RAM (RAM is the new disk, disk is the new tape, etc) • In-memory databases leverage SSD (no random writes) • NVRAM is coming (and could be cheaper than SSD) (c) Ankur Goyal
  26. 26. Why? • Memory is ge,ng cheaper (about 40% every year) • Cache is the new RAM (RAM is the new disk, disk is the new tape, etc) • In-memory databases leverage SSD (no random writes) • NVRAM is coming (and could be cheaper than SSD) In-memory databases are tuned to modern hardware and modern workloads (c) Ankur Goyal
  27. 27. In-Memory Architecture (c) Ankur Goyal
  28. 28. Architecture Topics • In-Memory Storage • Transac3ons and Concurrency Control • Crash Recovery and Replica3on • Code Genera3on • Distributed Execu3on (c) Ankur Goyal
  29. 29. In-Memory Storage Mo/va/on • Insanely fast random reads & writes (c) Ankur Goyal
  30. 30. In-Memory Storage Mo/va/on • Insanely fast random reads & writes • Atomic writes as granular as a byte (c) Ankur Goyal
  31. 31. In-Memory Storage Mo/va/on • Insanely fast random reads & writes • Atomic writes as granular as a byte • Working space is precious (RAM) (c) Ankur Goyal
  32. 32. In-Memory Storage Mo/va/on • Insanely fast random reads & writes • Atomic writes as granular as a byte • Working space is precious (RAM) • Very different for rowstores and columnstores (c) Ankur Goyal
  33. 33. In-Memory Rowstore • Rowstores have lots of random reads/writes (c) Ankur Goyal
  34. 34. In-Memory Rowstore • Rowstores have lots of random reads/writes • Datasets are usually small < 10 TB (c) Ankur Goyal
  35. 35. In-Memory Rowstore • Rowstores have lots of random reads/writes • Datasets are usually small < 10 TB Solu%on: keep the whole dataset in memory (c) Ankur Goyal
  36. 36. In-Memory Rowstore • Rowstores have lots of random reads/writes • Datasets are usually small < 10 TB Solu%on: keep the whole dataset in memory • Use memory op+mized data structures (skip list) (c) Ankur Goyal
  37. 37. What is a Skip List • Invented in 1989 by William Pugh (c) Ankur Goyal
  38. 38. What is a Skip List • Invented in 1989 by William Pugh • Expected O(log(n)) lookup, insert, delete (c) Ankur Goyal
  39. 39. What is a Skip List • Invented in 1989 by William Pugh • Expected O(log(n)) lookup, insert, delete • No pages (c) Ankur Goyal
  40. 40. (c) Ankur Goyal
  41. 41. Common Concerns • Memory overhead (c) Ankur Goyal
  42. 42. (c) Ankur Goyal
  43. 43. Skip List Struct Layout struct Table_Row { int col_a; char* col_b; … Tower* idx_1_ptrs; Tower* idx_2_ptrs; }; (c) Ankur Goyal
  44. 44. Common Concerns • Memory overhead • Scan performance (c) Ankur Goyal
  45. 45. (c) Ankur Goyal
  46. 46. Inefficient Skip List (c) Ankur Goyal
  47. 47. Efficient Skip List (c) Ankur Goyal
  48. 48. Common Concerns • Memory overhead • Scan performance • Reverse Itera6on (c) Ankur Goyal
  49. 49. Common Concerns • Memory overhead • Scan performance • Reverse Itera6on (HW Assignment) (c) Ankur Goyal
  50. 50. Concurrency Control (c) Ankur Goyal
  51. 51. Concurrency Control • No pages => No latches (c) Ankur Goyal
  52. 52. Concurrency Control • No pages => No latches • Skip list in MemSQL is lockfree (c) Ankur Goyal
  53. 53. Concurrency Control • No pages => No latches • Skip list in MemSQL is lockfree • Every node is a lock-free linked list (c) Ankur Goyal
  54. 54. Concurrency Control • No pages => No latches • Skip list in MemSQL is lockfree • Every node is a lock-free linked list • Row locks are implemented with futexes (4 bytes) (c) Ankur Goyal
  55. 55. Concurrency Control • No pages => No latches • Skip list in MemSQL is lockfree • Every node is a lock-free linked list • Row locks are implemented with futexes (4 bytes) • Read-commiGed and snapshot isolaHon (c) Ankur Goyal
  56. 56. In-Memory Columnstore (c) Ankur Goyal
  57. 57. In-Memory Columnstore (c) Ankur Goyal
  58. 58. Columnstore Review • Big sequen+al scans and writes (c) Ankur Goyal
  59. 59. Columnstore Review • Big sequen+al scans and writes • Huge immutable vectors of data (c) Ankur Goyal
  60. 60. Columnstore Review • Big sequen+al scans and writes • Huge immutable vectors of data Solu%on: Cache dataset in memory (c) Ankur Goyal
  61. 61. How do columnstores benefit from in-memory? (c) Ankur Goyal
  62. 62. Have a lock-free skip list handy? (c) Ankur Goyal
  63. 63. Have a lock-free skip list handy? • Keep metadata in-memory • Use sidecar rowstore for fast small-batch writes (c) Ankur Goyal
  64. 64. (c) Ankur Goyal
  65. 65. Columnstore LSM • Log-Structured Merge of sorted runs (c) Ankur Goyal
  66. 66. (c) Ankur Goyal
  67. 67. Columnstore LSM • Log-Structured Merge of sorted runs • Tunable tradeoffs for read/write amplifica=on (c) Ankur Goyal
  68. 68. Columnstore LSM • Log-Structured Merge of sorted runs • Tunable tradeoffs for read/write amplifica=on • Enables fast writes to a sorted columnstore (c) Ankur Goyal
  69. 69. Columnstore LSM • Log-Structured Merge of sorted runs • Tunable tradeoffs for read/write amplifica=on • Enables fast writes to a sorted columnstore • Smallest sorted run is a skip list (c) Ankur Goyal
  70. 70. (c) Ankur Goyal
  71. 71. Crash Recovery (c) Ankur Goyal
  72. 72. Durability in an In-Memory System? • Memory is not a reliable medium (yet) (c) Ankur Goyal
  73. 73. Durability in an In-Memory System? • Memory is not a reliable medium (yet) • There is always a hierarchy (c) Ankur Goyal
  74. 74. Durability in an In-Memory System? • Memory is not a reliable medium (yet) • There is always a hierarchy • E.g. EBS -> S3 -> Glacier (c) Ankur Goyal
  75. 75. Durability in an In-Memory System? • Memory is not a reliable medium (yet) • There is always a hierarchy • E.g. EBS -> S3 -> Glacier • To operate at in-memory speed, all disk I/O must be sequenHal (c) Ankur Goyal
  76. 76. Durability in the Rowstore • Indexes are not materialized on disk (c) Ankur Goyal
  77. 77. Durability in the Rowstore • Indexes are not materialized on disk • Reconstruct indexes on the fly during recovery (c) Ankur Goyal
  78. 78. Durability in the Rowstore • Indexes are not materialized on disk • Reconstruct indexes on the fly during recovery • Only need to log PK data (c) Ankur Goyal
  79. 79. Durability in the Rowstore • Indexes are not materialized on disk • Reconstruct indexes on the fly during recovery • Only need to log PK data • Take full database snapshots periodically (c) Ankur Goyal
  80. 80. Durability in the Rowstore • Indexes are not materialized on disk • Reconstruct indexes on the fly during recovery • Only need to log PK data • Take full database snapshots periodically • Tunable to be sync/async (c) Ankur Goyal
  81. 81. (c) Ankur Goyal
  82. 82. Durability in the Columnstore • Metadata uses ordinary rowstore mechanism (c) Ankur Goyal
  83. 83. Durability in the Columnstore • Metadata uses ordinary rowstore mechanism • Segments are huge (several KB or even MB) (c) Ankur Goyal
  84. 84. Durability in the Columnstore • Metadata uses ordinary rowstore mechanism • Segments are huge (several KB or even MB) • Read/wri=en sequen?ally (c) Ankur Goyal
  85. 85. Durability in the Columnstore • Metadata uses ordinary rowstore mechanism • Segments are huge (several KB or even MB) • Read/wri=en sequen?ally • Columnstore segments synchronously wri=en to disk (c) Ankur Goyal
  86. 86. Durability in the Columnstore • Metadata uses ordinary rowstore mechanism • Segments are huge (several KB or even MB) • Read/wri=en sequen?ally • Columnstore segments synchronously wri=en to disk • Memory-speed writes go to sidecar rowstore (c) Ankur Goyal
  87. 87. Crash Recovery • Replay latest snapshot, and then every log file since (c) Ankur Goyal
  88. 88. Crash Recovery • Replay latest snapshot, and then every log file since • No par7ally wri9en state on disk, so no undos (c) Ankur Goyal
  89. 89. Crash Recovery • Replay latest snapshot, and then every log file since • No par7ally wri9en state on disk, so no undos • Columnstore just replays metadata (c) Ankur Goyal
  90. 90. Crash Recovery • Replay latest snapshot, and then every log file since • No par7ally wri9en state on disk, so no undos • Columnstore just replays metadata • Replica7on == Con7nuous replay over the network (c) Ankur Goyal
  91. 91. Code Genera*on (c) Ankur Goyal
  92. 92. class Row(object): def __init__(self, a): self.a = a t = [Row(x) for x in range(1000000)] class State(object): def __init__(self): self.agg_sum = 0 def loop(state, row): state.agg_sum += row.a + 1 def query(): state = State() for r in t: loop(state, r) return state if __name__ == '__main__': start = time.time() state = query() end = time.time() print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start)) (c) Ankur Goyal
  93. 93. struct Row int main(void) { { Row(int a_arg) : a(a_arg) { } std::vector<Row> rows; int a; for (int i = 0; i < 1000000; i++) }; { rows.emplace_back(i); struct State } { State() : agg_sum(0) { } clock_t start = clock(); int64_t agg_sum; State state = query(rows); }; clock_t end = clock(); inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %gn", { state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC); state.agg_sum += row.a + 1; } } inline State query(std::vector<Row>& rows) { State s; for (Row& r : rows) { loop(s, r); } return s; } (c) Ankur Goyal
  94. 94. Comparison $ python test.py Answer: 500000500000, Time (s): 0.251049 $ time g++ test.cpp -o test-cpp -std=c++0x real 0m0.176s user 0m0.150s sys 0m0.023s $ ./test-cpp Answer: 500000500000, Time (s): 0.006745 (c) Ankur Goyal
  95. 95. Comparison $ python test.py Answer: 500000500000, Time (s): 0.251049 $ time g++ test.cpp -o test-cpp -std=c++0x real 0m0.176s user 0m0.150s sys 0m0.023s $ ./test-cpp Answer: 500000500000, Time (s): 0.006745 37x difference in execu+on (c) Ankur Goyal
  96. 96. Comparison $ python test.py Answer: 500000500000, Time (s): 0.251049 $ time g++ test.cpp -o test-cpp -std=c++0x real 0m0.176s user 0m0.150s sys 0m0.023s $ ./test-cpp Answer: 500000500000, Time (s): 0.006745 37x difference in execu+on 1.37x even with compila+on +me (c) Ankur Goyal
  97. 97. Code Genera*on • Expression execu.on (c) Ankur Goyal
  98. 98. Code Genera*on • Expression execu.on • Inline scans (c) Ankur Goyal
  99. 99. Code Genera*on • Expression execu.on • Inline scans • Need a powerful plan cache (c) Ankur Goyal
  100. 100. Code Genera*on • Expression execu.on • Inline scans • Need a powerful plan cache • OLTP vs. data explora.on (c) Ankur Goyal
  101. 101. Plancache Example (1) SELECT * FROM users WHERE id = 5 SELECT * FROM users WHERE id = 8 => SELECT * FROM users WHERE id = @ (c) Ankur Goyal
  102. 102. Plancache Example (2) SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7) SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4) => SELECT * FROM users WHERE id IN (@) OR a IN (@) (c) Ankur Goyal
  103. 103. Drill Down Example SELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep; SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product; (c) Ankur Goyal
  104. 104. Drill Down Example SELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep; SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product; No plancache match ! (c) Ankur Goyal
  105. 105. Let's look at some generated code (c) Ankur Goyal
  106. 106. Expression Snippet memsql> select concat("foo", "bar"); +----------------------+ | concat("foo", "bar") | +----------------------+ | foobar | +----------------------+ 1 row in set (0.81 sec) memsql> select concat("foo", "bar"); +----------------------+ | concat("foo", "bar") | +----------------------+ | foobar | +----------------------+ 1 row in set (0.00 sec) (c) Ankur Goyal
  107. 107. Old Code Genera,on memsql> select concat("foo", "bar"); +----------------------+ | concat("foo", "bar") | +----------------------+ | foobar | +----------------------+ 1 row in set (0.81 sec) bool overflow = false; VarCharTemp result1("foo", 3, threadId); VarCharTemp result2("bar", 3, threadId); opt<TemporaryImmutableString> result3; op_Concat(result3, result1, result2, overflow, threadId); (c) Ankur Goyal
  108. 108. Code Genera*on is Hard • Old compilers adage: Pick 2 of 3 (c) Ankur Goyal
  109. 109. Code Genera*on is Hard • Old compilers adage: Pick 2 of 3 • Fast execu:on :me • Fast compile :me • Fast development :me (c) Ankur Goyal
  110. 110. Code Genera*on is Hard • Old compilers adage: Pick 2 of 3 • Fast execu:on :me • Fast compile :me • Fast development :me • E.g. Assembly, C++, Python (c) Ankur Goyal
  111. 111. Code Genera*on is Hard • Old compilers adage: Pick 2 of 3 • Fast execu:on :me • Fast compile :me • Fast development :me • E.g. Assembly, C++, Python • JIT compilers turned this on its head (c) Ankur Goyal
  112. 112. MemSQL Compiler Pipeline (c) Ankur Goyal
  113. 113. Expression Snippet (MPL) memsql> select concat("foo", "bar"); +----------------------+ | concat("foo", "bar") | +----------------------+ | foobar | +----------------------+ 1 row in set (0.81 sec) declare outRow3 <- OutRowInit() OutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2))) OutRowSend(&outRow3) (c) Ankur Goyal
  114. 114. MBC Snippet OutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2))) 0x0048 OutRowInit local=&outRow 0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified 0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci 0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified 0x008c UpdateCollation local=&local_3 coll=utf8_general_ci 0x0098 Concat local=&local local=&local_2 local=&local_3 0x00a8 OutRowString local=&outRow local=&local target=0x01ac 0x00b8 OptStringFree local=&local 0x00c0 OptStringFree local=&local_3 0x00c8 OptStringFree local=&local_2 0x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified 0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci 0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified 0x010c UpdateCollation local=&local_6 coll=utf8_general_ci 0x0118 Concat local=&local_4 local=&local_5 local=&local_6 0x0128 OutRowString local=&outRow local=&local_4 target=0x018c 0x0138 OptStringFree local=&local_4 0x0140 OptStringFree local=&local_6 0x0148 OptStringFree local=&local_5 (c) Ankur Goyal
  115. 115. MBC Snippet 0x0048 OutRowInit local=&outRow 0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified 0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci 0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified 0x008c UpdateCollation local=&local_3 coll=utf8_general_ci 0x0098 Concat local=&local local=&local_2 local=&local_3 0x00a8 OutRowString local=&outRow local=&local target=0x01ac 0x00b8 OptStringFree local=&local 0x00c0 OptStringFree local=&local_3 0x00c8 OptStringFree local=&local_2 0x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified 0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci 0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified (c) Ankur Goyal
  116. 116. Distributed Query Execu0on (c) Ankur Goyal
  117. 117. (c) Ankur Goyal
  118. 118. First, some terminology (c) Ankur Goyal
  119. 119. (c) Ankur Goyal
  120. 120. (c) Ankur Goyal
  121. 121. (c) Ankur Goyal
  122. 122. Much easier to reason in terms of shipping SQL (c) Ankur Goyal
  123. 123. (c) Ankur Goyal
  124. 124. (c) Ankur Goyal
  125. 125. SELECT supp_nation,        cust_nation,        l_year,        Sum(volume) AS revenue FROM   (SELECT n1.n_name AS supp_nation,                n2.n_name AS cust_nation,                Extract(year FROM l_shipdate) AS l_year,                l_extendedprice * ( 1 - l_discount ) AS volume         FROM   supplier,                lineitem,                orders,                customer,                nation n1,                nation n2,         WHERE  s_suppkey = l_suppkey                AND o_orderkey = l_orderkey                AND c_custkey = o_custkey                AND s_nationkey = n1.n_nationkey                AND c_nationkey = n2.n_nationkey                AND ( ( n1.n_name = 'CANADA'                        AND n2.n_name = 'UNITED STATES' )                       OR ( n1.n_name = 'RUSSIA'                            AND n2.n_name =  'UNITED STATES' ) )                AND l_shipdate BETWEEN  Date('1995-01-01') AND  Date('1996-12-31'))        AS shipping GROUP  BY supp_nation,           cust_nation,           l_year ORDER  BY supp_nation,           cust_nation,           l_year;  (c) Ankur Goyal
  126. 126. Abstrac(ons • Distributed Query Plan created on aggregator (c) Ankur Goyal
  127. 127. Abstrac(ons • Distributed Query Plan created on aggregator • Layers of primi9ve opera9ons glued together (c) Ankur Goyal
  128. 128. Abstrac(ons • Distributed Query Plan created on aggregator • Layers of primi9ve opera9ons glued together • Full SQL on leaves • REMOTE tables • RESULT tables (c) Ankur Goyal
  129. 129. Primi%ves (SQL) • Queries over physical indexes (c) Ankur Goyal
  130. 130. Primi%ves (SQL) • Queries over physical indexes • Hook into global transac9onal state (c) Ankur Goyal
  131. 131. Primi%ves (SQL) • Queries over physical indexes • Hook into global transac9onal state • Full SQL on a single par99on (c) Ankur Goyal
  132. 132. Primi%ves (SQL) • Queries over physical indexes • Hook into global transac9onal state • Full SQL on a single par99on • Access to rowstores and columnstores (c) Ankur Goyal
  133. 133. Primi%ves (SQL) Example query the aggregator can send to the leaf: SELECT t.a, t.b, SUM(t.price) FROM t -- This will scan a physical table on the leaf WHERE t.c = 1000 -- This will use a local index GROUP BY t.a, t.b -- This will produce 1 row per group (c) Ankur Goyal
  134. 134. Primi%ves (Remote Tables) • Address data across leaves (c) Ankur Goyal
  135. 135. Primi%ves (Remote Tables) • Address data across leaves • SQL interface + custom shard key (c) Ankur Goyal
  136. 136. Primi%ves (Remote Tables) • Address data across leaves • SQL interface + custom shard key • Parallel execu<on primi<ves • Reshuffling • Merging on group keys • Merging data from joins (e.g. leE joins) (c) Ankur Goyal
  137. 137. Primi%ves (Remote Tables) SELECT t.a, SUM(s_net.c) FROM -- The row in s where s_net.b = t.a may not -- be on the same node as the local t. REMOTE(s) -- addresses the table across the cluster. t, REMOTE(s) AS s_net WHERE t.a = s_net.b GROUP BY t.a (c) Ankur Goyal
  138. 138. Primi%ves (Remote Tables) SELECT t.a, SUM(s_net.c) FROM -- This is a reshuffle operation. It relies on t -- being sharded on (t.a) and type(t.a) == type(s.b). -- It will only pull rows in s.b that match the -- shard key's local values of (t.a). t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net WHERE t.a = s_net.b GROUP BY t.a (c) Ankur Goyal
  139. 139. Primi%ves (Result Tables) • Shared, cached results of SQL queries (c) Ankur Goyal
  140. 140. Primi%ves (Result Tables) • Shared, cached results of SQL queries • Shares scans/computa9ons across readers (c) Ankur Goyal
  141. 141. Primi%ves (Result Tables) • Shared, cached results of SQL queries • Shares scans/computa9ons across readers • Supports streaming seman9cs (c) Ankur Goyal
  142. 142. Primi%ves (Result Tables) • Shared, cached results of SQL queries • Shares scans/computa9ons across readers • Supports streaming seman9cs • Technically an op9miza9on (c) Ankur Goyal
  143. 143. Primi%ves (Result Tables) • Shared, cached results of SQL queries • Shares scans/computa9ons across readers • Supports streaming seman9cs • Technically an op9miza9on • Similar to an RDD in Spark (c) Ankur Goyal
  144. 144. Primi%ves (Result Tables) CREATE RESULT TABLE t_reshuffled AS SELECT t.a, t.b, SUM(t.price) FROM t GROUP BY t.a, t.b SHARD BY t.a, t.b (c) Ankur Goyal
  145. 145. Op#miza#ons • Single-machine op0miza0ons (c) Ankur Goyal
  146. 146. Op#miza#ons • Single-machine op0miza0ons • Index selec0on, Sor0ng/Grouping (c) Ankur Goyal
  147. 147. Op#miza#ons • Single-machine op0miza0ons • Index selec0on, Sor0ng/Grouping • SQL -> SQL rewrites (c) Ankur Goyal
  148. 148. Op#miza#ons • Single-machine op0miza0ons • Index selec0on, Sor0ng/Grouping • SQL -> SQL rewrites • Cost-based distributed op0mizer (c) Ankur Goyal
  149. 149. Op#miza#ons • Single-machine op0miza0ons • Index selec0on, Sor0ng/Grouping • SQL -> SQL rewrites • Cost-based distributed op0mizer • Broadcast vs. Reshuffling (c) Ankur Goyal
  150. 150. Op#miza#ons • Single-machine op0miza0ons • Index selec0on, Sor0ng/Grouping • SQL -> SQL rewrites • Cost-based distributed op0mizer • Broadcast vs. Reshuffling • and many, many more (c) Ankur Goyal
  151. 151. MemSQL in the Wild (c) Ankur Goyal
  152. 152. Horizontals and Ver/cals • Real-'me data processing is everywhere (c) Ankur Goyal
  153. 153. Horizontals and Ver/cals • Real-'me data processing is everywhere • Top use-cases: Real-Time Analy'cs and Large-Scale Applica'ons (c) Ankur Goyal
  154. 154. Horizontals and Ver/cals • Real-'me data processing is everywhere • Top use-cases: Real-Time Analy'cs and Large-Scale Applica'ons • Top ver'cals: Financial Services, Webscale, Telco, Federal, Media (c) Ankur Goyal
  155. 155. Real-&me Analy&cs • High volumes of data, processed in real-8me (c) Ankur Goyal
  156. 156. Real-&me Analy&cs • High volumes of data, processed in real-8me • Fast updates in the rowstore • INSERT ... ON DUPLICATE KEY UPDATE • E.g. 2M update transac8ons/sec on 10 nodes (c) Ankur Goyal
  157. 157. Real-&me Analy&cs • High volumes of data, processed in real-8me • Fast updates in the rowstore • INSERT ... ON DUPLICATE KEY UPDATE • E.g. 2M update transac8ons/sec on 10 nodes • Fast appends, even one row at a 8me, in the columnstore • E.g. 1 GB/s on 16 EC2 nodes (c) Ankur Goyal
  158. 158. Real-&me Analy&cs • Converging with mainline analy2cs (c) Ankur Goyal
  159. 159. Real-&me Analy&cs • Converging with mainline analy2cs • No compromises, e.g. limited SQL, limited windows (c) Ankur Goyal
  160. 160. Real-&me Analy&cs • Converging with mainline analy2cs • No compromises, e.g. limited SQL, limited windows • Real-2me means fast reads as well (c) Ankur Goyal
  161. 161. Real-&me Analy&cs • Converging with mainline analy2cs • No compromises, e.g. limited SQL, limited windows • Real-2me means fast reads as well • Subsecond queries for dashboards (c) Ankur Goyal
  162. 162. Real-&me Analy&cs • Converging with mainline analy2cs • No compromises, e.g. limited SQL, limited windows • Real-2me means fast reads as well • Subsecond queries for dashboards • Millisecond queries for applica2ons (c) Ankur Goyal
  163. 163. Large-Scale Applica.ons • Large-scale opera.onal analy.cs and applica.ons (c) Ankur Goyal
  164. 164. Large-Scale Applica.ons • Large-scale opera.onal analy.cs and applica.ons • Hundreds of nodes for perf and HA (c) Ankur Goyal
  165. 165. Large-Scale Applica.ons • Large-scale opera.onal analy.cs and applica.ons • Hundreds of nodes for perf and HA • True "produc.on" workloads (c) Ankur Goyal
  166. 166. Large-Scale Applica.ons • Large-scale opera.onal analy.cs and applica.ons • Hundreds of nodes for perf and HA • True "produc.on" workloads • Exis.ng OLTP databases lack scalability and SQL perf (c) Ankur Goyal
  167. 167. Large-Scale Applica.ons • Large-scale opera.onal analy.cs and applica.ons • Hundreds of nodes for perf and HA • True "produc.on" workloads • Exis.ng OLTP databases lack scalability and SQL perf • Exis.ng OLAP databases lack opera.onal features (c) Ankur Goyal
  168. 168. Logos (c) Ankur Goyal
  169. 169. Take-Aways • In-memory Database != All-memory Database (c) Ankur Goyal
  170. 170. Take-Aways • In-memory Database != All-memory Database • In-memory Databases are databases built to modern tradeoffs (c) Ankur Goyal
  171. 171. Take-Aways • In-memory Database != All-memory Database • In-memory Databases are databases built to modern tradeoffs • Old problems with new solu<ons (c) Ankur Goyal
  172. 172. Take-Aways • In-memory Database != All-memory Database • In-memory Databases are databases built to modern tradeoffs • Old problems with new solu<ons • Real-<me analy<cs and Large-scale applica<ons == New projects (c) Ankur Goyal
  173. 173. Take-Aways • In-memory Database != All-memory Database • In-memory Databases are databases built to modern tradeoffs • Old problems with new solu<ons • Real-<me analy<cs and Large-scale applica<ons == New projects • We are hiring and ❤ Waterloo. • Come visit us in SF: email ankur@memsql.com (c) Ankur Goyal
  174. 174. Ques%ons (c) Ankur Goyal

×