Heterogenous Persistence

1. Heterogeneous Persistence A guide for the modern DBA Marcos Albe Jervin Real Ryan Lowe Liz Van Dijk

2. Introduction Hello everyone

3. Introduction MySQL everyone?

4. Introduction Memcached?

5. Agenda ● Introduction ● Why a single DBMS is not enough ● What makes a DBMS ● Different flavors of DMBS ● Top picks

6. Why one DBMS is not enough "If you feel things are not efficient in your code, is likely that you are suffering of poor data structures choice/design" ~ Anonymous

7. Why one DBMS is not enough ● Different data structures ● Different access patterns ● Different consistency and durability requirements. ● Different scaling needs ● Different budgets ● Theoretical fundamentalism

8. Why one DBMS is not enough A more concrete example OLAP -vs- OLTP

9. OLAP -vs- OLTP

10. PROs CONs ● No SPOF ● Workload optimized services ● Easier to scale* ● Additional complexity ● Operational needs (additional staffing) ● Cost ($$$)*

11. La Carte ● Key Value Stores ○ Memcached ○ MemcacheDB ○ Redis ○ Riak KV ○ Cassandra ○ Amazon's DynamoDB ● Graph ○ Neo4J ○ OrientDB ○ Titan ○ Virtuoso ○ ArangoDB ● Relational ○ MySQL ○ PostgreSQL ● Time Series ○ InfluxDB ○ Graphite ○ OpenTSDB ○ Blueflood ○ Prometheus ● Columnar ○ Vertica ○ Infobright ○ Amazon RedShift ○ Apache HBase ● Document ○ MongoDB ○ Couchbase ● Fulltext ○ Sphinx ○ Lucene/Solr

12. What makes a DB?

13. General Criteria ● Specialty ● Cost ● API/Interfaces ● Scalability ● CAP ● ACID ● Secondary Features

14. What makes a DBMS: General ● Licensing ● Language support ● OS support ● Community & workforce ● Tools ecosystem

15. ● Data Architecture ○ Logical data model ○ Physical data model ● Standards adherence (where defined) ● Atomicity ● Consistency ● Isolation ● Durability ● Referential integrity ● Transactions ● Locking ● Crash recovery ● Unicode support What makes a DBMS: Fundamental Features

16. ● Interface / connectors / protocols ● Sequences / auto-incrementals / atomic counters ● Conditional entry updates ● MapReduce ● Compression ● In-memory ● Availability ● Concurrency handling ● Scalability ● Embeddable ● Backups What makes a DBMS: Fundamental Features cont.

17. ● CRUD ● Union ● Intersect ● JOIN (inner, outer) ● Inner selects ● Merge joins ● Common Table Expressions ● Windowing Functions ● Parallel Query ● Subqueries ● Aggregation ● Derived tables What makes a DBMS: querying capabilities

18. ● Cursors ● Triggers ● Stored procedures ● Functions ● Views ● Materialized views ● Virtual columns ● UDF ● XML/JSON/YAML support What makes a DBMS: programmatic capabilities

19. ● Database (tables size sum) ● Number of Tables ● Tables individual size ● Variable length column size ● Row width ● Row columns count ● Row count ● Column name ● Blob size ● Char ● Numeric ● Date (min / max) What makes a DBMS: sizing limits

20. ● B-Tree ● Full text indexing ● Hash ● Bitmap ● Expression ● Partials ● Reverse ● GiST ● GIS indexing ● Composite keys ● Graph support What makes a DBMS: indexing

21. ● Replication ● Failover ● Clustering ● CAP choice What makes a DBMS: high availability

22. Partitioning ● Range ● Hash ● Range+hash ● List ● Expression ● Sub-partitioning Sharding ● By key ● By table What makes a DBMS: scalability

23. ● Integer ● Floating point ● Decimal ● String ● Binary ● Date/time ● Boolean ● Binary ● Set ● Enumeration ● Blob ● Clob ● JSON/XML/YAML (as native types) What makes a DBMS: supported data types

24. ● Authentication methods ● Access Control Lists ● Pluggable Authentication Modules support ● Encryption at-rest ● Encryption over the wire ● User proxy What makes a DBMS: security features

25. ● Data organization model: unstructured, semi-structured, structured ● Data model (schema) stability: Static? Stable? Dynamic? Highly dynamic? ● Writes: append-only; append mostly; updates only; updates mostly ● Reads: full scans; range scans; multi-range scans; point reads; ● Reads by age: new only; new mostly; old only; old mostly; whole range ● Reads by complexity: simple, related, deeply-nested relations, ....? What makes a DBMS: workload

26. ACID vs BASE ● Atomic ● Consistent ● Isolated ● Durable ● Basic Availability ● Soft-state ● Eventual Consistency

27. CAP Theorem ● Consistency ● Availability ● Partitioning

28. Relational Databases

29. Relational Databases

30. Relational Databases: write anomalies

31. Relational Databases: write anomalies

32. Relational Databases: normalization

33. Relational Databases: normalization

34. Relational Databases: query language results = new Array(); table = open(‘mydata’); while (row = table.fetch()) { if (row.x > 100) { results.push(row); } }

35. Relational Databases: query language SELECT * FROM mydata WHERE x > 100;

36. Relational Databases: JOINs SELECT o.order_id AS Order, CONCAT(c.customer_name, “ (“, c. customer_email, “)”) as Customer, GROUP_CONCAT(i.item_name), SUM(item_price) FROM orders AS o JOIN order_items AS oi ON oi.order_id = o.order_id JOIN items AS i ON i.item_id = oi.item_id JOIN customers AS c ON c.customer_id = o.customer_id

37. Relational Databases: good use cases ● Highly-structured data with complex querying needs ● Projects that need very high data durability and guarantees of database-level consistency and integrity ● Simple projects with limited data growth and limited amount of entities ● Projects that require PCI/DSS, HIPPA or similar security requirements ● Analysis of portions of larger BigData stores ● Projects where duplicated data volumes would be a problem

38. Relational Databases: bad use cases ● Unstructured data ● Deep Hierarchies / Nested -> XML ● Deep recursion: ● Ever-growing datasets; Projects that are basically logging data ● Projects recording time-series ● Reporting on massive datasets

39. Relational Databases: bad use cases ● Projects supporting extreme concurrency ● Projects supporting massive data intake ● Queues ● Cache storage

40. PROs CONs ● Very mature ● Abundant workforce ● ACID guarantees ● Referential integrity ● Highly expressive query language ● Ubiquitous ● Rigid schema ● Difficult to scale horizontally ● Expensive writes ● JOIN bombs

41. Relational Databases: MySQL

42. ● Well known / mature / extensive documentation ● GPLv2 + commercial license for OEMs, ISVs and VARs ● Client libraries for about every programming language ● Many different engines ● SQL/ACID impose scalability limits ● Asynchronous / Semi-synchronous / Virtually synchronous replication ● Can be AP or CP depending on replication model Relational Databases: MySQL

43. PROs CONs ● Open source ● Mature and ubiquitous ● ACID ● Choice of AP or CP ● Highly available ● Abundant tooling and expertise ● General purpouse; Likely good to start anything you want. ● Difficult to shard ● Replication issues ● Not 100% standard compliant ● Storage engines imposed limiations ● General purpouse; No single bullet solutions for scaling!

44. Relational Databases: PostgreSQL

45. ● Mature / adequate documentation ● PostgreSQL License (similar to BSD/MIT) ● Client libraries for about every programming language ● Highly Standards Compliant ● SQL/ACID impose scalability limits ● Asynchronous / Semi-synchronous ● Virtually synchronous replication via 3rd party ● Can be AP or CP depending on replication model` Relational Databases: PostgreSQL

46. PROs CONs ● Open source ● Mature and stable ● ACID ● Lots of advanced features ● Vacuum ● Difficult to shard ● Operations feel like an afterthought ● Less forgiving ● Vacuum

47. K/V Stores

49. CRUD ● CREATE ● READ ● UPDATE ● DELETE

50. HASHING ● Computers: 0, 1, 2, …, n - 1, n ● Key Value Pair: (k, v) (k, v) => hash(k) mod n

54. THUNDERING HERD

55. CONSISTENT HASHING

56. CONSISTENT HASHING

57. K/V Stores - Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns

58. K/V Stores - Good Use Cases ● Lots of data ○ Usually easily horizontally scalable ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns

59. K/V Stores - Good Use Cases ● Lots of data ● Object cache in front of RDBMS ○ Memcached, anyone? ● High concurrency ● Massive small-data intake ● Simple data access patterns

60. K/V Stores - Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ○ Very simple locking model ● Massive small-data intake ● Simple data access patterns

61. K/V Stores - Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns

62. K/V Stores - Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns ○ CRUD on PK access

63. K/V Stores - Bad Use Cases ● Durability and consistency* ● Complex data access patterns ● Non-PK access* ● Operations*

65. K/V Stores - Bad Use Cases ● Durability and consistency* ● Complex data access patterns* ● Non-PK access* ● Operations*

67. K/V Stores - Bad Use Cases ● Durability and consistency* ● Complex data access patterns ● Non-PK access* ● Operations* ○ Complex systems fail in complex ways

68. SIMPLE FAILURE

69. COMPLICATED FAILURE

70. EXAMPLE K/V STORES ● Memcached ● MemcacheDB ● Redis* ● Riak KV ● Cassandra* ● Amazon DynamoDB*

77. PROs CONs ● Highly scalable ● Simple access patterns ● Operational complexities ● Limited access patterns

78. Key Value Stores - Questions?

79. Columnar Databases

80. Columnar Data Layout ● Row-oriented ● Column-oriented 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000; ... 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004; ...

81. Columnar Data Layout ● Row-oriented Read Approach What we want to read Read Operation Memory Page 1 2 3 4 10 Smith Bob 40000 12 Jones Mary 50000 11 Johnson Cathy 44000

82. Columnar Data Layout ● Column-oriented Read Approach What we want to read Read Operation Memory Page 1 2 3 4 10 12 11 22 Smith Jones Johnson Joe Mary Cathy Bob

83. Columnar Databases - Considerations ● Buffering and compression can help to reduce the impact of writes, but they should still be avoided when possible ○ Usually, an ETL process should be put in place to prepare data for analysis in a column-based format ● Covering Indexes in row-based stores could provide similar benefits, but only up to a point → index maintenance work can become too expensive ● Column-based stores are self-indexing and more disk-space efficient ● SQL can be used for most column-based stores

87. ● Suitable for read-mostly or read-intensive, large data repositories ● Good for full table / large range reads. ● Good for unstructured problems where “good” indexes are hard to forecast ● Good for re-creatable datasets ● Good for structured data Columnar Database - Good use cases

92. ● Not good for “SELECT *” queries or queries fetching most of the columns ● Not good for writes ● Not good for mixed read/write ● Bad for unstructured data Columnar Database - Bad use cases

96. Columnar Database - Examples ● InfoBright (ICE) ● Vertica ● Amazon Redshift ● Apache HBase

99. Columnar Database - Examples ● InfoBright (ICE) ● Vertica ● Amazon Redshift ● Apache HBase ○ https://www.percona.com/live/data-performance-conference- 2016/sessions/solr-how-index-10-billion-phrases-mysql-and-hbase

100. Columnar - Questions?

101. Graph Databases

103. Graph Databases - Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ● Re-Creatable Data Set ● Structured Data

104. Graph Databases - Good Use Cases ● Highly Connected Data ○ Network & IT Operations, Recommendations, Fraud Detection, Social Networking, Identity & Access Management, Geo Routing, Insurance Risk Analysis, Counter Terrorism ● Millions or Billions of Records ● Re-Creatable Data Set ● Structured Data

106. Graph Databases - Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ○ Relational databases can also solve this problem at a smaller scale ● Re-Creatable Data Set ● Structured Data

107. Graph Databases - Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ● Re-Creatable Data Set ○ Keep as much as possible outside of the critical path ● Structured Data

108. Graph Databases - Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ● Re-Creatable Data Set ● Structured Data ○ You cannot graph a relationship unless you can define it

109. Graph Databases - Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent RW Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

110. Graph Databases - Bad Use Cases ● Unstructured Data ○ You cannot graph a relationship if you cannot define it ● Non-Connected Data ● Highly Concurrent Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

111. Graph Databases - Bad Use Cases ● Unstructured Data ● Non-Connected Data ○ Graphiness is important here ● Highly Concurrent Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

112. Graph Databases - Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent RW Workloads ○ Performance breaks down ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

113. Graph Databases - Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent Workloads ● Anything in the Critical OLTP Path* ○ I'm not only talking about writes here ● Ever-Growing Data Set

114. Graph Databases - Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent RW Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

115. Example Graph Databases ● Neo4j ● OrientDB ● Titan ● Virtuoso ● ArangoDB

122. THE CODE

123. PROs CONs ● Solves a very specific (and hard) data problem ● Learning curve not bad for developer usage ● Data analysts’ dream ● Very little operational expertise for hire ● Little community and virtually no tooling for administration and operations. ● Big mismatch in paradigm vs RDBMS; Hard to switch for DBAs. ● Hard/Expensive to scale horizontally ● Writes are computationally expensive

124. Graph Databases - Questions?

125. Time Series

126. ID: {timestamp, value} db1-threads: {1460928171, 6}

127. Time Series - Good Use Cases ● Uh … Time Series Data ● Write-mostly (95%+) - Sequential Appends ● Rare updates, rarer still to the distant past ● Deletes occur at the opposite end (the beginning) ● Data does not fit in memory

133. Time Series - Bad Use Cases ● Uh … Not Time Series Data ● Small data

134. Example Time Series Databases ● InfluxDB ● Graphite ● OpenTSDB ● Blueflood ● Prometheus

141. PROs CONs ● Solves a very specific (big) data problem ● Well-defined and finite data access patterns ● Terrible query semantics

142. Time Series - Questions?

143. Document Stores

144. Document Stores: Document Oriented

145. Document Stores: Document Oriented

146. Document Stores: Flexible Schema

151. ShardShardShard Document Stores: Scalable by Design Primary Primary Primary Replica Replica Replica Replica Replica Replica

152. InstanceInstanceInstance Document Stores: Scalable By Design Shard Shard Shard Replica Replica Replica Replica Replica Replica

153. Document Stores

154. Document Stores: MongoDB

155. Document Stores: MongoDB ● Sharding and replication for dummies! ● Pluggable storage engines for distinct workloads. ● Excellent compression options with PerconaFT, RocksDB, WiredTiger ● On disk encryption (Enterprise Advanced) ● In-memory storage engine (Beta) ● Connectors for all major programming languages ● Sharding and replica aware connectors ● Geospatial functions ● Aggregation framework ● .. a lot more except being transactional

157. Document Stores: MongoDB ● Sharding and replication for dummies! ● Pluggable storage engines for distinct workloads. ○ Different locking behaviors ● Excellent compression options with PerconaFT, RocksDB, WiredTiger ● On disk encryption (Enterprise Advanced) ● In-memory storage engine (Beta) ● Connectors for all major programming languages ● Sharding and replica aware connectors ● Geospatial functions ● Aggregation framework ● .. a lot more except being transactional

166. ● Catalogs ● Analytics/BI (BI Connector on 3.2) ● Time series Document Stores: MongoDB > Use Cases

169. Document Stores: Couchbase

170. Document Stores: Couchbase ● MongoDB - more or less ● Global Secondary Indexes is exciting which produces localized secondary indexes for low latency queries (Multi Dimensional Scaling) ● Drop in replacement for Memcache

173. Document Stores: Couchbase > Use Cases ● Internet of Things (direct or indirect receiver/pipeline) ● Mobile data persistence via Couchbase Mobile i.e. field devices with unstable connections and local/close priximity ingestion points ● Distributed K/V store

176. Document Store: Questions?

177. Fulltext Search

178. Fulltext Search: Inverted Index

179. Fulltext Search: Search in a Box

180. Fulltext Search: Optimized Out ● Optimized to take data out - little optimizations for getting data in https://flic.kr/p/abeTEw

181. Fulltext Search: Structured/Non-Structured Data

182. Fulltext Search

183. Fulltext Search: Elasticsearch ● Lucene based ● RESTful interface - JSON in, JSON out ● Flexible schema ● Automatic sharding and replication (NDB like) ● Reasonable defaults ● Extension model ● Written in Java, JVM limitation applies i.e. GC ● ELK - Elasticsearch+Logstash+Kibana

191. Fulltext Search: Elasticsearch > Use Cases ● Logs Analysis - ELK Stack i.e. Netflix ● Full Text search i.e. Github, Wikipedia, StackExchange, etc ● https://www.elastic.co/use-cases

192. Fulltext Search: Elasticsearch > Use Cases ● Logs Analysis - ELK Stack i.e. Netflix ● Full Text search i.e. Github, Wikipedia, StackExchange, etc ● https://www.elastic.co/use-cases

193. Fulltext Search: Elasticsearch > Use Cases ● Logs Analysis - ELK Stack i.e. Netflix ● Full Text search i.e. Github, Wikipedia, StackExchange, etc ● https://www.elastic.co/use-cases ○ Sentiment analysis ○ Personalized experience ○ etc

194. ● Lucene based ● Quite cryptic query interface - Innovator’s Dilemma ● Support for SQL based query on 6.1 ● Structured schema, data types needs to be predefined ● Written in Java, JVM limitation applies i.e. GC ● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x] ● SolrCloud support for sharding and replication Fulltext Search: Solr

199. ● Lucene based ● Quite cryptic query interface - Innovator’s Dilemma ● Support for SQL based query on 6.1 ● Structured schema, data types needs to be predefined ● Written in Java, JVM limitation applies i.e. GC ● Near real-time indexing - DIH, ● Rich document handling - PDF, doc[x] ● SolrCloud support for sharding and replication Fulltext Search: Solr

202. ● Search and Relevancy ○ https://www.percona.com/live/data-performance-conference-2016/sessions/solr-how-index-10- billion-phrases-mysql-and-hbase ● Recommendation Engine ● Spatial Search Fulltext Search: Solr > Use Cases

203. ● Search and Relevancy ● Recommendation Engine ● Spatial Search Fulltext Search: Solr > Use Cases

204. ● Search and Relevancy ● Recommendation Engine ● Spatial Search Fulltext Search: Solr > Use Cases

205. ● Structured data ● MySQL protocol - SphinxQL ● Durable indexes via binary logs ● Realtime indexes via MySQL queries ● Distributed index for scaling ● No native support for replication i.e. via rsync ● Very good documentation ● Fastest full indexing/reindexing [?] Fulltext Search: Sphinx Search

212. ● Structured data ● MySQL protocol - SphinxQL ● Durable indexes via binary logs ● Realtime indexes via MySQL queries ● Distributed index for scaling ● No native support for replication i.e. via rsync ● Very good documentation ● Fastest full indexing/reindexing Fulltext Search: Sphinx Search

213. ● Real time full text + basic geo functions ● Above with with dependency or to simplify access with SphinxQL or even Sphinx storage engine for MySQL Fulltext Search: Sphinx Search > Use Cases

214. ● Real time full text + basic geo functions ● Above with with dependency or to simplify access with SphinxQL or even Sphinx storage engine for MySQL Fulltext Search: Sphinx Search > Use Cases

215. Search - Questions?

216. Docker Is Your Friend

217. Relational ● https://github.com/docker-library/mysql ● https://github.com/docker-library/postgres Key Value ● https://github.com/docker-library/memcached ● https://github.com/docker-library/redis ● https://github.com/docker-library/cassandra ● https://github.com/hectcastro/docker-riak (https://docs.docker. com/engine/examples/running_riak_service/) Docker Is Your Friend

218. Graph ● https://github.com/neo4j/docker-neo4j ● https://github.com/orientechnologies/orientdb-docker ● https://github.com/arangodb/arangodb-docker ● https://github.com/tenforce/docker-virtuoso (non official) ● https://hub.docker.com/r/itzg/titandb/~/dockerfile/ (non official) ● https://github.com/phani1kumar/docker-titan (non official) Full Text ● https://github.com/docker-solr/docker-solr/ ● https://github.com/stefobark/sphinxdocker Docker Is Your Friend

219. Docker Is Your Friend Time series ● https://github.com/tutumcloud/influxdb (non official) ● https://hub.docker.com/r/sitespeedio/graphite/ (non official) ● https://github.com/rackerlabs/blueflood/tree/master/demo/docker ● https://hub.docker.com/r/petergrace/opentsdb-docker/ (non-official) ● https://hub.docker.com/r/opower/opentsdb/ (non-official) ● https://prometheus.io/docs/introduction/install/#using-docker ● https://github.com/prometheus/prometheus/blob/master/Dockerfile ● Both via http://opentsdb.net/docs/build/html/resources.html

220. Docker Is Your Friend Document ● https://github.com/docker-library/mongo/ ● https://hub.docker.com/r/couchbase/server/~/dockerfile/ Columnar ● http://www.infobright.org/index.php/download/download-pentaho-ice-integrated-virtual-machine/ ● https://github.com/meatcar/docker-infobright/blob/master/Dockerfile ● https://github.com/vertica/docker-vertica

Heterogenous Persistence

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Heterogenous Persistence

Ähnlich wie Heterogenous Persistence (20)

Mehr von Jervin Real

Mehr von Jervin Real (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Heterogenous Persistence