Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

ToroDB: scaling PostgreSQL like MongoDB

3.406 Aufrufe

Veröffentlicht am

NoSQL databases have emerged as a response to some perceived problems in the RDBMSs: agile/dynamic schemas; and transparent, horizontal scaling of the database. The former has been promptly targeted with the introduction of unstructured data types, but scaling a relational databases is still a very hard problem.

As a consequence, all NoSQL databases have been built from scratch: their storage engines, replication techniques, journaling, ACID support (if any). They haven't leveraged the previously existing state-of-the-art of RDBMSs, effectively re-inventing the wheel. Isn't this sub-optimal? Wouldn't it be possible to construct a NoSQL database by layering it on top of a relational database?

Enter ToroDB. ToroDB is an open source project that behaves as a NoSQL database but runs on top of PostgreSQL, one of the most respected and reliable relational databases. ToroDB offers a document (JSON-like) interface, and implements the MongoDB wire protocol, hence being compatible with existing MongoDB drivers and applications. Rather than using PostgreSQL's jsonb data type, ToroDB explored an innovative approach by transforming JSON documents to a fully relational representation, in an automated way. This brings to the table many advantages like lower disk footprint and automatic data-partitioning, leading to significantly faster queries.

As ToroDB speaks the MongoDB protocol, it also implements MongoDB replication and sharding techniques, enabling it to scale and offer HA like Mongo. Being based on PostgreSQL, ToroDB is effectively scaling PostgreSQL much in the same way MongoDB scales.

This presentation describes the architecture, internals and pitfalls of implementing MongoDB replication on ToroDB, and how key PostgreSQL technologies have been leveraged to accomplish this task (such as the use of Logical Decoding to serve idempotent database changes). It also addresses the MongoDB protocol itself, CAP, Jepsen and comments about real performance. It also touches MongoWP, a component of ToroDB built as a separate open source library, that implements the MongoDB protocol and enables development of Mongo-server-like, third-party applications.

Veröffentlicht in: Software

ToroDB: scaling PostgreSQL like MongoDB

  1. 1. ToroDB SCALING PostgreSQL LIKE MongoDB @NoSQLonSQL Álvaro Hernández <aht@torodb.com>
  2. 2. ToroDB @NoSQLonSQL About *8Kdata* ● Research & Development in databases ● Consulting, Training and Support in PostgreSQL ● Founders of PostgreSQL España, 5th largest PUG in the world (>500 members as of today) ● About myself: CTO at 8Kdata: @ahachete http://linkd.in/1jhvzQ3 www.8kdata.com
  3. 3. ToroDB @NoSQLonSQL ToroDB in brief
  4. 4. ToroDB @NoSQLonSQL ToroDB in one slide ● Document-oriented, JSON, NoSQL db ● Open source (AGPL) ● MongoDB compatibility (wire protocol level) ● Uses PostgreSQL as a storage backend
  5. 5. ToroDB @NoSQLonSQL ToroDB storage ● Data is stored in tables. No blobs ● JSON documents are split by hierarchy levels into “subdocuments”, which contain no nested structures. Each subdocument level is stored separately ● Subdocuments are classified by “type”. Each “type” maps to a different table
  6. 6. ToroDB @NoSQLonSQL ToroDB storage (II) ● A “structure” table keeps the subdocument “schema” ● Keys in JSON are mapped to attributes, which retain the original name ● Tables are created dinamically and transparently to match the exact types of the documents
  7. 7. ToroDB @NoSQLonSQL ToroDB storage internals { "name": "ToroDB", "data": { "a": 42, "b": "hello world!" }, "nested": { "j": 42, "deeper": { "a": 21, "b": "hello" } } }
  8. 8. ToroDB @NoSQLonSQL ToroDB storage internals The document is split into the following subdocuments: { "name": "ToroDB", "data": {}, "nested": {} } { "a": 42, "b": "hello world!"} { "j": 42, "deeper": {}} { "a": 21, "b": "hello"}
  9. 9. ToroDB @NoSQLonSQL ToroDB storage internals select * from demo.t_3 ┌─────┬───────┬────────────────────────────┬────────┐ │ did │ index │ _id │ name │ ├─────┼───────┼────────────────────────────┼────────┤ │ 0 │ ¤ │ x5451a07de7032d23a908576d │ ToroDB │ └─────┴───────┴────────────────────────────┴────────┘ select * from demo.t_1 ┌─────┬───────┬────┬──────────────┐ │ did │ index │ a │ b │ ├─────┼───────┼────┼──────────────┤ │ 0 │ ¤ │ 42 │ hello world! │ │ 0 │ 1 │ 21 │ hello │ └─────┴───────┴────┴──────────────┘ select * from demo.t_2 ┌─────┬───────┬────┐ │ did │ index │ j │ ├─────┼───────┼────┤ │ 0 │ ¤ │ 42 │ └─────┴───────┴────┘
  10. 10. ToroDB @NoSQLonSQL ToroDB storage internals select * from demo.structures ┌─────┬────────────────────────────────────────────────────────────────────────────┐ │ sid │ _structure │ ├─────┼────────────────────────────────────────────────────────────────────────────┤ │ 0 │ {"t": 3, "data": {"t": 1}, "nested": {"t": 2, "deeper": {"i": 1, "t": 1}}} │ └─────┴────────────────────────────────────────────────────────────────────────────┘ select * from demo.root; ┌─────┬─────┐ │ did │ sid │ ├─────┼─────┤ │ 0 │ 0 │ └─────┴─────┘
  11. 11. ToroDB @NoSQLonSQL ToroDB storage and I/O savings 29% - 68% storage required, compared to Mongo 2.6
  12. 12. ToroDB @NoSQLonSQL ToroDB: query “by structure” ● ToroDB is effectively partitioning by type ● Structures (schemas, partitioning types) are cached in ToroDB memory ● Queries only scan a subset of the data ● Negative queries are served directly from memory
  13. 13. ToroDB @NoSQLonSQL Scaling ToroDB like MongoDB
  14. 14. ToroDB @NoSQLonSQL Big Data: NoSQL vs SQL vs http://www.networkworld.com/article/2226514/tech-debates/what-s-better-for-your-big-data-application--sql-or-nosql-.html
  15. 15. ToroDB @NoSQLonSQL Scalability?
  16. 16. ToroDB @NoSQLonSQL Ways to scale ● Vertical scalability ➔ Concurrency scalability ➔ Hardware scalability ➔ Query scalability ● Read scalability (replication) ● Write scalability (horizontal, sharding)
  17. 17. ToroDB @NoSQLonSQL Vertical scalability Concurrency scalability ● SQL is usually better (e.g. PostgreSQL): ➔ Finer locking ➔ MVCC ➔ better caching ● NoSQL often needs sharding within the same host to scale
  18. 18. ToroDB @NoSQLonSQL Vertical scalability Hardware scalability ● Scaling with the number of cores? ● Process/threading model? Query scalability ● Use of indexes? Use of more than one? ● Table/collection partitioning? ● ToroDB “by-type” partitioning
  19. 19. ToroDB @NoSQLonSQL Read scalability: replication ● Replicate data to slave nodes, available read-only: scale-out reads ● Both NoSQL and SQL support it ● Binary replication usually faster (e.g. PostgreSQL's Streaming Replication) ● Not free from undesirable phenomena
  20. 20. ToroDB @NoSQLonSQL https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads Dirty and stale reads (“call me maybe”)
  21. 21. ToroDB @NoSQLonSQL MongoDB write acknowledge https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
  22. 22. ToroDB @NoSQLonSQL MongoDB's dirty and stale reads Dirty reads A primary in minority accepts a write that other clients see, but it later steps down, write is rolled back (fixed in 3.2?) Stale reads A primary in minority serves a value that ought to be current, but a newer value was written to the other primary in minority
  23. 23. ToroDB @NoSQLonSQL Write scalability (sharding) ● NoSQL better prepared than SQL ● But many compromises in data modeling (schema design): no FKs ● There are also solutions for SQL: ➔ Shared-disk, limited scalability (RAC) ➔ Sharding (like pg_shard) ➔ PostgreSQL's FDWs
  24. 24. ToroDB @NoSQLonSQL Read scalability (replication) in MongoDB and ToroDB
  25. 25. ToroDB @NoSQLonSQL Replication protocol choice ● ToroDB is based on PostgreSQL ● PostgreSQL has either binary streaming replication (async or sync) or logical replication ● MongoDB has logical replication ● ToroDB uses MongoDB's protocol
  26. 26. ToroDB @NoSQLonSQL MongoDB's replication protocol ● Every change is recorded in JSON documents, idempotent format (collection: local.oplog.rs) ● Slaves pull these documents from master (or other slaves) asynchronously ● Changes are applied and feedback is sent upstream
  27. 27. ToroDB @NoSQLonSQL MongoDB replication implementation https://github.com/stripe/mosql
  28. 28. ToroDB @NoSQLonSQL Announcing ToroDB v0.4 (snap) Supporting MongoDB replication
  29. 29. ToroDB @NoSQLonSQL ToroDB v0.4 ● ToroDB works as a secondary slave of a MongoDB master (or slave) ● Implements the full replication protocol (not an oplog tailable query) ● Replicates from Mongo to a PostgreSQL ● Open source github.com/torodb/torodb (repl branch, version 0.4-SNAPSHOT)
  30. 30. ToroDB @NoSQLonSQL Advantages of ToroDB w/ replication ● Native SQL ● Query “by type” ● Better SQL scaling ● Less concurrency contention ● Better hardware utilization ● No need for ETL from Mongo to PG!
  31. 31. ToroDB @NoSQLonSQL ● NoSQL is trying to get back to SQL ● ToroDB is SQL native! ● Insert with Mongo, query with SQL! ● Powerful PostgreSQL SQL: window functions, recursive queries, hypothetical aggregates, lateral joins, CTEs, etc ToroDB: native SQL
  32. 32. ToroDB @NoSQLonSQL ToroDB: native SQL
  33. 33. ToroDB @NoSQLonSQL db.bds15.insert({ id:5, person: { name: "Alvaro", surname: "Hernandez", contact: { email: "aht@torodb.com", verified: true } } }) db.bds15.insert({ id:6, person: { name: "Name", surname: "Surname", age: 31, contact: { email: "name.surname@gmail.com" } } }) Introducing ToroDB VIEWs
  34. 34. ToroDB @NoSQLonSQL torodb$ select * from bds15.person ; ┌─────┬───────────┬────────┬─────┐ │ did │ surname │ name │ age │ ├─────┼───────────┼────────┼─────┤ │ 0 │ Hernandez │ Alvaro │ ¤ │ │ 1 │ Surname │ Name │ 31 │ └─────┴───────────┴────────┴─────┘ (2 rows) torodb$ select * from bds15."person.contact"; ┌─────┬──────────┬────────────────────────┐ │ did │ verified │ email │ ├─────┼──────────┼────────────────────────┤ │ 0 │ t │ aht@torodb.com │ │ 1 │ ¤ │ name.surname@gmail.com │ └─────┴──────────┴────────────────────────┘ (2 rows) Introducing ToroDB VIEWs
  35. 35. ToroDB @NoSQLonSQL VIEWs, ToroDB from any SQL tool
  36. 36. ToroDB @NoSQLonSQL Enabling real DW on MongoDB: ToroDB + Greenplum
  37. 37. ToroDB @NoSQLonSQL ● GreenPlum was open sourced 6 days ago ● ToroDB already runs on GP! ● Goal: ToroDB v0.4 replicates from a MongoDB master onto Toro-Greenplum and run the reporting using distributed SQL ToroDB on GP: SQL DW for MongoDB
  38. 38. ToroDB @NoSQLonSQL SELECT count( distinct( "reviewerID" ) ) FROM reviews; Queries: which one is easier? db.reviews.aggregate([ { $group: { _id: "reviewerID"} }, { $group: {_id: 1, count: { $sum: 1}} } ])
  39. 39. ToroDB @NoSQLonSQL SELECT "reviewerName", count(*) as reviews FROM reviews GROUP BY "reviewerName" ORDER BY reviews DESC LIMIT 10; Queries: which one is easier? db.reviews.aggregate( [ { $group : { _id : '$reviewerName', r : { $sum : 1 } } }, { $sort : { r : -1 } }, { $limit : 10 } ], {allowDiskUse: true} )
  40. 40. ToroDB @NoSQLonSQL ● Amazon reviews dataset Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015 ● AWS c4.xlarge (4vCPU, 8GB RAM) 4KIOPS SSD EBS ● 4x shards, 3x config; 4x segments GP ● 83M records, 65GB plain json Benchmark
  41. 41. ToroDB @NoSQLonSQL Disk usage Mongo 3.0, WT, Snappy GP columnar, zlib level 9 table size index size total size 0 10000000000 20000000000 30000000000 40000000000 50000000000 60000000000 70000000000 80000000000 Storage requirements MongoDB vs ToroDB on Greenplum Mongo ToroDB on GP bytes
  42. 42. ToroDB @NoSQLonSQL Query times 3 different queries Q3 on MongoDB: aggregate fails 27,95 74,87 0 0 200 400 600 800 1000 1200 969 1007 0 35 13 31 Query duration (s) MongoDB vs ToroDB on Greenplum MongoDB ToroDB on GP speedup seconds

×