Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Couchbase and Apache Spark

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 47 Anzeige

Couchbase and Apache Spark

Herunterladen, um offline zu lesen

Slides presented at SDBigData Meetup:
http://www.meetup.com/sdbigdata/events/225691323/

There was a request for more Couchbase use case information and NoSQL primer, so I added a number of slides to let me talk to those aspects right before doing the presentation.

Slides presented at SDBigData Meetup:
http://www.meetup.com/sdbigdata/events/225691323/

There was a request for more Couchbase use case information and NoSQL primer, so I added a number of slides to let me talk to those aspects right before doing the presentation.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie Couchbase and Apache Spark (20)

Aktuellste (20)

Anzeige

Couchbase and Apache Spark

  1. 1. Couchbase and Apache Spark efficient data crunching in a fast moving world
  2. 2. ©2015 Couchbase Inc. 2 Matt Ingenthron Worked on large site scalability problems at previous company… memcached contributor Joined Couchbase very early and helped define key parts of system
  3. 3. A Quick Architectural Introduction to Couchbase
  4. 4. ©2015 Couchbase Inc. 4 Couchbase is a Document Oriented Database High availability cache Key-value store Document database Embedded database Sync management Couchbase can be used a number of ways. Developers often need a simple distributed hashtable, then grow to need secondary indexing and are either mobile-first or need to address mobile deployment.
  5. 5. ©2015 Couchbase Inc. 5 What makes Couchbase unique? 5 Performance & scalability leader Sub millisecond latency with high throughput; memory-centric architecture Multi- purpose Simplified administration Easy to deploy & manage; integrated Admin Console, single- click cluster expansion & rebalance Cache, key value store, document database, and local/mobile database in single platform Always-on availability Data replication across nodes, clusters, and data centers Enterprises choose Couchbase for several key advantages 24x365
  6. 6. ©2015 Couchbase Inc. 6  Consolidated cache and database  Tune memory required based on application requirements Multi-purpose database supports many uses 6 6 Tunable built-in cache Flexible schemas with JSON Couchbase Lite  Represent data with varying schemas using JSON on the server or on the device  Index and query data with Javascript views  Light weight embedded DB for always available apps  Sync Gateway syncs data seamlessly with Couchbase Server
  7. 7. ©2015 Couchbase Inc. 7 Couchbase leads in performance and scalability Auto Sharding Memory-memory XDCR Single NodeType  No manual sharding  Database manages data movement to scale out – not the user  Market’s only memory-to- memory database replication across clusters and geos  Provides disaster recover / data locality  Hugely simplifies management of clusters  Easy to scale clusters by adding any number of nodes
  8. 8. ©2015 Couchbase Inc. 8 24x365 Couchbase delivers always-on availability 8 High Availability Disaster Recovery Backup & Restore  In-memory replication with manual or automatic fail over  Rack-zone awareness to minimize data unavailability  Memory-to-memory cross cluster replication across data centers or geos  Active-active topology with bi- directional setup  Full backup or Incremental backup with online restore  Delta node catch-ups for faster recovery after failures
  9. 9. ©2015 Couchbase Inc. 9 Simplified administration for exceptional ease of use Online upgrades and operations Built-in enterprise class admin console RestfulAPIs  Online software, hardware and DB upgrades  Indexing, compaction, rebalance, backup & restore  Perform all administrative tasks with the click of a button  Monitor status of the system visual at cluster level, database level, server level  All admin operations available via UI, REST APIs or CLI commands  Integrate third party monitoring tools easily using REST
  10. 10. ©2015 Couchbase Inc. 10 Couchbase Server Architecture Single-node type means easier administration and scaling  Single installation  Two major components/processes: Data manager cluster manager  Data manager:  C/C++  Layer consolidation of caching and persistence  Cluster manager:  Erlang/OTP  Administration UI’s  Out-of-band for data requests
  11. 11. ©2015 Couchbase Inc. 11 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE Write Operation 11 DOC 1 DOC 1DOC 1 Single-node type means easier administration and scaling  Writes are async by default  Application gets acknowledgement when successfully in RAM and can trade- off waiting for replication or persistence per-write  Replication to 1, 2 or 3 other nodes  Replication is RAM-based so extremely fast  Off-node replication is primary level of HA  Disk written to as fast as possible – no waiting
  12. 12. ©2015 Couchbase Inc. 12 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 Basic Operation 12 SHARD 5 SHARD 2 SHARD 9 SHARD SHARD SHARD SHARD 4 SHARD 7 SHARD 8 SHARD SHARD SHARD SHARD 1 SHARD 3 SHARD 6 SHARD SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD Application has single logical connection to cluster (client object) • Data is automatically sharded resulting in even document data distribution across cluster • Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication) • Docs are automatically hashed by the client to a shard • Cluster map provides location of which server a shard is on • Every read/write/update/delete goes to same node for a given key • Strongly consistent data access (“read your own writes”) • A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads
  13. 13. ©2015 Couchbase Inc. 13 Cache Ejection 13 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 DOC 2DOC 3DOC 4DOC 5 DOC 1 DOC 2 DOC 3 DOC 4 DOC 5 Single-node type means easier administration and scaling  Layer consolidation means read through and write through cache  Couchbase automatically removes data that has already been persisted from RAM
  14. 14. ©2015 Couchbase Inc. 14 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 Cache Miss 14 DOC 2 DOC 3 DOC 4 DOC 5 DOC 2 DOC 3 DOC 4 DOC 5 GET DOC 1 DOC 1 DOC 1 Single-node type means easier administration and scaling  Layer consolidation means 1 single interface for App to talk to and get its data back as fast as possible  Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel
  15. 15. ©2015 Couchbase Inc. 15 Add Nodes to Cluster 15 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARD SHARD 8 SHARD 9 SHARD READ/WRITE/UPDATE Application has single logical connection to cluster (client object)  Multiple nodes added or removed at once  One-click operation  Incremental movement of active and replica vbuckets and data  Client library updated via cluster map  Fully online operation, no downtime or loss of performance
  16. 16. ©2015 Couchbase Inc. 16 Node Unresponsive / Lost
  17. 17. ©2015 Couchbase Inc. 17 Fail Over Node 17 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARDSHARD 6 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARDSHARD 8 SHARD 9 SHARD SHARD 3 SHARD 1 SHARD 3 SHARD Application has single logical connection to cluster (client object)  When node goes down, some requests will fail  Failover is either automatic or manual`  Client library is automatically updated via cluster map  Replicas not recreated to preserve stability  Best practice to replace node and rebalance
  18. 18. Demo
  19. 19. What about Hadoop?
  20. 20. ©2015 Couchbase Inc. 20 Big Data = Operational + Analytic (NoSQL + Hadoop) 20  Online  Web/Mobile/IoT apps  Millions of customers/consumers  Offline  Analytics apps  Hundreds of business analysts
  21. 21. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD
  22. 22. TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS
  23. 23. ©2015 Couchbase Inc. 23 Apache Spark:The Big Picture
  24. 24. ©2015 Couchbase Inc. 24 Apache Spark … is a fast and general purpose engine for small and large scale data processing …
  25. 25. ©2015 Couchbase Inc. 25 Components: Spark Core Resilient Distributed Datasets Clustering Execution
  26. 26. ©2015 Couchbase Inc. 26 Components: Spark SQL Structured through DataFrames Distributed querying with SQL
  27. 27. ©2015 Couchbase Inc. 27 Components: Spark Streaming Fault-tolerant streaming applications
  28. 28. ©2015 Couchbase Inc. 28 Components: Spark MLib Built-In Machine Learning Algorithms
  29. 29. ©2015 Couchbase Inc. 29 Components: Spark GraphX Graph processing and graph-parallel computations
  30. 30. ©2015 Couchbase Inc. 30 How does it work? Source: http://spark.apache.org/docs/latest/cluster-overview.html
  31. 31. ©2015 Couchbase Inc. 31 Spark Benefits  Linearly scalable to 1000+ worker nodes  Simpler to use than Hadoop MR  Only partial recompute on failure  For developers and data scientists – machine learning – R integration  Tight but not mandatory Hadoop integration – Sources, Sinks – Scheduler
  32. 32. ©2015 Couchbase Inc. 32 Spark vs Hadoop  Spark is RAM while Hadoop is mainly HDFS (disk) bound  Fully compatible with Hadoop Input/Output  Easier to develop against thanks to functional composition  Hadoop certainly more mature, but Spark ecosystem growing fast
  33. 33. ©2015 Couchbase Inc. 33 Couchbase in the Spark Landscape  Transparent generation and persistence of – RDDs – DataFrames – Dstreams  Spark SQL and N1QL are a natural fit  Linearly scale your data and application layer  Share data between SparkApplications The perfect storage companion for your spark applications. Source: http://spark.apache.org/docs/latest/cluster-overview.html
  34. 34. ©2015 Couchbase Inc. 34 Cluster Communication STORAGE Couchbase Server 1 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 2 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 3 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 4 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 5 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 6 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service Spark Worker Spark Worker
  35. 35. ©2015 Couchbase Inc. 35 Ecosystem Flexibility RDBMS Streams Web APIs DCP KV N1QL Views Batching Data Archive OLTP Data
  36. 36. ©2015 Couchbase Inc. 36 Infrastructure Consolidation
  37. 37. ©2015 Couchbase Inc. 37 The Connector
  38. 38. ©2015 Couchbase Inc. 38 Couchbase Connector  Spark Core – Automatic Cluster and Resource Management – Creating and Persisting RDDs – Java APIs in addition to Scala  Spark SQL – Easy JSON handling and querying – Tight N1QL Integration  Spark Streaming – Persisting DStreams – DCP source (experimental)
  39. 39. ©2015 Couchbase Inc. 39 Facts  CurrentVersion: 1.0.0-beta  Code: https://github.com/couchbaselabs/couchbase-spark-connector  Docs until GA: http://developer.couchbase.com/documentation/server/4.0/connectors/spark -1.0/spark-intro.html
  40. 40. ©2015 Couchbase Inc. 40 Connection Management
  41. 41. ©2015 Couchbase Inc. 41 Connection Management
  42. 42. ©2015 Couchbase Inc. 42 Creating RDDs
  43. 43. ©2015 Couchbase Inc. 43 Persisting RDDs
  44. 44. ©2015 Couchbase Inc. 44 Spark SQL Integration
  45. 45. ©2015 Couchbase Inc. 45 Spark Streaming with DCP
  46. 46. ©2015 Couchbase Inc. 46 What‘s next?
  47. 47. ©2015 Couchbase Inc. 47 Couchbase Connector  Learn More: – Couchbase and Spark at Couchbase Connect 2015: http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/  1.1.0 plans – Upgrade to Spark 1.5 – Stabilize DCP Support – Extend, Optimze, Fix bugs…  We need your feedback!

Hinweis der Redaktion

  • Slide 2 – About Me
  • KEY POINT: COUCHBASE PROVIDES A SET OF MULTI-PURPOSE, CORE CAPABILITIES THAT SUPPORT A BROAD RANGE OF APPLICATIONS AND USE CASES, ALL IN A SINGLE DATA MANAGEMENT PLATFORM.

    Couchbase provides a set of technology capabilities to support a broad range of applications and use cases:

    High Availability Cache: Couchbase provides an integrated managed object cache, so you can start out using Couchbase as a high availability cache on top of your existing relational database. For example, you can use Couchbase as a session store in front of your relational database, if your relational DB is struggling to keep up with the load required for online interactive applications.

    Key-Value Store: Many customers start with Couchbase as a cache and then broaden their usage to other capabilities, like using Couchbase as a Key-Value Store for things like Profile Management.

    Document Database: From there, you can grow into using Couchbase as a Document Database, where you can do more with capabilities like indexing and Cross Data Center Replication.

    Embedded Database: Couchbase also provides an embedded database called Couchbase Lite. It’s a purpose-built database for the device, so you can build applications that are always available and always work, whether offline or online.

    Sync Management: Finally, as part of our solution for mobile applications, we provide Couchbase Sync Gateway, which automatically synchronizes data on the device with Couchbase Server in the cloud so your developer doesn’t have to write code to manage the complex sync process.

    Starting with cache and then expanding to other capabilities is often a good way to learn the technology and get comfortable with Couchbase for a wider set of use cases.



  • Couchbase has emerged as a leading NoSQL provider for number of reasons:

    Best in performance and scalability
    We’ve engineered Couchbase from the ground up for high performance and scalability
    Couchbase is designed to deliver sub-millisecond responsiveness with very high throughput for both reads and writes
    We consistently outperform competitors like MongoDB and DataStax in multiple independent benchmarks
    Our performance advantage is driven in large part by our memory-centric architecture, which includes an integrated managed object cache and stream-based replication

    Broad use case support
    We’re the only NoSQL provider that has consolidated distributed cache, key-value store, and a JSON-based document database in a single platform
    This means customers can use Couchbase for a much broader range of applications

    Integrated mobile solution
    We’re the only vendor that provides an end-to-end NoSQL mobile solution -- allows customers to easily build mobile apps that run great on or offline
    Includes a JSON database embedded on the device, along with a prebuilt syncing tier
    So apps run great on the device, even without a network connection or no connectivity at all
    Data on the device auto-syncs with the backend server when a connection is available

    Simplified administration
    We’ve designed Couchbase to be exceptionally easy to deploy and manage
    Features such as an integrated Admin Console and single-click cluster expansion & rebalance dramatically increase admin efficiency




  • Each Couchbase node is exactly the same.

    All nodes are broken down into two components: A data manager (on the left) and a cluster manager (on the right). It’s important to realize that these are separate processes within the system specifically designed so that a node can continue serving its data even in the face of cluster problems like network disruption.

    The data manager is written in C and C++ and is responsible both for the object caching layer, persistence layer and querying engine. It is based off of memcached and so provides a number of benefits;
    -The very low lock contention of memcached allows for extremely high throughput and low latencies both to a small set of documents (or just one) as well as across millions of documents
    -Being compatible with the memcached protocol means we are not only a drop-in replacement, but inherit support for automatic item expiration (TTL), atomic incrementer.
    -We’ve increased the maximum object size to 20mb, but still recommend keeping them much smaller
    -Support for both binary objects as well as natively supporting JSON documents
    -All of the metadata for the documents and their keys is kept in RAM at all times. While this does add a bit of overhead per item, it also allows for extremely fast “miss” speeds which are critical to the operation of some applications….we don’t have to scan a disk to know when we don’t have some data.

    The cluster manager is based on Erlang/OTP which was developed by Ericsson to deal with managing hundreds or even thousands of distributed telco switches. This component is responsible for configuration, administration, process monitoring, statistics gathering and the UI and REST interface. Note that there is no data manipulation done through this interface.
  • Now, as you fill up memory (click), some data that has already been written to disk will be ejected from RAM to make room for new data. (click)

    Couchbase supports holding much more data than you have RAM available. It’s important to size the RAM capacity appropriately for your working set: the portion of data your application is working with at any given point in time and needs very low latency, high throughput access to. In some applications this is the entire data set, in others it is much smaller. As RAM fills up, we use a “not recently used” algorithm to determine the best data to be ejected from cache.
  • Should a read now come in for one of those documents that has been ejected (click), it is copied back from disk into RAM and sent back to the application. The document then remains in RAM as long as there is space and it is being accessed.
  • KEY POINTS: BIG DATA IS NOT ONE THING – IT’S A COMBINATION OF OPERATIONAL (NOSQL) AND ANALYTICAL DATABASES. YOU NEED BOTH. COUCHBASE PROVIDES THE OPERATIONAL SOLUTION.

    Big data has two major pieces: Operational and Analytical
    Operational is about:
    Real time
    Online, interactive
    Customer/consumer facing
    Processing data at high velocity
    Analytical is about:
    Offline analytics
    Often batch oriented
    Takes time processing
    Directly touches relatively few users (business analysts)
    These two pieces together form “Big Data”
    There’s some overlap
    NoSQL can deliver some analytics
    Hadoop can deliver some operational
    But in general each technology designed for separate purposes
    Couchbase fits on the operational side, Hadoop on the analytics side
  • The data generated by users is published to Apache Kafka.
    Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop.
    Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  • The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.

×