Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Operational Database With Spark, Matt Ingenthron, Senior Director, Engineering, Couchbase
Spark and Couchbase can be integrated in several ways to augment operational databases with analytics. Key integration points include automatic data sharding and locality, data streaming for replication and Spark Streaming, predicate pushdown using Couchbase global indexes, and flexible schemas. The connector allows accessing Couchbase data in Spark for use cases like recommendations, fraud detection, and machine learning model serving.
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Â
Ăhnlich wie Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Operational Database With Spark, Matt Ingenthron, Senior Director, Engineering, Couchbase
Ăhnlich wie Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Operational Database With Spark, Matt Ingenthron, Senior Director, Engineering, Couchbase (20)
Automating Google Workspace (GWS) & more with Apps Script
Â
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Operational Database With Spark, Matt Ingenthron, Senior Director, Engineering, Couchbase
2. Agenda
Why integrate Spark and NoSQL?
Architectural alignment
Integration âPoints of Interestâ
ï§ Automatic sharding and data locality
ï§ Streams: Data Replication and Spark Streaming
ï§ Predicate pushdown and global indexing
ï§ Flexible schemas and schema inference
See it in action
4. NoSQL + Spark use cases
Operations Analysis
NoSQL
ï§ Recommendations
ï§ Next gen data warehousing
ï§ Predictive analytics
ï§ Fraud detection
ï§ Catalog
ï§ Customer 360 + IOT
ï§ Personalization
ï§ Mobile applications
5. Big Data at a Glance
Couchbase Spark Hadoop
Use cases
âą Operational
âą Web / Mobile
âą Analytics
âą Machine
Learning
âą Analytics
âą Machine
Learning
Processing
mode
âą Online
âą Ad Hoc
âą Ad Hoc
âą Batch
âą Streaming (+/-)
âą Batch
âą Ad Hoc (+/-)
Low latency = < 1 ms ops Seconds Minutes
Performance Highly predictable Variable Variable
Users are
typicallyâŠ
Millions of
customers
100âs of analysts or
data scientists
100âs of analysts or
data scientists
Memory-centric Memory-centric Disk-centric
Big data = 10s of Terabytes Petabytes Petabytes
ANALYTICALOPERATIONAL
6. Use Case: Operationalize Analytics / ML
Hadoop
Examples: recommend content and products, spot fraud or spam
Data scientists train machine learning models
Load results into Couchbase so end users can interact with them online
Machine
Learning
Models
Data
Warehouse
Historical Data
NoSQL
7. Use Case: Operationalize ML
Model
NoSQL
node(e.g.) Training Data
(Observations)
Serving
Predictions
8. Spark connects to everythingâŠ
DCP
KV
N1QL
Views
Adapted from: Databricks â Not Your Fatherâs Database https://www.brighttalk.com/webcast/12891/196891
9. Use Case #2: Data Integration
RDBMSs3hdfs
Elasticsearc
h
Data engineers query data in many systems w/ one language &
runtime
Store results where needed for further use
Late binding of schemas
NoSQL
11. Full Text Search
Search for and fetch
the most relevant
records given a
freeform text string
Key-Value
Directly fetch /
store a particular
record
Query
Specify a set of criteria
to retrieve relevant data
records.
Essential in reporting.
Map-Reduce
Views
Maintain materialized
indexes of data
records, with reduce
functions for
aggregation
Data Streaming
Efficiently, quickly
stream data records to
external systems for
further processing or
integration
12. Hash Partitioned Data
Auto Sharding â Bucket And vBuckets
A bucket is a logical, unique key space
Each bucket has active & replica data sets
ï§ Each data set has 1024 virtual buckets
(vBuckets)
ï§ Each vBucket contains 1/1024th of the data
set
ï§ vBuckets have no fixed physical server
location
Mapping of vBuckets to physical servers is
called the cluster map
Document IDs (keys) always get hashed to
the same vBucket
Couchbase SDKâs lookup the vBucket ï
server mapping
13. N1QL Query
N1QL, pronounced ânickelâ, is a SQL service with extensions specifically for
JSON
ï§ Is stateless execution, howeverâŠ
ï§ Uses Couchbaseâs Global Secondary Indexes.
ï§ These are sorted structures, range partitioned.
ï§ Both can run on any nodes within the cluster. Nodes with differing
services can be added and removed as needed.
14. MapReduce Couchbase Views
A JavaScript based, incremental Map-Reduce
service for incrementally building sorted B+Trees.
ï§ Runs on every node, local to the data on that node, stored locally.
ï§ Automatically merge-sorted at query time.
15. Data Streaming with DCP
A general data streaming service, Database Change Protocol.
ï§ Allows for streaming all data out and continuing, orâŠ
ï§ Stream just what is coming in at the time of connection, orâŠ
ï§ Stream everything out for transfer/takeoverâŠ
17. Key-Value
Direct fetching/storing
of a particular record.
Query
Specifying a set of
criteria to retrieve
relevant data records.
Essential in reporting.
Map-Reduce
Views
Maintain materialized
indexes of data
records, with reduce
functions for
aggregation.
Data Streaming
Efficiently, quickly
stream data records to
external systems for
further processing or
integration.
Full Text Search
Search for, and allow
tuning of the system to
fetch the most relevant
records given a
freeform search string.
19. What happens in Spark Couchbase KV
When 1 Spark node per CB node, the connector will use the cluster map and
push down location hints
ï§ Helpful for situations where processing is intense, like transformation
ï§ Uses pipeline IO optimization
However, not available for N1QL or Views
ï§ Round robin - canât give location hints
ï§ Back end is scatter gather with 1 node responding
21. SparkSQL on
N1QL with Global Secondary Indexes
TableScan
Scan all of the data and return it
PrunedScan
Scan an index that matches only relevant
data to the query at hand.
PrunedFilteredScan
Scan an index that matches only relevant
data to the query at hand.
25. Predicate pushdown
Notes from implementing:
Spark assumes itâs getting
all the data, applies the predicates
Future potential optimizations
Push down all the things!
ï§ Aggregations
ï§ JOINs
Looking at Catalyst engine extensions from SAP
ï§ But, itâs not backward compatible andâŠ
ï§ âŠmany data sources can only push down filters
image courtesy http://allthefreethings.com/about/
27. DCP and Spark Streaming
Many system architectures rely upon streaming from the âoperationalâ data
store to other systems
ï§ Lambda architecture => store everything and
process/reprocess everything based on access
ï§ Command Query Responsibility Segregation - (CQRS)
ï§ Other reactive pattern derived systems and frameworks
28. Documents flow into the
system from outside
Documents are then
streamed down to consumers
In most common cases, flows
memory to memory
DCP and Spark Streaming
Couchbase Node
DCP
Consumer
Other Cluster
Nodes
Spark Cluster
34. Deployment Topology
Couchbase
Spark
Worker
Couchbase
Couchbase
ï§ Many small gets
ï§ Streaming with low
mutation rate
ï§ Ad hoc
Couchbase
Data Node
Data Node
Spark
Worker
Couchbase
Couchbase
ï§ Medium processing
ï§ Predictable
workloads
ï§ Plenty of overhead
on machines
Couchbase
Couchbase
Couchbase
Couchbase
Couchbase
Couchbase
Data Node
Data Node
Spark
Worker
XDCR
ï§ Heaviest processing
ï§ Workload isolation
writes
Hinweis der Redaktion
CHECK THE ORDER
Analysis side â includes various types of machine learning and analytics
Often the data warehousing includes many different sources of data
This is a popular use case for Spark because it provides the ability to go big and serve your predictions to a lot of users
Generally speaking, thereâs a thin layer like node.js that gets JSON from the NoSQL system and feeds it to the user.
Imagine that user is doing something like shopping or listening to music. As they use the App, you need ridiculously low latency because they system has to do something based on that data. But you also want to send that userâs actions back to the Spark system or wherever .
Couchbase advantages include:
Fast: memory-centric, integrated cache, implicit batching from the SDK with async & flat map
Dev Convenience: Native SDKs, automatic cluster management, code your app without reference to infrastructure
Sophisticated: Query using SQL for JSON (N1QL), supports JOINs
This is like federation
TODO: Come up with a better title here.
Goal is to lay out the services that are part of Couchbase and then talk about how they fit in with a spark deployment.
Add lots of notes
TODO: Merge ideas from the next slide (hidden)
TODO: fix the colors
On a given Spark worker node, we can optimize
Internal to the Couchbase JVM Core, we pipeline operations
Amortize the responses over many operations, meaning no cost for successful operations
Efficient scheduling
Future: use the union or differences of indexes for filtering down candidates.
More powerful than traditional relational databases owing to indexing architecture.
Surprise! We can push down predicates with an O(log n) lookup
2i â super awesome, thatâs how awesome we are
Relational DBs do this, and theyâre not expecting this to be good. Instead, go to every node and go full retard
Surprise! We can push down predicates with an O(log n) lookup
2i â super awesome, thatâs how awesome we are
Relational DBs do this, and theyâre not expecting this to be good. Instead, go to every node and go full retard
Spark always applies the predicates
Spark expects it will always get the data and apply the filters
If youâre building a similar system, turn off the flag so Spark doesnât do needlessly re-apply the filters
Optional but cool
PERFORMANCE IMPLICATIONS!!! OMG
KAFKA is most common
Matt could show his mad science demo
Limitation - Only backfill works, canât do from point in time. This is a bug.
Canât currently shard across spark workers since thereâs no way to see this topology
Fully transparent cluster and bucket management, including direct access if needed