SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Downloaden Sie, um offline zu lesen
REMINDER
Check in on the COLLABORATE
mobile app
Oracle vs. NoSQL
The good, the bad and the ugly
John Kanagaraj
Member of Technical Staff,
PayPal Database Engineering,
An eBay Inc. company
Housekeeping
■  Check the font sizes
▪  Can you read this at the back of the room?
▪  Can you read this at the back of the room?
▪  Just kidding!
■  Silence your Phones!
■  Q & A : Ask as we go along (and I will repeat the question)
▪  Keep it relevant to the slide at hand
▪  I might defer the question to a later slide if I believe it is
addressed later
▪  If it gets too long, I humbly request we deal with it after the
break or after the session
■  It is a long day, so if you nod off it is ok (hopefully no snoring!)
Agenda
■  Big Data – What it is, why should we care
■  NoSQL – What it is, and why do we need it
■  Concepts you need to understand
▪  CAP Theorem (and why it is important)
▪  Unstructured Data
▪  Sharding and Replication
▪  Data Modeling in the brave new world of NoSQL
■  Introduction to some popular NoSQL stores
■  A look into the (immediate) future: Moving forward
Not on the Agenda
■  Not a Tutorial on various NoSQL datastores
■  NotAnInstallationGuide
■  NotAnAdministrationManual
■  If you already know the CAP Theorem and NoSQL:
▪  I will be covering the basics (so you know!)
▪  We are all here to share and learn: Maybe I can learn from your
questions/inputs (time and context permitting)
▪  Let’s talk after the talk (or during the break)
Speaker Qualifications
■  Currently Database Engineer @ PayPal
■  Has been working with Oracle Databases and
UNIX for too many years J
■  Author and Technical editor
■  Frequent speaker at OOW, IOUG
COLLABORATE and regional OUGs
■  Oracle ACE
■  Contributing Editor, IOUG SELECT Journal
■  Loves to mentor new speakers and authors!
■  http://www.linkedin.com/in/johnkanagaraj
Big Data
Big Data – The Why
■  2.5 quintillions of data is generated every day
▪  (1 quintillion = 1018 Bytes): so that is ~= 2.3 Trillion GB
▪  Humans (using devices) as well as Machines (IoT)
—  Location data emitted by your smart phone
—  “Web-scale” Webserver logs and interactions
—  Sensor data emitted by almost every networked device: E.g.
Cars’ fuel/pressure gauges, Personal fitness devices (wearables)
—  Multi-media sources: Security cameras, Face/Plate recognition
—  Data that matters to you: Medical, Scientific, Weather
▪  Lots of value in this data, but mostly untapped
▪  Most of this is never stored: Too big to store, but not too big
to understand J
Big Data – The Why
■  Plummeting cost of technology
▪  Storage Cost/GB – 1980 : $437,500, 2013 : $0.05
▪  Computing Cost – Moore’s law
▪  Network transportation Cost – WiFi, BLE, etc.
■  What is driving this?
▪  Cheaper to store data than to delete/ignore it
▪  Minimal cost to generate, transport and store
▪  Ubiquity of network, storage and data generation
▪  Accelerating advances in science and technology
▪  Machine learning and intelligence is growing
Source for storage cost: http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
Big Data – The Why
Infographic:	
  h.p://www.ibmbigdatahub.com/infographic/four-­‐vs-­‐big-­‐data	
  
Big Data Characteristics: 4 V’s + 1
■  Volume – Scale at which data is generated
▪  Cannot be stored using traditional methods
▪  Cannot be stored in a monolithic store
■  Variety – Different forms of data
▪  Big Data is usually not structured; structure not known in
advance; structure not controlled by consumer
▪  May not always be in text form (more than just binary)
■  Velocity – Data arrives in a continuous stream
▪  Multiple, varied source produce data continuously
▪  Peaks and bursts unpredictable
▪  “Always on”: No down time for maintenance or re-orgs
▪  No “Known Users” – unpredictable, unknown patterns/scale
Big Data Characteristics: 4 V’s + 1
■  Veracity – Uncertainty: Data is not always accurate
▪  Multiplicity of sources creates convergence of truth
▪  Eventual consistency (versus immediate consistency)
■  Value – Immediacy and hidden relationships
▪  In many use cases, value of Big Data declines quickly
—  Traffic reports do not matter after 30 minutes
—  Routing resupply trucks is counterproductive after the fact
—  However, some historical value may be derived post the event
▪  Concept of “Near Line” data (neither fully online or offline)
▪  Easy to miss hidden relationships
—  Most data sets are correlated to other data sets, implicitly or explicitly
—  Not easy to detect due to volume and variety
—  Mine data using various techniques (Data Science)
So how do we store this storm?
■  Big Data impossible to store using RDBMS
▪  Too big, too fast for RDBMS to ingest
▪  RDBMS needs “schema before write”
▪  Unknown structures = “schema during read”
■  So what is limiting RDBMS?
▪  ACID requirement drives “protection” mechanism
▪  Redo and Undo in Oracle provides ACID
▪  “Relational” imposes “schema before write”
▪  Easy to get “small bits”; hard to get “large pieces”
So how do we store this storm?
■  RDBMS’ are essentially ACID
▪  Atomic: Transactions fully succeeds or fully fails
▪  Consistent: Transactions moves the database from one
consistent state to another
▪  Isolated: Transactions cannot interfere with each other
▪  Durable: Committed transactions persist even during failure
■  RDBMS Clusters = “Shared everything” for ACID
■  Atomicity in a distributed database: Two Phase commit
▪  Essential for splitting workload
▪  Reduction in availability though!
■  New concept! BASE (Basically Available, Soft state, Eventual
Consistency)
Confiden=al	
  and	
  Proprietary	
  14	
  
■  Heap table with one or more “right growing” indexes
−  Primary Key: Unique index on a NUMBER column
−  Key value generated from an Oracle Sequence (NEXTVAL = 1)
−  I.e. “monotonically” increasing ID value
−  High rate of insert (> 5000 inserts/second) from multiple sessions
−  Multiple indexes, typically leading date/time series or mono-valued
−  E.g. Oracle E-Business Suite’s FND_CONCURRENT_REQUESTS
■  Here’s the Problem:
−  All INSERTing sessions need one particular index block in CURRent
mode (as well as one particular data block in CURRent mode)
−  Question: Would you use RAC to scale out this particular workload?
A common scalability inhibitor
Confiden=al	
  and	
  Proprietary	
  15	
  
■  Here’s what happens to accommodate the INSERT
−  Assume the current value of the PK is 100, and NEXTVAL = 1
−  Assume we have ‘N’ sessions simultaneously inserting into that table
−  Session 1 needs to update the Index block (add the Index entry for 100)
−  Session 2 wants the same block in CURRent mode (add another entry for
101; needs the same block because the entry fits in the same block)
−  Session 3… N also want the same block in CURRent mode at the very
same time (as all sessions will have “nearby” values for index entry)
−  Block level pins/unpins (+ lots of other work – Redo/Undo) required….
−  Same memory location (SGA buffer for Index block) accessed
−  Smaller but still impacting work for buffer for Data block
−  Rate of work constrained by CPU speed and RAM access speeds
A quick deep dive
Confiden=al	
  and	
  Proprietary	
  16	
  
■  What if you use RAC to “scale out” this workload?
−  Assume “N” sessions simultaneously inserting from 2 RAC nodes (2xN)
−  In addition to previously described work, you need to
−  Obtain the Index block from remote node in CURRent mode
−  Session 1 (Node 1) updates Index block with value 100
−  Session 2 (Node 2) requests block in CURRent mode (value 101)
−  LMS processes on both nodes churn CPU co-ordinating messages and
block transfers back and forth on the interconnect
−  Flush redo changes to disk on Node 1 before shipping CURRent block
to Node 2 (gated by RedoWriter response!!!)
−  Sessions block on “gc current <state>” waits during this process
−  CPU, Redo IO, Interconnect, LMS/LMD processes involved
A quick deep dive
Confiden=al	
  and	
  Proprietary	
  17	
  
■  Some solutions
−  Spread the pain for the right growing index
−  Use Reverse Indexes (cons: Range scan not possible)
−  Use Hash partitioned indexes (cons: All partitions probed for Range
scan, Need Partitioning Option, Additional administration)
−  Prefix RAC node # (or some identifier per node) to key
−  Use a modified key: Use Java UUID, Other distinct prefix/suffixes
−  Use Range-Hash Partitioned tables with Time based ID as key
−  E.g. Epoch Time (# of seconds from Jan 1, 1970) + Sequence value for
lower bits
−  Enables Date/Time based partitioning key
−  Unique values allow Local Index to be unique
A quick deep dive
Relaxing ACID – Skip the Redo/Undo ☺
■  BASE Model
▪  “In partitioned databases, trading some consistency for
availability can lead to dramatic improvements in scalability”
▪  Proposed by Dan Pritchett (eBay) in 2008
▪  ACID is pessimistic; enforces consistency at the end of a
transaction
▪  BASE is optimistic; accepts eventual consistency
▪  Supports partial failure without total failure
■  Enabled new paradigms
▪  New patterns for distributing workload emerges
—  Sharding and Replication
—  Less than perfect (but good enough) consistency
A New Beginning - NoSQL
■  A new dawn emerges…
▪  Brewer proposes CAP theorem (2000)
▪  Google creates BigTable (~ 2006)
▪  Amazon creates Dynamo (~ 2007)
▪  eBay shards over Oracle Databases (2008)
▪  Inspires a new set of alternate data storage
projects
▪  NoSQL databases start appearing…
(~2008 – 2010)
▪  Becomes a buzz word (~ 2011 – 2013)
■  Now we all want “in”…
■  Picture courtesy Kamran Agayev via Twitter
So What is NoSQL?
■  NoSQL – supposed to be “No SQL”, but it is NOT
■  NoSQL – Loosely it is “Not Only SQL” (i.e. NOSQL)
▪  Term coined by Eric Evans (developer at Rackspace)
▪  Adopted by Johan Oskarrson (another developer)
▪  For a meetup of like minds at SF, 2009
▪  Meetup for “open-source, distributed, nonrelational
databases” [Voldemort, Cassandra, CouchDB, MongoDB, etc.]
■  NoSQL does not mean there is no “SQL-Like” interface
▪  Cassandra supports CQL (Cassandra Query Language)
■  NoSQL does NOT always mean Big Data
▪  But Big Data stores are almost always NoSQL based
▪  That is, if you count Hadoop as a NoSQL datastore *
* See: http://wiki.apache.org/hadoop/HadoopIsNot
A small diversion: The Hadoop ecosystem
■  Let’s understand Hadoop vs. the Rest
■  Hadoop – The real Big Data Store
▪  Real Big platform to store data
▪  Store almost anything and everything
▪  Key components of Hadoop:
—  HDFS: A unified file system that combines all storage in the cluster
—  MapReduce: A programming model to handle large data sets
—  An extensile ecosystem: Other components to control, schedule and
manage processing and the cluster
▪  Is NOT a database (although there is HBase)….
▪  But supports SQL-like interface using Hive
▪  Not really meant for Online, Web-site facing implementation
A small diversion: The Hadoop ecosystem
Big Data / NoSQL Landscape
From http://www.bigdata-startups.com/open-source-tools/
Why NoSQL?
■  Impedance Mismatch
▪  Real world data does not naturally posses structure
▪  A “Person” has many variable characteristics
▪  Applications deal with a “person” object
▪  This is then a set of In-memory structures
▪  Relational Databases require structured table/columns
though….
▪  Thus, an “impedance mismatch” between Dev and DBA
▪  Which ORM’s try to bridge (the gap between Dev and DBA)
—  Cultural mismatch: “Agile” (Dev) seems to be “Fragile” (for a DBA)
—  Technical mismatch: “Objects” to “Relational Tables”
—  Storage structure mismatch: “Un-/Semi-structured” to “Structured”
Why NoSQL?
■  Rapid “web-scale” growth for external entities/users
▪  Ability to support viral/burst traffic patterns
■  Most data does not (usually) need immediate consistency
▪  It is ok to lose some data; It is Ok not to have ACID
■  Commodity hardware and the Cloud
▪  RDBMS’ don’t run well on clusters (apologies: RAC world)
—  Shared Disk clusters are both a SPOF and expensive!
—  License costs for RDBMS on clusters
—  Failure of one component brings everything down
▪  Clustering cheaper commodity hardware is economical
—  Single or even a small number of failures affect a portion of
workload, not the whole application (due to sharding)
▪  Easier to create a “cloud” with commodity hardware
Why NoSQL?
■  Open patterns
▪  Almost all NoSQL products is open-source
▪  Relatively open learning
—  Meetups; Open seminars run by vendors
—  Lively blogs and passionate contributors
▪  Quick-and-easy installs
▪  Community versions from vendors
▪  Easy to install on for-rent cloud environments
▪  Monitoring/Alerting through open frameworks (Nagios, Ganglia)
■  Enterprise support through vendor
▪  10gen for MongoDB; DataStax for Cassandra; CouchBase
▪  Cloudera, Hortonworks, MapR for Hadoop
■  Large Webscale companies building own NoSQL databases
NoSQL Characteristics
■  “Schema before write” vs. “Schema before read”
▪  Caters to “unstructured” need
▪  Primarily solves Impedance mismatch
▪  Creates its own challenges
■  Modeled by read and write patterns
▪  “customer and orders” together for a customer centric view
▪  “product and orders” for a production/supply-chain centric view
▪  Alternative: Store twice
■  Data modeling driven by physical storage model
■  Read patterns
▪  Secondary indexing (overheads)
▪  Brute-force access via MapReduce jobs
▪  Store multiple, denormalized copies (“disk is cheap”)
NoSQL Characteristics
■  ACID is “relaxed”
▪  A transaction is limited to an aggregate (k-v pair)
▪  Enables distributed, shared-nothing architectures
▪  Ideal for clustered deployments
▪  Optimistic locking
▪  Some loss of data and consistency is expected (and catered to)
■  Write patterns
▪  UPDATEs converted to INSERTs (timestamped/tombstoned)
▪  Time-To-Live (TTL) based DELETE’s/Purges
▪  Compaction based garbage collection
▪  Reduced Write latency due to memory only writes
▪  Transaction logging supported in some NoSQL stores
Why use an RDBMS then?!
■  ACID may be a hard business requirement
▪  Data loss can never be tolerated
▪  Data inconsistency can never be tolerated (e.g. Money
movement)
■  Complex data models favor RDBMS
▪  Try modeling Oracle EBS in NoSQL J
■  Standardized interface via SQL
▪  Broadly same across all RDBMS
▪  Well understood, skills availability
■  Inter-application integration
▪  Single platform for data created it’s own ecosystem
■  Cost to change is prohibitive
Introducing the CAP Theorem
■  Eric Brewer’s conjecture at the July 2000 ACM Symposium
■  Formalized by Seth Gilbert and Nancy Lynch in 2002
■  Any networked shared-data system can have at most two of
three desirable properties:
▪  At least one Consistent (C) up-to-date copy of the data
▪  high Availability (A) of that data (for both reads and updates)
▪  tolerance to network Partitions (P)
■  Core systemic requirements in a distributed environment
▪  Special symbiotic relationship
▪  Present during design and deployment of applications in a
distributed environment (whether acknowledged or not)
■  Applies well to the distributed NoSQL world
Components of the CAP Theorem
■  (C)onsistency
▪  All clients see the same results from a query, even in the
presence of an update at the same time as the query
■  High (A)vailability
▪  All clients can write or access data, even in the presence of
system failures. Requestors receive acknowledgment of
success or failure
▪  Performance may degrade, but consuming applications are able
to access data even though some parts of the system may not
be operational at the time of a query
■  (P)artition Tolerance
▪  The system returns results regardless of failures in
communication between partitions in the distributed system; i.e.
system property holds true even if there is a network partition
General CAP Theorem
Illustrating the CAP Theorem (adapted)
■  You start a small business: Provide phone reminders/information
■  Customers call with information; You call back/respond to remind
■  Start small: All information written down in your (single) notebook
■  Business grows: Wife is recruited (scale out, PBX shards calls)
■  Inconsistency: Response misses info updated in Wife’s notebook
■  Resolve inconsistency: All notebooks updated when call ends (lock)
■  Wife’s day off: You leave sticky notes (Inconsistent until next day)
■  Wife fights with you: Network Partition (sticky notes thrown away)
■  You have a choice here: CAP Theorem in play – Pick two
▪  (C) Always provide consistent information to clients
▪  (A) Business is always open if at least one of you is present
▪  (P) Business is open even during a loss of communication between 2
■  Run around clerk: Eventual consistency and Compaction
Examples of CAP Theorem pairs
■  Consistency and Partition Tolerance (CP): Banking Transaction at an ATM
▪  Data needs to be consistent in the presence of updates
▪  If there is a network failure, dispense cash but limit the transaction amount
▪  Transaction still available, but system property changed due to network partition
■  Consistency and Availability (CA): Database System-of-Record
▪  Data Consistency is key
▪  During is a network failure, clients stop writing (no redo), no write availability
▪  Present in Oracle Data Guard’s Maximum protection mode/Single node DB
■  Availability and Partition Tolerance (AP): Shopping cart in Amazon.com
▪  Spread data across multiple partitions to be always available
▪  Reconcile cart at checkout (may result in dual purchases!)
▪  Sacrifices consistency, but works for most cases, most of the time
CAP Theorem in the Oracle World
■  Application Scalability: Some well-known techniques
▪  Partition workload by function
—  Schema level split: data unrelated to each other is segregated
—  Typically provides headroom for main workload/environment
▪  Distribute transactions
—  For related data that still needs to be viewed together
—  Typically using Database links
—  Typically for master lookups and remote writes
—  Introduces dependencies (more on that soon)
▪  Decouple work asynchronously
—  Use AQ to write tokens or keys to process later
—  Introduces a “delay”: Data not immediately consistent
CAP Theorem in the Oracle World
■  Application Scalability: Some well-known techniques
▪  Partition workload by function
—  Schema level split: data unrelated to each other is segregated
—  Typically provides headroom for main workload/environment
▪  Distribute transactions
—  For related data that still needs to be viewed together
—  Typically using Database links
—  Typically for master lookups and remote writes
—  Introduces dependencies (more on that soon)
▪  Decouple work asynchronously
—  Use AQ to write tokens or keys to process later
—  Introduces a “delay”: Data not immediately consistent
CAP Theorem in the Oracle World
■  Application Scalability: Some well-known techniques
▪  Offload reads using Active Data Guard (DB 11g and above)
▪  DG copy opened for reads during Real Time Apply
▪  DG allows Redo Data shipping in 3 modes
—  Maximum Protection: Zero loss but dependent on remote redo write
—  Maximum Performance: Remote redo written asynchronously
—  Maximum Availability: Switches to Max Performance mode on remote
redo write failure, operates in Max protection mode otherwise
▪  Offers multiple shades of availability and protection
▪  ADG and “read your writes” pattern
—  RTA apply is not equal to “instant” apply
—  Not “immediately consistent” but “eventually consistent”
CAP Theorem in the Oracle World
■  Application Scalability: Some well-known techniques
▪  Offload reads using Active Data Guard (DB 11g and above)
▪  DG copy opened for reads during Real Time Apply
▪  DG allows Redo Data shipping in 3 modes
—  Maximum Protection: Zero loss but dependent on remote redo write
—  Maximum Performance: Remote redo written asynchronously
—  Maximum Availability: Switches to Max Performance mode on remote
redo write failure, operates in Max protection mode otherwise
▪  Offers multiple shades of availability and protection
▪  ADG and “read your writes” pattern
—  RTA apply is not equal to “instant” apply
—  Not “immediately consistent” but “eventually consistent”
CAP Theorem in the NoSQL World
■  Realization of CAP enabled NoSQL to “break free”
▪  Opened minds of database developers
■  However, the “2 of 3” rule was somewhat misleading
▪  NoSQL datastores offer options to vary consistency/durability
and availability levels
▪  MongoDB has “Write Concern” – Unacknowledged,
Acknowledged, Journaled, Replica Acknowledged
▪  Cassandra has Write Consistency: From ANY to ALL
■  Reality is a spectrum between C and A in the presence of P
▪  Eventual Consistency is a given
▪  Some data loss is expected
▪  Application code/other techniques will need to cater for this
Sharding and Replication in NoSQL
■  NoSQL datastores: essentially shared-nothing clusters
■  Relaxing ACID allows distributed processing (CAP applies!)
■  Ability to scale out reads/writes is the key
■  Achieved using two techniques: Sharding and Replication
■  Sharding: Divide and Rule
▪  Data is read/written to different servers (“shards”)
▪  Location determined applying a fixed function on a known key
▪  Different functions: Modulo, Hash, Range, Programmatic
▪  Efficacy of load balancing dependent on function and data
▪  Typically used for Write-scaling (more than Read-scaling)
▪  (Hash partitioned tables/indexes are essentially object level
sharding in Oracle databases to enable write scaling)
Sharding and Replication in NoSQL
■  Sharding (contd.)
▪  Difficult, if not impossible to change function once implemented
▪  No consistency across shards, or across aggregates
▪  No joins allowed – no cross-shard dependencies
▪  Resilience does not improve (but enables partial availability)
▪  Not to be implemented lightly: Start single if you can
▪  Many NoSQL stores allow auto-sharding (e.g. CouchBase)
■  Replication: Allow multiple copies
▪  Master-Slave model: Simplest, Scales out reads only; Read
resilience; May need to cater for eventual consistency
▪  Peer-to-Peer or Multi-Master model: Scales out reads and
writes, but consistency/conflict resolution is a big problem
■  Can combine Sharding and Replication!
The NoSQL Datastore Landscape
■  Generally four types:
▪  Key-Value
▪  Document
▪  Column Family
▪  Graph
■  Not using the relational model, i.e. schema-less
▪  But not without a Data Model!
■  Runs on clusters of commodity hardware
■  Generally Open Source
■  Can be considered as storing/retrieving “aggregates”
▪  a collection of related objects that can be treated as a unit
■  Usually described by “Keys” and “Values” (i.e. K-V pairs)
Key-Value NoSQL stores
■  The most basic of NoSQL stores
■  Simple K-V structure: A “blob” of data (“Value”) indexed and
accessed via a “Key”
■  “Value” part also known as Aggregate
■  Aggregate is a collection of related objects treated as a unit
■  Written/Updated/Read/Consistent as single, smallest unit
■  Typically, aggregate is limited in size (BLOB in Oracle)
■  Typically, expressed in JSON, and sometimes in XML
■  JSON/XML aggregates are self-describing
■  Value is “opaque” in a K-V store, but is simple
■  Scale out with sharding
■  Examples of K-V store: Riak, Oracle NoSQL
Key-Value NoSQL stores
■  Typical Use cases
▪  Shines when you need simple GET/PUT operations
▪  Session state; Tokens – Enables web-scale
▪  User profiles and preferences – Typically latent caching layer
▪  Latency bridge: Support RYOW’s in some cases
■  Anti-patterns
▪  No ad-hoc query patterns - (i.e. need key to access)
▪  Not meant for analytics type workload
▪  When multi-key/multi-operation consistency is required
▪  Set based operations (i.e. related data)
Document NoSQL stores
■  Datastore able to understand and manipulate structures
■  Needs to follow an agreed format
▪  usually JSON, but BSON, XML and YAML
■  Support for secondary indexes
▪  Needs ability to understand/index K-V pairs in the aggregate
▪  Secondary indexes may throttle write rate
■  Aggregate size usually limited
■  Scale-out again supported via sharding
▪  Some stores support multiple sharding methods (MongoDB)
■  K-V store sometimes evolve into Document stores
▪  E.g. CouchBase evolution
■  Needs embedding/linking support (size/other limitations)
Document NoSQL stores
■  Typical Use cases
▪  Of course, any collection of document-type models
▪  Easy-to-start NoSQL projects when moving from RDBMS
▪  Almost any NoSQL use case needing secondary index access
▪  Content and Metadata store: typically multiple keys
▪  Queries using materialized views (CouchBase)
▪  Non-trivial sharding (MongoDB)
▪  Horizontally scaled or Cached reads (MongoDB, CouchBase)
▪  Models requiring simple relationships (Blogs, User modeling)
■  Anti-patterns:
▪  Not a drop-in replacement for RDBMS
▪  Evolving relationships or query patterns
▪  Usually not good for write-heavy
Column Family NoSQL stores
■  Characteristics of CF Stores
▪  Data is mostly organized by sets of columns
▪  Key – Value based access
▪  “Value” consists of sets or ranges of columns
▪  Still unstructured
▪  No joins (except via another keyed table, using MapReduce)
■  Cassandra, Hbase, Amazon SimpleDB are prime examples
▪  HDFS on a Hadoop cluster underlies HBase
▪  HBase evolved from Google’s BigTable
▪  Cassandra evolved from Facebook
▪  Cassandra also supports CQL (a SQL like language)
Column Family NoSQL stores
■  Typical Use cases
▪  Data is mostly organized by sets of columns (super columns)
▪  Key – Value based access
▪  “Value” consists of sets of columns (but still unstructured)
▪  Lots of repeated sets of values (e.g. Customer transactions)
▪  No joins (except via another keyed table, using MapReduce)
▪  Write-intensive patterns (Internet-of-Things type data)
▪  Rolling expiry patterns such as Time series data
■  Anti-patterns
▪  IMHO Low-latency reads (in comparison to other NoSQL stores)
▪  Need access via secondary or other keys
Graph NoSQL stores
■  Stores Nodes and Edges
■  Provides “Index-free Adjacency”
■  Nodes are entities: People, Accounts, Items, Locations
■  Edges connect Nodes to other Nodes
■  Edges have properties
■  Can mine patterns present in these relationships
■  Supports graph-like queries:
▪  Shortest distance between two locations
▪  Social Graphing: Connecting people
▪  Products that your friends liked
■  Neo4j is a well-known graph database
■  Giraph: An open source graph processing systems (FB!)
Graph NoSQL stores
■  Typical Use Cases
▪  Social Graphs
▪  Recommendation Engines
▪  Graph transversal uses cases
▪  Relationships with defined end-points
▪  Routing and Location based solutions
▪  Account Linking (e.g. for fraud detection; peer risk checking)
■  Anti-patterns
▪  Scale out via sharding typically not supported in some products
▪  Update all/Update most patterns
▪  Dangling end-points
Some more concepts: JSON
■  You need to understand JSON
▪  Java Script Object Notation
▪  Self describing, English text key-value pairs
▪  In other words, a simpler version of XML
▪  No externally imposed structure (hint: No tab/column mapping!)
{
"id":101,
”first_name":”John",
“second_name”:”Kanagaraj”,
”residential_address":[{“add1”:”20 First St”, "city":”San Jose”, “state”:”CA”}],
“phone”:”408-555-9999”
}
▪  Can you spot some optimization here?
Some more concepts: Languages
■  You need to understand JVMs and some Java
▪  Many NoSQL stores use JVM based programs
▪  E.g. Hadoop, Cassandra
▪  Ability to understand JVM’s and their internals is key
▪  JVM’s Garbage Collection needs to be managed
▪  Need to understand/configure JMX (Java Management Xtensions)
▪  Most NoSQL stores support Java API’s out of the box
■  Most NoSQL stores support more than just Java
▪  E.g. Python, Ruby, Perl, C/C++, Node.js, Go
▪  Less-well known ones such as Erlang, Haskell, Scala
▪  Need to able to install and troubleshoot app issues
■  Deploy/Management: Puppet, Nagios, Ganglia, Fab
▪  Frameworks can do more than just NoSQL!
MongoDB: Document datastore
Client	
  
MongoS	
   MongoS	
  
MongoD	
  
(Master)	
  
MongoD	
  
(Slave)	
  
MongoD	
  
(Slave)	
  
MongoD	
  
(Master)	
  
MongoD	
  
(Slave)	
  
MongoD	
  
(Slave)	
  
MongoD	
  
MongoD	
  
Replica	
  Set	
  1	
   Replica	
  Set	
  2	
  
1
3
2
•  Write	
  scaling	
  
Sharding	
  through	
  
MongoS	
  
•  Read	
  scaling	
  via	
  
Replica	
  sets	
  
•  Writes	
  to	
  Master	
  
Node,	
  reads	
  from	
  
Master	
  and	
  Slave	
  
nodes	
  (op=onal)	
  
MongoD	
  
Routers	
  
Config	
  Servers	
  
4
MongoDB: Data Modeling
RDBMS	
   MongoDB	
  
Database	
   Database	
  
Table	
   Collec=on	
  
Row	
   Document	
  
RowID	
   _id	
  	
  
Index	
   Index	
  
Join	
   Embedded	
  
Document	
  
(DBRef)	
  
Foreign	
  Key	
   Reference	
  
Order	
  ID:	
  1001	
  
Customer:	
  John	
  	
  
Order	
  Line	
  Items:	
  
20001	
  –	
  Tires	
  –	
  2	
  x	
  $84	
  -­‐	
  $168	
  
45320	
  –	
  Pump	
  –	
  1	
  x	
  $54	
  -­‐	
  $54	
  
	
  
Payment	
  Details:	
  
Card:	
  Amex	
  
CC:	
  3425268768	
  
Exp:	
  03/17	
  
Total:	
  $222	
  
Order	
  	
  
Customer	
  
Line	
  Items	
  
Financial	
  
Instrument	
  
FinTrans	
  
Journal	
  
{	
  “order_id”:	
  “1001”,	
  “customer”:”John”,	
  	
  
	
  	
  	
  	
  “orderitems”:	
  [	
  {“prodid”:”20001”,	
  “prodname”:”Tires”,	
  “Qty”:2,	
  “price”:168},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {“prodid”:”45320”,	
  “prodname”:”Pump”,	
  “Qty”:1,	
  “price”:54}	
  ],	
  
	
  	
  	
  	
  “pcard”:”Amex”,”pcc”:”3425268768”,”pexp”:”03/17”,”ord_tot”:222	
  }	
  
MongoDB: Essentials
■  Stands for “huMONGOus DataBase”
■  Reads and Writes using memory-mapped files
▪  Try and fit working set in memory
▪  Use SSDs for faster I/O
■  Very good index support on identified JSON fields
▪  Allows Key-Value, Range and text search queries
▪  Unique as well as Compound Indexes
▪  Special TTL (Time-to-Live) index to retire data
■  Stores documents in BSON format (Binary JSON)
■  Interact, manage, program through Mongo Shell
■  Many other drivers and interfaces
■  Support for Geospatial data and queries
■  Aggregation Framework and MapReduce support
MongoDB Physical/Memory Mapping
MongoDB: Essentials
■  Query optimizer exposes execution plan
■  Multiple sharding methods:
▪  Range-based sharding: Optimized for range queries
▪  Hash-based sharding: Ensure uniform distribution
▪  Tag-aware sharding: Partitioned by user-specified configuration
■  Write-ahead journaling
▪  Journal commits every 100ms (oplog is capped collection)
■  Configurable Write-availability via Write Concern
▪  Unacknowledged (memory only)
▪  Acknowledgement for specific levels:
—  Write to at least 2 replicas in the same datacenter
—  Write to at least 1 replica in remote datacenter
■  Commercially supported by 10gen (now called MongoDB)
MongoDB: The Not-so-good…
■  Reads block Writes (albeit for very short periods ~ microsecs)
▪  Be careful about aggregation/MapReduce: Intense reads
▪  Read lock yields when read has to go to disk
▪  Read locks can be shared by multiple readers
■  Writes block Reads (Writer-greedy, for very short periods)
■  Locks are at a “database” level
▪  Careful with your data model!
▪  Typically restrict one collection per database if possible
▪  Write to multiple documents will yield periodically
■  Index creation (writes) locks your entire database
■  Replicates to Slaves and locks all slaves in Replicaset
■  Compaction also locks the database
■  Secondaries block on replication writes
CouchBase – Another Document Store
Couchbase Cluster"
Multitenant Architecture"
Server Nodes"
User/applica=on	
  data	
  
based	
  on	
  bucket	
  par==oning	
  
Which	
  live	
  on	
  
Data Buckets"
Documents"
Read/write	
  from/to	
  
That	
  form	
  a	
  
Clients	
  
Servers	
  
dynamically	
  scalable	
  
CouchBase Single-Node Architecture
Replica=on,	
  Rebalance,	
  	
  
Shard	
  State	
  Manager	
  
REST	
  management	
  	
  
API/Web	
  UI	
  
8091	
  
Admin	
  Console	
  
Erlang	
  /OTP	
  
11210	
  /	
  11211	
  
Data	
  access	
  ports	
  
Object-­‐managed	
  
Cache	
  
Storage	
  Engine	
  
8092	
  
Query	
  API	
  
Query	
  Engine	
  
hDp	
  
Data	
  Manager	
   Cluster	
  Manager	
  
CouchBase: Background and Use cases
■  Created as a Merge of code and ideas:
▪  MemCache – An excellent memory only cache
▪  CouchDB – A Key-Value store
▪  Now a Persistent Cache
▪  Code in Erlang and C++ (??)
▪  Different ports for both products – now merging
▪  Lots of MemCache implementations
▪  Now can upgrade into CouchBase quickly – Moxi client
■  Primarily as a Caching solution
▪  Very fast for reads and writes
▪  Some concerns with cross data center replication
▪  IMHO - Not yet suited for RYOWs via secondary key
Cassandra: Column-Family datastore
Node	
  1	
  
Node	
  2	
  
Node	
  3	
  
Node	
  4	
  
Node	
  5	
  
Node	
  6	
  
Client	
  
•  Hash	
  func=on(Key)	
  =>	
  Token	
  
•  Client	
  writes	
  to	
  selected	
  Node	
  as	
  per	
  
Token	
  
•  Coordinator	
  Node	
  replicates	
  to	
  other	
  
nodes	
  (Timed	
  per	
  Quorum	
  selng)	
  
•  Node	
  acknowledges	
  to	
  coordinator	
  
•  Acknowledgement	
  to	
  client	
  
•  Data	
  wri.en	
  to	
  internal	
  commit	
  log	
  
•  If	
  node	
  goes	
  offline,	
  writes	
  stop	
  
•  When	
  node	
  rejoins,	
  a	
  “hinted	
  handoff”	
  
process	
  completes	
  the	
  pending	
  writes	
  +	
  
“read	
  repair”	
  
•  Requests	
  can	
  range	
  from	
  ANY	
  to	
  ALL	
  
•  ANY:	
  Write	
  to	
  commit	
  log	
  on	
  at	
  
least	
  1	
  node	
  
•  ALL:	
  Writes	
  complete	
  to	
  memory	
  
and	
  commit	
  log	
  on	
  ALL	
  replicas	
  
•  Availability	
  precedes	
  Consistency	
  (AP)	
  
•  Read	
  and	
  Write	
  Paths	
  are	
  separate	
  
Cassandra: Column-Family datastore
(1)  Write:(K1,{C1:V1})	
  
(2)  Write:(K1,{C2:V2})	
  
(3)  Write:(K2,{C1:V3,C2:V4})	
  
(4)  Write:(K1,{C1:V5,C3:V6})	
  
K1	
   C1:V1	
   Memory	
  
Disk	
  
K1	
   C1:V1	
  
C2:V2	
  
K1	
   C2:V2	
  
K2	
   C1:V3	
   C2:V4	
  
K2	
   C1:V3	
   C2:V4	
  
C1:V5	
   C3:V6	
  
K1	
   C1:V5	
   C3:V6	
  
Memtable	
  
Commit	
  log	
  
Index	
  
K1	
   C2:V2	
  C1:V5	
   C3:V6	
  
K2	
   C1:V3	
   C2:V4	
  
SSTable	
  
Cassandra: Essentials
■  Write Path is simpler; Reads are a little more complex
▪  Merge Memtable (Row/Key cache) and Row Reads from Disk
▪  Uses Bloom Filter to decide which SSTables to skip (false +ive)
▪  In-memory caches are stored in Java heap (GC!!!!)
▪  Can return inconsistent data for RYOW (depending on Quorum)
▪  Consistent: (nodes_written + nodes_read) > replication_factor
■  Compaction: Merge SSTables; Expire Tombstoned data (TTL)
■  Data Modeling:
▪  Model your queries – Optimize for reads
▪  Denormalize – Reads: Slow; Writes: Fast; Disk: Cheap
▪  Column families are stored sorted by timestamp
■  CQL: Cassandra Query Language – A familiar interface
■  Maintaining the Cluster: Gossip and Snitch J
Choosing the right NoSQL database:
ASCII the right question!
■  Is this a site-facing, P1 Application?
■  Is this a BI/Analytics type problem waiting to be solved?
■  Is this Write Intensive or Read Intensive?
■  Is this a Caching problem?
■  Can the application afford some data loss?
■  What about data consistency?
■  What is more important – consistency or availability?
■  How many data centers need to be supported?
■  What are the query patterns? Are they widely varying?
■  How many distinct clusters of data are present, and how are
they related?
■  Is my organization ready to support this product?
Generic problems
■  Consistency is and will be a problem in the NoSQL world
■  Data loss will be present - application should cater to this
▪  Consider the cost of workarounds/cost of data loss
■  The world of NoSQL is evolving:
▪  Maturing slowly: Peak -> Sliding into the trough
▪  Too many choices: 150 choices: http://nosql-database.org/
▪  Many picking the wrong product…
—  (and had to change it later: Check my Delicious stream #nosql)
▪  Most NoSQL vendors still VC funded
▪  New Versions/Features every 6 months!
▪  We will learn lessons the hard way…..
Real World problems
■  Need to break out of the RDBMS/ACID world
▪  Imagine a world with no COMMITs, no “Transactions”
▪  Data loss and Data inconsistency is inevitable
▪  Data Owners/Architects shy away: FUDs, Real dangers
■  Everyone wants to become (or is!) a NoSQL expert
▪  Spell NoSQL and earn $$$ J
▪  Best way to learn: Create a “Big Data” need and fulfill it
▪  Who makes the decisions?
■  Lack of skills and maturity
▪  Product choice: Knowledge/Experience/Forethought required
▪  Many NoSQL products still basic in functionality
▪  Be prepared to back out of your initial choice
How to get there (from here)?
■  This presentation is just the beginning
■  Lots and lots of reading and experimenting required
■  Recommended Reading:
▪  NoSQL Distilled by Fowler and Sadalage
▪  Seven Databases in Seven Weeks: Redmond and Wilson
▪  Many NoSQL books – browse at Safari Online
■  Lots of links to read – Live links:
▪  Follow me on http://delicious.com/jkanagaraj - Tag #nosql
■  Play with the community versions:
▪  Available from the vendors: No support though
▪  Spin up/use Cloud based VMs – Rackspace or AWS
A warning – And some advice
“Some people, when confronted with a big data
problem, think, I’ll use Hadoop. Now, they have a
big data problem and a big Hadoop cluster”
Dmitry Ryaboy, Engineering Manager, Twitter
▪  Start small
▪  Grow with success
▪  Create your own expertise
▪  It is about the untapped potential in your data
Please	
  fill	
  in	
  the	
  feedback	
  form!	
  
Link	
  up	
  with	
  me	
  on	
  LinkedIn	
  
John	
  Kanagaraj,	
  PayPal,	
  an	
  eBay	
  Inc.	
  Company	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSatya Pal
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture OverviewChristopher Foot
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture PatternsMaynooth University
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionKrishnakumar S
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesMaynooth University
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?DataStax
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Enterprise Architect's view of Couchbase 4.0 with N1QL
Enterprise Architect's view of Couchbase 4.0 with N1QLEnterprise Architect's view of Couchbase 4.0 with N1QL
Enterprise Architect's view of Couchbase 4.0 with N1QLKeshav Murthy
 
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase
 

Was ist angesagt? (20)

Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explained
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture Overview
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture Patterns
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the question
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
 
Selecting best NoSQL
Selecting best NoSQL Selecting best NoSQL
Selecting best NoSQL
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Enterprise Architect's view of Couchbase 4.0 with N1QL
Enterprise Architect's view of Couchbase 4.0 with N1QLEnterprise Architect's view of Couchbase 4.0 with N1QL
Enterprise Architect's view of Couchbase 4.0 with N1QL
 
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
 

Andere mochten auch

Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systemsramazan fırın
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With GroovyJames Williams
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
 
MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontierValentin Kuznetsov
 
Practical Ruby Projects With Mongo Db
Practical Ruby Projects With Mongo DbPractical Ruby Projects With Mongo Db
Practical Ruby Projects With Mongo DbAlex Sharp
 
Migration from SQL to MongoDB - A Case Study at TheKnot.com
Migration from SQL to MongoDB - A Case Study at TheKnot.com Migration from SQL to MongoDB - A Case Study at TheKnot.com
Migration from SQL to MongoDB - A Case Study at TheKnot.com MongoDB
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBMongoDB
 
The NoSQL Way in Postgres
The NoSQL Way in PostgresThe NoSQL Way in Postgres
The NoSQL Way in PostgresEDB
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMongoDB
 
Yelp Academic Dataset
Yelp Academic DatasetYelp Academic Dataset
Yelp Academic DatasetMandaniKeyur
 
Plywood industry- A deep insight
Plywood industry- A deep insightPlywood industry- A deep insight
Plywood industry- A deep insightVikram Dahiya
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDBAlex Sharp
 
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overviewEnterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overviewWinton Winton
 
JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)
JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)
JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)orientadoresdeestudopaic
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 

Andere mochten auch (20)

From Oracle to MongoDB
From Oracle to MongoDBFrom Oracle to MongoDB
From Oracle to MongoDB
 
Mongodb
MongodbMongodb
Mongodb
 
Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systems
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With Groovy
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontier
 
Practical Ruby Projects With Mongo Db
Practical Ruby Projects With Mongo DbPractical Ruby Projects With Mongo Db
Practical Ruby Projects With Mongo Db
 
Migration from SQL to MongoDB - A Case Study at TheKnot.com
Migration from SQL to MongoDB - A Case Study at TheKnot.com Migration from SQL to MongoDB - A Case Study at TheKnot.com
Migration from SQL to MongoDB - A Case Study at TheKnot.com
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
The NoSQL Way in Postgres
The NoSQL Way in PostgresThe NoSQL Way in Postgres
The NoSQL Way in Postgres
 
Data mining on yelp dataset
Data mining on yelp datasetData mining on yelp dataset
Data mining on yelp dataset
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDB
 
Yelp Academic Dataset
Yelp Academic DatasetYelp Academic Dataset
Yelp Academic Dataset
 
Plywood industry- A deep insight
Plywood industry- A deep insightPlywood industry- A deep insight
Plywood industry- A deep insight
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDB
 
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overviewEnterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
 
JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)
JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)
JOGOS MATEMÁTICOS 3º 4º 5º ANO PAIC + VOLUME I(PROFESSOR)
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Ähnlich wie Oracle vs NoSQL – The good, the bad and the ugly

The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...Lucas Jellema
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016Lucas Jellema
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]Huy Do
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in GoScylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in GoScyllaDB
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Hejwowski Piotr
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]shuwutong
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsDave Stokes
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...Dave Stokes
 
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux AdminsLinuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux AdminsDave Stokes
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from OracleEDB
 
Cassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so AlienCassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so AlienBrian Hess
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developersShy Engelberg
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)Ben Stopford
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 

Ähnlich wie Oracle vs NoSQL – The good, the bad and the ugly (20)

The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in GoScylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
 
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux AdminsLinuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from Oracle
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Cassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so AlienCassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so Alien
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developers
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

Kürzlich hochgeladen

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Oracle vs NoSQL – The good, the bad and the ugly

  • 1. REMINDER Check in on the COLLABORATE mobile app Oracle vs. NoSQL The good, the bad and the ugly John Kanagaraj Member of Technical Staff, PayPal Database Engineering, An eBay Inc. company
  • 2. Housekeeping ■  Check the font sizes ▪  Can you read this at the back of the room? ▪  Can you read this at the back of the room? ▪  Just kidding! ■  Silence your Phones! ■  Q & A : Ask as we go along (and I will repeat the question) ▪  Keep it relevant to the slide at hand ▪  I might defer the question to a later slide if I believe it is addressed later ▪  If it gets too long, I humbly request we deal with it after the break or after the session ■  It is a long day, so if you nod off it is ok (hopefully no snoring!)
  • 3. Agenda ■  Big Data – What it is, why should we care ■  NoSQL – What it is, and why do we need it ■  Concepts you need to understand ▪  CAP Theorem (and why it is important) ▪  Unstructured Data ▪  Sharding and Replication ▪  Data Modeling in the brave new world of NoSQL ■  Introduction to some popular NoSQL stores ■  A look into the (immediate) future: Moving forward
  • 4. Not on the Agenda ■  Not a Tutorial on various NoSQL datastores ■  NotAnInstallationGuide ■  NotAnAdministrationManual ■  If you already know the CAP Theorem and NoSQL: ▪  I will be covering the basics (so you know!) ▪  We are all here to share and learn: Maybe I can learn from your questions/inputs (time and context permitting) ▪  Let’s talk after the talk (or during the break)
  • 5. Speaker Qualifications ■  Currently Database Engineer @ PayPal ■  Has been working with Oracle Databases and UNIX for too many years J ■  Author and Technical editor ■  Frequent speaker at OOW, IOUG COLLABORATE and regional OUGs ■  Oracle ACE ■  Contributing Editor, IOUG SELECT Journal ■  Loves to mentor new speakers and authors! ■  http://www.linkedin.com/in/johnkanagaraj
  • 7. Big Data – The Why ■  2.5 quintillions of data is generated every day ▪  (1 quintillion = 1018 Bytes): so that is ~= 2.3 Trillion GB ▪  Humans (using devices) as well as Machines (IoT) —  Location data emitted by your smart phone —  “Web-scale” Webserver logs and interactions —  Sensor data emitted by almost every networked device: E.g. Cars’ fuel/pressure gauges, Personal fitness devices (wearables) —  Multi-media sources: Security cameras, Face/Plate recognition —  Data that matters to you: Medical, Scientific, Weather ▪  Lots of value in this data, but mostly untapped ▪  Most of this is never stored: Too big to store, but not too big to understand J
  • 8. Big Data – The Why ■  Plummeting cost of technology ▪  Storage Cost/GB – 1980 : $437,500, 2013 : $0.05 ▪  Computing Cost – Moore’s law ▪  Network transportation Cost – WiFi, BLE, etc. ■  What is driving this? ▪  Cheaper to store data than to delete/ignore it ▪  Minimal cost to generate, transport and store ▪  Ubiquity of network, storage and data generation ▪  Accelerating advances in science and technology ▪  Machine learning and intelligence is growing Source for storage cost: http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
  • 9. Big Data – The Why Infographic:  h.p://www.ibmbigdatahub.com/infographic/four-­‐vs-­‐big-­‐data  
  • 10. Big Data Characteristics: 4 V’s + 1 ■  Volume – Scale at which data is generated ▪  Cannot be stored using traditional methods ▪  Cannot be stored in a monolithic store ■  Variety – Different forms of data ▪  Big Data is usually not structured; structure not known in advance; structure not controlled by consumer ▪  May not always be in text form (more than just binary) ■  Velocity – Data arrives in a continuous stream ▪  Multiple, varied source produce data continuously ▪  Peaks and bursts unpredictable ▪  “Always on”: No down time for maintenance or re-orgs ▪  No “Known Users” – unpredictable, unknown patterns/scale
  • 11. Big Data Characteristics: 4 V’s + 1 ■  Veracity – Uncertainty: Data is not always accurate ▪  Multiplicity of sources creates convergence of truth ▪  Eventual consistency (versus immediate consistency) ■  Value – Immediacy and hidden relationships ▪  In many use cases, value of Big Data declines quickly —  Traffic reports do not matter after 30 minutes —  Routing resupply trucks is counterproductive after the fact —  However, some historical value may be derived post the event ▪  Concept of “Near Line” data (neither fully online or offline) ▪  Easy to miss hidden relationships —  Most data sets are correlated to other data sets, implicitly or explicitly —  Not easy to detect due to volume and variety —  Mine data using various techniques (Data Science)
  • 12. So how do we store this storm? ■  Big Data impossible to store using RDBMS ▪  Too big, too fast for RDBMS to ingest ▪  RDBMS needs “schema before write” ▪  Unknown structures = “schema during read” ■  So what is limiting RDBMS? ▪  ACID requirement drives “protection” mechanism ▪  Redo and Undo in Oracle provides ACID ▪  “Relational” imposes “schema before write” ▪  Easy to get “small bits”; hard to get “large pieces”
  • 13. So how do we store this storm? ■  RDBMS’ are essentially ACID ▪  Atomic: Transactions fully succeeds or fully fails ▪  Consistent: Transactions moves the database from one consistent state to another ▪  Isolated: Transactions cannot interfere with each other ▪  Durable: Committed transactions persist even during failure ■  RDBMS Clusters = “Shared everything” for ACID ■  Atomicity in a distributed database: Two Phase commit ▪  Essential for splitting workload ▪  Reduction in availability though! ■  New concept! BASE (Basically Available, Soft state, Eventual Consistency)
  • 14. Confiden=al  and  Proprietary  14   ■  Heap table with one or more “right growing” indexes −  Primary Key: Unique index on a NUMBER column −  Key value generated from an Oracle Sequence (NEXTVAL = 1) −  I.e. “monotonically” increasing ID value −  High rate of insert (> 5000 inserts/second) from multiple sessions −  Multiple indexes, typically leading date/time series or mono-valued −  E.g. Oracle E-Business Suite’s FND_CONCURRENT_REQUESTS ■  Here’s the Problem: −  All INSERTing sessions need one particular index block in CURRent mode (as well as one particular data block in CURRent mode) −  Question: Would you use RAC to scale out this particular workload? A common scalability inhibitor
  • 15. Confiden=al  and  Proprietary  15   ■  Here’s what happens to accommodate the INSERT −  Assume the current value of the PK is 100, and NEXTVAL = 1 −  Assume we have ‘N’ sessions simultaneously inserting into that table −  Session 1 needs to update the Index block (add the Index entry for 100) −  Session 2 wants the same block in CURRent mode (add another entry for 101; needs the same block because the entry fits in the same block) −  Session 3… N also want the same block in CURRent mode at the very same time (as all sessions will have “nearby” values for index entry) −  Block level pins/unpins (+ lots of other work – Redo/Undo) required…. −  Same memory location (SGA buffer for Index block) accessed −  Smaller but still impacting work for buffer for Data block −  Rate of work constrained by CPU speed and RAM access speeds A quick deep dive
  • 16. Confiden=al  and  Proprietary  16   ■  What if you use RAC to “scale out” this workload? −  Assume “N” sessions simultaneously inserting from 2 RAC nodes (2xN) −  In addition to previously described work, you need to −  Obtain the Index block from remote node in CURRent mode −  Session 1 (Node 1) updates Index block with value 100 −  Session 2 (Node 2) requests block in CURRent mode (value 101) −  LMS processes on both nodes churn CPU co-ordinating messages and block transfers back and forth on the interconnect −  Flush redo changes to disk on Node 1 before shipping CURRent block to Node 2 (gated by RedoWriter response!!!) −  Sessions block on “gc current <state>” waits during this process −  CPU, Redo IO, Interconnect, LMS/LMD processes involved A quick deep dive
  • 17. Confiden=al  and  Proprietary  17   ■  Some solutions −  Spread the pain for the right growing index −  Use Reverse Indexes (cons: Range scan not possible) −  Use Hash partitioned indexes (cons: All partitions probed for Range scan, Need Partitioning Option, Additional administration) −  Prefix RAC node # (or some identifier per node) to key −  Use a modified key: Use Java UUID, Other distinct prefix/suffixes −  Use Range-Hash Partitioned tables with Time based ID as key −  E.g. Epoch Time (# of seconds from Jan 1, 1970) + Sequence value for lower bits −  Enables Date/Time based partitioning key −  Unique values allow Local Index to be unique A quick deep dive
  • 18. Relaxing ACID – Skip the Redo/Undo ☺ ■  BASE Model ▪  “In partitioned databases, trading some consistency for availability can lead to dramatic improvements in scalability” ▪  Proposed by Dan Pritchett (eBay) in 2008 ▪  ACID is pessimistic; enforces consistency at the end of a transaction ▪  BASE is optimistic; accepts eventual consistency ▪  Supports partial failure without total failure ■  Enabled new paradigms ▪  New patterns for distributing workload emerges —  Sharding and Replication —  Less than perfect (but good enough) consistency
  • 19. A New Beginning - NoSQL ■  A new dawn emerges… ▪  Brewer proposes CAP theorem (2000) ▪  Google creates BigTable (~ 2006) ▪  Amazon creates Dynamo (~ 2007) ▪  eBay shards over Oracle Databases (2008) ▪  Inspires a new set of alternate data storage projects ▪  NoSQL databases start appearing… (~2008 – 2010) ▪  Becomes a buzz word (~ 2011 – 2013) ■  Now we all want “in”… ■  Picture courtesy Kamran Agayev via Twitter
  • 20. So What is NoSQL? ■  NoSQL – supposed to be “No SQL”, but it is NOT ■  NoSQL – Loosely it is “Not Only SQL” (i.e. NOSQL) ▪  Term coined by Eric Evans (developer at Rackspace) ▪  Adopted by Johan Oskarrson (another developer) ▪  For a meetup of like minds at SF, 2009 ▪  Meetup for “open-source, distributed, nonrelational databases” [Voldemort, Cassandra, CouchDB, MongoDB, etc.] ■  NoSQL does not mean there is no “SQL-Like” interface ▪  Cassandra supports CQL (Cassandra Query Language) ■  NoSQL does NOT always mean Big Data ▪  But Big Data stores are almost always NoSQL based ▪  That is, if you count Hadoop as a NoSQL datastore * * See: http://wiki.apache.org/hadoop/HadoopIsNot
  • 21. A small diversion: The Hadoop ecosystem ■  Let’s understand Hadoop vs. the Rest ■  Hadoop – The real Big Data Store ▪  Real Big platform to store data ▪  Store almost anything and everything ▪  Key components of Hadoop: —  HDFS: A unified file system that combines all storage in the cluster —  MapReduce: A programming model to handle large data sets —  An extensile ecosystem: Other components to control, schedule and manage processing and the cluster ▪  Is NOT a database (although there is HBase)…. ▪  But supports SQL-like interface using Hive ▪  Not really meant for Online, Web-site facing implementation
  • 22. A small diversion: The Hadoop ecosystem
  • 23. Big Data / NoSQL Landscape From http://www.bigdata-startups.com/open-source-tools/
  • 24. Why NoSQL? ■  Impedance Mismatch ▪  Real world data does not naturally posses structure ▪  A “Person” has many variable characteristics ▪  Applications deal with a “person” object ▪  This is then a set of In-memory structures ▪  Relational Databases require structured table/columns though…. ▪  Thus, an “impedance mismatch” between Dev and DBA ▪  Which ORM’s try to bridge (the gap between Dev and DBA) —  Cultural mismatch: “Agile” (Dev) seems to be “Fragile” (for a DBA) —  Technical mismatch: “Objects” to “Relational Tables” —  Storage structure mismatch: “Un-/Semi-structured” to “Structured”
  • 25. Why NoSQL? ■  Rapid “web-scale” growth for external entities/users ▪  Ability to support viral/burst traffic patterns ■  Most data does not (usually) need immediate consistency ▪  It is ok to lose some data; It is Ok not to have ACID ■  Commodity hardware and the Cloud ▪  RDBMS’ don’t run well on clusters (apologies: RAC world) —  Shared Disk clusters are both a SPOF and expensive! —  License costs for RDBMS on clusters —  Failure of one component brings everything down ▪  Clustering cheaper commodity hardware is economical —  Single or even a small number of failures affect a portion of workload, not the whole application (due to sharding) ▪  Easier to create a “cloud” with commodity hardware
  • 26. Why NoSQL? ■  Open patterns ▪  Almost all NoSQL products is open-source ▪  Relatively open learning —  Meetups; Open seminars run by vendors —  Lively blogs and passionate contributors ▪  Quick-and-easy installs ▪  Community versions from vendors ▪  Easy to install on for-rent cloud environments ▪  Monitoring/Alerting through open frameworks (Nagios, Ganglia) ■  Enterprise support through vendor ▪  10gen for MongoDB; DataStax for Cassandra; CouchBase ▪  Cloudera, Hortonworks, MapR for Hadoop ■  Large Webscale companies building own NoSQL databases
  • 27. NoSQL Characteristics ■  “Schema before write” vs. “Schema before read” ▪  Caters to “unstructured” need ▪  Primarily solves Impedance mismatch ▪  Creates its own challenges ■  Modeled by read and write patterns ▪  “customer and orders” together for a customer centric view ▪  “product and orders” for a production/supply-chain centric view ▪  Alternative: Store twice ■  Data modeling driven by physical storage model ■  Read patterns ▪  Secondary indexing (overheads) ▪  Brute-force access via MapReduce jobs ▪  Store multiple, denormalized copies (“disk is cheap”)
  • 28. NoSQL Characteristics ■  ACID is “relaxed” ▪  A transaction is limited to an aggregate (k-v pair) ▪  Enables distributed, shared-nothing architectures ▪  Ideal for clustered deployments ▪  Optimistic locking ▪  Some loss of data and consistency is expected (and catered to) ■  Write patterns ▪  UPDATEs converted to INSERTs (timestamped/tombstoned) ▪  Time-To-Live (TTL) based DELETE’s/Purges ▪  Compaction based garbage collection ▪  Reduced Write latency due to memory only writes ▪  Transaction logging supported in some NoSQL stores
  • 29. Why use an RDBMS then?! ■  ACID may be a hard business requirement ▪  Data loss can never be tolerated ▪  Data inconsistency can never be tolerated (e.g. Money movement) ■  Complex data models favor RDBMS ▪  Try modeling Oracle EBS in NoSQL J ■  Standardized interface via SQL ▪  Broadly same across all RDBMS ▪  Well understood, skills availability ■  Inter-application integration ▪  Single platform for data created it’s own ecosystem ■  Cost to change is prohibitive
  • 30. Introducing the CAP Theorem ■  Eric Brewer’s conjecture at the July 2000 ACM Symposium ■  Formalized by Seth Gilbert and Nancy Lynch in 2002 ■  Any networked shared-data system can have at most two of three desirable properties: ▪  At least one Consistent (C) up-to-date copy of the data ▪  high Availability (A) of that data (for both reads and updates) ▪  tolerance to network Partitions (P) ■  Core systemic requirements in a distributed environment ▪  Special symbiotic relationship ▪  Present during design and deployment of applications in a distributed environment (whether acknowledged or not) ■  Applies well to the distributed NoSQL world
  • 31. Components of the CAP Theorem ■  (C)onsistency ▪  All clients see the same results from a query, even in the presence of an update at the same time as the query ■  High (A)vailability ▪  All clients can write or access data, even in the presence of system failures. Requestors receive acknowledgment of success or failure ▪  Performance may degrade, but consuming applications are able to access data even though some parts of the system may not be operational at the time of a query ■  (P)artition Tolerance ▪  The system returns results regardless of failures in communication between partitions in the distributed system; i.e. system property holds true even if there is a network partition
  • 33. Illustrating the CAP Theorem (adapted) ■  You start a small business: Provide phone reminders/information ■  Customers call with information; You call back/respond to remind ■  Start small: All information written down in your (single) notebook ■  Business grows: Wife is recruited (scale out, PBX shards calls) ■  Inconsistency: Response misses info updated in Wife’s notebook ■  Resolve inconsistency: All notebooks updated when call ends (lock) ■  Wife’s day off: You leave sticky notes (Inconsistent until next day) ■  Wife fights with you: Network Partition (sticky notes thrown away) ■  You have a choice here: CAP Theorem in play – Pick two ▪  (C) Always provide consistent information to clients ▪  (A) Business is always open if at least one of you is present ▪  (P) Business is open even during a loss of communication between 2 ■  Run around clerk: Eventual consistency and Compaction
  • 34. Examples of CAP Theorem pairs ■  Consistency and Partition Tolerance (CP): Banking Transaction at an ATM ▪  Data needs to be consistent in the presence of updates ▪  If there is a network failure, dispense cash but limit the transaction amount ▪  Transaction still available, but system property changed due to network partition ■  Consistency and Availability (CA): Database System-of-Record ▪  Data Consistency is key ▪  During is a network failure, clients stop writing (no redo), no write availability ▪  Present in Oracle Data Guard’s Maximum protection mode/Single node DB ■  Availability and Partition Tolerance (AP): Shopping cart in Amazon.com ▪  Spread data across multiple partitions to be always available ▪  Reconcile cart at checkout (may result in dual purchases!) ▪  Sacrifices consistency, but works for most cases, most of the time
  • 35. CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques ▪  Partition workload by function —  Schema level split: data unrelated to each other is segregated —  Typically provides headroom for main workload/environment ▪  Distribute transactions —  For related data that still needs to be viewed together —  Typically using Database links —  Typically for master lookups and remote writes —  Introduces dependencies (more on that soon) ▪  Decouple work asynchronously —  Use AQ to write tokens or keys to process later —  Introduces a “delay”: Data not immediately consistent
  • 36. CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques ▪  Partition workload by function —  Schema level split: data unrelated to each other is segregated —  Typically provides headroom for main workload/environment ▪  Distribute transactions —  For related data that still needs to be viewed together —  Typically using Database links —  Typically for master lookups and remote writes —  Introduces dependencies (more on that soon) ▪  Decouple work asynchronously —  Use AQ to write tokens or keys to process later —  Introduces a “delay”: Data not immediately consistent
  • 37. CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques ▪  Offload reads using Active Data Guard (DB 11g and above) ▪  DG copy opened for reads during Real Time Apply ▪  DG allows Redo Data shipping in 3 modes —  Maximum Protection: Zero loss but dependent on remote redo write —  Maximum Performance: Remote redo written asynchronously —  Maximum Availability: Switches to Max Performance mode on remote redo write failure, operates in Max protection mode otherwise ▪  Offers multiple shades of availability and protection ▪  ADG and “read your writes” pattern —  RTA apply is not equal to “instant” apply —  Not “immediately consistent” but “eventually consistent”
  • 38. CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques ▪  Offload reads using Active Data Guard (DB 11g and above) ▪  DG copy opened for reads during Real Time Apply ▪  DG allows Redo Data shipping in 3 modes —  Maximum Protection: Zero loss but dependent on remote redo write —  Maximum Performance: Remote redo written asynchronously —  Maximum Availability: Switches to Max Performance mode on remote redo write failure, operates in Max protection mode otherwise ▪  Offers multiple shades of availability and protection ▪  ADG and “read your writes” pattern —  RTA apply is not equal to “instant” apply —  Not “immediately consistent” but “eventually consistent”
  • 39. CAP Theorem in the NoSQL World ■  Realization of CAP enabled NoSQL to “break free” ▪  Opened minds of database developers ■  However, the “2 of 3” rule was somewhat misleading ▪  NoSQL datastores offer options to vary consistency/durability and availability levels ▪  MongoDB has “Write Concern” – Unacknowledged, Acknowledged, Journaled, Replica Acknowledged ▪  Cassandra has Write Consistency: From ANY to ALL ■  Reality is a spectrum between C and A in the presence of P ▪  Eventual Consistency is a given ▪  Some data loss is expected ▪  Application code/other techniques will need to cater for this
  • 40. Sharding and Replication in NoSQL ■  NoSQL datastores: essentially shared-nothing clusters ■  Relaxing ACID allows distributed processing (CAP applies!) ■  Ability to scale out reads/writes is the key ■  Achieved using two techniques: Sharding and Replication ■  Sharding: Divide and Rule ▪  Data is read/written to different servers (“shards”) ▪  Location determined applying a fixed function on a known key ▪  Different functions: Modulo, Hash, Range, Programmatic ▪  Efficacy of load balancing dependent on function and data ▪  Typically used for Write-scaling (more than Read-scaling) ▪  (Hash partitioned tables/indexes are essentially object level sharding in Oracle databases to enable write scaling)
  • 41. Sharding and Replication in NoSQL ■  Sharding (contd.) ▪  Difficult, if not impossible to change function once implemented ▪  No consistency across shards, or across aggregates ▪  No joins allowed – no cross-shard dependencies ▪  Resilience does not improve (but enables partial availability) ▪  Not to be implemented lightly: Start single if you can ▪  Many NoSQL stores allow auto-sharding (e.g. CouchBase) ■  Replication: Allow multiple copies ▪  Master-Slave model: Simplest, Scales out reads only; Read resilience; May need to cater for eventual consistency ▪  Peer-to-Peer or Multi-Master model: Scales out reads and writes, but consistency/conflict resolution is a big problem ■  Can combine Sharding and Replication!
  • 42. The NoSQL Datastore Landscape ■  Generally four types: ▪  Key-Value ▪  Document ▪  Column Family ▪  Graph ■  Not using the relational model, i.e. schema-less ▪  But not without a Data Model! ■  Runs on clusters of commodity hardware ■  Generally Open Source ■  Can be considered as storing/retrieving “aggregates” ▪  a collection of related objects that can be treated as a unit ■  Usually described by “Keys” and “Values” (i.e. K-V pairs)
  • 43. Key-Value NoSQL stores ■  The most basic of NoSQL stores ■  Simple K-V structure: A “blob” of data (“Value”) indexed and accessed via a “Key” ■  “Value” part also known as Aggregate ■  Aggregate is a collection of related objects treated as a unit ■  Written/Updated/Read/Consistent as single, smallest unit ■  Typically, aggregate is limited in size (BLOB in Oracle) ■  Typically, expressed in JSON, and sometimes in XML ■  JSON/XML aggregates are self-describing ■  Value is “opaque” in a K-V store, but is simple ■  Scale out with sharding ■  Examples of K-V store: Riak, Oracle NoSQL
  • 44. Key-Value NoSQL stores ■  Typical Use cases ▪  Shines when you need simple GET/PUT operations ▪  Session state; Tokens – Enables web-scale ▪  User profiles and preferences – Typically latent caching layer ▪  Latency bridge: Support RYOW’s in some cases ■  Anti-patterns ▪  No ad-hoc query patterns - (i.e. need key to access) ▪  Not meant for analytics type workload ▪  When multi-key/multi-operation consistency is required ▪  Set based operations (i.e. related data)
  • 45. Document NoSQL stores ■  Datastore able to understand and manipulate structures ■  Needs to follow an agreed format ▪  usually JSON, but BSON, XML and YAML ■  Support for secondary indexes ▪  Needs ability to understand/index K-V pairs in the aggregate ▪  Secondary indexes may throttle write rate ■  Aggregate size usually limited ■  Scale-out again supported via sharding ▪  Some stores support multiple sharding methods (MongoDB) ■  K-V store sometimes evolve into Document stores ▪  E.g. CouchBase evolution ■  Needs embedding/linking support (size/other limitations)
  • 46. Document NoSQL stores ■  Typical Use cases ▪  Of course, any collection of document-type models ▪  Easy-to-start NoSQL projects when moving from RDBMS ▪  Almost any NoSQL use case needing secondary index access ▪  Content and Metadata store: typically multiple keys ▪  Queries using materialized views (CouchBase) ▪  Non-trivial sharding (MongoDB) ▪  Horizontally scaled or Cached reads (MongoDB, CouchBase) ▪  Models requiring simple relationships (Blogs, User modeling) ■  Anti-patterns: ▪  Not a drop-in replacement for RDBMS ▪  Evolving relationships or query patterns ▪  Usually not good for write-heavy
  • 47. Column Family NoSQL stores ■  Characteristics of CF Stores ▪  Data is mostly organized by sets of columns ▪  Key – Value based access ▪  “Value” consists of sets or ranges of columns ▪  Still unstructured ▪  No joins (except via another keyed table, using MapReduce) ■  Cassandra, Hbase, Amazon SimpleDB are prime examples ▪  HDFS on a Hadoop cluster underlies HBase ▪  HBase evolved from Google’s BigTable ▪  Cassandra evolved from Facebook ▪  Cassandra also supports CQL (a SQL like language)
  • 48. Column Family NoSQL stores ■  Typical Use cases ▪  Data is mostly organized by sets of columns (super columns) ▪  Key – Value based access ▪  “Value” consists of sets of columns (but still unstructured) ▪  Lots of repeated sets of values (e.g. Customer transactions) ▪  No joins (except via another keyed table, using MapReduce) ▪  Write-intensive patterns (Internet-of-Things type data) ▪  Rolling expiry patterns such as Time series data ■  Anti-patterns ▪  IMHO Low-latency reads (in comparison to other NoSQL stores) ▪  Need access via secondary or other keys
  • 49. Graph NoSQL stores ■  Stores Nodes and Edges ■  Provides “Index-free Adjacency” ■  Nodes are entities: People, Accounts, Items, Locations ■  Edges connect Nodes to other Nodes ■  Edges have properties ■  Can mine patterns present in these relationships ■  Supports graph-like queries: ▪  Shortest distance between two locations ▪  Social Graphing: Connecting people ▪  Products that your friends liked ■  Neo4j is a well-known graph database ■  Giraph: An open source graph processing systems (FB!)
  • 50. Graph NoSQL stores ■  Typical Use Cases ▪  Social Graphs ▪  Recommendation Engines ▪  Graph transversal uses cases ▪  Relationships with defined end-points ▪  Routing and Location based solutions ▪  Account Linking (e.g. for fraud detection; peer risk checking) ■  Anti-patterns ▪  Scale out via sharding typically not supported in some products ▪  Update all/Update most patterns ▪  Dangling end-points
  • 51. Some more concepts: JSON ■  You need to understand JSON ▪  Java Script Object Notation ▪  Self describing, English text key-value pairs ▪  In other words, a simpler version of XML ▪  No externally imposed structure (hint: No tab/column mapping!) { "id":101, ”first_name":”John", “second_name”:”Kanagaraj”, ”residential_address":[{“add1”:”20 First St”, "city":”San Jose”, “state”:”CA”}], “phone”:”408-555-9999” } ▪  Can you spot some optimization here?
  • 52. Some more concepts: Languages ■  You need to understand JVMs and some Java ▪  Many NoSQL stores use JVM based programs ▪  E.g. Hadoop, Cassandra ▪  Ability to understand JVM’s and their internals is key ▪  JVM’s Garbage Collection needs to be managed ▪  Need to understand/configure JMX (Java Management Xtensions) ▪  Most NoSQL stores support Java API’s out of the box ■  Most NoSQL stores support more than just Java ▪  E.g. Python, Ruby, Perl, C/C++, Node.js, Go ▪  Less-well known ones such as Erlang, Haskell, Scala ▪  Need to able to install and troubleshoot app issues ■  Deploy/Management: Puppet, Nagios, Ganglia, Fab ▪  Frameworks can do more than just NoSQL!
  • 53. MongoDB: Document datastore Client   MongoS   MongoS   MongoD   (Master)   MongoD   (Slave)   MongoD   (Slave)   MongoD   (Master)   MongoD   (Slave)   MongoD   (Slave)   MongoD   MongoD   Replica  Set  1   Replica  Set  2   1 3 2 •  Write  scaling   Sharding  through   MongoS   •  Read  scaling  via   Replica  sets   •  Writes  to  Master   Node,  reads  from   Master  and  Slave   nodes  (op=onal)   MongoD   Routers   Config  Servers   4
  • 54. MongoDB: Data Modeling RDBMS   MongoDB   Database   Database   Table   Collec=on   Row   Document   RowID   _id     Index   Index   Join   Embedded   Document   (DBRef)   Foreign  Key   Reference   Order  ID:  1001   Customer:  John     Order  Line  Items:   20001  –  Tires  –  2  x  $84  -­‐  $168   45320  –  Pump  –  1  x  $54  -­‐  $54     Payment  Details:   Card:  Amex   CC:  3425268768   Exp:  03/17   Total:  $222   Order     Customer   Line  Items   Financial   Instrument   FinTrans   Journal   {  “order_id”:  “1001”,  “customer”:”John”,            “orderitems”:  [  {“prodid”:”20001”,  “prodname”:”Tires”,  “Qty”:2,  “price”:168},                                                                  {“prodid”:”45320”,  “prodname”:”Pump”,  “Qty”:1,  “price”:54}  ],          “pcard”:”Amex”,”pcc”:”3425268768”,”pexp”:”03/17”,”ord_tot”:222  }  
  • 55. MongoDB: Essentials ■  Stands for “huMONGOus DataBase” ■  Reads and Writes using memory-mapped files ▪  Try and fit working set in memory ▪  Use SSDs for faster I/O ■  Very good index support on identified JSON fields ▪  Allows Key-Value, Range and text search queries ▪  Unique as well as Compound Indexes ▪  Special TTL (Time-to-Live) index to retire data ■  Stores documents in BSON format (Binary JSON) ■  Interact, manage, program through Mongo Shell ■  Many other drivers and interfaces ■  Support for Geospatial data and queries ■  Aggregation Framework and MapReduce support
  • 57. MongoDB: Essentials ■  Query optimizer exposes execution plan ■  Multiple sharding methods: ▪  Range-based sharding: Optimized for range queries ▪  Hash-based sharding: Ensure uniform distribution ▪  Tag-aware sharding: Partitioned by user-specified configuration ■  Write-ahead journaling ▪  Journal commits every 100ms (oplog is capped collection) ■  Configurable Write-availability via Write Concern ▪  Unacknowledged (memory only) ▪  Acknowledgement for specific levels: —  Write to at least 2 replicas in the same datacenter —  Write to at least 1 replica in remote datacenter ■  Commercially supported by 10gen (now called MongoDB)
  • 58. MongoDB: The Not-so-good… ■  Reads block Writes (albeit for very short periods ~ microsecs) ▪  Be careful about aggregation/MapReduce: Intense reads ▪  Read lock yields when read has to go to disk ▪  Read locks can be shared by multiple readers ■  Writes block Reads (Writer-greedy, for very short periods) ■  Locks are at a “database” level ▪  Careful with your data model! ▪  Typically restrict one collection per database if possible ▪  Write to multiple documents will yield periodically ■  Index creation (writes) locks your entire database ■  Replicates to Slaves and locks all slaves in Replicaset ■  Compaction also locks the database ■  Secondaries block on replication writes
  • 59. CouchBase – Another Document Store Couchbase Cluster" Multitenant Architecture" Server Nodes" User/applica=on  data   based  on  bucket  par==oning   Which  live  on   Data Buckets" Documents" Read/write  from/to   That  form  a   Clients   Servers   dynamically  scalable  
  • 60. CouchBase Single-Node Architecture Replica=on,  Rebalance,     Shard  State  Manager   REST  management     API/Web  UI   8091   Admin  Console   Erlang  /OTP   11210  /  11211   Data  access  ports   Object-­‐managed   Cache   Storage  Engine   8092   Query  API   Query  Engine   hDp   Data  Manager   Cluster  Manager  
  • 61. CouchBase: Background and Use cases ■  Created as a Merge of code and ideas: ▪  MemCache – An excellent memory only cache ▪  CouchDB – A Key-Value store ▪  Now a Persistent Cache ▪  Code in Erlang and C++ (??) ▪  Different ports for both products – now merging ▪  Lots of MemCache implementations ▪  Now can upgrade into CouchBase quickly – Moxi client ■  Primarily as a Caching solution ▪  Very fast for reads and writes ▪  Some concerns with cross data center replication ▪  IMHO - Not yet suited for RYOWs via secondary key
  • 62. Cassandra: Column-Family datastore Node  1   Node  2   Node  3   Node  4   Node  5   Node  6   Client   •  Hash  func=on(Key)  =>  Token   •  Client  writes  to  selected  Node  as  per   Token   •  Coordinator  Node  replicates  to  other   nodes  (Timed  per  Quorum  selng)   •  Node  acknowledges  to  coordinator   •  Acknowledgement  to  client   •  Data  wri.en  to  internal  commit  log   •  If  node  goes  offline,  writes  stop   •  When  node  rejoins,  a  “hinted  handoff”   process  completes  the  pending  writes  +   “read  repair”   •  Requests  can  range  from  ANY  to  ALL   •  ANY:  Write  to  commit  log  on  at   least  1  node   •  ALL:  Writes  complete  to  memory   and  commit  log  on  ALL  replicas   •  Availability  precedes  Consistency  (AP)   •  Read  and  Write  Paths  are  separate  
  • 63. Cassandra: Column-Family datastore (1)  Write:(K1,{C1:V1})   (2)  Write:(K1,{C2:V2})   (3)  Write:(K2,{C1:V3,C2:V4})   (4)  Write:(K1,{C1:V5,C3:V6})   K1   C1:V1   Memory   Disk   K1   C1:V1   C2:V2   K1   C2:V2   K2   C1:V3   C2:V4   K2   C1:V3   C2:V4   C1:V5   C3:V6   K1   C1:V5   C3:V6   Memtable   Commit  log   Index   K1   C2:V2  C1:V5   C3:V6   K2   C1:V3   C2:V4   SSTable  
  • 64. Cassandra: Essentials ■  Write Path is simpler; Reads are a little more complex ▪  Merge Memtable (Row/Key cache) and Row Reads from Disk ▪  Uses Bloom Filter to decide which SSTables to skip (false +ive) ▪  In-memory caches are stored in Java heap (GC!!!!) ▪  Can return inconsistent data for RYOW (depending on Quorum) ▪  Consistent: (nodes_written + nodes_read) > replication_factor ■  Compaction: Merge SSTables; Expire Tombstoned data (TTL) ■  Data Modeling: ▪  Model your queries – Optimize for reads ▪  Denormalize – Reads: Slow; Writes: Fast; Disk: Cheap ▪  Column families are stored sorted by timestamp ■  CQL: Cassandra Query Language – A familiar interface ■  Maintaining the Cluster: Gossip and Snitch J
  • 65. Choosing the right NoSQL database: ASCII the right question! ■  Is this a site-facing, P1 Application? ■  Is this a BI/Analytics type problem waiting to be solved? ■  Is this Write Intensive or Read Intensive? ■  Is this a Caching problem? ■  Can the application afford some data loss? ■  What about data consistency? ■  What is more important – consistency or availability? ■  How many data centers need to be supported? ■  What are the query patterns? Are they widely varying? ■  How many distinct clusters of data are present, and how are they related? ■  Is my organization ready to support this product?
  • 66. Generic problems ■  Consistency is and will be a problem in the NoSQL world ■  Data loss will be present - application should cater to this ▪  Consider the cost of workarounds/cost of data loss ■  The world of NoSQL is evolving: ▪  Maturing slowly: Peak -> Sliding into the trough ▪  Too many choices: 150 choices: http://nosql-database.org/ ▪  Many picking the wrong product… —  (and had to change it later: Check my Delicious stream #nosql) ▪  Most NoSQL vendors still VC funded ▪  New Versions/Features every 6 months! ▪  We will learn lessons the hard way…..
  • 67. Real World problems ■  Need to break out of the RDBMS/ACID world ▪  Imagine a world with no COMMITs, no “Transactions” ▪  Data loss and Data inconsistency is inevitable ▪  Data Owners/Architects shy away: FUDs, Real dangers ■  Everyone wants to become (or is!) a NoSQL expert ▪  Spell NoSQL and earn $$$ J ▪  Best way to learn: Create a “Big Data” need and fulfill it ▪  Who makes the decisions? ■  Lack of skills and maturity ▪  Product choice: Knowledge/Experience/Forethought required ▪  Many NoSQL products still basic in functionality ▪  Be prepared to back out of your initial choice
  • 68. How to get there (from here)? ■  This presentation is just the beginning ■  Lots and lots of reading and experimenting required ■  Recommended Reading: ▪  NoSQL Distilled by Fowler and Sadalage ▪  Seven Databases in Seven Weeks: Redmond and Wilson ▪  Many NoSQL books – browse at Safari Online ■  Lots of links to read – Live links: ▪  Follow me on http://delicious.com/jkanagaraj - Tag #nosql ■  Play with the community versions: ▪  Available from the vendors: No support though ▪  Spin up/use Cloud based VMs – Rackspace or AWS
  • 69. A warning – And some advice “Some people, when confronted with a big data problem, think, I’ll use Hadoop. Now, they have a big data problem and a big Hadoop cluster” Dmitry Ryaboy, Engineering Manager, Twitter ▪  Start small ▪  Grow with success ▪  Create your own expertise ▪  It is about the untapped potential in your data
  • 70. Please  fill  in  the  feedback  form!   Link  up  with  me  on  LinkedIn   John  Kanagaraj,  PayPal,  an  eBay  Inc.  Company