SlideShare a Scribd company logo
1 of 43
BigData As A Platform
Cassandra and current trends

Matthew F. Dennis // @mdennis
8th Advanced Computing Conference
September 12, 2012
Seoul, South Korea
Why Does BigData Matter To Business?
Effective Use of BigData Leads To Success
And Businesses Have Noticed …
     (according to indeed.com job trends)
Requirements Of A BigData Platform


●   Performance


●   Scalability


●   Availability
Measuring Performance


●   throughput (in ops/sec)



●   latency (for a single request)
Uncommon Cassandra Performance Features

●   No Locking

●   Log Structured Storage Engine

●   Highly Parallel

●   Immutable On Disk Structures

●   On Disk Compression

●   No “Master” Nodes
Cassandra Throughput
Cassandra Latency
                                           (in microseconds)




http://www.slideshare.net/adrianco/cassandra-performance-on-aws
Cassandra + SSD


But can you get such low latency and high
throughput for random reads from disk?


With Cassandra + SSDs, YES!
(SSD latency is usually only ~100 us)
A Random Note About C* and SSDs

     ●   Cassandra can use cheap consumer grade MLC
         SSDs (~$1.00 USD / GB)
     ●   no in-place updates results in far fewer erase
         cycles on the drive which results in the drive
         lasting longer
     ●   Compared to nearly all other databases,
         consumer SSDs last ~10x longer on C*
     ●   to put it another way MLC drives last about as
         long with C* as enterprise SLC drives last with
         most other databases

http://www.anandtech.com/show/5518/a-look-at-enterprise-performance-of-intel-ssds
Why Not On Rotational Disks Too?


●   rotational disks require ~8ms per seek
●   note that this is a HW limitation, an
    absolute upper limit (for that HW)
●   no system can do better than the seek
    time when randomly retrieving data from
    disk (and most do far worse)
What About Writes/Updates?

●   all write I/O in Cassandra is sequential
●   no global write lock
●   no Btrees
●   compare to MySQL, BerkeleyDB, MongoDB,
    Oracle, et cetera which either lock (sometimes
    with a global lock) and/or generate random
    writes for updates
●   locking is not the only way to handle
    concurrency !!!
Larger Than Memory Datasets

●   write performance degrades only
    marginally as the dataset outgrows
    memory; essentially no change in latency
    or throughput

●   read performance degrades gracefully and
    is relative to the percent of data in memory
Most Systems Do Not Behave That Way
Performance Compared to HBase

●   10x better read throughput
●   8x better write throughput
●   8x better read latency
●   10x better write latency
    (even when HBase was running without durability)


    University of Toronto, Canada
    Middleware Systems Research Group, et al
    38th International Conference On Very Large Data Bases
    http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
And We’re Not Even Finished Yet ...


●   native CQL transport layer
●   unified off-heap row and key caches
●   vnodes
●   no read-before-write on secondary indexes
●   native JBOD support
●   straight up profiler driven optimizations
Measuring Scalability


If performance is measured in throughput
and latency, then scalability is the stability
of latency as throughput increases (or the
stability of latency and throughput as “load”
increases); essentially scalability is how
well a system handles growth
Linear Scalability

  If for all values of X, Y, Z, C and N:

latency=X                   latency=X
throughput=Y                throughput=Y
“load”=Z          implies   “load”=CZ
nodes=N                     nodes=CN

 Then: the system is perfectly linearly
       scalable with respect to “load”
Linear Scalability

   If for all values of X, Y, Z, C and N:

latency=X                    latency=X
throughput=Y                 throughput=CY
“load”=Z           implies   “load”=Z
nodes=N                      nodes=CN

Then: the system is perfectly linearly
      scalable with respect to throughput
Cassandra Scalability


“In terms of scalability, there is a clear winner throughout our
experiments. Cassandra achieves the highest throughput for the
maximum number of nodes in all experiments with [nearly linear]
increasing throughput from 1 to 12 nodes.”


University of Toronto, Canada
Middleware Systems Research Group, et al
38th International Conference On Very Large Data Bases
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Cassandra Scalability

                reads: 25%
                scans: 25%
                writes: 50%




University of Toronto, Canada
Middleware Systems Research Group, et al
38th International Conference On Very Large Data Bases
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Cassandra Scalability
                      (client operations per second at RF=3)




http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Requirements Of A BigData Platform

    Performance


    Scalability


●Availability – performance and scalability
are easy if you ignore availability and/or
assume failures never happen
Measuring Availability




availability is measured by the amount of
downtime a system has over a given
period of time
Common Causes of Downtime
             In Large Scale Systems

●   Component Failure (disk)

●   Machine Failure (NIC, cpu, power supply)

●   Rack Failure (router, switch, UPS, AC)

●   Site Failure (power grid, natural disaster, war, coup)
The Common Theme In The Solutions?


            Replication
Legacy Replication




can be HW or SW, replicated or not
  this is not necessarily a SPOF
Legacy Replication




          SPOF                SPOF   SPOF   SPOF




the master/leader is a SPOF
Thoughts On Availability
           (and legacy replication)
“High availability implies that a single fault will not
bring down your system. Not ‘we’ll recover quickly.’”
                                  -- Ben Coverston, DataStax




“The biggest problem with failover is that you’re
almost never using it until it really hurts. Its like
backups that you never test.”
                                   -- Rick Branson, Instagram
Cassandra Replication


cloud provider 0         cloud provider 1




 data center 0             data center 1
More Thoughts On Availability
  (and legacy replication)
Continuous Global Availability ...




… requires more than being able to
recover from faults, it requires
being able to tolerate the faults
without downtime in the first place
if you care about
     continuous global availability
then you must serve reads and writes from
    multiple geographical locations


         there is no alternative
Cassandra As A BigData Platform



Performance


Scalability


Availability
And now back to our trends ...
Public Cassandra Users Early 2011
Public Cassandra Users Mid 2012
DataStax Enterprise
DataStax Enterprise
Q?
Matthew F. Dennis // @mdennis
http://slideshare.net/mattdennis
Thank You!
 Matthew F. Dennis // @mdennis
 http://slideshare.net/mattdennis
A Quick Side Note ...

               Cassandra Replication Follows
                   The Dynamo Model *


      http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html




                      Read It!



* Cassandra is not a strict reimplementation of dynamo

More Related Content

What's hot

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
Rick Branson
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 

What's hot (20)

Cassandra at scale
Cassandra at scaleCassandra at scale
Cassandra at scale
 
Cassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on FireCassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on Fire
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
 
Building Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache Cassandra
 
Redis on NVMe SSD - Zvika Guz, Samsung
 Redis on NVMe SSD - Zvika Guz, Samsung Redis on NVMe SSD - Zvika Guz, Samsung
Redis on NVMe SSD - Zvika Guz, Samsung
 
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
Hybrid Storage Pools (Now with the benefit of hindsight!)
Hybrid Storage Pools (Now with the benefit of hindsight!)Hybrid Storage Pools (Now with the benefit of hindsight!)
Hybrid Storage Pools (Now with the benefit of hindsight!)
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
 
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And Failover
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive Analytics
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
 
Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph as storage for CloudStack
Ceph as storage for CloudStack
 

Viewers also liked

Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
Matthew Dennis
 

Viewers also liked (19)

durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Cassandra and Hybrid Cloud - Introducing Mache
Cassandra and Hybrid Cloud - Introducing MacheCassandra and Hybrid Cloud - Introducing Mache
Cassandra and Hybrid Cloud - Introducing Mache
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Validation driven change
Validation driven changeValidation driven change
Validation driven change
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Generic VNF configuration management and orchestration
Generic VNF configuration management and orchestrationGeneric VNF configuration management and orchestration
Generic VNF configuration management and orchestration
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 

Similar to BigData as a Platform: Cassandra and Current Trends

How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Matt Stubbs
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
Alan McSweeney
 

Similar to BigData as a Platform: Cassandra and Current Trends (20)

Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Scaling Your Database In The Cloud
Scaling Your Database In The CloudScaling Your Database In The Cloud
Scaling Your Database In The Cloud
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
 
Compare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerCompare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL Server
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the Cloud
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
 
Why new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases fasterWhy new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases faster
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL Database
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

BigData as a Platform: Cassandra and Current Trends

  • 1. BigData As A Platform Cassandra and current trends Matthew F. Dennis // @mdennis 8th Advanced Computing Conference September 12, 2012 Seoul, South Korea
  • 2. Why Does BigData Matter To Business?
  • 3. Effective Use of BigData Leads To Success
  • 4. And Businesses Have Noticed … (according to indeed.com job trends)
  • 5. Requirements Of A BigData Platform ● Performance ● Scalability ● Availability
  • 6. Measuring Performance ● throughput (in ops/sec) ● latency (for a single request)
  • 7. Uncommon Cassandra Performance Features ● No Locking ● Log Structured Storage Engine ● Highly Parallel ● Immutable On Disk Structures ● On Disk Compression ● No “Master” Nodes
  • 9. Cassandra Latency (in microseconds) http://www.slideshare.net/adrianco/cassandra-performance-on-aws
  • 10. Cassandra + SSD But can you get such low latency and high throughput for random reads from disk? With Cassandra + SSDs, YES! (SSD latency is usually only ~100 us)
  • 11. A Random Note About C* and SSDs ● Cassandra can use cheap consumer grade MLC SSDs (~$1.00 USD / GB) ● no in-place updates results in far fewer erase cycles on the drive which results in the drive lasting longer ● Compared to nearly all other databases, consumer SSDs last ~10x longer on C* ● to put it another way MLC drives last about as long with C* as enterprise SLC drives last with most other databases http://www.anandtech.com/show/5518/a-look-at-enterprise-performance-of-intel-ssds
  • 12. Why Not On Rotational Disks Too? ● rotational disks require ~8ms per seek ● note that this is a HW limitation, an absolute upper limit (for that HW) ● no system can do better than the seek time when randomly retrieving data from disk (and most do far worse)
  • 13. What About Writes/Updates? ● all write I/O in Cassandra is sequential ● no global write lock ● no Btrees ● compare to MySQL, BerkeleyDB, MongoDB, Oracle, et cetera which either lock (sometimes with a global lock) and/or generate random writes for updates ● locking is not the only way to handle concurrency !!!
  • 14. Larger Than Memory Datasets ● write performance degrades only marginally as the dataset outgrows memory; essentially no change in latency or throughput ● read performance degrades gracefully and is relative to the percent of data in memory
  • 15. Most Systems Do Not Behave That Way
  • 16. Performance Compared to HBase ● 10x better read throughput ● 8x better write throughput ● 8x better read latency ● 10x better write latency (even when HBase was running without durability) University of Toronto, Canada Middleware Systems Research Group, et al 38th International Conference On Very Large Data Bases http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
  • 17. And We’re Not Even Finished Yet ... ● native CQL transport layer ● unified off-heap row and key caches ● vnodes ● no read-before-write on secondary indexes ● native JBOD support ● straight up profiler driven optimizations
  • 18. Measuring Scalability If performance is measured in throughput and latency, then scalability is the stability of latency as throughput increases (or the stability of latency and throughput as “load” increases); essentially scalability is how well a system handles growth
  • 19. Linear Scalability If for all values of X, Y, Z, C and N: latency=X latency=X throughput=Y throughput=Y “load”=Z implies “load”=CZ nodes=N nodes=CN Then: the system is perfectly linearly scalable with respect to “load”
  • 20. Linear Scalability If for all values of X, Y, Z, C and N: latency=X latency=X throughput=Y throughput=CY “load”=Z implies “load”=Z nodes=N nodes=CN Then: the system is perfectly linearly scalable with respect to throughput
  • 21. Cassandra Scalability “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with [nearly linear] increasing throughput from 1 to 12 nodes.” University of Toronto, Canada Middleware Systems Research Group, et al 38th International Conference On Very Large Data Bases http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
  • 22. Cassandra Scalability reads: 25% scans: 25% writes: 50% University of Toronto, Canada Middleware Systems Research Group, et al 38th International Conference On Very Large Data Bases http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
  • 23. Cassandra Scalability (client operations per second at RF=3) http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
  • 24. Requirements Of A BigData Platform Performance Scalability ●Availability – performance and scalability are easy if you ignore availability and/or assume failures never happen
  • 25. Measuring Availability availability is measured by the amount of downtime a system has over a given period of time
  • 26. Common Causes of Downtime In Large Scale Systems ● Component Failure (disk) ● Machine Failure (NIC, cpu, power supply) ● Rack Failure (router, switch, UPS, AC) ● Site Failure (power grid, natural disaster, war, coup)
  • 27. The Common Theme In The Solutions? Replication
  • 28. Legacy Replication can be HW or SW, replicated or not this is not necessarily a SPOF
  • 29. Legacy Replication SPOF SPOF SPOF SPOF the master/leader is a SPOF
  • 30. Thoughts On Availability (and legacy replication) “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston, DataStax “The biggest problem with failover is that you’re almost never using it until it really hurts. Its like backups that you never test.” -- Rick Branson, Instagram
  • 31. Cassandra Replication cloud provider 0 cloud provider 1 data center 0 data center 1
  • 32. More Thoughts On Availability (and legacy replication)
  • 33. Continuous Global Availability ... … requires more than being able to recover from faults, it requires being able to tolerate the faults without downtime in the first place
  • 34. if you care about continuous global availability then you must serve reads and writes from multiple geographical locations there is no alternative
  • 35. Cassandra As A BigData Platform Performance Scalability Availability
  • 36. And now back to our trends ...
  • 41. Q? Matthew F. Dennis // @mdennis http://slideshare.net/mattdennis
  • 42. Thank You! Matthew F. Dennis // @mdennis http://slideshare.net/mattdennis
  • 43. A Quick Side Note ... Cassandra Replication Follows The Dynamo Model * http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Read It! * Cassandra is not a strict reimplementation of dynamo