Introduction to Cassandra (June 2010)

•Als PPTX, PDF herunterladen•

40 gefällt mir•8,055 views

gdusbabek

Presented to the Silicon Valley Cloud Computing Group. 17 June 2010.

Apache Gary Dusbabek Rackspace Silicon Valley Cloud Computing Group • 17 June 2010

Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

Why Cassandra? 1.98 billion 500 GB drives 988EB 6 fold growth In 4 years 322 million 500GB drives 161 EB 2006 2010 Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

SQL Specialized data structures (think B-trees) Shines with complicated queries Focus on fast query & analysis quickly Not necessarily on large datasets

Ever tried scaling a RDBMS For reads? Memcache etc. For writes? Oh noes!

Vertical Scaling Is hard credit: janetmck via flickr

Enter Cassandra Amazon Dynamo Consistent hashing Partitioning Replication One-hop routing Google BigTable Column Families Memtables SSTables

Distributed and Scalable Horizontal! All nodes are identical No master or SPOF Adding is simple Automatic cluster maintenance

Replication Replication factor How many nodes data is replicated on Consistency level Zero, One, Quorum, All Sync or async for writes Reliability of reads Read repair

Ring Topology RF=3 Conceptual Ring One token per node Multiple ranges per node a j d g

Ring Topology RF=2 Conceptual Ring One token per node Multiple ranges per node a j d g

New Node RF=3 Token assignment Range adjustment Bootstrap Arrival only affects immediate neighbors a m j d g

Ring Partition RF=3 Node dies Available? Hinting Handoff Achtung! Plan for this a j d g

Schema-free Sparse-table Flexible column naming You define the sort order Not required to have a specific column just because another row does

Data Model Keyspace ColumnFamily ,[object Object]

Eventually Consistent CAP Theorem Consistency Availability Partition Tolerance Choose two Cassandra chooses A and P But…

Eventually Consistent I got a fever! And the only prescription is MORE CONSISTENCY!

Tunable Consistency Give up a little A and P to get more C Ratchet up the consistency level R + W > N  Strong consistency More to come

Inserting: Overview Simple: put(key, col, value) Complex: put(key, [col:value, …, col:value]) Batch: multi key.

Inserting: Writes Commit log for durability Configurable fsync Sequential writes only Memtable – no disk access (no reads or seeks) Sstables are final (become read only) Indexes Bloom filter Raw data Bottom line: FAST!!!

Querying: Overview You need a key or keys: Single: key=‘a’ Range: key=‘a’ through ’f’ And columns to retrieve: Slice: cols={bar through kite} By name: key=‘b’ cols={bar, cat, llama} Nothing like SQL “WHERE col=‘faz’” But secondary indices are being worked on (see CASSANDRA-749)

Querying: Reads Practically lock free Sstable proliferation New in 0.6: Row cache (avoid sstable lookup, not write-through) Key cache (avoid index scan)

Client API (Low Level) Fat Client Live non-storage node Reduced RPC overhead Thrift (12 language bindings!) http://incubator.apache.org/thrift/ No streaming Avro Work in progress

Client API (High Level) ,[object Object]

Feature richConnection pooling Load balancing/failover Simplified APIs Version opaque

Practical Considerations Partitioner-Random or Order Preserving Range queries Provisioning Virtual or bare metal Cluster size Data model Think in terms of access Giving up transactions, ad-hoc queries, arbitrary indexes and joins (you may already do this with an RDBMS!)

Practical Considerations Wide rows Data life-span Cluster planning Bootstrapping

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to NoSQLPolarSeven Pty Ltd

Relational vs. Non-RelationalPostgreSQL Experts, Inc.

Introduction to NoSQLDimitar Danailov

NoSQL databasesFilip Ilievski

NoSQL: Why, When, and HowBigBlueHat

A Seminar on NoSQL Databases.Navdeep Charan

Mongodb - NoSql DatabasePrashant Gupta

NoSQL databases - An introductionPooyan Mehrparvar

NoSQL and MongoDB IntrodctionBrian Enochson

An Intro to NoSQL DatabasesRajith Pemabandu

An Overview of Apache CassandraDataStax

NOSQL Databases types and UsesSuvradeep Rudra

NoSQL Slideshare Presentation Ericsson Labs

NOSQL OverviewTobias Lindaaker

Nosql databasesateeq ateeq

Scaling MongoDBMongoDB

Cassandra trainingAndrás Fehér

NoSql DatabasesNimat Khattak

CouchDB – A Database for the WebKarel Minarik

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu

Was ist angesagt? (20)

Introduction to NoSQL

Relational vs. Non-Relational

Introduction to NoSQL

NoSQL databases

NoSQL: Why, When, and How

A Seminar on NoSQL Databases.

Mongodb - NoSql Database

NoSQL databases - An introduction

NoSQL and MongoDB Introdction

An Intro to NoSQL Databases

An Overview of Apache Cassandra

NOSQL Databases types and Uses

NoSQL Slideshare Presentation

NOSQL Overview

Nosql databases

Scaling MongoDB

Cassandra training

NoSql Databases

CouchDB – A Database for the Web

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Andere mochten auch

Big Data: Hadoop Map / Reduce sur Windows et Windows AzureMicrosoft

Introduction aux bases de données NoSQLAntoine Augusti

NoSQL et Big Dataacogoluegnes

Les modèles NoSQLebiznext

NoSQL: Introducción a las Bases de Datos no estructuradasDiego López-de-Ipiña González-de-Artaza

Presentation Hadoop QuébecMathieu Dumoulin

Techday Arrow Group: Hadoop & le Big DataArrow Group

Architectures techniques NoSQLOCTO Technology

Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Hatim CHAHDI

Une introduction à HBaseModern Data Stack France

Les BD NoSQLMinyar Sassi Hidri

Introduction à HadoopMathieu Dumoulin

Hadoop et son écosystèmeKhanh Maudoux

Andere mochten auch (13)

Big Data: Hadoop Map / Reduce sur Windows et Windows Azure

Introduction aux bases de données NoSQL

NoSQL et Big Data

Les modèles NoSQL

NoSQL: Introducción a las Bases de Datos no estructuradas

Presentation Hadoop Québec

Techday Arrow Group: Hadoop & le Big Data

Architectures techniques NoSQL

Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...

Une introduction à HBase

Les BD NoSQL

Introduction à Hadoop

Hadoop et son écosystème

Ähnlich wie Introduction to Cassandra (June 2010)

Cassandra Presentation for San Antonio JUGgdusbabek

Streaming Microservices With Akka Streams And Kafka StreamsLightbend

The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit

AWS Innovate: Running Databases in AWS- Russell NashAmazon Web Services Korea

SQL Server 2008 Integration ServicesEduardo Castro

Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret

Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin

Building and deploying large scale real time news system with my sql and dist...Tao Cheng

No sqlShruti_gtbit

Strata NY 2018: The deconstructed databaseJulien Le Dem

From flat files to deconstructed databaseJulien Le Dem

Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services

http://www.hfadeel.com/Blog/?p=151xlight

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services

Amazon Elastic Map Reduce - Ian Meyershuguk

MySQL And Search At CraigslistJeremy Zawodny

Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA

MinneBar 2013 - Scaling with CassandraJeff Smoley

How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit

Scaling your websiteAlejandro Marcu

Ähnlich wie Introduction to Cassandra (June 2010) (20)

Cassandra Presentation for San Antonio JUG

Streaming Microservices With Akka Streams And Kafka Streams

The Pushdown of Everything by Stephan Kessler and Santiago Mola

AWS Innovate: Running Databases in AWS- Russell Nash

SQL Server 2008 Integration Services

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Frustration-Reduced PySpark: Data engineering with DataFrames

Building and deploying large scale real time news system with my sql and dist...

No sql

Strata NY 2018: The deconstructed database

From flat files to deconstructed database

Interactively Querying Large-scale Datasets on Amazon S3

http://www.hfadeel.com/Blog/?p=151

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...

Amazon Elastic Map Reduce - Ian Meyers

MySQL And Search At Craigslist

Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas

MinneBar 2013 - Scaling with Cassandra

How to use Parquet as a Sasis for ETL and Analytics

Scaling your website

Mehr von gdusbabek

My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015gdusbabek

How To (Not) Open Source - Javazone, Oslo 2014gdusbabek

Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014gdusbabek

Measure All the Things! - Austin Data Day 2014gdusbabek

Blueflood: Open Source Metrics Processing at CassandraEU 2013gdusbabek

Introduction to Blueflood at Berlin Buzzwords 2013gdusbabek

Rackspace Cloud Monitoring - Strata NYCgdusbabek

Austin cassandra meetupgdusbabek

How Rackspace Cloud Monitoring uses Cassandragdusbabek

Breaking the Relational Headlock: A Survey of NoSQL Datastoresgdusbabek

Building Rackspace Cloud Monitoringgdusbabek

Cassandra Codebase 2011gdusbabek

Data Modeling with Cassandra Column Familiesgdusbabek

Getting to Know the Cassandra Codebasegdusbabek

Mehr von gdusbabek (14)

My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015

How To (Not) Open Source - Javazone, Oslo 2014

Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014

Measure All the Things! - Austin Data Day 2014

Blueflood: Open Source Metrics Processing at CassandraEU 2013

Introduction to Blueflood at Berlin Buzzwords 2013

Rackspace Cloud Monitoring - Strata NYC

Austin cassandra meetup

How Rackspace Cloud Monitoring uses Cassandra

Breaking the Relational Headlock: A Survey of NoSQL Datastores

Building Rackspace Cloud Monitoring

Cassandra Codebase 2011

Data Modeling with Cassandra Column Families

Getting to Know the Cassandra Codebase

Introduction to Cassandra (June 2010)

1. Apache Gary Dusbabek Rackspace Silicon Valley Cloud Computing Group • 17 June 2010

2. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

3. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

4. Why Cassandra? 1.98 billion 500 GB drives 988EB 6 fold growth In 4 years 322 million 500GB drives 161 EB 2006 2010 Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

5. Why Cassandra?

6. SQL Specialized data structures (think B-trees) Shines with complicated queries Focus on fast query & analysis quickly Not necessarily on large datasets

7. Ever tried scaling a RDBMS For reads? Memcache etc. For writes? Oh noes!

8. Vertical Scaling Is hard credit: janetmck via flickr

9. No, really: Vertical Scaling Is hard

10. Enter Cassandra Amazon Dynamo Consistent hashing Partitioning Replication One-hop routing Google BigTable Column Families Memtables SSTables

11. Origins Pre-2008

12. Moving Along 2008

13. Landed 2009

14. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

15. Distributed and Scalable Horizontal! All nodes are identical No master or SPOF Adding is simple Automatic cluster maintenance

16. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

17. Replication Replication factor How many nodes data is replicated on Consistency level Zero, One, Quorum, All Sync or async for writes Reliability of reads Read repair

18. Ring Topology RF=3 Conceptual Ring One token per node Multiple ranges per node a j d g

19. Ring Topology RF=2 Conceptual Ring One token per node Multiple ranges per node a j d g

20. New Node RF=3 Token assignment Range adjustment Bootstrap Arrival only affects immediate neighbors a m j d g

21. Ring Partition RF=3 Node dies Available? Hinting Handoff Achtung! Plan for this a j d g

22. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

23. Schema-free Sparse-table Flexible column naming You define the sort order Not required to have a specific column just because another row does

24.

25. Key

26. ColumnsName (sorted) Value

27. Easier to show from the bottom up

28. Data Model A single column

29. Data Model A single row

30. Data Model

31. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

32. Eventually Consistent CAP Theorem Consistency Availability Partition Tolerance Choose two Cassandra chooses A and P But…

33. Eventually Consistent I got a fever! And the only prescription is MORE CONSISTENCY!

34. Tunable Consistency Give up a little A and P to get more C Ratchet up the consistency level R + W > N  Strong consistency More to come

35. Outline History Scaling Replication Model Data Model Tuning Write Path Read Path Client Access Practical Considerations

36. Inserting: Overview Simple: put(key, col, value) Complex: put(key, [col:value, …, col:value]) Batch: multi key.

37. Inserting: Writes Commit log for durability Configurable fsync Sequential writes only Memtable – no disk access (no reads or seeks) Sstables are final (become read only) Indexes Bloom filter Raw data Bottom line: FAST!!!

38. Outline History Scaling Replication Model Data Model Tuning WritePath Read Path Client Access Practical Considerations

39. Querying: Overview You need a key or keys: Single: key=‘a’ Range: key=‘a’ through ’f’ And columns to retrieve: Slice: cols={bar through kite} By name: key=‘b’ cols={bar, cat, llama} Nothing like SQL “WHERE col=‘faz’” But secondary indices are being worked on (see CASSANDRA-749)

40. Querying: Reads Practically lock free Sstable proliferation New in 0.6: Row cache (avoid sstable lookup, not write-through) Key cache (avoid index scan)

41. Outline History Scaling Replication Model Data Model Tuning WritePath Read Path Client Access Practical Considerations

42. Client API (Low Level) Fat Client Live non-storage node Reduced RPC overhead Thrift (12 language bindings!) http://incubator.apache.org/thrift/ No streaming Avro Work in progress

43.

44. Feature richConnection pooling Load balancing/failover Simplified APIs Version opaque

45. Outline History Scaling Replication Model Data Model Tuning WritePath Read Path Client Access Practical Considerations

46. Practical Considerations Partitioner-Random or Order Preserving Range queries Provisioning Virtual or bare metal Cluster size Data model Think in terms of access Giving up transactions, ad-hoc queries, arbitrary indexes and joins (you may already do this with an RDBMS!)

47. Practical Considerations Wide rows Data life-span Cluster planning Bootstrapping

48. Future Direction Vector clocks (server side conflict resolution) Alter keyspace/column families on a live cluster Compression Multi-tenant features Less memory restrictions

49. Wrapping Up Use Cassandra if you want/need High write throughput Near-linear scalability Automated replication/fault tolerance Can tolerate missing RDBMS features

50. Questions? Linkage wiki.apache.org/cassandra cassandra.apache.org gdusbabek@gmail.com gdusbabek on twitter and just about everything else.

Hinweis der Redaktion

Data growth has been expanding.
Historical industry leaders
32 core processor machines are expensiveCosts go way up when you try to scale these databasesAlso-instability.
Terabytes of data~1,000,000 ops/secondSchema changes are difficult (impossible)Manual sharding takes a lot of effortAutomated sharding + replication is difficult
100 M users, 25 TB data
Horizontal – commodity hardware, not specialized boxes
Cluster is a logical storage ringNode placement divides the ring into ranges that represent start/stop points for keysAutomatic or manual token assignment (use another slide for that) Closer together means less responsibility and data
Token
Bootstrapping
Hinting not designed for long failures.
RDBMS focus on consistency. Limits scale.
No multi-key transactions
Sstable proliferation degrades performance.
DistributedScalableSchema-freeSparse tableEventually consistentTunable (throughput and fault-tolerance)

Introduction to Cassandra (June 2010)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Introduction to Cassandra (June 2010)

Ähnlich wie Introduction to Cassandra (June 2010) (20)

Mehr von gdusbabek

Mehr von gdusbabek (14)

Introduction to Cassandra (June 2010)

Hinweis der Redaktion