SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Prepared by,
Pooja G.V
8th sem, CSE
GECH.
• Cassandra is a distributed database from Apache
that is highly scalable and designed to manage
very large amounts of structured data.
•It provides high availability with no single point of
failure.
• Open-source database
management system
(DBMS)
• Several key features of
Cassandra differentiate it
from other similar systems
Continued.....
History of Cassandra
● Cassandra was created to power
the Facebook Inbox Search
● Facebook open-sourced Cassandra in 2008 and
became an Apache Incubator project
● In 2010, Cassandra graduated to a top-level
project, regular update and releases followed
Motivation and Function
● Designed to handle large amount of data across
multiple servers
● There is a lot of unorganized data out there
● Easy to implement and deploy
● Mimics traditional relational database systems, but
with triggers and lightweight transactions
● Raw, simple data structures
General Design Features
Emphasis on performance over analysis
● Still has support for analysis tools such as Hadoop
Organization
● Rows are organized into tables
● First component of a table’s primary key is the partition key
● Rows are clustered by the remaining columns of the key
● Columns may be indexed separately from the primary key
● Tables may be created, dropped, altered at runtime without blocking
queries
Language
● CQL (Cassandra Query Language) introduced, similar to SQL (flattened
learning curve)
Peer to Peer Cluster
• Decentralized design
• No single point of failure
• No bottlenecking
Fault Tolerance/Durability
Failures happen all the time with multiple nodes
● Hardware Failure
● Bugs
● Operator error
● Power Outage, etc.
Solution: Buy cheap, redundant hardware,
replicate, maintain consistency
Fault Tolerance/Durability
• Replication
• Distribution of data to multiple data centers
Performance
• Core architectural designs allow Cassandra to
outperform its competitors
• Very good read and write throughputs
Scalability
Read and write throughput increase linearly as more machines are added
“In terms of scalability, there is a clear winner throughout our experiments.
Cassandra achieves the highest throughput for the maximum number of
nodes…” - University of Toronto
Comparisons
Apache Cassandra Google Big Table Amazon DynamoDB
Storage Type Column Column Key-Value
Best Use Write often, read less Designed for large
scalability
Large database solution
Concurrency Control MVCC Locks ACID
Characteristics High Availability
Partition Tolerance
Persistence
Consistency
High Availability
Partition Tolerance
Persistence
Consistency
High Availability
Cassandra Use Cases
Netflix
• online DVD and Blu-Ray
movie retailer
• Nielsen study showed
that 38% of Americans
use or subscribe to
Netflix
Netflix: Why Cassandra
● Using a central SQL database negatively
impacted scalability and availability
● International Expansion required Multi-
Datacenter solution
● Need for configurable Replication, Consistency,
and Resiliency in the face of failure
● Cassandra on AWS offered high levels of
scalability and availability
Jason Brown,
Senior Software
Engineer at Netflix
Hulu
• a website and a
subscription service
offering on-demand
streaming video media
• ~30 million unique
viewers per month
Hulu: Why Cassandra
• need for Availability
• need for Scalability
• Good Performance
• Nearly Linear Scalability
• Geo-Replication
• Minimal Maintenance Requirements
Andres Rangel,
Senior Software
Engineer at Hulu
Reasons for Choosing Cassandra
• Value availability over
consistency
• Require high write-throughput
• High scalability required
• No single point of failure
CAP Theorem
Cassandra’s Data Model
• Cassandra is a column oriented
NoSQL system
• Column families: sets of key-
value pairs
• A row is a collection of columns
labeled with a name
Key-Value Model
Cassandra Row
• the value of a row is itself a
sequence of key-value pairs
• such nested key-value pairs are
columns
• key = column name
• a row must contain at least 1
column
Example of Columns
Column names storing values
• key: User ID
• column names store
tweet ID values
• values of all column
names are set to “-”
(empty byte array) as
they are not used
• A Key Space is a group of
column families together. It
is only a logical grouping of
column families and
provides an isolated scope
for names
Key Space
Comparing Cassandra (C*) and RDBMS
• with RDBMS, a normalized data model is created
without considering the exact queries
• with C*, the data model is designed for specific
queries
• C*: NO joins, relationships, or foreign keys
Cassandra Query Language - CQL
• creating a keyspace - namespace of tables
CREATE KEYSPACE demo
WITH replication = {‘class’: ’SimpleStrategy’,
replication_factor’: 3};
• to use namespace:
USE demo;
Cassandra Query Language - CQL
• creating tables:
CREATE TABLE users( CREATE TABLE tweets(
email varchar, email varchar,
bio varchar, time_posted timestamp,
birthday timestamp, tweet varchar,
active boolean, PRIMARY KEY (email,
time_posted));
PRIMARY KEY (email));
Cassandra Query Language - CQL
• inserting data
INSERT INTO users (email, bio, birthday, active)
VALUES (‘john.doe@bti360.com’, ‘BT360 Teammate’,
516513600000, true);
Cassandra Query Language - CQL
• querying tables
• SELECT expression reads one or more records from
Cassandra column family and returns a result-set of rows
SELECT * FROM users;
SELECT email FROM users WHERE active = true;
Cassandra: Conclusion
• perfect for time-series data
• high performance
• Decentralization
• nearly linear scalability
• replication support
• no single points of failure
• MapReduce support
Cassandra Advantages
Cassandra Weaknesses
● no referential integrity
● querying options for retrieving data are limited
● sorting data is a design decision
● no support for atomic operations
● first think about queries, then about data model
Cassandra: Points to Consider
● Cassandra is designed as a distributed database
management system
● Cassandra write performance is always excellent, but read
performance depends on write patterns
● having a high-level understanding of some internals is a plus
References
• Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured
storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
• Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media, 2010.
• http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/a
rchitectureTOC.html
• http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding-
apache-cassandra
• http://www.slideshare.net/DataStax/evaluating-apache-cassandra-as-a-cloud-
database
• http://planetcassandra.org/functional-use-cases/
• http://marsmedia.info/en/cassandra-pros-cons-and-model.php
• http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-
cassandra
• http://wiki.apache.org/cassandra/CassandraLimitations

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Azure storage
Azure storageAzure storage
Azure storage
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Cassandra
CassandraCassandra
Cassandra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Survey of High Performance NoSQL Systems
Survey of High Performance NoSQL SystemsSurvey of High Performance NoSQL Systems
Survey of High Performance NoSQL Systems
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Azure redis cache
Azure redis cacheAzure redis cache
Azure redis cache
 

Andere mochten auch

Andere mochten auch (9)

Cassandra devoxx 2010
Cassandra devoxx 2010Cassandra devoxx 2010
Cassandra devoxx 2010
 
Cassandra - Deep Dive ...
Cassandra - Deep Dive ...Cassandra - Deep Dive ...
Cassandra - Deep Dive ...
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Cassandra - Research Paper Overview
Cassandra - Research Paper OverviewCassandra - Research Paper Overview
Cassandra - Research Paper Overview
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 

Ähnlich wie Cassandra

Ähnlich wie Cassandra (20)

cassandra_presentation_final
cassandra_presentation_finalcassandra_presentation_final
cassandra_presentation_final
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra Model
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Big data stores
Big data  storesBig data  stores
Big data stores
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Introduction to NoSQL CassandraDB
Introduction to NoSQL CassandraDBIntroduction to NoSQL CassandraDB
Introduction to NoSQL CassandraDB
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Cassandra Architecture FTW
Cassandra Architecture FTWCassandra Architecture FTW
Cassandra Architecture FTW
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Cassandra

  • 1. Prepared by, Pooja G.V 8th sem, CSE GECH.
  • 2.
  • 3. • Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. •It provides high availability with no single point of failure.
  • 4. • Open-source database management system (DBMS) • Several key features of Cassandra differentiate it from other similar systems Continued.....
  • 5. History of Cassandra ● Cassandra was created to power the Facebook Inbox Search ● Facebook open-sourced Cassandra in 2008 and became an Apache Incubator project ● In 2010, Cassandra graduated to a top-level project, regular update and releases followed
  • 6. Motivation and Function ● Designed to handle large amount of data across multiple servers ● There is a lot of unorganized data out there ● Easy to implement and deploy ● Mimics traditional relational database systems, but with triggers and lightweight transactions ● Raw, simple data structures
  • 7. General Design Features Emphasis on performance over analysis ● Still has support for analysis tools such as Hadoop Organization ● Rows are organized into tables ● First component of a table’s primary key is the partition key ● Rows are clustered by the remaining columns of the key ● Columns may be indexed separately from the primary key ● Tables may be created, dropped, altered at runtime without blocking queries Language ● CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)
  • 8. Peer to Peer Cluster • Decentralized design • No single point of failure • No bottlenecking
  • 9. Fault Tolerance/Durability Failures happen all the time with multiple nodes ● Hardware Failure ● Bugs ● Operator error ● Power Outage, etc. Solution: Buy cheap, redundant hardware, replicate, maintain consistency
  • 10. Fault Tolerance/Durability • Replication • Distribution of data to multiple data centers
  • 11. Performance • Core architectural designs allow Cassandra to outperform its competitors • Very good read and write throughputs
  • 12. Scalability Read and write throughput increase linearly as more machines are added “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes…” - University of Toronto
  • 13. Comparisons Apache Cassandra Google Big Table Amazon DynamoDB Storage Type Column Column Key-Value Best Use Write often, read less Designed for large scalability Large database solution Concurrency Control MVCC Locks ACID Characteristics High Availability Partition Tolerance Persistence Consistency High Availability Partition Tolerance Persistence Consistency High Availability
  • 15. Netflix • online DVD and Blu-Ray movie retailer • Nielsen study showed that 38% of Americans use or subscribe to Netflix
  • 16. Netflix: Why Cassandra ● Using a central SQL database negatively impacted scalability and availability ● International Expansion required Multi- Datacenter solution ● Need for configurable Replication, Consistency, and Resiliency in the face of failure ● Cassandra on AWS offered high levels of scalability and availability Jason Brown, Senior Software Engineer at Netflix
  • 17. Hulu • a website and a subscription service offering on-demand streaming video media • ~30 million unique viewers per month
  • 18. Hulu: Why Cassandra • need for Availability • need for Scalability • Good Performance • Nearly Linear Scalability • Geo-Replication • Minimal Maintenance Requirements Andres Rangel, Senior Software Engineer at Hulu
  • 19. Reasons for Choosing Cassandra • Value availability over consistency • Require high write-throughput • High scalability required • No single point of failure
  • 22. • Cassandra is a column oriented NoSQL system • Column families: sets of key- value pairs • A row is a collection of columns labeled with a name Key-Value Model
  • 23. Cassandra Row • the value of a row is itself a sequence of key-value pairs • such nested key-value pairs are columns • key = column name • a row must contain at least 1 column
  • 25. Column names storing values • key: User ID • column names store tweet ID values • values of all column names are set to “-” (empty byte array) as they are not used
  • 26. • A Key Space is a group of column families together. It is only a logical grouping of column families and provides an isolated scope for names Key Space
  • 27. Comparing Cassandra (C*) and RDBMS • with RDBMS, a normalized data model is created without considering the exact queries • with C*, the data model is designed for specific queries • C*: NO joins, relationships, or foreign keys
  • 28. Cassandra Query Language - CQL • creating a keyspace - namespace of tables CREATE KEYSPACE demo WITH replication = {‘class’: ’SimpleStrategy’, replication_factor’: 3}; • to use namespace: USE demo;
  • 29. Cassandra Query Language - CQL • creating tables: CREATE TABLE users( CREATE TABLE tweets( email varchar, email varchar, bio varchar, time_posted timestamp, birthday timestamp, tweet varchar, active boolean, PRIMARY KEY (email, time_posted)); PRIMARY KEY (email));
  • 30. Cassandra Query Language - CQL • inserting data INSERT INTO users (email, bio, birthday, active) VALUES (‘john.doe@bti360.com’, ‘BT360 Teammate’, 516513600000, true);
  • 31. Cassandra Query Language - CQL • querying tables • SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows SELECT * FROM users; SELECT email FROM users WHERE active = true;
  • 33. • perfect for time-series data • high performance • Decentralization • nearly linear scalability • replication support • no single points of failure • MapReduce support Cassandra Advantages
  • 34. Cassandra Weaknesses ● no referential integrity ● querying options for retrieving data are limited ● sorting data is a design decision ● no support for atomic operations ● first think about queries, then about data model
  • 35. Cassandra: Points to Consider ● Cassandra is designed as a distributed database management system ● Cassandra write performance is always excellent, but read performance depends on write patterns ● having a high-level understanding of some internals is a plus
  • 36. References • Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40. • Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media, 2010. • http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/a rchitectureTOC.html • http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding- apache-cassandra • http://www.slideshare.net/DataStax/evaluating-apache-cassandra-as-a-cloud- database • http://planetcassandra.org/functional-use-cases/ • http://marsmedia.info/en/cassandra-pros-cons-and-model.php • http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global- cassandra • http://wiki.apache.org/cassandra/CassandraLimitations

Hinweis der Redaktion

  1. It combines Amazon Dynamo’s fully distributed design with Google Bigtable’s column-oriented data model.
  2. designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failur
  3. offering streaming movies through video game consoles, Apple TV, TiVo and more http://www.usatoday.com/story/life/tv/2013/09/18/netflix-hulu-amazon-nielsen-viewership-data/2831535/
  4. Central Oracle database -> everything in one place, convenient until it fails Schema changes required downtime Cassandra stores 3 local copies, 1 per zone Synchronous access, durable Replicates at destination Global Coverage business agility Local Access better latency fault isolation
  5. Geo-Replication – distribution of data across multiple regions
  6. SimpleStrategy is a replication strategy. There are two: SimpleStrategy and NetworkTopologyStrategy. SimpleStrategy is used for simple data center clusters. It is default. NetworkTopologyStrategy is used for multi-datacenter deployments
  7. Table – when schema is static Column Family – when schema is dynamic
  8. No join or subquery support, and limited support for aggregation. This is by design, to force you to denormalize into partitions that can be efficiently queried from a single replica, instead of having to gather data from across the entire cluster. Ordering is done per-partition, and is specified at table creation time. Again, this is to enforce good application design; sorting thousands or millions of rows can be fast in development, but sorting billions in production is a bad idea.
  9. It’s important to analyze how you are going to query your data. Spending time to design your schema around your query pattern can save a lot of hassle debugging performance issues while also ensuring that you can scale easily. Additionally, having a high-level understanding of some of the internals such has how deletions are implemented, how secondary indices operate, and when to use the row cache can go a long way in designing a strong application built atop Cassandra.