Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURIC...
Agenda
1. Introduction - What is Apache Cassandra?
2. Cassandra Data Model
3. First Steps in the Cassandra World
4. Cassan...
What is Apache Cassandra?
The Cassandra Elevator Pitch
„Apache Cassandra is an open source, distributed, decentralized,
elastically scalable, highly...
History of Cassandra
Bigtable Dynamo
Highlights
• Apache Cassandra™ is free
• Distributed & Decentralized
• Elastic Scalable
• High Available
• Fault Tolerance...
Elastic Scalable
• Capable of comfortably scaling to petabytes
• New nodes = Linear performance increases
• Add and remove...
Write Requests
coordinator sends a write request to all replicas that own the row being written
Read Requests
There are two types of read requests that a coordinator can send to a replica:
• A direct read request
• A b...
Tunable Data Consistency
What do I need?
• Writes
• Reads
• Consistency 1
2
3
4
5
6
› Any
› One/Two/Three
› Local_One
› Qu...
Who is using Cassandra?
Largest infrastructure running over 75,000 Cassandra nodes, storing more than 10
petabytes of data...
Cassandra Data Model
Cassandra Data Model
• Forget Normalization – DENORMALIZE
• Design by Query
• No Joins – Denormalize
- Model
- Materialize...
How Cassandra stores data
• Model brought from Google Bigtable
• Row Key and a lot of columns
• Column names sorted (UTF8,...
Static Column Family – "Skinny Row"
15
rowkey
CREATE TABLE skinny (rowkey text,
c1 text PRIMARY KEY,
c2 text,
c3 text,
PRI...
Dynamic Column Family – "Wide Row"
16
rowkey
Billion	of	Rows
rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowke...
First Steps
Getting Cassandra
Apache Cassandra Distribution
• http://cassandra.apache.org/
DataStax Distribution
• DataStax Enterprice...
DataStax OpsCenter
• At-a-Glance Cluster Management
• Point-and-Click Provisioning and
Administration
• Secured Administra...
DataStax OpsCenter
Point-and-Click Provisioning and Administration
DataStax OpsCenter
Visual Monitoring and Tuning
CQL – Cassandra Query Language
Introducing CQL
• CQL is a reintroduction of schema so that you don't have to read code to
understand the data model.
• CQ...
CQL Language
• SQL like syntax
• Data Definition Language – DDL
CREATE / ALTER / DROP / …
• Data Manipulation Language – D...
CQL Shell for Apache Cassandra
cqlsh is the command line utility for execution CQL commands (think of SQL*Plus for
Cassand...
CQL Shell for Apache Cassandra
$ cat create-table.cql | cqlsh
$ cqlsh –f create-table.cql
cqlsh> SOURCE '~/cassandra_train...
Creating a Keyspace
Create a keyspace with SimpleStrategy and replication factor option
Make the new keyspace the active o...
Describing a Keyspace
Use the DESCRIBE KEYSPACE to show the metadata of the keyspace
cqlsh> DESCRIBE KEYSPACE my_space;
CR...
Create a Static table Dept
Use CREATE TABLE to create a static column family (table) named ”dept"
cqlsh:my_space> CREATE T...
Create a Dynamic table (wide-row) Employee
A Dynamic Table is also created with the CREATE TABLE statement but using a
com...
Truncate / Drop Table
Use TRUNCATE to truncate the data
Use DROP TABLE to drop the whole table, operation is irreversible ...
Insert data into Dept
• PRIMARY KEY is always required
• Insert with same primary key => update
cqlsh:training> INSERT INT...
Retrieving data from Dept table
SELECT statement returns rows and columns, just as in SQL
It can optionally also have a WH...
Retrieving data from Dept table (II)
Restriction on column other than PRIMARY KEY won't work
Can be solved with an Index (...
Update data in Dept
• WHERE over Primary Key
• If Primary Key does not exist => INSERT
cqlsh:my_space> UPDATE dept SET loc...
Cassandra Data Types
Category CQL	Data	Type Description
String ascii US-ASCII	character	string
text UTF-8	encoded	string, ...
Cassandra Data Types (II)
Category CQL	Data	Type Description
UUIDs uuid A	UUID	in	standard	UUID	format
timeuuid Type	1	UUI...
Batch operation
• COMMIT ?
• BEGIN BATCH … APPLY BATCH – execute multiple mutations – single operation
BEGIN BATCH
INSERT ...
Alter Table
• ALTER TABLE
change meta data
• CQL is quick
flexible schema, not
changes to existing data
cqlsh:my_space> AL...
Collections
CQL3 also supports collections for storing complex data structures
• Set {value,…}, List [value,…], Map {key:v...
Collections (II)
cqlsh:training> SELECT * FROM collection_sample;
id | string_list | string_map | string_set
----+--------...
UDF – User Defined Function
Sample Code
CREATE FUNCTION count_if_true(input boolean)
RETURNS NULL ON NULL INPUT
RETURNS in...
UDA – User Defined Aggregate
Sample Code
CREATE FUNCTION state_count_if_true(total int, input boolean)
RETURNS NULL ON NUL...
Materialized Views
Relieve the pain of manual denormalization
cqlsh:training> CREATE MATERIALIZED VIEW employee_by_role (
...
Time-to-Live (TTL) on Insert
• Insert a row with a TTL in seconds (30s)
• after that the row is deleted
cqlsh:my_space> UP...
Summary
• Just great J
• No single point of failure – Ring Model
• Distribution over nodes / rack’s / data center’s
• Tune...
Jan Ott
Senior Consultant
jan.ott@trivadis.com
References
• Books
Cassandra: The Definite Guide, 2nd Edition
• Apache – Cassandra CQL Documentation
https://cassandra.apa...
Nächste SlideShare
Wird geladen in …5
×

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott

509 Aufrufe

Veröffentlicht am

First Steps of an Oracle-expert in the Big Data World. Everyone speaks about Big Data. But what does it mean? This speech focuses on one animal of the Big Data Zoo - Cassandra and answers the following questions:
- Why another database?
- There is Impala and Spark. Why would I need Cassandra?
- New database - do I need to learn a new language?
- How do I get the data in?
- Can I use SQL?
- Is it part of a distribution, for example Cloudera?

Demos will explain the theory.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott

  1. 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Apache Cassandra Big Data: Why do I need Cassandra Jan Ott Trivadis
  2. 2. Agenda 1. Introduction - What is Apache Cassandra? 2. Cassandra Data Model 3. First Steps in the Cassandra World 4. Cassandra Query Language - CQL 5. Summary
  3. 3. What is Apache Cassandra?
  4. 4. The Cassandra Elevator Pitch „Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, row-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.“ Cassandra: The Definitive Guide by Jeff Carpenter and Eben Hewitt
  5. 5. History of Cassandra Bigtable Dynamo
  6. 6. Highlights • Apache Cassandra™ is free • Distributed & Decentralized • Elastic Scalable • High Available • Fault Tolerance • Tunable data consistency • CQL language (like SQL)
  7. 7. Elastic Scalable • Capable of comfortably scaling to petabytes • New nodes = Linear performance increases • Add and remove nodes online 100'000 txns/sec 200'000 txns/sec 400'000 txns/sec
  8. 8. Write Requests coordinator sends a write request to all replicas that own the row being written
  9. 9. Read Requests There are two types of read requests that a coordinator can send to a replica: • A direct read request • A background read repair request The number of replicas contacted by a direct read request is determined by the consistency level specified by the client.
  10. 10. Tunable Data Consistency What do I need? • Writes • Reads • Consistency 1 2 3 4 5 6 › Any › One/Two/Three › Local_One › Quorum › Local_Quorum › Each_Quorum › All Writes › One/Two/Three › Local_One › Quorum › Local_Quorum › Each_Quorum › All Reads
  11. 11. Who is using Cassandra? Largest infrastructure running over 75,000 Cassandra nodes, storing more than 10 petabytes of data with one cluster was over 1,000 nodes
  12. 12. Cassandra Data Model
  13. 13. Cassandra Data Model • Forget Normalization – DENORMALIZE • Design by Query • No Joins – Denormalize - Model - Materialized Views - Do it on the client side – not recommended • No Referential Integrity - Possible to define but not enforced
  14. 14. How Cassandra stores data • Model brought from Google Bigtable • Row Key and a lot of columns • Column names sorted (UTF8, Int, Timestamp, etc.) Column Name … Column Name Column Value Column Value Timestamp Timestamp TTL TTL Row Key 1 2 Billion Billion of Rows
  15. 15. Static Column Family – "Skinny Row" 15 rowkey CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY, c2 text, c3 text, PRIMARY KEY (rowkey)); Grows up to Billion of Rows rowkey-1 c1 c2 c3 value-c1 value-c2 value-c3 rowkey-2 c1 c3 value-c1 value-c3 rowkey-3 c1 c2 c3 value-c1 value-c2 value-c3 c1 c2 c3 Partition Key
  16. 16. Dynamic Column Family – "Wide Row" 16 rowkey Billion of Rows rowkey-1 ckey-1:c1 ckey-1:c2 value-c1 value-c2 rowkey-2 rowkey-3 CREATE TABLE wide (rowkey text, ckey text, c1 text, c2 text, PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC); ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 1 2 Billion Partition Key Clustering Key
  17. 17. First Steps
  18. 18. Getting Cassandra Apache Cassandra Distribution • http://cassandra.apache.org/ DataStax Distribution • DataStax Enterprice 5.0 - Sandbox - VM https://academy.datastax.com/downloads/welcome VM with Cassandra (1 node), DataStax DevCenter, DataStax OpsCenter • Oracle Virtual Box http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html • Login – datastax/datastax
  19. 19. DataStax OpsCenter • At-a-Glance Cluster Management • Point-and-Click Provisioning and Administration • Secured Administration • Always On Management and Monitoring • Visual Monitoring and Tuning • Best Practice Advice • Proactive Assistance • Smart Data Protection
  20. 20. DataStax OpsCenter Point-and-Click Provisioning and Administration
  21. 21. DataStax OpsCenter Visual Monitoring and Tuning
  22. 22. CQL – Cassandra Query Language
  23. 23. Introducing CQL • CQL is a reintroduction of schema so that you don't have to read code to understand the data model. • CQL creates a common language so that details of the data model can be easily communicated. • CQL is a best-practices Cassandra interface and hides the messy details.
  24. 24. CQL Language • SQL like syntax • Data Definition Language – DDL CREATE / ALTER / DROP / … • Data Manipulation Language – DML INSERT, UPDATE, DELETE • Query data with SELECT • Build in Functions – COUNT, MIN, MAX, sum, avg, LIMIT, ... • UDF – User Defined Function / UDA - User Defined Aggregate
  25. 25. CQL Shell for Apache Cassandra cqlsh is the command line utility for execution CQL commands (think of SQL*Plus for Cassandra) CQL3 is default since Cassandra 1.2 $ cqlsh Connected to DataStaxCluster at localhost:9160. [cqlsh 4.1.0 | Cassandra 2.0.5.24 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh>
  26. 26. CQL Shell for Apache Cassandra $ cat create-table.cql | cqlsh $ cqlsh –f create-table.cql cqlsh> SOURCE '~/cassandra_training/cql/create-table.cql' Execute a script with the –f option Alternatively pie scripts into cqlsh Source files inside cqlsh
  27. 27. Creating a Keyspace Create a keyspace with SimpleStrategy and replication factor option Make the new keyspace the active one cqlsh> CREATE KEYSPACE my_space WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1}; cqlsh> USE my_space; cqlsh:my_space>
  28. 28. Describing a Keyspace Use the DESCRIBE KEYSPACE to show the metadata of the keyspace cqlsh> DESCRIBE KEYSPACE my_space; CREATE KEYSPACE training WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' }; cqlsh>
  29. 29. Create a Static table Dept Use CREATE TABLE to create a static column family (table) named ”dept" cqlsh:my_space> CREATE TABLE dept( deptno int, dname varchar, loc varchar, PRIMARY KEY (deptno)); dname loc 10 ACCOUNTING NEW YORK 20 RESEARCH DALLAS 30 SALES CHICAGO 40 OPERATIONS BOSTON
  30. 30. Create a Dynamic table (wide-row) Employee A Dynamic Table is also created with the CREATE TABLE statement but using a composite partition key cqlsh:training> CREATE TABLE emp( empno int, ename varchar, … deptno int, primary key (dname,ename)); KING:empno ... CLARK:empno ... 10 7839 ... 7782 ... JONES:empno ... SCOTT:empno ... FORD:empno ... 20 7566 ... 7788 ... 7902 ...
  31. 31. Truncate / Drop Table Use TRUNCATE to truncate the data Use DROP TABLE to drop the whole table, operation is irreversible and removes all information within the specified table! • Will raise an error, if it does not exist, use IF EXISTS to prevent (new in 2.0): cqlsh:training> TRUNCATE employee; cqlsh:training> DROP TABLE employee; cqlsh:training> DROP TABLE IF EXISTS employee;
  32. 32. Insert data into Dept • PRIMARY KEY is always required • Insert with same primary key => update cqlsh:training> INSERT INTO dept (deptno, dname, loc) VALUES (10, 'ACCOUNTING', 'NEW YORK');
  33. 33. Retrieving data from Dept table SELECT statement returns rows and columns, just as in SQL It can optionally also have a WHERE clause, an ORDER BY clause and a LIMIT clause cqlsh:training> SELECT deptno, dname FROM dept LIMIT 2; deptno | dname --------+------------ 10 | ACCOUNTING 30 | SALES (2 rows)
  34. 34. Retrieving data from Dept table (II) Restriction on column other than PRIMARY KEY won't work Can be solved with an Index (but be careful, better use de-normalization) cqlsh:my_space> SELECT * FROM dept WHERE loc = 'NEW YORK'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING" cqlsh:my_space> CREATE INDEX ON dept(loc); cqlsh:my_space> SELECT * FROM dept WHERE loc = 'NEW YORK'; deptno | dname | loc --------+------------+---------- 10 | ACCOUNTING | NEW YORK
  35. 35. Update data in Dept • WHERE over Primary Key • If Primary Key does not exist => INSERT cqlsh:my_space> UPDATE dept SET loc = 'LOS ANGELES' WHERE deptno = 10;
  36. 36. Cassandra Data Types Category CQL Data Type Description String ascii US-ASCII character string text UTF-8 encoded string, used most of the time for storing String data. varchar UTF-8 Strings. inet Used for storing IP addresses Numeric int 32-bit signed integer float 32-bit IEEE-754 floating point double 64-bit IEEE-754 floating point varint Arbitrary precision integers bigint 64-bit number, equivalent to long. decimal Variable-precision decimal counter Distributed counter value (64-bit long)
  37. 37. Cassandra Data Types (II) Category CQL Data Type Description UUIDs uuid A UUID in standard UUID format timeuuid Type 1 UUID only, for storing unique time-base IDs Collections list Ordered collection of one or more elements map Collection of arbitrary key-value pairs set Unordered collection of one or more unique elements Miscellaneous boolean Boolean (true/false) blob Used for storing binary data written in hexadecimal timestamp Date/Time
  38. 38. Batch operation • COMMIT ? • BEGIN BATCH … APPLY BATCH – execute multiple mutations – single operation BEGIN BATCH INSERT INTO dept (deptno, dname, loc) VALUES (50, 'IT', 'ZURICH'); UPDATE emp SET sal = 9000 WHERE empno = 9000; APPLY BATCH;
  39. 39. Alter Table • ALTER TABLE change meta data • CQL is quick flexible schema, not changes to existing data cqlsh:my_space> ALTER TABLE dept ADD operational BOOLEAN; cqlsh:training> DESCRIBE TABLE employee; CREATE TABLE my_space.dept ( deptno int PRIMARY KEY, dname text, loc text, operational boolean ) WITH...; cqlsh:my_space> SELECT * FROM dept LIMIT 2; deptno | dname | loc | operational --------+------------+-------------+------------ - 50 | IT | ZURICH | null 10 | ACCOUNTING | LOS ANGELES | null (2 rows)
  40. 40. Collections CQL3 also supports collections for storing complex data structures • Set {value,…}, List [value,…], Map {key:value,…} cqlsh:training> CREATE TABLE collection_sample( id int PRIMARY KEY, string_set set<text>, string_list list<text>, string_map map<text, text>); cqlsh:training> INSERT INTO coll (id, string_set, string_list, string_map) VALUES (1, {'text1','text2','text1'}, ['text1','text2','text1'], {'key1':'value1'});
  41. 41. Collections (II) cqlsh:training> SELECT * FROM collection_sample; id | string_list | string_map | string_set ----+-----------------------------+--------------------+-------------------- 1 | ['text1', 'text2', 'text1'] | {'key1': 'value1'} | {'text1', 'text2'} (1 rows)
  42. 42. UDF – User Defined Function Sample Code CREATE FUNCTION count_if_true(input boolean) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE java AS 'if (input) return 1; else return total;'; SELECT door_number, count_if_true(is_open) FROM my_doors; „CREATE OR REPLACE” or “IF NOT EXSITS” Syntax possible
  43. 43. UDA – User Defined Aggregate Sample Code CREATE FUNCTION state_count_if_true(total int, input boolean) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE java AS 'if (input) return total+1; else return total;'; CREATE AGGREGATE total_open (boolean) SFUNC state_count_if_true STYPE int INITCOND 0; SELECT door_number, total_open(is_open) FROM my_doors; „CREATE OR REPLACE” or “IF NOT EXSITS” Syntax possible
  44. 44. Materialized Views Relieve the pain of manual denormalization cqlsh:training> CREATE MATERIALIZED VIEW employee_by_role ( AS SELECT role, name, age FROM employee WHERE role IS NOT NULL PRIMARY KEY (role, name); cqlsh:training> CREATE TABLE employee_by_role ( role text, name text, age int, PRIMARY KEY (role, name)); Cassandra 3.0
  45. 45. Time-to-Live (TTL) on Insert • Insert a row with a TTL in seconds (30s) • after that the row is deleted cqlsh:my_space> UPDATE emp USING ttl 15 SET sal = 8000 WHERE empno = 9000; cqlsh:my_space> SELECT ename, sal, ttl(sal) FROM emp WHERE empno = 9000; ename | sal | ttl(sal) -------+------+---------- null | 8000 | 15 (1 rows)
  46. 46. Summary • Just great J • No single point of failure – Ring Model • Distribution over nodes / rack’s / data center’s • Tuneable Consistency • CQL • Spark / Cassandra Integration • CQL limited • Forget 20 years of experience in relational modelling L => DENORMALIZE J
  47. 47. Jan Ott Senior Consultant jan.ott@trivadis.com
  48. 48. References • Books Cassandra: The Definite Guide, 2nd Edition • Apache – Cassandra CQL Documentation https://cassandra.apache.org/doc/latest/cql/index.html • DataStax – CQL Documentation http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html • Netflix and Cassandra http://techblog.netflix.com/search/label/Cassandra

×