Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra
Big Data: Why do I need Cassandra
Jan Ott
Trivadis

Agenda
1. Introduction - What is Apache Cassandra?
2. Cassandra Data Model
3. First Steps in the Cassandra World
4. Cassandra Query Language - CQL
5. Summary

The Cassandra Elevator Pitch
„Apache Cassandra is an open source, distributed, decentralized,
elastically scalable, highly available, fault-tolerant, tuneably
consistent, row-oriented database that bases its distribution
design on Amazon’s Dynamo and its data model on Google’s
Bigtable. Created at Facebook, it is now used at some of the
most popular sites on the Web.“
Cassandra: The Definitive Guide by Jeff Carpenter and Eben Hewitt

History of Cassandra
Bigtable Dynamo

Highlights
• Apache Cassandra™ is free
• Distributed & Decentralized
• Elastic Scalable
• High Available
• Fault Tolerance
• Tunable data consistency
• CQL language (like SQL)

Elastic Scalable
• Capable of comfortably scaling to petabytes
• New nodes = Linear performance increases
• Add and remove nodes online
100'000
txns/sec
200'000
txns/sec
400'000
txns/sec

Write Requests
coordinator sends a write request to all replicas that own the row being written

Read Requests
There are two types of read requests that a coordinator can send to a replica:
• A direct read request
• A background read repair request
The number of replicas contacted by a direct read request is determined by the
consistency level specified by the client.

Tunable Data Consistency
What do I need?
• Writes
• Reads
• Consistency 1
2
3
4
5
6
› Any
› One/Two/Three
› Local_One
› Quorum
› Local_Quorum
› Each_Quorum
› All
Writes
› One/Two/Three
› Local_One
› Quorum
› Local_Quorum
› Each_Quorum
› All
Reads

Who is using Cassandra?
Largest infrastructure running over 75,000 Cassandra nodes, storing more than 10
petabytes of data with one cluster was over 1,000 nodes

Cassandra Data Model
• Forget Normalization – DENORMALIZE
• Design by Query
• No Joins – Denormalize
- Model
- Materialized Views
- Do it on the client side – not recommended
• No Referential Integrity
- Possible to define but not enforced

How Cassandra stores data
• Model brought from Google Bigtable
• Row Key and a lot of columns
• Column names sorted (UTF8, Int, Timestamp, etc.)
Column Name … Column Name
Column Value Column Value
Timestamp Timestamp
TTL TTL
Row Key
1 2 Billion
Billion of Rows

Static Column Family – "Skinny Row"
15
rowkey
CREATE TABLE skinny (rowkey text,
c1 text PRIMARY KEY,
c2 text,
c3 text,
PRIMARY KEY (rowkey));
Grows up to Billion of Rows
rowkey-1 c1 c2 c3
value-c1 value-c2 value-c3
rowkey-2 c1 c3
value-c1 value-c3
rowkey-3 c1 c2 c3
value-c1 value-c2 value-c3
c1 c2 c3
Partition Key

Dynamic Column Family – "Wide Row"
16
rowkey
Billion of Rows
rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text,
ckey text,
c1 text,
c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
1 2 Billion
Partition Key Clustering Key

Getting Cassandra
Apache Cassandra Distribution
• http://cassandra.apache.org/
DataStax Distribution
• DataStax Enterprice 5.0 - Sandbox - VM
https://academy.datastax.com/downloads/welcome
VM with Cassandra (1 node), DataStax DevCenter, DataStax OpsCenter
• Oracle Virtual Box
http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html
• Login – datastax/datastax

DataStax OpsCenter
• At-a-Glance Cluster Management
• Point-and-Click Provisioning and
Administration
• Secured Administration
• Always On Management and Monitoring
• Visual Monitoring and Tuning
• Best Practice Advice
• Proactive Assistance
• Smart Data Protection

DataStax OpsCenter
Point-and-Click Provisioning and Administration

DataStax OpsCenter
Visual Monitoring and Tuning

CQL – Cassandra Query Language

Introducing CQL
• CQL is a reintroduction of schema so that you don't have to read code to
understand the data model.
• CQL creates a common language so that details of the data model can be easily
communicated.
• CQL is a best-practices Cassandra interface and hides the messy details.

CQL Language
• SQL like syntax
• Data Definition Language – DDL
CREATE / ALTER / DROP / …
• Data Manipulation Language – DML
INSERT, UPDATE, DELETE
• Query data with SELECT
• Build in Functions – COUNT, MIN, MAX, sum, avg, LIMIT, ...
• UDF – User Defined Function / UDA - User Defined Aggregate

CQL Shell for Apache Cassandra
cqlsh is the command line utility for execution CQL commands (think of SQL*Plus for
Cassandra)
CQL3 is default since Cassandra 1.2
$ cqlsh
Connected to DataStaxCluster at localhost:9160.
[cqlsh 4.1.0 | Cassandra 2.0.5.24 | CQL spec 3.1.1 | Thrift
protocol 19.39.0]
Use HELP for help.
cqlsh>

CQL Shell for Apache Cassandra
$ cat create-table.cql | cqlsh
$ cqlsh –f create-table.cql
cqlsh> SOURCE '~/cassandra_training/cql/create-table.cql'
Execute a script with the –f option
Alternatively pie scripts into cqlsh
Source files inside cqlsh

Creating a Keyspace
Create a keyspace with SimpleStrategy and replication factor option
Make the new keyspace the active one
cqlsh> CREATE KEYSPACE my_space
WITH REPLICATION = {'class':'SimpleStrategy',
'replication_factor':1};
cqlsh> USE my_space;
cqlsh:my_space>

Describing a Keyspace
Use the DESCRIBE KEYSPACE to show the metadata of the keyspace
cqlsh> DESCRIBE KEYSPACE my_space;
CREATE KEYSPACE training WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
cqlsh>

Create a Static table Dept
Use CREATE TABLE to create a static column family (table) named ”dept"
cqlsh:my_space> CREATE TABLE dept(
deptno int,
dname varchar,
loc varchar,
PRIMARY KEY (deptno));
dname loc
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON

Create a Dynamic table (wide-row) Employee
A Dynamic Table is also created with the CREATE TABLE statement but using a
composite partition key cqlsh:training> CREATE TABLE emp(
empno int,
ename varchar,
…
deptno int,
primary key (dname,ename));
KING:empno ... CLARK:empno ...
10 7839 ... 7782 ...
JONES:empno ... SCOTT:empno ... FORD:empno ...
20 7566 ... 7788 ... 7902 ...

Truncate / Drop Table
Use TRUNCATE to truncate the data
Use DROP TABLE to drop the whole table, operation is irreversible and removes all
information within the specified table!
• Will raise an error, if it does not exist, use IF EXISTS to prevent (new in 2.0):
cqlsh:training> TRUNCATE employee;
cqlsh:training> DROP TABLE employee;
cqlsh:training> DROP TABLE IF EXISTS employee;

Insert data into Dept
• PRIMARY KEY is always required
• Insert with same primary key => update
cqlsh:training> INSERT INTO dept (deptno, dname, loc)
VALUES (10, 'ACCOUNTING', 'NEW YORK');

Retrieving data from Dept table
SELECT statement returns rows and columns, just as in SQL
It can optionally also have a WHERE clause, an ORDER BY clause and a LIMIT clause
cqlsh:training> SELECT deptno, dname FROM dept LIMIT 2;
deptno | dname
--------+------------
10 | ACCOUNTING
30 | SALES
(2 rows)

Retrieving data from Dept table (II)
Restriction on column other than PRIMARY KEY won't work
Can be solved with an Index (but be careful, better use de-normalization)
cqlsh:my_space> SELECT * FROM dept WHERE loc = 'NEW YORK';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute this
query despite the performance unpredictability, use ALLOW FILTERING"
cqlsh:my_space> CREATE INDEX ON dept(loc);
cqlsh:my_space> SELECT * FROM dept WHERE loc = 'NEW YORK';
deptno | dname | loc
--------+------------+----------
10 | ACCOUNTING | NEW YORK

Update data in Dept
• WHERE over Primary Key
• If Primary Key does not exist => INSERT
cqlsh:my_space> UPDATE dept SET loc = 'LOS ANGELES'
WHERE deptno = 10;

Cassandra Data Types
Category CQL Data Type Description
String ascii US-ASCII character string
text UTF-8 encoded string, used most of the time for
storing String data.
varchar UTF-8 Strings.
inet Used for storing IP addresses
Numeric int 32-bit signed integer
float 32-bit IEEE-754 floating point
double 64-bit IEEE-754 floating point
varint Arbitrary precision integers
bigint 64-bit number, equivalent to long.
decimal Variable-precision decimal
counter Distributed counter value (64-bit long)

Cassandra Data Types (II)
Category CQL Data Type Description
UUIDs uuid A UUID in standard UUID format
timeuuid Type 1 UUID only, for storing unique time-base IDs
Collections list Ordered collection of one or more elements
map Collection of arbitrary key-value pairs
set Unordered collection of one or more unique
elements
Miscellaneous boolean Boolean (true/false)
blob Used for storing binary data written in hexadecimal
timestamp Date/Time

Batch operation
• COMMIT ?
• BEGIN BATCH … APPLY BATCH – execute multiple mutations – single operation
BEGIN BATCH
INSERT INTO dept (deptno, dname, loc)
VALUES (50, 'IT', 'ZURICH');
UPDATE emp SET sal = 9000
WHERE empno = 9000;
APPLY BATCH;

Alter Table
• ALTER TABLE
change meta data
• CQL is quick
flexible schema, not
changes to existing data
cqlsh:my_space> ALTER TABLE dept
ADD operational BOOLEAN;
cqlsh:training> DESCRIBE TABLE employee;
CREATE TABLE my_space.dept (
deptno int PRIMARY KEY,
dname text,
loc text,
operational boolean
) WITH...;
cqlsh:my_space> SELECT * FROM dept LIMIT 2;
deptno | dname | loc | operational
--------+------------+-------------+------------
-
50 | IT | ZURICH | null
10 | ACCOUNTING | LOS ANGELES | null
(2 rows)

Collections
CQL3 also supports collections for storing complex data structures
• Set {value,…}, List [value,…], Map {key:value,…}
cqlsh:training> CREATE TABLE collection_sample(
id int PRIMARY KEY,
string_set set<text>,
string_list list<text>,
string_map map<text, text>);
cqlsh:training> INSERT INTO coll
(id, string_set, string_list, string_map)
VALUES (1,
{'text1','text2','text1'},
['text1','text2','text1'],
{'key1':'value1'});

Collections (II)
cqlsh:training> SELECT * FROM collection_sample;
id | string_list | string_map | string_set
----+-----------------------------+--------------------+--------------------
1 | ['text1', 'text2', 'text1'] | {'key1': 'value1'} | {'text1', 'text2'}
(1 rows)

UDF – User Defined Function
Sample Code
CREATE FUNCTION count_if_true(input boolean)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java AS 'if (input) return 1; else return total;';
SELECT door_number, count_if_true(is_open)
FROM my_doors;
„CREATE OR REPLACE” or “IF NOT EXSITS” Syntax possible

UDA – User Defined Aggregate
Sample Code
CREATE FUNCTION state_count_if_true(total int, input boolean)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java AS 'if (input) return total+1; else return total;';
CREATE AGGREGATE total_open (boolean)
SFUNC state_count_if_true
STYPE int
INITCOND 0;
SELECT door_number, total_open(is_open)
FROM my_doors;
„CREATE OR REPLACE” or “IF NOT EXSITS” Syntax possible

Materialized Views
Relieve the pain of manual denormalization
cqlsh:training> CREATE MATERIALIZED VIEW employee_by_role (
AS SELECT role, name, age
FROM employee
WHERE role IS NOT NULL
PRIMARY KEY (role, name);
cqlsh:training> CREATE TABLE employee_by_role (
role text, name text, age int,
PRIMARY KEY (role, name));
Cassandra 3.0

Time-to-Live (TTL) on Insert
• Insert a row with a TTL in seconds (30s)
• after that the row is deleted
cqlsh:my_space> UPDATE emp USING ttl 15 SET sal = 8000
WHERE empno = 9000;
cqlsh:my_space> SELECT ename, sal, ttl(sal) FROM emp WHERE empno = 9000;
ename | sal | ttl(sal)
-------+------+----------
null | 8000 | 15
(1 rows)

Summary
• Just great J
• No single point of failure – Ring Model
• Distribution over nodes / rack’s / data center’s
• Tuneable Consistency
• CQL
• Spark / Cassandra Integration
• CQL limited
• Forget 20 years of experience in relational modelling L => DENORMALIZE J

Jan Ott
Senior Consultant
jan.ott@trivadis.com

References
• Books
Cassandra: The Definite Guide, 2nd Edition
• Apache – Cassandra CQL Documentation
https://cassandra.apache.org/doc/latest/cql/index.html
• DataStax – CQL Documentation
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html
• Netflix and Cassandra
http://techblog.netflix.com/search/label/Cassandra

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott

Ähnlich wie Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott (20)

Mehr von Trivadis

Mehr von Trivadis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott