First Steps of an Oracle-expert in the Big Data World. Everyone speaks about Big Data. But what does it mean? This speech focuses on one animal of the Big Data Zoo - Cassandra and answers the following questions:
- Why another database?
- There is Impala and Spark. Why would I need Cassandra?
- New database - do I need to learn a new language?
- How do I get the data in?
- Can I use SQL?
- Is it part of a distribution, for example Cloudera?
Demos will explain the theory.
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra
Big Data: Why do I need Cassandra
Jan Ott
Trivadis
2. Agenda
1. Introduction - What is Apache Cassandra?
2. Cassandra Data Model
3. First Steps in the Cassandra World
4. Cassandra Query Language - CQL
5. Summary
4. The Cassandra Elevator Pitch
„Apache Cassandra is an open source, distributed, decentralized,
elastically scalable, highly available, fault-tolerant, tuneably
consistent, row-oriented database that bases its distribution
design on Amazon’s Dynamo and its data model on Google’s
Bigtable. Created at Facebook, it is now used at some of the
most popular sites on the Web.“
Cassandra: The Definitive Guide by Jeff Carpenter and Eben Hewitt
9. Read Requests
There are two types of read requests that a coordinator can send to a replica:
• A direct read request
• A background read repair request
The number of replicas contacted by a direct read request is determined by the
consistency level specified by the client.
10. Tunable Data Consistency
What do I need?
• Writes
• Reads
• Consistency 1
2
3
4
5
6
› Any
› One/Two/Three
› Local_One
› Quorum
› Local_Quorum
› Each_Quorum
› All
Writes
› One/Two/Three
› Local_One
› Quorum
› Local_Quorum
› Each_Quorum
› All
Reads
11. Who is using Cassandra?
Largest infrastructure running over 75,000 Cassandra nodes, storing more than 10
petabytes of data with one cluster was over 1,000 nodes
13. Cassandra Data Model
• Forget Normalization – DENORMALIZE
• Design by Query
• No Joins – Denormalize
- Model
- Materialized Views
- Do it on the client side – not recommended
• No Referential Integrity
- Possible to define but not enforced
14. How Cassandra stores data
• Model brought from Google Bigtable
• Row Key and a lot of columns
• Column names sorted (UTF8, Int, Timestamp, etc.)
Column Name … Column Name
Column Value Column Value
Timestamp Timestamp
TTL TTL
Row Key
1 2 Billion
Billion of Rows
23. Introducing CQL
• CQL is a reintroduction of schema so that you don't have to read code to
understand the data model.
• CQL creates a common language so that details of the data model can be easily
communicated.
• CQL is a best-practices Cassandra interface and hides the messy details.
24. CQL Language
• SQL like syntax
• Data Definition Language – DDL
CREATE / ALTER / DROP / …
• Data Manipulation Language – DML
INSERT, UPDATE, DELETE
• Query data with SELECT
• Build in Functions – COUNT, MIN, MAX, sum, avg, LIMIT, ...
• UDF – User Defined Function / UDA - User Defined Aggregate
25. CQL Shell for Apache Cassandra
cqlsh is the command line utility for execution CQL commands (think of SQL*Plus for
Cassandra)
CQL3 is default since Cassandra 1.2
$ cqlsh
Connected to DataStaxCluster at localhost:9160.
[cqlsh 4.1.0 | Cassandra 2.0.5.24 | CQL spec 3.1.1 | Thrift
protocol 19.39.0]
Use HELP for help.
cqlsh>
26. CQL Shell for Apache Cassandra
$ cat create-table.cql | cqlsh
$ cqlsh –f create-table.cql
cqlsh> SOURCE '~/cassandra_training/cql/create-table.cql'
Execute a script with the –f option
Alternatively pie scripts into cqlsh
Source files inside cqlsh
27. Creating a Keyspace
Create a keyspace with SimpleStrategy and replication factor option
Make the new keyspace the active one
cqlsh> CREATE KEYSPACE my_space
WITH REPLICATION = {'class':'SimpleStrategy',
'replication_factor':1};
cqlsh> USE my_space;
cqlsh:my_space>
28. Describing a Keyspace
Use the DESCRIBE KEYSPACE to show the metadata of the keyspace
cqlsh> DESCRIBE KEYSPACE my_space;
CREATE KEYSPACE training WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
cqlsh>
29. Create a Static table Dept
Use CREATE TABLE to create a static column family (table) named ”dept"
cqlsh:my_space> CREATE TABLE dept(
deptno int,
dname varchar,
loc varchar,
PRIMARY KEY (deptno));
dname loc
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
30. Create a Dynamic table (wide-row) Employee
A Dynamic Table is also created with the CREATE TABLE statement but using a
composite partition key cqlsh:training> CREATE TABLE emp(
empno int,
ename varchar,
…
deptno int,
primary key (dname,ename));
KING:empno ... CLARK:empno ...
10 7839 ... 7782 ...
JONES:empno ... SCOTT:empno ... FORD:empno ...
20 7566 ... 7788 ... 7902 ...
31. Truncate / Drop Table
Use TRUNCATE to truncate the data
Use DROP TABLE to drop the whole table, operation is irreversible and removes all
information within the specified table!
• Will raise an error, if it does not exist, use IF EXISTS to prevent (new in 2.0):
cqlsh:training> TRUNCATE employee;
cqlsh:training> DROP TABLE employee;
cqlsh:training> DROP TABLE IF EXISTS employee;
32. Insert data into Dept
• PRIMARY KEY is always required
• Insert with same primary key => update
cqlsh:training> INSERT INTO dept (deptno, dname, loc)
VALUES (10, 'ACCOUNTING', 'NEW YORK');
33. Retrieving data from Dept table
SELECT statement returns rows and columns, just as in SQL
It can optionally also have a WHERE clause, an ORDER BY clause and a LIMIT clause
cqlsh:training> SELECT deptno, dname FROM dept LIMIT 2;
deptno | dname
--------+------------
10 | ACCOUNTING
30 | SALES
(2 rows)
34. Retrieving data from Dept table (II)
Restriction on column other than PRIMARY KEY won't work
Can be solved with an Index (but be careful, better use de-normalization)
cqlsh:my_space> SELECT * FROM dept WHERE loc = 'NEW YORK';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute this
query despite the performance unpredictability, use ALLOW FILTERING"
cqlsh:my_space> CREATE INDEX ON dept(loc);
cqlsh:my_space> SELECT * FROM dept WHERE loc = 'NEW YORK';
deptno | dname | loc
--------+------------+----------
10 | ACCOUNTING | NEW YORK
35. Update data in Dept
• WHERE over Primary Key
• If Primary Key does not exist => INSERT
cqlsh:my_space> UPDATE dept SET loc = 'LOS ANGELES'
WHERE deptno = 10;
36. Cassandra Data Types
Category CQL Data Type Description
String ascii US-ASCII character string
text UTF-8 encoded string, used most of the time for
storing String data.
varchar UTF-8 Strings.
inet Used for storing IP addresses
Numeric int 32-bit signed integer
float 32-bit IEEE-754 floating point
double 64-bit IEEE-754 floating point
varint Arbitrary precision integers
bigint 64-bit number, equivalent to long.
decimal Variable-precision decimal
counter Distributed counter value (64-bit long)
37. Cassandra Data Types (II)
Category CQL Data Type Description
UUIDs uuid A UUID in standard UUID format
timeuuid Type 1 UUID only, for storing unique time-base IDs
Collections list Ordered collection of one or more elements
map Collection of arbitrary key-value pairs
set Unordered collection of one or more unique
elements
Miscellaneous boolean Boolean (true/false)
blob Used for storing binary data written in hexadecimal
timestamp Date/Time
38. Batch operation
• COMMIT ?
• BEGIN BATCH … APPLY BATCH – execute multiple mutations – single operation
BEGIN BATCH
INSERT INTO dept (deptno, dname, loc)
VALUES (50, 'IT', 'ZURICH');
UPDATE emp SET sal = 9000
WHERE empno = 9000;
APPLY BATCH;
39. Alter Table
• ALTER TABLE
change meta data
• CQL is quick
flexible schema, not
changes to existing data
cqlsh:my_space> ALTER TABLE dept
ADD operational BOOLEAN;
cqlsh:training> DESCRIBE TABLE employee;
CREATE TABLE my_space.dept (
deptno int PRIMARY KEY,
dname text,
loc text,
operational boolean
) WITH...;
cqlsh:my_space> SELECT * FROM dept LIMIT 2;
deptno | dname | loc | operational
--------+------------+-------------+------------
-
50 | IT | ZURICH | null
10 | ACCOUNTING | LOS ANGELES | null
(2 rows)
40. Collections
CQL3 also supports collections for storing complex data structures
• Set {value,…}, List [value,…], Map {key:value,…}
cqlsh:training> CREATE TABLE collection_sample(
id int PRIMARY KEY,
string_set set<text>,
string_list list<text>,
string_map map<text, text>);
cqlsh:training> INSERT INTO coll
(id, string_set, string_list, string_map)
VALUES (1,
{'text1','text2','text1'},
['text1','text2','text1'],
{'key1':'value1'});
41. Collections (II)
cqlsh:training> SELECT * FROM collection_sample;
id | string_list | string_map | string_set
----+-----------------------------+--------------------+--------------------
1 | ['text1', 'text2', 'text1'] | {'key1': 'value1'} | {'text1', 'text2'}
(1 rows)
42. UDF – User Defined Function
Sample Code
CREATE FUNCTION count_if_true(input boolean)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java AS 'if (input) return 1; else return total;';
SELECT door_number, count_if_true(is_open)
FROM my_doors;
„CREATE OR REPLACE” or “IF NOT EXSITS” Syntax possible
43. UDA – User Defined Aggregate
Sample Code
CREATE FUNCTION state_count_if_true(total int, input boolean)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java AS 'if (input) return total+1; else return total;';
CREATE AGGREGATE total_open (boolean)
SFUNC state_count_if_true
STYPE int
INITCOND 0;
SELECT door_number, total_open(is_open)
FROM my_doors;
„CREATE OR REPLACE” or “IF NOT EXSITS” Syntax possible
44. Materialized Views
Relieve the pain of manual denormalization
cqlsh:training> CREATE MATERIALIZED VIEW employee_by_role (
AS SELECT role, name, age
FROM employee
WHERE role IS NOT NULL
PRIMARY KEY (role, name);
cqlsh:training> CREATE TABLE employee_by_role (
role text, name text, age int,
PRIMARY KEY (role, name));
Cassandra 3.0
45. Time-to-Live (TTL) on Insert
• Insert a row with a TTL in seconds (30s)
• after that the row is deleted
cqlsh:my_space> UPDATE emp USING ttl 15 SET sal = 8000
WHERE empno = 9000;
cqlsh:my_space> SELECT ename, sal, ttl(sal) FROM emp WHERE empno = 9000;
ename | sal | ttl(sal)
-------+------+----------
null | 8000 | 15
(1 rows)
46. Summary
• Just great J
• No single point of failure – Ring Model
• Distribution over nodes / rack’s / data center’s
• Tuneable Consistency
• CQL
• Spark / Cassandra Integration
• CQL limited
• Forget 20 years of experience in relational modelling L => DENORMALIZE J