Everyone wants to build applications that are scalable and highly available. But how do you build a site that’s capable of withstanding the public’s insatiable demand for sharing cat videos, even if your data center gets hit with a nuclear bomb? In this session we’ll take a look at KillrVideo, an open source video sharing application demo (similar to YouTube) built on Apache Cassandra and Microsoft Azure. You’ll get an introduction to Cassandra, a highly available distributed database including data modelling (and how it’s different from the relational world you probably have experience with), using CQL to query, and how to interact with Cassandra from your code. We’ll also touch on using Azure Media Services for processing and streaming video content as well as how to setup a Cassandra cluster in Azure. While the code samples in this session will be in C#, the same APIs are available and the same concepts apply to other languages (like Java and Python). If you’re interested in learning more about NoSQL solutions, Cassandra, or Azure, this talk will get you started. No kittens were harmed in the making of this talk.
Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)
1. Satisfying the Public’s Demand for Cat Videos with Cassandra and Azure
Luke Tillman (@LukeTillman)
Language Evangelist at DataStax
2. Who are you?!
•Evangelist with a focus on the .NET Community
•Long-time Developer
•Recently presented at Cassandra Summit 2014 with Microsoft
•Very Recent Denver Transplant
2
3. 1
What is this KillrVideo thing you speak of?
2
Cassandra, the really short version
3
CQL: NoSQL, now with more SQL!
4
Breaking the Relational Mindset
5
Putting it all together: Cassandra, Azure, and .NET
3
5. KillrVideo, a Video Sharing Site
•Think a YouTube competitor
–Users add videos, rate them, comment on them, etc.
–Can search for videos by tag
5
6. See the Live Demo, Get the Code
•Live demo available at http://www.killrvideo.com
–Written in C#
–Live Demo running in Azure
–Open source: https://github.com/luketillman/killrvideo-csharp
•Interesting use case because of different data modeling challenges and the scale of something like YouTube
–More than 1 billion unique users visit YouTube each month
–100 hours of video are uploaded to YouTube every minute
6
7. Just How Popular are Cats on the Internet?
7
http://mashable.com/2013/07/08/cats-bacon-rule-internet/
8. Just How Popular are Cats on the Internet?
8
http://mashable.com/2013/07/08/cats-bacon-rule-internet/
10. What is Cassandra?
•A Linearly Scaling and Fault Tolerant Distributed Database
•Fully Distributed
–Data spread over many nodes
–All nodes participate in a cluster
–All nodes are equal
–No SPOF (shared nothing)
10
11. What is Cassandra?
Linearly Scaling
–Have More Data? Add more nodes.
–Need More Throughput? Add more nodes.
11
Fault Tolerant
–Nodes Down != Database Down
–Datacenter Down != Database Down
12. What is Cassandra?
•Fully replicated across multiple DCs
•Clients write local
•Data syncs across WAN
•Replication Factor per DC
12
US
Europe
Client
13. Cassandra and the CAP Theorem
•The CAP Theorem limits what distributed systems can do
–Consistency
–Availability
–Partition Tolerance
•Limits? “Pick 2 out of 3”
•Cassandra is an AP system that is Eventually Consistent
13
14. Two knobs control Cassandra fault tolerance
•Replication Factor (server side)
–How many copies of the data should exist?
14
Client
B AD
C
AB
A CD
D
BC
Write A
RF=3
15. Two knobs control Cassandra fault tolerance
•Consistency Level (client side)
–How many replicas do we need to hear from before we acknowledge?
15
Client
B AD
C
AB
A
CD
D BC
Write A
CL=QUORUM
Client
B
AD
C AB
A
CD
D
BC
Write A
CL=ONE
16. Consistency Levels
•Applies to both Reads and Writes (i.e. is set on each query)
•ONE – one replica from any DC
•LOCAL_ONE – one replica from local DC
•QUORUM – 51% of replicas from any DC
•LOCAL_QUORUM – 51% of replicas from local DC
•ALL – all replicas
•TWO
16
17. Consistency Level and Availability
•Consistency Level choice affects availability
•For example, QUORUM can tolerate one replica being down and still be available (in RF=3)
17
Client
B
AD
C
AB
A CD
D
BC
A=2
A=2
A=2
Read A (CL=QUORUM)
18. Eventual Consistency
•Cassandra is an AP system that is Eventually Consistent so replicas may disagree
•Column values are timestamped
•In Cassandra, Last Write Wins (LWW)
18
Client
B
AD
C AB
A
CD
D BC
Read A
(CL=QUORUM)
A=2 Newer
A=1 Older
A=2
20. Schema Definition (DDL)
•Easy to define tables for storing data
•First part of Primary Key is the Partition Key
CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) );
20
21. Partition Key
Partition Key Determines Data Distribution
•Partition Key determines node placement
21
name
description
...
Keyboard Cat
Keyboard Cat is the ...
...
Nyan Cat
Check out Nyan cat ...
...
Original Grumpy Cat
Visit Grumpy Cat’s …
...
videoid
689d56e5- …
93357d73- …
d978b136- …
22. Partition Key – Hashing
•The Partition Key is hashed using a consistent hashing function (Murmur 3) and the output is used to place the data on a node
•The data is also replicated to RF-1 other nodes
22
Murmur3
videoid: 689d56e5- ...
Murmur3: A
B AD
C
AB
A
CD
D BC
RF=3
Partition Key
name
description
...
Keyboard Cat
Keyboard Cat is the ...
...
videoid
689d56e5- ...
23. Hashing – Back to Reality
•Back in reality, Partition Keys actually hash to 128 bit numbers
•Nodes in Cassandra own token ranges (i.e. hash ranges)
23
B
AD
C
AB
A CD
D BC
Range
Start
End
A
0xC000000..1
0x0000000..0
B
0x0000000..1
0x4000000..0
C
0x4000000..1
0x8000000..0
D
0x8000000..1
0xC000000..0
Murmur3
0xadb95e99da887a8a4cb474db86eb5769
Partition Key
videoid
689d56e5- ...
24. Clustering Columns
•Second part of Primary Key is Clustering Column(s)
•Clustering columns affect ordering of data (on disk)
•Ascending/Descending order is possible
24
CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
25. Clustering Columns – Wide Rows
•Use of Clustering Columns (and the layout on disk) is where the term “Wide Rows” comes from
25
videoid='0fe6a...'
userid=
'ac346...'
comment= 'Awesome!'
commentid='82be1...'
(10/1/2014 9:36AM)
userid= 'f89d3...'
comment=
'Garbage!'
commentid='765ac...' (9/17/2014 7:55AM)
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
26. Inserts and Updates
•Use INSERT or UPDATE to add and modify data
•Both will overwrite data (no constraints like RDBMS)
•INSERT and UPDATE functionally equivalent
26
INSERT INTO comments_by_video (
videoid, commentid, userid, comment)
VALUES (
'0fe6a...', '82be1...', 'ac346...', 'Awesome!');
UPDATE comments_by_video SET userid = 'ac346...', comment = 'Awesome!' WHERE videoid = '0fe6a...' AND commentid = '82be1...';
27. TTL and Deletes
•Can specify a Time to Live (TTL) in seconds when doing an INSERT or UPDATE
•Use DELETE statement to remove data
•Can optionally specify columns to remove part of a row
27
INSERT INTO comments_by_video ( ... )
VALUES ( ... )
USING TTL 86400;
DELETE FROM comments_by_video WHERE videoid = '0fe6a...' AND commentid = '82be1...';
28. Querying
•Use SELECT to get data from your tables
•Always include Partition Key and optionally Clustering Columns
•Can use ORDER BY (on Clustering Columns) and LIMIT
•Use range queries (for example, by date) to slice partitions
28
SELECT * FROM comments_by_video
WHERE videoid = 'a67cd...'
LIMIT 10;
30. Breaking the Relational Mindset
•How do we data model when we have to query by the Partition Key (and optionally Clustering Columns)?
•Denormalize all the things!
•Disk is cheap now and writes in Cassandra are FAST
•Data modeling is very much query driven
•Many times we end up with a “table per query”
30
31. Users – The Relational Way
•Single Users table with all user data and an Id Primary Key
•Add an index on email address to allow queries by email
User Logs into site
Find user by email address
Show basic information about user
Find user by id
31
32. Users – The Cassandra Way
User Logs into site
Find user by email address
Show basic information about user
Find user by id
CREATE TABLE user_credentials (
email text,
password text,
userid uuid,
PRIMARY KEY (email)
);
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
32
33. Considerations When Duplicating Data
•Can the data change?
•How likely is it to change or how frequently will it change?
•Do I have all the information I need to update duplicates and maintain consistency?
•Just scratching the surface of data modeling examples here
33
35. KillrVideo on Azure
Cassandra Cluster (DSE)
App data storage (video metadata, comments, users, ratings, etc.)
Azure Media Services
Uploaded video encoding, thumbnail generation, Video access URI generation
Azure Storage Queues – notifications on encoding job progress Blob – uploaded video storage
OpsCenter
provisioning, monitoring, management
KillrVideo Web App C# MVC Web Application, Azure Web Role Serves up UI, JSON Endpoints
KillrVideo Upload Worker C#, Azure Worker Role Monitors encoding job events, publishes completed uploads
Web UI
HTML5 / JavaScript (KnockoutJS, jQuery, Bootstrap, etc)
35
36. Deploying Cassandra in Azure
•Cassandra is a JVM application and should be deployed on Linux VMs (parity in Windows is coming – 3.0?)
•IOPs is super important (recommend A7 instances for production, A4 for testing and development)
•New SSD instances in Azure look promising
•In-depth documentation and scripts available to help
36
37. .NET and Cassandra
•Open Source (on GitHub), available via NuGet
•Bootstrap using the Builder and then reuse the ISession object
Cluster cluster = Cluster.Builder() .AddContactPoint("127.0.0.1") .Build(); ISession session = cluster.Connect("killrvideo");
37
38. .NET and Cassandra
•Executing CQL
•Sync and Async API available
var statement = new SimpleStatement("SELECT * FROM users WHERE userid = ?"); statement = statement.Bind(145); RowSet rows = await session.ExecuteAsync(statement);
38
39. .NET and Cassandra
•Getting values from a RowSet is easy
•Rowset is a collection of Row (IEnumerable<Row>)
RowSet rows = await _session.ExecuteAsync(statement); foreach (Row row in rows) { var videoId = row.GetValue<Guid>("videoid"); var addedDate = row.GetValue<DateTimeOffset>("added_date"); var name = row.GetValue<string>("name"); }
39
40. .NET and Cassandra
•Mapping results to DTOs: if you like using CQL, try CqlPoco package
•Note: This package may be pulled into the official driver soon.
public class User { public Guid UserId { get; set; } public string Name { get; set; } } // Get a user by id from Cassandra or null if not found var user = client.SingleOrDefault<User>( "SELECT userid, name FROM users WHERE userid = ?", someUserId);
40
41. .NET and Cassandra
•Mapping results to DTOs: if you like LINQ, use built-in LINQ provider
[Table("users")] public class User { [Column("userid"), PartitionKey] public Guid UserId { get; set; } [Column("name")] public string Name { get; set; } } var user = session.GetTable<User>() .SingleOrDefault(u => u.UserId == someUserId) .Execute();
41
42. Some Tips for .NET and Cassandra
•Look at Prepared Statements in the documentation for an easy performance optimization
•Take advantage of the async API to run queries in parallel
•Don’t write boilerplate mapping code—use LINQ or CqlPoco
42
43. What Next?
•Planet Cassandra: http://planetcassandra.org/
–Windows installer for Cassandra for development
–More information on the drivers
–Resources for Data Modeling
•Guidance and Scripts for deployments on Azure
–https://academy.datastax.com/demos/enterprise-deployment-microsoft-azure-cloud
•KillrVideo Source
–https://github.com/luketillman/killrvideo-csharp
43
44. Questions?
44
Follow me on Twitter for updates or to ask questions later: @LukeTillman