SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Satisfying the Public’s Demand for Cat Videos with Cassandra and Azure 
Luke Tillman (@LukeTillman) 
Language Evangelist at DataStax
Who are you?! 
•Evangelist with a focus on the .NET Community 
•Long-time Developer 
•Recently presented at Cassandra Summit 2014 with Microsoft 
•Very Recent Denver Transplant 
2
1 
What is this KillrVideo thing you speak of? 
2 
Cassandra, the really short version 
3 
CQL: NoSQL, now with more SQL! 
4 
Breaking the Relational Mindset 
5 
Putting it all together: Cassandra, Azure, and .NET 
3
What is this KillrVideo thing you speak of? 
4
KillrVideo, a Video Sharing Site 
•Think a YouTube competitor 
–Users add videos, rate them, comment on them, etc. 
–Can search for videos by tag 
5
See the Live Demo, Get the Code 
•Live demo available at http://www.killrvideo.com 
–Written in C# 
–Live Demo running in Azure 
–Open source: https://github.com/luketillman/killrvideo-csharp 
•Interesting use case because of different data modeling challenges and the scale of something like YouTube 
–More than 1 billion unique users visit YouTube each month 
–100 hours of video are uploaded to YouTube every minute 
6
Just How Popular are Cats on the Internet? 
7 
http://mashable.com/2013/07/08/cats-bacon-rule-internet/
Just How Popular are Cats on the Internet? 
8 
http://mashable.com/2013/07/08/cats-bacon-rule-internet/
Cassandra, the really short version
What is Cassandra? 
•A Linearly Scaling and Fault Tolerant Distributed Database 
•Fully Distributed 
–Data spread over many nodes 
–All nodes participate in a cluster 
–All nodes are equal 
–No SPOF (shared nothing) 
10
What is Cassandra? 
Linearly Scaling 
–Have More Data? Add more nodes. 
–Need More Throughput? Add more nodes. 
11 
Fault Tolerant 
–Nodes Down != Database Down 
–Datacenter Down != Database Down
What is Cassandra? 
•Fully replicated across multiple DCs 
•Clients write local 
•Data syncs across WAN 
•Replication Factor per DC 
12 
US 
Europe 
Client
Cassandra and the CAP Theorem 
•The CAP Theorem limits what distributed systems can do 
–Consistency 
–Availability 
–Partition Tolerance 
•Limits? “Pick 2 out of 3” 
•Cassandra is an AP system that is Eventually Consistent 
13
Two knobs control Cassandra fault tolerance 
•Replication Factor (server side) 
–How many copies of the data should exist? 
14 
Client 
B AD 
C 
AB 
A CD 
D 
BC 
Write A 
RF=3
Two knobs control Cassandra fault tolerance 
•Consistency Level (client side) 
–How many replicas do we need to hear from before we acknowledge? 
15 
Client 
B AD 
C 
AB 
A 
CD 
D BC 
Write A 
CL=QUORUM 
Client 
B 
AD 
C AB 
A 
CD 
D 
BC 
Write A 
CL=ONE
Consistency Levels 
•Applies to both Reads and Writes (i.e. is set on each query) 
•ONE – one replica from any DC 
•LOCAL_ONE – one replica from local DC 
•QUORUM – 51% of replicas from any DC 
•LOCAL_QUORUM – 51% of replicas from local DC 
•ALL – all replicas 
•TWO 
16
Consistency Level and Availability 
•Consistency Level choice affects availability 
•For example, QUORUM can tolerate one replica being down and still be available (in RF=3) 
17 
Client 
B 
AD 
C 
AB 
A CD 
D 
BC 
A=2 
A=2 
A=2 
Read A (CL=QUORUM)
Eventual Consistency 
•Cassandra is an AP system that is Eventually Consistent so replicas may disagree 
•Column values are timestamped 
•In Cassandra, Last Write Wins (LWW) 
18 
Client 
B 
AD 
C AB 
A 
CD 
D BC 
Read A 
(CL=QUORUM) 
A=2 Newer 
A=1 Older 
A=2
CQL: NoSQL, now with more SQL!
Schema Definition (DDL) 
•Easy to define tables for storing data 
•First part of Primary Key is the Partition Key 
CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) ); 
20
Partition Key 
Partition Key Determines Data Distribution 
•Partition Key determines node placement 
21 
name 
description 
... 
Keyboard Cat 
Keyboard Cat is the ... 
... 
Nyan Cat 
Check out Nyan cat ... 
... 
Original Grumpy Cat 
Visit Grumpy Cat’s … 
... 
videoid 
689d56e5- … 
93357d73- … 
d978b136- …
Partition Key – Hashing 
•The Partition Key is hashed using a consistent hashing function (Murmur 3) and the output is used to place the data on a node 
•The data is also replicated to RF-1 other nodes 
22 
Murmur3 
videoid: 689d56e5- ... 
Murmur3: A 
B AD 
C 
AB 
A 
CD 
D BC 
RF=3 
Partition Key 
name 
description 
... 
Keyboard Cat 
Keyboard Cat is the ... 
... 
videoid 
689d56e5- ...
Hashing – Back to Reality 
•Back in reality, Partition Keys actually hash to 128 bit numbers 
•Nodes in Cassandra own token ranges (i.e. hash ranges) 
23 
B 
AD 
C 
AB 
A CD 
D BC 
Range 
Start 
End 
A 
0xC000000..1 
0x0000000..0 
B 
0x0000000..1 
0x4000000..0 
C 
0x4000000..1 
0x8000000..0 
D 
0x8000000..1 
0xC000000..0 
Murmur3 
0xadb95e99da887a8a4cb474db86eb5769 
Partition Key 
videoid 
689d56e5- ...
Clustering Columns 
•Second part of Primary Key is Clustering Column(s) 
•Clustering columns affect ordering of data (on disk) 
•Ascending/Descending order is possible 
24 
CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
Clustering Columns – Wide Rows 
•Use of Clustering Columns (and the layout on disk) is where the term “Wide Rows” comes from 
25 
videoid='0fe6a...' 
userid= 
'ac346...' 
comment= 'Awesome!' 
commentid='82be1...' 
(10/1/2014 9:36AM) 
userid= 'f89d3...' 
comment= 
'Garbage!' 
commentid='765ac...' (9/17/2014 7:55AM) 
CREATE TABLE comments_by_video ( 
videoid uuid, 
commentid timeuuid, 
userid uuid, 
comment text, 
PRIMARY KEY (videoid, commentid) 
) WITH CLUSTERING ORDER BY (commentid DESC);
Inserts and Updates 
•Use INSERT or UPDATE to add and modify data 
•Both will overwrite data (no constraints like RDBMS) 
•INSERT and UPDATE functionally equivalent 
26 
INSERT INTO comments_by_video ( 
videoid, commentid, userid, comment) 
VALUES ( 
'0fe6a...', '82be1...', 'ac346...', 'Awesome!'); 
UPDATE comments_by_video SET userid = 'ac346...', comment = 'Awesome!' WHERE videoid = '0fe6a...' AND commentid = '82be1...';
TTL and Deletes 
•Can specify a Time to Live (TTL) in seconds when doing an INSERT or UPDATE 
•Use DELETE statement to remove data 
•Can optionally specify columns to remove part of a row 
27 
INSERT INTO comments_by_video ( ... ) 
VALUES ( ... ) 
USING TTL 86400; 
DELETE FROM comments_by_video WHERE videoid = '0fe6a...' AND commentid = '82be1...';
Querying 
•Use SELECT to get data from your tables 
•Always include Partition Key and optionally Clustering Columns 
•Can use ORDER BY (on Clustering Columns) and LIMIT 
•Use range queries (for example, by date) to slice partitions 
28 
SELECT * FROM comments_by_video 
WHERE videoid = 'a67cd...' 
LIMIT 10;
Breaking the Relational Mindset
Breaking the Relational Mindset 
•How do we data model when we have to query by the Partition Key (and optionally Clustering Columns)? 
•Denormalize all the things! 
•Disk is cheap now and writes in Cassandra are FAST 
•Data modeling is very much query driven 
•Many times we end up with a “table per query” 
30
Users – The Relational Way 
•Single Users table with all user data and an Id Primary Key 
•Add an index on email address to allow queries by email 
User Logs into site 
Find user by email address 
Show basic information about user 
Find user by id 
31
Users – The Cassandra Way 
User Logs into site 
Find user by email address 
Show basic information about user 
Find user by id 
CREATE TABLE user_credentials ( 
email text, 
password text, 
userid uuid, 
PRIMARY KEY (email) 
); 
CREATE TABLE users ( 
userid uuid, 
firstname text, 
lastname text, 
email text, 
created_date timestamp, 
PRIMARY KEY (userid) 
); 
32
Considerations When Duplicating Data 
•Can the data change? 
•How likely is it to change or how frequently will it change? 
•Do I have all the information I need to update duplicates and maintain consistency? 
•Just scratching the surface of data modeling examples here 
33
Putting it all together: Cassandra, Azure, and .NET
KillrVideo on Azure 
Cassandra Cluster (DSE) 
App data storage (video metadata, comments, users, ratings, etc.) 
Azure Media Services 
Uploaded video encoding, thumbnail generation, Video access URI generation 
Azure Storage Queues – notifications on encoding job progress Blob – uploaded video storage 
OpsCenter 
provisioning, monitoring, management 
KillrVideo Web App C# MVC Web Application, Azure Web Role Serves up UI, JSON Endpoints 
KillrVideo Upload Worker C#, Azure Worker Role Monitors encoding job events, publishes completed uploads 
Web UI 
HTML5 / JavaScript (KnockoutJS, jQuery, Bootstrap, etc) 
35
Deploying Cassandra in Azure 
•Cassandra is a JVM application and should be deployed on Linux VMs (parity in Windows is coming – 3.0?) 
•IOPs is super important (recommend A7 instances for production, A4 for testing and development) 
•New SSD instances in Azure look promising 
•In-depth documentation and scripts available to help 
36
.NET and Cassandra 
•Open Source (on GitHub), available via NuGet 
•Bootstrap using the Builder and then reuse the ISession object 
Cluster cluster = Cluster.Builder() .AddContactPoint("127.0.0.1") .Build(); ISession session = cluster.Connect("killrvideo"); 
37
.NET and Cassandra 
•Executing CQL 
•Sync and Async API available 
var statement = new SimpleStatement("SELECT * FROM users WHERE userid = ?"); statement = statement.Bind(145); RowSet rows = await session.ExecuteAsync(statement); 
38
.NET and Cassandra 
•Getting values from a RowSet is easy 
•Rowset is a collection of Row (IEnumerable<Row>) 
RowSet rows = await _session.ExecuteAsync(statement); foreach (Row row in rows) { var videoId = row.GetValue<Guid>("videoid"); var addedDate = row.GetValue<DateTimeOffset>("added_date"); var name = row.GetValue<string>("name"); } 
39
.NET and Cassandra 
•Mapping results to DTOs: if you like using CQL, try CqlPoco package 
•Note: This package may be pulled into the official driver soon. 
public class User { public Guid UserId { get; set; } public string Name { get; set; } } // Get a user by id from Cassandra or null if not found var user = client.SingleOrDefault<User>( "SELECT userid, name FROM users WHERE userid = ?", someUserId); 
40
.NET and Cassandra 
•Mapping results to DTOs: if you like LINQ, use built-in LINQ provider 
[Table("users")] public class User { [Column("userid"), PartitionKey] public Guid UserId { get; set; } [Column("name")] public string Name { get; set; } } var user = session.GetTable<User>() .SingleOrDefault(u => u.UserId == someUserId) .Execute(); 
41
Some Tips for .NET and Cassandra 
•Look at Prepared Statements in the documentation for an easy performance optimization 
•Take advantage of the async API to run queries in parallel 
•Don’t write boilerplate mapping code—use LINQ or CqlPoco 
42
What Next? 
•Planet Cassandra: http://planetcassandra.org/ 
–Windows installer for Cassandra for development 
–More information on the drivers 
–Resources for Data Modeling 
•Guidance and Scripts for deployments on Azure 
–https://academy.datastax.com/demos/enterprise-deployment-microsoft-azure-cloud 
•KillrVideo Source 
–https://github.com/luketillman/killrvideo-csharp 
43
Questions? 
44 
Follow me on Twitter for updates or to ask questions later: @LukeTillman

Weitere ähnliche Inhalte

Andere mochten auch

A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
 
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)Luke Tillman
 
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...Luke Tillman
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceDataStax Academy
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...DataStax
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
 
Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2DataStax Academy
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 

Andere mochten auch (10)

A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
 
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
 
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
 
Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 

Kürzlich hochgeladen

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

  • 1. Satisfying the Public’s Demand for Cat Videos with Cassandra and Azure Luke Tillman (@LukeTillman) Language Evangelist at DataStax
  • 2. Who are you?! •Evangelist with a focus on the .NET Community •Long-time Developer •Recently presented at Cassandra Summit 2014 with Microsoft •Very Recent Denver Transplant 2
  • 3. 1 What is this KillrVideo thing you speak of? 2 Cassandra, the really short version 3 CQL: NoSQL, now with more SQL! 4 Breaking the Relational Mindset 5 Putting it all together: Cassandra, Azure, and .NET 3
  • 4. What is this KillrVideo thing you speak of? 4
  • 5. KillrVideo, a Video Sharing Site •Think a YouTube competitor –Users add videos, rate them, comment on them, etc. –Can search for videos by tag 5
  • 6. See the Live Demo, Get the Code •Live demo available at http://www.killrvideo.com –Written in C# –Live Demo running in Azure –Open source: https://github.com/luketillman/killrvideo-csharp •Interesting use case because of different data modeling challenges and the scale of something like YouTube –More than 1 billion unique users visit YouTube each month –100 hours of video are uploaded to YouTube every minute 6
  • 7. Just How Popular are Cats on the Internet? 7 http://mashable.com/2013/07/08/cats-bacon-rule-internet/
  • 8. Just How Popular are Cats on the Internet? 8 http://mashable.com/2013/07/08/cats-bacon-rule-internet/
  • 9. Cassandra, the really short version
  • 10. What is Cassandra? •A Linearly Scaling and Fault Tolerant Distributed Database •Fully Distributed –Data spread over many nodes –All nodes participate in a cluster –All nodes are equal –No SPOF (shared nothing) 10
  • 11. What is Cassandra? Linearly Scaling –Have More Data? Add more nodes. –Need More Throughput? Add more nodes. 11 Fault Tolerant –Nodes Down != Database Down –Datacenter Down != Database Down
  • 12. What is Cassandra? •Fully replicated across multiple DCs •Clients write local •Data syncs across WAN •Replication Factor per DC 12 US Europe Client
  • 13. Cassandra and the CAP Theorem •The CAP Theorem limits what distributed systems can do –Consistency –Availability –Partition Tolerance •Limits? “Pick 2 out of 3” •Cassandra is an AP system that is Eventually Consistent 13
  • 14. Two knobs control Cassandra fault tolerance •Replication Factor (server side) –How many copies of the data should exist? 14 Client B AD C AB A CD D BC Write A RF=3
  • 15. Two knobs control Cassandra fault tolerance •Consistency Level (client side) –How many replicas do we need to hear from before we acknowledge? 15 Client B AD C AB A CD D BC Write A CL=QUORUM Client B AD C AB A CD D BC Write A CL=ONE
  • 16. Consistency Levels •Applies to both Reads and Writes (i.e. is set on each query) •ONE – one replica from any DC •LOCAL_ONE – one replica from local DC •QUORUM – 51% of replicas from any DC •LOCAL_QUORUM – 51% of replicas from local DC •ALL – all replicas •TWO 16
  • 17. Consistency Level and Availability •Consistency Level choice affects availability •For example, QUORUM can tolerate one replica being down and still be available (in RF=3) 17 Client B AD C AB A CD D BC A=2 A=2 A=2 Read A (CL=QUORUM)
  • 18. Eventual Consistency •Cassandra is an AP system that is Eventually Consistent so replicas may disagree •Column values are timestamped •In Cassandra, Last Write Wins (LWW) 18 Client B AD C AB A CD D BC Read A (CL=QUORUM) A=2 Newer A=1 Older A=2
  • 19. CQL: NoSQL, now with more SQL!
  • 20. Schema Definition (DDL) •Easy to define tables for storing data •First part of Primary Key is the Partition Key CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) ); 20
  • 21. Partition Key Partition Key Determines Data Distribution •Partition Key determines node placement 21 name description ... Keyboard Cat Keyboard Cat is the ... ... Nyan Cat Check out Nyan cat ... ... Original Grumpy Cat Visit Grumpy Cat’s … ... videoid 689d56e5- … 93357d73- … d978b136- …
  • 22. Partition Key – Hashing •The Partition Key is hashed using a consistent hashing function (Murmur 3) and the output is used to place the data on a node •The data is also replicated to RF-1 other nodes 22 Murmur3 videoid: 689d56e5- ... Murmur3: A B AD C AB A CD D BC RF=3 Partition Key name description ... Keyboard Cat Keyboard Cat is the ... ... videoid 689d56e5- ...
  • 23. Hashing – Back to Reality •Back in reality, Partition Keys actually hash to 128 bit numbers •Nodes in Cassandra own token ranges (i.e. hash ranges) 23 B AD C AB A CD D BC Range Start End A 0xC000000..1 0x0000000..0 B 0x0000000..1 0x4000000..0 C 0x4000000..1 0x8000000..0 D 0x8000000..1 0xC000000..0 Murmur3 0xadb95e99da887a8a4cb474db86eb5769 Partition Key videoid 689d56e5- ...
  • 24. Clustering Columns •Second part of Primary Key is Clustering Column(s) •Clustering columns affect ordering of data (on disk) •Ascending/Descending order is possible 24 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 25. Clustering Columns – Wide Rows •Use of Clustering Columns (and the layout on disk) is where the term “Wide Rows” comes from 25 videoid='0fe6a...' userid= 'ac346...' comment= 'Awesome!' commentid='82be1...' (10/1/2014 9:36AM) userid= 'f89d3...' comment= 'Garbage!' commentid='765ac...' (9/17/2014 7:55AM) CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 26. Inserts and Updates •Use INSERT or UPDATE to add and modify data •Both will overwrite data (no constraints like RDBMS) •INSERT and UPDATE functionally equivalent 26 INSERT INTO comments_by_video ( videoid, commentid, userid, comment) VALUES ( '0fe6a...', '82be1...', 'ac346...', 'Awesome!'); UPDATE comments_by_video SET userid = 'ac346...', comment = 'Awesome!' WHERE videoid = '0fe6a...' AND commentid = '82be1...';
  • 27. TTL and Deletes •Can specify a Time to Live (TTL) in seconds when doing an INSERT or UPDATE •Use DELETE statement to remove data •Can optionally specify columns to remove part of a row 27 INSERT INTO comments_by_video ( ... ) VALUES ( ... ) USING TTL 86400; DELETE FROM comments_by_video WHERE videoid = '0fe6a...' AND commentid = '82be1...';
  • 28. Querying •Use SELECT to get data from your tables •Always include Partition Key and optionally Clustering Columns •Can use ORDER BY (on Clustering Columns) and LIMIT •Use range queries (for example, by date) to slice partitions 28 SELECT * FROM comments_by_video WHERE videoid = 'a67cd...' LIMIT 10;
  • 30. Breaking the Relational Mindset •How do we data model when we have to query by the Partition Key (and optionally Clustering Columns)? •Denormalize all the things! •Disk is cheap now and writes in Cassandra are FAST •Data modeling is very much query driven •Many times we end up with a “table per query” 30
  • 31. Users – The Relational Way •Single Users table with all user data and an Id Primary Key •Add an index on email address to allow queries by email User Logs into site Find user by email address Show basic information about user Find user by id 31
  • 32. Users – The Cassandra Way User Logs into site Find user by email address Show basic information about user Find user by id CREATE TABLE user_credentials ( email text, password text, userid uuid, PRIMARY KEY (email) ); CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) ); 32
  • 33. Considerations When Duplicating Data •Can the data change? •How likely is it to change or how frequently will it change? •Do I have all the information I need to update duplicates and maintain consistency? •Just scratching the surface of data modeling examples here 33
  • 34. Putting it all together: Cassandra, Azure, and .NET
  • 35. KillrVideo on Azure Cassandra Cluster (DSE) App data storage (video metadata, comments, users, ratings, etc.) Azure Media Services Uploaded video encoding, thumbnail generation, Video access URI generation Azure Storage Queues – notifications on encoding job progress Blob – uploaded video storage OpsCenter provisioning, monitoring, management KillrVideo Web App C# MVC Web Application, Azure Web Role Serves up UI, JSON Endpoints KillrVideo Upload Worker C#, Azure Worker Role Monitors encoding job events, publishes completed uploads Web UI HTML5 / JavaScript (KnockoutJS, jQuery, Bootstrap, etc) 35
  • 36. Deploying Cassandra in Azure •Cassandra is a JVM application and should be deployed on Linux VMs (parity in Windows is coming – 3.0?) •IOPs is super important (recommend A7 instances for production, A4 for testing and development) •New SSD instances in Azure look promising •In-depth documentation and scripts available to help 36
  • 37. .NET and Cassandra •Open Source (on GitHub), available via NuGet •Bootstrap using the Builder and then reuse the ISession object Cluster cluster = Cluster.Builder() .AddContactPoint("127.0.0.1") .Build(); ISession session = cluster.Connect("killrvideo"); 37
  • 38. .NET and Cassandra •Executing CQL •Sync and Async API available var statement = new SimpleStatement("SELECT * FROM users WHERE userid = ?"); statement = statement.Bind(145); RowSet rows = await session.ExecuteAsync(statement); 38
  • 39. .NET and Cassandra •Getting values from a RowSet is easy •Rowset is a collection of Row (IEnumerable<Row>) RowSet rows = await _session.ExecuteAsync(statement); foreach (Row row in rows) { var videoId = row.GetValue<Guid>("videoid"); var addedDate = row.GetValue<DateTimeOffset>("added_date"); var name = row.GetValue<string>("name"); } 39
  • 40. .NET and Cassandra •Mapping results to DTOs: if you like using CQL, try CqlPoco package •Note: This package may be pulled into the official driver soon. public class User { public Guid UserId { get; set; } public string Name { get; set; } } // Get a user by id from Cassandra or null if not found var user = client.SingleOrDefault<User>( "SELECT userid, name FROM users WHERE userid = ?", someUserId); 40
  • 41. .NET and Cassandra •Mapping results to DTOs: if you like LINQ, use built-in LINQ provider [Table("users")] public class User { [Column("userid"), PartitionKey] public Guid UserId { get; set; } [Column("name")] public string Name { get; set; } } var user = session.GetTable<User>() .SingleOrDefault(u => u.UserId == someUserId) .Execute(); 41
  • 42. Some Tips for .NET and Cassandra •Look at Prepared Statements in the documentation for an easy performance optimization •Take advantage of the async API to run queries in parallel •Don’t write boilerplate mapping code—use LINQ or CqlPoco 42
  • 43. What Next? •Planet Cassandra: http://planetcassandra.org/ –Windows installer for Cassandra for development –More information on the drivers –Resources for Data Modeling •Guidance and Scripts for deployments on Azure –https://academy.datastax.com/demos/enterprise-deployment-microsoft-azure-cloud •KillrVideo Source –https://github.com/luketillman/killrvideo-csharp 43
  • 44. Questions? 44 Follow me on Twitter for updates or to ask questions later: @LukeTillman