SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Downloaden Sie, um offline zu lesen
Ben	
  Coverston	
  
   Director	
  of	
  Opera2ons	
  
ben.coverston@datastax.com	
  
         Hosted	
  By:	
  
      Ma=hew	
  O’Keefe	
  
        MorningStar	
  
                 	
  
                 	
  
History	
  
•    Open	
  Sourced	
  by	
  FB	
  in	
  July	
  2008	
  
•    Apache	
  Incubator	
  March	
  2009	
  
•    Graduated	
  March	
  2010	
  
•    Riptano	
  Founded	
  April	
  2010	
  
•    First	
  Summit	
  August	
  2010	
  
•    Riptano	
  Changed	
  to	
  Datastax	
  January	
  2011	
  
You	
  Changed	
  Your	
  Name?	
  Why!?	
  
•  Suits	
  
    –  Marke2ng	
  
    –  Relevancy	
  
    –  Riptano	
  too	
  “Skateboard”	
  
•  The	
  Real	
  Reason?	
  
    –  “The	
  X	
  makes	
  it	
  sound	
  cool.”	
  –	
  Bender	
  Bending	
  
       Rodriguez,	
  Futurama	
  
Strengths	
  
•  Scalable	
  
•  Reliable	
  
    –  Replica2on	
  that	
  works	
  
    –  Mul2-­‐DC	
  Support	
  
    –  No	
  Single	
  Point	
  of	
  Failure	
  
•  Analy2cs	
  in	
  the	
  same	
  system	
  as	
  OLTP	
  (with	
  
   “integrated”	
  Hadoop	
  support) 	
  
Weaknesses	
  
•  No	
  ACID	
  Transac2ons	
  
•  Limited	
  Support	
  for	
  (OLTP)	
  ad-­‐hoc	
  queries	
  
•  ..but	
  you	
  lost	
  that	
  when	
  you	
  started	
  to	
  shard	
  
   your	
  rela2onal	
  system.	
  	
  
A	
  Short	
  History	
  of	
  Big	
  Data	
  (Or	
  Why	
  
                    Cassandra)	
  
•  Rela2onal	
  databases	
  scale	
  poorly	
  
•  B-­‐trees	
  are	
  slow	
  
   –  ..and	
  require	
  read	
  before	
  write.	
  
   –  ..hope	
  your	
  dataset	
  fits	
  in	
  memory	
  
First	
  A=empt	
  
We	
  just	
  need	
  to	
  buy	
  a	
  bigger	
  box…	
  
We	
  Just	
  Need	
  to	
  Cache	
  Our	
  Data…	
  
Add	
  a	
  few	
  more	
  Databases	
  
A	
  Li=le	
  Sharding	
  
What	
  About	
  Backup?	
  
Add	
  Another	
  Layer	
  of	
  Abstrac2on	
  
What	
  do	
  we	
  end	
  up	
  with?	
  




(“The	
  eBay	
  Architecture,”	
  Randy	
  Shoup	
  and	
  Dan	
  Pritche=)	
  
BASE	
  
•  BASE	
  is	
  diametrically	
  opposed	
  to	
  ACID.	
  Where	
  
   ACID	
  is	
  pessimis2c	
  and	
  forces	
  consistency	
  at	
  
   the	
  end	
  of	
  every	
  opera2on,	
  BASE	
  is	
  op2mis2c	
  
   and	
  accepts	
  that	
  the	
  database	
  consistency	
  will	
  
   be	
  in	
  a	
  state	
  of	
  flux.	
  Although	
  this	
  sounds	
  
   impossible	
  to	
  cope	
  with,	
  in	
  reality	
  it	
  is	
  quite	
  
   manageable	
  and	
  leads	
  to	
  levels	
  of	
  scalability	
  
   that	
  cannot	
  be	
  obtained	
  with	
  ACID.	
  
    –  Dan	
  Pritche=	
  –	
  NoSQL	
  Pioneer,	
  Ebay	
  Engineer	
  
       h=p://queue.acm.org/detail.cfm?id=1394128	
  
Myth	
  
•  Lack	
  of	
  ACID	
  means	
  that	
  I	
  have	
  to	
  give	
  up	
  
   transac2onal	
  guarantees	
  and	
  consistency.	
  
•  Paraphrasing:	
  At	
  Nellix	
  we	
  tend	
  to	
  be	
  
   op2mis2c.	
  When	
  things	
  don’t	
  quite	
  work	
  out	
  
   we	
  try	
  again.	
  
    –  Siddharth	
  Andand	
  
•  Achievable	
  
Cassandra	
  In	
  Produc2on	
  
•    Nellix	
  :	
  Streaming	
  Bookmarks	
  
•    Digital	
  Reasoning:	
  NLP	
  &	
  En2ty	
  Analy2cs	
  
•    OpenX:	
  largest	
  publisher-­‐side	
  ad	
  network	
  
•    Cloudkick:	
  performance	
  data	
  &	
  aggrega2on	
  
•    SimpleGeo:	
  loca2on-­‐as-­‐API	
  
•    Ooyala:	
  video	
  analy2cs	
  and	
  business	
  intelligence	
  
•    ngmoco:	
  massively	
  mul2player	
  online	
  game	
  worlds	
  
•    Kosmix:	
  social	
  media	
  aggrega2on	
  
•    Reddit:	
  vote	
  tracking	
  system	
  
•    Twi=er:	
  Rainbird,	
  geo	
  data,	
  analy2cs	
  
•    …	
  lots	
  more	
  
Who	
  is	
  inves2ng	
  in	
  Cassandra?	
  
•  DataStax	
  
•  Twi=er:	
  
   –  We're	
  inves2ng	
  in	
  Cassandra	
  every	
  day.	
  It'll	
  be	
  
      with	
  us	
  for	
  a	
  long	
  2me	
  and	
  our	
  usage	
  of	
  it	
  will	
  only	
  
      grow.	
  	
  
•  Rackspace	
  
•  >	
  100	
  different	
  individuals	
  have	
  submi=ed	
  
   patches	
  to	
  C*	
  
•  You?	
  
Durability	
  
•  Write	
  to	
  Commit	
  Log	
  
    –  fsync	
  is	
  cheap	
  (append	
  only)	
  
    –  Latency	
  is	
  only	
  subject	
  to	
  rota2onal	
  latency	
  
        •  Separate	
  par22on	
  (no	
  seeking)	
  
        •  SSD	
  won’t	
  hurt,	
  but	
  it	
  may	
  not	
  help	
  either.	
  
•  Write	
  to	
  memtable	
  
•  Flush	
  memtable	
  to	
  SSTable	
  
Log	
  Structured	
  Storage	
  
Tuneable	
  Consistency	
  
•  One,	
  Quorum,	
  All	
  
•  R	
  +	
  W	
  >	
  N	
  
•  Choose	
  availability	
  vs	
  consistency	
  (latency)	
  
The	
  Ring	
  
Adding	
  A	
  Node	
  
Adding	
  A	
  Node	
  (Con2nued)	
  
Bootstrapping	
  
Consistent	
  Hashing	
  
•  Hash	
  Func2on	
  -­‐-­‐	
  K	
  à	
  T	
  
     –  Let’s	
  call	
  this	
  |k|	
  (hash	
  of	
  k)	
  for	
  our	
  examples	
  
•  Par22oner	
  Determines	
  Loca2on	
  in	
  the	
  Ring	
  
	
  
Par22oning	
  
Replica2on	
  
•  Simple	
  Replica2on	
  Strategy	
  
•  Network	
  Topology	
  Strategy	
  
   –  How	
  many	
  replicas	
  in	
  each	
  datacenter	
  for	
  each	
  
      keyspace?	
  
   –  Generaliza2on	
  of	
  Rack	
  Aware	
  Strategy	
  
Replica2on	
  
Coordinators	
  
•  Each	
  Node	
  can	
  be	
  a	
  coordinator	
  
•  Manages	
  wri2ng,	
  read	
  repair.	
  
•  Success	
  depends	
  on	
  per-­‐call	
  CL	
  request	
  
Coordinators	
  
Reliability	
  
•  No	
  Single	
  Points	
  of	
  Failure	
  
•  Mul2ple	
  Datacenters	
  
•  Monitorable	
  
    –  JMX	
  (or	
  whatever	
  plugs	
  into	
  it	
  –	
  lots	
  of	
  counters)	
  
    –  Cac2	
  
    –  Munin	
  
    –  Nagios	
  
Expecta2on	
  of	
  Failure	
  
•  C*	
  is	
  designed	
  to	
  fail	
  
•  No	
  “Clean	
  Shutdown”	
  
•  kill	
  -­‐9,	
  it’s	
  ok.	
  
Failure	
  
Failure	
  
Failure	
  
Hinted	
  Handoff	
  
Decommission	
  (RF3)	
  
Repair	
  
Keyspaces	
  and	
  ColumnFamilies	
  
•  Loosely	
  analogous	
  to	
  “Schemas”	
  and	
  “Tables”	
  
Inside	
  CFs,	
  columns	
  are	
  dynamic	
  
l    Twi=er:	
  “Fiveen	
  months	
  ago,	
  it	
  took	
  two	
  
      weeks	
  to	
  perform	
  ALTER	
  TABLE	
  on	
  the	
  
      statuses	
  [tweets]	
  table.”	
  
ColumnFamilies	
  
l    Sta2c	
  
      l    Object	
  data	
  
l    Dynamic	
  
      l    Precalculated	
  query	
  results	
  
“sta2c”	
  columnfamilies	
  
                                Users
zznate	
          Password:	
  *	
       Name:	
  Nate	
  

 drivx	
          Password:	
  *	
     Name:	
  Brandon	
  

thobbs	
          Password:	
  *	
       Name:	
  Tyler	
  

jbellis	
         Password:	
  *	
     Name:	
  Jonathan	
     Site:	
  riptano.com	
  
“dynamic”	
  columnfamilies	
  
                                 Following
zznate	
         drivx:	
      thobbs:	
  

 drivx	
  

thobbs	
         zznate:	
  

                                              pcmanus
jbellis	
        drivx:	
      mdennis:	
                thobbs:	
     xedin:	
     zznate:	
  
                                                 :	
  
Inser2ng	
  
l    Really	
  “insert	
  or	
  update”	
  
l    Not	
  a	
  key/value	
  store	
  –	
  update	
  as	
  much	
  of	
  the	
  
      row	
  as	
  you	
  want	
  
Column	
  indexes	
  
l    Name	
  vs	
  range	
  filters	
  
l    “reversed=true”	
  
      l    Special	
  case:	
  forward-­‐scan	
  star2ng	
  with	
  beginning	
  
            of	
  row	
  is	
  fastest	
  
Example:	
  Twissandra	
  
•  h=p://twissandra.com	
  
Tweets	
  
RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f
=> (column=body, value=Four score and seven years ago,
timestamp=1289446891681799)

=> (column=username, value=alincoln,
timestamp=1289446891681799)

-------------------
RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f

=> (column=body, value=Do geese see God?,
timestamp=1289501976713199)
=> (column=username, value=pdrome,
timestamp=1289501976713199)
Userline	
  
RowKey: ericflo
=> (column=1289446393708810, value=6a0b4834-ed44-11df-
bc31-000c29864c4f, timestamp=1289446393710212)

=> (column=1289446397693831, value=6c6b5916-ed44-11df-
bc31-000c29864c4f, timestamp=1289446397694646)

=> (column=1289446891681780, value=92dbeb50-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446891685065)

=> (column=1289446897315887, value=96379f92-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446897317676)
Userline	
  

                 1289847840615:	
  3f19757a-­‐
zznate	
                                                 1289847887086:	
  a20fcf52-­‐595c...	
  
                         c89d...	
  

 drivx	
  

thobbs	
      1289847887086:	
  a20fcf52-­‐595c...	
  

                 1289847840615:	
  3f19757a-­‐             1289847844275:	
  844e75e2-­‐
jbellis	
  
                         c89d...	
                                  b546...	
  
Timeline	
  
RowKey: ericflo
=> (column=1289446393708810, value=6a0b4834-ed44-11df-
bc31-000c29864c4f, timestamp=1289446393710212)

=> (column=1289446397693831, value=6c6b5916-ed44-11df-
bc31-000c29864c4f, timestamp=1289446397694646)

=> (column=1289446891681780, value=92dbeb50-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446891685065)

=> (column=1289446897315887, value=96379f92-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446897317676)
Adding	
  a	
  tweet	
  
tweet_id = str(uuid())
body = '@ericflo thanks for Twissandra, it helps!'
timestamp = long(time.time() * 1e6)

columns = {'uname': useruuid, 'body': body}
TWEET.insert(tweet_id, columns)

columns = {ts: tweet_id}
USERLINE.insert(uname, columns)

TIMELINE.insert(uname, columns)
for follower_uname in FOLLOWERS.get(uname, 5000):
    TIMELINE.insert(follower_uname, columns)
Reads	
  
timeline = USERLINE.get(uname, column_reversed=True)
tweets = TWEET.multiget(timeline.values())


start = request.GET.get('start')
limit = NUM_PER_PAGE
timeline = TIMELINE.get(uname, column_start=start,
column_count=limit, column_reversed=True)
tweets = TWEET.multiget(timeline.values())
I	
  can	
  has	
  smarter	
  clients?	
  
l    Don't	
  use	
  thriv	
  directly	
  
l    Higher	
  level	
  clients	
  have	
  a	
  lot	
  of	
  features	
  you	
  
      want	
  
       l    Knowledge	
  about	
  data	
  types	
  
       l    Connec2on	
  pooling	
  
       l    Automa2c	
  retries	
  
       l    Logging	
  
Raw	
  thriv	
  API:	
  Connec2ng	
  
def get_client(host='127.0.0.1', port=9170):
    socket = TSocket.TSocket(host, port)
    transport = TTransport.TBufferedTransport(socket)
    transport.open()
    protocol =
TBinaryProtocol.TBinaryProtocolAccelerated(transport)
    client = Cassandra.Client(protocol)
    return client
Raw	
  thriv	
  API:	
  Inser2ng	
  
data = {'id': useruuid, ...}
columns = [Column(k, v, time.time())
           for (k, v) in data.items()]
mutations = [Mutation(ColumnOrSuperColumn(column=c))
             for c in columns]
rows = {useruuid: {'User': mutations}}
client.batch_mutate('Twissandra', rows,
ConsistencyLevel.ONE)
Raw	
  thriv	
  API:	
  Fetching	
  
l     get,	
  get_slice,	
  get_count,	
  mul2get_slice,	
  
       get_range_slices	
  
l     ColumnOrSuperColumn	
  
l     h=p://wiki.apache.org/cassandra/API	
  
	
  
API	
  layers	
  
Layer	
             Analog	
  

libpq	
             Thriv	
  

JDBC	
              Hector	
  

JPA	
               Kundera	
  
Language	
  support	
  
l    Python	
  
       l    pycassa	
  
       l    telephus	
  
l    Ruby	
  
       l    Speed	
  is	
  a	
  nega2ve	
  
l    Java	
  
       l    Hector	
  
l    PHP	
  (soon	
  with	
  less	
  suckage!)	
  
Done	
  yet?	
  
l    S2ll	
  doing	
  1+N	
  queries	
  per	
  page	
  
l    Solu2on:	
  Supercolumns	
  
l    Err..	
  Well	
  maybe…	
  
Supercolumns:	
  limita2ons	
  
l     Requires	
  reading	
  an	
  en2re	
  SC	
  (not	
  the	
  en2re	
  
       row)	
  from	
  disk	
  even	
  if	
  you	
  just	
  want	
  one	
  
       subcolumn	
  
l     No	
  Secondary	
  Indexes	
  
l     It’s	
  just	
  an	
  extra	
  map	
  layer.	
  
l     Probably	
  best	
  to	
  avoid	
  them	
  if	
  you	
  can.	
  
	
  
UUIDs	
  
l    Column	
  names	
  should	
  be	
  uuids,	
  not	
  longs,	
  to	
  
      avoid	
  collisions	
  
l    Version	
  1	
  UUIDs	
  can	
  be	
  sorted	
  by	
  2me	
  
      (“TimeUUID”)	
  
l    Any	
  UUID	
  can	
  be	
  sorted	
  by	
  its	
  raw	
  bytes	
  
      (“LexicalUUID”)	
  
      l    Usually	
  Version	
  4	
  
      l    Slightly	
  less	
  overhead	
  
0.7: secondary indexes



  Obviate need for Userline (but not Timeline)
l 
Lucandra	
  
l    What	
  documents	
  contain	
  term	
  X?	
  
      l    …	
  and	
  term	
  Y?	
  
      l    …	
  or	
  start	
  with	
  Z?	
  
FAQ:	
  coun2ng	
  
l    UUIDs	
  +	
  batch	
  process	
  
l    Mutex	
  (contrib/mutex	
  or	
  “cages”)	
  
l    Use	
  redis	
  or	
  mysql	
  or	
  memcached	
  
l    column-­‐per-­‐app-­‐server	
  
l    counter	
  API	
  (aver	
  .7	
  is	
  out)	
  
Tips	
  
l    Insert	
  instead	
  of	
  check-­‐then-­‐insert	
  
l    Use	
  client-­‐side	
  clock	
  to	
  your	
  advantage	
  
l    use	
  TTL	
  
l    Wider	
  rows	
  (but	
  not	
  too	
  wide)	
  	
  
l    Start	
  with	
  queries,	
  work	
  backwards	
  
l    Avoid	
  storing	
  extra	
  “2mestamp”	
  columns	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Michaël Figuière
 

Was ist angesagt? (20)

Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
 
Cassandra NoSQL
Cassandra NoSQLCassandra NoSQL
Cassandra NoSQL
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
Cassandra Community Webinar: Back to Basics with CQL3
Cassandra Community Webinar: Back to Basics with CQL3Cassandra Community Webinar: Back to Basics with CQL3
Cassandra Community Webinar: Back to Basics with CQL3
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
MariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityMariaDB and Cassandra Interoperability
MariaDB and Cassandra Interoperability
 
Distributed Lock Manager
Distributed Lock ManagerDistributed Lock Manager
Distributed Lock Manager
 
Cache on Delivery
Cache on DeliveryCache on Delivery
Cache on Delivery
 
Cassandra Summit 2014: Cassandra at Instagram 2014
Cassandra Summit 2014: Cassandra at Instagram 2014Cassandra Summit 2014: Cassandra at Instagram 2014
Cassandra Summit 2014: Cassandra at Instagram 2014
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
 

Ähnlich wie Ben Coverston - The Apache Cassandra Project

Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
mubarakss
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
Murat Çakal
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
Rajini Ramesh
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 

Ähnlich wie Ben Coverston - The Apache Cassandra Project (20)

Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Hotsos 2012
Hotsos 2012Hotsos 2012
Hotsos 2012
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Cassandra
CassandraCassandra
Cassandra
 
Distributed Database Consistency: Architectural Considerations and Tradeoffs
Distributed Database Consistency: Architectural Considerations and TradeoffsDistributed Database Consistency: Architectural Considerations and Tradeoffs
Distributed Database Consistency: Architectural Considerations and Tradeoffs
 
Cassandra Silicon Valley
Cassandra Silicon ValleyCassandra Silicon Valley
Cassandra Silicon Valley
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Persistent Data Structures - partial::Conf
Persistent Data Structures - partial::ConfPersistent Data Structures - partial::Conf
Persistent Data Structures - partial::Conf
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapest
 
ScyllaDB V Developer Deep Dive Series: Resiliency and Strong Consistency via ...
ScyllaDB V Developer Deep Dive Series: Resiliency and Strong Consistency via ...ScyllaDB V Developer Deep Dive Series: Resiliency and Strong Consistency via ...
ScyllaDB V Developer Deep Dive Series: Resiliency and Strong Consistency via ...
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 

Ben Coverston - The Apache Cassandra Project

  • 1. Ben  Coverston   Director  of  Opera2ons   ben.coverston@datastax.com   Hosted  By:   Ma=hew  O’Keefe   MorningStar      
  • 2. History   •  Open  Sourced  by  FB  in  July  2008   •  Apache  Incubator  March  2009   •  Graduated  March  2010   •  Riptano  Founded  April  2010   •  First  Summit  August  2010   •  Riptano  Changed  to  Datastax  January  2011  
  • 3. You  Changed  Your  Name?  Why!?   •  Suits   –  Marke2ng   –  Relevancy   –  Riptano  too  “Skateboard”   •  The  Real  Reason?   –  “The  X  makes  it  sound  cool.”  –  Bender  Bending   Rodriguez,  Futurama  
  • 4. Strengths   •  Scalable   •  Reliable   –  Replica2on  that  works   –  Mul2-­‐DC  Support   –  No  Single  Point  of  Failure   •  Analy2cs  in  the  same  system  as  OLTP  (with   “integrated”  Hadoop  support)  
  • 5. Weaknesses   •  No  ACID  Transac2ons   •  Limited  Support  for  (OLTP)  ad-­‐hoc  queries   •  ..but  you  lost  that  when  you  started  to  shard   your  rela2onal  system.    
  • 6. A  Short  History  of  Big  Data  (Or  Why   Cassandra)   •  Rela2onal  databases  scale  poorly   •  B-­‐trees  are  slow   –  ..and  require  read  before  write.   –  ..hope  your  dataset  fits  in  memory  
  • 8. We  just  need  to  buy  a  bigger  box…  
  • 9. We  Just  Need  to  Cache  Our  Data…  
  • 10. Add  a  few  more  Databases  
  • 13. Add  Another  Layer  of  Abstrac2on  
  • 14. What  do  we  end  up  with?   (“The  eBay  Architecture,”  Randy  Shoup  and  Dan  Pritche=)  
  • 15.
  • 16. BASE   •  BASE  is  diametrically  opposed  to  ACID.  Where   ACID  is  pessimis2c  and  forces  consistency  at   the  end  of  every  opera2on,  BASE  is  op2mis2c   and  accepts  that  the  database  consistency  will   be  in  a  state  of  flux.  Although  this  sounds   impossible  to  cope  with,  in  reality  it  is  quite   manageable  and  leads  to  levels  of  scalability   that  cannot  be  obtained  with  ACID.   –  Dan  Pritche=  –  NoSQL  Pioneer,  Ebay  Engineer   h=p://queue.acm.org/detail.cfm?id=1394128  
  • 17. Myth   •  Lack  of  ACID  means  that  I  have  to  give  up   transac2onal  guarantees  and  consistency.   •  Paraphrasing:  At  Nellix  we  tend  to  be   op2mis2c.  When  things  don’t  quite  work  out   we  try  again.   –  Siddharth  Andand   •  Achievable  
  • 18. Cassandra  In  Produc2on   •  Nellix  :  Streaming  Bookmarks   •  Digital  Reasoning:  NLP  &  En2ty  Analy2cs   •  OpenX:  largest  publisher-­‐side  ad  network   •  Cloudkick:  performance  data  &  aggrega2on   •  SimpleGeo:  loca2on-­‐as-­‐API   •  Ooyala:  video  analy2cs  and  business  intelligence   •  ngmoco:  massively  mul2player  online  game  worlds   •  Kosmix:  social  media  aggrega2on   •  Reddit:  vote  tracking  system   •  Twi=er:  Rainbird,  geo  data,  analy2cs   •  …  lots  more  
  • 19. Who  is  inves2ng  in  Cassandra?   •  DataStax   •  Twi=er:   –  We're  inves2ng  in  Cassandra  every  day.  It'll  be   with  us  for  a  long  2me  and  our  usage  of  it  will  only   grow.     •  Rackspace   •  >  100  different  individuals  have  submi=ed   patches  to  C*   •  You?  
  • 20. Durability   •  Write  to  Commit  Log   –  fsync  is  cheap  (append  only)   –  Latency  is  only  subject  to  rota2onal  latency   •  Separate  par22on  (no  seeking)   •  SSD  won’t  hurt,  but  it  may  not  help  either.   •  Write  to  memtable   •  Flush  memtable  to  SSTable  
  • 22. Tuneable  Consistency   •  One,  Quorum,  All   •  R  +  W  >  N   •  Choose  availability  vs  consistency  (latency)  
  • 25. Adding  A  Node  (Con2nued)  
  • 27. Consistent  Hashing   •  Hash  Func2on  -­‐-­‐  K  à  T   –  Let’s  call  this  |k|  (hash  of  k)  for  our  examples   •  Par22oner  Determines  Loca2on  in  the  Ring    
  • 29. Replica2on   •  Simple  Replica2on  Strategy   •  Network  Topology  Strategy   –  How  many  replicas  in  each  datacenter  for  each   keyspace?   –  Generaliza2on  of  Rack  Aware  Strategy  
  • 31. Coordinators   •  Each  Node  can  be  a  coordinator   •  Manages  wri2ng,  read  repair.   •  Success  depends  on  per-­‐call  CL  request  
  • 33. Reliability   •  No  Single  Points  of  Failure   •  Mul2ple  Datacenters   •  Monitorable   –  JMX  (or  whatever  plugs  into  it  –  lots  of  counters)   –  Cac2   –  Munin   –  Nagios  
  • 34. Expecta2on  of  Failure   •  C*  is  designed  to  fail   •  No  “Clean  Shutdown”   •  kill  -­‐9,  it’s  ok.  
  • 41. Keyspaces  and  ColumnFamilies   •  Loosely  analogous  to  “Schemas”  and  “Tables”  
  • 42. Inside  CFs,  columns  are  dynamic   l  Twi=er:  “Fiveen  months  ago,  it  took  two   weeks  to  perform  ALTER  TABLE  on  the   statuses  [tweets]  table.”  
  • 43. ColumnFamilies   l  Sta2c   l  Object  data   l  Dynamic   l  Precalculated  query  results  
  • 44. “sta2c”  columnfamilies   Users zznate   Password:  *   Name:  Nate   drivx   Password:  *   Name:  Brandon   thobbs   Password:  *   Name:  Tyler   jbellis   Password:  *   Name:  Jonathan   Site:  riptano.com  
  • 45. “dynamic”  columnfamilies   Following zznate   drivx:   thobbs:   drivx   thobbs   zznate:   pcmanus jbellis   drivx:   mdennis:   thobbs:   xedin:   zznate:   :  
  • 46. Inser2ng   l  Really  “insert  or  update”   l  Not  a  key/value  store  –  update  as  much  of  the   row  as  you  want  
  • 47. Column  indexes   l  Name  vs  range  filters   l  “reversed=true”   l  Special  case:  forward-­‐scan  star2ng  with  beginning   of  row  is  fastest  
  • 48.
  • 49. Example:  Twissandra   •  h=p://twissandra.com  
  • 50. Tweets   RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f => (column=body, value=Four score and seven years ago, timestamp=1289446891681799) => (column=username, value=alincoln, timestamp=1289446891681799) ------------------- RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f => (column=body, value=Do geese see God?, timestamp=1289501976713199) => (column=username, value=pdrome, timestamp=1289501976713199)
  • 51. Userline   RowKey: ericflo => (column=1289446393708810, value=6a0b4834-ed44-11df- bc31-000c29864c4f, timestamp=1289446393710212) => (column=1289446397693831, value=6c6b5916-ed44-11df- bc31-000c29864c4f, timestamp=1289446397694646) => (column=1289446891681780, value=92dbeb50-ed45-11df- a6d0-000c29864c4f, timestamp=1289446891685065) => (column=1289446897315887, value=96379f92-ed45-11df- a6d0-000c29864c4f, timestamp=1289446897317676)
  • 52. Userline   1289847840615:  3f19757a-­‐ zznate   1289847887086:  a20fcf52-­‐595c...   c89d...   drivx   thobbs   1289847887086:  a20fcf52-­‐595c...   1289847840615:  3f19757a-­‐ 1289847844275:  844e75e2-­‐ jbellis   c89d...   b546...  
  • 53. Timeline   RowKey: ericflo => (column=1289446393708810, value=6a0b4834-ed44-11df- bc31-000c29864c4f, timestamp=1289446393710212) => (column=1289446397693831, value=6c6b5916-ed44-11df- bc31-000c29864c4f, timestamp=1289446397694646) => (column=1289446891681780, value=92dbeb50-ed45-11df- a6d0-000c29864c4f, timestamp=1289446891685065) => (column=1289446897315887, value=96379f92-ed45-11df- a6d0-000c29864c4f, timestamp=1289446897317676)
  • 54. Adding  a  tweet   tweet_id = str(uuid()) body = '@ericflo thanks for Twissandra, it helps!' timestamp = long(time.time() * 1e6) columns = {'uname': useruuid, 'body': body} TWEET.insert(tweet_id, columns) columns = {ts: tweet_id} USERLINE.insert(uname, columns) TIMELINE.insert(uname, columns) for follower_uname in FOLLOWERS.get(uname, 5000): TIMELINE.insert(follower_uname, columns)
  • 55. Reads   timeline = USERLINE.get(uname, column_reversed=True) tweets = TWEET.multiget(timeline.values()) start = request.GET.get('start') limit = NUM_PER_PAGE timeline = TIMELINE.get(uname, column_start=start, column_count=limit, column_reversed=True) tweets = TWEET.multiget(timeline.values())
  • 56. I  can  has  smarter  clients?   l  Don't  use  thriv  directly   l  Higher  level  clients  have  a  lot  of  features  you   want   l  Knowledge  about  data  types   l  Connec2on  pooling   l  Automa2c  retries   l  Logging  
  • 57. Raw  thriv  API:  Connec2ng   def get_client(host='127.0.0.1', port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client
  • 58. Raw  thriv  API:  Inser2ng   data = {'id': useruuid, ...} columns = [Column(k, v, time.time()) for (k, v) in data.items()] mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns] rows = {useruuid: {'User': mutations}} client.batch_mutate('Twissandra', rows, ConsistencyLevel.ONE)
  • 59. Raw  thriv  API:  Fetching   l  get,  get_slice,  get_count,  mul2get_slice,   get_range_slices   l  ColumnOrSuperColumn   l  h=p://wiki.apache.org/cassandra/API    
  • 60. API  layers   Layer   Analog   libpq   Thriv   JDBC   Hector   JPA   Kundera  
  • 61. Language  support   l  Python   l  pycassa   l  telephus   l  Ruby   l  Speed  is  a  nega2ve   l  Java   l  Hector   l  PHP  (soon  with  less  suckage!)  
  • 62. Done  yet?   l  S2ll  doing  1+N  queries  per  page   l  Solu2on:  Supercolumns   l  Err..  Well  maybe…  
  • 63. Supercolumns:  limita2ons   l  Requires  reading  an  en2re  SC  (not  the  en2re   row)  from  disk  even  if  you  just  want  one   subcolumn   l  No  Secondary  Indexes   l  It’s  just  an  extra  map  layer.   l  Probably  best  to  avoid  them  if  you  can.    
  • 64. UUIDs   l  Column  names  should  be  uuids,  not  longs,  to   avoid  collisions   l  Version  1  UUIDs  can  be  sorted  by  2me   (“TimeUUID”)   l  Any  UUID  can  be  sorted  by  its  raw  bytes   (“LexicalUUID”)   l  Usually  Version  4   l  Slightly  less  overhead  
  • 65. 0.7: secondary indexes Obviate need for Userline (but not Timeline) l 
  • 66. Lucandra   l  What  documents  contain  term  X?   l  …  and  term  Y?   l  …  or  start  with  Z?  
  • 67. FAQ:  coun2ng   l  UUIDs  +  batch  process   l  Mutex  (contrib/mutex  or  “cages”)   l  Use  redis  or  mysql  or  memcached   l  column-­‐per-­‐app-­‐server   l  counter  API  (aver  .7  is  out)  
  • 68. Tips   l  Insert  instead  of  check-­‐then-­‐insert   l  Use  client-­‐side  clock  to  your  advantage   l  use  TTL   l  Wider  rows  (but  not  too  wide)     l  Start  with  queries,  work  backwards   l  Avoid  storing  extra  “2mestamp”  columns