SlideShare a Scribd company logo
1 of 37
Download to read offline
Cassandra 1.0
    and the future of big data
    Jonathan Ellis




Tuesday, October 4, 2011
About me

    ✤    Project chair, Apache Cassandra
          ✤    Active since Dec 2008
          ✤    First non-Facebook committer
          ✤    wrote ~30% of committed patches, reviewed ~40% of the rest
    ✤    Distributed systems background
          ✤    At Mozy, built a multi-petabyte, scalable storage system based on
               Reed-Solomon encoding
    ✤    Founder and CTO, DataStax



Tuesday, October 4, 2011
About DataStax

    ✤    Founded in April 2010
    ✤    Commercial leader in Apache Cassandra
    ✤    100+ customers
    ✤    30+ employees
    ✤    Home to Apache Cassandra Chair & most committers
    ✤    Headquartered in San Francisco Bay area, California
    ✤    Secured $11M in Series B funding in Sep 2011



Tuesday, October 4, 2011
Job Trends (indeed.com)




Tuesday, October 4, 2011
“Big Data” trend




Tuesday, October 4, 2011
Big data




                  Analytics        Realtime
                              ?
                  (Hadoop)        (“NoSQL”)




Tuesday, October 4, 2011
Some Cassandra users

    ✤    Financial
    ✤    Social Media
    ✤    Advertising
    ✤    Entertainment
    ✤    Energy
    ✤    E-tail
    ✤    Health care
    ✤    Government

Tuesday, October 4, 2011
Common use cases

    ✤    Time series data
    ✤    Messaging
    ✤    Ad tracking
    ✤    Data mining
    ✤    User activity streams
    ✤    User sessions
    ✤    Anything requiring: Scalable + performant + highly
         available


Tuesday, October 4, 2011
Why people choose Cassandra

    ✤    Multi-master, multi-DC
    ✤    Linearly scalable
    ✤    Larger-than-memory datasets
    ✤    Best-in-class performance (not just writes!)
    ✤    Fully durable
    ✤    Integrated caching
    ✤    Tuneable consistency



Tuesday, October 4, 2011
0.7

    ✤    CREATE COLUMN FAMILY
    ✤    Expiring columns (TTL)
    ✤    Secondary (column) indexes
    ✤    Efficient streaming
    ✤    Efficient cross-datacenter writes




Tuesday, October 4, 2011
0.8

    ✤    CQL
    ✤    Counters
    ✤    Automatic memtable tuning
    ✤    New bulk load interface




Tuesday, October 4, 2011
1.0

    ✤    Compression
    ✤    Read performance
    ✤    LeveledCompactionStrategy
    ✤    CQL 2.0




Tuesday, October 4, 2011
Compression

        ✤    Rows-per-block or blocks-per-row




Tuesday, October 4, 2011
Classic size-tiered compaction




Tuesday, October 4, 2011
Level-based Compaction

        ✤    SSTables are non-overlapping within a level
        ✤    Bounds the number that can contain a given row




                                                L0: newly flushed

                                               L1: 100 MB

                                               L2: 1000 MB


Tuesday, October 4, 2011
Read performance: maxtimestamp

    ✤    Sort sstables by maximum (client-provided) timestamp
    ✤    Only merge sstables until we have the columns requested
    ✤    Allows pre-merging highly fragmented rows without
         waiting for compaction




Tuesday, October 4, 2011
Results




Tuesday, October 4, 2011
CQL


cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;

        KEY | birth_date |         full_name | state |
 bsanderson |       1975 | Brandon Sanderson |    UT |




Tuesday, October 4, 2011
CQL 2.0

    ✤    ALTER
    ✤    Counter support
    ✤    TTL support
    ✤    SELECT count(*)




Tuesday, October 4, 2011
Post-1.0 features

    ✤    Ease Of Use
    ✤    CQL
          ✤    “Native” transport
          ✤    Composite columns
          ✤    Prepared statements
    ✤    Triggers
    ✤    Entity groups
    ✤    Smarter range queries
          ✤    Enables more-efficient analytics
Tuesday, October 4, 2011
The evolution of Analytics




                           Analytics + Realtime


Tuesday, October 4, 2011
The evolution of Analytics




                                       replication




                           Analytics                 Realtime



Tuesday, October 4, 2011
The evolution of Analytics




                           ETL




Tuesday, October 4, 2011
Big data




                  Analytics    DataStax      Realtime
                  (Hadoop)    Enterprise   (Cassandra)




Tuesday, October 4, 2011
DataStax Enterprise re-unifies
    realtime and analytics




Tuesday, October 4, 2011
26

Tuesday, October 4, 2011
Data model: Realtime
               LiveStocks
                                   last
                           GOOG   $95.52
                           AAPL   $186.10
                           AMZN   $112.98


                 Portfolios
                                  GOOG      LNKD       P        AMZN    AAPLE
                     Portfolio1
                                   80        20       40        100       20


                 StockHist
                                  2011-01-01       2011-01-02     2011-01-03
                           GOOG
                                    $79.85          $75.23            $82.11



Tuesday, October 4, 2011
Data model: Analytics
               HistLoss
                                   worst_date    loss
                      Portfolio1   2011-07-23   -$34.81
                      Portfolio2   2011-03-11 -$11432.24
                      Portfolio3   2011-05-21 -$1476.93




Tuesday, October 4, 2011
Data model: Analytics
               10dayreturns
                   ticker      rdate     return
                   GOOG     2011-07-25   $8.23
                   GOOG     2011-07-24   $6.14
                   GOOG     2011-07-23   $7.78
                   AAPL     2011-07-25   $15.32
                   AAPL     2011-07-24   $12.68


              INSERT OVERWRITE TABLE 10dayreturns
              SELECT a.row_key ticker,
                     b.column_name rdate,
                     b.value - a.value
              FROM StockHist a
              JOIN StockHist b
              ON (a.row_key = b.row_key
                  AND date_add(a.column_name,10) = b.column_name);



Tuesday, October 4, 2011
Data model: Analytics

                           2011-01-01     2011-01-02   2011-01-03
               GOOG
                            $79.85          $75.23       $82.11




            row_key column_name         value
             GOOG    2011-01-01         $8.23
             GOOG    2011-01-02         $6.14
             GOOG 2011-001-03           $7.78




Tuesday, October 4, 2011
Data model: Analytics
               portfolio_returns
                    portfolio       rdate      preturn
                    Portfolio1   2011-07-25    $118.21
                    Portfolio1   2011-07-24     $60.78
                    Portfolio1   2011-07-23    -$34.81
                    Portfolio2   2011-07-25   $2143.92
                    Portfolio3   2011-07-24    -$10.19


               INSERT OVERWRITE TABLE portfolio_returns
               SELECT row_key portfolio,
                      rdate,
                      SUM(b.return)
               FROM portfolios a JOIN 10dayreturns b
               ON (a.column_name = b.ticker)
               GROUP BY row_key, rdate;




Tuesday, October 4, 2011
Data model: Analytics
               HistLoss
                                   worst_date    loss
                      Portfolio1   2011-07-23   -$34.81
                      Portfolio2   2011-03-11 -$11432.24
                      Portfolio3   2011-05-21 -$1476.93



               INSERT OVERWRITE TABLE HistLoss
               SELECT a.portfolio, rdate, minp
               FROM (
                 SELECT portfolio, min(preturn) as minp
                 FROM portfolio_returns
                 GROUP BY portfolio
               ) a
               JOIN portfolio_returns b
               ON (a.portfolio = b.portfolio and a.minp = b.preturn);



Tuesday, October 4, 2011
Portfolio Demo dataflow


     Portfolios               Portfolios
     Historical Prices        Live Prices for today
     Intermediate Results
     Largest loss             Largest loss




Tuesday, October 4, 2011
Operations

    ✤    “Vanilla” Hadoop
          ✤    8+ services to setup, monitor, backup, and recover
               (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper,
               Region Server,...)
          ✤    Single points of failure
          ✤    Can't separate online and offline processing

    ✤    DataStax Enterprise
          ✤    Single, simplified component
          ✤    Self-organizes based on workload
          ✤    Peer to peer
          ✤    JobTracker failover
          ✤    No additional cassandra config

Tuesday, October 4, 2011
OpsCenter




Tuesday, October 4, 2011
Questions?

    ✤    http://datastax.com/dev/blog
    ✤    jonathan@datastax.com




Tuesday, October 4, 2011
37

Tuesday, October 4, 2011

More Related Content

Similar to Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Brisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by CassandraBrisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by Cassandrajbellis
 
Introducing Ext GWT 3.0
Introducing Ext GWT 3.0Introducing Ext GWT 3.0
Introducing Ext GWT 3.0Sencha
 
Advanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITRAdvanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITRRobert Treat
 
DBXTalk: Smalltalk Relational Database Suite
DBXTalk: Smalltalk Relational Database SuiteDBXTalk: Smalltalk Relational Database Suite
DBXTalk: Smalltalk Relational Database SuiteMariano Martínez Peck
 
잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기형우 안
 
Bigtable a distributed storage system
Bigtable a distributed storage systemBigtable a distributed storage system
Bigtable a distributed storage systemDevyani Vaidya
 
Bigtable a distributed storage system
Bigtable a distributed storage systemBigtable a distributed storage system
Bigtable a distributed storage systemDevyani Vaidya
 
A Look at the Future of HTML5
A Look at the Future of HTML5A Look at the Future of HTML5
A Look at the Future of HTML5Tim Wright
 
MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignDATAVERSITY
 
Atlassian RoadTrip 2011 Slide Deck
Atlassian RoadTrip 2011 Slide DeckAtlassian RoadTrip 2011 Slide Deck
Atlassian RoadTrip 2011 Slide DeckAtlassian
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastoreikailan
 
international PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPinternational PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPsmueller_sandsmedia
 
The Fast, The Slow and the Lazy
The Fast, The Slow and the LazyThe Fast, The Slow and the Lazy
The Fast, The Slow and the LazyMaurício Linhares
 
Performance Optimization for Ext GWT 3.0
Performance Optimization for Ext GWT 3.0Performance Optimization for Ext GWT 3.0
Performance Optimization for Ext GWT 3.0Sencha
 
Conquistando el Servidor con Node.JS
Conquistando el Servidor con Node.JSConquistando el Servidor con Node.JS
Conquistando el Servidor con Node.JSCaridy Patino
 
OSMC 2023 | OpenTelemetry for Logging by Philipp Krenn
OSMC 2023 | OpenTelemetry for Logging by Philipp KrennOSMC 2023 | OpenTelemetry for Logging by Philipp Krenn
OSMC 2023 | OpenTelemetry for Logging by Philipp KrennNETWAYS
 
Node js techtalksto
Node js techtalkstoNode js techtalksto
Node js techtalkstoJason Diller
 

Similar to Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011) (20)

Brisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by CassandraBrisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by Cassandra
 
Introducing Ext GWT 3.0
Introducing Ext GWT 3.0Introducing Ext GWT 3.0
Introducing Ext GWT 3.0
 
Advanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITRAdvanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITR
 
DBXTalk: Smalltalk Relational Database Suite
DBXTalk: Smalltalk Relational Database SuiteDBXTalk: Smalltalk Relational Database Suite
DBXTalk: Smalltalk Relational Database Suite
 
잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기
 
Bigtable a distributed storage system
Bigtable a distributed storage systemBigtable a distributed storage system
Bigtable a distributed storage system
 
Bigtable a distributed storage system
Bigtable a distributed storage systemBigtable a distributed storage system
Bigtable a distributed storage system
 
A Look at the Future of HTML5
A Look at the Future of HTML5A Look at the Future of HTML5
A Look at the Future of HTML5
 
MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema Design
 
Atlassian RoadTrip 2011 Slide Deck
Atlassian RoadTrip 2011 Slide DeckAtlassian RoadTrip 2011 Slide Deck
Atlassian RoadTrip 2011 Slide Deck
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore
 
international PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPinternational PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHP
 
The Fast, The Slow and the Lazy
The Fast, The Slow and the LazyThe Fast, The Slow and the Lazy
The Fast, The Slow and the Lazy
 
When?, Why? and What? of MongoDB
When?, Why? and What? of MongoDBWhen?, Why? and What? of MongoDB
When?, Why? and What? of MongoDB
 
Performance Optimization for Ext GWT 3.0
Performance Optimization for Ext GWT 3.0Performance Optimization for Ext GWT 3.0
Performance Optimization for Ext GWT 3.0
 
Caridy patino - node-js
Caridy patino - node-jsCaridy patino - node-js
Caridy patino - node-js
 
Conquistando el Servidor con Node.JS
Conquistando el Servidor con Node.JSConquistando el Servidor con Node.JS
Conquistando el Servidor con Node.JS
 
OSMC 2023 | OpenTelemetry for Logging by Philipp Krenn
OSMC 2023 | OpenTelemetry for Logging by Philipp KrennOSMC 2023 | OpenTelemetry for Logging by Philipp Krenn
OSMC 2023 | OpenTelemetry for Logging by Philipp Krenn
 
Node js techtalksto
Node js techtalkstoNode js techtalksto
Node js techtalksto
 

More from jbellis

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
Data day texas: Cassandra and the Cloud
Data day texas: Cassandra and the CloudData day texas: Cassandra and the Cloud
Data day texas: Cassandra and the Cloudjbellis
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015jbellis
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014jbellis
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1jbellis
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014jbellis
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013jbellis
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0jbellis
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynotejbellis
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012jbellis
 
Top five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionTop five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionjbellis
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012jbellis
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandrajbellis
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1jbellis
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Javajbellis
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprisejbellis
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)jbellis
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from javajbellis
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011jbellis
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialjbellis
 

More from jbellis (20)

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Data day texas: Cassandra and the Cloud
Data day texas: Cassandra and the CloudData day texas: Cassandra and the Cloud
Data day texas: Cassandra and the Cloud
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 
Top five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionTop five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solution
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from java
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 

Recently uploaded

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

  • 1. Cassandra 1.0 and the future of big data Jonathan Ellis Tuesday, October 4, 2011
  • 2. About me ✤ Project chair, Apache Cassandra ✤ Active since Dec 2008 ✤ First non-Facebook committer ✤ wrote ~30% of committed patches, reviewed ~40% of the rest ✤ Distributed systems background ✤ At Mozy, built a multi-petabyte, scalable storage system based on Reed-Solomon encoding ✤ Founder and CTO, DataStax Tuesday, October 4, 2011
  • 3. About DataStax ✤ Founded in April 2010 ✤ Commercial leader in Apache Cassandra ✤ 100+ customers ✤ 30+ employees ✤ Home to Apache Cassandra Chair & most committers ✤ Headquartered in San Francisco Bay area, California ✤ Secured $11M in Series B funding in Sep 2011 Tuesday, October 4, 2011
  • 6. Big data Analytics Realtime ? (Hadoop) (“NoSQL”) Tuesday, October 4, 2011
  • 7. Some Cassandra users ✤ Financial ✤ Social Media ✤ Advertising ✤ Entertainment ✤ Energy ✤ E-tail ✤ Health care ✤ Government Tuesday, October 4, 2011
  • 8. Common use cases ✤ Time series data ✤ Messaging ✤ Ad tracking ✤ Data mining ✤ User activity streams ✤ User sessions ✤ Anything requiring: Scalable + performant + highly available Tuesday, October 4, 2011
  • 9. Why people choose Cassandra ✤ Multi-master, multi-DC ✤ Linearly scalable ✤ Larger-than-memory datasets ✤ Best-in-class performance (not just writes!) ✤ Fully durable ✤ Integrated caching ✤ Tuneable consistency Tuesday, October 4, 2011
  • 10. 0.7 ✤ CREATE COLUMN FAMILY ✤ Expiring columns (TTL) ✤ Secondary (column) indexes ✤ Efficient streaming ✤ Efficient cross-datacenter writes Tuesday, October 4, 2011
  • 11. 0.8 ✤ CQL ✤ Counters ✤ Automatic memtable tuning ✤ New bulk load interface Tuesday, October 4, 2011
  • 12. 1.0 ✤ Compression ✤ Read performance ✤ LeveledCompactionStrategy ✤ CQL 2.0 Tuesday, October 4, 2011
  • 13. Compression ✤ Rows-per-block or blocks-per-row Tuesday, October 4, 2011
  • 15. Level-based Compaction ✤ SSTables are non-overlapping within a level ✤ Bounds the number that can contain a given row L0: newly flushed L1: 100 MB L2: 1000 MB Tuesday, October 4, 2011
  • 16. Read performance: maxtimestamp ✤ Sort sstables by maximum (client-provided) timestamp ✤ Only merge sstables until we have the columns requested ✤ Allows pre-merging highly fragmented rows without waiting for compaction Tuesday, October 4, 2011
  • 18. CQL cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;         KEY | birth_date |         full_name | state |  bsanderson |       1975 | Brandon Sanderson |    UT | Tuesday, October 4, 2011
  • 19. CQL 2.0 ✤ ALTER ✤ Counter support ✤ TTL support ✤ SELECT count(*) Tuesday, October 4, 2011
  • 20. Post-1.0 features ✤ Ease Of Use ✤ CQL ✤ “Native” transport ✤ Composite columns ✤ Prepared statements ✤ Triggers ✤ Entity groups ✤ Smarter range queries ✤ Enables more-efficient analytics Tuesday, October 4, 2011
  • 21. The evolution of Analytics Analytics + Realtime Tuesday, October 4, 2011
  • 22. The evolution of Analytics replication Analytics Realtime Tuesday, October 4, 2011
  • 23. The evolution of Analytics ETL Tuesday, October 4, 2011
  • 24. Big data Analytics DataStax Realtime (Hadoop) Enterprise (Cassandra) Tuesday, October 4, 2011
  • 25. DataStax Enterprise re-unifies realtime and analytics Tuesday, October 4, 2011
  • 27. Data model: Realtime LiveStocks last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios GOOG LNKD P AMZN AAPLE Portfolio1 80 20 40 100 20 StockHist 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11 Tuesday, October 4, 2011
  • 28. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 Tuesday, October 4, 2011
  • 29. Data model: Analytics 10dayreturns ticker rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.row_key ticker, b.column_name rdate, b.value - a.value FROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name); Tuesday, October 4, 2011
  • 30. Data model: Analytics 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11 row_key column_name value GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78 Tuesday, October 4, 2011
  • 31. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT row_key portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker) GROUP BY row_key, rdate; Tuesday, October 4, 2011
  • 32. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn); Tuesday, October 4, 2011
  • 33. Portfolio Demo dataflow Portfolios Portfolios Historical Prices Live Prices for today Intermediate Results Largest loss Largest loss Tuesday, October 4, 2011
  • 34. Operations ✤ “Vanilla” Hadoop ✤ 8+ services to setup, monitor, backup, and recover (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...) ✤ Single points of failure ✤ Can't separate online and offline processing ✤ DataStax Enterprise ✤ Single, simplified component ✤ Self-organizes based on workload ✤ Peer to peer ✤ JobTracker failover ✤ No additional cassandra config Tuesday, October 4, 2011
  • 36. Questions? ✤ http://datastax.com/dev/blog ✤ jonathan@datastax.com Tuesday, October 4, 2011