SlideShare a Scribd company logo
1 of 125
Download to read offline
Runaway complexity in Big Data
       And a plan to stop it

                           Nathan Marz
                           @nathanmarz   1
Agenda
• Common    sources of complexity in data systems
• Design for a fundamentally better data system
What is a data system?

   A system that manages the storage and
   querying of data
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years encompassing every version of the
   application to ever exist
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years encompassing every version of the
   application to ever exist, every hardware
   failure
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years encompassing every version of the
   application to ever exist, every hardware
   failure, and every human mistake ever
   made
Common sources of complexity

         Lack of human fault-tolerance



         Conflation of data and queries



             Schemas done wrong
Lack of human fault-tolerance
Human fault-tolerance
• Bugs will be deployed to production over the lifetime of a data system
• Operational mistakes will be made
• Humans are part of the overall system, just like your hard disks, CPUs,
  memory, and software
• Must design for human error like you’d design for any other fault
Human fault-tolerance
Examples of human error
• Deploy a bug that increments counters by two instead of by one
• Accidentally delete data from database
• Accidental DOS on important internal service
The worst consequence is
data loss or data corruption
As long as an error doesn’t
 lose or corrupt good data,
you can fix what went wrong
Mutability
• The U and D in CRUD
• A mutable system updates the current state of the world
• Mutable systems inherently lack human fault-tolerance
• Easy to corrupt or lose data
Immutability
•   An immutable system captures a historical record of events
•   Each event happens at a particular time and is always true
Capturing change with mutable data
model
  Person    Location                    Person   Location

   Sally   Philadelphia                  Sally   New York

   Bob      Chicago                      Bob     Chicago




                   Sally moves to New York
Capturing change with immutable
data model
Person    Location      Time               Person    Location      Time

 Sally   Philadelphia 1318358351            Sally   Philadelphia 1318358351

 Bob      Chicago    1327928370             Bob      Chicago    1327928370

                                            Sally   New York    1338469380


                         Sally moves to New York
Immutability greatly restricts the
 range of errors that can cause
  data loss or data corruption
Vastly more human fault-tolerant
Immutability
Other benefits
•   Fundamentally simpler
•   CR instead of CRUD
•   Only write operation is appending new units of data
•   Easy to implement on top of a distributed filesystem
    •   File = list of data records
    •   Append = Add a new file into a directory

                                                          basing a system on mutability is
                                                          like pouring gasoline on your
                                                          house (but don’t worry, i checked
                                                          all the wires carefully to make sure
                                                          there won’t be any sparks). when
                                                          someone makes a mistake who
                                                          knows what will burn
Immutability




       Please watch Rich Hickey’s talks to learn more
       about the enormous benefits of immutability
Conflation of data and queries
Conflation of data and queries
Normalization vs. denormalization


   ID      Name     Location ID   Location ID     City      State   Population

    1       Sally       3             1         New York    NY        8.2M

    2      George       1             2         San Diego   CA        1.3M

    3       Bob         3             3         Chicago      IL       2.7M




                            Normalized schema
Join is too expensive, so
      denormalize...
ID           Name        Location ID        City         State
1             Sally             3       Chicago             IL
2            George             1      New York             NY
3             Bob               3       Chicago             IL



     Location ID        City        State          Population
         1            New York         NY            8.2M
         2            San Diego        CA            1.3M
         3            Chicago          IL            2.7M


                   Denormalized schema
Obviously, you prefer all data to be
         fully normalized
But you are forced to denormalize
         for performance
Because the way data is modeled,
stored, and queried is complected
We will come back to how to build
 data systems in which these are
          disassociated
Schemas done wrong
Schemas have a bad rap
Schemas
•   Hard to change
•   Get in the way
•   Add development overhead
•   Requires annoying configuration
I know! Use a
schemaless database!
This is an overreaction
Confuses the poor implementation
 of schemas with the value that
        schemas provide
What is a schema exactly?
function(data unit)
That says whether this data is valid
             or not
This is useful
Value of schemas
•   Structural integrity
•   Guarantees on what can and can’t be stored
•   Prevents corruption
Otherwise you’ll detect corruption
      issues at read-time
Potentially long after the corruption
              happened
With little insight into the
circumstances of the corruption
Much better to get an exception
 where the mistake is made,
before it corrupts the database
Saves enormous amounts of time
Why are schemas considered
painful?
• Changing  the schema is hard (e.g., adding a column to a table)
• Schema is overly restrictive (e.g., cannot do nested objects)
• Require translation layers (e.g. ORM)
• Requires more typing (development overhead)
None of these are fundamentally
 linked with function(data unit)
These are problems in the
implementation of schemas, not in
      schemas themselves
Ideal schema tool
• Data  is represented as maps
• Schema tool is a library that helps construct the schema function:
  • Concisely specify required fields and types
  • Insert custom validation logic for fields (e.g. ages are between 0 and 200)
• Built-in support for evolving the schema over time
• Fast and space-efficient serialization/deserialization
• Cross-language                      this is easy to use and gets out of your way

                                      i use apache thrift, but it lacks the custom validation logic

                                      i think it could be done better with a clojure-like data as maps approach

                                      given that parameters of a data system: long-lived, ever changing, with mistakes being made, the
                                      amount of work it takes to make a schema (not that much) is absolutely worth it
Let’s get provocative
The relational database will
  be a footnote in history
Not because of SQL, restrictive
schemas, or scalability issues
But because of fundamental
flaws in the RDBMS approach
      to managing data
Mutability
Conflating the storage of data
   with how it is queried
Back in the day, these flaws were feature – because space was a
premium. The landscape has changed, and this is no longer the
constraint it once was. So these properties of mutability and
conflating data and queries are now major, glaring flaws.
Because there are better ways to design data system
“NewSQL” is misguided
Let’s use our ability to cheaply
store massive amounts of data
To do data right
And not inherit the complexities
          of the past
I know! Use a
NoSQL database!




                  if SQL’s wrong, and
                  NoSQL isn’t SQL, then
                  NoSQL must be right
NoSQL databases are generally
not a step in the right direction
Some aspects are, but not the
ones that get all the attention
Still based on mutability and
      not general purpose
Let’s see how you design
a data system that
doesn’t suffer from
these complexities




                           Let’s start from scratch
What does a data system do?
Retrieve data that you
 previously stored?
    Put        Get
Not really...
Counterexamples

          Store location information on people

      How many people live in a particular location?

                 Where does Sally live?

         What are the most populous locations?
Counterexamples

             Store pageview information

       How many pageviews on September 2nd?

         How many unique visitors over time?
Counterexamples

        Store transaction history for bank account

         How much money does George have?

     How much money do people spend on housing?
What does a data system do?


    Query = Function(All data)
Sometimes you retrieve what
       you stored
Oftentimes you do transformations,
        aggregations, etc.
Queries as pure functions that take
all data as input is the most general
              formulation
Example query


 Total number of pageviews to a
 URL over a range of time
Example query




          Implementation
Too slow: “all data” is petabyte-scale
On-the-fly computation



          All       Query
         data
Precomputation


     All     Precomputed
                           Query
    data         view
Example query
 Pageview

 Pageview

 Pageview                                  2930
                                   Query
 Pageview

 Pageview
  All data
                Precomputed view
Precomputation


     All     Precomputed
                           Query
    data         view
Precomputation


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function
Data system


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




               Two problems to solve
Data system


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




               How to compute views
Data system


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




      How to compute queries from views
Computing views


          All              Precomputed

         data   Function
                               view
Function that takes in
  all data as input
Batch processing
MapReduce
MapReduce is a framework for
computing arbitrary functions on
        arbitrary data
Expressing those functions


                       Cascalog


                 Scalding
MapReduce precomputation
                                 Batch view #1

                    e wo rkflow
           MapReduc
All data




           MapRed
                 uce work
                          flow    Batch view #2
Batch view database

Need a database that...
• Is batch-writable from MapReduce
• Has fast random reads
• Examples: ElephantDB, Voldemort
Batch view database


  No random writes required!
Properties


              All               Batch

             data    Function
                                view




                                  ElephantDB is only a few
                                  thousand lines of code

                    Simple
Properties


              All               Batch

             data    Function
                                view




                    Scalable
Properties


              All              Batch

             data   Function
                               view




                Highly available
Properties


               All              Batch

              data   Function
                                view




  Can be heavily optimized (b/c no random writes)
Properties


              All                Batch

             data     Function
                                 view




                    Normalized
Properties

Not exactly denormalization, because
you’re doing more than just retrieving data
that you stored (can do aggregations)

You’re able to optimize data storage
separately from data modeling, without
the complexity typical of denormalization
in relational databases
                                               All               Batch
*This is because the batch view is a pure
function of all data* -> hard to get out of
sync, and if there’s ever a problem (like a
                                              data   Function
                                                                 view

bug in your code that computes the wrong
batch view) you can recompute

also easy to debug problems, since you have
the input that produced the batch view ->
this is not true in a mutable system based
on incremental updates




                                                “Denormalized”
So we’re done, right?
Not quite...
                                           Just a few hours
• A batch workflow is too slow              of data!
• Views are out of date


             Absorbed into batch views   Not absorbed


                                                   Now

                                Time
Properties


              All              Batch

             data   Function
                               view




             Eventually consistent
Properties


              All               Batch

             data    Function
                                view




      (without the associated complexities)
Properties


               All                Batch

              data     Function
                                  view




   (such as divergent values, vector clocks, etc.)
What’s left?


   Precompute views for last few
   hours of data
Realtime views
Realtime view #1



New data stream   Stream processor

                                        Realtime view #2



                                     NoSQL databases
Application queries

         Batch view


                        Merge
        Realtime view
Precomputation


     All     Precomputed
                           Query
    data         view
Precomputation

               All             Precomputed
                                batch view
              data
                                               Query
                               Precomputed
                               realtime view
 New data stream


                   “Lambda Architecture”
Precomputation

               All   Precomputed
                      batch view
              data
                                           Query
                     Precomputed
                     realtime view
 New data stream


                      Most complex part of system
Precomputation

               All                                Precomputed
                                                   batch view
              data
                                                                       Query
                                                  Precomputed
                                                  realtime view
 New data stream

                   This is where things like

                                               Random write dbs much more complex
                   vector clocks have to be
                   dealt with if using
                   eventually consistent
                   NoSQL database
Precomputation

               All     Precomputed
                        batch view
              data
                                               Query
                       Precomputed
                       realtime view
 New data stream


                     But only represents few hours of data
Precomputation

                     All                Precomputed
                                         batch view
                    data
                                                                Query
                                        Precomputed
                                        realtime view
 New data stream

           Can continuously discard
           realtime views, keeping
           them small                 If anything goes wrong, auto-corrects
CAP
Realtime layer decides whether to guarantee C or A
• If it chooses consistency, queries are consistent
• If it chooses availability, queries are eventually consistent



                                       All the complexity of *dealing* with the CAP theorem
                                       (like read repair) is isolated in the realtime layer. If
                                       anything goes wrong, it’s *auto-corrected*

                                       CAP is now a choice, as it should be, rather than a
                                       complexity burden. Making a mistake w.r.t. eventual
                                       consistency *won’t corrupt* your data
Eventual accuracy


  Sometimes hard to compute
  exact answer in realtime
Eventual accuracy



     Example: unique count
Eventual accuracy


 Can compute exact answer in
 batch layer and approximate
 answer in realtime layer
                     Though for functions
                     which can be computed
                     exactly in the realtime
                     layer (e.g. counting), you
                     can achieve full accuracy
Eventual accuracy


   Best of both worlds of
   performance and accuracy
Tools
                                    ElephantDB, Voldemort

               All     Hadoop Precomputed
                                 batch view
              data
                                                Storm   Query
                                Precomputed
                                realtime view
 New data stream       Storm
                                 Cassandra, Riak, HBase
    Kafka
                   “Lambda Architecture”
Lambda Architecture
• Candiscard batch views and realtime views and recreate everything
 from scratch

• Mistakes   corrected via recomputation

• Data   storage layer optimized independently from query resolution layer

                                              what mistakes can be made?

                                                - write bad data? - remove the data and recompute the
                                              views

                                                - bug in the functions that compute view? - recompute the
                                              view

                                               - bug in query function? just deploy the fix
Future
• Abstractionover batch and realtime
• More data structure implementations for batch and realtime views
Learn more




        http://manning.com/marz
Questions?

More Related Content

What's hot

ADRという考えを取り入れてみて
ADRという考えを取り入れてみてADRという考えを取り入れてみて
ADRという考えを取り入れてみてinfinite_loop
 
SDL2の紹介
SDL2の紹介SDL2の紹介
SDL2の紹介nyaocat
 
2021.laravelconf.tw.slides1
2021.laravelconf.tw.slides12021.laravelconf.tw.slides1
2021.laravelconf.tw.slides1LiviaLiaoFontech
 
【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~
【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~
【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~日本マイクロソフト株式会社
 
[소프트웨어교육] 알고리즘 교사 연수 자료
[소프트웨어교육] 알고리즘 교사 연수 자료[소프트웨어교육] 알고리즘 교사 연수 자료
[소프트웨어교육] 알고리즘 교사 연수 자료Sangsu Song
 
量子コンピューター向け冷凍機について
量子コンピューター向け冷凍機について量子コンピューター向け冷凍機について
量子コンピューター向け冷凍機についてTakayoshi Tanaka
 
FPGAアクセラレータの作り方
FPGAアクセラレータの作り方FPGAアクセラレータの作り方
FPGAアクセラレータの作り方Mr. Vengineer
 
クエリビルダとEloquent ORM の違い
クエリビルダとEloquent ORM の違いクエリビルダとEloquent ORM の違い
クエリビルダとEloquent ORM の違いtomo
 
ゲーム開発におけるバックトラック法
ゲーム開発におけるバックトラック法ゲーム開発におけるバックトラック法
ゲーム開発におけるバックトラック法大介 束田
 
C++20 モジュールの概要 / Introduction to C++ modules (part 1)
C++20 モジュールの概要 / Introduction to C++ modules (part 1)C++20 モジュールの概要 / Introduction to C++ modules (part 1)
C++20 モジュールの概要 / Introduction to C++ modules (part 1)TetsuroMatsumura
 
続・モジュール / Introduction to C++ modules (part 2)
続・モジュール / Introduction to C++ modules (part 2)続・モジュール / Introduction to C++ modules (part 2)
続・モジュール / Introduction to C++ modules (part 2)TetsuroMatsumura
 
競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略K Moneto
 
Quine・難解プログラミングについて
Quine・難解プログラミングについてQuine・難解プログラミングについて
Quine・難解プログラミングについてmametter
 
Unicode文字列処理
Unicode文字列処理Unicode文字列処理
Unicode文字列処理信之 岩永
 
你一定不能不知道的 Markdown 寫作技巧
你一定不能不知道的 Markdown 寫作技巧你一定不能不知道的 Markdown 寫作技巧
你一定不能不知道的 Markdown 寫作技巧Will Huang
 
Xamarin で良くやっていたあれを MAUI でする話
Xamarin で良くやっていたあれを MAUI でする話Xamarin で良くやっていたあれを MAUI でする話
Xamarin で良くやっていたあれを MAUI でする話m ishizaki
 
公開鍵暗号5: 離散対数問題
公開鍵暗号5: 離散対数問題公開鍵暗号5: 離散対数問題
公開鍵暗号5: 離散対数問題Joe Suzuki
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 

What's hot (20)

ADRという考えを取り入れてみて
ADRという考えを取り入れてみてADRという考えを取り入れてみて
ADRという考えを取り入れてみて
 
SDL2の紹介
SDL2の紹介SDL2の紹介
SDL2の紹介
 
2021.laravelconf.tw.slides1
2021.laravelconf.tw.slides12021.laravelconf.tw.slides1
2021.laravelconf.tw.slides1
 
【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~
【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~
【de:code 2020】 React Native で Windows アプリ開発 ~React Native for Windows~
 
[소프트웨어교육] 알고리즘 교사 연수 자료
[소프트웨어교육] 알고리즘 교사 연수 자료[소프트웨어교육] 알고리즘 교사 연수 자료
[소프트웨어교육] 알고리즘 교사 연수 자료
 
量子コンピューター向け冷凍機について
量子コンピューター向け冷凍機について量子コンピューター向け冷凍機について
量子コンピューター向け冷凍機について
 
FPGAアクセラレータの作り方
FPGAアクセラレータの作り方FPGAアクセラレータの作り方
FPGAアクセラレータの作り方
 
最大流 (max flow)
最大流 (max flow)最大流 (max flow)
最大流 (max flow)
 
クエリビルダとEloquent ORM の違い
クエリビルダとEloquent ORM の違いクエリビルダとEloquent ORM の違い
クエリビルダとEloquent ORM の違い
 
ゲーム開発におけるバックトラック法
ゲーム開発におけるバックトラック法ゲーム開発におけるバックトラック法
ゲーム開発におけるバックトラック法
 
C++20 モジュールの概要 / Introduction to C++ modules (part 1)
C++20 モジュールの概要 / Introduction to C++ modules (part 1)C++20 モジュールの概要 / Introduction to C++ modules (part 1)
C++20 モジュールの概要 / Introduction to C++ modules (part 1)
 
続・モジュール / Introduction to C++ modules (part 2)
続・モジュール / Introduction to C++ modules (part 2)続・モジュール / Introduction to C++ modules (part 2)
続・モジュール / Introduction to C++ modules (part 2)
 
競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略
 
Survival analysis0702
Survival analysis0702Survival analysis0702
Survival analysis0702
 
Quine・難解プログラミングについて
Quine・難解プログラミングについてQuine・難解プログラミングについて
Quine・難解プログラミングについて
 
Unicode文字列処理
Unicode文字列処理Unicode文字列処理
Unicode文字列処理
 
你一定不能不知道的 Markdown 寫作技巧
你一定不能不知道的 Markdown 寫作技巧你一定不能不知道的 Markdown 寫作技巧
你一定不能不知道的 Markdown 寫作技巧
 
Xamarin で良くやっていたあれを MAUI でする話
Xamarin で良くやっていたあれを MAUI でする話Xamarin で良くやっていたあれを MAUI でする話
Xamarin で良くやっていたあれを MAUI でする話
 
公開鍵暗号5: 離散対数問題
公開鍵暗号5: 離散対数問題公開鍵暗号5: 離散対数問題
公開鍵暗号5: 離散対数問題
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 

Similar to Runaway complexity in Big Data... and a plan to stop it

Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionKrishnakumar S
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)Ben Stopford
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented developmentrajmundr
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software TestingCloverDX
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Datastream management system1
Datastream management system1Datastream management system1
Datastream management system1SaritaTripathy3
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
Protecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test dataProtecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test dataMatt Bowen
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setSoner Altin
 
unba.se - ACM CSCW 2017 - IWCES15
unba.se - ACM CSCW 2017 - IWCES15unba.se - ACM CSCW 2017 - IWCES15
unba.se - ACM CSCW 2017 - IWCES15Daniel Norman
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastInside Analysis
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12mark madsen
 
Stig: Social Graphs & Discovery at Scale
Stig: Social Graphs & Discovery at ScaleStig: Social Graphs & Discovery at Scale
Stig: Social Graphs & Discovery at ScaleDATAVERSITY
 

Similar to Runaway complexity in Big Data... and a plan to stop it (20)

Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
 
To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the question
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented development
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software Testing
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
CQRS
CQRSCQRS
CQRS
 
Datastream management system1
Datastream management system1Datastream management system1
Datastream management system1
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
A peek into the future
A peek into the futureA peek into the future
A peek into the future
 
Protecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test dataProtecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test data
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
unba.se - ACM CSCW 2017 - IWCES15
unba.se - ACM CSCW 2017 - IWCES15unba.se - ACM CSCW 2017 - IWCES15
unba.se - ACM CSCW 2017 - IWCES15
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Stig: Social Graphs & Discovery at Scale
Stig: Social Graphs & Discovery at ScaleStig: Social Graphs & Discovery at Scale
Stig: Social Graphs & Discovery at Scale
 

More from nathanmarz

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineeringnathanmarz
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrongnathanmarz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypenathanmarz
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systemsnathanmarz
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackTypenathanmarz
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshopnathanmarz
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loopnathanmarz
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Daynathanmarz
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 

More from nathanmarz (17)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Storm
StormStorm
Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 
Cascading
CascadingCascading
Cascading
 

Recently uploaded

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_planJamie (Taka) Wang
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 

Recently uploaded (20)

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_plan
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 

Runaway complexity in Big Data... and a plan to stop it

  • 1. Runaway complexity in Big Data And a plan to stop it Nathan Marz @nathanmarz 1
  • 2. Agenda • Common sources of complexity in data systems • Design for a fundamentally better data system
  • 3. What is a data system? A system that manages the storage and querying of data
  • 4. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years
  • 5. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist
  • 6. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure
  • 7. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure, and every human mistake ever made
  • 8. Common sources of complexity Lack of human fault-tolerance Conflation of data and queries Schemas done wrong
  • 9. Lack of human fault-tolerance
  • 10. Human fault-tolerance • Bugs will be deployed to production over the lifetime of a data system • Operational mistakes will be made • Humans are part of the overall system, just like your hard disks, CPUs, memory, and software • Must design for human error like you’d design for any other fault
  • 11. Human fault-tolerance Examples of human error • Deploy a bug that increments counters by two instead of by one • Accidentally delete data from database • Accidental DOS on important internal service
  • 12. The worst consequence is data loss or data corruption
  • 13. As long as an error doesn’t lose or corrupt good data, you can fix what went wrong
  • 14. Mutability • The U and D in CRUD • A mutable system updates the current state of the world • Mutable systems inherently lack human fault-tolerance • Easy to corrupt or lose data
  • 15. Immutability • An immutable system captures a historical record of events • Each event happens at a particular time and is always true
  • 16. Capturing change with mutable data model Person Location Person Location Sally Philadelphia Sally New York Bob Chicago Bob Chicago Sally moves to New York
  • 17. Capturing change with immutable data model Person Location Time Person Location Time Sally Philadelphia 1318358351 Sally Philadelphia 1318358351 Bob Chicago 1327928370 Bob Chicago 1327928370 Sally New York 1338469380 Sally moves to New York
  • 18. Immutability greatly restricts the range of errors that can cause data loss or data corruption
  • 19. Vastly more human fault-tolerant
  • 20. Immutability Other benefits • Fundamentally simpler • CR instead of CRUD • Only write operation is appending new units of data • Easy to implement on top of a distributed filesystem • File = list of data records • Append = Add a new file into a directory basing a system on mutability is like pouring gasoline on your house (but don’t worry, i checked all the wires carefully to make sure there won’t be any sparks). when someone makes a mistake who knows what will burn
  • 21. Immutability Please watch Rich Hickey’s talks to learn more about the enormous benefits of immutability
  • 22. Conflation of data and queries
  • 23. Conflation of data and queries Normalization vs. denormalization ID Name Location ID Location ID City State Population 1 Sally 3 1 New York NY 8.2M 2 George 1 2 San Diego CA 1.3M 3 Bob 3 3 Chicago IL 2.7M Normalized schema
  • 24. Join is too expensive, so denormalize...
  • 25. ID Name Location ID City State 1 Sally 3 Chicago IL 2 George 1 New York NY 3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  • 26. Obviously, you prefer all data to be fully normalized
  • 27. But you are forced to denormalize for performance
  • 28. Because the way data is modeled, stored, and queried is complected
  • 29. We will come back to how to build data systems in which these are disassociated
  • 31. Schemas have a bad rap
  • 32. Schemas • Hard to change • Get in the way • Add development overhead • Requires annoying configuration
  • 33. I know! Use a schemaless database!
  • 34. This is an overreaction
  • 35. Confuses the poor implementation of schemas with the value that schemas provide
  • 36. What is a schema exactly?
  • 38. That says whether this data is valid or not
  • 40. Value of schemas • Structural integrity • Guarantees on what can and can’t be stored • Prevents corruption
  • 41. Otherwise you’ll detect corruption issues at read-time
  • 42. Potentially long after the corruption happened
  • 43. With little insight into the circumstances of the corruption
  • 44. Much better to get an exception where the mistake is made, before it corrupts the database
  • 46. Why are schemas considered painful? • Changing the schema is hard (e.g., adding a column to a table) • Schema is overly restrictive (e.g., cannot do nested objects) • Require translation layers (e.g. ORM) • Requires more typing (development overhead)
  • 47. None of these are fundamentally linked with function(data unit)
  • 48. These are problems in the implementation of schemas, not in schemas themselves
  • 49. Ideal schema tool • Data is represented as maps • Schema tool is a library that helps construct the schema function: • Concisely specify required fields and types • Insert custom validation logic for fields (e.g. ages are between 0 and 200) • Built-in support for evolving the schema over time • Fast and space-efficient serialization/deserialization • Cross-language this is easy to use and gets out of your way i use apache thrift, but it lacks the custom validation logic i think it could be done better with a clojure-like data as maps approach given that parameters of a data system: long-lived, ever changing, with mistakes being made, the amount of work it takes to make a schema (not that much) is absolutely worth it
  • 51. The relational database will be a footnote in history
  • 52. Not because of SQL, restrictive schemas, or scalability issues
  • 53. But because of fundamental flaws in the RDBMS approach to managing data
  • 55. Conflating the storage of data with how it is queried Back in the day, these flaws were feature – because space was a premium. The landscape has changed, and this is no longer the constraint it once was. So these properties of mutability and conflating data and queries are now major, glaring flaws. Because there are better ways to design data system
  • 57. Let’s use our ability to cheaply store massive amounts of data
  • 58. To do data right
  • 59. And not inherit the complexities of the past
  • 60. I know! Use a NoSQL database! if SQL’s wrong, and NoSQL isn’t SQL, then NoSQL must be right
  • 61. NoSQL databases are generally not a step in the right direction
  • 62. Some aspects are, but not the ones that get all the attention
  • 63. Still based on mutability and not general purpose
  • 64. Let’s see how you design a data system that doesn’t suffer from these complexities Let’s start from scratch
  • 65. What does a data system do?
  • 66. Retrieve data that you previously stored? Put Get
  • 68. Counterexamples Store location information on people How many people live in a particular location? Where does Sally live? What are the most populous locations?
  • 69. Counterexamples Store pageview information How many pageviews on September 2nd? How many unique visitors over time?
  • 70. Counterexamples Store transaction history for bank account How much money does George have? How much money do people spend on housing?
  • 71. What does a data system do? Query = Function(All data)
  • 72. Sometimes you retrieve what you stored
  • 73. Oftentimes you do transformations, aggregations, etc.
  • 74. Queries as pure functions that take all data as input is the most general formulation
  • 75. Example query Total number of pageviews to a URL over a range of time
  • 76. Example query Implementation
  • 77. Too slow: “all data” is petabyte-scale
  • 78. On-the-fly computation All Query data
  • 79. Precomputation All Precomputed Query data view
  • 80. Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view
  • 81. Precomputation All Precomputed Query data view
  • 82. Precomputation All Precomputed Query data Function view Function
  • 83. Data system All Precomputed Query data Function view Function Two problems to solve
  • 84. Data system All Precomputed Query data Function view Function How to compute views
  • 85. Data system All Precomputed Query data Function view Function How to compute queries from views
  • 86. Computing views All Precomputed data Function view
  • 87. Function that takes in all data as input
  • 90. MapReduce is a framework for computing arbitrary functions on arbitrary data
  • 91. Expressing those functions Cascalog Scalding
  • 92. MapReduce precomputation Batch view #1 e wo rkflow MapReduc All data MapRed uce work flow Batch view #2
  • 93. Batch view database Need a database that... • Is batch-writable from MapReduce • Has fast random reads • Examples: ElephantDB, Voldemort
  • 94. Batch view database No random writes required!
  • 95. Properties All Batch data Function view ElephantDB is only a few thousand lines of code Simple
  • 96. Properties All Batch data Function view Scalable
  • 97. Properties All Batch data Function view Highly available
  • 98. Properties All Batch data Function view Can be heavily optimized (b/c no random writes)
  • 99. Properties All Batch data Function view Normalized
  • 100. Properties Not exactly denormalization, because you’re doing more than just retrieving data that you stored (can do aggregations) You’re able to optimize data storage separately from data modeling, without the complexity typical of denormalization in relational databases All Batch *This is because the batch view is a pure function of all data* -> hard to get out of sync, and if there’s ever a problem (like a data Function view bug in your code that computes the wrong batch view) you can recompute also easy to debug problems, since you have the input that produced the batch view -> this is not true in a mutable system based on incremental updates “Denormalized”
  • 101. So we’re done, right?
  • 102. Not quite... Just a few hours • A batch workflow is too slow of data! • Views are out of date Absorbed into batch views Not absorbed Now Time
  • 103. Properties All Batch data Function view Eventually consistent
  • 104. Properties All Batch data Function view (without the associated complexities)
  • 105. Properties All Batch data Function view (such as divergent values, vector clocks, etc.)
  • 106. What’s left? Precompute views for last few hours of data
  • 108. Realtime view #1 New data stream Stream processor Realtime view #2 NoSQL databases
  • 109. Application queries Batch view Merge Realtime view
  • 110. Precomputation All Precomputed Query data view
  • 111. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream “Lambda Architecture”
  • 112. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Most complex part of system
  • 113. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream This is where things like Random write dbs much more complex vector clocks have to be dealt with if using eventually consistent NoSQL database
  • 114. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream But only represents few hours of data
  • 115. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Can continuously discard realtime views, keeping them small If anything goes wrong, auto-corrects
  • 116. CAP Realtime layer decides whether to guarantee C or A • If it chooses consistency, queries are consistent • If it chooses availability, queries are eventually consistent All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer. If anything goes wrong, it’s *auto-corrected* CAP is now a choice, as it should be, rather than a complexity burden. Making a mistake w.r.t. eventual consistency *won’t corrupt* your data
  • 117. Eventual accuracy Sometimes hard to compute exact answer in realtime
  • 118. Eventual accuracy Example: unique count
  • 119. Eventual accuracy Can compute exact answer in batch layer and approximate answer in realtime layer Though for functions which can be computed exactly in the realtime layer (e.g. counting), you can achieve full accuracy
  • 120. Eventual accuracy Best of both worlds of performance and accuracy
  • 121. Tools ElephantDB, Voldemort All Hadoop Precomputed batch view data Storm Query Precomputed realtime view New data stream Storm Cassandra, Riak, HBase Kafka “Lambda Architecture”
  • 122. Lambda Architecture • Candiscard batch views and realtime views and recreate everything from scratch • Mistakes corrected via recomputation • Data storage layer optimized independently from query resolution layer what mistakes can be made? - write bad data? - remove the data and recompute the views - bug in the functions that compute view? - recompute the view - bug in query function? just deploy the fix
  • 123. Future • Abstractionover batch and realtime • More data structure implementations for batch and realtime views
  • 124. Learn more http://manning.com/marz