SlideShare a Scribd company logo
1 of 49
Download to read offline
Big Data
          architectural concerns for the
                     new age




Sunday, 2 December 12
Debasish Ghosh
                            CTO
                        (a Nomura Research Institute group company)




Sunday, 2 December 12
@debasishg on Twitter

                                           code @
                        http://github.com/debasishg

                                       blog @
                Ruminations of a Programmer
                http://debasishg.blogspot.com




Sunday, 2 December 12
some numbers ..




Sunday, 2 December 12
Facebook reaches 1 billion active users




Sunday, 2 December 12
Sunday, 2 December 12
Sunday, 2 December 12
some more numbers ..




Sunday, 2 December 12
• Walmart handles 1M transactions per hour
                   • Google processes 24PB of data per day
                   • AT&T transfers 30PB of data per day
                   • 90 trillion emails are sent every year
                   • World of Warcraft uses 1.3PB of storage

Sunday, 2 December 12
Big Data - the positive
                        feedback cycle
            1
             new technologies
            make using big data              2
                 efficient
                                        more adoption
                                         of big data
                                    3
                           generation
                            of more
                            big data
Sunday, 2 December 12
new technologies

                        .. new architectural concerns




Sunday, 2 December 12
new ways to store data
Sunday, 2 December 12
new techniques to retrieve data
Sunday, 2 December 12
new ways to scale reads & writes
Sunday, 2 December 12
transparent to the
                            application


Sunday, 2 December 12
new ways to consume data
Sunday, 2 December 12
new techniques to analyze data
Sunday, 2 December 12
new ways to visualize data
Sunday, 2 December 12
at Web scale



Sunday, 2 December 12
The Database
                         Landscape so far ..
                   • relational database - the bedrock of
                        enterprise data
                   • irrespective of application development
                        paradigm
                   • object-relational-mapping considered to be
                        the panacea for impedance mismatch



Sunday, 2 December 12
blogger, big geek and
                        architectural consultant




                                      “Object Relational Mapping is the
                                         Vietnam of Computer Science”
                                                   - Ted Neward (2006)

Sunday, 2 December 12
RDBMS & Big Data

                   • once the data volume crosses the limit of a
                        single server, you shard / partition
                        • sharding implies a lookup node for the
                          hash code => SPOF
                        • cross shard joins, transactions don’t scale

Sunday, 2 December 12
RDBMS & Big Data
                   • Cost of distributed transactions
                    • synchronization overhead
                    • 2 phase commit is a blocking protocol
                          (can block indefinitely)
                        • as slow as the slowest DB node +
                          network latency


Sunday, 2 December 12
RDBMS & Big Data
                   • Master/Slave replication
                    • synchronous replication => slow
                    • asynchronous replication => can lose
                          data
                        • writing to master is a bottleneck and
                          SPOF


Sunday, 2 December 12
Need Distributed
                           Databases
                   • data is automatically partitioned
                   • transparent to the application
                   • add capacity without downtime
                   • failure tolerant

Sunday, 2 December 12
2 famous papers ..

                   • Bigtable: A distributed storage system for
                        structured data, 2006
                   • Dynamo: Amazon’s highly scalable key/value
                        store, 2007




Sunday, 2 December 12
Addressing 2
                               Approaches

                   • Bigtable: “how can we build a distributed
                        database on top of GFS ?”
                   • Dynamo: “how can we build a distributed
                        hash table appropriate for data center ?”




Sunday, 2 December 12
Big Data
                         recommendations
                   • reduce accidental complexity in processing
                        data
                   • be less rigid (no rigid schema)
                   • store data in a format closer to the domain
                        model
                   • hence no universal data model ..

Sunday, 2 December 12
Polyglot Storage
                   • unfortunately came to be known as NoSQL
                        databases
                   • document oriented (MongoDB, CouchDB)
                   • key/value (Dynamo, Bigtable, Riak,
                        Cassandra,Voldemort)
                   • data structure based (redis)
                   • graph based (Neo4J)
Sunday, 2 December 12
reduced impedance
                                mismatch




                richer modeling           closer to
                   capabilities         domain model




Sunday, 2 December 12
Asynchronous Replication to RDBMS using Message Oriented
                                          Middleware
Sunday, 2 December 12
Hybrid Oracle MongoDB storage over Messaging backbone

Sunday, 2 December 12
Relational Database is just another option, not
   the only option when data set is BIG and
               semantically rich




Sunday, 2 December 12
10 things never to do with a
                            Relational Database
                   •    Search                                        •    Media Repository

                   •    Recommendation                                •    Email

                   •    High Frequency Trading                        •    Classification ad

                   •    Product Cataloging                            •    Time Series /
                                                                           Forecasting
                   •    User group / ACLs

                   •    Log Analysis

                    Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-
                                                   database-206944?page=0,0


Sunday, 2 December 12
Scalability, Availability ..
                   •    ACID => BASE             •   Anti-entropy

                   •    CAP Theorem &            •   Gossip Protocol
                        Eventual Consistency

                   •    Consistent Hashing

                   •    Vector Clocks

                   •    Hinted Hand-off & Read
                        repair




Sunday, 2 December 12
CAP Theorem

                   • Consistency, Availability & Partition
                        Tolerance
                   • You can have only 2 of these in a
                        distributed system
                   • Eric Brewer postulated this quite some
                        time back



Sunday, 2 December 12
ACID => BASE
                   • Basic Availability Soft-state Eventual
                        consistency
                   • Rather than requiring consistency after
                        every transaction, it’s enough for the
                        database to eventually be in a consistent
                        state.
                   • It’s ok to use stale data and it’s ok to give
                        approximate answers


Sunday, 2 December 12
Consistent Hashing



Sunday, 2 December 12
Big Data in the wild
                   • Hadoop
                    • started as a batch processing engine
                          (HDFS & Map/Reduce)
                        • with bigger and bigger data, you need to
                          make them available to users at near real
                          time
                        • stream processing, CEP ..
Sunday, 2 December 12
a data warehouse system for Hadoop for easy data
  summarization, ad-hoc queries & analysis of large
  datasets stored in Hadoop compatible file systems




                  complementing
                   Map/Reduce                 Pig, a platform for analyzing large data sets that
                                              consists of a high-level language for expressing data
                    in Hadoop                 analysis programs, coupled with infrastructure for
                                              evaluating these programs.




           Cloudera Impala
    real time ad hoc query capability to Hadoop,
    complementing traditional MapReduce batch
    processing



Sunday, 2 December 12
Real time queries in
                              Hadoop
                   • currently people use Hadoop connectors
                        to massively parallel databases to do real
                        time queries in Hadoop
                   • expensive and may need lots of data
                        movement between the database & the
                        Hadoop clusters



Sunday, 2 December 12
.. and the Hadoop ecosystem continues to grow
    with lots of real time tools being developed
   actively that are compliant with the current
                        base ..




Sunday, 2 December 12
Shark from UC
                               Berkeley
                   • a large scale data warehouse system for
                        Spark, compatible with Hive
                   • supports HiveQL, Hive data formats and
                        user defined functions. In addition, Shark
                        can be used to query data in HDFS, HBase
                        and Amazon S3



Sunday, 2 December 12
BI and Analytics
                   • making Big Data available to developers
                   • API / scripting abilities for writing rich
                        analytic applications (Precog, Continuity,
                        Infochimps)
                   • analyzing user behaviors, network
                        monitoring, log processing, recommenders,
                        AI ..


Sunday, 2 December 12
Machine Learning
                   • personalization
                   • social network analysis
                   • pattern discovery - click patterns,
                        recommendations, ratings
                   • apps that rely on machine learning -
                        Prismatic, Trifacta, Google, Twitter ..


Sunday, 2 December 12
Summary
                   • Big Data will grow bigger - we need to
                        embrace the changes in architecture
                   • An RDBMS is NOT the panacea - pick your
                        data model that’s closest to your domain
                   • It’s economical to limit data movement -
                        process data in place and utilize the
                        multiple cores of your hardware


Sunday, 2 December 12
Summary

                   • Go for decentralized architectures, avoid
                        SPOFs
                   • With the big volumes of data, streaming is
                        your friend




Sunday, 2 December 12
Thank You!



Sunday, 2 December 12
http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-
market-research/
http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htm

http://www.emich.edu/chhs/about-researchMETHODS.html
http://docs.basho.com/riak/latest/references/appendices/concepts/




Sunday, 2 December 12

More Related Content

What's hot

Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Marko Grobelnik
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarPlatfora
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setSoner Altin
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12mark madsen
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Scality presentation cloud Computing Expo NY 2012 v1.0
Scality presentation cloud Computing Expo NY 2012 v1.0Scality presentation cloud Computing Expo NY 2012 v1.0
Scality presentation cloud Computing Expo NY 2012 v1.0Marc Villemade
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...
Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...
Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...Tomas Pariente Lobo
 

What's hot (20)

Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Scality presentation cloud Computing Expo NY 2012 v1.0
Scality presentation cloud Computing Expo NY 2012 v1.0Scality presentation cloud Computing Expo NY 2012 v1.0
Scality presentation cloud Computing Expo NY 2012 v1.0
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...
Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...
Big data-public-private-forum--2013 publioc-sector_meeting_spain_big_data_tec...
 

Viewers also liked

Property based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rulesProperty based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rulesDebasish Ghosh
 
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...1Strategy
 
Domain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approachDomain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approachDebasish Ghosh
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingDebasish Ghosh
 
DSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDebasish Ghosh
 
Functional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modelingFunctional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modelingDebasish Ghosh
 
Dependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDebasish Ghosh
 
From functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modelingFrom functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modelingDebasish Ghosh
 
Functional Patterns in Domain Modeling
Functional Patterns in Domain ModelingFunctional Patterns in Domain Modeling
Functional Patterns in Domain ModelingDebasish Ghosh
 
An Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain ModelingAn Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain ModelingDebasish Ghosh
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
Domain Modeling in a Functional World
Domain Modeling in a Functional WorldDomain Modeling in a Functional World
Domain Modeling in a Functional WorldDebasish Ghosh
 

Viewers also liked (12)

Property based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rulesProperty based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rules
 
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
 
Domain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approachDomain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approach
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain Modeling
 
DSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic model
 
Functional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modelingFunctional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modeling
 
Dependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake Pattern
 
From functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modelingFrom functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modeling
 
Functional Patterns in Domain Modeling
Functional Patterns in Domain ModelingFunctional Patterns in Domain Modeling
Functional Patterns in Domain Modeling
 
An Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain ModelingAn Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain Modeling
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Domain Modeling in a Functional World
Domain Modeling in a Functional WorldDomain Modeling in a Functional World
Domain Modeling in a Functional World
 

Similar to Big Data - architectural concerns for the new age

MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Brasil
 
Morning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et IntroductionsMorning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et IntroductionsMongoDB
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptChris Richardson
 
A Morning with MongoDB Barcelona: Introduction
A Morning with MongoDB Barcelona: IntroductionA Morning with MongoDB Barcelona: Introduction
A Morning with MongoDB Barcelona: IntroductionMongoDB
 
The Coming Database Revolution
The Coming Database RevolutionThe Coming Database Revolution
The Coming Database RevolutionDATAVERSITY
 
Coming to cassandra from relational world (New)
Coming to cassandra from relational world (New)Coming to cassandra from relational world (New)
Coming to cassandra from relational world (New)Nenad Bozic
 
EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012  Simon Riggs - Open Data, Open Database: PostgreSQLEDF2012  Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQLEuropean Data Forum
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)packetloop
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server ProLynn Langit
 
A Morning with MongoDB Barcelona: Use Cases and Roadmap
A Morning with MongoDB Barcelona: Use Cases and RoadmapA Morning with MongoDB Barcelona: Use Cases and Roadmap
A Morning with MongoDB Barcelona: Use Cases and RoadmapMongoDB
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
 
Gilbane Boston 2011 big data
Gilbane Boston 2011 big dataGilbane Boston 2011 big data
Gilbane Boston 2011 big dataPeter O'Kelly
 
"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 Keynote"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 KeynoteDan Brickley
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 

Similar to Big Data - architectural concerns for the new age (20)

MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPal
 
Morning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et IntroductionsMorning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et Introductions
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
 
A Morning with MongoDB Barcelona: Introduction
A Morning with MongoDB Barcelona: IntroductionA Morning with MongoDB Barcelona: Introduction
A Morning with MongoDB Barcelona: Introduction
 
The Coming Database Revolution
The Coming Database RevolutionThe Coming Database Revolution
The Coming Database Revolution
 
Coming to cassandra from relational world (New)
Coming to cassandra from relational world (New)Coming to cassandra from relational world (New)
Coming to cassandra from relational world (New)
 
Hadoop
HadoopHadoop
Hadoop
 
Grails 2.0 Update
Grails 2.0 UpdateGrails 2.0 Update
Grails 2.0 Update
 
EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012  Simon Riggs - Open Data, Open Database: PostgreSQLEDF2012  Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQL
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
Measure Everything
Measure EverythingMeasure Everything
Measure Everything
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 
A Morning with MongoDB Barcelona: Use Cases and Roadmap
A Morning with MongoDB Barcelona: Use Cases and RoadmapA Morning with MongoDB Barcelona: Use Cases and Roadmap
A Morning with MongoDB Barcelona: Use Cases and Roadmap
 
Big data presentation
Big data  presentationBig data  presentation
Big data presentation
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
Big data&hadoop
Big data&hadoopBig data&hadoop
Big data&hadoop
 
Gilbane Boston 2011 big data
Gilbane Boston 2011 big dataGilbane Boston 2011 big data
Gilbane Boston 2011 big data
 
"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 Keynote"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 Keynote
 
Dublin Core: What is left to do?
Dublin Core: What is left to do?Dublin Core: What is left to do?
Dublin Core: What is left to do?
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 

More from Debasish Ghosh

Functional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayFunctional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayDebasish Ghosh
 
Algebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain ModelsAlgebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain ModelsDebasish Ghosh
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed worldDebasish Ghosh
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingDebasish Ghosh
 
Architectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsArchitectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsDebasish Ghosh
 
Mining Functional Patterns
Mining Functional PatternsMining Functional Patterns
Mining Functional PatternsDebasish Ghosh
 

More from Debasish Ghosh (7)

Functional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayFunctional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 Way
 
Algebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain ModelsAlgebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain Models
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed world
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain Modeling
 
Architectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsArchitectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain Models
 
Mining Functional Patterns
Mining Functional PatternsMining Functional Patterns
Mining Functional Patterns
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Big Data - architectural concerns for the new age

  • 1. Big Data architectural concerns for the new age Sunday, 2 December 12
  • 2. Debasish Ghosh CTO (a Nomura Research Institute group company) Sunday, 2 December 12
  • 3. @debasishg on Twitter code @ http://github.com/debasishg blog @ Ruminations of a Programmer http://debasishg.blogspot.com Sunday, 2 December 12
  • 4. some numbers .. Sunday, 2 December 12
  • 5. Facebook reaches 1 billion active users Sunday, 2 December 12
  • 8. some more numbers .. Sunday, 2 December 12
  • 9. • Walmart handles 1M transactions per hour • Google processes 24PB of data per day • AT&T transfers 30PB of data per day • 90 trillion emails are sent every year • World of Warcraft uses 1.3PB of storage Sunday, 2 December 12
  • 10. Big Data - the positive feedback cycle 1 new technologies make using big data 2 efficient more adoption of big data 3 generation of more big data Sunday, 2 December 12
  • 11. new technologies .. new architectural concerns Sunday, 2 December 12
  • 12. new ways to store data Sunday, 2 December 12
  • 13. new techniques to retrieve data Sunday, 2 December 12
  • 14. new ways to scale reads & writes Sunday, 2 December 12
  • 15. transparent to the application Sunday, 2 December 12
  • 16. new ways to consume data Sunday, 2 December 12
  • 17. new techniques to analyze data Sunday, 2 December 12
  • 18. new ways to visualize data Sunday, 2 December 12
  • 19. at Web scale Sunday, 2 December 12
  • 20. The Database Landscape so far .. • relational database - the bedrock of enterprise data • irrespective of application development paradigm • object-relational-mapping considered to be the panacea for impedance mismatch Sunday, 2 December 12
  • 21. blogger, big geek and architectural consultant “Object Relational Mapping is the Vietnam of Computer Science” - Ted Neward (2006) Sunday, 2 December 12
  • 22. RDBMS & Big Data • once the data volume crosses the limit of a single server, you shard / partition • sharding implies a lookup node for the hash code => SPOF • cross shard joins, transactions don’t scale Sunday, 2 December 12
  • 23. RDBMS & Big Data • Cost of distributed transactions • synchronization overhead • 2 phase commit is a blocking protocol (can block indefinitely) • as slow as the slowest DB node + network latency Sunday, 2 December 12
  • 24. RDBMS & Big Data • Master/Slave replication • synchronous replication => slow • asynchronous replication => can lose data • writing to master is a bottleneck and SPOF Sunday, 2 December 12
  • 25. Need Distributed Databases • data is automatically partitioned • transparent to the application • add capacity without downtime • failure tolerant Sunday, 2 December 12
  • 26. 2 famous papers .. • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: Amazon’s highly scalable key/value store, 2007 Sunday, 2 December 12
  • 27. Addressing 2 Approaches • Bigtable: “how can we build a distributed database on top of GFS ?” • Dynamo: “how can we build a distributed hash table appropriate for data center ?” Sunday, 2 December 12
  • 28. Big Data recommendations • reduce accidental complexity in processing data • be less rigid (no rigid schema) • store data in a format closer to the domain model • hence no universal data model .. Sunday, 2 December 12
  • 29. Polyglot Storage • unfortunately came to be known as NoSQL databases • document oriented (MongoDB, CouchDB) • key/value (Dynamo, Bigtable, Riak, Cassandra,Voldemort) • data structure based (redis) • graph based (Neo4J) Sunday, 2 December 12
  • 30. reduced impedance mismatch richer modeling closer to capabilities domain model Sunday, 2 December 12
  • 31. Asynchronous Replication to RDBMS using Message Oriented Middleware Sunday, 2 December 12
  • 32. Hybrid Oracle MongoDB storage over Messaging backbone Sunday, 2 December 12
  • 33. Relational Database is just another option, not the only option when data set is BIG and semantically rich Sunday, 2 December 12
  • 34. 10 things never to do with a Relational Database • Search • Media Repository • Recommendation • Email • High Frequency Trading • Classification ad • Product Cataloging • Time Series / Forecasting • User group / ACLs • Log Analysis Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational- database-206944?page=0,0 Sunday, 2 December 12
  • 35. Scalability, Availability .. • ACID => BASE • Anti-entropy • CAP Theorem & • Gossip Protocol Eventual Consistency • Consistent Hashing • Vector Clocks • Hinted Hand-off & Read repair Sunday, 2 December 12
  • 36. CAP Theorem • Consistency, Availability & Partition Tolerance • You can have only 2 of these in a distributed system • Eric Brewer postulated this quite some time back Sunday, 2 December 12
  • 37. ACID => BASE • Basic Availability Soft-state Eventual consistency • Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state. • It’s ok to use stale data and it’s ok to give approximate answers Sunday, 2 December 12
  • 39. Big Data in the wild • Hadoop • started as a batch processing engine (HDFS & Map/Reduce) • with bigger and bigger data, you need to make them available to users at near real time • stream processing, CEP .. Sunday, 2 December 12
  • 40. a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems complementing Map/Reduce Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data in Hadoop analysis programs, coupled with infrastructure for evaluating these programs. Cloudera Impala real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing Sunday, 2 December 12
  • 41. Real time queries in Hadoop • currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop • expensive and may need lots of data movement between the database & the Hadoop clusters Sunday, 2 December 12
  • 42. .. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current base .. Sunday, 2 December 12
  • 43. Shark from UC Berkeley • a large scale data warehouse system for Spark, compatible with Hive • supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3 Sunday, 2 December 12
  • 44. BI and Analytics • making Big Data available to developers • API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps) • analyzing user behaviors, network monitoring, log processing, recommenders, AI .. Sunday, 2 December 12
  • 45. Machine Learning • personalization • social network analysis • pattern discovery - click patterns, recommendations, ratings • apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter .. Sunday, 2 December 12
  • 46. Summary • Big Data will grow bigger - we need to embrace the changes in architecture • An RDBMS is NOT the panacea - pick your data model that’s closest to your domain • It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware Sunday, 2 December 12
  • 47. Summary • Go for decentralized architectures, avoid SPOFs • With the big volumes of data, streaming is your friend Sunday, 2 December 12
  • 48. Thank You! Sunday, 2 December 12