SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Introduction to Big Data
and NoSQL
SQL Azure Saturday
April, 21, 2012
                Don Demsak
                Advisory Solutions Architect
                EMC Consulting
                www.donxml.com




                                               1
Meet Don

• Advisory Solutions Architect
   – EMC Consulting
      • Application Architecture, Development & Design
• DonXml.com, Twitter: donxml
• Email – don@donxml.com
• SlideShare - http://www.slideshare.net/dondemsak




                                                         2
The era of Big Data


                      3
How did we get here?
• Expensive                • Monoculture
   –   Processors             – Limit CPU cycles
   –   Disk space             – Limit disk space
   –   Memory                 – Limit memory
   –   Operating Systems      – Limited OS
   –   Software                 Development
   –   Programmers            – Limited Software
                              – Programmers
                                  • Mono-lingual
                                  • Mono-persistence




                                                       4
Typical RDBMS Implementations
• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACID
  –   Atomicity
  –   Consistency
  –   Isolation
  –   Durability




                                    5
How we scale RDBMS
implementations




                     6
1st Step – Build a relational database




                  Database




                                         7
2nd Step – Table Partitioning

                  p1 p2 p3




                  Database




                                8
3rd Step – Database Partitioning

   Browser      Web Tier   B/L Tier   Database
  Customer #1




    Browser     Web Tier   B/L Tier   Database
  Customer #2




    Browser     Web Tier   B/L Tier   Database
  Customer #3




                                                 9
4th Step – Move to the cloud?

   Browser      Web Tier   B/L Tier   SQL Azure
                                      Federation
  Customer #1



                                      SQL Azure
    Browser     Web Tier   B/L Tier   Federation

  Customer #2



                                      SQL Azure
    Browser     Web Tier   B/L Tier   Federation

  Customer #3




                                                   10
There has to be other ways


                             11
Polyglot Persistence


                       12
Polyglot Programmer


                      13
14
Where Did NoSQL Originate?
• 1998 - Carlo Strozzi
  – NoSQL project - lightweight open-source relational DB
    with no SQL interface
• 2009 - Eric Evans & Johan Oskarsson of Last.fm
  wanted to organize an event to discuss open-
  source distributed databases




                                                            15
NoSQL (loose) Definition
• (often) Open source
• Non-relational
• Distributed
• (often) don‟t guarantee ACID




                                 16
Atlanta 2009
• No:sql(east) conference
   – select fun, profit from real_world where relational=false
• Billed as “conference of no-rel datastores”




                                                                 17
Types Of NoSQL Data Stores




                             18
5 Groups of Data Models
  Relational


  Document


  Key Value


  Graph


  Column Family



                          19
Document Store
• Apache Jackrabbit
• CouchDB
• MongoDB
• SimpleDB
• XML Databases
  – MarkLogic Server
  – eXist.




                       20
Document?
• Okay think of a web page...
  – Relational model requires column/tag
  – Lots of empty columns
  – Wasted space
• Document model just stores the pages as is
  – Saves on space
  – Very flexible.




                                               21
Graph Storage
• AllegroGraph
• Core Data
• Neo4j
• DEX
• FlockDB
• Microsoft Trinity (research project)
   – http://research.microsoft.com/en-us/projects/trinity/




                                                             22
What‟s a graph?
• Graph consists of
  – Node („stations‟ of the graph)
  – Edges (lines between them)
• FlockDB
  – Created by the Twitter folks
  – Nodes = Users
  – Edges = Nature of relationship between nodes.




                                                    23
Key/Value Stores
• On disk
• Cache in Ram
• Eventually Consistent
  – Weak Definition
     • “If no updates occur for a period, eventually all updates will
       propagate through the system and all replicas will be consistent”
  – Strong Definition
     • “for a given update and a given replica eventually either the
       update reaches the replica or the replica retires”

• Ordered
  – Distributed Hash Table allows lexicographical processing



                                                                           24
Key/Value Examples
• Azure AppFabric Cache
• Memcache-d
• VMWare vFabric GemFire




                           25
Object Databases
• Db4o
• GemStone/S
• InterSystems Caché
• Objectivity/DB
• ZODB




                       26
Tabular
• BigTable
• Mnesia
• Hbase
• Hypertable
• Azure Table Storage
• SQL Server 2012




                        27
Azure Table Storage Demo




                           28
Big Data




           29
Big Data Definition
• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks




                                        30
Big Data Examples
• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)



                                         31
Real World Example
       • Twitter
          – The challenges
             • Needs to store many graphs
                    Who you are following
                    Who‟s following you
                    Who you receive phone
                     notifications from etc
             • To deliver a tweet requires
               rapid paging of followers
             • Heavy write load as followers
               are added and removed
             • Set arithmetic for @mentions
               (intersection of users).



                                               32
What did they try?
• Started with Relational
  Databases
• Tried Key-Value storage
  of denormalized lists
• Did it work?
   – Nope
      • Either good at
           Handling the write load
           Or paging large
            amounts of data
           But not both



                                      33
What did they need?
• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of order
   – Or be processed more than once
   – Failures should result in redundant work
• Not lost work!




                                                34
The Result was FlockDB
• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency lists
  – List of all edges in a graph
     • Key is the edge value a set of the node end points

• Optimized for fast read and write
• Optimized for page-able set arithmetic.




                                                            35
How Does it Work?
• Stores graphs as sets of edges between nodes
• Data is partitioned by node
  – All queries can be answered by a single partition
• Write operations are idempotent
  – Can be applied multiple times without changing the
    result
• And commutative
  – Changing the order of operands doesn‟t change the
    result.



                                                         36
Working With Big Data




                        37
ACID
• Atomicity
   – All or Nothing
• Consistency
   – Valid according to all defined rules
• Isolation
   – No transaction should be able to interfere with another
     transaction
• Durability
   – Once a transaction has been committed, it will remain
     so, even in the event of power loss, crashes, or errors


                                                               38
BASE
• Basically Available
   – High availability but not always consistent
• Soft state
   – Background cleanup mechanism
• Eventual consistency
   – Given a sufficiently long period of time over which no
     changes are sent, all updates can be expected to
     propagate eventually through the system and all the
     replicas will be consistent.




                                                              39
Traditional (relational) Approach


                    Extract   Transactional Data Store




              Transform



                              Data Warehouse
                     Load




                                                         40
Big Data Approach
• MapReduce Pattern/Framework
  – an Input Reader
  – Map Function – To transform to a common shape
    (format)
  – a partition function
  – a compare function
  – Reduce Function
  – an Output Writer




                                                    41
MongoDB Example

> // map function                        > // reduce function
> m = function(){                        > r = function( key , values ){
...    this.tags.forEach(                ...    var total = 0;
...        function(z){                  ...    for ( var i=0; i<values.length; i++ )
...            emit( z , { count : 1 }   ...        total += values[i].count;
);                                       ...    return { count : total };
...        }                             ...};
...    );
...};




           > // execute
           > res = db.things.mapReduce(m, r, { out : "myoutput" } );




                                                                                        42
MongoDB Demo




               43
Big Data on Azure
• Azure Table Storage
  – Azure Service Bus
• SQL Azure Federations
• MongoDB on Azure
  – http://www.mongodb.org/display/DOCS/MongoDB+on+Azure

• Hadoop on Azure
  – https://www.hadooponazure.com/




                                                           44
Using Azure for Computing


                                           Data
             Data                 Worker
                                           Data
    Client          Master        Worker

             Job/Task Scheduler   Worker
                                           Data




                                                  45
Moving to Event Based Architecture
      Web Role                                       Worker Role


         Web Role                                 Worker Role


            Web Role                          Worker Role




                         Req   Req   Req



                                Queue



             Web Role                         Worker Role


         Web Role         Monitor queue           Worker Role
                          length against
      Web Role          user‟s expectations          Worker Role




                                                                   46
Aggregate Stores




                   47
Visualizing Aggregates                              Orders




  ID: 1001


  Customer: Ann

  Line Items                                        Customers


    32411234        2    $48   $96
    707423234       1    $56   456

    125145          1    $24   $24



                                                    Order Lines
  Payment Details


   Card: AmEx
   CC#: 12343
   Expiration: 07/2015               Credit Cards




                                                                  48
Visualizing Aggregates
  ID: 1001


  Customer: Ann

  Line Items


    32411234        2    $48   $96   {
                                     “SalesOrdersView”:{
    707423234       1    $56   456     ID: 1001,
                                       Customer: Ann,
    125145          1    $24   $24      LineItems: []
                                     ……………..
                                     …………….
                                     ……………..
  Payment Details
                                     }
                                     }
   Card: AmEx
   CC#: 12343
   Expiration: 07/2015




                                                           49
MongoDB on Azure Demo




                        50
Next Steps
• Learn a NoSQL product
  – Great place to start – AppFabric Cache, Azure Table
    Storage, MongoDB
• Pick a new programming language to learn
  – Not Java or C#/VB
  – Node.js, JavaScript, F#




                                                          51
THANK YOU



            52

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
Viet-Trung TRAN
 

Was ist angesagt? (20)

Data warehouse,data mining & Big Data
Data warehouse,data mining & Big DataData warehouse,data mining & Big Data
Data warehouse,data mining & Big Data
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Relational databases
Relational databasesRelational databases
Relational databases
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
PPL, OQL & oodbms
PPL, OQL & oodbmsPPL, OQL & oodbms
PPL, OQL & oodbms
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 

Andere mochten auch

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 
database recovery techniques
database recovery techniques database recovery techniques
database recovery techniques
Kalhan Liyanage
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 

Andere mochten auch (12)

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
spring
springspring
spring
 
Intro to NoSQL
Intro to NoSQLIntro to NoSQL
Intro to NoSQL
 
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
 
J2EE and layered architecture
J2EE and layered architectureJ2EE and layered architecture
J2EE and layered architecture
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
database recovery techniques
database recovery techniques database recovery techniques
database recovery techniques
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Intro to Big Data and NoSQL

Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Don Demcsak
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 

Ähnlich wie Intro to Big Data and NoSQL (20)

Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,..."Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases
 
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Ciel, mes données ne sont plus relationnelles
Ciel, mes données ne sont plus relationnellesCiel, mes données ne sont plus relationnelles
Ciel, mes données ne sont plus relationnelles
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in Ruby
 
NoSQL
NoSQLNoSQL
NoSQL
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Intro to Big Data and NoSQL

  • 1. Introduction to Big Data and NoSQL SQL Azure Saturday April, 21, 2012 Don Demsak Advisory Solutions Architect EMC Consulting www.donxml.com 1
  • 2. Meet Don • Advisory Solutions Architect – EMC Consulting • Application Architecture, Development & Design • DonXml.com, Twitter: donxml • Email – don@donxml.com • SlideShare - http://www.slideshare.net/dondemsak 2
  • 3. The era of Big Data 3
  • 4. How did we get here? • Expensive • Monoculture – Processors – Limit CPU cycles – Disk space – Limit disk space – Memory – Limit memory – Operating Systems – Limited OS – Software Development – Programmers – Limited Software – Programmers • Mono-lingual • Mono-persistence 4
  • 5. Typical RDBMS Implementations • Fixed table schemas • Small but frequent reads/writes • Large batch transactions • Focus on ACID – Atomicity – Consistency – Isolation – Durability 5
  • 6. How we scale RDBMS implementations 6
  • 7. 1st Step – Build a relational database Database 7
  • 8. 2nd Step – Table Partitioning p1 p2 p3 Database 8
  • 9. 3rd Step – Database Partitioning Browser Web Tier B/L Tier Database Customer #1 Browser Web Tier B/L Tier Database Customer #2 Browser Web Tier B/L Tier Database Customer #3 9
  • 10. 4th Step – Move to the cloud? Browser Web Tier B/L Tier SQL Azure Federation Customer #1 SQL Azure Browser Web Tier B/L Tier Federation Customer #2 SQL Azure Browser Web Tier B/L Tier Federation Customer #3 10
  • 11. There has to be other ways 11
  • 14. 14
  • 15. Where Did NoSQL Originate? • 1998 - Carlo Strozzi – NoSQL project - lightweight open-source relational DB with no SQL interface • 2009 - Eric Evans & Johan Oskarsson of Last.fm wanted to organize an event to discuss open- source distributed databases 15
  • 16. NoSQL (loose) Definition • (often) Open source • Non-relational • Distributed • (often) don‟t guarantee ACID 16
  • 17. Atlanta 2009 • No:sql(east) conference – select fun, profit from real_world where relational=false • Billed as “conference of no-rel datastores” 17
  • 18. Types Of NoSQL Data Stores 18
  • 19. 5 Groups of Data Models Relational Document Key Value Graph Column Family 19
  • 20. Document Store • Apache Jackrabbit • CouchDB • MongoDB • SimpleDB • XML Databases – MarkLogic Server – eXist. 20
  • 21. Document? • Okay think of a web page... – Relational model requires column/tag – Lots of empty columns – Wasted space • Document model just stores the pages as is – Saves on space – Very flexible. 21
  • 22. Graph Storage • AllegroGraph • Core Data • Neo4j • DEX • FlockDB • Microsoft Trinity (research project) – http://research.microsoft.com/en-us/projects/trinity/ 22
  • 23. What‟s a graph? • Graph consists of – Node („stations‟ of the graph) – Edges (lines between them) • FlockDB – Created by the Twitter folks – Nodes = Users – Edges = Nature of relationship between nodes. 23
  • 24. Key/Value Stores • On disk • Cache in Ram • Eventually Consistent – Weak Definition • “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent” – Strong Definition • “for a given update and a given replica eventually either the update reaches the replica or the replica retires” • Ordered – Distributed Hash Table allows lexicographical processing 24
  • 25. Key/Value Examples • Azure AppFabric Cache • Memcache-d • VMWare vFabric GemFire 25
  • 26. Object Databases • Db4o • GemStone/S • InterSystems Caché • Objectivity/DB • ZODB 26
  • 27. Tabular • BigTable • Mnesia • Hbase • Hypertable • Azure Table Storage • SQL Server 2012 27
  • 29. Big Data 29
  • 30. Big Data Definition • Volumes & volumes of data • Unstructured • Semi-structured • Not suited for Relational Databases • Often utilizes MapReduce frameworks 30
  • 31. Big Data Examples • Cassandra • Hadoop • Greenplum • Azure Storage • EMC Atmos • Amazon S3 • SQL Azure (with Federations support) 31
  • 32. Real World Example • Twitter – The challenges • Needs to store many graphs  Who you are following  Who‟s following you  Who you receive phone notifications from etc • To deliver a tweet requires rapid paging of followers • Heavy write load as followers are added and removed • Set arithmetic for @mentions (intersection of users). 32
  • 33. What did they try? • Started with Relational Databases • Tried Key-Value storage of denormalized lists • Did it work? – Nope • Either good at  Handling the write load  Or paging large amounts of data  But not both 33
  • 34. What did they need? • Simplest possible thing that would work • Allow for horizontal partitioning • Allow write operations to • Arrive out of order – Or be processed more than once – Failures should result in redundant work • Not lost work! 34
  • 35. The Result was FlockDB • Stores graph data • Not optimized for graph traversal operations • Optimized for large adjacency lists – List of all edges in a graph • Key is the edge value a set of the node end points • Optimized for fast read and write • Optimized for page-able set arithmetic. 35
  • 36. How Does it Work? • Stores graphs as sets of edges between nodes • Data is partitioned by node – All queries can be answered by a single partition • Write operations are idempotent – Can be applied multiple times without changing the result • And commutative – Changing the order of operands doesn‟t change the result. 36
  • 37. Working With Big Data 37
  • 38. ACID • Atomicity – All or Nothing • Consistency – Valid according to all defined rules • Isolation – No transaction should be able to interfere with another transaction • Durability – Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors 38
  • 39. BASE • Basically Available – High availability but not always consistent • Soft state – Background cleanup mechanism • Eventual consistency – Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent. 39
  • 40. Traditional (relational) Approach Extract Transactional Data Store Transform Data Warehouse Load 40
  • 41. Big Data Approach • MapReduce Pattern/Framework – an Input Reader – Map Function – To transform to a common shape (format) – a partition function – a compare function – Reduce Function – an Output Writer 41
  • 42. MongoDB Example > // map function > // reduce function > m = function(){ > r = function( key , values ){ ... this.tags.forEach( ... var total = 0; ... function(z){ ... for ( var i=0; i<values.length; i++ ) ... emit( z , { count : 1 } ... total += values[i].count; ); ... return { count : total }; ... } ...}; ... ); ...}; > // execute > res = db.things.mapReduce(m, r, { out : "myoutput" } ); 42
  • 44. Big Data on Azure • Azure Table Storage – Azure Service Bus • SQL Azure Federations • MongoDB on Azure – http://www.mongodb.org/display/DOCS/MongoDB+on+Azure • Hadoop on Azure – https://www.hadooponazure.com/ 44
  • 45. Using Azure for Computing Data Data Worker Data Client Master Worker Job/Task Scheduler Worker Data 45
  • 46. Moving to Event Based Architecture Web Role Worker Role Web Role Worker Role Web Role Worker Role Req Req Req Queue Web Role Worker Role Web Role Monitor queue Worker Role length against Web Role user‟s expectations Worker Role 46
  • 48. Visualizing Aggregates Orders ID: 1001 Customer: Ann Line Items Customers 32411234 2 $48 $96 707423234 1 $56 456 125145 1 $24 $24 Order Lines Payment Details Card: AmEx CC#: 12343 Expiration: 07/2015 Credit Cards 48
  • 49. Visualizing Aggregates ID: 1001 Customer: Ann Line Items 32411234 2 $48 $96 { “SalesOrdersView”:{ 707423234 1 $56 456 ID: 1001, Customer: Ann, 125145 1 $24 $24 LineItems: [] …………….. ……………. …………….. Payment Details } } Card: AmEx CC#: 12343 Expiration: 07/2015 49
  • 50. MongoDB on Azure Demo 50
  • 51. Next Steps • Learn a NoSQL product – Great place to start – AppFabric Cache, Azure Table Storage, MongoDB • Pick a new programming language to learn – Not Java or C#/VB – Node.js, JavaScript, F# 51
  • 52. THANK YOU 52

Hinweis der Redaktion

  1. t least four groups of data model: key-value, document, column-family, and graph. Looking at this list, there&apos;s a big similarity between the first three - all have a fundamental unit of storage which is a rich structure of closely related data: for key-value stores it&apos;s the value, for document stores it&apos;s the document, and for column-family stores it&apos;s the column family. In DDD terms, this group of data is an aggregate.A Graph Database stores data structured in the Nodes and Relationships of a graphColumn Family (BigTable-style) databases are an evolution of key-value, using &quot;families&quot; to allow grouping of rows. The rise of NoSQL databases has been driven primarily by the desire to store data effectively on large clusters - such as the setups used by Google and Amazon. Relational databases were not designed with clusters in mind, which is why people have cast around for an alternative. Storing aggregates as fundamental units makes a lot of sense for running on a cluster. Aggregates make natural units for distribution strategies such as sharding, since you have a large clump of data that you expect to be accessed together.The Relational ModelThe relational model provides for the storage of records that are made up of tuples. Records are stored in tables. Tables are defined by a schema, which determines what columns are in the table. Columns have a name and a type. All records within a table fit that table&apos;s definition. SQL is a query language designed to operate over tables. SQL provides syntax for finding records that meet criteria, as well as for relating records in one table to another via joins; a join finds a record in one table based on its relationship to a record in another table.Records can be created (inserted) or deleted. Fields within a record can be updated individually.Implementations of the relational model usually provide transactions, which provide a means to make modifications spanning multiple records atomically.In terms of what programming languages provide, tables are like arrays or lists of records or structures. For high performance access, tables can be indexed in various ways using b-trees or hash maps.Key-Value StoresKey-Value stores provide access to a value based on a key.The key-value pair can be created (inserted), or deleted. The value associated with a key may be updated.Key-value stores don&apos;t usually provide transactions.In terms of what programming languages provide, key-value stores resemble hash tables; these have many names: HashMap (Java), hash (Perl), dict (Python), associative array (PHP), boost::unordered_map&lt;...&gt; (C++).Key-value stores provide one implicit index on the key itself.A key-value store may not sound like the most useful thing, but a lot of information can be stored in the value. It is quite common for the value to be an XML document, a JSON object, or some other serialized form. The key point here is that the storage engine is not aware of the internal structure of the value. It is up to the client application to interpet the value andmanage its contents. The value can only be written as a whole; if the client is storing a JSON object, and only wants to update one field, the entire value must be fetched, the new value substituted, and then the entire value must be written back.The inability to fetch data by anything other than one key may appear limited, but there are workarounds. If the application requires a secondary index, the application can maintain one itself. To do this, the application manages a second collection of key-value pairs where the key is the value of another field in the first collection, and the value is the primary key in the first collection. Because there are no transactions that can be used to make sure that the secondary index is kept synchronized with the original collection, any application that does this would be wise to have a periodic syncing process to clean up after any partial changes that occur due to application crashes, bugs, or errors.Document StoresDocument stores provide access to structured data, but unlike the relational model, there may not be a schema that is enforced. In essence, the application stores bags of key-value pairs. In order to operate in this environment, the application adopts some conventions about how to deal with differing bags it may retrieve, or it may take advantage of the storage engine&apos;s ability to put different documents in different collections, which the application will use to manage its data.Unlike a relational store, document stores usually support nested structures. For example, for document stores that support XML or JSON documents, the value of a field may be something that looks like another document. Document stores can also support array or list-valued keys.Unlike a key-value store, document stores are aware of the internal structure of the document. This allows the storage engine to support secondary indexes directly, allowing for efficient queries on any field. The ability to support nested document storage leads to query languages that can be used to search for items nested inside others; XQuery is one example of this. MongoDB supports some similar functionality by allowing the specification of JSON field paths in queries.Column StoresColumn stores are like relational stores, except that they flip the data around. Instead of storing records, column stores store all the values for a column together in a stream. An index provides a means to get column values for any particular record.Map-reduce implementations such as Hadoop are most efficient if they can stream in their data. Column stores work particularly well for that. As a result, stores like HBase and Hypertable are often used as non-relational data warehouses to feed map-reduce for analytics.A relational-style column scalar may not be the most useful for analytics, so users often store more complex structures in columns. This manifests directly in Cassandra, which introduces the notion of &quot;column families,&quot; which get treated as a &quot;super-column.&quot;Column-oriented stores support retrieving records, but this requires fetching the column values from their individual columns and re-assembling the record.Graph DatabasesGraph databases store vertices and the edges between them. Some support adding annotations to the vertices and/or edges. This can be used to model things like social graphs (people are represented by vertices, and their relationships are the edges), or real-world objects (components are represented by vertices, and their connectedness is represented by edges). The content on IMDB is tied together by a graph: movies are related to to the actors in them, and actors are related to the movies they star in, forming a large complex graph.The access and query languages for graph databases are the most different of the set of those discussed here. Graph database query languages are generally about finding paths in the graph based on either endpoints, or constraints on attributes of the paths between endpoints; one example is SPARQL.
  2. Need to go into the EMC offerings