SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Introduction to Big Data
      and NoSQL
NJ SQL Server User Group
May 15, 2012
Melissa Demsak    Don Demsak
SQL Architect     Advisory Solutions Architect
Realogy           EMC Consulting
www.sqldiva.com   www.donxml.com
Meet Melissa

• SQL Architect
   – Realogy
• SqlDiva, Twitter: sqldiva
• Email – melissa@sqldiva.com
Meet Don

• Advisory Solutions Architect
   – EMC Consulting
      • Application Architecture, Development & Design
• DonXml.com, Twitter: donxml
• Email – don@donxml.com
• SlideShare - http://www.slideshare.net/dondemsak
The era of Big Data
How did we get here?
• Expensive               • Culture of Limitations
  o   Processors             o   Limit CPU cycles
  o   Disk space             o   Limit disk space
  o   Memory                 o   Limit memory
  o   Operating Systems      o   Limited OS Development
  o   Software               o   Limited Software
  o   Programmers            o   Programmers
                                   • One language
                                   • One persistence store
Typical RDBMS Implementations
• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACID
  o   Atomicity
  o   Consistency
  o   Isolation
  o   Durability
How we scale RDBMS
implementations
1 st
    Step – Build a
relational database


        Relational
        Database
2 nd
   Step – Table
  Partitioning
       p1 p2 p3




       Relational
       Database
3 rd   Step – Database
                   Partitioning
 Browser             Web Tier   B/L Tier   Relational
                                           Database
Customer #1




  Browser            Web Tier   B/L Tier   Relational
                                           Database
Customer #2




  Browser            Web Tier   B/L Tier   Relational
                                           Database
Customer #3
4 th     Step – Move to the
                  cloud?
 Browser         Web Tier   B/L Tier   SQL Azure
                                       Federation
Customer #1



                                       SQL Azure
  Browser        Web Tier   B/L Tier   Federation

Customer #2



                                       SQL Azure
  Browser        Web Tier   B/L Tier   Federation

Customer #3
Problems created by too
          much data
• Where to store
• How to store
• How to process
• Organization, searching, and
  metadata
• How to manage access
• How to copy, move, and backup
• Lifecycle
Polyglot Programmer
Polyglot Persistence

      (how to store)
• Atlanta 2009 - No:sql(east) conference
   select fun, profit from real_world
   where relational=false
• Billed as “conference of no-rel
  datastores”

             (loose) Definition

•   (often) Open source
•   Non-relational
•   Distributed
•   (often) does not guarantee ACID
Types Of NoSQL Data Stores
5 Groups of Data
          Models
Relational


Document


Key Value


Graph


Column Family
Document?
• Think of a web page...
  o Relational model requires column/tag
  o Lots of empty columns
  o Wasted space and processing time

• Document model just stores the pages as is
  o Saves on space
  o Very flexible

• Document Databases
  o   Apache Jackrabbit
  o   CouchDB
  o   MongoDB
  o   SimpleDB
  o   XML Databases
       • MarkLogic Server
       • eXist.
Key/Value Stores
• Simple Index on Key
• Value can be any serialized form of data
• Lots of different implementations
   o Eventually Consistent
       • “If no updates occur for a period, eventually all updates will propagate
          through the system and all replicas will be consistent”
   o Cached in RAM
   o Cached on disk
   o Distributed Hash Tables

• Examples
   o Azure AppFabric Cache
   o Memcache-d
   o VMWare vFabric GemFire
Graph?
• Graph consists of
   o Node („stations‟ of the graph)
   o Edges (lines between them)

• Graph Stores
   o AllegroGraph
   o Core Data
   o Neo4j
   o DEX
   o FlockDB
       • Created by the Twitter folks
       • Nodes = Users
       • Edges = Nature of relationship between nodes.
   o Microsoft Trinity (research project)
       • http://research.microsoft.com/en-us/projects/trinity/
Column Family?
• Lots of variants
   o  Object Stores
       • Db4o
       • GemStone/S
       • InterSystems Caché
       • Objectivity/DB
       • ZODB
   o Tabluar
       • BigTable
       • Mnesia
       • Hbase
       • Hypertable
       • Azure Table Storage
   o Column-oriented
       • Greenplum
       • Microsoft SQL Server 2012
Okay got it, Now Let’s
Compare Some Real World
       Scenarios
You Need Constant
                     Consistency
•     You‟re dealing with financial transactions
•     You‟re dealing with medical records
•     You‟re dealing with bonded goods
•     Best you use a RDMBS 




    Footer Text                               5/15/2012   24
You Need Horizontal
                 Scalability
•     You‟re working across defined timezones
•     You‟re Aggregating large quantities of data
•     Maintaining a chat server (Facebook chat)
•     Use Column Family Storage.




    Footer Text                             5/15/2012   25
Frequently Written Rarely
          Read
•     Think web counters and the like
•     Every time a user comes to a page = ctr++
•     But it‟s only read when the report is run
•     Use Key-Value Storage.




    Footer Text                             5/15/2012   26
Here Today Gone
                 Tomorrow
• Transient data like..
    o Web Sessions
    o Locks
    o Short Term Stats
       • Shopping cart contents

• Use Key-Value Storage




Footer Text                       5/15/2012   27
Where to store
• RAM
   o Fast
                                • Local Disk
                                   o   SSD – super fast
   o Expensive
                                   o   Fast spinning disks (7200+)
   o volatile
                                   o   High Bandwidth possible
                                   o   Persistent

                                • SAN
• Parallel File System             o Storage Area Network
   o HDFS (Hadoop)                 o Fully managed
   o Auto-replicated for           o Expensive
     parallel decentralized
     I/O                        • Cloud
                                   o Amazon
                                   o Box.Net
                                   o DropBox
Big Data
Big Data Definition
           •Beyond what traditional
Volume      environments can handle

           •Need decisions fast
Velocity



           •Many formats
Variety
Additional Big Data Concepts
• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks
Big Data Examples
• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)?
Real World Example
• Twitter
  o The challenges
     • Needs to store many graphs
           Who you are following
           Who‟s following you
           Who you receive phone
            notifications from etc
     • To deliver a tweet requires
       rapid paging of followers
     • Heavy write load as
       followers are added and
       removed
     • Set arithmetic for @mentions
       (intersection of users).
What did they try?
• Started with Relational
  Databases
• Tried Key-Value storage
  of denormalized lists
• Did it work?
   o Nope
      • Either good at
            Handling the write load
            Or paging large
             amounts of data
            But not both
What did they need?
• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of order
  o Or be processed more than once
  o Failures should result in redundant work

• Not lost work!
The Result was FlockDB
• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency lists
  o List of all edges in a graph
      • Key is the edge value a set of the node end points

• Optimized for fast read and write
• Optimized for page-able set arithmetic.
How Does it Work?
• Stores graphs as sets of edges between nodes
• Data is partitioned by node
  o All queries can be answered by a single partition

• Write operations are idempotent
  o Can be applied multiple times without changing the result

• And commutative
  o Changing the order of operands doesn‟t change the result.
How to Process Big
Data
ACID
• Atomicity
  o All or Nothing

• Consistency
  o Valid according to all defined rules

• Isolation
  o No transaction should be able to interfere with another transaction

• Durability
  o Once a transaction has been committed, it will remain so, even in
    the event of power loss, crashes, or errors
BASE
• Basically Available
  o High availability but not always consistent

• Soft state
  o Background cleanup mechanism

• Eventual consistency
  o Given a sufficiently long period of time over which no changes are
    sent, all updates can be expected to propagate eventually through
    the system and all the replicas will be consistent.
Traditional (relational)
      Approach
            Extract   Transactional Data Store




      Transform



                      Data Warehouse
            Load
Big Data Approach
• MapReduce Pattern/Framework
 o an Input Reader
 o Map Function – To transform to a common shape
   (format)
 o a partition function
 o a compare function
 o Reduce Function
 o an Output Writer
MongoDB Example

> // map function                        > // reduce function
> m = function(){                        > r = function( key , values ){
...    this.tags.forEach(                ...    var total = 0;
...        function(z){                  ...    for ( var i=0; i<values.length; i++ )
...            emit( z , { count : 1 }   ...        total += values[i].count;
);                                       ...    return { count : total };
...        }                             ...};
...    );
...};




           > // execute
           > res = db.things.mapReduce(m, r, { out : "myoutput" } );
What is Hadoop?
• A scalable fault-tolerant grid operating system for
  data storage and processing
• Its scalability comes from the marriage of:
  o HDFS: Self-Healing High-Bandwidth Clustered Storage
  o MapReduce: Fault-Tolerant Distributed Processing
• Operates on unstructured and structured data
• A large and active ecosystem (many developers
  and additions like HBase, Hive, Pig, …)
• Open source under the friendly Apache License
• http://wiki.apache.org/hadoop/
Hadoop Design Axioms
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and Extensible
Hadoop Core Components

     Store             Process


     HDFS           Map/Reduce



  Self-healing      Fault-tolerant
High-bandwidth       distributed
Clustered storage    processing
HDFS: Hadoop Distributed File System
 Block Size = 64MB
Replication Factor = 3




  Cost/GB is a few
 ¢/month vs $/month
Hadoop Map/Reduce
Hadoop Job Architecture
                                       Node
                                      Manager


                               Container   App Mstr


Client

                    Resource           Node
                    Manager           Manager
Client

                               App Mstr    Container




 MapReduce Status                      Node
                                      Manager
   Job Submission
   Node Status
 Resource Request              Container   Container
Microsoft embraces Hadoop




Good for enterprises & developers
Great for end users!
HADOOP
                                         [Azure and Enterprise]


 Java OM        Streaming OM     HiveQL                PigLatin               .NET/C#/F#         (T)SQL




                                           OCEAN OF DATA
             NOSQL             [unstructured, semi-structured, structured]                 ETL




                                            HDFS




           A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs




EIS /                RDBMS                  File                             OData                 Azure
ERP                                         System                           [RSS]                Storage
Hive Plug-in for Excel




Footer Text                 5/15/2012   52
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Peter Haase
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 

Was ist angesagt? (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration Hell
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Riak TS
Riak TSRiak TS
Riak TS
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 

Ähnlich wie Big Data (NJ SQL Server User Group)

Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
Manik Surtani
 
NoSQL and CouchDB: the view from MOO
NoSQL and CouchDB: the view from MOONoSQL and CouchDB: the view from MOO
NoSQL and CouchDB: the view from MOO
James Hollingworth
 

Ähnlich wie Big Data (NJ SQL Server User Group) (20)

Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Scaling Databases On The Cloud
Scaling Databases On The CloudScaling Databases On The Cloud
Scaling Databases On The Cloud
 
Scaing databases on the cloud
Scaing databases on the cloudScaing databases on the cloud
Scaing databases on the cloud
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
noSQL choices
noSQL choicesnoSQL choices
noSQL choices
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
NoSQL
NoSQLNoSQL
NoSQL
 
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
Infinispan, Data Grids, NoSQL, Cloud Storage and JSR 347
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
NoSQL and CouchDB: the view from MOO
NoSQL and CouchDB: the view from MOONoSQL and CouchDB: the view from MOO
NoSQL and CouchDB: the view from MOO
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Big Data (NJ SQL Server User Group)

  • 1. Introduction to Big Data and NoSQL NJ SQL Server User Group May 15, 2012 Melissa Demsak Don Demsak SQL Architect Advisory Solutions Architect Realogy EMC Consulting www.sqldiva.com www.donxml.com
  • 2. Meet Melissa • SQL Architect – Realogy • SqlDiva, Twitter: sqldiva • Email – melissa@sqldiva.com
  • 3. Meet Don • Advisory Solutions Architect – EMC Consulting • Application Architecture, Development & Design • DonXml.com, Twitter: donxml • Email – don@donxml.com • SlideShare - http://www.slideshare.net/dondemsak
  • 4. The era of Big Data
  • 5. How did we get here? • Expensive • Culture of Limitations o Processors o Limit CPU cycles o Disk space o Limit disk space o Memory o Limit memory o Operating Systems o Limited OS Development o Software o Limited Software o Programmers o Programmers • One language • One persistence store
  • 6. Typical RDBMS Implementations • Fixed table schemas • Small but frequent reads/writes • Large batch transactions • Focus on ACID o Atomicity o Consistency o Isolation o Durability
  • 7. How we scale RDBMS implementations
  • 8. 1 st Step – Build a relational database Relational Database
  • 9. 2 nd Step – Table Partitioning p1 p2 p3 Relational Database
  • 10. 3 rd Step – Database Partitioning Browser Web Tier B/L Tier Relational Database Customer #1 Browser Web Tier B/L Tier Relational Database Customer #2 Browser Web Tier B/L Tier Relational Database Customer #3
  • 11. 4 th Step – Move to the cloud? Browser Web Tier B/L Tier SQL Azure Federation Customer #1 SQL Azure Browser Web Tier B/L Tier Federation Customer #2 SQL Azure Browser Web Tier B/L Tier Federation Customer #3
  • 12. Problems created by too much data • Where to store • How to store • How to process • Organization, searching, and metadata • How to manage access • How to copy, move, and backup • Lifecycle
  • 13.
  • 15. Polyglot Persistence (how to store)
  • 16. • Atlanta 2009 - No:sql(east) conference select fun, profit from real_world where relational=false • Billed as “conference of no-rel datastores” (loose) Definition • (often) Open source • Non-relational • Distributed • (often) does not guarantee ACID
  • 17. Types Of NoSQL Data Stores
  • 18. 5 Groups of Data Models Relational Document Key Value Graph Column Family
  • 19. Document? • Think of a web page... o Relational model requires column/tag o Lots of empty columns o Wasted space and processing time • Document model just stores the pages as is o Saves on space o Very flexible • Document Databases o Apache Jackrabbit o CouchDB o MongoDB o SimpleDB o XML Databases • MarkLogic Server • eXist.
  • 20. Key/Value Stores • Simple Index on Key • Value can be any serialized form of data • Lots of different implementations o Eventually Consistent • “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent” o Cached in RAM o Cached on disk o Distributed Hash Tables • Examples o Azure AppFabric Cache o Memcache-d o VMWare vFabric GemFire
  • 21. Graph? • Graph consists of o Node („stations‟ of the graph) o Edges (lines between them) • Graph Stores o AllegroGraph o Core Data o Neo4j o DEX o FlockDB • Created by the Twitter folks • Nodes = Users • Edges = Nature of relationship between nodes. o Microsoft Trinity (research project) • http://research.microsoft.com/en-us/projects/trinity/
  • 22. Column Family? • Lots of variants o Object Stores • Db4o • GemStone/S • InterSystems Caché • Objectivity/DB • ZODB o Tabluar • BigTable • Mnesia • Hbase • Hypertable • Azure Table Storage o Column-oriented • Greenplum • Microsoft SQL Server 2012
  • 23. Okay got it, Now Let’s Compare Some Real World Scenarios
  • 24. You Need Constant Consistency • You‟re dealing with financial transactions • You‟re dealing with medical records • You‟re dealing with bonded goods • Best you use a RDMBS  Footer Text 5/15/2012 24
  • 25. You Need Horizontal Scalability • You‟re working across defined timezones • You‟re Aggregating large quantities of data • Maintaining a chat server (Facebook chat) • Use Column Family Storage. Footer Text 5/15/2012 25
  • 26. Frequently Written Rarely Read • Think web counters and the like • Every time a user comes to a page = ctr++ • But it‟s only read when the report is run • Use Key-Value Storage. Footer Text 5/15/2012 26
  • 27. Here Today Gone Tomorrow • Transient data like.. o Web Sessions o Locks o Short Term Stats • Shopping cart contents • Use Key-Value Storage Footer Text 5/15/2012 27
  • 28. Where to store • RAM o Fast • Local Disk o SSD – super fast o Expensive o Fast spinning disks (7200+) o volatile o High Bandwidth possible o Persistent • SAN • Parallel File System o Storage Area Network o HDFS (Hadoop) o Fully managed o Auto-replicated for o Expensive parallel decentralized I/O • Cloud o Amazon o Box.Net o DropBox
  • 30. Big Data Definition •Beyond what traditional Volume environments can handle •Need decisions fast Velocity •Many formats Variety
  • 31. Additional Big Data Concepts • Volumes & volumes of data • Unstructured • Semi-structured • Not suited for Relational Databases • Often utilizes MapReduce frameworks
  • 32. Big Data Examples • Cassandra • Hadoop • Greenplum • Azure Storage • EMC Atmos • Amazon S3 • SQL Azure (with Federations support)?
  • 33. Real World Example • Twitter o The challenges • Needs to store many graphs  Who you are following  Who‟s following you  Who you receive phone notifications from etc • To deliver a tweet requires rapid paging of followers • Heavy write load as followers are added and removed • Set arithmetic for @mentions (intersection of users).
  • 34. What did they try? • Started with Relational Databases • Tried Key-Value storage of denormalized lists • Did it work? o Nope • Either good at  Handling the write load  Or paging large amounts of data  But not both
  • 35. What did they need? • Simplest possible thing that would work • Allow for horizontal partitioning • Allow write operations to • Arrive out of order o Or be processed more than once o Failures should result in redundant work • Not lost work!
  • 36. The Result was FlockDB • Stores graph data • Not optimized for graph traversal operations • Optimized for large adjacency lists o List of all edges in a graph • Key is the edge value a set of the node end points • Optimized for fast read and write • Optimized for page-able set arithmetic.
  • 37. How Does it Work? • Stores graphs as sets of edges between nodes • Data is partitioned by node o All queries can be answered by a single partition • Write operations are idempotent o Can be applied multiple times without changing the result • And commutative o Changing the order of operands doesn‟t change the result.
  • 38. How to Process Big Data
  • 39. ACID • Atomicity o All or Nothing • Consistency o Valid according to all defined rules • Isolation o No transaction should be able to interfere with another transaction • Durability o Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
  • 40. BASE • Basically Available o High availability but not always consistent • Soft state o Background cleanup mechanism • Eventual consistency o Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
  • 41. Traditional (relational) Approach Extract Transactional Data Store Transform Data Warehouse Load
  • 42. Big Data Approach • MapReduce Pattern/Framework o an Input Reader o Map Function – To transform to a common shape (format) o a partition function o a compare function o Reduce Function o an Output Writer
  • 43. MongoDB Example > // map function > // reduce function > m = function(){ > r = function( key , values ){ ... this.tags.forEach( ... var total = 0; ... function(z){ ... for ( var i=0; i<values.length; i++ ) ... emit( z , { count : 1 } ... total += values[i].count; ); ... return { count : total }; ... } ...}; ... ); ...}; > // execute > res = db.things.mapReduce(m, r, { out : "myoutput" } );
  • 44. What is Hadoop? • A scalable fault-tolerant grid operating system for data storage and processing • Its scalability comes from the marriage of: o HDFS: Self-Healing High-Bandwidth Clustered Storage o MapReduce: Fault-Tolerant Distributed Processing • Operates on unstructured and structured data • A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …) • Open source under the friendly Apache License • http://wiki.apache.org/hadoop/
  • 45. Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data 4. Simple Core, Modular and Extensible
  • 46. Hadoop Core Components Store Process HDFS Map/Reduce Self-healing Fault-tolerant High-bandwidth distributed Clustered storage processing
  • 47. HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month
  • 49. Hadoop Job Architecture Node Manager Container App Mstr Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container
  • 50. Microsoft embraces Hadoop Good for enterprises & developers Great for end users!
  • 51. HADOOP [Azure and Enterprise] Java OM Streaming OM HiveQL PigLatin .NET/C#/F# (T)SQL OCEAN OF DATA NOSQL [unstructured, semi-structured, structured] ETL HDFS A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs EIS / RDBMS File OData Azure ERP System [RSS] Storage
  • 52. Hive Plug-in for Excel Footer Text 5/15/2012 52

Hinweis der Redaktion

  1. t least four groups of data model: key-value, document, column-family, and graph. Looking at this list, there&apos;s a big similarity between the first three - all have a fundamental unit of storage which is a rich structure of closely related data: for key-value stores it&apos;s the value, for document stores it&apos;s the document, and for column-family stores it&apos;s the column family. In DDD terms, this group of data is an aggregate.A Graph Database stores data structured in the Nodes and Relationships of a graphColumn Family (BigTable-style) databases are an evolution of key-value, using &quot;families&quot; to allow grouping of rows. The rise of NoSQL databases has been driven primarily by the desire to store data effectively on large clusters - such as the setups used by Google and Amazon. Relational databases were not designed with clusters in mind, which is why people have cast around for an alternative. Storing aggregates as fundamental units makes a lot of sense for running on a cluster. Aggregates make natural units for distribution strategies such as sharding, since you have a large clump of data that you expect to be accessed together.The Relational ModelThe relational model provides for the storage of records that are made up of tuples. Records are stored in tables. Tables are defined by a schema, which determines what columns are in the table. Columns have a name and a type. All records within a table fit that table&apos;s definition. SQL is a query language designed to operate over tables. SQL provides syntax for finding records that meet criteria, as well as for relating records in one table to another via joins; a join finds a record in one table based on its relationship to a record in another table.Records can be created (inserted) or deleted. Fields within a record can be updated individually.Implementations of the relational model usually provide transactions, which provide a means to make modifications spanning multiple records atomically.In terms of what programming languages provide, tables are like arrays or lists of records or structures. For high performance access, tables can be indexed in various ways using b-trees or hash maps.Key-Value StoresKey-Value stores provide access to a value based on a key.The key-value pair can be created (inserted), or deleted. The value associated with a key may be updated.Key-value stores don&apos;t usually provide transactions.In terms of what programming languages provide, key-value stores resemble hash tables; these have many names: HashMap (Java), hash (Perl), dict (Python), associative array (PHP), boost::unordered_map&lt;...&gt; (C++).Key-value stores provide one implicit index on the key itself.A key-value store may not sound like the most useful thing, but a lot of information can be stored in the value. It is quite common for the value to be an XML document, a JSON object, or some other serialized form. The key point here is that the storage engine is not aware of the internal structure of the value. It is up to the client application to interpet the value andmanage its contents. The value can only be written as a whole; if the client is storing a JSON object, and only wants to update one field, the entire value must be fetched, the new value substituted, and then the entire value must be written back.The inability to fetch data by anything other than one key may appear limited, but there are workarounds. If the application requires a secondary index, the application can maintain one itself. To do this, the application manages a second collection of key-value pairs where the key is the value of another field in the first collection, and the value is the primary key in the first collection. Because there are no transactions that can be used to make sure that the secondary index is kept synchronized with the original collection, any application that does this would be wise to have a periodic syncing process to clean up after any partial changes that occur due to application crashes, bugs, or errors.Document StoresDocument stores provide access to structured data, but unlike the relational model, there may not be a schema that is enforced. In essence, the application stores bags of key-value pairs. In order to operate in this environment, the application adopts some conventions about how to deal with differing bags it may retrieve, or it may take advantage of the storage engine&apos;s ability to put different documents in different collections, which the application will use to manage its data.Unlike a relational store, document stores usually support nested structures. For example, for document stores that support XML or JSON documents, the value of a field may be something that looks like another document. Document stores can also support array or list-valued keys.Unlike a key-value store, document stores are aware of the internal structure of the document. This allows the storage engine to support secondary indexes directly, allowing for efficient queries on any field. The ability to support nested document storage leads to query languages that can be used to search for items nested inside others; XQuery is one example of this. MongoDB supports some similar functionality by allowing the specification of JSON field paths in queries.Column StoresColumn stores are like relational stores, except that they flip the data around. Instead of storing records, column stores store all the values for a column together in a stream. An index provides a means to get column values for any particular record.Map-reduce implementations such as Hadoop are most efficient if they can stream in their data. Column stores work particularly well for that. As a result, stores like HBase and Hypertable are often used as non-relational data warehouses to feed map-reduce for analytics.A relational-style column scalar may not be the most useful for analytics, so users often store more complex structures in columns. This manifests directly in Cassandra, which introduces the notion of &quot;column families,&quot; which get treated as a &quot;super-column.&quot;Column-oriented stores support retrieving records, but this requires fetching the column values from their individual columns and re-assembling the record.Graph DatabasesGraph databases store vertices and the edges between them. Some support adding annotations to the vertices and/or edges. This can be used to model things like social graphs (people are represented by vertices, and their relationships are the edges), or real-world objects (components are represented by vertices, and their connectedness is represented by edges). The content on IMDB is tied together by a graph: movies are related to to the actors in them, and actors are related to the movies they star in, forming a large complex graph.The access and query languages for graph databases are the most different of the set of those discussed here. Graph database query languages are generally about finding paths in the graph based on either endpoints, or constraints on attributes of the paths between endpoints; one example is SPARQL.
  2. Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB