SQL and NoSQL in the Context of SQL Server: Understanding Data Models and Scaling Techniques
1. SQL and NoSQL
in the Context of SQL Server
Michael Rys
Program Manager, Microsoft Corp.
@SQLServerMike
2. Key Session Takeaways
Scaling your Business is important
What are the NoSQL paradigms
You can use NoSQL Paradigms with SQL
Server and SQL Azure
We are working on moving the paradigms
into SQL Server
3. The Web 2.0 Business Architecture
Attract Individual
Consumers:
- Provide interesting
service
- Provide mobility
- Provide social
Monetize Individual:
- Upsell service
Online Monetize the Social:
- Improve individual
- VIP
- Speed
Business experience
- Re-sell Aggregate Data
- Extra
Capabilities
Application (e.g., Advertisers)
4. Social Networking: the Business Problem
100s of million of users
10s of million of users
concurrently
Terabytes to petabytes of
data
Structured and unstructured
Required (eventual) data
consistency across users
E.g. show your updated state
in your friends’ profile pages
5. Solution
Shard/Partition user data across
hundreds to thousands of SQL
Databases
Propagate data changes using
reliable, async Message Service
No Global Transactions! Hinder scale
and availability!
Provide a caching layer for
performance
Also used for
Clean-up state (e.g. on account close)
Deploy business logic (stored procedures)
6. Example Architecture (MySpace.com)
1-1000 3001-4000
Async My DB I change
Message
gets updated my status
Service TX1
TX3 TX2
Dispatcher Async userId=1024
Message
Async
2001-3000 Message
1001-2000
TX4 TX5
4001-5000 5001-6000 Web Tier
Data Tier
7. Many Large Scale Customers using Similar Patterns
Patterns
Sharding and reliable messaging
Sharding and fan/out query layer
Caching layer
Customer Examples
Social Networking: Facebook, MySpace, etc
Online electronic stores (cannot give names )
Travel reservation systems (e.g. Choice International)
MSN Casual Gaming
etc.
8. Lessons Learned from these Scenarios
Require high availability
Be able to scale out
Functional and Data Partitioning Architecture
Provide scale-out processing
Be able to deal with failures
Be able to quickly grow and change
Elastic scale
Flexible, open schema
Multi-version schema support
Move better support for these patterns into the Data
Platform!
9. What is NoSQL about?
NoSQL = operational and developer agility at low CapEx and OpEx!
Low Cost
Free Software and Support
Scale CapEx cost below customer growth rate
Web friendly developer model and tool chain, Easy to use
Processing Paradigms
High Availability
Data and Processing Scale-out
Performance
Tunable/Eventual Consistency
Data Model Paradigms
Data first: Flexible Schema
Low-impedance mismatch between programming and data model
From devices, over OLTP Web 2.0 applications to BigData Analytics
10. Data Models
Data Model Example Stores
Simple Key-Value Pairs Memcache, Redis, Dynamo, Voldermort, LevelDB,
Azure Caching
Wide Sparse Column Sets HyperTable, Big Table, Cassandra, HBASE,
Hyperbase, Amazon DynamoDB, Windows Azure
Tables, SQL Server/Azure Sparse columns
BLOBs Amazon S3, Oracle Berkeley NoSQL, Windows
Azure Blob Store, SQL Server RBS/FileTable
JSON Documents MongoDB, CouchBase, Riak, RavenDB
Graph Neo4J, GraphDB, HypergraphDB, Stig,
Intellidimension
Objects and XML Documents Versant, Oracle Berkeley NoSQL, MarkLogic,
existDB, EMC HiveDB, SQL Server/Azure, Oracle,
IBM DB2
Extended Relational Oracle, EMC SQLFire, IBM DB2, MySQL, Postgres,
SQL Server/Azure/Parallel DW
11. Operational Agility
You want:
Availability of service (scalability)
Global consistency
Network Partition Tolerance
You can only get 2 of 3 (CAP Theorem)
In Brave New World:
Online businesses need availability
It is distributed, because it is big
thus Network Partitioning is unavoidable
Hence global consistency must be relaxed
→ BASE vs ACID
12. BASE vs ACID Consistency
ACID :
Atomicity, Consistency, Isolation, Durability
Full Serializability provides all 4
Distributed transactions providing all 4 limits
service availability, throughput and scalability
BASE: Basically Available, Soft state, Eventual
consistency
Relaxes ACID properties to increase Replica
availability, throughput and scalability
Primary
Replica consistency:
Replica
Impacts recoverability
Cross-node consistency: Replica
Impacts globally consistent view of the world
Primary
Replica
13. Operational Agility
Performance and Scale
Automate management lifecycle (or fail)
Simple deployment lifecycle
No DB or OS Admin telling me what to do
14. Developer Agility
Code First and revise quickly
Application-model first (before database)
Flexible open data models
You don’t know exactly what you are looking for
Lower Pain of adoption and maintenance
No DB or OS Admin telling me what to do
15. NoSQL and BigData: Two sides of the same coin
BigData:
Origin: large unstructured data processing
(sensor data, scientific research, web stream analysis)
Analytics focused (“new” OLAP, Map-Reduce, Hadoop)
Scale-out data and processing paradigm at low cost
NoSQL:
Origin: developing agile, scalable web applications
Realtime customer transaction focused (“new” OLTP)
Scale-out data and processing paradigm with flexible
data model at low cost
Both use many of the same paradigms
16. The Web 2.0 Business Architecture
Attract Individual
Consumers:
- Provide interesting
service
- Provide mobility
- Provide social
Monetize Individual:
- Upsell service
Online Monetize the Social:
- Improve individual
- VIP
- Speed
Business experience
- Re-sell Aggregate Data
- Extra
Capabilities
Application (e.g., Advertisers)
17. Scale-Out Data PLATFORM Architecture
Readable
Replica
Primary Copy
Shard
OLTP Workloads Readable
Replica
Traditional OLAP Workloads
Highly Available
known schema
High Scale
Data warehouse, “Star joins”
High Flexibility Readable
Replica
mostly touching 1 Primary
to low number of Shard Dynamic OLAP Workloads
shards Readable
Replica 3Vs (Volume, Velocity, Variety)
Exploratory
Readable Scale-out queries, often using
Replica eventual consistent scale-out
frameworks like Hadoop
Primary
Shard Query
Readable
Replica
18. What does SQL Server provide today?
Scale-programming models
Service Broker provides:
Functional, service-oriented architecture
Scale out on demand
Async reliable messaging provides for true eventual consistency
SQL Azure Federations provides Sharding support
Distributed Queries
SQL Server Parallel Data Warehouse
Programmer Agility
XML, XQuery for XML documents
FileTable for documents (but what is equivalent solution in the cloud?)
Open Schema: Sparse Columns and column sets (but still schema first)
CLR extensibility, but
No indexing, bad cost-models
Difficult to deploy (and DB Admins often do not allow it!)
Failure Resilience
SQL Azure has local automatic HA, self-healing
Rich Services
Semantic Extraction and Similarity Search in SQL Server 2012
DB/OS Admin “interference”
SQL Azure: Self-maintaining and Self-provisioning
19. Introducing SQL Azure Federations
Provides Data Partitioning/Sharding
at the Data Platform
Enables applications to build elastic
scale-out applications
Provides non-blocking SPLIT/DROP for
shards (MERGE to come later)
Auto-connect to right shard based on
sharding keyvalue
Provides SPLIT resilient query mode
20. SQL Azure Federation Concepts
Federation
Azure DB with Federation Root
Represents the data being sharded
Federation Root Federation Directories, Federation
Database that logically houses Users, Federation Distributions, …
federations, contains federation meta data
Federation Key
Value that determines the routing of a piece Federation “Orders_Fed”
of data (defines a Federation Distribution) (Federation Key: CustomerID)
Federation Member (aka Shard)
Physical container for a set of federated
tables of a specific key range and reference Member: PK [min, 100)
tables
Atomic Unit AU
PK=5
AU
PK=25
AU
PK=35
All rows with the same federation
key value: always together!
Federated Table
Member: PK [100, 488)
Table that contains only atomic units
for the member’s key range
AU AU AU
Connection
Reference Table PK=105 PK=235 PK=365
Gateway
Non-sharded table
Member: PK [488, max)
AU AU AU
Sharded PK=555 PK=2545 PK=3565
20 Application
22. SQL Azure: A Not Only SQL Data Platform
SQL Azure adds support for NoSQL paradigms in the data platform:
No CapEx, Low OpEx (which should/will be even lower )
High-Availability (each DB has two replicas)
Sharding support with federations:
Data platform provides online SPLIT/DROP
Filtered connection to provide split resilient programming model
Flexible Data Models:
XML support
Sparse columns/Column sets
More to come in the future…
More scale and tunable HA (to support OLTP/OLAP model)
Taking Federations further (orthogonality, merge, fanout)
Integration with Hadoop eco-system
More data-first (data-driven columnsets, JSON)
23. Call to Action
Download the Presentation from:
http://www.slideshare.net/MichaelRys/presentations
Fill out SQL Azure Federation Survey:
http://connect.microsoft.com/BusinessPlatform/Survey/S
urvey.aspx?SurveyID=13625
24. Related Content
Related Whitepapers and Presentations:
CACM: Scalable SQL: http://cacm.acm.org/magazines/2011/6/108663-scalable-sql
NoSQL and the Windows Azure Platform:
http://download.microsoft.com/download/9/E/9/9E9F240D-0EB6-472E-B4DE-
6D9FCBB505DD/Windows%20Azure%20No%20SQL%20White%20Paper.pdf
SQL Federation blog: http://blogs.msdn.com/b/cbiyikoglu/archive/2011/03/03/nosql-genes-in-
sql-azure-federations.aspx
Windows Gaming Experience Case Study:
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4
000008310
NoSQL Presentations: http://www.slideshare.net/MichaelRys/presentations
Contact me:
mrys@microsoft.com
@SQLServerMike
http://sqlblog.com/blogs/michael_rys/default.aspx
Editor's Notes
Example MySpace architecture:Service Dispatcher coordination point between all SQL ServersCentralizes route managementAvoids routes explosion Load-balanced across 30 SQL ServersMessages are sent randomly to theseEnables multicast/broadcast functionalitySupports destination lists and wildcards e.g. [DB1,DB3, DB4], DB%18,000 ~2k msgs/sec per dispatcher SQL ServerMyDB sends a message with my status change and a target list specifying the DBs that store my friends data.The Service Dispatcher forwards the message these DBs.Each DB processes the message updating my status in a partitioned table
Example MSN Casual Gaming:~2 Million users at launch~86 Million services requests/day 135 Windows Azure Data Services Hosting VMs ca. 18K connections in Connection Pools, this could grow with trafficCa. 1200 SQL Azure requests/second spread across all partitions during peak load~ 90% reads vs 10% writes (this varies per storage type)~ 200 bytes of storage per user~ 20% of database storage is currently used, but expect this to growSharded over 400 SQL Azure Databases
Requirehigh availabilityBe able to scale out:Functional and Data Partitioning ArchitectureProvide scale-out processing:Function shippingFanout and Map/Reduce processingBe able to deal with failures:QuorumRetriesEventual Consistency (similar to Read-consistent Snapshot Isolation)Be able to quickly grow and change:Elastic scaleFlexible, open schemaMulti-version schema supportMove better support for these patterns into the Data Platform!
Note: Big-sized companies invest resources in building these platforms instead of using existing relational platforms!Low CostFree Open Source Stores, Community SupportScale cost below customer growth rateWeb friendly developer model and tool chain, Easy to useProcessing ParadigmsHigh Availability (scalable Replication, Fast Failover, DR/GeoDR, tunable latency)Scale-out (Sharding, Map-Reduce, Elasticity)Performance (tuned for workloads, Caching, co-located compute with partitioned state)Tunable/Eventual ConsistencyData Model ParadigmsData first: Flexible SchemaLow-impedance mismatch between programming and data model:Key-Documents and Objects (BLOBS, JSON, XML, POJO)Key-Wide Sparse Column SetsGraphs (e.g., RDF)
Performance and Scale:Map/Reduce PatternsEventual consistency (trade-off due to CAP)ShardingCachingAutomate management Lifecycle:Elastic Scale on demand (no need to pay for resources until needed)Automatic Fail-overScalable Schema version rolloutPerf troubleshootingAuto alertingAuto loadbalancingAuto resourcing (e.g., auto splits based on policies)Declarative policy-based management
Code First and revise quicklyWorking software over comprehensive documentationResponding to change over following a planApplication-model first (before database) Dictates the data model and queriesFlexible data modelsNo a priori modeling: Data first, schema later/Open SchemaKey/Value storesReduced impedance mismatch: JSON, XML, YAMLYou don’t know exactly what you are looking forMap/Reduce for adhoc analysisProvide Search across all your data instead of just queryLower Pain of adoption and maintenance From code to deployment & “monetization” of data, services, apps and tenantsRich Services out of the BoxData and services mashupEasy troubleshooting of deployed appsNo DB or OS Admin telling me what to do
ShardedGamesInfo table using SQL Azure FederationsUse a C# library that does implement a Map/Reduce processor on top SQL Azure FederationsMapper and Reducer are specified using SQL