1. Scaling Data On Public Clouds
Liran Zelkha, Founder
Liran.zelkha@scalebase.com
2. About Us
• ScaleBase is a new startup targeting the
database-as-a-service market (DBaaS)
• We offer unlimited database scalability and
availability using our Database Load Balancer
• We currently run in beta mode – contact me if
you want to join
3. Problem Of Data
• Flickr just hit 5B pictures
• Facebook > 0.5B users
• Farmville have more monthly players than the
population of France
4. Mondays Key Note
• More data
• More users
• More complex actions
• Shorter response times
5. Scalability Pain
Infrastructure
Cost $
Large You just lost
Capital customers
Expenditure
Predicted
Demand
Opportunity Traditional
Cost Hardware
Actual
Demand
Automated
Virtualization
time
http://media.amazonwebservices.com/pdf/IBMWebinarDeck_Final.pdf
6. CAP vs. ACID
• CAP = Consistency, Availability, Partition
Tolerance
• ACID = Atomicity, Consistency, Isolation,
Durability
• Atomicity – Chain of actions treated as one
whole unseperateable action
• Isolation – Consistent query snapshots, read
across writes, 4 levels are supported
7. ScaleBase
Database Scaling In A Box
Applications Legacy clients
• The first truly elastic,
fault tolerant SQL
based data layer
• Enables linear scaling
Scalebase of any SQL based
database
Database instances
11. Brewer's (CAP) Theorem
• It is impossible for a distributed computer
system to simultaneously provide all three of
the following guarantees:
– Consistency (all nodes see the same data at the
same time)
– Availability (node failures do not prevent survivors
from continuing to operate)
– Partition Tolerance (the system continues to
operate despite arbitrary message loss)
http://en.wikipedia.org/wiki/CAP_theorem
13. Reading More On CAP
• This is an excellent read, and some of my
samples are from this blog
– http://www.julianbrowne.com/article/viewer/bre
wers-cap-theorem
15. Databases And CAP
• ACID – Consistency
• Availability – tons of solutions, most of them
not cloud oriented
– Oracle RAC
– MySQL Proxy
– Etc.
– Replication based solutions can solve at least read
availability and scalability (see Azure SQL)
17. So Where Is The Problem?
• Scaling problems (usually write but also read)
• Schema change problems
• BigData problems
18. Scaling Up
• Issues with scaling up when the dataset is just
too big
• RDBMS were not designed to be distributed
• Began to look at multi-node database
solutions
• Known as ‘scaling out’ or ‘horizontal scaling’
• Different approaches include:
– Master-slave
– Sharding
19. Scaling RDBMS – Master/Slave
• Master-Slave
– All writes are written to the master. All reads
performed against the replicated slave databases
– Critical reads may be incorrect as writes may not
have been propagated down
– Large data sets can pose problems as master
needs to duplicate data to slaves
20. Scaling RDBMS - Sharding
• Partition or sharding
– Scales well for both reads and writes
– Not transparent, application needs to be partition-
aware
– Can no longer have relationships/joins across
partitions
– Loss of referential integrity across shards
21. Other ways to scale RDBMS
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
– This involves de-normalizing data
• In-memory databases
23. NoSQL
• A term used to designate databases which
differ from classic relational databases in
some way. These data stores may not require
fixed table schemas, and usually
avoid join operations and typically scale
horizontally. Academics and papers typically
refer to these databases as structured storage,
a term which would include classic relational
databases as a subset.
http://en.wikipedia.org/wiki/NoSQL
24. NoSQL Types
• Key/Value
– A big hash table
– Examples: Voldemort, Amazon Dynamo
• Big Table
– Big table, column families
– Examples: Hbase, Cassandra
• Document based
– Collections of collections
– Examples: CouchDB, MongoDB
• Each solves a different problem
26. Pros/Cons
• Pros:
– Performance
– BigData
– Most solutions are open source
– Data is replicated to nodes and is therefore fault-tolerant (partitioning)
– Don't require a schema
– Can scale up and down
• Cons:
– Code change
– Limited framework support
– Not ACID
– Eco system (BI, Backup)
– There is always a database at the backend
– Some API is just too simple
27. Amazon S3 Code Sample
AWSAuthConnection conn = new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey, secure, server, format);
Response response = conn.createBucket(bucketName, location, null);
final String text = "this is a test";
response = conn.put(bucketName, key, new S3Object(text.getBytes(), null), null);
28. Cassandra Code Sample
CassandraClient cl = pool.getClient() ;
KeySpace ks = cl.getKeySpace("Keyspace1") ;
// insert value
ColumnPath cp = new ColumnPath("Standard1" , null,
"testInsertAndGetAndRemove".getBytes("utf-8"));
for(int i = 0 ; i < 100 ; i++){
ks.insert("testInsertAndGetAndRemove_"+i, cp ,
("testInsertAndGetAndRemove_value_"+i).getBytes("utf-8"));
}
//get value
for(int i = 0 ; i < 100 ; i++){
Column col = ks.getColumn("testInsertAndGetAndRemove_"+i, cp);
String value = new String(col.getValue(),"utf-8") ;
}
//remove value
for(int i = 0 ; i < 100 ; i++){
ks.remove("testInsertAndGetAndRemove_"+i, cp);
}
29. Cassandra Code Sample – Cont’
try{
ks.remove("testInsertAndGetAndRemove_not_exist", cp);
}catch(Exception e){
fail("remove not exist row should not throw exceptions");
}
//get already removed value
for(int i = 0 ; i < 100 ; i++){
try{
Column col = ks.getColumn("testInsertAndGetAndRemove_"+i, cp);
fail("the value should already being deleted");
}catch(NotFoundException e){
}catch(Exception e){
fail("throw out other exception, should be
NotFoundException." + e.toString() );
}
}
pool.releaseClient(cl) ;
pool.close() ;
30. Cassandra Statistics
• Facebook Search
• MySQL > 50 GB Data
– Writes Average : ~300 ms
– Reads Average : ~350 ms
• Rewritten with Cassandra > 50 GB Data
– Writes Average : 0.12 ms
– Reads Average : 15 ms
31. MongoDB
Mongo m = new Mongo();
DB db = m.getDB( "mydb" );
Set<String> colls = db.getCollectionNames();
for (String s : colls) {
System.out.println(s);
}
32. MongoDB – Cont’
BasicDBObject doc = new BasicDBObject();
doc.put("name", "MongoDB");
doc.put("type", "database");
doc.put("count", 1);
BasicDBObject info = new BasicDBObject();
info.put("x", 203);
info.put("y", 102);
doc.put("info", info);
coll.insert(doc);
34. Data SLA
• There is no golden hammer
– See http://sourcemaking.com/antipatterns/golden-
hammer
• Choose your tool wisely, based on what you need
• Usually
– Start with RDBMS (shortest TTM, which is what we
really care about)
– When scale issues occur – start moving to NoSQL
based on your needs
• You can get Data Scalability in the cloud – just
think before you code!!!
36. How ScaleBase Works
Application
• ScaleBase takes an application
database and splits its data
across multiple, separated
instances (a technique called
Sharding)
• Queries and DML are either:
– Directed to correct instance, or
– Executed simultaneously across ScaleBase
several instances
• Results are aggregated and
returned to the original
application
Database instances
37. Example
ID First name Last name
102 Lex De Haan
105 David Austin
ID First name Last name
ID First name Last name
100 Steven King
100 Steven King
103 Alexander Hunold
101 Neena Kochhar
106 Valli Pataballa
102 Lex De Haan
103 Alexander Hunold
104 Bruce Ernst ID First name Last name
105 David Austin 101 Neena Kochhar
106 Valli Pataballa 104 Bruce Ernst
38. ScaleBase Supports
• 3 table types
– Master
– Global
– Split
• Splitting according to Hash, List, Range
• Rebalance, addition and removal of machines
• Instance replication backup: Shadow and Master
• Full consistent 2-Phase Commit
• Joins, Foreign Keys, Subqueries
• DML, DDL, Batch updates, Prepared Statements
• Aggregations, Group By, Order By, Auto Numbering,
Timestamps
39. Sample Code
SELECT site_owner_id, count(*)FROM google.user_clicks
WHERE country = ‘BRAZIL’
GROUP BY site_owner_id
• site_owner_id is the split key
• Perform the query on all DBs
• Simple aggregation of results
• No Code Change
40. Sample Code
SELECT country, count(*)FROM google.user_clicks
GROUP BY country
• Perform the query on all DBs
• Aggregation of the aggregations
• No Code Change
41. Sample Code
PreparedStatement pstmt = conn.prepareStatement("INSERT INTO emp VALUES(?,?,?,?,?)");
for (int i = 0; i < 10; i++) {
pstmt.setInt(1, 300 + i);
pstmt.setString(2, "Something" + i);
pstmt.setDate(3, new Date(System.currentTimeMillis()));
pstmt.setInt(4, i);
pstmt.setLong(5, i);
pstmt.addBatch();
}
int[] result = pstmt.executeBatch();
• Is split key dynamic or static?
• Each command is added to the correct DB,
execution is on all relevant DBs
• No Code Change
42. ScaleBase Solution
• Elastic SQL Database Scaling hi-availability
solution
– Complete
– Transparent
– Super scalable
– Out of the box
– Non-intrusive
– Flexible
– Manageable
43. With ScaleBase
• Pay much less for hardware and Database licenses
• Get more power, better data spreading and better
availability
• Real linear scalability
• Go for real grid, cloud and virtualization
• ScaleBase is NOT:
– Is NOT an RDBMS. It facilitates any secure, high-available,
archivable RDBMS (Oracle, DB2, MySQL, any!)
– Does NOT require schema structure modifications
– Does NOT require additional sophisticated hardware
44. Moving To ScaleBase
• Implementing ScaleBase is done in minutes
• Just direct your application to your ScaleBase
instance
• Target ScaleBase to your original database and
target database instances
• ScaleBase will automatically migrate your schema
and data to the new servers
• After all data is transferred ScaleBase will start
working with target database instances, giving
you linear scalability – with no down time!
45. Where ScaleBase Fits In
• Cloud databases
– Use SQL databases in the cloud, and get Scale Out
features and high availability
• High scale applications
– Use your existing code, without migrating to
NOSQL, and get unlimited SQL scalability
47. Public Cloud
• Scenario:
– A startup developing a complex iPhone application
running on EC2
– Requires high scaling option for SQL based database
• Problem:
– Amazon RDS offers only Scale Up option, no Scale Out
• Solution:
– Customer switched their Amazon RDS-based
application to ScaleBase on EC2
– Gained transparent, linear, scaling
– Running 4 RDS instances behind ScaleBase
48. Private Cloud
• Scenario:
– A company selling devices that ‘ping’ home every 5 minute
– 8 digit number of devices sold
• Problem:
– Evaluated MySQL, Oracle - no single machine can support
required devices
– Clustering options too expensive, limited scalability
• Solution:
– Customer moved to ScaleBase with no code changes
– Gained linear scales. Runs 4 MySQL databases behind
ScaleBase