Amazon Aurora is a fully managed relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. With Aurora, we've completely reimagined how databases are built for the cloud, providing you higher performance, availability, and durability than previously possible. In this session, we dive deep into the architectural details of Aurora with MySQL compatibility, and we review recent innovations, such as parallel query, backtrack, serverless, and multi-master. We also share best practices for utilizing the power of relational databases at cloud scale.
12. BINLOG DATA DOUBLE-WRITELOG FRM FILES
TYPE OF WRITE
MYSQL WITH REPLICA
EBS mirrorEBS mirror
AZ 1 AZ 2
EBS
Amazon Elastic
Block Store (EBS)
Primary
Instance
Replica
Instance
1
2
3
4
5
AZ 1 AZ 3
Primary
Instance
AZ 2
Replica
Instance
ASYNC
4/6 QUORUM
DISTRIBUTED
WRITES
Replica
Instance
Amazon S3
AMAZON AURORA
0.78MM transactions
7.4 I/Os per transaction
MySQL I/O profile for 30 min Sysbench run
27MM transactions 35X MORE
0.95 I/Os per transaction 7.7X LESS
Aurora IO profile for 30 min Sysbench run
MySQL vs. Aurora I/O profile
Amazon S3
25. Existing Multi-Master solutions
Paxos leader with 2PCGlobal ordering with read-write set
DATA
RANGE #1
DATA
RANGE #2
DATA
RANGE #4
DATA
RANGE #3
DATA
RANGE #5
L
L L
L
L
Distributed lock manager
SHARED STORAGE
M1 M2 M3
M1 M1 M1M2 M3 M2
SQL
Transactions
Caching
Logging
SQL
Transactions
Caching
Logging
GLOBAL ORDERING UNIT
T1 T2 T3 T100
Heavyweight synchronization: pessimistic
and negative scaling
Ex – Oracle RAC, DB2 Purescale, Sybase
Global entity: scaling bottleneck
Ex – Galera,TangoDB, FaunaDB
Heavy-weight consensus protocol: hot
partitions and struggle with
cross-partition queries
Ex – Spanner, CockroachDB, Ignite
32. N3
N1 N2
Client
T2
T3 N1 wait for replication to catch up untilT2 ANDT3
Globally consistent results
Aurora Multi-Master Global Reads - How?
No waits on the write path
Adds latency ONLY to Globally
consistent reads
Configurable per session
Shared distributed
storage volume
Performs the read at vector clockT = (T1,T2,T3)
ReadT1
Overall
In summary
Net-net
So,
Lets,
Now,
Alright,
(2 min) HELLO everyone
It’s my PRIVILEGE to be with ALL of you, and a WARM welcome to DAY 1 at REINVENT 2018
You know, OVER 7 years ago, WHEN we started building AURORA, WE had a simple mission, we WANTED, ANY person, ANYWHERE in the world to be able to RUN and MANAGE databases, AND all they would need is their BUSINESS APPLICATION. They wouldn’t need to worry about PROVISIONING, need HIGHLY skilled operators MANAGING their databases, make TRADEOFFS between PERFORMANCE, AVAILABILITY, DURABILITY and COST. AND by doing so, we BELIEVE we can TRANSFORM EVERYONE in the world to RE-IMAGINE the DATABASES in the cloud.
hi my name is Kamal Gupta and I am a senior engineering manager at AWS
TODAY, WE are going to SHOW you, in the NEXT hour, not only the NEW INNOVATIONS in Aurora, but also, how we DID it. AND how we are adding new capabilities like multimaster, parallel query, serverless, global databases into our Aurora offering that my team is building.
Joining me today, I am EXCITED to have Sirish Chandrasekaran, principal product manager at AWS
Let’s deep dive into Aurora MySQL
(1 min 15 sec) Our vision for Amazon relational database service is to offer you choices and recommendations so that you can decide what's best for your application.
On the one hand, AWS offers Open source engines for our customers who simply likes the simplicity and cost-effectiveness of them but the problem is that they lack the enterprise grade performance and reliability that our customers need for their mission critical applications.
We also offer old guard commercial database engines for our customers, who needs enterprise grade performance and reliability even though they are quite expensive with lock-ins and punitive licensing terms.
One of the early feedback we got from our customers is to build something that combines the best of both world. And we have created an Aurora for you.
(20 sec) With Aurora, you NO longer have to make the trade-offs. It provides you the commercial grade performance, durability and availability at the simplicity and cost-effectiveness of open-source solutions
And its delivered to you as a managed service.
(20 sec) Here are some of the customers who have been using Aurora – Airbnb, zynga, hulu, ancenstory, Nasdaq, some big names.
As you can see, Aurora continues to be the fastest growing service in the AWS history
(20 seconds) So, with that intro, I will first talk talk about Performance and then Availability and Manageability.
(20 sec) You know, when databases first came out, it looked something like this. Monolithic architecture in a single box.
With local storage, we were trading availability and durability to get better performance
(20 seconds) Over time, we decoupled storage from compute that allowed us to scale/customize/manage each layer independently but still monolithic stack remained the same.
(30 seconds) And then we added more such boxes. As you can see, it’s the same SQL stack everywhere. Nothing changed!
Moreover, we need heavy-weight distributed consensus for data replication and they perform poorly because of multiple phases, multiple rounds, sync points etc.
(2.5 minutes) With Aurora, we did two big contributions:
We pushed down Log applicator down to the storage => that allowed us to construct pages from the logs themselves. This is really cool because we don’t have to write full pages anymore. So unlike traditional databases which write both logs and pages, we just have to write logs. This means we have significantly less network IO, fundamentally less work on the engine - you don't need checkpointing anymore, you don't need flushing of pages or cache eviction any more.
Instead of heavy weight distributed consensus for data replication. We use 4/6 write Quorum & Local tracking. The reason we can avoid distributed consensus is because we exploit monotonically increasing Log Sequence Number (LSN) by the Master that allows us to order the writes. And so SN's just accept the writes. There is no voting involved.
We are going to see both these things in action. As a result, YOU get significantly better write performance
YOU get Read scale out because they share the same storage with the master.
YOU get AZ+1 failure tolerance => Aurora stores 6 copies, two copies per AZ. Even in the presence of background radiation, an entire AZ goes down on top, Aurora can handle it. No problem.
YOU get an Instant database redo recovery because we don't have to explicitly do anything at startup other than doing some math to find out the point at which we crashed.
Overall: You NO longer have to make a trade-off between performance, availability, and durability.
(2 minutes) Lets see how log applicator works in action.
Here, we are running 4 transactions with Master and replica. We have the storage at the bottom with each log 6-ways replication.
Lets say we commit a Trx T1. As you can see all SNs and replica received the changes. And so if we try to read, both master and replica will get the page with the orange Trx.
Now - lets say we commit T2/T3/T4. Please note that SN already coalesced purple but left blue and green log records on the side because it can't. This is because replica clock is still at purple.
And if both master and replica try to read, they will get the right image. For master, storage will apply the log records, blue and green, kept on the side, on the fly
And at some point, replica clock will go green and we can garbage collect the remaining log records by coalescing the changes.
Hopefully you can see, how we can construct pages from the logs themselves.
(1 minute) Lets see what benefits we got out of this.
Here is how the IO profile looks like for Aurora and MySQL.
1. On the left, we have MySQL on EBS. Thing to note is that it has to replicate all kinds of data. With Aurora, on the right, we ONLY have to replicate log records. As a result we do 7.7X less IO and 35X more work, despite 6X amplification due to 6 copies.
2. The other thing to note is that steps 1, 3, 4 on the left are synchronous and leads to jitter but with Aurora we use 4/6 quorum, that 's much more resilient to tail latency. And we will see that in a second, why that matters for your applications.
(20 seconds) And so we ran Sysbench workload on Aurora and MySQL, and we got an order of magnitude more writes and 2.5X more reads as opposed to stock MySQL running on EBS.
(10 seconds)
Here is another example with bulk load plus indexing. Again Aurora is 2.5X faster.
(1 minute) Lets talk about read scale out for your OLTP reads, or analytics queries.
Here, you can see on the left, we have MySQL Native binlog based replication, typically used for replication in the MySQL community. And on the right, we have Aurora physical replication.
- Unlike MySQL which have to transfer full rows or statements, Aurora only transfer log records (nothing but the delta changes). And that too are compressed.
- Unlike MySQL binlog replication, Aurora doesn’t need to write anything on replica, no extra write IO or storage involved.
- Also note, Aurora only need to update the pages only in-cache and so no read IO as well. And in-fact we only transfer what's in the replica cache, we filter it out on the master itself. Even better.
(30 seconds)
You can see the comparison of Binlog replica with Aurora Physical replica. Both are on the same Aurora instance, same software, same hardware to keep it apples to apples comparison.
The Binlog graph on the left is in seconds and it spiked to 5 minutes with in first 10 min under heavy load.
On the right we have Aurora Physical Replica. As you can see, its consistently stays under 20ms for hours and hours under the same load.
(2 minutes) Lets see how we do write quorum and local tracking in action.
We have the same setup with 4 transactions with storage at the bottom. There is a quorum tracking on the right. There are 4 waiting transactions and none of them are committed yet.
With traditional database, we keep WAL sequential, buffer the writes and flush them sequentially. As soon as write is flushed, we consider that trx committed and ack back to the client. Instead, we issue the writes to storage immediately in parallel and instead we use write tracker to ack in the right order until everything is flushed until this point. Otherwise we will break write-ahead logging
As you can see, there is no distributed consensus like Paxos or Raft, with multiple phases, or sync points in the storage. Its all Quorum and local tracking by individual SN because we leverage the sequencing from the head node
Now, I didn’t talk about reads here, but we use a different tracking mechanism instead of relying on any sort of consensus. Refer to Aurora SIGMOD papers for details, if you are interested.
(1 minute 30 sec) So, we looked at sysbench response times under heavy load. With Aurora we used 10K connections, with MySQL we just used only 500 connections because it starts thrashing after that and gets even worse.
As expected, the response times for Aurora are not only lower, but have much less variation. More precisely, based on the standard deviations of the two data sets, Amazon Aurora is more than 200x more consistent than MySQL. Also, the average response time is about 25x lower. Please note that Aurora is pushing 45x more throughput in this example.
You might wonder what’s going on with the spikes in MySQL. What you see here is the impact of database checkpoints. During a checkpoint, MySQL will do a lot of writes, which will slow down user transactions, hence the variability in the MySQL response times.
3 things why Aurora is so much better: 1) Light-weight consensus 2) ability to do out of order flush 3) no checkpointing because we construct pages from the logs themselves
(1 minute). Here are some examples of software innovations we did to give you the world-class database in the industry.
Lets take a look at the thread model. With MySQL, on the left, it follows a thread per connection model. Clearly, it doesn’t scale with connections. With Aurora, instead, we use a thread pool with epoll and latch-free task queue. This allows Aurora to scale much better with connections.
Here is another example of innovation, when you push more writes, you will have more contention in the system. And if we simply lock the whole lock table like MySQL, a lot of our other effort will go in vain. Instead Aurora allows concurrent access to any given lock chain
(1 minute). Here are some examples of software innovations we did to give you the world-class database in the industry.
Lets take a look at the thread model. With MySQL, on the left, it follows a thread per connection model. Clearly, it doesn’t scale with connections. With Aurora, instead, we use a thread pool with epoll and latch-free task queue. This allows Aurora to scale much better with connections.
Here is another example of innovation, when you push more writes, you will have more contention in the system. And if we simply lock the whole lock table like MySQL, a lot of our other effort will go in vain. Instead Aurora allows concurrent access to any given lock chain
(20 seconds)
Besides software improvements, H/W also improved.
And combined you can see, Aurora is getting better and better. Aurora now delivers 200K writes and whopping 700k reads per second on a single R4.16XL instance and its getting better each day!
(20 seconds) A lot of our customers come to me on how to tune Aurora
Well, Aurora automatically pre-tunes or auto-tunes different parameters for different h/w configurations for you.
Unless you are doing something really peculiar, you will just get the best performance out of the box.
(1 minute): Now, THERE are few parameters in MySQL like innodb_flush_at_trx_commit or innodb_file_size or sync_binlog that allow for better write performance but there is usually a tradeoff. Here is one such example with innodb_log_file_size. You can get better performance in MySQL but it also increases the recovery time if there is a database failure.
The reason is that this parameter is fundamentally delaying the checkpoint. And by increasing the duration, you are accumulating more redo logs. And when recovering from a crash, MySQL has to replay more logs in a single thread. The more logs to replay, the longer recovery will take.
With Amazon Aurora, there are no checkpoints and it doesn't even matter. In essence, you don't have to make trade-offs between performance and availability or durability with Aurora.
(45 sec) Okay – lets talk about multi-master. Before we jump, quick background on the space.
- We first had SQL running on 1 node but that was hard to scale
- To scale, we manually sharded them, but it was very hard to manage, as partitions becomes hot or we need schema changes across the partitions.
- Then to simplify, we built No SQL systems but they fundamentally lack transaction support and is very hard for our customers to build apps on eventual consistent systems. Customer that I talk to love the trx model, very easy to reason about.
- With Aurora SM, we addressed most of them but there were few gaps.
- With MM, we are addressing most of those gaps by adding write scalability and database write availability for our customers.
(2 min) Lets take a look how some of the existing MM solutions work:
1) First, we have a shared disk model with caches fused together. The challenge with these systems is that they uses pessimistic locking. The other challenge is that they require high cache coheeerence traffic, on per-lock basis, and so you either need expensive interconnects between the nodes (typically put together in a small room in the datacenter) or you suffer from hot blocks ping-ponging across the nodes.
2) Then there are systems that use read-write set technique - basically, what this is, as part of the transactions, you first read all the objects and then later modify based on the value you read. Now, at the time of the commit, if you find anyone else changed any of those objects since you read, we simply abort that transaction, otherwise commit the transaction. And if we can follow that for all transactions in some particular order, we can guarantee that all nodes will independently come to the same decision. But coming up with this global order ends up becoming the bottleneck in these systems.
3) And finally, there are noSQL style systems, with QP support on top, where data is range partitioned. They select a paxos leader within each partition but if there is a skew in the access pattern, which is quite typical, you will end up with the hot partitions ex - lets say you partitioned by date-time and you are inserting in the last range. Also, they typically use heavy-weight consensus protocol for commits.
(2.5 min) With Aurora MM, there is no pessimistic locking, no explicit global ordering, no global commit-coordination. Aurora architecture is based on 3 techniques:
First, it uses optimistic-conflict-resolution in the storage. To understand this better, lets say Orange Master runs T1 and Blue Master runs T2. Now, if T1 and T2 modifies different pages, there is no conflict and hence no sync required. However, if T1 and T2 both touches P2, then one of them wins and the other one have to retry based on the quorum. And as you can see, that doesn’t require any heavy-weight consensus protocol. Again we rely on quorum and local tracking with partitioned, monotonic LSN sequencing, from individual database nodes, to order the writes.
Since logging layer is pushed down, Aurora decouples the Transaction layer from the logging layer, and this allows Aurora to separate physical conflict on the pages from the logical conflict between the transactions. Trx conflicts are handled through MVCC, and physical conflicts through OCR. Moreover, there is no direct coupling in the storage partitions or, with-in the database nodes in the cluster.
Microservices architecture – there are independent, minimal and resilient services running in the cluster that are needed for async coordination. And any of them temporarily going down doesn’t NOT impact the whole cluster.
Net-net: Aurora only coordinates when it has to coordinate. Lets see this in action.
1 min
Lets say we have two clients C1 and C2 talking to Blue and orange master respectively.
Lets start from the simple case where two clients are writing to two different tables
Both clients start trx BT1 and OT1 to respective Master nodes.
They both issue an update but to 2 different tables
And they can both commit, no explicit sync required.
1 min
Same setup.
Lets say two clients wants to write to the same entry in the same table.
Again, Both clients start trx BT1 and OT1 to respective Master nodes.
They both issue an update but to the same table, modifying the same entry
And they both try to commit, one of them wins, other one looses.
1 min
Now, its possible that two trx may conflict even though there is no physical conflict. Lets see that.
Again same setup and two clients wants to write to the same row, same table.
Again, Both clients start trx BT1 and OT1 to respective Master nodes.
C1 sends an update and get a quorum. The changes gets replicated to Orange Master.
Now, if C2 updates the same row, Storage is totally okay because the changes were made on top of the latest image.
But we detect that conflict in the database itself through MVCC and rollback the trx. No distributed locking needed.
And ofcourse C1 commits successfully.
(1 min) We ran sysbench on MM cluster
As we scaled up the cluster at 5 min mark, performance went up from 14 to 27K. At time t=10, we added 2 more nodes in the cluster and you can see performance went up to 48K.
At t=15, one of the machines went down, and aggregated throughput came down to 38K and went back to 48K when the affected node came back up at t=16.
This is really cool!
(1 min) Switching gears, what about reads. How does reads work in MM. More precisely, how do we offer linearizability?
Let me illustrate the problem with an example.
John and Bob are friends. One day, John proposed Sara, and so he updates his status and let everyone know that he proposed Sara.
Bob, who was checking his updates, saw that John finally proposed but then realizes that his status is still single and so probably it didn’t work out.
However, if Bob did a global read, he would have found out that John is engaged to Sara. Bob immediately calls John to congratulate him.
Local reads allow you to read your own changes but if you want to read all the changes in the cluster, you need global reads.
(1min 15 sec)
Say we have 3 nodes, N1/N2/N3.
C1 issues a request to N1.
N1 then sends a hello request to N2 and N3. N2 and N3 respond with the timestamp T2 and T3 respectively, when they saw the hello request from N1.
N1 then waits for replication to catch until T2 and T3 from N2 and N3 respectively. And once its caught up, performs the read and return the results back. This is a very simplified view, there is quite bit of engineering and complexities involved to make it work in practice.
As you can see, there is no wait on the write path. It only adds latency to the reads that needs global consistency. And its configurable per session.
(30 seconds)
Net net, Aurora achieves linear write scaling with OCR.
Continuous availability through Microservices architecure.
Enterprise-grade Durability through 6 copies, 2 copies. Plus we continuously backup your data in a highly durable S3 store in addition
And supports indexes, constraints, triggers, proc, funcs you need for your relational database application.
(45 sec) So much about OLTP, lets talk about your OLAP queries.
Here are some of the optimizations we did for OLAP...
Batched scans – the idea is to scans tuples in batches from the innodb buffer pool to avoid latching pages again/again, traversing again/again and allows to use JIT optimizations. Mainly for in-memory workloads.
Hash joins – improves equi-join performance. Build one side and scan through the other to probe the hash table. Lots of complexity with skew and duplicates – have to minimize the #passes, not to mention when to choose hash joins over other join operators like index join or nested loop joins.
Async. Key Prefetch - prefetches pages in memory for index joins using BKA, quite useful for non-equi joins or some equi joins (if one side is small and the big side has a high-cardinality index on the joining column). For Out of cache workloads.
(15 seconds) We ran TPCH like workload.
Here you can the performance improvement from all those improvements. And as you can see, roughly half the queries are >2x better, with a peak speed up of roughly 18x.
(1 min)
Push down processing to thousands of storage nodes
Moving processing closer to data reduces network traffic
It reduces buffer pool pollution - why does it matter? we will see it in a second
(1 min)
On the left, request is sent to SN – including the pages, page-lsn, function to eval. In return we get 2 streams back: clean and dirty. Clean stream is set of records which have not been modified since the query started. Dirty stream…Clean stream is sent to aggregator to merge with other partial clean stream from other SN nodes. For Dirty, have it go through MVCC converter to get the right version, apply the function and feed it back into the agg. The combined result is set back to the client and next step in the query execution.
We already push down predicates, projections,….This is an active area of work and we can do much more and exploit storage in unique ways that was not possible before.
There are more challenges involved ex -
how we scan the list of pages to process without holding the latches for long
How we do the flow control
How we run each request in a secure container on the storage
How do we seemlessly handle failures of storage nodes
There will be a PQ chalk talk on Thu, if you are interested in learning more details.
DO NOT USE
===========
Some of the obvious next work items: Aggr pushdown, subquey unnesting with (inner HJ + aggr), native semi-join support in HJ
(30 sec) On top of previous improvements.
As you can see some queries are 2 orders of magnitude better and several queries are an order of magnitude better.
Now, to be clear, this is no AWS redshift but clearly its a good option if you are doing some light weight analytics.
(30 sec) Processing closer to data significantly reduces the data transfer between HN and SN. As a result, there is a significant less impact on the OLTP performance because of buffer pool pollution and reduction in network traffic.
We use 150GB just to show the impact because 8xl buffer pool is around 150GB something. And so if we bring in pages for OLAP queries, it will evict pages needed by OLTP queries.
Lets talk about availability.
Why ? - Even in the presence of background radiation, where nodes…
But how? The reason is that even if we loose 3 copies, we still have 1 copy left.
Instead if we use 2 out of 3 quorum, and 2 copies are lost, we will loose data.
We continuously back your data to S3.
How does it work?
Aurora divides the database into 10GB segments. And so we basically take snapshot of those segments and stream any delta redo logs to S3. On restore, we get those snapshots and apply delta log stream on top in parallel.
This all is happening on the storage, has no performance impact on the database nodes.
Now, there are times when we accidentally delete a table or forgot a where clause in the delete statement.
Backtrack allows to quick get the data back w/o fully restoring from backups. Relatively quick operation. Couple of minutes vs hours.
In this example, you first backtracked to t1 and let it run some trx…
note that you can actually backtrack back and forth to find out the right point, its not a destructive operation.
In traditional databases, we have to replay logs since the last checkpoint in a single thread.
With Aurora, these redo logs are already being applied to each segment in parallel, ASYNCHRONOUSLY. For startup - we don’t have to do anything other than doing some math to find out the point at the time of the crash.
You can have up-to 15 replicas
Because Replica shares the storage with Master – no loss of data
You can define the failover order
Lets see how MM availability looks like in practice
Two primary usecases:
DR
Enhance data locality by bringing data closer to your customer’s applications in different regions
Continue to get high throughput
Lag across regions under a sec even at the peak throughput, quite impressive
For DR, you can switch your apps to a different region under a minute, basically your RTO. RPO under few seconds.
How did we do it?
- Multi-tenant distributed replication fleet, which attaches itself as replica on one side and writer on the other side. And you do compressed physical replication between the two fleets.
Here we did the comparison between logical and physical replication across regions using Sysbench. For logical, we used Multi-thread with 64 parallel workers
As we ramp up the workload past 25K QPS, the logical replica was unable to keep up, with lag rising consistently. But physical lag was under a second even at the peak throughput.
DO NOT USE
==========
The above tests demonstrate that even under heavy write workload global database is able to maintain low latency physical replication. Unlike logical Binlog replication changes in the master do not need to be executed on replicas, the changes are physically replicated. This means that within milliseconds, committed DML and DDL changes from the writer are replicated globally to regions you have selected. All this done while ensuring data is durable in 3 availability zones in each region of your global database cluster. Global databases’ physical replication also removes any dependency on binary logs, all replication is handled natively by the Aurora storage layer.
Notes:
Important to note there are many different Binlog replica configurations, the following tests were done using a sustained write only workload
Tests:
Write only Sysbench workload.
Stepped workload, Every 600 seconds workload was ramped up.
WL used: https://d0.awsstatic.com/product-marketing/Aurora/RDS_Aurora_Performance_Assessment_Benchmarking_v1-2.pdf
Binlog tests:
-- RDS MySQL 5.7.23
-- Single AZ, 30k IOPS, r4.16xlarge
-- Master: us-east-1, Replica: us-west-2
- Multi threaded slave apply threads used Replica: 64 parallel slave worker threads, LOGICAL_CLOCK
- TPS step rate: 800,900,1000,2000,3000,4000,5000,6000
Desc: Until 25K QPS we had lag of < 5 seconds, afterwards the replica was unable to keep up with lag rising consistently.
Global Databases
- Aurora MySQL one writer & one cross region reader
- Reader. r4.16xlarge, US-Writer: r4.16xlarge, US-EAST-1
- EAST-2
- TPS step rate: 1000,2000,4000,8000,16000,24000,28000
- Desc: 200ms lag up to 150 QPS At > 170K QPS we were seeing a lag of ~500ms.
Lets talk about manageability
(1 min)
- We announced performance insights support for Aurora MySQL earlier this year.
- It’s a single place for you to monitor and root cause load issues. It allows you to group by waits, sql, users, hosts over time or over any of those metrics.
- For ex – the problem could be higher cpu, lock wait, io latency. You can then take actions by tuning the SQL statements or adding more resources.
- Its done in a seemless way that it doesn’t impact the performance of your database.
(30 seconds) With Aurora, you don’t have to manage storage, it automatically grows for you.
You don’t have to manage read replicas, we can auto-scale for you based on your workload. And custom reader endpoints to manage analytics replica from OLTP replicas for example.
(30 seconds)
- With Serverless, we can now manage writer instance for you - can automatically scale up and down (including going to zero) depending on the load.
- And so if you have a dev test workload or a sporadic/cyclic/unpredicatable workload, Serverless maybe a great option for you.
- It’s a great way to save cost - as you only pay for the time, you use for.
(2 min)
To understand how, lets first look at the different layers:
- There is a Multitenant, distributed proxy layer in the front and your database instance connections are spanned across that fleet. And so there is no single point of failure.
- Then we have a warm pool set of instance of different sizes kept on the side to quickly scale up or down.
- Finally, we also have a monitoring service running on the left, monitoring your database instances and taking actions as needed.
How seemless scaling works:
We first attach the new instance as replica
Then we request the database to find a safe scaling point with no active trx.
Once it find such a point, it starts looping back all the incoming traffic back to proxy fleet along with the coordinates of the replica instance
Proxy then reads the payload and redirect all the network stream from the old host to the new host and any new traffic to the new host.
Finally proxy sends the close message to the old machine
So you get no breakage in connections, no app impact.
(30 seconds)
We ran a quick simulation and you can see for yourself.
(20 second)
We are also announcing web service data api for your lambda apps on top of Serverless, you can simply send us the http request and don’t need to worry about even connection pooling
(10 second)
If you like to know more about mm, or serverless or PQ in Aurora MySQL, we will following sessions coming
Thank you all and have a great rest of your conference.