Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
8. Feb 2017•0 gefällt mir
12 gefällt mir
Sei der Erste, dem dies gefällt
Mehr anzeigen
•6,324 Aufrufe
Aufrufe
Aufrufe insgesamt
0
Auf Slideshare
0
Aus Einbettungen
0
Anzahl der Einbettungen
0
Downloaden Sie, um offline zu lesen
Melden
Technologie
Varun Talwar, product manager on Google's gRPC project discusses the fundamentals and specs of gRPC inside of a Google-scale microservices architecture.
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Google confidential │ Do not
distribute
Google confidential │ Do not
distribute
Bringing learnings from
Googley microservices with
gRPC
Microservices Summit
Varun Talwar
Contents
1. Context: Why are we here?
2. Learnings from Stubby experience
a. HTTP/JSON doesnt cut it
b. Establish a Lingua Franca
c. Design for fault tolerance and control: Sync/Async, Deadlines, Cancellations, Flow control
d. Flying blind without stats
e. Diagnosing with tracing
f. Load Balancing is critical
3. gRPC
a. Cross platform matters !
b. Performance and Standards matter: HTTP/2
c. Pluggability matters: Interceptors, Name Resolvers, Auth plugins
d. Usability matters !
Key learnings
1. HTTP/JSON doesnt cut it !
2. Establish a lingua franca
3. Design for fault tolerance and provide control knobs
4. Dont fly blind: Service Analytics
5. Diagnosing problems: Tracing
6. Load Balancing is critical
HTTP/JSON doesn’t cut it !
1. WWW, browser growth - bled into services
2. Stateless
3. Text on the wire
4. Loose contracts
5. TCP connection per request
6. Nouns based
7. Harder API evolution
8. Think compute, network on cloud platforms
1
Establish a lingua franca
1. Protocol Buffers - Since 2003.
2. Start with IDL
3. Have a language agnostic way of agreeing on data semantics
4. Code Gen in various languages
5. Forward and Backward compatibility
6. API Evolution
2
Google Cloud Platform
Service Definition (weather.proto)
syntax = "proto3";
service Weather {
rpc GetCurrent(WeatherRequest) returns (WeatherResponse);
}
message WeatherRequest {
Coordinates coordinates = 1;
message Coordinates {
fixed64 latitude = 1;
fixed64 longitude = 2;
}
}
message WeatherResponse {
Temperature temperature = 1;
float humidity = 2;
}
message Temperature {
float degrees = 1;
Units units = 2;
enum Units {
FAHRENHEIT = 0;
CELSIUS = 1;
KELVIN = 2;
}
}
Design for fault tolerance and control
● Sync and Async APIs
● Need fault tolerance: Deadlines, Cancellations
● Control Knobs: Flow control, Service Config, Metadata
3
18
First-class feature in gRPC.
Deadline is an absolute point in time.
Deadline indicates to the server how
long the client is willing to wait for an
answer.
RPC will fail with DEADLINE_EXCEEDED
status code when deadline reached.
gRPC Deadlines
Google Cloud Platform
Deadline Propagation
Gateway
90 ms
Now =
1476600000000
Deadline =
1476600000200
40 ms
20 ms
20 ms 60 ms
withDeadlineAfter(200, MILLISECONDS)
Now =
1476600000040
Deadline =
1476600000200
Now =
1476600000150
Deadline =
1476600000200
Now =
1476600000230
Deadline =
1476600000200
DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED
20
Deadlines are expected.
What about unpredictable cancellations?
• User cancelled request.
• Caller is not interested in the result any
more.
• etc
Cancellation?
Google Cloud Platform
Cancellation?
GW
Busy Busy Busy
Busy Busy Busy
Busy Busy Busy
Active RPC Active RPC
Active RPC
Active RPC Active RPCActive RPC
Active RPC Active RPC
Active RPC
23
Automatically propagated.
RPC fails with CANCELLED status code.
Cancellation status be accessed by the
receiver.
Server (receiver) always knows if RPC is
valid!
Cancellation
Google Cloud Platform
BiDi Streaming - Slow Client
Fast Server
Request
Responses
Slow Client
CANCELLED
UNAVAILABLE
RESOURCE_EXHAUSTED
Google Cloud Platform
BiDi Streaming - Slow Server
Slow Server
Request
Response
Fast Client
CANCELLED
UNAVAILABLE
RESOURCE_EXHAUSTED
Requests
26
Flow-control helps to balance
computing power and network
capacity between client and server.
gRPC supports both client- and
server-side flow control.
Flow-Control
Photo taken by Andrey Borisenko.
27
Policies where server tells client what
they should do
Can specify deadlines, lb policy,
payload size per method of a service
Loved by SREs, they have more control
Discovery via DNS
Service Config
Metadata Exchange - Common cross-cutting concerns
like authentication or tracing rely on the exchange of
data that is not part of the declared interface of a
service. Deployments rely on their ability to evolve these
features at a different rate to the individual APIs
exposed by services.
Metadata helps in exchange of useful information
Don’t fly blind: Stats4
● What is the mean latency time per RPC?
● How many RPCs per hour for a service?
● Errors in last minute/hour?
● How many bytes sent? How many connections to my server?
Data collection by arbitrary metadata is useful
● Any service’s resource usage and performance stats in real time by (almost)
any arbitrary metadata
1. Service X can monitor CPU usage in their jobs broken down by the name of the invoked RPC
and the mdb user who sent it.
2. Social can monitor the RPC latency of shared bigtable jobs when responding to their requests,
broken down by whether the request originated from a user on web/Android/iOS.
3. Gmail can collect usage on servers, broken down by according POP/IMAP/web/Android/iOS.
Layer propagates Gmail's metadata down to every service, even if the request was made by an
intermediary job that Gmail doesn't own
● Stats layer export data to varz and streamz, and provides stats to many
monitoring systems and dashboards
Diagnosing problems: Tracing5
● 1/10K requests takes very long. Its an ad query :-) I need to find out.
● Take a sample and store in database; help identify request in sample which
took similar amount of time
● I didnt get a response from the service. What happened? Which link in the
service dependency graph got stuck? Stitch a trace and figure out.
● Where is it taking time for a trace? Hotspot analysis
● What all are the dependencies for a service?
Load Balancing is important !5
Iteration 1: Stubby Balancer
Iteration 2: Client side load balancing
Iteration 3: Hybrid
Iteration 4: gRPC-lb
● Current client support intentionally dumb (simplicity).
○ Pick first available - Avoid connection establishment latency
○ Round-robin-over-list - Lists not sets → ability to represent weights
● For anything more advanced, move the burden to an external "LB Controller", a
regular gRPC server and rely on a client-side implementation of the so-called
gRPC LB policy.
client LB Controller
backends
1) Control RPC
2) address-list
3) RR over addresses of
address-list
gRPC LB
Next gen of load balancing
In summary, what did we learn
● Contracts should be strict
● Common language helps
● Common understanding for deadlines, cancellations, flow control
● Common stats/tracing framework is essential for monitoring, debugging
● Common framework lets uniform policy application for control and lb
Single point of integration for logging, monitoring, tracing, service
discovery and load balancing makes lives much easier !
Open source on Github for C, C++, Java, Node.js,
Python, Ruby, Go, C#, PHP, Objective-C
gRPC core
gRPC Java
gRPC Go
● 1.0 with stable APIs
● Well documented with an active community
● Reliable with continuous running tests on GCE
○ Deployable in your environment
● Measured with an open performance dashboard
○ Deployable in your environment
● Well adopted inside and outside Google
Where is the project today?
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers,
Auth plugins
4. Usability matters !
More lessons
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers,
Auth plugins
4. Usability matters !
More lessons
Google Cloud Platform
Coverage & Simplicity
The stack should be available on every popular
development platform and easy for someone to build
for their platform of choice. It should be viable on
CPU & memory limited devices.
gRPC Principles & Requirements
http://www.grpc.io/blog/principles
Google Cloud Platform
gRPC Speaks Your Language
● Java
● Go
● C/C++
● C#
● Node.js
● PHP
● Ruby
● Python
● Objective-C
● MacOS
● Linux
● Windows
● Android
● iOS
Service definitions and client libraries Platforms supported
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers,
Auth plugins
4. Usability matters !
More lessons
Google Cloud Platform
• Single TCP connection.
• No Head-of-line blocking.
• Binary framing layer.
• Request –> Stream.
• Header Compression.
HTTP/2 in One Slide
Transport(TCP)
Application (HTTP/2)
Network (IP)
Session (TLS) [optional]
Binary Framing
HEADERS Frame
DATA Frame
HTTP/2
POST: /upload
HTTP/1.1
Host: www.javaday.org.ua
Content-Type: application/json
Content-Length: 27
HTTP/1.x
{“msg”: “Welcome to 2016!”}
Google Cloud Platform
HTTP/2 breaks down the
HTTP protocol
communication into an
exchange of
binary-encoded frames,
which are then mapped to
messages that belong to a
stream, and all of which
are multiplexed within a
single TCP connection.
Binary Framing
Stream 1 HEADERS
Stream 2
:method: GET
:path: /kyiv
:version: HTTP/2
:scheme: https
HEADERS
:status: 200
:version: HTTP/2
:server: nginx/1.10.1
...
DATA
<payload>
Stream N
Request
Response
TCP
Google Cloud Platform
gRPC Service Definitions
Unary RPCs where the
client sends a single
request to the server
and gets a single
response back, just like
a normal function call.
The client sends a
request to the server
and gets a stream to
read a sequence of
messages back.
The client reads from
the returned stream
until there are no more
messages.
The client send a
sequence of messages
to the server using a
provided stream.
Once the client has
finished writing the
messages, it waits for
the server to read them
and return its response.
Client streaming
Both sides send a
sequence of messages
using a read-write
stream. The two
streams operate
independently. The
order of messages in
each stream is
preserved.
BiDi streamingUnary Server streaming
48
Messaging applications.
Games / multiplayer tournaments.
Moving objects.
Sport results.
Stock market quotes.
Smart home devices.
You name it!
BiDi Streaming Use-Cases
● Open Performance Benchmark and Dashboard
● Benchmarks run in GCE VMs per Pull Request for regression testing.
● gRPC Users can run these in their environments.
● Good Performance across languages:
○ Java Throughput: 500 K RPCs/Sec and 1.3 M Streaming messages/Sec on 32 core VMs
○ Java Latency: ~320 us for unary ping-pong (netperf 120us)
○ C++ Throughput: ~1.3 M RPCs/Sec and 3 M Streaming Messages/Sec on 32 core VMs.
Performance
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Auth
4. Usability matters !
More lessons
Google Cloud Platform
Pluggable
Large distributed systems need security,
health-checking, load-balancing and failover,
monitoring, tracing, logging, and so on.
Implementations should provide extensions points
to allow for plugging in these features and, where
useful, default implementations.
gRPC Principles & Requirements
http://www.grpc.io/blog/principles
1. Server reflection
2. Health Checking
3. Automatic retries
4. Streaming compression
5. Mechanism to do caching
6. Binary Logging
a. Debugging, auditing though costly
7. Unit Testing support
a. Automated mock testing
b. Dont need to bring up all dependent services just to test
8. Web support
Coming soon !
Microservices: in data centres
Streaming telemetry from network devices
Client Server communication/Internal APIs
Some early adopters
Mobile Apps
Why gRPC?
Multi-language
9 languages
Open
Open source and growing
community
Strict Service contracts
Define and enforce contracts,
backward compatible
Performant
1m+ QPS - unary, 3m+ streaming
(dashboard)
Pluggable design
Auth, Transport, IDL, LB
Efficiency on wire
2-3X gains
Streaming APIs
Large payloads, speech, logs
Standard compliant
HTTP/2
Easy to use
Single line installation
Google Cloud Platform
The Fallacies of Distributed Computing
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
https://blogs.oracle.com/jag/resource/Fallacies.html
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous
How is gRPC Used?
Direct RPCs :
Microservices
On
Prem
GCP
Other
Cloud
How is gRPC Used?
Direct RPCs :
Microservices
RPCs to
access APIs
Google APIs
Your APIs
On
Prem
GCP Other
Cloud
How is gRPC Used?
Direct RPCs :
Microservices
RPCs to
access APIs
Google APIs
Your APIs
Mobile/Web
RPCs
Your
Mobile
/Web
Apps
On
Prem
GCP
Other
Cloud
Google confidential │ Do not
distribute
What are the benefits?
Ease of use
Performance
Versioning
Programming model
Developers
Uniform Monitoring
Debugging/Tracing
Cross
platform/language
Operators
Defined Contracts
Single uniform
framework for control
Visibility
Architects/Manag
ers
Google Cloud Platform
gRPC Principles & Requirements
Layered
Key facets of the stack must be able to evolve
independently. A revision to the wire-format should
not disrupt application layer bindings.
http://www.grpc.io/blog/principles
Google Cloud Platform
Layered Architecture
HTTP/2
RPC Client-Side App
Channel
Stub
Future
Stub
Blocking
Stub
ClientCall
RPC Server-side Apps
Tran #1 Tran #2 Tran #N
Service Definition
(extends generated definition)
ServerCall handler
Transport
ServerCall
NameResolver LoadBalancer
Pluggable
Load
Balancing
and
Service
Discovery
Google Cloud Platform
Takeaways
HTTP/2 is a high performance production-ready multiplexed
bidirectional protocol.
gRPC (http://grpc.io):
• HTTP/2 transport based, open source, general purpose
standards-based, feature-rich RPC framework.
• Bidirectional streaming over one single TCP connection.
• Netty transport provides asynchronous and non-blocking I/O.
• Deadline and cancellations propagation.
• Client- and server-side flow-control.
• Layered, pluggable and extensible.
• Supports 10 programming languages.
• Build-in testing support.
• Production-ready (current version is 1.0.1) and growing ecosystem.
● Protocol Structure
○ Request → <Call Spec> <Header Metadata> <Messages>*
○ Response → <Header Metadata> <Messages>* <Trailing Metadata> <Status>
● Generic mechanism for attaching metadata to requests and responses
● Commonly used to attach “bearer tokens” to requests for Auth
○ OAuth2 access tokens
○ JWT e.g. OpenId Connect Id Tokens
● Session state for specific Auth mechanisms is encapsulated in an
Auth-credentials object
Metadata and Auth