"Netflix is actually a log generating application that just happens to stream movies"
Building a service/Microservice is itself easy. Scaling it on the cloud is not that hard either but operating, maintaining and iterating a production large scale service is not just about linearisation. As Cockcroft points out, telemetry and monitoring is the most important aspect of building Microservices
We discuss 5 patterns that any serious Microservice should have:
- Canary (an endpoint reporting health of underlying dependencies)
- IO monitor (measuring all calls from Microservice to external dependencies)
- A circuit breaker
- An ActivityId-Propagator
- An exception and short timeout retry policy
3. /// ASOS in numbers
2 0 1 6 T u r n O v e r → £1.5 bln
A c t i v e C u s t o m e r s → 12 M
N e w P r o d u c t s / w k → 4 k
U n i q u e V i s i t s / m o → 123 M
P a g e V i e w s / d a y → 95 M
P l a t f o r m T e a m s → 40
A z u r e D a t a C e n t r e s → 5
5. @aliostad
/// why microservices
> Scaling people not the solution
> Decentralising decision centres => Agility
> Frequent deployment => Agility
> Reduced complexity of each ms (Divide/Conquere) => Agility
> Cost of mistakes and bad decisions smaller ...
6. @aliostad
/// anecdote
Often you can measure your success in
implementing Microservice Architecture not
by the number of services you build, but by
the number you decommission.
7. @aliostad
/// microservices vs soa
SOA Microservices
Main Goal Architectual Decoupling Business Agility
Set out to solve Architectural Coupling
Scaling People,
Frequent Deployment
Audience Mainly Architecture Business (Everyone)
Law Conway’s Reverse Conway’s
Impact on Structure of
Organisation
Minimal Huge
Service Cardinality Usually up to a dozen >40 (Commonly >100)
When to do Always teams > ~5**
** Debateable. There are articles and discussions on this very topic
8. @aliostad
/// microservice challenges
> Very difficult to build a complete mental picture of solution
> When things go wrong, need to know where before why
> Potentially increased latency
> Performance outliers intractable to solve
> A complete mind-shift requiring a new operating model
15. @aliostad
/// Backup requests with cross server
cancelation
https://static.googleusercontent.com/media/research.google.com/en//people/jeff/Berkeley-
Latency-Mar2012.pdf
BRCSC Paper
Google, 2012
19. @aliostad
/// Blame Game
“If there is a single place where
you can play blame game,
instead of collective responsibility,
it is in
Microservices troubleshooting”
20. @aliostad
/// Did you say IO??
Microservice
DB
API
Cache
Measure...
every time your code
goes out of your process
21. @aliostad
/// Recording Methods
> Explicitly by calling record()
> Asking the library to record a closure
> Aspect-oriented
Java (spf4j)
private static final MeasurementRecorder recorder
= RecorderFactory.createScalableCountingRecorder(forWhat, unitOfMeasurement,
sampleTimeMillis);
…
recorder.record(measurement);
.NET (PerfIt)
var ins = new SimpleInstrumentor(new InstrumentationInfo()
{
Counters = CounterTypes.StandardCounters,
Description = "test",
InstanceName = "Test instance",
CategoryName = TestCategory
});
ins.Instrument(() => Thread.Sleep(100), "test...");
Java and .NET
@PerformanceMonitor(warnThresholdMillis=1, errorThresholdMillis=100, recorderSource =
RecorderSourceInstance.Rs5m.class)
[PerfItFilter(“PerfItTests", InstanceName = "Test")]
public string Get()
{
return Guid.NewGuid().ToString();
}
22. @aliostad
/// Publishing Methods
> Local file (various to logstash)
> TCP and HTTP (many, to zipkin, influxdb)
> UDP (statsd, collectd to graphite, logstash)
> Raising Kernel-level event (Windows ETW)
> Local communication (statsd)
26. @aliostad
/// Sampling
“The first production version of Dapper used a uniform sampling
probability for all processes at Google, averaging one sampled trace for
every 1024 candidates… [however] we are in the process of deploying
an adaptive sampling scheme that is parameterized not by a uniform
sampling probability, but by a desired rate of sampled traces per unit
time.”
Dapper Paper
Zipkin samples in the collector using a strategy pattern: an
implementation of CollectorSampler abstract class.
28. @aliostad
/// tri-state
> Closed traffic can flow normally
> Open traffic does not flow
> Half-open circuit breaker tests the waters again
Closed
Open
Half-open
Test
Failure
Wait timeout
30. @aliostad
/// Fallback
> Custom: e.g. serve content from a local cache (status 206)
> Silent: return null/no-data/empty (status 200/204)
> Fail-fast: Customer experience is important (status 5xx)
32. @aliostad
/// ActivityId
> Every customer request matters
> Every request is unique
> Every request creates a chain (or tree) of calls/events
> Activities are correlated
> You need an ActivityId (or CorrelationId) to link calls/events
38. @aliostad
/// Health Endpoints
Ping returns a success code when invoked
Canary returns a connectivity status and
latency on the service and dependencies
“… none of them invoke any application code”
47. @aliostad
/// Wrap-up
> If you have more than ~5 teams, consider Microservices
> Logging/Monitoring/Alerting: single most important asset
> Use ActivityId Propagator to correlate (consider zipkin)
> Cloud is a jungleTM
. Without retry/timeout you won’t survive
> Monitor and measure all calls to external services (blame game)
> Protect your systems with circuit-breakers (and isolation)
> Canary helps you detect connectivity from customer view
48. @aliostad
Thomas Wood: Daisy Picture
Thomas Au: Thermometer Picture
Torbakhopper: Cables Picture
Dam Picture - Japan
Hsiung: Lights Picture
Health Endpoint in API Design