How we sleep well at night using Hystrix at Finn.no

Hystrix- What did we learn?
JavaZone
September 2015
Hystrix cristata
Audun Fauchald Strand &
Henning Spjelkavik

public int lookup(MapPoint p ) {
return altitude(p);
}
Example

return new LookupCommand(p).execute();
}
private class LookupCommand extends HystrixCommand<Integer> {
final MapPoint p;
LookupCommand(MapPoint p) {
super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude"));
this.p = p;
}
protected Integer run() throws Exception {
return altitude(p);
}
protected Integer getFallback() {
return -1;
}
}
Example

Audun Fauchald Strand
@audunstrand
Henning Spjelkavik
@spjelkavik

Agenda
Why?
Tolerance for failure - How?
How to create a Hystrix Command
Monitoring and Dashboard
Examples from finn
What did we learn

Map calls User over the network
What can possibly go wrong?

Map calls User
1. Connection refused
2. Slow answer
3. Veery slow answer (=never)
4. The result causes an exception in
the client library

Map calls User
1. Connection refused => < 2 ms
2. Slow answer => 5 s
3. Veery slow answer => timeout
the client library => depends

Map calls User
Fails quickly

Map calls User
May kill both
the server and
the client

Map calls User
Let’s assume:
Thread pr request
Response time - 4 s
Map has 60 req/s.
Fan-out to User is 2 => 120 req/s
240 / 480 threads blocking

mobilewebN has 130 req/s
Let’s assume:
Thread pr request
RandomApp has 130 req/s.
Fan-out to service is 2 => 260 req/s
520 / 1040 threads blocking

What happens in an app with 500 blocking threads?
Not much. Besides waiting. CPU is idle.
If maximum-threads == 500
=> no more connections are allowed
And what about 1040 occupied threads?

And where is the user after 8 s?
At Youtube, Facebook or searching for cute
kittens.

The problem we try to solve
An application with 30 dependent services - with 99.99%
uptime for each service
99.99^30 = 99.7% uptime
0.3% of 1 billion requests = 3,000,000 failures
2+ hours downtime/month even if all dependencies have excellent uptime.
98%^30 = 54% uptime
99.99% = 8 sec a day; 99.7% 4 min pr day;

Agenda
Why?
Tolerance for failure - How?
Examples from finn
One step further

Control over latency and failure from dependencies
Stop cascading failures in a complex distributed system.
Fail fast and rapidly recover.
Fallback and gracefully degrade when possible.
Enable near real-time monitoring, alerting
What is Hystrix for?

Fail fast - don’t let the user wait!
Circuit breaker - don’t bother, it’s
already down
Fallback - can you give a sensible
default, show stale data?
Bulkhead - protect yourself
against cascading failure
Principles

How?
Avoid any single dep from using up all threads
Shedding load and failing fast instead of queueing
Providing fallbacks wherever feasible
Using isolation techniques (such as bulkhead, swimlane,
and circuit breaker patterns) to limit the impact of any one
dependency.

Two different ways of isolation
Semaphore
“at most 5 concurrent calls”
only for CPU-intensive, local calls
Thread pool (dedicated couriers)
the call to the underlying service is handled by a pool
overhead is usually not problematic
default approach

Dependencies
Depends on
rxjava
archaius (& commons-configuration)
FINN uses Constretto for configuration
management, hence:
https://github.com/finn-no/archaius-constretto

Dependencies
There are useful addons:
hystrix-metrics-event-stream - json/http
stream
hystrix-codahale-metrics-publisher (currently
io.dropwizard.metrics)
(Follows the recent trend of really splitting up the dependencies - include only what you need)

Default properties
Quite sensible, “fail fast”
Do your own calculations of
number of concurrent requests
timeouts (99.8 percentile)
...by looking at your current performance
(latency) pr request and add a little buffer

threads
requests per second
at peak when healthy
× 99th percentile
latency in seconds
+ some breathing
room

Hystrix - part of NetflixOSS
Netflix OSS
Hystrix - resilience
Ribbon - remote calls
Feign - Rest client
Eureka - Service discovery
Archaius - Configuration
Karyon - Starting point

Agenda
Why?
Tolerance for failure
Examples from finn
What did we learn

A command class wrapping the “risky”
operation.
- must implement run()
- might implement fallback()
Since version 1.4 Observable implementation
also available

return altitude(p);
}
AltitudeSearch - before

}
final MapPoint p;
this.p = p;
}
return altitude(p);
}
}
AltitudeSearch - after

FAQ
Does that mean I
have to write a
command for (almost)
every remote
operation in my
application?

Why is it so intrusive?
But Why?

Hystrix-Javanica
@HystrixCommand(
fallbackMethod = "defaultUser"
ignoreExceptions =
{BadRequestException.class})
public User getUserById(String id) {
}
private User defaultUser(String id) {
}

Concurrency - The client decides
T = c.execute() synchronous
Future<T> = c.queue() asynchronous
Observable<T> = c.observable() reactive streams

Agenda
Why?
Tolerance for failure
Metrics, Monitoring and Dashboard
Examples from finn
What did we learn

Metrics
Circuit breaker open?
Calls pr. second
Execution time?
Median, 90th, 95th and
99th percentile
Status of thread pool?
Number of clients in
cluster

Publishing the metrics
Servo - Netflix metrics library
CodaHale/Yammer/dropwizard - metrics
HystrixPlugins.
registerMetricsPublisher(HystrixMetricsPublisher impl)

Dashboard toolset
hystrix-metrics-event-stream
out of the box: servlet
we use embedded jetty for thrift services
turbine-web
aggregates metrics-event-stream into clusters
hystrix-dashboard
graphical interface

Examples from Finn - Code
Altitudesearch
Fetch Several Profiles using collapsing
Operations

}
final MapPoint p;
this.p = p;
}
return altitude(p);
}
protected Integer getFallback() {
return -1;
}
}
AltitudeSearch

Migrating a library
Create commands
Wrap commands with
existing services
Backwards compatible
No flexibility

Fetch a map point
Fetch Several Profiles using
collapsing
Operations

Request Collapsing
Fetch one profile takes 10ms
Lots of concurrent requests
Better to fetch multiple profiles

Request Collapsing - why
decouples client model from server interface
reduces network overhead
client container/thread batches requests

Request Collapsing
create two commands
Collapser
one new() pr client request
BatchCommand
one new() pr server request

Request Collapsing
Integrate two commands in two methods
createCommand()
Create batchCommand from a list of
singlecommands
mapResponseToRequests()
Map listResponse to single resposes

Create Collapser
public Collapser(Query query) {
this.query = query;

Create BatchCommand
return new BatchCommand(collapsedRequests, client);

create BatchCommand
@Override
protected HystrixCommand<Map<Query,Profile>>
createCommand(Collection<Request> collapsedRequests) {
return new BatchCommand(collapsedRequests, client);
}

mapResponseToRequests
@Override
protected void mapResponseToRequests(
Map<Query,Profile> batchResponse,
Collection<Request> collapsedRequests) {
collapsedRequests.stream().forEach(
c -> c.setResponse(batchResponse.getOrDefault(
c.getArgument(),
new ImmutableProfile(id)
);)
}

mapResponseToRequests
@Override
protected void mapResponseToRequests(
Map<Query,Profile> batchResponse,
Collection<Request> collapsedRequests) {
collapsedRequests.stream().forEach(
c -> c.setResponse(batchResponse.getOrDefault(
c.getArgument(),
new ImmutableProfile(id)
);)
} Graceful
degradation

Request Collapsing - experiences
Each individual request will be slower for the
client, is that ok?
10 ms operation into 100 ms window
Max 110 ms for client
Average 60 ms
Read documentation first!!

Fetch a map point
Fetch Several Profiles using collapsing
Operations

Example from Finn - Operations
[2015-06-31T13:37:00,485]
[ERROR] Forwarding to error page from request
due to exception
[AdCommand short-circuited and no fallback available.]
com.netflix.hystrix.exception.HystrixRuntimeException:
RecommendMoreLikeThisCommand short-circuited and no fallback available.
at com.netflix.hystrix.AbstractCommand$16.call
(AbstractCommand.java:811)

Error happens in production
Operations gets paged with lots of error
messages in logs
They read the logs
Lots or [ERROR]
They restart the application

Learnings - operations
Error messages means different things with
Hystrix
What they say, not where they occur
Built in error recovery with circuit breaker
Operations reads logs, not hystrix dashboard
Lots of unnecessary restarts

Experiences from Finn
Hystrix belongs
client-side

Nested Hystrix
commands are ok

Graceful degradation is
a big change in mindset
Little use of proper
fallback-values

Tried putting hystrix in
low-level http client
without great success.

Server side errors are
detected clientside

Not all exceptions are
errors.

RxJava needs a full
rewrite… Still useful
without!

Experiences from FINN
Hystrix standardises things we did before:
Nitty gritty http-client stuff
Timeouts
Connection pools
Tuning thread pools
Dashboards
Metrics

Wrap up
Should you start using Hystrix?
- Bulkhead and circuit-breaker - explicit timeout and error
handling is useful
- Dashboards
Further reading
Ben Christensen, GOTO Aarhus 2013 - https://www.youtube.com/watch?v=_t06LRX0DV0
Updated for QConSF2014; https://qconsf.com/system/files/presentation-slides/ReactiveProgrammingWithRx-QConSF-
2014.pdf
Thanks for listening!
audun.fauchald.strand@finn.no & henning.spjelkavik@finn.no

How we sleep well at night using Hystrix at Finn.no

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie How we sleep well at night using Hystrix at Finn.no

Ähnlich wie How we sleep well at night using Hystrix at Finn.no (20)

Mehr von Henning Spjelkavik

Mehr von Henning Spjelkavik (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How we sleep well at night using Hystrix at Finn.no

Hinweis der Redaktion