Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

How we sleep well at night using Hystrix at Finn.no

4.880 Aufrufe

Veröffentlicht am

Experiences using Hystrix at FINN.no. Presented at JavaZone 2015 in Oslo.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

How we sleep well at night using Hystrix at Finn.no

  1. 1. Hystrix- What did we learn? JavaZone September 2015 Hystrix cristata Audun Fauchald Strand & Henning Spjelkavik
  2. 2. public int lookup(MapPoint p ) { return altitude(p); } Example
  3. 3. public int lookup(MapPoint p ) { return new LookupCommand(p).execute(); } private class LookupCommand extends HystrixCommand<Integer> { final MapPoint p; LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; } protected Integer run() throws Exception { return altitude(p); } protected Integer getFallback() { return -1; } } Example
  4. 4. Audun Fauchald Strand @audunstrand Henning Spjelkavik @spjelkavik
  5. 5. Agenda Why? Tolerance for failure - How? How to create a Hystrix Command Monitoring and Dashboard Examples from finn What did we learn
  6. 6. Agenda Why? Tolerance for failure - How? How to create a Hystrix Command Monitoring and Dashboard Examples from finn What did we learn
  7. 7. Service A calls Service B
  8. 8. Map calls User over the network What can possibly go wrong?
  9. 9. Map calls User What can possibly go wrong? 1. Connection refused 2. Slow answer 3. Veery slow answer (=never) 4. The result causes an exception in the client library
  10. 10. Map calls User What can possibly go wrong? 1. Connection refused => < 2 ms 2. Slow answer => 5 s 3. Veery slow answer => timeout 4. The result causes an exception in the client library => depends
  11. 11. Map calls User What can possibly go wrong? 1. Connection refused => < 2 ms 2. Slow answer => 5 s 3. Veery slow answer => timeout 4. The result causes an exception in the client library => depends Fails quickly
  12. 12. Map calls User What can possibly go wrong? 1. Connection refused => < 2 ms 2. Slow answer => 5 s 3. Veery slow answer => timeout 4. The result causes an exception in the client library => depends May kill both the server and the client
  13. 13. Map calls User Let’s assume: Thread pr request Response time - 4 s Map has 60 req/s. Fan-out to User is 2 => 120 req/s 240 / 480 threads blocking
  14. 14. mobilewebN has 130 req/s Let’s assume: Thread pr request RandomApp has 130 req/s. Fan-out to service is 2 => 260 req/s 520 / 1040 threads blocking
  15. 15. What happens in an app with 500 blocking threads? Not much. Besides waiting. CPU is idle. If maximum-threads == 500 => no more connections are allowed And what about 1040 occupied threads?
  16. 16. And where is the user after 8 s? At Youtube, Facebook or searching for cute kittens.
  17. 17. The problem we try to solve An application with 30 dependent services - with 99.99% uptime for each service 99.99^30 = 99.7% uptime 0.3% of 1 billion requests = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. 98%^30 = 54% uptime 99.99% = 8 sec a day; 99.7% 4 min pr day;
  18. 18. Agenda Why? Tolerance for failure - How? How to create a Hystrix Command Monitoring and Dashboard Examples from finn One step further
  19. 19. Control over latency and failure from dependencies Stop cascading failures in a complex distributed system. Fail fast and rapidly recover. Fallback and gracefully degrade when possible. Enable near real-time monitoring, alerting What is Hystrix for?
  20. 20. Fail fast - don’t let the user wait! Circuit breaker - don’t bother, it’s already down Fallback - can you give a sensible default, show stale data? Bulkhead - protect yourself against cascading failure Principles
  21. 21. How? Avoid any single dep from using up all threads Shedding load and failing fast instead of queueing Providing fallbacks wherever feasible Using isolation techniques (such as bulkhead, swimlane, and circuit breaker patterns) to limit the impact of any one dependency.
  22. 22. Two different ways of isolation Semaphore “at most 5 concurrent calls” only for CPU-intensive, local calls Thread pool (dedicated couriers) the call to the underlying service is handled by a pool overhead is usually not problematic default approach
  23. 23. Recommended book: Release it!
  24. 24. Dependencies Depends on rxjava archaius (& commons-configuration) FINN uses Constretto for configuration management, hence: https://github.com/finn-no/archaius-constretto
  25. 25. Dependencies There are useful addons: hystrix-metrics-event-stream - json/http stream hystrix-codahale-metrics-publisher (currently io.dropwizard.metrics) (Follows the recent trend of really splitting up the dependencies - include only what you need)
  26. 26. Default properties Quite sensible, “fail fast” Do your own calculations of number of concurrent requests timeouts (99.8 percentile) ...by looking at your current performance (latency) pr request and add a little buffer
  27. 27. threads requests per second at peak when healthy × 99th percentile latency in seconds + some breathing room
  28. 28. Hystrix - part of NetflixOSS Netflix OSS Hystrix - resilience Ribbon - remote calls Feign - Rest client Eureka - Service discovery Archaius - Configuration Karyon - Starting point
  29. 29. Hystrix at FINN.no
  30. 30. Agenda Why? Tolerance for failure How to create a Hystrix Command Monitoring and Dashboard Examples from finn What did we learn
  31. 31. How to create a Hystrix Command A command class wrapping the “risky” operation. - must implement run() - might implement fallback() Since version 1.4 Observable implementation also available
  32. 32. public int lookup(MapPoint p ) { return altitude(p); } AltitudeSearch - before
  33. 33. public int lookup(MapPoint p ) { return new LookupCommand(p).execute(); } private class LookupCommand extends HystrixCommand<Integer> { final MapPoint p; LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; } protected Integer run() throws Exception { return altitude(p); } } AltitudeSearch - after
  34. 34. FAQ Does that mean I have to write a command for (almost) every remote operation in my application?
  35. 35. FAQ YES! YES!
  36. 36. Why is it so intrusive? But Why?
  37. 37. Hystrix-Javanica @HystrixCommand( fallbackMethod = "defaultUser" ignoreExceptions = {BadRequestException.class}) public User getUserById(String id) { } private User defaultUser(String id) { }
  38. 38. Concurrency - The client decides T = c.execute() synchronous Future<T> = c.queue() asynchronous Observable<T> = c.observable() reactive streams
  39. 39. Runtime behaviour
  40. 40. Runtime behaviour
  41. 41. Runtime behaviour
  42. 42. Runtime behaviour
  43. 43. Runtime behaviour
  44. 44. Runtime behaviour
  45. 45. Runtime behaviour
  46. 46. Runtime behaviour
  47. 47. Runtime behaviour
  48. 48. Runtime behaviour
  49. 49. Runtime behaviour
  50. 50. Runtime behaviour
  51. 51. Runtime behaviour
  52. 52. Runtime behaviour
  53. 53. Agenda Why? Tolerance for failure How to create a Hystrix Command Metrics, Monitoring and Dashboard Examples from finn What did we learn
  54. 54. Metrics Circuit breaker open? Calls pr. second Execution time? Median, 90th, 95th and 99th percentile Status of thread pool? Number of clients in cluster
  55. 55. Publishing the metrics Servo - Netflix metrics library CodaHale/Yammer/dropwizard - metrics HystrixPlugins. registerMetricsPublisher(HystrixMetricsPublisher impl)
  56. 56. Dashboard toolset hystrix-metrics-event-stream out of the box: servlet we use embedded jetty for thrift services turbine-web aggregates metrics-event-stream into clusters hystrix-dashboard graphical interface
  57. 57. Dashboard
  58. 58. More Details
  59. 59. Thread Pools
  60. 60. Details
  61. 61. Agenda Why? Tolerance for failure How to create a Hystrix Command Monitoring and Dashboard Examples from finn What did we learn
  62. 62. Examples from Finn - Code Altitudesearch Fetch Several Profiles using collapsing Operations
  63. 63. public int lookup(MapPoint p ) { return new LookupCommand(p).execute(); } private class LookupCommand extends HystrixCommand<Integer> { final MapPoint p; LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; } protected Integer run() throws Exception { return altitude(p); } protected Integer getFallback() { return -1; } } AltitudeSearch
  64. 64. Migrating a library Create commands Wrap commands with existing services Backwards compatible No flexibility
  65. 65. Examples from Finn - Code Fetch a map point Fetch Several Profiles using collapsing Operations
  66. 66. Request Collapsing Fetch one profile takes 10ms Lots of concurrent requests Better to fetch multiple profiles
  67. 67. Request Collapsing - why decouples client model from server interface reduces network overhead client container/thread batches requests
  68. 68. Request Collapsing create two commands Collapser one new() pr client request BatchCommand one new() pr server request
  69. 69. Request Collapsing Integrate two commands in two methods createCommand() Create batchCommand from a list of singlecommands mapResponseToRequests() Map listResponse to single resposes
  70. 70. Create Collapser public Collapser(Query query) { this.query = query;
  71. 71. Create BatchCommand return new BatchCommand(collapsedRequests, client);
  72. 72. create BatchCommand @Override protected HystrixCommand<Map<Query,Profile>> createCommand(Collection<Request> collapsedRequests) { return new BatchCommand(collapsedRequests, client); }
  73. 73. mapResponseToRequests @Override protected void mapResponseToRequests( Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) { collapsedRequests.stream().forEach( c -> c.setResponse(batchResponse.getOrDefault( c.getArgument(), new ImmutableProfile(id) );) }
  74. 74. mapResponseToRequests @Override protected void mapResponseToRequests( Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) { collapsedRequests.stream().forEach( c -> c.setResponse(batchResponse.getOrDefault( c.getArgument(), new ImmutableProfile(id) );) }
  75. 75. mapResponseToRequests @Override protected void mapResponseToRequests( Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) { collapsedRequests.stream().forEach( c -> c.setResponse(batchResponse.getOrDefault( c.getArgument(), new ImmutableProfile(id) );) } Graceful degradation
  76. 76. Request Collapsing - experiences Each individual request will be slower for the client, is that ok? 10 ms operation into 100 ms window Max 110 ms for client Average 60 ms Read documentation first!!
  77. 77. Examples from Finn - Code Fetch a map point Fetch Several Profiles using collapsing Operations
  78. 78. Example from Finn - Operations [2015-06-31T13:37:00,485] [ERROR] Forwarding to error page from request due to exception [AdCommand short-circuited and no fallback available.] com.netflix.hystrix.exception.HystrixRuntimeException: RecommendMoreLikeThisCommand short-circuited and no fallback available. at com.netflix.hystrix.AbstractCommand$16.call (AbstractCommand.java:811)
  79. 79. Error happens in production Operations gets paged with lots of error messages in logs They read the logs Lots or [ERROR] They restart the application
  80. 80. Learnings - operations Error messages means different things with Hystrix What they say, not where they occur Built in error recovery with circuit breaker Operations reads logs, not hystrix dashboard Lots of unnecessary restarts
  81. 81. Conclusions What did we learn
  82. 82. Experiences from Finn Hystrix belongs client-side
  83. 83. Experiences from Finn Nested Hystrix commands are ok
  84. 84. Experiences from Finn Graceful degradation is a big change in mindset Little use of proper fallback-values
  85. 85. Experiences from Finn Tried putting hystrix in low-level http client without great success.
  86. 86. Experiences from Finn Server side errors are detected clientside
  87. 87. Experiences from Finn Not all exceptions are errors.
  88. 88. Experiences from Finn RxJava needs a full rewrite… Still useful without!
  89. 89. Experiences from FINN Hystrix standardises things we did before: Nitty gritty http-client stuff Timeouts Connection pools Tuning thread pools Dashboards Metrics
  90. 90. Wrap up Should you start using Hystrix? - Bulkhead and circuit-breaker - explicit timeout and error handling is useful - Dashboards Further reading Ben Christensen, GOTO Aarhus 2013 - https://www.youtube.com/watch?v=_t06LRX0DV0 Updated for QConSF2014; https://qconsf.com/system/files/presentation-slides/ReactiveProgrammingWithRx-QConSF- 2014.pdf Thanks for listening! audun.fauchald.strand@finn.no & henning.spjelkavik@finn.no
  91. 91. Questions?

×