Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

RedisConf18 - 2,000 Instances and Beyond

744 Aufrufe

Veröffentlicht am

Breakout Session

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

RedisConf18 - 2,000 Instances and Beyond

  1. 1. Daniel Hochman, Engineer 1,000 2,000 Instances and Beyond
  2. 2. Agenda - From the ground up - Provisioning - Clustering - Maintaining high availability - Handling system failures - Observability - Load testing - Roadmap
  3. 3. Case study: scaling a geospatial index Operating Redis on the Lyft platform RedisConf 2017
  4. 4. By the numbers 2017 50 clusters 750 instances 15M QPS peak Twemproxy 2018 64 clusters (+14) 2,000 instances (+1,250) 25M QPS peak (+10M) Envoy Migrated entire Redis infrastructure.
  5. 5. Consistency? - Lyft runs with no replication - No AOF, no RDB - "Best-effort" - No consistency guarantees - If an instance is lost, data is gone Real-time nature of service means most data is dynamic and refreshed often.
  6. 6. From the ground up
  7. 7. Provisioning clusters - Every Redis cluster is an EC2 autoscaling group - Each service defines and deploys its own cluster asg.present: - name: locationsredis - image: ubuntu16_base - launch_config: - cloud_init: #!/bin/bash NAME=locationsredis SERVICE=redis curl s3://provision.sh | sh - instance_type: c5.large - min_size: 60 - max_size: 60
  8. 8. Provisioning instances - Central provisioning templates - Include and override include /etc/lyft/redis/redis-defaults.conf # overrides bind 0.0.0.0 save "" port {{ port }} maxmemory-policy {{ get(maxmemory_policy, 'allkeys-lru') }} {% if environment == 'production' %} rename-command KEYS "" rename-command CONFIG CAREFULCONFIG {% endif %}
  9. 9. Twemproxy (deprecated) - Also known as Nutcracker - Unmaintained, replaced with closed-source - No active healthcheck - No hot restart (config changes cause downtime) - Difficult to extend (e.g. to integrate with instance discovery) Commits
  10. 10. Envoy Proxy - Open-source - Built for edge and mesh networking - Observability: stats, stats, stats - Dynamic configuration - Pluggable architecture - Out-of-process - Thriving ecosystem - Redis, DynamoDB, MongoDB codecs
  11. 11. Discovery discovery GET /members/locationsredis POST /members/locationsredis Membership is eventually consistent. … 30s 60s locationsredis: - 10.0.0.1:6379, 40s ago - 10.0.0.2:6379, 23s ago ... - 10.0.0.9:6379, 12s ago
  12. 12. Active healthchecking > PING "PONG" > EXISTS _maintenance_ (integer) 0 > SET _maintenance_ true OK > EXISTS _maintenance_ (integer) 1 Send a command periodically to check for a healthy response. healthcheck: unhealthy_threshold: 3 healthy_threshold: 2 interval: 5s interval_jitter: 0.1s
  13. 13. Passive healthchecking Monitor the success and failure of operations and eject outliers. outlier_detection: consecutive_failure: 30 success_rate_stdev: 1 interval: 3s base_ejection_time: 3s Panic routing thresholds ensures that we don't eject everything.
  14. 14. Consistent hashing cluster: name: locationsredis lb_policy: ring_hash Ketama algorithm Initialization: Hash each server n times to an integer e.g. hash( 10.0.0.1_1) = 15 Request: 1. Hash a key to an integer e.g. GET lyft ➝ hash(lyft) = 10 2. Search for the range that contains the key Larger n? - Better distribution - Longer ring initialization - Longer search time 1 15
  15. 15. Partitioning localhost:6379 … SET msg hello INCR comm MGET lyft hello SET msg hello GET hello INCR comm GET lyft OK 1 nil To the application, the proxy looks like a single instance of Redis.
  16. 16. Unsupported commands Any command with multiple keys is generally unsupported. Example: SUNION key1 key2 Solution: "Hash tagging" designates a portion of the key for hashing. SUNION {key}1 {key}2
  17. 17. Maintaining High Availability
  18. 18. Recovering from failure When an instance is lost, rebuild the ring When a new instance takes its place, rebuild the ring t0 t1 t2 Consistent hashing only re-allocates a portion of the keyspace.
  19. 19. More rebuilding When an instance is lost, rebuild the ring When a new instance takes its place, rebuild the ring When active healthcheck fails, rebuild the ring When outlier detection eject, rebuild the ring Optimization required! B U S Y
  20. 20. Consistent hashing Maglev hashing algorithm - 10x faster ring build - 5x faster selection - Less variance between hosts - Slightly higher key redistribution on membership change
  21. 21. Fault injection Now - Chaos Monkey - Envoy HTTP fault injection - Latency - Error TODO - TCP - Redis-specific - Target certain commands openfip / redfi
  22. 22. Stats Mix of stats from Envoy and Redis - Per-backend RPS - Command RPS - CPU - Memory - Network - Hit rate - Key count - Connection count {% macro redis_cluster_stats(redis_cluster_name, alarm_thresholds) %}
  23. 23. redis-look $ redis-look-monitor.py -n 2 --estimate-throughput ^C 32072 commands in 2.54 seconds (12605.22 cmd/s) * top by key count avg/s % key 136 53.45 0.4 count:1033422222177010026 136 53.45 0.4 count:1004894103322111029 * top by command count avg/s % command 8198 3222.05 25.6 GET 6746 2651.37 21.0 ZREMRANGEBYSCORE * top by command and key count avg/s % command and key 115 45.20 0.4 GET healthcheck 115 45.20 0.4 GET params * top by est. throughput est. bytes count throughput throughput/s key 1MB 72 72MB 32MB attr:1004893923555550610 434B 99 42.0K 16.5K attr:1004897644432010001 Throughput cost of large keys is real. redis-cli --bigkeys can identify large keys, but sampled and without frequency. danielhochman / redis-look
  24. 24. Serialization Benefits of smaller format - Lower memory consumption, I/O - Lower network I/O - Lower serialization cost 708 bytes 70% 1012 bytes (original) 190 bytes 18%
  25. 25. Load testing - Injecting extra bytes - Oplog replay at higher speed (difficult) - Simulated Rides - Practical load test in production - Test business logic and infrastructure - Weekly cadence RPS Time Real Simulated Total System during Load Test
  26. 26. Spectre - First week of January - 25%+ performance loss - Identified required migrations with load testing - Migrated half of fleet from C4 to C5 - Migration completed in 3 days - 20% performance gain CPU Spectre week Week before Spectre Migrate to C5 Week over week Redis CPU Time
  27. 27. Roadmap - Envoy has feature parity with Nutcracker (except hash tagging) - Documentation on minimal configuration for Envoy as Redis proxy - Replication - Request and response dumping (i.e. oplog)
  28. 28. Q&A - Thanks! - @danielhochman on GitHub and Twitter - Participate in Envoy open source! envoyproxy / envoy - Lyft is hiring. Talk to me or visit https://www.lyft.com/jobs.

×