SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Daniel Hochman, Engineer
1,000 2,000 Instances and Beyond
Agenda
- From the ground up
- Provisioning
- Clustering
- Maintaining high availability
- Handling system failures
- Observability
- Load testing
- Roadmap
Case study: scaling a geospatial index
Operating Redis on the Lyft platform
RedisConf 2017
By the numbers
2017
50 clusters
750 instances
15M QPS peak
Twemproxy
2018
64 clusters (+14)
2,000 instances (+1,250)
25M QPS peak (+10M)
Envoy
Migrated entire Redis infrastructure.
Consistency?
- Lyft runs with no replication
- No AOF, no RDB
- "Best-effort"
- No consistency guarantees
- If an instance is lost, data is gone
Real-time nature of service means most data is dynamic and refreshed often.
From the ground up
Provisioning clusters
- Every Redis cluster is an EC2 autoscaling group
- Each service defines and deploys its own cluster
asg.present:
- name: locationsredis
- image: ubuntu16_base
- launch_config:
- cloud_init:
#!/bin/bash
NAME=locationsredis
SERVICE=redis
curl s3://provision.sh | sh
- instance_type: c5.large
- min_size: 60
- max_size: 60
Provisioning instances
- Central provisioning templates
- Include and override
include /etc/lyft/redis/redis-defaults.conf
# overrides
bind 0.0.0.0
save ""
port {{ port }}
maxmemory-policy {{ get(maxmemory_policy, 'allkeys-lru') }}
{% if environment == 'production' %}
rename-command KEYS ""
rename-command CONFIG CAREFULCONFIG
{% endif %}
Twemproxy (deprecated)
- Also known as Nutcracker
- Unmaintained, replaced with closed-source
- No active healthcheck
- No hot restart (config changes cause downtime)
- Difficult to extend (e.g. to integrate with instance discovery)
Commits
Envoy Proxy
- Open-source
- Built for edge and mesh networking
- Observability: stats, stats, stats
- Dynamic configuration
- Pluggable architecture
- Out-of-process
- Thriving ecosystem
- Redis, DynamoDB, MongoDB codecs
Discovery
discovery
GET /members/locationsredis
POST /members/locationsredis
Membership is eventually consistent.
…
30s
60s
locationsredis:
- 10.0.0.1:6379, 40s ago
- 10.0.0.2:6379, 23s ago
...
- 10.0.0.9:6379, 12s ago
Active healthchecking
> PING
"PONG"
> EXISTS _maintenance_
(integer) 0
> SET _maintenance_ true
OK
> EXISTS _maintenance_
(integer) 1
Send a command periodically to check for a healthy response.
healthcheck:
unhealthy_threshold: 3
healthy_threshold: 2
interval: 5s
interval_jitter: 0.1s
Passive healthchecking
Monitor the success and failure of operations and eject outliers.
outlier_detection:
consecutive_failure: 30
success_rate_stdev: 1
interval: 3s
base_ejection_time: 3s
Panic routing thresholds ensures that we don't eject everything.
Consistent hashing
cluster:
name: locationsredis
lb_policy: ring_hash
Ketama algorithm
Initialization: Hash each server n times to an integer
e.g. hash( 10.0.0.1_1) = 15
Request:
1. Hash a key to an integer
e.g. GET lyft ➝ hash(lyft) = 10
2. Search for the range that
contains the key
Larger n?
- Better distribution
- Longer ring initialization
- Longer search time
1
15
Partitioning
localhost:6379
…
SET msg hello
INCR comm
MGET lyft hello
SET msg hello
GET hello
INCR comm
GET lyft
OK
1
nil
To the application, the proxy looks like a single instance of Redis.
Unsupported commands
Any command with multiple keys is generally unsupported.
Example:
SUNION key1 key2
Solution:
"Hash tagging" designates a portion of the key for hashing.
SUNION {key}1 {key}2
Maintaining
High Availability
Recovering from failure
When an instance is lost, rebuild the ring
When a new instance takes its place, rebuild the ring
t0 t1
t2
Consistent hashing only re-allocates a portion of the keyspace.
More rebuilding
When an instance is lost, rebuild the ring
When a new instance takes its place, rebuild the ring
When active healthcheck fails, rebuild the ring
When outlier detection eject, rebuild the ring
Optimization required!
B U S Y
Consistent hashing
Maglev hashing algorithm
- 10x faster ring build
- 5x faster selection
- Less variance between hosts
- Slightly higher key redistribution
on membership change
Fault injection
Now
- Chaos Monkey
- Envoy HTTP fault injection
- Latency
- Error
TODO
- TCP
- Redis-specific
- Target certain commands
openfip / redfi
Stats
Mix of stats from Envoy and Redis
- Per-backend RPS
- Command RPS
- CPU
- Memory
- Network
- Hit rate
- Key count
- Connection count
{% macro redis_cluster_stats(redis_cluster_name, alarm_thresholds) %}
redis-look
$ redis-look-monitor.py -n 2 --estimate-throughput
^C 32072 commands in 2.54 seconds (12605.22 cmd/s)
* top by key
count avg/s % key
136 53.45 0.4 count:1033422222177010026
136 53.45 0.4 count:1004894103322111029
* top by command
count avg/s % command
8198 3222.05 25.6 GET
6746 2651.37 21.0 ZREMRANGEBYSCORE
* top by command and key
count avg/s % command and key
115 45.20 0.4 GET healthcheck
115 45.20 0.4 GET params
* top by est. throughput
est. bytes count throughput throughput/s key
1MB 72 72MB 32MB attr:1004893923555550610
434B 99 42.0K 16.5K attr:1004897644432010001
Throughput cost of large keys is real.
redis-cli --bigkeys can identify
large keys, but sampled and without
frequency.
danielhochman / redis-look
Serialization
Benefits of smaller format
- Lower memory consumption, I/O
- Lower network I/O
- Lower serialization cost
708 bytes
70%
1012 bytes
(original)
190 bytes
18%
Load testing
- Injecting extra bytes
- Oplog replay at higher speed (difficult)
- Simulated Rides
- Practical load test in production
- Test business logic and infrastructure
- Weekly cadence
RPS
Time
Real
Simulated
Total
System during Load Test
Spectre
- First week of January
- 25%+ performance loss
- Identified required migrations with load testing
- Migrated half of fleet from C4 to C5
- Migration completed in 3 days
- 20% performance gain
CPU
Spectre week
Week before Spectre
Migrate to C5
Week over week Redis CPU
Time
Roadmap
- Envoy has feature parity with Nutcracker (except hash tagging)
- Documentation on minimal configuration for Envoy as Redis proxy
- Replication
- Request and response dumping (i.e. oplog)
Q&A
- Thanks!
- @danielhochman on GitHub and Twitter
- Participate in Envoy open source! envoyproxy / envoy
- Lyft is hiring. Talk to me or visit https://www.lyft.com/jobs.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
TO THE NEW | Technology
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
Amazon Web Services
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 

Was ist angesagt? (20)

A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
 
Tuning Linux for MongoDB
Tuning Linux for MongoDBTuning Linux for MongoDB
Tuning Linux for MongoDB
 
DevSecOps
DevSecOpsDevSecOps
DevSecOps
 
The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Socket.IO
Socket.IOSocket.IO
Socket.IO
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
SAST (Static Application Security Testing) vs. SCA (Software Composition Anal...
SAST (Static Application Security Testing) vs. SCA (Software Composition Anal...SAST (Static Application Security Testing) vs. SCA (Software Composition Anal...
SAST (Static Application Security Testing) vs. SCA (Software Composition Anal...
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP Switzerland
 
Redis database
Redis databaseRedis database
Redis database
 
Performance Anti-Patterns in Hibernatee, by Patrycja Wegrzynowicz
Performance Anti-Patterns in Hibernatee, by Patrycja WegrzynowiczPerformance Anti-Patterns in Hibernatee, by Patrycja Wegrzynowicz
Performance Anti-Patterns in Hibernatee, by Patrycja Wegrzynowicz
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL database
 
Autorização de transações no Nubank
Autorização de transações no NubankAutorização de transações no Nubank
Autorização de transações no Nubank
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
 
DevSecOps reference architectures 2018
DevSecOps reference architectures 2018DevSecOps reference architectures 2018
DevSecOps reference architectures 2018
 
Dev ops != Dev+Ops
Dev ops != Dev+OpsDev ops != Dev+Ops
Dev ops != Dev+Ops
 
Azure Redis Cache
Azure Redis CacheAzure Redis Cache
Azure Redis Cache
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
 

Ähnlich wie RedisConf18 - 2,000 Instances and Beyond

Windows Remote Management - EN
Windows Remote Management - ENWindows Remote Management - EN
Windows Remote Management - EN
Kirill Nikolaev
 
Spca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessingSpca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessing
NCCOMMS
 

Ähnlich wie RedisConf18 - 2,000 Instances and Beyond (20)

Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
Using SLOs for Continuous Performance Optimizations of Your k8s WorkloadsUsing SLOs for Continuous Performance Optimizations of Your k8s Workloads
Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
 
Zabbix Smart problem detection - FISL 2015 workshop
Zabbix Smart problem detection - FISL 2015 workshopZabbix Smart problem detection - FISL 2015 workshop
Zabbix Smart problem detection - FISL 2015 workshop
 
Banv
BanvBanv
Banv
 
SAP consulting results
SAP consulting resultsSAP consulting results
SAP consulting results
 
Windows Remote Management - EN
Windows Remote Management - ENWindows Remote Management - EN
Windows Remote Management - EN
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016
 
Training Slides: 153 - Working with the CLI
Training Slides: 153 - Working with the CLITraining Slides: 153 - Working with the CLI
Training Slides: 153 - Working with the CLI
 
Time series databases
Time series databasesTime series databases
Time series databases
 
4Developers: Time series databases
4Developers: Time series databases4Developers: Time series databases
4Developers: Time series databases
 
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use Them
 
Mod03 linking and accelerating
Mod03 linking and acceleratingMod03 linking and accelerating
Mod03 linking and accelerating
 
Load Data Fast!
Load Data Fast!Load Data Fast!
Load Data Fast!
 
A Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOpsA Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOps
 
LeanXcale Presentation - Waterloo University
LeanXcale Presentation - Waterloo UniversityLeanXcale Presentation - Waterloo University
LeanXcale Presentation - Waterloo University
 
You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]
You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]
You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Redis acc
Redis accRedis acc
Redis acc
 
Spca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessingSpca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessing
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
The End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional ManagementThe End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional Management
 

Mehr von Redis Labs

SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Redis Labs
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Redis Labs
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 

Mehr von Redis Labs (20)

Redis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redisRedis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redis
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
 
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
 
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
 
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
 
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of OracleRedis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
 
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
 
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
 
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
 
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
 
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
 
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
 
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
 
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
 
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

RedisConf18 - 2,000 Instances and Beyond

  • 1. Daniel Hochman, Engineer 1,000 2,000 Instances and Beyond
  • 2. Agenda - From the ground up - Provisioning - Clustering - Maintaining high availability - Handling system failures - Observability - Load testing - Roadmap
  • 3. Case study: scaling a geospatial index Operating Redis on the Lyft platform RedisConf 2017
  • 4. By the numbers 2017 50 clusters 750 instances 15M QPS peak Twemproxy 2018 64 clusters (+14) 2,000 instances (+1,250) 25M QPS peak (+10M) Envoy Migrated entire Redis infrastructure.
  • 5. Consistency? - Lyft runs with no replication - No AOF, no RDB - "Best-effort" - No consistency guarantees - If an instance is lost, data is gone Real-time nature of service means most data is dynamic and refreshed often.
  • 7. Provisioning clusters - Every Redis cluster is an EC2 autoscaling group - Each service defines and deploys its own cluster asg.present: - name: locationsredis - image: ubuntu16_base - launch_config: - cloud_init: #!/bin/bash NAME=locationsredis SERVICE=redis curl s3://provision.sh | sh - instance_type: c5.large - min_size: 60 - max_size: 60
  • 8. Provisioning instances - Central provisioning templates - Include and override include /etc/lyft/redis/redis-defaults.conf # overrides bind 0.0.0.0 save "" port {{ port }} maxmemory-policy {{ get(maxmemory_policy, 'allkeys-lru') }} {% if environment == 'production' %} rename-command KEYS "" rename-command CONFIG CAREFULCONFIG {% endif %}
  • 9. Twemproxy (deprecated) - Also known as Nutcracker - Unmaintained, replaced with closed-source - No active healthcheck - No hot restart (config changes cause downtime) - Difficult to extend (e.g. to integrate with instance discovery) Commits
  • 10. Envoy Proxy - Open-source - Built for edge and mesh networking - Observability: stats, stats, stats - Dynamic configuration - Pluggable architecture - Out-of-process - Thriving ecosystem - Redis, DynamoDB, MongoDB codecs
  • 11. Discovery discovery GET /members/locationsredis POST /members/locationsredis Membership is eventually consistent. … 30s 60s locationsredis: - 10.0.0.1:6379, 40s ago - 10.0.0.2:6379, 23s ago ... - 10.0.0.9:6379, 12s ago
  • 12. Active healthchecking > PING "PONG" > EXISTS _maintenance_ (integer) 0 > SET _maintenance_ true OK > EXISTS _maintenance_ (integer) 1 Send a command periodically to check for a healthy response. healthcheck: unhealthy_threshold: 3 healthy_threshold: 2 interval: 5s interval_jitter: 0.1s
  • 13. Passive healthchecking Monitor the success and failure of operations and eject outliers. outlier_detection: consecutive_failure: 30 success_rate_stdev: 1 interval: 3s base_ejection_time: 3s Panic routing thresholds ensures that we don't eject everything.
  • 14. Consistent hashing cluster: name: locationsredis lb_policy: ring_hash Ketama algorithm Initialization: Hash each server n times to an integer e.g. hash( 10.0.0.1_1) = 15 Request: 1. Hash a key to an integer e.g. GET lyft ➝ hash(lyft) = 10 2. Search for the range that contains the key Larger n? - Better distribution - Longer ring initialization - Longer search time 1 15
  • 15. Partitioning localhost:6379 … SET msg hello INCR comm MGET lyft hello SET msg hello GET hello INCR comm GET lyft OK 1 nil To the application, the proxy looks like a single instance of Redis.
  • 16. Unsupported commands Any command with multiple keys is generally unsupported. Example: SUNION key1 key2 Solution: "Hash tagging" designates a portion of the key for hashing. SUNION {key}1 {key}2
  • 18. Recovering from failure When an instance is lost, rebuild the ring When a new instance takes its place, rebuild the ring t0 t1 t2 Consistent hashing only re-allocates a portion of the keyspace.
  • 19. More rebuilding When an instance is lost, rebuild the ring When a new instance takes its place, rebuild the ring When active healthcheck fails, rebuild the ring When outlier detection eject, rebuild the ring Optimization required! B U S Y
  • 20. Consistent hashing Maglev hashing algorithm - 10x faster ring build - 5x faster selection - Less variance between hosts - Slightly higher key redistribution on membership change
  • 21. Fault injection Now - Chaos Monkey - Envoy HTTP fault injection - Latency - Error TODO - TCP - Redis-specific - Target certain commands openfip / redfi
  • 22. Stats Mix of stats from Envoy and Redis - Per-backend RPS - Command RPS - CPU - Memory - Network - Hit rate - Key count - Connection count {% macro redis_cluster_stats(redis_cluster_name, alarm_thresholds) %}
  • 23. redis-look $ redis-look-monitor.py -n 2 --estimate-throughput ^C 32072 commands in 2.54 seconds (12605.22 cmd/s) * top by key count avg/s % key 136 53.45 0.4 count:1033422222177010026 136 53.45 0.4 count:1004894103322111029 * top by command count avg/s % command 8198 3222.05 25.6 GET 6746 2651.37 21.0 ZREMRANGEBYSCORE * top by command and key count avg/s % command and key 115 45.20 0.4 GET healthcheck 115 45.20 0.4 GET params * top by est. throughput est. bytes count throughput throughput/s key 1MB 72 72MB 32MB attr:1004893923555550610 434B 99 42.0K 16.5K attr:1004897644432010001 Throughput cost of large keys is real. redis-cli --bigkeys can identify large keys, but sampled and without frequency. danielhochman / redis-look
  • 24. Serialization Benefits of smaller format - Lower memory consumption, I/O - Lower network I/O - Lower serialization cost 708 bytes 70% 1012 bytes (original) 190 bytes 18%
  • 25. Load testing - Injecting extra bytes - Oplog replay at higher speed (difficult) - Simulated Rides - Practical load test in production - Test business logic and infrastructure - Weekly cadence RPS Time Real Simulated Total System during Load Test
  • 26. Spectre - First week of January - 25%+ performance loss - Identified required migrations with load testing - Migrated half of fleet from C4 to C5 - Migration completed in 3 days - 20% performance gain CPU Spectre week Week before Spectre Migrate to C5 Week over week Redis CPU Time
  • 27. Roadmap - Envoy has feature parity with Nutcracker (except hash tagging) - Documentation on minimal configuration for Envoy as Redis proxy - Replication - Request and response dumping (i.e. oplog)
  • 28. Q&A - Thanks! - @danielhochman on GitHub and Twitter - Participate in Envoy open source! envoyproxy / envoy - Lyft is hiring. Talk to me or visit https://www.lyft.com/jobs.