Open Source Redis is not only the fastest NoSQL database but also the most popular among the new wave of databases running in containers. This talk introduces the data structures used to speed up applications and solve the everyday use cases that are driving Redis' popularity.
2. 2
Who?
Me: Adi Foulger, Pro Geek @ Redis Labs, the open source home and
provider of enterprise Redis
About Redis Labs:
• 5300+ paying customers, 35k+ free customers
• 118k Redis databases managed
• Withstood several datacenter outages with no loss of data for customers who chose HA
options
• Two main products: Redis Cloud, Redis Labs Enterprise Cluster
3. 3
What?
Redis is an open source (BSD licensed),
in-memory data structure store,
used as database, cache and message broker
and much more!
5. 5
REmote DIctionary Server
• Data Structure Database
• An overly simplistic definition :
CREATE TABLE redis (
k VARCHAR(512MB) NOT NULL,
v VARCHAR(512MB),
PRIMARY KEY (k)
);
• 8ish data types, 180+ commands, blazing fast
• Created by @antirez (a.k.a Salvatore Sanfilippo)
• v1.0 August 9th, 2009 … v3.2 May 6th, 2016
• Source: https://github.com/antirez/redis
• Website: http://redis.io
6. 6
Data structures are used by developers like “Lego” building
blocks, saving them much coding effort and time
Redis : A Data Structure Database
Strings Hashes Lists Sets
Sorted Sets Bitmaps
Hyper-
LogLogs
Geospatial
indexes
7. 7
Why? Because It Is Fun!
• Simplicity rich functionality, great flexibility
• Performance easily serves 100K’s of ops/sec
• Lightweight ~ 2MB footprint
• Production proven (name dropping)
8. 8
• about how data is stored
• about how data is accessed
• about efficiency
• about performance
• about the network
• …
• Redis is a database construction kit
• Beware of Maslow's "Golden"
Gavel/Law of Instrument:
"If all you have is a hammer, everything
looks like a nail"
Redis Makes You Think
10. 10
Key Points About Key Names
• Key names are "limited" to 512MB (also the values btw)
• To conserve RAM & CPU, try avoid using
unnecessarily_longish_names_for_your_redis_keys because they are
more expensive to store and compare (unlike an RDBMS's column
names, key names are saved for each key-value pair)
• On the other hand, don't be too stringent (e.g 'u:<uid>:r')
• Although not mandatory, the convention is to use colons (':') to
separate the parts of the key's name
• Your schema is your keys' names so keep them in order
11. 11
STRINGs
• Are the most basic data type
• Are binary-safe
• Is used for storing:
Strings (duh) – APPEND, GETRANGE, SETRANGE, STRLEN
Integers – INCR, INCRBY, DECR, DECRBY
Floats – INCRBYFLOAT
Bits – SETBIT, GETBIT, BITPOS, BITCOUNT, BITOP
http://xkcd.com/171/
12. 12
Pattern: Caching Calls to the DB
• Motivation: quick responses, reduce load on DBMS
• How: keep the statement's results using the Redis STRING data type
def get_results(sql):
hash = md5.new(sql).digest()
result = redis.get(hash)
if result is None:
result = db.execute(sql)
redis.set(hash, result)
# or use redis.setex to set a TTL for the key
return result
13. 13
The HASH Data Type
• Acts as a Redis-within-Redis contains key-value pairs
• Have their own commands: HINCRBY, HINCRBYFLOAT, HLEN,
HKEYS, HVALS…
• Usually used for aggregation, i.e. keeping related data together
for easy fetching/updating (remember that Redis is not a
relational database). Example:
Using separate keys Using hash aggregation
user:1:id 1 user:1 id 1
user:1:fname Foo fname Foo
user:1:lname Bar lname Bar
user:1:email foo@acme.com email foo@acme.com
14. 14
Pattern: Avoiding Calls to the DB
• Motivation: server-side storage and sharing of transient data that doesn't need a
full-fledged RDBMS, e.g. sessions and shopping carts
• How: depending on the case, use STRING or HASH to store data in Redis
def add_to_cart(session, product, quantity):
if quantity > 0:
redis.hset('cart:' + userId, product, quantity)
else:
redis.hrem('cart:' + userId, product)
redis.expire('cart:' + userId,Cart_Timeout)
def get_cart_contents(session):
return redis.hgetall('cart:' + userId)
15. 15
Pattern: Counting Things
• Motivation: statistics, real-time analytics, dashboards, throttling
• How #1: use the *INCR commands
• How #2: use a little bit of BIT*
def user_log_login(uid):
joined = redis.hget('user:' + uid, 'joined')
d0 = datetime.strptime(joined, '%Y-%m-$d')
d1 = datetime.date.today()
delta = d1 – d0
redis.setbit('user:' + uid + ':logins', delta, 1)
def user_logins_count(uid):
return redis.bitcount(
'user:' + uid + ':logins', 0, -1)
16. 16
De-normalization
• Non relational no foreign keys, no
referential integrity constraints
• Thus, data normalization isn't practical (
Mostly)
• Be prepared to have duplicated data, e.g.:
> HSET user:1 country Mordor
> HSET user:2 country Mordor
…
• Tradeoff:
• Processing Complexity ↔ Data Volume
17. 17
LISTs
17
• Lists of strings sorted by insertion
order
• Usually have a head and a tail
• Top n, bottom n, constant length list
operations as well as passing items
from one list to another are
extremely popular and extremely
fast
18. 18
Pattern: Lists of Items
• Motivation: keeping track of a sequence, e.g. last viewed profiles
• How: use Redis' LIST data type
def view_product(uid, product):
redis.lpush('user:' + uid + ':viewed', product)
redis.ltrim('user:' + uid + ':viewed', 0, 9)
…
def get_last_viewed_products(uid):
return redis.lrange('user:' + uid + ':viewed', 0,
-1)
19. 19
Pattern: Queues
• Motivation: a producer-consumer use case, asynchronous job
management, e.g. processing photo uploads
def enqueue(queue, item):
redis.lpush(queue, item)
def dequeue(queue):
return redis.rpop(queue)
# or use brpop for blocking pop
20. 20
SETs
• Unordered collection of strings
• ADD, REMOVE or TEST for
membership –O(1)
• Unions, intersections,
differences computed very very
fast
21. 21
Pattern: Searching
• Motivation: finding keys in the database, for example all the users
• How #1: use a LIST to store key names
• How #2: the *SCAN commands
def do_something_with_all_users():
first = True
cursor = 0
while cursor != 0 or first:
first = False
cursor, data = redis.scan(cursor, 'user:*')
do_something(data)
22. 22
Pattern: Indexing
• Motivation: Redis doesn't have indices, you need to maintain
them
• How: the SET data type (a collection of unordered unique
members)
def update_country_idx(country, uid):
redis.sadd('country:' + country, uid)
def get_users_in_country(country):
return redis.smembers('country:' + country)
23. 23
Pattern: Relationships
• Motivation: Redis doesn't have foreign keys, you need to maintain
them
> SADD user:1:friends 3 4 5 // Foo is social and makes friends
> SCARD user:1:friends // How many friends does Foo have?
> SINTER user:1:friends user:2:friends // Common friends
> SDIFF user:1:friends user:2:friends // Exclusive friends
> SUNION user:1:friends user:2:friends // All the friends
24. 24
ZSETs (Sorted Sets)
Are just like SETs:
• Members are unique
• ZADD, ZCARD, ZINCRBY, …
ZSET members have a score that's used for sorting
• ZCOUNT, ZRANGE, ZRANGEBYSCORE
When the scores are identical, members are sorted alphabetically
Lexicographical ranges are also supported:
• ZLEXCOUNT, ZRANGEBYLEX
25. 25
Pattern: Sorting
Motivation: anything that needs to be sorted
How: ZSETs
> ZADD friends_count 3 1 1 2 999 3
> ZREVRANGE friends_count 0 -1
3
1
2
Set members (uids)
Scores (friends count)
26. 26
The SORT Command
• A command that sorts LISTs, SETs and SORTED SETs
• SORT's syntax is the most complex (comparatively) but SQLers
should feel right at home with it:
• SORT key [BY pattern] [LIMIT offset count]
[GET pattern [GET pattern ...]]
[ASC|DESC] [ALPHA]
[STORE destination]
• SORT is also expensive in terms of complexity O(N+M*log(M))
• BTW, SORT is perhaps the only ad-hoc-like command in Redis
27. 27
Pattern: Counting Unique Items
• How #1: SADD items and SCARD for the count
• Problem: more unique items more RAM
• How #2: the HyperLogLog data structure
> PFADD counter item1 item2 item3 …
HLL is a probabilistic data structure that counts (PFCOUNT) unique
items
Sacrifices accuracy: standard error of 0.81%
Gains: constant complexity and memory – 12KB per counter
Bonus: HLLs are merge-able with PFMERGE
28. 28
Is Redis ACID? (mostly) Yes!
• Redis is (mostly) single threaded, hence every operation is
o Atomic
o Isolated
• WATCH/MULTI/EXEC allow something like transactions (no rollbacks)
• Server-side Lua scripts ("stored procedures") also behave like
transactions
• Durability is configurable and is a tradeoff between efficiency and
safety
• Consistency achieved using the “wait” command
29. 30
Wait, There's More!
• There are additional commands that we didn't cover
• Expiration and eviction policies
• Publish/Subscribe
• Data persistency and durability
• Server-side scripting with Lua
• Master-Slave(s) replication
• High Availability
• Clustering
31. 33
Before
• Redis is ubiquitous for fast data, fits lots of cases (Swiss™ Army
knife)
• Some use cases need special care
• Open source has its own agenda
32. 34
After
• Core still fits lots of cases
• Module extensions for special cases
• A new community-driven ecosystem
• “Give power to users to go faster”
33. 35
What are they?
• Dynamically (server-)loaded libraries
• Future-compatible
• Will be (mostly) written in C
• (Almost) As fast as the core
• Planned for public release Q3 2016
34. 36
What can I do with them?
• Process: where the data is at
• Compose: call core & other modules
• Extend: new structures, commands
35. 37
It’s got layers!
• Operational: admin, memory, disk,
replication, arguments, replies…
• High-level: client-like access to core and
modules’ commands
• Low-level: (almost) native access to core
data structures memory
42. 44
Spark Operation w/o Redis
Read to RDD Deserialization Processing Serialization Write to RDD
Analytics & BI
1 2 3 4 5 6
Data SinkData Source
43. 45
Spark SQL &
Data Frame
Spark Operation with Redis
Data Source Serving Layer
Analytics & BI
1 2
Processing
Spark-Redis connector
Read
filtered/sorted
data
Write
filtered/sorted
data
44. 46
Accelerating Spark Time-Series with Redis
Redis is faster by up to 100 times compared to HDFS
and over 45 times compared to Tachyon or Spark
45. 47
Goal:
• Accelerate Hadoop operation by orders of magnitude:
‒ Phase 1 – use Redis as a caching solution for HBase (Hadoop’s default database)
and HDFS (Hadoop Distributed File System)
‒ phase 2 – completely replace HBase with Redis
Hadoop is Turbo Charged with Redis
Milestone:
• Demonstrated Hadoop acceleration by
HBase caching with Redis
Real-life scenario:
~500% acceleration
AOF rewrites till disk is about to be full, take snapshot and start over. Auto matically – no complex configuration needed.
Spark, the new in-memory distributed data processing framework represents the next generation of big data analytics tools. Spark isn't a database; whenever it needs to process data it has to read it from unfiltered/unordered the source, translates it to its internal data-structures, RDDs(resilient distributed datasets) or Tungsten , and execute multiple deserialization and serialization steps before processing it. The same applicable when Spark needs to save the data to a data sink.
With Redis' data structures exposed to Spark, its operations are tremendously simplified and accelerated. With the Redis Labs' Spark Redis connector, Spark can read only the data it needs for its processing directly from Redis, avoid copying the data, serializing and deserializing it. The same offloading applies when Spark needs to store its processing results back in Redis. And when Redis is used as a serving layer for Spark it can further offload and accelerate Spark, as whenever an SQL query arrives to Spark (using Spark SQL) it will first be examined against the data in Redis, and if the data is found, it will be read directly from Redis and the entire cycle of Spark processing will be avoided.
The acceleration provided by Redis is demonstrated easily through this benchmark performed on time series data. When Redis sorted sets are used to store timeseries data ( in this case stock prices for 1024 stocks over the last 30 years), Spark queries are executed 100 times faster compared to Spark using HDFS and 45 times faster compared to Spark using Tachyon or just in process memory