Sampling characteristics of Redis sets

This website relies on random sampling to work properly. For this I rely heavily on Redis - I pretty much make a bunch of sets in Redis, and union them and intersect them and pop from them all the time. I love this and I love Redis. However throughout the development phase of this project I kept wondering just how random SPOP and SRANDMEMBER really are. They didn't feel random to me!

I have some background in statistics, so I know that true randomness, or even deterministic computer pseudorandomness, generally does not feel random to people. People expect random samples to not repeat as often as they actually do. So I didn't really trust my instincts here. But I was curious enough to look into it so I did that, and it turns out that for sets of the size I'm working with, mostly in the single thousands range, Redis is extremely not random. Like I can't trust it at all, and I rewrote all my code to not use Redis sampling. So here's what Redis sampling actually looks like with sets of the size I'm dealing with.

What I did

All I did was make some random samples using both Python's random library and an equivalent number of samples with the matching Redis commands. For populations of 10, 100, 1000, and 10000, I took a population-size sample 1000 times, and made a histogram of the counts for each method. Code below if you're interested (yeah I know it could be better and only needs one loop, but a) I'm not much of a programmer and b) I think it's more readable this way):

Python code
def python_choice(popsize, numiter):
  sampledict = {}
  popset = set([i for i in range(1,popsize+1)])

  for i in range(1, numiter):
    for j in range(1, popsize+1):
      sample = random.sample(popset, 1)[0]
      sampledict.setdefault(sample, 0)
      sampledict[sample] += 1
  return sampledict.values()
Redis code
def redis_choice(popsize, numiter):
  sampledict = {}

  r.delete('popset')
  pipe = r.pipeline()
  for i in range(1,popsize+1):
    pipe.sadd('popset', i)
  pipe.execute()

  for i in range(1, numiter):
    pipe = r.pipeline()
    for j in range(1, popsize+1):
      pipe.srandmember('popset')
    samples = pipe.execute()
    for sample in samples:
      sampledict.setdefault(sample, 0)
      sampledict[sample] += 1
  return sampledict.values()

All of these tests should be normally distributed with a mean of 1000, meaning most items in the population should have been sampled about 1000 times. That's definitely what I got with Python, but Redis was way off. The distribution was multimodal, with some population members getting sampled rarely and some getting sampled extremely frequently. This is what I would expect if Redis did two rounds of sampling, where it was sampling from a bucket first and then a member of that bucket, if the number of elements in the buckets were wildly different. From what is on the Redis website it seems that that is roughly what it does. Here are the histograms:

Pop 10
Pop 100
Pop 1000
Pop 10000

Both histograms are plotted on the same graph for both Python and Redis - the x-axis represents the number of times a given member of the population was sampled, and the y-axis represents the number of members that were sampled that number of times. The mean for every graph should be 1000. I didn't make any graphs for sets larger than 10000 because I don't need that but I'd be curious to see if the behavior gets better with larger sets, I would imagine that it does if the buckets get bigger, but not if just the number of buckets increases.

As you can see, for populations of 10 and 100 Redis is basically equivalent to Python. But 1000 and over and it is really bad. I would say unacceptably bad if you are relying on it. I could accept a fatter variance but the fact that some members get sampled very infrequently and some get sampled all the time is no good.

Here is what Redis says about the sampling algorithms used on the description page for SRANDMEMBER:

The distribution of the returned elements is far from perfect when the number of elements in the set is small, this is due to the fact that we used an approximated random element function that does not really guarantees good distribution. The algorithm used, that is implemented inside dict.c, samples the hash table buckets to find a non-empty one. Once a non empty bucket is found, since we use chaining in our hash table implementation, the number of elements inside the bucket is checked and a random element is selected. This means that if you have two non-empty buckets in the entire hash table, and one has three elements while one has just one, the element that is alone in its bucket will be returned with much higher probability.

And for SPOP:

Note that this command is not suitable when you need a guaranteed uniform distribution of the returned elements. For more information about the algorithms used for SPOP, look up both the Knuth sampling and Floyd sampling algorithms.

I should have taken those descriptions more seriously, but also I think that the descriptions should match on both pages, and it should be a more detailed description, because if the underlying issue can't be fixed the description really doesn't match what's on the tin. It could be fixed if the underlying buckets sampled from had about the same number of members, or if they were sampled from with a probability relating to the number of items in the bucket. I don't know if that is practical to do. The upshot for me is that I rewrote everything to not use Redis sampling, but if you rely on random sampling in Redis you should look into it! If you have any questions please feel free to email me at caleb@theshfl.com.

Also if you want the R code for my absolutely beautiful histograms, here it is (made with some help):

R histogram code (using ggplot2)
ggplot(data=counts) +
geom_histogram(aes(x = Redis, y=..count.., fill="Redis"), binwidth=5, alpha=0.6) +
geom_histogram(aes(x = Python, y=..count.., fill="Python"), binwidth=5, alpha=0.6) +
scale_fill_manual(values=c("Redis"="darkorange3", "Python"="deepskyblue4"), name="Sampler") +
scale_x_continuous(name = "Times sampled", breaks = seq(0, 2000, 200)) +
scale_y_continuous(name = "ID count") +
ggtitle("ID sample count\n10000 samples from population of 10000\n1000 iterations") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.line = element_line(size=1, colour = "black"),
  panel.grid.major = element_line(colour = "#d3d3d3"),
  panel.grid.minor = element_blank(),
  panel.border = element_blank(), panel.background = element_blank(),
  plot.title = element_text(size = 14, face = "bold"),
  axis.text.x=element_text(colour="black", size = 9),
  axis.text.y=element_text(colour="black", size = 9))

And hey, while you're here check out my website!

PS

I linked to this article on the Redis subreddit and antirez responded — it looks like this issue will be fixed for sure in Redis 6, and potentially backported to Redis 5.

Shfl