 xkcd’s random generator

Here is the wrap up of the coding challenge I proposed a few days ago, and which gathered some interesting feedback, both in the comments and also on reddit.

## Randomness

First of all, as many commenters pointed out, you can’t really test for randomness, so we’ll have to settle for a looser criterion, such as even distribution. Randomness usually implies even distribution but not the other way around. On the face of it, even distribution can seem like a very weak assertion (for example, the series “[1,2,3],[2,3,1],[3,1,2],[1,2,3],…” is evenly distributed but obviously not random) but if you can ascertain it with a data set that was generated from a fair random source, then you can probably make a good claim that your algorithm is reasonably unbiased.

## Shuffling

Let’s get the first question out of the way. The Fisher-Yates shuffle is very simple to explain: start from the end and swap the element at that index with a random one. Then take the next to last and swap it with a random element between 0 and i -1. Here is the (simplified) Java library version, Collections.shuffle():

```public static void shuffle(List<?> list, Random rnd) {
int size = list.size();
for (int i=size; i>1; i--)
swap(list, i-1, rnd.nextInt(i));
}
}
```

Note: this is actually *not* the Fisher-Yates algorithm, which runs in O(n^2) but an improved version, which runs in linear time. For some reason, most people still call it the Fisher-Yates algorithm, probably because of its resemblance to it.

Before I can show in how many ways you can actually get this wrong, we need to find out how we can measure the distribution of these results, which was the heart of the second question.

## Simple distribution measurements

Again, there are many ways to go about this and all of them have various compromises in time and space (mostly in space, actually). One of the most intuitive ways of reasoning about this is to maintain a matrix where each element [i,j] records how many times the digit i appeared at index j. If we call N the size of the array (for example 5) and RUNS the number of times we will invoke our function to get a representative sample (for example one million), we will generate one million arrays, so five million numbers. Since the matrix is NxN = 25, we would expect each element of the matrix to be approximately 5,000,000 / 25 = 200,000.

I will be using this approach in the rest of this article, but for the sake of completeness, let’s spend a short time considering other options.

You might be thinking that such a matrix doesn’t contain enough information. For example, while it will tell you that “2” appeared 201,000 times in position “3”, it doesn’t tell you if there was any pattern to these appearances or whether these appearances were coupled with other numbers in the array. Whatever additional data you want to measure, you will probably end up having to extend your matrix with either constant space or maybe N (e.g. turn it into a 3D matrix).

Another thought might be that instead of counting the occurrences of these numbers, we could calculate the average of the number that each index has contained. And if you have some education in statistics, you know that once you start getting into the idea of computing averages, you can’t avoid looking at standard deviations as well.

Here is what we get for an evenly distributed algorithm:

```Averages:[1.9988, 1.9992, 2.0023, 2.0011, 1.9983]
Stdev:   [1.4131, 1.4145, 1.4145, 1.4147, 1.4140]
```

The average is 2 since the elements range from 0 to 4 inclusive. The standard deviation is the square root of 2, which means that the numbers are evenly distributed around the average within a distance of 2.

Here are the results for the identity function, which doesn’t distribute at all:

```Averages:[0.0, 1.0, 2.0, 3.0, 4.0]
Stdev:   [2.0, 1.0, 0.0, 1.0, 2.0]
```

Unsurprisingly, the average of numbers appearing in position 0 is 0 and 4 for the position 4. The elements in position 0 and 4 are 2 away from the average, etc…

So we can get some approximate information about how evenly distributed our function is with a simple average and standard deviation, but if we want to dig further, we need to go back to our count matrix.

## The count matrix

Here is how the identity function spreads on the count matrix:

``` 1000000      0      0      0      0
0 1000000      0      0      0
0      0 1000000      0      0
0      0      0 1000000      0
0      0      0      0 1000000
```

And here is a typical run for Collections.shuffle():

```2000455 2000514 2000895 1998349 1999787
2001463 1999130 1999621 1999253 2000533
1997862 2000217 2001450 2000810 1999661
1999117 1999870 1998689 2000809 2001515
2001103 2000269 1999345 2000779 1998504
```

This looks good but it’s a bit difficult to read, so instead, let’s count how many of these numbers differ from the expected value by, say, 1%:

```for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float delta = Math.abs((RUNS / N) - mat[i][j]);
float percent = delta * 100f / N / RUNS;
if (percent > 0.01f) {
failures++;
}
}
}
```

The outcome:

```Failures: 0/25
```

We would expect the number of failures to increase as we reduce the number of runs:

```Runs: 10000000, failures: 0/25
Runs: 1000000, failures: 0/25
Runs: 100000, failures: 15/25
Runs: 10000, failures: 24/25
Runs: 1000, failures: 24/25
```

## Doing it wrong

Alright, so Collections.shuffle() seems to score well on this scale. Now let’s turn our attention to the many ways you can get this algorithm wrong. Let’s make a copy of it:

```for (int i = N; i > 1; i--) {
Collections.swap(result, i-1, rnd.nextInt(N));
}
```

Let’s reset RUNS to one million and try again:

```Runs: 1000000, failures: 17/25
```

Oops. Running it again and again yields similar results. What happened?

Okay, I’ll admit it, I introduced a bug when copying the code, can you spot it?

```  Collections.swap(result, i-1, rnd.nextInt(N));  // incorrect
Collections.swap(result, i-1, rnd.nextInt(i));  // correct
```

It looks like selecting the random index to swap among the entire array totally wrecks the distribution, while selecting it from only these elements we haven’t reached in our loop makes everything right. I’ll leave the interpretation of this behavior as an exercise to the reader.

Here is another way to get this wrong.

When you look at the fundamentals of the algorithm, you may wonder if you really need to go through the entire array to perform the swaps. After all, once you reach the half point of the array, it’s very likely that the second half of the array has been swapped randomly already, so why not stop there? Well, let’s try (I am also increasing the size of the array to 10):

```Runs: 1000000, failures: 25/100
```

Again, it looks like we messed up the algorithm. Let’s take a look at the distribution matrix:

```    for (int i = N; i > N / 2; i--) {
Collections.swap(result, i-1, rnd.nextInt(i));
}
```
``` 499916      0      0      0      0  99893 100776  99757  99744  99914
0 499489      0      0      0 100013 100040 100580  99643 100235
0      0 499929      0      0  99719  99753 100135 100363 100101
0      0      0 500268      0  99903 100346  99934 100079  99470
0      0      0      0 500255  99669  99766 100426 100040  99844
100364 100323  99846  99878  99909 100102  99518  99608  99934 100518
100392  99868  99843 100204  99662 100333  99544  99582 100313 100259
99946 100439 100138  99926 100406  99785 100144  99898  99773  99545
99906  99622  99748 100211  99641 100624 100149 100188 100003  99908
99476 100259 100496  99513 100127  99959  99964  99892 100108 100206
```

Our modified algorithm has introduced a strong bias in the first half of the array (the one we decided wasn’t worth selecting on): number 0 has a 50% chance of being found in position 0, same for 1-4. From 5 to 9, the probability drops to the expected 10% odds.

Let’s see what happens if we go a bit further than the half point, say 3/4 of the array:

```    for (int i = N; i > N / 4; i--) {
Collections.swap(result, i-1, rnd.nextInt(i));
}
```
``` 199836      0 100150 100374 100292 100139  99861  99665  99861  99822
0 200711 100183  99753 100317  99945  99761  99403  99879 100048
100283  99860  99936  99915  99693 100480  99786 100128  99848 100071
100057  99592 100152  99785 100017  99507  99550 100557 100564 100219
99933 100225  99808 100193  99844 100017 100212  99573 100462  99733
99776  99789  99895  99753  99953  99973 100124 100873  99518 100346
99921 100068 100020 100316  99589  99728 100461 100206  99900  99791
100169  99896  99941  99880  99922 100094 100168  99947  99754 100229
100217  99964  99812 100066 100413 100209  99975  99593 100216  99535
99808  99895 100103  99965  99960  99908 100102 100055  99998 100206
Runs: 1000000, failures: 4/100
```

Here is another way to mess it up. Let’s restore our loop to swap the entire array:

```    for (int i = N; i > 1; i--) {
Collections.swap(result, i-1, new Random().nextInt(i));
}
```

Let’s run it:

``` 137244  44545  91552  87602 100969 122467 100299 125866 101490  87966
71720 138821  92466 107337 103311 123042  99610  73350 103791  86552
129376 113794 114303  98828 102760  55589 100312 102734  95621  86683
71216 112960 111268 115538 102947  99968 100271  98012 100026  87794
94519  94294  94709  94851  93888 103046  99956 100257 101460 123020
95533  96387  95852  96858  96351  96101  99570  99488  96360 127500
99901  99926  99890  99421  99929 100409  99989  99744 100489 100302
100233  99852  99969  99940  99872  99922 100016 100154 100071  99971
99725  99387  99817  99841 100002  99500 100061 100811 100863  99993
100533 100034 100174  99784  99971  99956  99916  99584  99829 100219
Runs: 1000000, failures: 49/100
```

Wow, that’s another massive failure. What did I do wrong this time?

```    private Random rnd = new Random();

// ...

Collections.swap(result, i-1, rnd.nextInt(i));  // correct
Collections.swap(result, i-1, new Random().nextInt(i)); // incorrect
```

Why would creating a new Random object trigger such a bug while reusing the same Random object through our loops provides a good distribution?

If you ever deal with random generation in Java (and probably other languages), you absolutely need to be aware of the way java.lang.Random is implemented. The problem here is that new objects are seeded with the current time, so creating objects within a loop pretty much guarantees that these objects will be created within a timeframe that’s smaller than the clock granularity, which means that a lof ot the random numbers generated in succession will be identical. You address this either by reusing the same Random object (consider warming it up, too) or, better, by using java.util.Math#random (or even better: java.security.SecureRandom).

I’ll show one more common mistake, which might actually not always be one:

```    for (int i = N; i > 1; i--) {
Collections.swap(result, i-1, rnd.nextInt(i - 1));  // was i
}
```

The difference here is that the random index is picked between 0 and i-1, instead of between 0 and i. Here is the distribution:

```      0 249684 250515 249406 250395
250334      0 249828 249915 249923
249569 249612      0 250188 250631
249973 250505 250471      0 249051
250124 250199 249186 250491      0
Runs: 1000000, failures: 25/25
```

We observe that no number i can ever be found in position i, which, while totally failing our even distribution test, possesses a few interesting qualities. This approach is actually known as the Sattolo algorithm, and it’s actually useful because it always generates cycles (look it up in a section of its own in the Fisher-Yates entry on Wikipedia).

This is getting much longer than I anticipated, so I’ll just leave you with a final thought in the form of a simple problem, which touches on a question that Lawrence Kesteloot mentioned on in his comment on the original entry: I’m giving you an unbiased random source that returns either 0 or 1 and I want you to use it to generate a random list of the letters ‘a’, ‘b’, ‘c’, ‘d’ and ‘e’. And then, extend the idea to shuffle a deck of 52 cards.

Update: discussion on reddit