Archive for category Coding challenge

Non coding challenge: a power story

Since the next coding challenge will be a bit brutal, I thought we all earned a little break so here is a short non coding challenge. I’m pretty sure you won’t find the answer on the net since I just made it up…

Ann was enjoying this Sunday morning in her San Francisco apartment by sipping a nice cup of coffee and browsing the web on her laptop and her home wifi connection when the entire apartment went dark. Her husband entered the room:

“I might have tripped a fuse with my computer, I’ll go check the breaker box.”

“Don’t bother, it’s a general outage, the neighbors lost power too.”

“Oh? How do you know? Did you talk to them?”

“No.”

“Did you make a phone call or receive one?”

“No.”

“Did you talk to anyone?”

“Only you.”

“Did you see something outside that tipped you off?”

“No.”

How did Ann know that the outage was general and not just their apartment?

Update 1: I will hold the comments in the moderation queue for a little while to avoid spoilers, but I’ll publish them all shortly.

Update 2: I posted the solution.

Various ways to get randomness wrong


xkcd’s random generator

Here is the wrap up of the coding challenge I proposed a few days ago, and which gathered some interesting feedback, both in the comments and also on reddit.

Randomness

First of all, as many commenters pointed out, you can’t really test for randomness, so we’ll have to settle for a looser criterion, such as even distribution. Randomness usually implies even distribution but not the other way around. On the face of it, even distribution can seem like a very weak assertion (for example, the series “[1,2,3],[2,3,1],[3,1,2],[1,2,3],…” is evenly distributed but obviously not random) but if you can ascertain it with a data set that was generated from a fair random source, then you can probably make a good claim that your algorithm is reasonably unbiased.

Shuffling

Let’s get the first question out of the way. The Fisher-Yates shuffle is very simple to explain: start from the end and swap the element at that index with a random one. Then take the next to last and swap it with a random element between 0 and i -1. Here is the (simplified) Java library version, Collections.shuffle():

public static void shuffle(List<?> list, Random rnd) {
    int size = list.size();
    for (int i=size; i>1; i--)
        swap(list, i-1, rnd.nextInt(i));
    }
}


Note: this is actually *not* the Fisher-Yates algorithm, which runs in O(n^2) but an improved version, which runs in linear time. For some reason, most people still call it the Fisher-Yates algorithm, probably because of its resemblance to it.

Before I can show in how many ways you can actually get this wrong, we need to find out how we can measure the distribution of these results, which was the heart of the second question.

Simple distribution measurements

Again, there are many ways to go about this and all of them have various compromises in time and space (mostly in space, actually). One of the most intuitive ways of reasoning about this is to maintain a matrix where each element [i,j] records how many times the digit i appeared at index j. If we call N the size of the array (for example 5) and RUNS the number of times we will invoke our function to get a representative sample (for example one million), we will generate one million arrays, so five million numbers. Since the matrix is NxN = 25, we would expect each element of the matrix to be approximately 5,000,000 / 25 = 200,000.

I will be using this approach in the rest of this article, but for the sake of completeness, let’s spend a short time considering other options.

You might be thinking that such a matrix doesn’t contain enough information. For example, while it will tell you that “2″ appeared 201,000 times in position “3″, it doesn’t tell you if there was any pattern to these appearances or whether these appearances were coupled with other numbers in the array. Whatever additional data you want to measure, you will probably end up having to extend your matrix with either constant space or maybe N (e.g. turn it into a 3D matrix).

Another thought might be that instead of counting the occurrences of these numbers, we could calculate the average of the number that each index has contained. And if you have some education in statistics, you know that once you start getting into the idea of computing averages, you can’t avoid looking at standard deviations as well.

Here is what we get for an evenly distributed algorithm:

Averages:[1.9988, 1.9992, 2.0023, 2.0011, 1.9983]
Stdev:   [1.4131, 1.4145, 1.4145, 1.4147, 1.4140]

The average is 2 since the elements range from 0 to 4 inclusive. The standard deviation is the square root of 2, which means that the numbers are evenly distributed around the average within a distance of 2.

Here are the results for the identity function, which doesn’t distribute at all:

Averages:[0.0, 1.0, 2.0, 3.0, 4.0]
Stdev:   [2.0, 1.0, 0.0, 1.0, 2.0]

Unsurprisingly, the average of numbers appearing in position 0 is 0 and 4 for the position 4. The elements in position 0 and 4 are 2 away from the average, etc…

So we can get some approximate information about how evenly distributed our function is with a simple average and standard deviation, but if we want to dig further, we need to go back to our count matrix.

The count matrix

Here is how the identity function spreads on the count matrix:

 1000000      0      0      0      0
      0 1000000      0      0      0
      0      0 1000000      0      0
      0      0      0 1000000      0
      0      0      0      0 1000000

And here is a typical run for Collections.shuffle():

2000455 2000514 2000895 1998349 1999787
2001463 1999130 1999621 1999253 2000533
1997862 2000217 2001450 2000810 1999661
1999117 1999870 1998689 2000809 2001515
2001103 2000269 1999345 2000779 1998504

This looks good but it’s a bit difficult to read, so instead, let’s count how many of these numbers differ from the expected value by, say, 1%:

for (int i = 0; i < N; i++) {
  for (int j = 0; j < N; j++) {
    float delta = Math.abs((RUNS / N) - mat[i][j]);
    float percent = delta * 100f / N / RUNS;
    if (percent > 0.01f) {
      failures++;
    }
  }
}

The outcome:

Failures: 0/25

We would expect the number of failures to increase as we reduce the number of runs:

Runs: 10000000, failures: 0/25
Runs: 1000000, failures: 0/25
Runs: 100000, failures: 15/25
Runs: 10000, failures: 24/25
Runs: 1000, failures: 24/25

Doing it wrong

Alright, so Collections.shuffle() seems to score well on this scale. Now let’s turn our attention to the many ways you can get this algorithm wrong. Let’s make a copy of it:

for (int i = N; i > 1; i--) {
  Collections.swap(result, i-1, rnd.nextInt(N));
}

Let’s reset RUNS to one million and try again:

Runs: 1000000, failures: 17/25

Oops. Running it again and again yields similar results. What happened?

Okay, I’ll admit it, I introduced a bug when copying the code, can you spot it?

  Collections.swap(result, i-1, rnd.nextInt(N));  // incorrect
  Collections.swap(result, i-1, rnd.nextInt(i));  // correct

It looks like selecting the random index to swap among the entire array totally wrecks the distribution, while selecting it from only these elements we haven’t reached in our loop makes everything right. I’ll leave the interpretation of this behavior as an exercise to the reader.

Here is another way to get this wrong.

When you look at the fundamentals of the algorithm, you may wonder if you really need to go through the entire array to perform the swaps. After all, once you reach the half point of the array, it’s very likely that the second half of the array has been swapped randomly already, so why not stop there? Well, let’s try (I am also increasing the size of the array to 10):

Runs: 1000000, failures: 25/100

Again, it looks like we messed up the algorithm. Let’s take a look at the distribution matrix:

    for (int i = N; i > N / 2; i--) {
      Collections.swap(result, i-1, rnd.nextInt(i));
    }
 499916      0      0      0      0  99893 100776  99757  99744  99914
      0 499489      0      0      0 100013 100040 100580  99643 100235
      0      0 499929      0      0  99719  99753 100135 100363 100101
      0      0      0 500268      0  99903 100346  99934 100079  99470
      0      0      0      0 500255  99669  99766 100426 100040  99844
 100364 100323  99846  99878  99909 100102  99518  99608  99934 100518
 100392  99868  99843 100204  99662 100333  99544  99582 100313 100259
  99946 100439 100138  99926 100406  99785 100144  99898  99773  99545
  99906  99622  99748 100211  99641 100624 100149 100188 100003  99908
  99476 100259 100496  99513 100127  99959  99964  99892 100108 100206

Our modified algorithm has introduced a strong bias in the first half of the array (the one we decided wasn’t worth selecting on): number 0 has a 50% chance of being found in position 0, same for 1-4. From 5 to 9, the probability drops to the expected 10% odds.

Let’s see what happens if we go a bit further than the half point, say 3/4 of the array:

    for (int i = N; i > N / 4; i--) {
      Collections.swap(result, i-1, rnd.nextInt(i));
    }
 199836      0 100150 100374 100292 100139  99861  99665  99861  99822
      0 200711 100183  99753 100317  99945  99761  99403  99879 100048
 100283  99860  99936  99915  99693 100480  99786 100128  99848 100071
 100057  99592 100152  99785 100017  99507  99550 100557 100564 100219
  99933 100225  99808 100193  99844 100017 100212  99573 100462  99733
  99776  99789  99895  99753  99953  99973 100124 100873  99518 100346
  99921 100068 100020 100316  99589  99728 100461 100206  99900  99791
 100169  99896  99941  99880  99922 100094 100168  99947  99754 100229
 100217  99964  99812 100066 100413 100209  99975  99593 100216  99535
  99808  99895 100103  99965  99960  99908 100102 100055  99998 100206
Runs: 1000000, failures: 4/100

Here is another way to mess it up. Let’s restore our loop to swap the entire array:

    for (int i = N; i > 1; i--) {
      Collections.swap(result, i-1, new Random().nextInt(i));
    }

Let’s run it:

 137244  44545  91552  87602 100969 122467 100299 125866 101490  87966
  71720 138821  92466 107337 103311 123042  99610  73350 103791  86552
 129376 113794 114303  98828 102760  55589 100312 102734  95621  86683
  71216 112960 111268 115538 102947  99968 100271  98012 100026  87794
  94519  94294  94709  94851  93888 103046  99956 100257 101460 123020
  95533  96387  95852  96858  96351  96101  99570  99488  96360 127500
  99901  99926  99890  99421  99929 100409  99989  99744 100489 100302
 100233  99852  99969  99940  99872  99922 100016 100154 100071  99971
  99725  99387  99817  99841 100002  99500 100061 100811 100863  99993
 100533 100034 100174  99784  99971  99956  99916  99584  99829 100219
Runs: 1000000, failures: 49/100

Wow, that’s another massive failure. What did I do wrong this time?

    private Random rnd = new Random();

// ...

    Collections.swap(result, i-1, rnd.nextInt(i));  // correct
    Collections.swap(result, i-1, new Random().nextInt(i)); // incorrect

Why would creating a new Random object trigger such a bug while reusing the same Random object through our loops provides a good distribution?

If you ever deal with random generation in Java (and probably other languages), you absolutely need to be aware of the way java.lang.Random is implemented. The problem here is that new objects are seeded with the current time, so creating objects within a loop pretty much guarantees that these objects will be created within a timeframe that’s smaller than the clock granularity, which means that a lof ot the random numbers generated in succession will be identical. You address this either by reusing the same Random object (consider warming it up, too) or, better, by using java.util.Math#random (or even better: java.security.SecureRandom).

I’ll show one more common mistake, which might actually not always be one:

    for (int i = N; i > 1; i--) {
      Collections.swap(result, i-1, rnd.nextInt(i - 1));  // was i
    }

The difference here is that the random index is picked between 0 and i-1, instead of between 0 and i. Here is the distribution:

      0 249684 250515 249406 250395
 250334      0 249828 249915 249923
 249569 249612      0 250188 250631
 249973 250505 250471      0 249051
 250124 250199 249186 250491      0
Runs: 1000000, failures: 25/25

We observe that no number i can ever be found in position i, which, while totally failing our even distribution test, possesses a few interesting qualities. This approach is actually known as the Sattolo algorithm, and it’s actually useful because it always generates cycles (look it up in a section of its own in the Fisher-Yates entry on Wikipedia).

This is getting much longer than I anticipated, so I’ll just leave you with a final thought in the form of a simple problem, which touches on a question that Lawrence Kesteloot mentioned on in his comment on the original entry: I’m giving you an unbiased random source that returns either 0 or 1 and I want you to use it to generate a random list of the letters ‘a’, ‘b’, ‘c’, ‘d’ and ‘e’. And then, extend the idea to shuffle a deck of 52 cards.

Update: discussion on reddit

A new coding challenge

It’s been a while since I offered a coding challenge, so let’s remedy this.

This one comes from this week’s Car Talk’s puzzler. Here is the short version of it:

A farmer has a forty pound stone that he uses to weigh food for his animals on a scale. One day, he loans the stone to his neighbor. A few days later, the neighbor returns the stone and apologizes to the farmer because his stone is now broken in four pieces. The farmer responds “Please don’t apologize, you have actually made my life easier because with these four pieces, I can now measure any weight between one and forty pounds”.

The question is: how much do these four individual pieces weigh?

I’m adding a few clarifications, which are not necessarily what the Car Talk guys meant (I don’t have the solution and I haven’t looked at any discussion on the subject):

  • The four weights are integers.
  • The weights we can measure between one and forty pounds are in one pound increment.
  • They are measured in one session (otherwise, you could measure forty pounds with a one pound stone by performing forty measurements).
  • If there is no solution (the farmer might be bad at math), show the four weights that allow to measure the most weights between one and forty pounds.

The challenge? Write a brute force solution to this puzzler in any language that you want. I don’t care if your solution is slow, fast, ugly or elegant as long as it works, and I suspect that certain languages will be very good for this puzzler

Bonus questions:

  • Your solution should explain how exactly you use the stone pieces to measure each weight.
  • Is there a formal way to solve this problem without using brute force?

Coding challenge

Update: I posted a wrap-up here.

Here is an interesting coding challenge: write a counter function that counts from 1 to max but only returns numbers whose digits don’t repeat.

For example, part of the output would be:

  • 8, 9, 10, 12 (11 is not valid)
  • 98, 102, 103 (99, 100 and 101 are not valid)
  • 5432, 5436, 5437 (5433, 5434 and 5435 are not valid)

Also:

  • Display the biggest jump (in the sequences above, it’s 4: 98 -> 102)
  • Display the total count of numbers
  • Give these two values for max=10000

Post your solution in the comments or on your blog. All languages and all approaches (brute force or not) are welcome.

I’m curious to see if somebody can come up with a clever solution (I don’t have one)…