Coding challenge: partial results semantics

Consider the simple following signature:

  List<Person> searchPeople(String query);

This method takes a search string (e.g. “Bill”) and returns a list of Person that match this search string. This includes people with “Bill” as their first name, or as their last name, or maybe even using nicknames (someone whose name is “William” would match).

However, you have millions of people in your database, which means that this function call can potentially return tens of thousands of people, and it can also be quite time consuming. But your caller cannot wait forever and they want to cap the amount of time you spend doing the search, e.g. five seconds.

The function itself doesn’t know about this limit, it just does as much as it can and then it gets interrupted by its caller after the time has run out. You can imagine that the caller invokes this function in a separate thread and then calls get on the Future with a time out of five seconds.

The nice thing about the signature above is that it’s referentially transparent, which offers a lot of nice properties. However, it’s also binary: either it returns everything that matches the search or it gets interrupted before it can finish and the caller gets zero results.

The challenge is to write this function so that when it gets interrupted by the time out, it still returns whatever it has found so far.

The solution is trivial using mutable structures so bonus points if you can implement this solution with immutable data. Any language welcome, and I suggest you use pastebin or a similar service to share your code, since the comment system is not very good at formatting code.

More about language popularity

Hot on the tail of my previous post about language popularity, the latest numbers from the TIOBE are out.

The top five languages are C, Java, Objective-C, C++ and Visual Basic. Every other language beyond that has less than 4% mind share. The next JVM languages are Scala (#35) and Groovy (#48). Clojure didn’t make it in the top 50.

The pitfalls of Test-Driven Development

A few days ago, David Heinemeier Hansson posted a very negative article on Test-Driven Development (TDD) which generated quite a bit of noise. This prompted Kent Beck to respond with a Facebook post which I found fairly weak because it failed to address most of the points that David made in his blog post.

I have never been convinced by TDD myself and I have expressed my opinions on the subject repeatedly in the past (here and here for example) so I can’t say I’m unhappy to see this false idol finally being questioned seriously.

I actually started voicing my opinion on the subject in my book in 2007, so I thought I’d reproduce the text from this book here for context (with a few changes).

The Pitfalls of Test-Driven Development

I basically have two objections to Test-Driven Development (TDD).

  1. It promotes microdesign over macrodesign.
  2. It’s hard to apply in practice.

Let’s go over these points one by one.

TDD Promotes Microdesign over Macrodesign

Imagine that you ask a famous builder and architect to construct a sky scraper. After a month, that person comes back to you and says

“The first floor is done. It looks gorgeous; all the apartments are in perfect, livable condition. The bathrooms have marble floors and beautiful mirrors, the hallways are carpeted and decorated with the best art.”

“However,” the builder adds, “I just realized that the walls I built won’t be able to support a second floor, so I need to take everything down and rebuild with stronger walls. Once I’m done, I guarantee that the first two floors will look great.”

This is what some premises of Test-Driven Development encourage, especially aggravated by the mantra “Do the simplest thing that could possibly work,” which I often hear from Extreme Programming proponents. It’s a nice thought but one that tends to lead to very myopic designs and, worst of all, to a lot of churn as you constantly revisit and refactor the choices you made initially so they can encompass the next milestone that you purposefully ignored because you were too busy applying another widespread principle known as “You aren’t going to need it” (YAGNI).

Focusing exclusively on Test-Driven Development tends to make programmers disregard the practice of large or medium scale design, just because it no is longer “the simplest thing that could possibly work”. Sometimes it does pay off to start including provisions in your code for future work and extensions, such as empty or lightweight classes, listeners, hooks, or factories, even though at the moment you are, for example, using only one implementation of a certain interface.

Another factor to take into consideration is whether the code you are writing is for a closed application (a client or a Web application) or a library (to be used by developers or included in a framework). Obviously, developers of the latter type of software have a much higher incentive to empower their users as much as possible, or their library will probably never gain any acceptance because it doesn’t give users enough extensibility. Test-Driven Development cripples library development because its principles are at odds with the very concept of designing libraries: think of things that users are going to need.

Software is a very iterative process, and throwing away entire portions of code is not only common but encouraged. When I start working on an idea from scratch, I fully expect to throw out and completely rewrite the first if not the first two versions of my code. With that in mind, why bother writing tests for this temporary code? I much prefer writing the code without any tests while my understanding of the problem evolves and matures, and only when I reach what I consider the first decent implementation of the idea is it time to write tests.

At any rate, test-driven developers and pragmatist testers are trying to achieve the same goal: write the best tests possible. Ideally, whenever you write tests, you want to make sure that these tests will remain valid no matter how the code underneath changes. Identifying such tests is difficult, though, and the ability to do so probably comes only with experience, so
consider this a warning against testing silver bullets.

Yes, Test-Driven Development can lead to more robust software, but it can also lead to needless churn and a tendency to over-refactor that can negatively impact your software, your design, and your deadlines.

TDD Is Hard to Apply

Test-Driven Development reading material that I have seen over the years tends to focus on very simple problems:

  • A scorecard for bowling
  • A simple container (Stack or List)
  • A Money class
  • A templating system

TDD works wonders on these examples, and the articles describing this practice usually do a good job of showing why and how.

What these articles don’t do, though, is help programmers dealing with very complex code bases perform Test-Driven Development. In the real world, programmers deal with code bases comprised of millions of lines of code. They also have to work with source code that not only was never
designed to be tested in the first place but also interacts with legacy systems (often not written in Java), user interfaces, graphics, or code that outputs on all kinds of hardware devices, processes running under very stringent real time, memory, network or performance constraints, faulty hardware, and so on.

Notice that none of the examples from the TDD reading materials fall in any of this category, and because I have yet to see a concrete illustration of how to use Test-Driven Development to test a back-end system interacting with a 20-year-old mainframe validating credit card transactions, I certainly share the perplexity of developers who like the idea of Test-Driven
Development but can’t find any reasonable way to apply it to their day jobs.

TestNG itself is a very good candidate for Test-Driven Development: It doesn’t have any graphics, it provides a rich programmatic API that makes it easy to probe in various ways, and its output is highly deterministic and very easy to query. On top of that, it’s an open source project that is not subject to any deadlines except for the whims of its developers.

Despite all these qualities, I estimate that less than 5% of the tests validating TestNG have been written in a TDD fashion for the simple reason than code written with TDD was not necessarily of higher quality than if it had been delivered “tests last.” It was also not clear at all that code produced with TDD ended up being better designed.

No matter what TDD advocates keep saying, code produced this way is not intrinsically better than traditionally tested code. And looking back, it actually was a little harder to produce, if only because of the friction created by dealing with code that didn’t compile and tests that didn’t pass for quite a while.

Extracting the Good from Test-Driven Development

The goal of any testing practice is to produce tests. Even though I am firmly convinced that code produced with TDD is not necessarily better than code produced the traditional way, it is still much better than code produced without any tests. And this is the number one lesson I’d like everybody to keep in mind: how you create your tests is much less important than writing tests in the first place.

Another good quality of Test-Driven Development is that it forces you to think of the exit criteria that your code has to meet before you even start coding. I certainly applaud this focus on concrete results, and I encourage any professional developer to do the same. I simply argue that there are other ways to phrase these criteria than writing tests first, and sometimes even a simple text file with a list of goals is a very decent way to get started. Just make sure that, by the time you are done with an initial version, you have written tests for every single item on your list.

Don’t test first, test smart.


Update: Discussion on reddit

Language popularity on GitHub

RedMonk just published their latest survey of Github’s most popular languages, and given Github’s continuous growth in popularity, they are worth a look.

Here are the results at a glance:

  • Javascript is seeing a consistent and serious growth.
  • Ruby is in sharp decline.
  • Python is showing a decline as well, although not as severe as Ruby.
  • Java is showing some growth, and it’s also the only JVM language in the top 12 listed by Red Monk.

I’m going to go out on a limb and predict that Python is being replaced by Go. I don’t have a lot of information to back up this prediction except that most of the positive articles I read about Go are written by Python developers, and a lot of them say that they are now actively migrating their code base from Python to Go. I don’t see as much enthusiasm for Go from developers using statically typed languages, probably because of Go’s antiquated type system (which is still a big step up from Python, obviously).

It’s interesting to see Java continue to grow twenty years after its introduction. I think this constant growth is fueled by the language’s remarkable versatility and its ability to adapt to new technologies but also driven by a series of constant popularity boosts such as Android five years ago and Java 8 this year.

I’m surprised not to see Groovy in this top 12 of languages, since it’s usually acknowledged as the second most popular language on the JVM and I expected its popularity grow thanks to Gradle, but this doesn’t seem to be enough to crack the top 12 on Red Monk.

Update: Discussion on reddit

A bit I can’t unsee

This is the binary definition of ET’s sprite, in Atari’s infamous game:


ETWalkSprite_A0
.byte $FE ; |XXXXXXX.|
.byte $FF ; |XXXXXXXX|
.byte $C3 ; |XX....XX|
.byte $0F ; |....XXXX|
.byte $FF ; |XXXXXXXX|
.byte $3F ; |..XXXXXX|
.byte $2B ; |..X.X.XX|
.byte $E7 ; |XXX..XXX|
.byte $00 ; |........|

I am actually less disturbed by the 6502 listing than by this extra bit between ET’s legs (which you can see on the screen shot above).

Coding challenge: fast width/height of an image

I am trying to find out the width and height of an image (a URL) transferring as few bytes as possible. This coding challenge has two separate parts:

  • Finding the cheapest way to retrieve width and height.
  • Verifying that the solution is indeed cheap.

The first part is not that interesting since it’s a fairly well documented problem. A bit too documented, actually, since I found several solutions:

This last solution looks pretty efficient from a network perspective but I’m worried it might leave out a few corner cases in image handling. The first two solutions are probably battle tested and very robust but I have no idea how effective they are from a network standpoint (their Javadoc doesn’t describe the specifics of how they operate).

So… Do you have a better solution? And whether you do or don’t, how would you assess how many bytes these implementations transfer?

Chrome and passwords

An innocent blog post demonstrating that it’s trivial for anyone to see passwords in clear in Chrome is slowly building up into a whole scandal. The Chrome team was quick to respond but unfortunately, that answer simply poured oil on a fire that was already burning quite brightly.

Here are a few predictions:

  • As we speak, the Chrome team is having an emergency meeting to discuss the situation.
  • Within a few days, they will announce that they are going to fix the problem.
  • In a couple of weeks, the next version of Chrome will no longer display passwords in clear without additional action from the user.
  • In the months that follow, the other browsers will fall in line and implement a similar fix (Update: Safari already prompts you for your machine password before showing you its passwords, thanks Fabrizio for pointing this out).

I’m guessing Justin Chuh is probably regretting answering so hastily (you know you’re in hot water when Sir Tim Berners Lee calls your answer “disappointing”) and this might be a good illustration of being immersed into a domain for so many years that you start missing out on obvious things. We’ve all been guilty of this at some point, where our own expertise is reinforcing a flawed premise and that very expertise is preventing us from being critical of that very same premise.

The general idea is that once someone has physical access to a resource you are trying to secure, that resource is pretty much gone, so you shouldn’t spend too much time trying to address that scenario.

A common response to this is that once someone gains access to your computer, they can still log into your favorite web sites without knowing your passwords, but this is a straw man: the exposure of your passwords in clear text can be devastating since this knowledge can then be used to many more web sites. If you give me a password cookie, I can only log to so many web sites in a few minutes. If you give me your clear text password, I can spend hours back home trying to log in various web sites without worries.

The danger with the “physical access is not worth addressing” premise is that it’s tempting to do absolutely nothing to address this case (which is what most browsers do, Chrome is not the only culprit here) instead of doing at least a minimum. For example, maybe you have some cash in your home that you save for emergencies and you are wondering where to hide that money. Just because a burglar inside your house can steal whatever they want at that point doesn’t mean you shouldn’t at least make it a bit harder for them. Surely, hiding that cash inside socks in a drawer is a bit more secure than leaving it in plain sight on the kitchen table.

This added bit of security can be crucial for two reasons:

  • It raises the technical bar on the thief. With the current situation, someone who sits down in front of your unlocked workstation can know your passwords within ten seconds and just a few clicks. Hash and hide these passwords and suddenly, reading them is no longer accessible to a large part of the population.
  • Time is also a factor. When someone gains access to your computer, they might only have a few minutes and anything you can do to delay them (such as installing a hash decrypter) is that much time they don’t have to steal something else.

Whatever happens, I think we will all come out of this situation with more secure browsers.

Email tips

Email tips which I thought were common knowledge.

Avoiding embarrassing forwards

Problem: You receive an email (say from a customer), you pass it along internally with a few comments and… you accidentally copy the author of the email with details they were not meant to read.

Solution: Instead of “Reply to all”, adding your team in Cc and removing the customer (the part a lot of people forget), forward the whole email and then add your comments. It’s much easier to avoid mistakes when you have to populate an empty To: box than when you have to edit an existing list of To's, Cc's and Bcc's.

Don’t send too soon

Problem: writing an email that’s not ready to be sent yet.

You need to write a detailed email which is not supposed to be sent just yet (for example because you are describing something you’re still working on). How do you make sure that you don’t send the email until the time is right?

Solution: Write it and save it as a draft but don’t enter anything in the To: box. Do this last.

Say no to ugly emails

Problem: You want your emails to look nice.

My emails routinely contain code or shell output, and as much I love Gmail, its abilities to format are pathetic, both from a user interaction standpoint (why do I need three clicks to indent text to the right?) and from a theme standpoint. For example:

In order to get this ugly rendition of code, I had to indent the snippet manually and then change its font. And it still looks terrible.

Solution: Use Markdown Here, a Chrome extension which not only allows you to format your code in Markdown but also uses some pretty nice CSS.

You only need a few backticks and, optionally, specify the language:

```java
public class SlidingWindowMap {
  public SlidingWindowMap(Set keys, int maxCount, long periodMs) {
    // ...
  }
 
  /**
   * @return a key that has been used less than `maxCount` times during the
   * past `periodMs` milliseconds or null if no such key exists.
   */
  public String getNextKey() {
    // Your brilliant Solution goes here
  }
 
}
```

Click the Markdown button and voilà:

Your turn now, what email tricks do you use which you think very few people know?

Bringing down servers, one character at a time

Last week, I mentioned a bug that I caused and a few people asked me details about it, so here is what happened.

The code in question is parsing a calendar and processing meetings. Before showing the buggy code, here is what a shortened version of the fix ended up looking like:

while (maxDays < 7 && (eventCount < 20) || attendeeCount < 40) {
    // process meetings
    // increment maxDays whenever we cross a day boundary
    // possibly increment eventCount or attendeeCount based on the meeting
}

The bug went unnoticed in production for a few days until suddenly, the servers started pinning all their CPU's and shortly thereafter, depleting the memory of each instance. A restart was the only remedy.

This code runs a loop through someone's calendar and stops until we've gathered enough information. The maxDays variable is what I call a "safety valve", a condition used to guarantees that the code doesn't run amok in case its normal processing flow never finishes. And yet, the crash seems to indicate that the code was doing exactly that.

With this introduction out of the way, here is now the line that caused the bug:

Actually... you've already seen it, it's the code I pasted above. Did you spot the problem?

That's right, a simple set of parentheses in the wrong place. Because they are surrounding the wrong expression, they completely defeat the purpose of the safety valve, which only triggers for people with empty calendars (explaining why it took some time to trigger that bug).

Two characters to bring down an entire server...

Can we learn anything about this bug, besides the fact that testing needs to cover edge cases? A static analysis tool can't be of much help here since we're completely in the realm of runtime behavior, but is there some way that an automated tool could have warned me about the flaw of this termination expression?

Branching logic

How does the following code make you feel:

  runSomeLogic(account, person, true, false);

?

Me? Annoyed.

I wish we could do away completely with this kind of code but I’ve come to the conclusion that such calls are unavoidable for a simple reason:

  private void complexStuff() {
    // ... gnarly code here...
    if (some condition) {
      doA
    } else {
      doB
    }
    // ... more gnarly code...
  }

Accepting branching parameters in the signature is a good way to minimize the amount of duplication that you will write, however, this doesn’t mean that you need to expose your callers to this implementation detail, or at least, that you can do it in a safer way than using naked booleans.

There is a whole spectrum to cover such cases, and the example above sits at one end of it. Let’s see if we can enumerate the alternatives.

Let’s start with the lazy way, which doesn’t require any code change:

  runSomeLogic(account, person,
        true /* use memcache */,
        false /* no security check */);

This gives me a better sense of what the code is doing but it’s still error prone since comments are easily forgotten. Can we get the compiler to help us?

  /** Never call this, call the overloaded helper methods */
  private void runSomeLogic(Account account, Person person, boolean useMemcache,
      boolean securityCheck) { ... }

  public void runSomeLogicNoCacheSecurity(Account account, Person person) {
    runSomeLogic(account, person, false, true);
  }

  public void runSomeLogicCacheSecurity(Account account, Person person) {
    runSomeLogic(account, person, true, true);
  }

  // ... you get the idea

A bit better and the compiler will keep us honest, but you can see the combinatorial explosion here: each new boolean forces you to multiply the number of methods by two. Not really scalable. One way to make it scalable is to prevent more than one boolean parameter, but this is really just kicking the can down the road.

Maybe booleans are the problems, how about something more descriptive?

  enum CacheInfo {
    USE,
    DONT_USE
  }

  enum SecurityInfo {
    USE,
    DONT_USE
  }

  runSomeLogic(account, person, CacheInfo.USE, SecurityInfo.DONT_USE);

This looks somewhat better: it’s both enforced by the compiler and readable, although maybe we can improve on that by creating one big enum instead of plenty of tiny ones:

  enum Use {
    MEMCACHE,
    SECURITY
  }

  enum DontUse {
    MEMCACHE,
    SECURITY
  }

  runSomeLogic(account, person, Use.MEMCACHE, DontUse.SECURITY);

The cost is two uber enums that everybody needs to import but the code reads more fluently. The downside of this approach is that we’ve weakened static guarantees considerably: it’s now possible to pass a Use.MEMCACHE where a security configuration setting is expected.

Can you think of other ways to approach this problem? (all languages welcome)