Scala 2.9 introduced parallel collections, which mirror most of the existing collections with a parallel version. Collections that have been parallelized this way have received a new method called par which magically parallelize certain operations on this collection.

For example, here is a sequential version:

scala> (1 to 5) foreach println
1
2
3
4
5

And the parallel version (notice the extra par keyword):

scala> (1 to 5).par foreach println
1
4
3
2
5

Obviously, the ordering will change each time you run the parallel version.

This piqued my curiosity and I decided to dig a bit further, starting with investigating what exactly is happening behind the scenes.

First of all, the parallel collections are based on an article called “A Generic Parallel Collection Framework”, by Martin Odersky, Aleksander Prokopec et al, which I highly recommend. It’s a very interesting analysis of how to decompose the concepts of parallelism, concurrency, collections, ranges and iterators and assemble them in a generic manner.

Sadly, this article ended up being the only highlight of my research in this area, because the more I dug into the Scala parallel collections, the more disappointed I became. By now, I am struggling to find a good use case for parallel collections, and I’m hoping that this article will generate some positive responses about their use.

Here are some of the problems that I found with the parallel collections, starting with the one I think is the most important one.

Lack of control

My first reaction when I saw the output above was to try to verify that indeed, threads were being spawned, and then find out how many of them, how I can control the size of the thread pool, etc…

I came up pretty much empty on all counts, and if I have missed a piece of the documentation that explains this, I would love to see it, but browsing the sources of ParSeq and other classes produced no useful result.

This is a big deal, and probably the worst problem with this framework. The loop above generated a parallel range of five entries, did it generate five threads? What happens if I try with 1000? 100000? The answer: it works for all these values, which makes me think that the loop is not allocating one thread per value. So it’s using a thread pool. But again: what size? Is that size configurable? How about other characteristics of that thread pool?

Digging deeper, what are the saturation and rejection policies? If the pool contains ten threads, what happens when it receives an eleventh value? It probably blocks, but can this be configured? Can the dispatch strategy be configured? Maybe I’m feeding operations of diverse durations and I want to make sure that the expensive operations don’t starve the faster ones, how can I do this?

This absence of configuration is a big blow to the parallel framework, and it relegates its usage to the simplest cases, where it will most likely not bring much speed gain compared to sequential execution.

Silver bullet illusion

Over the past months, I have seen quite a few users pop in the #scala channel and complain that parallel collections are not working. Taking a closer look at their code, it usually becomes quickly obvious that their algorithm is not parallelizable, and either 1) they didn’t realize it or 2) they were aware of that fact but they got the impression that par would magically take care of it.

Here is a quick example:

scala> Set(1,2,3,4,5) mkString(" ")
res149: String = 5 1 2 3 4

scala> Set(1,2,3,4,5).par mkString(" ")
res149: String = 5 1 2 3 4

You can run the par version over and over, the result will remain the same. This is confusing. Note that I used a Set this time, which indicates that I don’t care for the ordering of my collection. Calling mkString on the sequential version of my set reflects this. With this in mind, I would hope that calling mkString on the parallel version of my set would randomize its output, but that’s not what’s happening: I’m getting the same result as the sequential version, over and over.

It should be obvious that not all operations on collections can be parallelized (e.g. folds) but it looks like creating a string out of a set should be, and it’s not. I’m not going to go too far down here because the explanation is a mix of implementation details and theoretical considerations (the catamorphic nature of folds, set, the Scala inheritance hierarchy and the mkString specification), but the key point here is that the parallelization of collections can lead to non-intuitive results.

Bloat

I think the decision to retrofit the existing collections with the par operation was a mistake. Parallel operations come with a set of constraints that are not widely applicable to sequential collections, which leads to the situation that not all the collections support par (e.g. there is no ParTraversable) and more importantly, it imposes a burden on everyone, including people who don’t care for this functionality.

In doing this, Scala violates what I consider a fairly important rule for programming languages and API’s in general: you shouldn’t pay for what you don’t use. Not only do the parallel collections add a few megs to a jar file that’s already fairly big, but they probably introduce a great deal of complexity that is going to impact the maintainers of the collections (both sequential and parallel). It looks like anyone who will want to make modifications to the sequential collections will have to make sure their code is not breaking the parallel collections, and vice versa.

Unproven gains

Scala 2.9 is still very recent so it’s no surprise that we don’t really have any quantitative feedback of real world gains, but I’ll make a prediction today that the courageous developers who will decide to embrace the parallel collections wholeheartedly across their code base will see very little gains. In my experience, inner loops are hardly ever the bottleneck in large code bases, and I’d even go further in suspecting that spawning threads for elements of a loop could have adverse effects (context switching, memory thrashing, cache misses) for loops that iterate on very few elements or that are executing very fast operations already. I’m mostly speculating here, I haven’t run any measurements, so I could be completely wrong.

Remedies

Because of all these problems, I am a bit underwhelmed by the usefulness of the parallel collection framework overall, maybe someone who has a more extensive experience with it can chime in to share the benefits they reaped from it.

I have a couple of suggestions that I think might be a better path for this kind of initiative:

  • Split up the parallel and sequential collections, remove par and make sure that both hierarchies can be evolved independently of each other.
  • Provide a nice Scala wrapper around the Executor framework. Executors have everything that anyone interested in low level parallelism can dream of: configurable thread pool sizes, and even thread pools themselves, thread factories, saturation and rejection policies, lifecycle hooks, etc… You could write a Scala wrapper around this framework in a few hundreds of lines and it would be much more useful than what is currently possible with par.

Thoughts?

Update: Discussion on reddit.