# Random Scala Tip #568: Beware of Leaking Iterators

5 min read

The Tip

The Scala collections library defines an Iterator trait. As you might expect, a typical Iterator's implementation relies on mutable state.

And so for the today's tip:

Beware of leaking Iterator instances outside the scope you initialized it in.

But What's So Special About Iterator?

Glad you asked.

We're all functional adults here, we're avoiding mutable state anyways. And even when we don't, obviously we won't be leaking any mutable state outside some tightly guarded scope. So why am I singling out Iterator specifically?

The reason is that Iterator can be sneaky. It's returned in places you might not expect it, and it superficially behaves like any other (immutable) collection1.

Let me demonstrate. Suppose you're working on this code2:

// 1
def process(datas: List[UserData]): Unit =
datas
.map(enrich) // 2
.foreach(storeToDB) // 3
// 4
def enrich(data: UserData): EnrichedUserData = ???
// 5
def storeToDB(data: EnrichedUserData): Unit = ???

What you have here is a nice, lightly functional, data processing pipeline. The reason why we all love using functional collections for data processing. In detail:

  • The main process function (1) takes a bunch of UserData values
  • It then enriches each one (2) using the enrich function (4)
  • Lastly, it stores each enriched entry into the database (3) using the storeToDB procedure (5)

You can run this code with some sample inputs and get the following output:

Stored ID: 1
Stored ID: 2
Stored ID: 3
Stored ID: 4
...
Stored ID: 1000

Nothing fancy, I'm sure that most Scala programmers are familiar with such code.

But then requirements change. We're told that the database is getting overloaded with all the storeToDB requests. We scratch our head a bit and decide to do some batching.

Easy enough:

def process(datas: List[UserData]): Unit =
datas
.map(enrich)
.grouped(100) // 1
.foreach(storeToDB)
def enrich(data: UserData): EnrichedUserData = ???
// 2
def storeToDB(batch: List[EnrichedUserData]): Unit = ???

With a minor tweak to the code we managed to adapt to the new requirements3. By calling grouped (1) on our list, we split it in batches. We can now send each batch to the modified storeToDB procedure (2) to be stored as a whole batch.

Who doesn't love code that's so adaptable to fluctuating requirements? That's functional modularity at its best.

And you can run it again and see how each batch gets stored:

Stored IDs: 1, 2, 3...
Stored IDs: 101, 102, 103...
Stored IDs: 201, 202, 203...
Stored IDs: 301, 302, 303...
Stored IDs: 401, 402, 403...
...
Stored IDs: 901, 902, 903...

So far, we have no issues. But as always, requirements change yet again. This time we are required to add some monitoring to the batches that we produce. For monitoring we have the following procedure:

def monitor(batch: List[EnrichedUserData]): Unit = ???

It takes a batch and side-effectfully does some monitoring. Another requirement is that we don't want to delay our main flow due to monitoring, so we should run it in a "fire and forget" Future4.

That's all easy enough5:

def process(datas: List[UserData]): Unit =
val batches = // 1
datas
.map(enrich)
.grouped(100)
Future(batches.foreach(monitor)) // 2
batches.foreach(storeToDB) // 3
  • We now assign our batches to a variable (1) so that we can use it twice
  • We run monitoring in a separate Future (2)
  • And we do the data storing just like in the previous snippet (3)

Although the code got a bit uglier, it's not that terrible, so we move on with our day.

Until the production bug hits us in the face...

If we try to run this code like we did before we might get the following output:

Monitored: 100 items
Monitored: 100 items
...
Monitored: 100 items
Monitored: 1 items
Stored IDs: 1, 101, 102...

It seems that we monitored most of what we need to monitor, but we only stored one batch of mismatched IDs.

Or worse yet:

Exception in thread "main" java.lang.NullPointerException: Cannot invoke "scala.collection.IterableOnce.knownSize()" because "prefix" is null
at scala.collection.immutable.List.prependedAll(List.scala:148)
at scala.collection.immutable.List$.from(List.scala:685)
...
Monitored: 100 items
Monitored: 100 items
Monitored: 100 items
...
Monitored: 100 items
Monitored: 20 items

We only monitored some of the batches, but we did no storing at all, instead ending up with an exception.

What's going on?

The Case of the Sneaky Iterator

What just happened is that unbeknownst to us the grouped method returns an Iterator[List[UserData]]. Now, whether or not this is the correct choice of a return type for this method is above my pay grade. But it's there in the standard library (along with some other Iterator-returning methods like sliding) and so we have to deal with it.

The problem here is how invisible it is. Due to type-inference we didn't have to specify the type of batches, so we missed it there. And since Iterator being part of the collections library has a foreach method, there was nothing to draw our attention to the fact that we moved from a regular immutable collection to a mutable Iterator.

As we are sharing the batches value between two flows we get the weird and racy behavior that we just observed. Both flows are competing to consume the Itertor, and the results are mixed.

And this is why I singled out Iterator: it can sneak mutable state where you don't expect it, resulting in code that leaks Iterators into unexpected places.

Possible objection: "but I'm doing pure FP". Indeed, doing pure FP is great. But this means there's all the more reason to be careful with the standard collections, lest you inject an impure value into your pure computation and wreak havoc on your referentially transparent code.

Possible workaround: I might even go as far as adding a linter rule (for, e.g., WartRemover) to the build to forbid the usage of these somewhat dangerous, Iterator-returning, methods.

That's all for now, stay Iterator-safe.

If you enjoyed this, reach out for a workshop on Functional Programming, where I teach how to apply Functional Programming to improve code quality in any language.

Subscribe to the newsletter
Get new posts by email:

Footnotes

  1. Yet another instance of "overloading considered evil".

  2. The full example code is available on GitHub.

  3. This is a great use-case for a streaming library. Unfortunately not all of us get to live in the enlightened world of streaming.

  4. A terrible idea in general, but sometimes you just do what they tell you to do.

  5. Yes, I know the code looks a bit contrived, but it's "based on a true story".


Article Series
1
2
3

Comments