# Random Scala Tip #568: Beware of Leaking Iterators
The Tip
The Scala collections library defines an Iterator
trait. As you might expect, a typical Iterator
's implementation relies on mutable state.
And so for the today's tip:
Beware of leaking
Iterator
instances outside the scope you initialized it in.
But What's So Special About Iterator
?
Glad you asked.
We're all functional adults here, we're avoiding mutable state anyways. And even when we don't, obviously we won't be leaking any mutable state outside some tightly guarded scope. So why am I singling out Iterator
specifically?
The reason is that Iterator
can be sneaky. It's returned in places you might not expect it, and it superficially behaves like any other (immutable) collection1.
Let me demonstrate. Suppose you're working on this code2:
// 1def process(datas: List[UserData]): Unit = datas .map(enrich) // 2 .foreach(storeToDB) // 3
// 4def enrich(data: UserData): EnrichedUserData = ???
// 5def storeToDB(data: EnrichedUserData): Unit = ???
What you have here is a nice, lightly functional, data processing pipeline. The reason why we all love using functional collections for data processing. In detail:
- The main
process
function (1) takes a bunch ofUserData
values - It then enriches each one (2) using the
enrich
function (4) - Lastly, it stores each enriched entry into the database (3) using the
storeToDB
procedure (5)
You can run this code with some sample inputs and get the following output:
Stored ID: 1Stored ID: 2Stored ID: 3Stored ID: 4...Stored ID: 1000
Nothing fancy, I'm sure that most Scala programmers are familiar with such code.
But then requirements change. We're told that the database is getting overloaded with all the storeToDB
requests. We scratch our head a bit and decide to do some batching.
Easy enough:
def process(datas: List[UserData]): Unit = datas .map(enrich) .grouped(100) // 1 .foreach(storeToDB)
def enrich(data: UserData): EnrichedUserData = ???
// 2def storeToDB(batch: List[EnrichedUserData]): Unit = ???
With a minor tweak to the code we managed to adapt to the new requirements3. By calling grouped
(1) on our list, we split it in batches. We can now send each batch to the modified storeToDB
procedure (2) to be stored as a whole batch.
Who doesn't love code that's so adaptable to fluctuating requirements? That's functional modularity at its best.
And you can run it again and see how each batch gets stored:
Stored IDs: 1, 2, 3...Stored IDs: 101, 102, 103...Stored IDs: 201, 202, 203...Stored IDs: 301, 302, 303...Stored IDs: 401, 402, 403......Stored IDs: 901, 902, 903...
So far, we have no issues. But as always, requirements change yet again. This time we are required to add some monitoring to the batches that we produce. For monitoring we have the following procedure:
def monitor(batch: List[EnrichedUserData]): Unit = ???
It takes a batch and side-effectfully does some monitoring. Another requirement is that we don't want to delay our main flow due to monitoring, so we should run it in a "fire and forget" Future
4.
That's all easy enough5:
def process(datas: List[UserData]): Unit = val batches = // 1 datas .map(enrich) .grouped(100)
Future(batches.foreach(monitor)) // 2 batches.foreach(storeToDB) // 3
- We now assign our batches to a variable (1) so that we can use it twice
- We run monitoring in a separate
Future
(2) - And we do the data storing just like in the previous snippet (3)
Although the code got a bit uglier, it's not that terrible, so we move on with our day.
Until the production bug hits us in the face...
If we try to run this code like we did before we might get the following output:
Monitored: 100 itemsMonitored: 100 items...Monitored: 100 itemsMonitored: 1 itemsStored IDs: 1, 101, 102...
It seems that we monitored most of what we need to monitor, but we only stored one batch of mismatched IDs.
Or worse yet:
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "scala.collection.IterableOnce.knownSize()" because "prefix" is null at scala.collection.immutable.List.prependedAll(List.scala:148) at scala.collection.immutable.List$.from(List.scala:685) ...Monitored: 100 itemsMonitored: 100 itemsMonitored: 100 items...Monitored: 100 itemsMonitored: 20 items
We only monitored some of the batches, but we did no storing at all, instead ending up with an exception.
What's going on?
The Case of the Sneaky Iterator
What just happened is that unbeknownst to us the grouped
method returns an Iterator[List[UserData]]
. Now, whether or not this is the correct choice of a return type for this method is above my pay grade. But it's there in the standard library (along with some other Iterator
-returning methods like sliding
) and so we have to deal with it.
The problem here is how invisible it is. Due to type-inference we didn't have to specify the type of batches
, so we missed it there. And since Iterator
being part of the collections library has a foreach
method, there was nothing to draw our attention to the fact that we moved from a regular immutable collection to a mutable Iterator
.
As we are sharing the batches
value between two flows we get the weird and racy behavior that we just observed. Both flows are competing to consume the Itertor
, and the results are mixed.
And this is why I singled out Iterator
: it can sneak mutable state where you don't expect it, resulting in code that leaks Iterator
s into unexpected places.
Possible objection: "but I'm doing pure FP". Indeed, doing pure FP is great. But this means there's all the more reason to be careful with the standard collections, lest you inject an impure value into your pure computation and wreak havoc on your referentially transparent code.
Possible workaround: I might even go as far as adding a linter rule (for, e.g., WartRemover
) to the build to forbid the usage of these somewhat dangerous, Iterator
-returning, methods.
That's all for now, stay Iterator
-safe.
Footnotes
-
Yet another instance of "overloading considered evil". ↩
-
This is a great use-case for a streaming library. Unfortunately not all of us get to live in the enlightened world of streaming. ↩
-
A terrible idea in general, but sometimes you just do what they tell you to do. ↩
-
Yes, I know the code looks a bit contrived, but it's "based on a true story". ↩