Primary Unit – Help, zip is eating my iterator’s items

Help, zip is eating my iterator’s items

Tuesday, June 18 2019

Python’s zip function, which knits together two iterables, is indispensable for me. It works like this:

list_one = [1, 2, 3]
list_two = ["a", "b", "c"]
for pair in zip(list_one, list_two):
    print(pair)
# (1, 'a')
# (2, 'b')
# (3, 'c')

If the two iterables differ in length, zip halts after the shortest is exhausted. If we add an additional element to one of the lists above, we get the same results:

list_one = [1, 2, 3]
list_two = ["a", "b", "c", "d"]  # Note extra item, "d"
for pair in zip(list_one, list_two):
    print(pair)
# (1, 'a')
# (2, 'b')
# (3, 'c')

But the actual mechanics of this surprised me. Today I was working on the “chunked” problem from Python Morsels (which is great and you should totally try out if you write Python), and was left scratching my head after elements of my iterable started disappearing.

The basic problem for chunked is this: given some iterable, return its elements in count-length lists. Trey likes you to think in terms of “any iterable” so you can’t depend on list-like behaviour, such as being able to index into the iterable or check its length without consuming it.

It’s safer to assume you get one traversal. So, my solution starts like this, creating an iterator from the iterable.

def chunked(iterable, count):
    iterator = iter(iterable)
    ...

Then (eliding the scaffolding) I build up a new count-length chunk using zip in a comprehension:

temp = [item for item, _ in zip(iterator, range(count))]

Here I use the “earliest finish” behaviour of zip paired with range — the amount of numbers in the range (count-many of them) determines how many items I fetch from the iterator.

Let’s give this a try, using your imagination to flesh out the rest of chunked:

for chunk in chunked(iterable=range(10), count=4):
    print(chunk)
# [0, 1, 2, 3]
# [5, 6, 7, 8]

Er, hm. Not what I was expecting, which was:

# [0, 1, 2, 3]
# [4, 5, 6, 7]
# [8, 9]

Somehow, the program is consuming an extra item from iterator each time I create a chunk. But that list comprehension is the only place where I touch iterator. What gives?

Well, how does zip know when to terminate? If you take a look in the documentation, you’ll see a handy code sample that is “equivalent to” the implementation of zip. There we see that zip builds up a list of results by taking an item from each of the given iterables, but if any of those iterables are finished, it just returns — and discards the result list!

So what happens with zip(longer, shorter) is that it takes from longer, stashes the item, discovers shorter is exhausted, and discards the item from longer. And that’s what happens to the missing numbers in the example above.

This situation arises because I’m zipping the same iterable repeatedly, until it’s empty, and because the iterator is the first argument to zip. This small change works fine:

# Old, broken
temp = [item for item, _ in zip(iterator, range(count))]
# New, fixed
temp = [item for _, item in zip(range(count), iterator)]

In the new version, zip discovers that the iterator over the range is exhausted first, before it takes an item from iterator, so no items are ever discarded.

So, is this OK? Really, really not! This is super-fragile. It’s not obvious that switching the arguments will break the code. And really it just looks wrong, because surely the ignored tuple element (assigned to the underscore) should come after the item that we care about?

Thankfully, the itertools module has what we need (as always!). The reason I originally used the list comprehension-zip-range combo is because you can’t slice every iterable. For example:

(x**2 for x in range(10))[:4]
# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# <ipython-input-2-17f2a627cc7c> in <module>
# ----> 1 (x**2 for x in range(10))[:4]
#
# TypeError: 'generator' object is not subscriptable

But you can with islice:

list(islice((x**2 for x in range(10)), 4))
# [0, 1, 4, 9]

And this works great with iterators where you care about the current state:

to_10_sq = (x**2 for x in range(10))
list(islice(to_10_sq, 4))
# [0, 1, 4, 9]
list(islice(to_10_sq, 4))
# [16, 25, 36, 49]
list(islice(to_10_sq, 4))
# [64, 81]

Which leads us to the most straightforward way of building up those chunks.

chunk = list(islice(iterator, count))

(The chunks have to be “concrete” sequences as the problem requires some length-checking for one of the bonus parts, hence the list call.)

Thanks for reading. If I have some key messages, they’re these:

Python is lovely, but it’s not magic!
itertools might have solved your iteration problem already.
Check out Python Morsels. The problems are short, fun, and a nice way to improve your Python skills.