Primary Unit – Language is hard, strings are great

Language is hard, strings are great

Thursday, December 1 2016

Alex Chan wrote a short post today dealing with an oddity with Python’s .lower() string method, which converts a string to all lower case.

In short, the problem is this:

>>> print('\u0130',
...       len('\u0130'))
İ 1
>>> print('\u0130'.lower(),
...       len('\u0130'.lower()))
i̇  2

(Print is used because I was having trouble with the closing quote disappearing in the second case and didn’t want to mislead you over the output — this is a hint as to the problem! That said, I inserted an extra space to get the second case to display properly for me, which you may or may not see.)

So you have an upper-case dotted i (İ), used in a few alphabets but mainly Turkish, which is of length 1 in Python, but lowercase is length 2.

Why’s this happening? The first answer is that the length of Python strings is the number of Unicode code points — not the number of perceived characters. In the lower case we have a small latin i with a combining dot afterwards.

Swift is a language that seems to handle the characters or code points split very well:

1> print("\u{0130}",
         "\u{0130}".unicodeScalars.count,
         "\u{0130}".characters.count)
İ 1 1

2> print("\u{0130}".lowercased(),
         "\u{0130}".lowercased().unicodeScalars.count,
         "\u{0130}".lowercased().characters.count)
i̇  2 1

(Extra space again!)

Here, both consist of a single character but different numbers of code points. I’ve barely used Swift so I don’t know if it has any Unicode gotchas, but this seems to be the right way to handle it and something I’d like in Python (4?).

So that’s the first answer. The second is that İ is a special case in Unicode: literally the second entry in the Unicode special cases document.

In the Turkish alphabet, a lowercase İ is i — the standard latin small letter i. But if that’s what you got from .lower() you’d end up with a totally different letter if you were to then call .upper():

İ -> i -> I     # Wrong at the end

In the Turkish alphabet, I is the capital form of ı — a small dot-less i. So a round trip would destroy the original character. That’s why in Unicode the decision was made to turn it into i̇, a small latin i with an additional dot above. It seems to be a character that only exists to allow for round-tripping İ:

İ -> i̇ -> İ    # Sort-of wrong in the middle

The latter is, yes, incorrect. Turkish-specific casing functions would handle this differently. There’s two arguments to made here: it’s a practical decision based on i -> I being the most common in languages using Latin script; or it shows how Latin script-centric computing is.

The way to handle this would be locale-specific case transformations, the conclusion of a Python bug discussion about this very issue. As is mentioned at the end of that thread, you’ll want to look at PyICU if you have to deal with these kinds of differences.

So, where does that leave us? First, it should be a caution that certain language properties that we take for granted may not be universal.

In Alex’s case, the assumption was that a mixed- or upper-case string has a lower-case transformation that is of the same length. (Although, as we’ve seen, if .lower() did what would be ideal for Turkish alphabet users then it would be the same length.)

That point can be expanded out when you find out that many scripts are unicameral and don’t have case distinctions.

The second is that the representation of a string does not necessarily match the perceived length of a string. Swift exposing both characters and unicodeScalars makes that plain: characters are what you expect, Unicode scalars are how those characters are stored. And just look to Python 2’s str type for yet another example — a bag of bytes that may or may not be text.