Escape to the country

Every year I leave London for Christmas and go back to Leicester. Usually I get Christmas Eve off, as there’s no Christmas Day paper, and get out of the city a day ahead of many others.

No such luck this year as Christmas Eve is on the Saturday, so I’m going back with everyone else after work on the Friday night. Right, time to book the train tickets.

An edited screenshot showing return tickets all costing £59.50.

(This screenshot is heavily edited to condense the information, but you can see the tickets for yourself.)

Hmm. That’s a bit dear. What about singles… oh. “From £71.” Really?

Yes, really. Here’s a selection of single tickets for trains leaving St Pancras the night before Christmas Eve, along with the ticket cost in euros for reasons that will become clear:

Single train tickets (selected) leaving St Pancras for Leicester, Friday December 23 2016
Depart Arrive Cost (£) Cost (€)
17:57 19:00 £79.50 €95.00
18:25 19:40 £55.00 €65.50
18:30 19:52 £71.00 €84.50
19:15 20:23 £58.50 €70.00
19:32 20:53 £42.50 €50.50
19:55 21:01 £55.00 €65.50
20:15 21:22 £55.00 €65.50
20:26 21:29 £42.50 €50.50
20:55 22:02 £42.50 €50.50
21:00 22:26 £28.50 €34.00
21:30 23:15 £58.50 €70.00
22:00 23:44 £21.00 €25.00
22:25 23:57 £17.50 €21.00
23:15 01:50 £14.50 €17.50

Looking at these prices, there’s an obvious tailing off as it gets later. But there’s a wild fluctuation even between adjacent services.

Why does it cost £30 more to travel at 21:30 instead of 21:00, on a service that calls at the same stops (not shown) but takes 19 minutes longer?

Why does it cost £79.50 to travel at 17:57? £79.50! That’s £6.40 more than the weekly dole for over-25s. 11 hours’ work at the minimum wage for over-25s. This is mad.

What’s the situation abroad? Let’s pick on the Netherlands, going from Amsterdam to Eindhoven as the journey times are similar.

Single train tickets leaving Amsterdam for Eindhoven, Friday December 23 2016
Depart Arrive Cost (£) Cost (€)
19:55 21:16 £16.00 €19.20
20:10 21:28 £16.00 €19.20
20:25 21:46 £16.00 €19.20
20:40 21:58 £16.00 €19.20
20:55 22:16 £16.00 €19.20

It’s Friday night, the night before Christmas Eve, thousands are leaving the capital to spend Christmas with their families. You plan to leave any time from just before 20:00 to just after 21:00. How much do you pay for your ticket?

In England, about £50. The Netherlands, £16 — the same as you’d pay tomorrow, or the day after, the same as you’d pay in the morning peak.

Leaving aside the much higher price in England — a function of our railways being run by billionaire bigots — the huge variation in pricing says something important about the way our country’s railways are thought of.

Train journeys are not an essential service to be provided to all, but a scare commodity to be sold for the most profit. There’s nothing inherently different about these services — they travel the same distance along the same tracks and serve similar stations along the way.

But there’s more demand at certain times, so the price goes up — to make the most amount of money out of people able to pay it or who absolutely must travel at those times.

In contrast, to use our Netherlands example, a journey costs a certain amount of money, determined approximately by the distance travelled. It’s a service, not a commodity. If the train’s full then the train’s full, get the next one in 15 minutes.

This nonsense can be pinned to the door of privatisation. Britain now has the highest rail fares in Europe — and rising much faster than real wages — yet subsidies to private operators are four times what were paid to British Rail before it was broken up. And there has been little investment, as private operators extract profits, so while passenger numbers are up since 1994, rail’s modal share (the proportion of people travelling by train) hasn’t budged. Action for Rail have a good report on the “four big myths” of privatisation.

But it looks like I’m having a late one on Friday December 23.

Language is hard, strings are great

Alex Chan wrote a short post today dealing with an oddity with Python’s .lower() string method, which converts a string to all lower case.

In short, the problem is this:

python:
>>> print('\u0130',
...       len('\u0130'))
İ 1
>>> print('\u0130'.lower(),
...       len('\u0130'.lower()))
i̇  2

(Print is used because I was having trouble with the closing quote disappearing in the second case and didn’t want to mislead you over the output — this is a hint as to the problem! That said, I inserted an extra space to get the second case to display properly for me, which you may or may not see.)

So you have an upper-case dotted i (İ), used in a few alphabets but mainly Turkish, which is of length 1 in Python, but lowercase is length 2.

Why’s this happening? The first answer is that the length of Python strings is the number of Unicode code points — not the number of perceived characters. In the lower case we have a small latin i with a combining dot afterwards.

Swift is a language that seems to handle the characters or code points split very well:

swift:
1> print("\u{0130}",
         "\u{0130}".unicodeScalars.count,
         "\u{0130}".characters.count)
İ 1 1

2> print("\u{0130}".lowercased(),
         "\u{0130}".lowercased().unicodeScalars.count,
         "\u{0130}".lowercased().characters.count)
i̇  2 1

(Extra space again!)

Here, both consist of a single character but different numbers of code points. I’ve barely used Swift so I don’t know if it has any Unicode gotchas, but this seems to be the right way to handle it and something I’d like in Python (4?).

So that’s the first answer. The second is that İ is a special case in Unicode: literally the second entry in the Unicode special cases document.

In the Turkish alphabet, a lowercase İ is i — the standard latin small letter i. But if that’s what you got from .lower() you’d end up with a totally different letter if you were to then call .upper():

İ -> i -> I     # Wrong at the end

In the Turkish alphabet, I is the capital form of ı — a small dot-less i. So a round trip would destroy the original character. That’s why in Unicode the decision was made to turn it into i̇, a small latin i with an additional dot above. It seems to be a character that only exists to allow for round-tripping İ:

İ -> i̇ -> İ    # Sort-of wrong in the middle

The latter is, yes, incorrect. Turkish-specific casing functions would handle this differently. There’s two arguments to made here: it’s a practical decision based on i -> I being the most common in languages using Latin script; or it shows how Latin script-centric computing is.

The way to handle this would be locale-specific case transformations, the conclusion of a Python bug discussion about this very issue. As is mentioned at the end of that thread, you’ll want to look at PyICU if you have to deal with these kinds of differences.

So, where does that leave us? First, it should be a caution that certain language properties that we take for granted may not be universal.

In Alex’s case, the assumption was that a mixed- or upper-case string has a lower-case transformation that is of the same length. (Although, as we’ve seen, if .lower() did what would be ideal for Turkish alphabet users then it would be the same length.)

That point can be expanded out when you find out that many scripts are unicameral and don’t have case distinctions.

The second is that the representation of a string does not necessarily match the perceived length of a string. Swift exposing both characters and unicodeScalars makes that plain: characters are what you expect, Unicode scalars are how those characters are stored. And just look to Python 2’s str type for yet another example — a bag of bytes that may or may not be text.

Trump

At work, we tend to avoid US politics. I think partly that’s a reaction to the obsession of the rest of the British media with the US as the imperial centre and their tendency to see commonalities with our own (political) culture that aren’t really there.

But, I’m not at work yet, so why not? The Trump victory is being portrayed as an upset here but I think it’s unhelpful to see it as massively unexpected (though I did think the result would come out the other way).

That the polls were close was a real sign of trouble given Trump’s extreme statements. They, and the eventual result, show just how weak a candidate Hillary Clinton was. Head-to-head polls in May put Clinton 3 percentage points ahead of Trump on average, but her left-wing primary challenger Bernie Sanders 10 points ahead. There’s no way to say if that lead would have held up: Clinton once held a similar lead that evaporated.

Whatever you think of Trump, a filthy-rich racist demagogue, he at the very least engaged with some of the real concerns of voters. Clinton’s ideological position meant that she was incapable of doing so. She couldn’t talk seriously — or be taken seriously — about, say, jobs and industry because her position is one of support for the capitalist forces that are putting people out of work and wrecking entire sectors.

It’s not that Trump is truly different on that score, but his campaign has used the tactics of the far right of taking real concerns and appearing to address them using hatred against marginalised groups. There are clear historical parallels to be drawn but I won’t here.

Clinton couldn’t do that. She was clearly the Establishment candidate, repeating in the way that they have for decades that you can’t challenge the supremacy of the market but, trust me, I’ll manage it better than the other guy. At this point, people have rejected that forcefully and publicly.

But the Democrats didn’t acknowledge that’s where their challenge lied. If they had, Clinton would not have been their choice coming out of the primaries. Sanders occupied some of the same ground as Trump but his answers were honest, decent ones instead of Trump’s hate.

After complaining of people drawing dodgy parallels between Britain and the US, I’ll make one to close. If the Labour Party hadn’t chosen a solid left platform with the election of Jeremy Corbyn in 2015 I would not have been surprised if we saw a similar far-right force gain quickly in popularity. While we’ve got our usual collection of nasty types, that hasn’t happened.

But there needs to be a recognition that modern liberalism — in its two differing forms on both sides of the Atlantic — does not and cannot address people’s real concerns as it is often complicit in making matters worse, and that if you don’t come up with a serious left-wing challenge then the fascists have all to gain.

Blind data

The Guardian’s Blind Date column has been going for over seven and a half years now, but I always struggled to read it. There was something missing — I didn’t just want to peer into these people’s lives and be left feeling bad for them if things hadn’t gone well.

Well the missing thing was The Guyliner’s sort-of reviews, which are brilliant. I only found out about his blog recently and binged a bit on them.

One thing I find interesting is the way the daters’ scores for each other, which are meant to be out of 10, are stuck in a limited range between 7 and 9. (7 being “a gentleman’s one”.)

In a recent entry — which mentions a letter to the Guardian about the limited range of scores used — the two seem to get on really well, want to see each other again but the scores are 7.5 and 8.

To get a bit of perspective on the scoring I went through all of the Blind Date columns from January 31 2009 through October 31 2016. The Guardian’s API makes this easy, although what wasn’t immediately obvious is that you can use subsection paths (such as Blind Date at lifeandstyle/series/blind-date) as an alternative to an imprecise search for the same articles. Use the interactive explorer to see for yourself.

I used a bit of Python to grab all the articles, save them to disk and pull out two things: the score each dater gave their opposite number and whether or not they wanted to see them again.

The data needed cleaning up by hand, usually to parse whether a person wanted to see their opposite number again. This often required a bit of judgement on my part, so it’s not perfect. “Just as friends” counts as a No: only seeming romantic interest gets a Yes. I excluded people who in whatever way refused to answer the scoring question. (This includes “The food was a 10” etc.) I was left with 637 individual responses.

I want to stress two things: the scores are the scorer’s judgement on their date and don’t reflect mutual agreement; answers to the “Would you meet again?” question might be swayed by their partner’s reaction. So, for example, a person might rate their date a 9 but say No to the latter question if their date didn’t seem interested. I wouldn’t worry too much about this for our purposes, but I’m also not claiming this is rigorous work.

So, how frequently do the scores come up?

A bar chart showing the distribution of scores in the Guardian’s Blind Date column.

Dominating the scoring are 8 and 7, with 9 a distant third. 6 and 10 get a look in but only that.

Very few people award less than a 6 — in fact, you’re more likely to get a half-point score between 7 and 9 than a 5.

Overlaid on the grey total bars are red bars, which are daters who would like to meet their partner again. The way 8 dominates the scoring, it’s not surprising that there are more Yes answers to the “meet again” question for 8-awarding daters than any other score.

But what happens when we look at how likely a person wants to see their date again for the score given?

A bar chart showing percentage of people giving a certain score who would like to meet their date again.

Because no-one’s ever awarded less than a 6 and also wanted to see their date again, I’ve limited this plot to scores 6–10. It’s seriously unlikely that someone who awards a 6 wants to see their date again; at 7 it’s not hugely better at about one in four.

7.5 is an interesting score. I was initially tempted to round half-scores but I’m glad I didn’t (though I did round silly scores like 8.9 to the nearest integer). If someone awards a 7.5 they are much more likely to want to see their date again than a straight 7, at just under half the time, but still noticeably less than the rate for 8.

The same can’t be said about 8.5, though, which really is a cautious 9. Someone who gives an 8.5 or 9 is pretty likely to want to see their date again.

More so actually than 10, but I’ve got a theory here: 10 is the refuge for a certain group of people who had a good time but didn’t feel anything for their date. Given the relative rarity of 10, I think it’s enough to bring down the Yes percentage to beneath that for 8.5 & 9.

(We can ignore the 100% Yes rate for 9.5, a score which has only been awarded twice.)

Lessons, then. The real scoring range is 6 to 10, but within that there are only real differences in the fundamental question — Would you meet again? — up to 8.5, after which things level off.

That’s it. I did pick up a few scoring bugbears while doing this, though:

  • “Cute” scores. 6.1, 7.4, 7.75 (twice!). These come up not often but enough that people really should resist.
  • Not answering the scoring question. My favourite was:

    What is this, a baking competition? All I’ll say is “top marks”

    If you can award a numerical score to a cake you can award one to a date.

  • People who say they won’t answer the scoring question because they’re above it but who actually do answer the scoring question:

    That seems rather ungentlemanly, but since you insist, 7.

    The date? 7. Jo? I wouldn't be so vulgar…

    It’s men that do this. Please stop.

And lastly, our favourite fairly common sort of cop-out answer: spark.

Across the 357 articles I downloaded (a handful of which aren’t actual Blind Date columns), spark is mentioned 45 separate times — about once every eight articles, which is much less frequently than I expected.

Padding regex groups in Name Mangler

I’m a big fan of Name Mangler by Many Tricks, an interactive file renaming application for everything from simple operations to really quite complex and powerful ones, with a comfortable and straightforward interface.

I don’t use it particularly often but it’s nice to have in the toolbox for things that might otherwise be frustrating.

One of my most common tasks for Name Mangler is converting the filename convention used internally at work for naming page files to a more general format we use for our external partners.

Here’s what they look like:

# Internal
1_Front_221016.pdf
# External
MS_2016_10_22_001.pdf

Normally this is done automatically with a scheduled script, but occasionally that script fails (at a different stage) and it has to be done by hand. Now obviously this involves a regular expression, and the page number group (at the start of the internal name, end of the external) is zero-padded so it’s three digits long.

So, in Name Mangler’s advanced renaming syntax that becomes:

[pad
    "$1"
    to "-3"
    with "0"
]

right?

Well, no. What happens is interesting. The literal string $1 is zero-padded until it’s three digits long: 0$1 (one extra zero). But after that the regex replacement is made, so page #1 becomes 01: 1 with one zero on the front.

To Many Tricks’s great credit, they responded to the support ticket I raised with example code in less than a day, along with an explanation of what’s happening by the developer.

The trick is, instead of providing the group name as the argument to pad, to perform the regex search in-line:

[pad
    [findRegularExpression
        "^(\d+)_.*$"
        in <name>
        replace with "$1"
    ]
    to "-3"
    with "0"
]

This means that when pad gets its arguments, it’s exactly what you want to pad.

By necessity this uses regexes twice: one for parsing the date and constructing most of the name, and this for finding and padding the page number. Perhaps the reason why I missed this approach by myself is that in the Python code, the regex search is performed once and the groups placed in this format string:

MS_{date:%Y}_{date:%m}_{date:%d}_{page:03}.pdf

Which takes care of the padding with no fuss. (At this point, the date has been parsed for completeness’s sake, hence the strftime codes.)

That might lead to a question about why I don’t just use Python to do this. And the answer is that, once you’ve got a recipe that works for you, Name Mangler is painless and flexible. Really, check it out.