One of my responsibilities at work is to provide a list of people who our printers should call if there’s ever a problem with the edition. Usually that’s the chief sub, or whoever is covering her.

I also prepare the rota for the journalistic staff, which I use as the source of information for the responsibility list.

This job has largely escaped automation. I do have a Python script that prints a nice template report for the week ahead, complete with BBEdit placeholders, but working out whose name should be attached to each edition is just done by reading the rota across and deleting names from the template list until you’re down to one.

However, I’ve found things of this nature, if not automated, are put off, forgotten, or done wrong. This, because it’s not actually vital to anything, is no exception, particularly when I’m pulled into jobs that actually are vital.

The report looks a little like this, so you get the idea:

Tue May 08    16pp    Alice Jones
Wed May 09    16pp    Bob Smith
Thu May 10    16pp    Rob Wells

And so on, with the pagination in the middle column.

The pagination is consistent (16 in the week, 24 on the weekend) with occasional larger editions. It can either be predicted with total certainty or none at all, as the large editions vary considerably with advertising and feature articles.

The responsibility can’t be predicted because we don’t work fixed patterns (we don’t have enough staff to do so). However, it can be done in advance once the newsroom rota is completed.

So let’s forget the pagination and just focus on pulling together a list of every production day in the completed period and who is the chief sub.

Our newsroom rota is just a spreadsheet, which is actually the best tool I’ve found so far for handling a couple dozen people with intricate job-cover links between them. (The rota used to be laid out in InDesign, which, no matter what you think about spreadsheets or InDesign, was much more difficult.)

It looks a bit like this (the real spreadsheet has proper formatting and so on):

Sun 6/5 Mon 7/5 Tue 8/5 Wed 9/5 Thu 10/5 Fri 11/5 Sat 12/5 Lieu add Lieu tot
Rob Wells Off Sport Ch Sub Sport 10
Alice Jones Off 0.25 4.5

There’s a fair amount of information: names, dates, days off, cover responsibilities, new and accrued TOIL. It’s entirely designed for humans, not computers (and it takes the humans a little while until they’re able to read it).

A lot is implicit. If we assume in this example that Alice is the chief sub, she is performing that role on her usual working days (the empty cells). It is only marked for people who have to cover someone else’s job.

This table is not something that you can just chuck into a computer program; it needs cleaning up first.

Thankfully, R (and the Tidyverse particularly) is a great environment in which to wrangle your data, and to do so fairly quickly. All the code below was pulled together in about 30 minutes total (with a good 10 minutes of reading documentation and fixing errors in the original source data). Writing this post has taken much longer.

In our example below we’re going to have four workers who each cover the chief sub at different times. Here we’re going make “Dan Taylor” the chief sub. Congratulations, Dan!

First we’ll pull in our libraries.

library(tidyverse)
library(lubridate)
library(stringr)

Then we’ll read in the data, which is saved in a TSV file after copying and pasting from the spreadsheet into a text document. We’ll select only the production days and the unnamed first column (named X1 on import), excluding Saturdays and the TOIL columns.

wide <- read_tsv('chsub.tsv') %>%
    select(matches('^(Mon|Tue|Wed|Thu|Fri|Sun) |X1')) %>%
    rename(name = X1)

Then we’ll use a tidyr function, gather(), to transform our wide format into a tall one by selecting the date columns. It’s easier to get a feel for gather() by looking at the output.

tidy <- wide %>%
    gather(matches('^(Mon|Tue|Wed|Thu|Fri|Sun) '),
           key = date,
           value = status)
head(tidy)
## # A tibble: 6 x 3
##   name           date     status
##   <chr>          <chr>    <chr>
## 1 Alice Jones    Sun 29/4 Off
## 2 Bob Smith      Sun 29/4 Sick
## 3 Carol Williams Sun 29/4 Booked
## 4 Dan Taylor     Sun 29/4 <NA>
## 5 Alice Jones    Mon 30/4 <NA>
## 6 Bob Smith      Mon 30/4 Off

We now have a row for each person for each day, along with their “status” for the day.

But Dan doesn’t have his chief sub days marked, as it would be nearly every day. Let’s split out Dan’s rows and replace the empty cells with Ch Sub, the same status string used by everyone else. Then we’ll combine the filled-out Dan rows with all the non-Dan rows from the original data frame.

dan_replaced <- tidy %>%
    filter(name == 'Dan Taylor') %>%
    replace_na(list(status = 'Ch Sub'))

all <- tidy %>%
    filter(name != 'Dan Taylor') %>%
    rbind(dan_replaced)

tail(all)
## # A tibble: 6 x 3
##   name       date      status
##   <chr>      <chr>     <chr>
## 1 Dan Taylor Sun 30/12 Ch Sub
## 2 Dan Taylor Mon 31/12 Ch Sub
## 3 Dan Taylor Tue 1/1   Ch Sub
## 4 Dan Taylor Wed 2/1   Ch Sub
## 5 Dan Taylor Thu 3/1   Ch Sub
## 6 Dan Taylor Fri 4/1   Ch Sub

Great. But poor Dan, he’s working every day over New Year 2018-2019. In reality, I haven’t done that far on the rota, just up to October. We’ll convert all those dates now, and filter out all the newly missing entries where the month was outside our range.

dated <- all %>%
    mutate(
        date = dmy(str_c(str_extract(date, '\\d+/[4-9]'), '/2018'))
    ) %>%
    filter(!is.na(date))

Let’s get only the chief sub-related rows and sort them by date.

chsub <- dated %>%
    filter(str_detect(status, 'Ch Sub')) %>%
    arrange(date) %>%
    select(date, chief_sub = name)
head(chsub)
## # A tibble: 6 x 2
##   date       chief_sub
##   <date>     <chr>
## 1 2018-04-29 Dan Taylor
## 2 2018-04-30 Dan Taylor
## 3 2018-05-01 Dan Taylor
## 4 2018-05-02 Bob Smith
## 5 2018-05-03 Dan Taylor
## 6 2018-05-04 Carol Williams

Exactly what we want. Now time for a bit of formatting to make this giant list somewhat acceptable for other people. This is also where my knowledge of R runs out.

formatted <- str_c(
    format(chsub$date, '%a %Y-%m-%d'),
           chsub$chief_sub,
           sep = '  ')

fd <- file('output.txt')
writeLines(formatted, fd)
close(fd)

So we’ll switch to Python, printing a blank line between each production week (of six days).

with open('output.txt') as f:
    for idx, line in enumerate(f.readlines()):
        if idx % 6 == 0:
            print()
        print(line, end='')
Sun 2018-08-19  Carol Williams
Mon 2018-08-20  Carol Williams
Tue 2018-08-21  Alice Jones
Wed 2018-08-22  Carol Williams
Thu 2018-08-23  Carol Williams
Fri 2018-08-24  Carol Williams

Sun 2018-08-26  Carol Williams
Mon 2018-08-27  Carol Williams
Tue 2018-08-28  Carol Williams
Wed 2018-08-29  Dan Taylor
Thu 2018-08-30  Dan Taylor
Fri 2018-08-31  Dan Taylor

Perfect. And ready for whenever I get time to update the rota again.

I’ve just published an update to my recent Tube travel post, fixing a few small mistakes, a bigger one (an error in a station name that nonetheless didn’t affect the plot involved) and adding an update to the last section which goes a bit deeper into the fare and duration difference between the two periods.

I didn’t fix the mistake in the title, as I felt it was too late, but of course it’s 3⅔ years, not 3.75, since September 2014.

In that post I mentioned how pleasant it is working in R Studio in a R Markdown document. It really is, and I find the R Markdown way of mixing prose and code much more natural and fluid than Jupyter notebooks, which I like the idea of but find the block-based method a bit awkward.

The biggest problem with R Markdown was fitting it into my, admittedly arcane, blogging system. To do so, I’ve cooked up a short Python script to transform the Markdown output from R Studio and knitr.

Right now, I’ve settled on this set of output options in the YAML front-matter:

md_document:
    variant: markdown_strict+fenced_code_blocks
    preserve_yaml: true
    fig_width: 7.5
    fig_height: 5
    dev: svg
    pandoc_args: [
        "--wrap", "preserve"
    ]

Now I don’t actually use fenced (~~~~) code blocks in Markdown, instead I just use regular Markdown indented code blocks with a header line (python:) at the top. But I include that extension in the Markdown variant to make it easier to transform code blocks later.

But why? Well, if your output just uses indented code blocks, it’s difficult to tell which of those are your R code and which are the R code’s output. Fencing the blocks makes it easier to insert empty comments after each block, keeping code and output separate.

The YAML front matter is preserved as I use a similar thing in my own posts and this gets passed through to my blogging system without a problem, with unknown settings ignored. (I do remove the quoting that the template file includes around strings.)

The other important option above is supplying the --wrap argument to Pandoc, preserving the line breaks as they are in the source file instead of breaking them. By default Pandoc hard-wraps the lines, which I’d be fine with, except that it hard wraps the alt text for images (plots).

That makes it more difficult to pick out later. This is necessary as I always use HTML to include images in my posts (so I can set classes, allow for full-width etc).

I say more difficult as I’m working line-wise. It’d be possible to apply a regex to the joined lines and make the transformation, but then again I don’t hard-wrap my own posts so it’s not something I care about keeping.

The option I would like to use is to keep my Markdown reference links intact, instead of having Pandoc put everything inline. But this makes the images into reference links, making rewriting more difficult again.

So, I knit the document together from R Studio, then apply the script, and pipe the output into the for-real .md file. This is the one that gets checked into the Mercurial repository, fed into my blog generator and ultimately published.

I could probably get away with doing less, or handling things differently — such as allowing for fenced code on the generator’s side.

But I want the transformed output to resemble as closely as possible something that I’d written in BBEdit because I actually attach some importance to the contents of the Markdown files outside their use as raw material with which to create HTML.

They should be able to tell the post’s story without needing to be processed further, to interpret R code or make the raw source readable. I’m not quite at the point of having totally pure, completely readable plain text files (note those dummy comments mentioned above) but I want to be as close as I can.

A couple of years ago, shortly after I moved house, I wrote a post analysing my Tube travel. It was my first real attempt to do that kind of analysis, and the first time I’d done anything with Matplotlib of any level of complexity.

I thought I’d have another crack at it now, looking at the changes in my travel patterns since the move, and also changing from Python and Matplotlib to R and ggplot2.

Why now? There’s no great immediate reason, though for a time I was thinking about stopping to use my Oyster card in favour of a contactless bank card. You don’t have the option to be emailed CSV journey history files with a bank card, and the main advantage of weekly capping wouldn’t affect me, so I’ll be sticking with the Oyster card for the moment.

But, as I noted in the introduction to the previous post, my travel habits have changed considerably. Before I would commute by train twice a day, whereas now I’m within a short cycle of work. I’m expecting this to have a significant effect in what we observe below.

And why the switch in environment? Python is still the language that fits my brain the best, but Matplotlib feels like hard work. R is a pretty odd language in many ways, but the ggplot2 way of building plots makes a great deal of sense to me, and has allowed me to play with plots quickly in ways that I feel that wouldn’t be available if I was trying to contort to fit Matplotlib’s preferences. I freely admit that I don’t have a great deal of experience with Matplotlib, so it’s entirely possible that’s the reason why I find it a struggle, but that barrier just isn’t there with ggplot2.

I’m writing this post in RStudio in a R Markdown document, but it’s actually my second go at this. The first was invaluable in getting myself acquainted with the process and playing around with ideas, but it kind of spiralled out of control so it’s not presentable. Hopefully this is something approaching readable.

Setup

To start with we’re going to load some libraries to make our life easier. The Tidyverse wraps up several helpful packages; lubridate has some handy date-handling functions; stringr is helpful for, er, strings; patchwork allows you to easily combine plots into one figure; ggalt provides an extra geom (geom_encircle()) that we’ll use in a bit. Forgive me for not making clear where functions come from below as, like Swift, R imports into the global namespace.

Not shown is my customised ggplot2 theme, which you can find if you look at the original .Rmd source file.

library(tidyverse)
library(lubridate)
library(stringr)
library(patchwork)
library(ggalt)

# Moving average function from https://stackoverflow.com/a/4862334/1845155
mav <- function(x, n) {
    stats::filter(x, rep(1/n, n), sides = 1)
}

Data import

I keep all the CSV files as received, just dating the filenames with the date I got them. (Sorry, I won’t be sharing the data.) Let’s load all the files:

oyster_filenames <- dir(
    '~/Documents/Oyster card/Journey history CSVs/',
    pattern = '*.csv',
    full.names = TRUE)

There are 109 CSV files that we need to open, load, and combine.

oyster_data <- oyster_filenames %>%
    map(~ read_csv(., skip = 1)) %>%
    reduce(rbind)

Here we’re piping oyster_filenames through map, where we use an R formula to supply arguments to read_csv to skip the header line in each file. Finally we reduce the 109 data frames by binding them by row.

Poking around the data

We can take a look at the data to get an idea of its structure:

head(oyster_data)
## # A tibble: 6 x 8
##   Date   `Start Time` `End Time` `Journey/Action`    Charge Credit Balance
##   <chr>  <time>       <time>     <chr>                <dbl> <chr>    <dbl>
## 1 31-Oc… 23:22        23:50      North Greenwich to…    1.5 <NA>     26.0 
## 2 31-Oc… 18:39        18:59      Woolwich Arsenal D…    1.6 <NA>     27.6 
## 3 31-Oc… 18:39           NA      Auto top-up, Woolw…   NA   20       29.2 
## 4 31-Oc… 17:10        17:37      Stratford to Woolw…    1.6 <NA>      9.15
## 5 31-Oc… 16:26        16:53      Woolwich Arsenal D…    1.6 <NA>     10.8 
## 6 30-Oc… 22:03        22:39      Pudding Mill Lane …    1.5 <NA>     12.4 
## # ... with 1 more variable: Note <chr>

It’s clearly in need of a clean-up. The journey history file appears to be a record of every action involving the card. It’s interesting to note that the Oyster card isn’t just a “key” to pass through the ticket barriers, but a core part of how the account is managed (note that having an online account is entirely optional).

Actions taken “outside” of the card need to be “picked up” by the card by tapping on an Oyster card reader. Here we can see a balance increase being collected, mixed in with the journey details. (Funnily enough, TfL accidentally cancelled my automatic top-up a couple of months ago, but that was never applied to my account as I didn’t use the card before the action expired.)

But we’re only interested in rail journeys, one station to another, with a start and finish time.

Let’s see if the notes field can give us any guidance of what we may need to exclude.

oyster_data %>%
    filter(!is.na(Note)) %>%
    count(Note, sort = TRUE)
## # A tibble: 5 x 2
##   Note                                                                   n
##   <chr>                                                              <int>
## 1 The fare for this journey was capped as you reached the daily cha…    18
## 2 We are not able to show where you touched out during this journey      6
## 3 This incomplete journey has been updated to show the <station> yo…     1
## 4 We are not able to show where you touched in during this journey       1
## 5 You have not been charged for this journey as it is viewed as a c…     1

OK, not much here, but there are some troublesome rail journeys missing either a starting or finishing station. The “incomplete journey” line also hints at something to be aware of:

oyster_data %>%
    filter(str_detect(Note, 'This incomplete journey')) %>%
    select(`Journey/Action`) %>%
    first()
## [1] "Woolwich Arsenal DLR to <Blackheath [National Rail]>"

Note the angle brackets surrounding the substituted station. We’ll come back to this later.

A missing start or finish time is a giveaway for oddities, which overlaps somewhat but not completely with Journey/Action fields that don’t match the pattern of {station} to {station}. Let’s fish those out and have a look at the abbreviated descriptions:

stations_regex <- '^<?([^>]+)>? to <?([^>]+)>?$'

oyster_data %>%
    filter(
        is.na(`Start Time`) |
        is.na(`End Time`) |
        !str_detect(`Journey/Action`, stations_regex)) %>%
    mutate(abbr = str_extract(`Journey/Action`, '^[^,]+')) %>%
    count(abbr, sort = TRUE)
## # A tibble: 11 x 2
##    abbr                                              n
##    <chr>                                         <int>
##  1 Auto top-up                                      84
##  2 Bus journey                                      26
##  3 Automated Refund                                  7
##  4 Woolwich Arsenal DLR to [No touch-out]            3
##  5 Oyster helpline refund                            2
##  6 Unknown transaction                               2
##  7 [No touch-in] to Woolwich Arsenal DLR             1
##  8 Entered and exited Woolwich Arsenal DLR           1
##  9 Monument to [No touch-out]                        1
## 10 Stratford International DLR to [No touch-out]     1
## 11 Stratford to [No touch-out]                       1

Tidying the data

All these should be filtered out of the data for analysis. (The two unknown transactions appear to be two halves of my old commute. Strange.)

rail_journeys <- oyster_data %>%
    # Note the !() below to invert the earlier filter
    filter(!(
        is.na(`Start Time`) |
        is.na(`End Time`) |
        !str_detect(`Journey/Action`, stations_regex)))

That leaves us with 993 rail journeys to have a look at.

But there’s more tidying-up to do:

  • Journey dates and times are stored separately. Finish times may be after midnight (and so on the day after the date they’re associated with).
  • Start and finish stations need to be separated. (And don’t forget that set of angle brackets.)
  • All money-related fields should be dropped except for “charge” (the journey fare).

Let’s have a crack at it, proceeding in that order:

tidy_journeys <- rail_journeys %>%
    mutate(
        start = dmy_hms(
            str_c(Date, `Start Time`, sep=' '),
            tz = 'Europe/London'),
        end = dmy_hms(
            str_c(Date, `End Time`, sep=' '),
            tz = 'Europe/London') +
            # Add an extra day if the journey ends “earlier” than the start
            days(1 * (`End Time` < `Start Time`)),
        # Let’s add a duration to make our lives easier
        duration = end - start,

        enter = str_match(`Journey/Action`, stations_regex)[,2],
        exit = str_match(`Journey/Action`, stations_regex)[,3]
    ) %>%
    select(
        start, end, duration,
        enter, exit,
        fare = Charge
    ) %>%
    # Sorting solely to correct the slightly odd example output
    arrange(start)
head(tidy_journeys)
## # A tibble: 6 x 6
##   start               end                 duration enter    exit      fare
##   <dttm>              <dttm>              <time>   <chr>    <chr>    <dbl>
## 1 2014-09-06 13:14:00 2014-09-06 13:42:00 28       Woolwic… Stratfo…   1.5
## 2 2014-09-06 13:59:00 2014-09-06 14:08:00 9        Stratfo… Hackney…   1.5
## 3 2014-09-06 20:47:00 2014-09-06 21:02:00 15       Hackney… Highbur…   1.5
## 4 2014-09-06 23:22:00 2014-09-07 00:10:00 48       Highbur… Woolwic…   2.7
## 5 2014-09-07 10:00:00 2014-09-07 10:30:00 30       Woolwic… Pudding…   1.5
## 6 2014-09-07 20:43:00 2014-09-07 21:15:00 32       Pudding… Woolwic…   1.5

Great. The duration variable isn’t strictly necessary but it’ll make things a tad clearer later on.

Weekly totals

For a start, let’s try to remake the first plot from my previous post, of weekly spending with a moving average.

Looking back, it’s not tremendously helpful, but it’s a starting point. (In addition, while that plot is labelled as showing a six-week average, the code computes an eight-week average, and a quick count of the points preceding the average line confirms it.)

But there’s a problem with the data: they record journeys made, not the absence of any journeys (obviously). If we’re to accurately plot weekly spending, we need to include weeks where no journeys were made and no money spent.

First, let’s make a data frame containing every ISO week from the earliest journey in our data to the most recent.

blank_weeks <- seq(min(tidy_journeys$start),
    max(tidy_journeys$end),
    by = '1 week') %>%
    tibble(
        start = .,
        week = format(., '%G-W%V')
    )
head(blank_weeks)
## # A tibble: 6 x 2
##   start               week    
##   <dttm>              <chr>   
## 1 2014-09-06 13:14:00 2014-W36
## 2 2014-09-13 13:14:00 2014-W37
## 3 2014-09-20 13:14:00 2014-W38
## 4 2014-09-27 13:14:00 2014-W39
## 5 2014-10-04 13:14:00 2014-W40
## 6 2014-10-11 13:14:00 2014-W41

The format string uses the ISO week year (%G) and the ISO week number (%V), which may differ from what you might intuitively expect. I’ve included a somewhat arbitrary start time, as it’s a bit easier to plot and label datetimes rather than the year-week strings.

Now we need to summarise our actual journey data, collecting the total fare for each ISO week. We’ll use group_by() and summarise() — two tools that took me a few tries to get a handle on. Here summarise() works group-wise based on the result of group_by(); you don’t have to pass the group into the summarise() call, just specify the value you want summarised and how.

real_week_totals <- tidy_journeys %>%
    group_by(week = format(start, '%G-W%V')) %>%
    summarise(total = sum(fare))

That done, we can use an SQL-like join operation to take every week in our giant list and match it against the week summaries from our real data. The join leaves missing values (NA) in the total column for weeks where no journeys were made (and so weren’t present in the data to summarise) so we replace them with zero.

complete_week_totals <- left_join(blank_weeks,
                                  real_week_totals,
                                  by = 'week') %>%
    replace_na(list(total = 0))
tail(complete_week_totals)
## # A tibble: 6 x 3
##   start               week     total
##   <dttm>              <chr>    <dbl>
## 1 2018-03-17 12:14:00 2018-W11   0  
## 2 2018-03-24 12:14:00 2018-W12   0  
## 3 2018-03-31 13:14:00 2018-W13  21.1
## 4 2018-04-07 13:14:00 2018-W14   9.5
## 5 2018-04-14 13:14:00 2018-W15   0  
## 6 2018-04-21 13:14:00 2018-W16   7.8

With this summary frame assembled, we can now plot the totals. I’m also going to mark roughly when I moved house so we can try to see if there’s any particular shift.

house_move <- as.POSIXct('2016-08-01')
pound_scale <- scales::dollar_format(prefix = '£')

weeks_for_avg <- 8

ggplot(data = complete_week_totals,
       mapping = aes(x = start, y = total)) +
    geom_vline(
        xintercept = house_move,
        colour = rjw_grey,
        alpha = 0.75) +
    geom_point(
        colour = rjw_blue,
        size = 0.75) +
    geom_line(
        mapping = aes(y = mav(complete_week_totals$total,
                              weeks_for_avg)),
        colour = rjw_red) +

    labs(
        title = str_glue(
            'Weekly transport spending and {weeks_for_avg}',
            '-week moving average'),
        subtitle = (
            'September 2014 to May 2018, vertical bar marks house move'),
        x = NULL, y = NULL) +

    scale_x_datetime(
        date_breaks = '6 months',
        date_labels = '%b ’%y') +
    scale_y_continuous(
        labels = pound_scale)

A plot showing my weekly Oyster card spending, September 2014 to May 2018

It’s clear that there is a difference after the house move. But I’m not sure this plot is the best way to show it. (Nor the best way to show anything.)

That said, the code for this plot is a pretty great example of what I like about ggplot2: you create a plot, add geoms to it, customise the labels and scales, piece by piece until you’re happy. It’s fairly straightforward to discover things (especially with RStudio’s completion), and you change things by adding on top of the basics instead of hunting around in the properties of figures or axes or whatever.

Cumulative spending

The first plot showed a change in my average weekly spending. What does that look like when we plot the cumulative spending over this period?

ggplot(data = tidy_journeys,
       mapping = aes(x = start,
                     y = cumsum(fare),
                     colour = start > house_move)) +
    geom_line(
        size = 1) +

    labs(
        title = 'Cumulative Oyster card spending',
        subtitle = 'September 2014 to May 2018',
        x = NULL, y = NULL,
        colour = 'House move') +
    scale_y_continuous(
        labels = pound_scale,
        breaks = c(0, 500, 1000, 1400, 1650)) +
    scale_color_brewer(
        labels = c('Before', 'After'),
        palette = 'Set2') +
    theme(
        legend.position = 'bottom')

A plot showing my cumulative Oyster card spending, September 2014 to May 2018

The difference in slope is quite clear; at one point I fitted a linear smoother to the two periods but it overlapped so tightly with the data that it was difficult to read either. I’ve also monkeyed around with the y-axis breaks to highlight the difference; what before took three to six months to spend has taken about 21 months since the house move.

Zero-spending weeks

One thing that shows up in the first plot, and likely underlies the drop in average spending, is the number of weeks where I don’t travel using my Oyster card at all. Let’s pull together a one-dimensional plot showing just that.

ggplot(complete_week_totals,
       aes(x = start,
           y = 1,
           fill = total == 0)) +
    geom_col(
        width = 60 * 60 * 24 * 7) +  # datetime col width handled as seconds
    geom_vline(
        xintercept = house_move,
        colour = rjw_red) +

    scale_fill_manual(
        values = c(str_c(rjw_grey, '20'), rjw_grey),
        labels = c('Some', 'None')) +
    scale_x_datetime(
        limits = c(min(complete_week_totals$start),
                   max(complete_week_totals$start)),
        expand = c(0, 0)) +
    scale_y_continuous(
        breaks = NULL) +
    labs(
        title = 'Weeks with zero Oyster card spending',
        subtitle = 'September 2014 to May 2018, red line marks house move',
        x = NULL, y = NULL,
        fill = 'Spending') +
    theme(
        legend.position = 'bottom')

A plot showing weeks where I made no journeys using my Oyster card

The change here after I moved house is stark, nearly an inversion of the previous pattern of zero/no-zero spending weeks. (Almost looks like a barcode!)

My apologies for the thin lines between columns, which is an SVG artefact. The inspiration for this was a plot of games/non-games in the App Store top charts that Dr Drang included at the bottom of one of his posts and, for the life of me, I can’t find now.

Changes in journey properties

So it’s clear that I travel less on the Tube network, and that I spend less. But what has happened to the sort of journeys that I make? Are they longer? Shorter? Less expensive? More?

Let’s have a look at how the average fare and average journey duration change over time.

n_journey_avg <- 10

common_vline <- geom_vline(xintercept = house_move,
                           colour = rjw_red)
common_point <- geom_point(size = .5)

fares_over_time <- ggplot(tidy_journeys,
                          aes(x = start,
                              y = mav(fare, n_journey_avg))) +
    scale_x_datetime(
        labels = NULL) +
    scale_y_continuous(
        labels = pound_scale) +
    labs(
        y = 'Fare',
        title = 'More expensive, shorter journeys',
        subtitle = str_glue('{n_journey_avg}-journey average, ',
                            'vertical line marks house move'))

duration_over_time <- ggplot(tidy_journeys,
                             aes(x = start,
                                 y = mav(duration, n_journey_avg))) +
    scale_y_continuous() +
    labs(
        y = 'Duration (mins)')

(fares_over_time / duration_over_time) &  # Patchwork is magic
    common_vline &
    common_point &
    labs(x = NULL)

A plot of average fares and journey durations over time

Journeys taken after the house move appear to be shorter and more expensive. How distinct is this? What is driving the averages? I have a hunch so let me rush on ahead with this plot.

commute_stations <- c('Woolwich Arsenal DLR', 'Stratford International DLR',
                      'Stratford', 'Pudding Mill Lane DLR')

commute_journeys <- tidy_journeys %>%
    filter(
        enter %in% commute_stations,
        exit %in% commute_stations)

high_speed_journeys <- tidy_journeys %>%
    filter(
        str_detect(enter, 'HS1'),
        str_detect(exit, 'HS1'))

ggplot(tidy_journeys,
       aes(x = fare,
           y = duration,
           colour = start > house_move)) +
    geom_jitter(
        width = 0.05,  # 5p
        height = 0.5,  # 30 seconds
        alpha = 0.5) +
    geom_encircle(
        data = commute_journeys,
        size = 1.5) +
    geom_encircle(
        data = high_speed_journeys,
        size = 1.5) +

    scale_color_brewer(
        palette = 'Set2',
        labels = c('Before', 'After')) +
    scale_x_continuous(
        labels = pound_scale) +
    scale_y_continuous(
        limits = c(0, 80)) +
    labs(
        title = 'Pre- and post-move averages driven by two groups',
        subtitle = str_c('Old commute and high-speed journeys circled,',
                         ' positions not exact'),
        x = 'Fare',
        y = 'Duration (mins)',
        colour = 'House move')

A plot of journey fare against distance, grouped by whether the journeys were before or after I moved house

We can see in the lower central section that there’s some overlap. Remember also that there are far fewer post-move journeys, so it’s not surprising that earlier ones dominate this plot. (I added jitter to the points to make things a little easier to see — geom_jitter() is a wrapper around geom_point().)

But what is crucial to understanding the averages are the two rough groups circled: journeys between stations that I used for my old commute (on the left in green), and journeys involving travel on the High Speed 1 (HS1) rail line (on the right in orange).

My old commute was low-cost, each way either £1.50 or £1 (with an off-peak railcard discount, applied for part of the pre-move period). There are a lot of these journeys (nearly 500). It was a fairly predictable 30ish-minute journey.

On the other hand, trips involving the HS1 line are expensive and very short. A single off-peak fare is currently £3.90 and peak £5.60, while the journey time between Stratford International and St Pancras is just seven minutes, with a bit of waiting inside the gateline.

But is that it?

Does that theory of the two extreme groups really explain the difference? Let’s filter out the two groups from our journey data.

journeys_without_extremes <- tidy_journeys %>%
    anti_join(commute_journeys) %>%
    anti_join(high_speed_journeys)

Let’s look how the journey durations compare:

ggplot(journeys_without_extremes,
       aes(x = duration,
           fill = start > house_move)) +
    geom_histogram(
        binwidth = 5,
        closed = 'left',
        colour = 'black',
        size = 0.15,
        position = 'identity') +
    scale_x_continuous(
        breaks = seq(0, 70, 10),
        limits = c(0, 70)) +
    scale_fill_brewer(
        palette = 'Set2',
        labels = c('Before', 'After')) +
    labs(
        title = 'Post-move journeys still shorter',
        subtitle = 'Commute and HS1 journeys excluded, bars overlap',
        x = 'Duration (mins)',
        y = 'Number of journeys',
        fill = 'House move')

A histogram showing journey durations having excluded known extremes, with post-move journeys generally shorter

And the fares:

ggplot(journeys_without_extremes,
       aes(x = fare,
           fill = start > house_move)) +
    geom_histogram(
        binwidth = 0.5,
        closed = 'left',
        colour = 'black',
        size = 0.15,
        position = 'identity') +
    scale_x_continuous(
        labels = pound_scale) +
    scale_fill_brewer(
        palette = 'Set2',
        labels = c('Before', 'After')) +

    labs(
        title = 'Post-move journeys generally more expensive',
        subtitle = 'Commute and HS1 journeys excluded, bars overlap',
        x = 'Fare',
        y = 'Number of journeys',
        fill = 'House move')

A histogram showing journey fares having excluded known extremes, with post-move fares generally more expensive

While it’s much clearer for duration than cost now, post-move journeys are still generally shorter and more expensive.

At this point, I’ve reached the limits of how far I’m able to take this with visualisation. One possible route would be to look at the distance between station (in miles), how many stations used are in which fare zone, and the number of fare zones crossed. I don’t have stations/fare zones data readily to hand so we’ll leave that here.

But I’ll end with an intuitive answer. Durations are shorter because from Woolwich it takes additional time to get into the main Tube network from the DLR, and particularly to central stations. Whereas now I’m not far from a Central Line station, which will get me into zone 1 fairly quickly.

Fares are higher because I’ve transferred classes of journeys to cycling — not just my commute to work but shopping and leisure. I’d reckon that the remaining journeys are more likely to involve travel into and within central London, and maybe more likely to be at peak times.

Last thoughts

If you made it this far, well done, and thanks for reading. There’s a lot of R code in this post, probably too much. But there are two reasons for that: as a reference for myself, and to show that there’s not any magic going on behind the curtain, and very little hard work. (In my code at least, there’s plenty of both in the libraries!)

Working in R with ggplot2 and the other packages really is a pleasure; it doesn’t take very long to grasp how the different tools fit together into nice, composable pieces, and to assemble them in ways that produce something that matches what you have pictured in your mind.

Barcodes can be pretty mystifying from the outside, if all you’ve got to go on is a set of lines and numbers, or even magic incantations for the software that produces them.

Despite working at a place where we produce a product with a new barcode every day, I didn’t understand how they were made up for years.

But they’re fairly straightforward, and once you know how they work it’s quite simple to produce them reliably. That’s important because getting a barcode wrong can cause real problems.

Barcode problems

In the case we’ll look at here, daily newspapers, an incorrect barcode means serious headaches for your wholesalers and retailers, and you’ll likely and entirely understandably face a penalty charge for them having to handle your broken product.

I know because, well, I’ve been there. In our case at the Star there were two main causes of incorrect barcodes, both down to people choosing:

  1. the wrong issue number or sequence variant;
  2. the wrong barcode file.

We’ll talk about the terminology shortly, but we can see how easily problem number one can occur by looking at the interface of standard barcode-producing software:

A screenshot of the interface of Agamik BarCoder, a good barcode-producing application

Now, Agamik BarCoder is a nice piece of software and is very versatile. If you need to make a barcode it’s worth a look.

But look again at that interface — it’s not intuitive what you need to do to increment from the previous day’s barcode, the settings for which are saved in the application. It’s very easy to put in the wrong details, or accidentally reuse yesterday’s details.

Second, it produces files with names such as ISSN 03071758_23_09 — a completely logical name, but the similarity between the names and the fact you have to manually place the file on your page makes it easy to choose the wrong barcode, whose name will likely differ only by one digit to the previous day.

That isn’t helped by Adobe InDesign by default opening the last-used folder when you place an image. At least once, I’ve made the barcode first thing in the morning and accidentally placed the previous day’s barcode file.

One of the suggestions we had after we printed a paper with the wrong barcode was to have the barcode checked twice before the page is sent to the printers. This is an entirely sensible suggestion, but I know from experience that — however well-intentioned — “check x twice” is a rule that will be broken when you’re under pressure and short-staffed.

It’s far more important to have a reliable production process so that whatever makes it through to the proofreading stage is certain to be correct, or as close as possible.

We can understand this by looking at the hierarchy of hazard controls, which is useful far outside occupational health and safety:

An illustration of the hierarchy of controls, to reduce industry hazards, which has at the top (most effective) the elimination of hazards, followed by substitution, engineering controls, administrative controls and then finally (and least effective) personal protective equipment.

“Check twice” is clearly an administrative control — changing the way people work while leaving the hazard in place. An engineering control in our case might be to have software check the barcode when the page is about to be sent to the printers (something we do on PDF export by inspecting the filename). We want to aim still higher up the hierarchy, eliminating or substituting the hazard.

But to reach that point we need to understand the components of a barcode.

Barcode components

Barcodes are used all over the place, so it’s understandable that some terms are opaque. But picking a specific case — daily newspaper barcodes here — it’s quite easy to break down what they mean and why they’re important.

The information here comes from the barcoding guidance published by the Professional Publishers Association and Association of Newspaper and Magazine Wholesalers. It’s a very clear document and if you’re involved in using barcodes for newspapers or magazines you should read it. (Really, do read it, as while I’ll try to bring newspaper barcodes “to life” below, there’s a lot of information in there that I won’t cover — such as best practice for sizing.)

Let’s start off by examining a typical newspaper EAN-13+2 barcode, using the terms that you’ll find in the PPA-ANMW guidance:

A diagram showing the components of a British newspaper barcode, using the EAN-13+2 format.

You’ll see at first that it’s clearly made up of two components: the largest is a typical EAN-13 barcode with a smaller EAN-2 on the right.

Reading left-to-right, we have the GS1 prefix to the barcode number, which is always 977 for the ISSN numbers assigned to newspapers and magazines.

Next is the first seven digits of the publication’s ISSN number — the eighth digit isn’t included because it is a check digit and is redundant because the EAN-13 includes its own check digit.

That check digit follows a two-digit sequence variant, which encodes some information about the periodical. On the right, above the EAN-2, is the issue number. This is used in different ways depending on the publication’s frequency.

Lastly is a chevron, which is used to guard some amount of whitespace on the right-hand side to ensure the barcode reader has enough room to scan properly. (The leading 9 performs the same function on the left.) This is optional.

In practice

Now let’s look at a real barcode, see which elements we have to think about, and how they fit together.

A diagram showing an annotated barcode as used by the Morning Star newspaper.

Now let’s start with the elements that were present on the basic ISSN barcode.

ISSN number

Your newspaper’s ISSN appears after the 977 prefix. The Morning Star’s ISSN is 0307-1758, but the 8 at the end of that is a check digit, used to detect errors in the preceding seven digits. This is removed because it’s unnecessary as the 13th digit of the EAN-13 is a check digit for all 12 preceding digits. So only the front seven digits of the ISSN appear in the bar code.

Sequence variant

For daily newspapers the sequence variant provides two pieces of information.

The first digit is a price code, which indicates to retailers what price they should charge. The code is dependent on the publication — you can’t tell from the price code alone what price a newspaper will be. For the Star, we currently use price codes 2 (£1) and 4 (£1.50).

The second digit is the ISO weekday number. Monday is ISO weekday 1, through to Sunday as 7.

So by looking at the sequence variant in this barcode, we can tell that it’s the paper for Wednesday (ISO weekday 3) and sells at whatever price code 2 corresponds to in the retailer’s point-of-sale system.

When you introduce a new price, typically you use the next unused price code. We recently increased the price of our Saturday paper from £1.20 (price code 3) to £1.50 (price code 4).

Issue number

The issue number appears above the EAN-2 supplemental barcode. For daily newspapers this corresponds to the ISO week containing the edition. Note that this may differ from, say, the week number in your diary. New ISO weeks begin on Monday.

Modern versions of strftime accept the %V format, which will return a zero-padded ISO week number. In Python the date and datetime classes have an .isocalendar() method which returns a 3-tuple of ISO week-numbering year, ISO week number and ISO weekday number.

Header strap

The line printed above the barcode is technically not part of the barcode itself, and different publications do different things. It’s common not to print anything, and for years we didn’t either, but I think it’s quite useful to print related information here to help whoever has to check the barcode before the page is sent for printing.

Note that in this example, all the information printed above the barcode is referred to in the barcode itself (except the year). I use this space to “decode” the barcode digits for human readers.

This was suggested to me by our printers (Trinity Mirror), who do something similar with their own titles.

Light margin indicator

Eagle-eyed readers will spot that the chevron used to guard whitespace for the barcode scanner is missing from the right-hand side. The PPA-ANMW guidance does urge that you include the chevron, but its absence as such won’t cause scanning problems.

It’s straightforward to guarantee enough space around the barcode by carefully placing it in the first place. Our back-page template reserves a space for the barcode, along with some legally required information, which is big enough to make the chevron unnecessary. You can see this in the image below:

An annotated photo of the barcode on a printed copy of the Morning Star, noting the space reserved around the barcode.

The main block of text on the left of the barcode doesn’t change. The date below it does, but it’s been tested so that even the longest dates provide enough space. (The longest date in consideration being the edition of Saturday/Sunday December 31-January 1 2022-2023.)

The superimposed purple lines show where the margins would appear in Adobe InDesign, with the barcode in the bottom-right corner. This section is ruled off above to prevent the encroachment of page elements, with the understanding that page items must end on the baseline grid line above the rule (which itself sits on the grid).

A photograph of a older style of Morning Star barcode, showing page elements in close proximity. A photograph of an even older style of Morning Star barcode, showing page elements in close proximity and a light margin chevron.

(As you can see from the smaller photos, this wasn’t always the case. The barcode often had page elements very close by, and did not have its own clear space. At this point, the barcode was also produced at a smaller size to fit within one of the page’s six columns.)

The “inside margin” on the right-hand side of the page (remember that the back page is in fact the left-hand page of a folded spread) provides an additional light margin. However, note that you still need an adequate distance from the fold itself:

“it is recommended that the symbol should not be printed closer than 10 mm from any cut or folded edge” (PPA-ANMW)

Our inside margin is 9mm, with the edge of the EAN-2 symbol roughly 1.5mm further in, for a total 10.5mm. While it appears that there’s bags of space, we’re still only just within the recommendations.

You might want to put the barcode on the outer edge of the back (the left-hand side) as the margin there is deeper (15mm in our case), but I would be very cautious about doing so. I’ve seen enough mishandled papers with bits torn off that I prefer the safety of the inside of the sheet.

You can see similar considerations at work when you look at how other papers place their barcodes. This example of the Sunday Mirror is quite similar to the Morning Star above, but rotated to make use of the more abundant vertical space:

A photograph of a barcode on the back of the Sunday Mirror, rotated so that it is placed sideways on the page.

(You can also see the use of a strap above the barcode, with the title name (SM, Sunday Mirror) and date (210517). I’m not sure what LO means, but it could mean London, if this is used as a way of identifying batches from different print sites.)

A photo of the barcode on The Times newspaper. A photo of the barcode on the Financial Times newspaper.

The Times and Financial Times also take this approach of cordoning off a space. Neither use a header strap (not unusual), though I am confused by the placement of the chevron in the FT’s barcode. It should be outside of the symbol area to reserve the space, though a lack of space is certainly not an issue.

Dedicating some space for the barcode is important because it means that there won’t be any compromises made day-to-day. You’ll want to take into account the recommended size and magnification factors in the PPA-ANMW guidance if adjusting page templates.

One of the changes we made was to abandon the reduced-size barcode (to fit within a page column), which then meant that something else was needed to fill out two columns to justify the space. But — as seen in the examples from other papers — it might be that having some amount of additional blank space around the barcode is an easy sell anyway.

Automation

Where these considerations really come in is when you automate the creation and setting of the barcode, because they can be thought about once, agreed and then left untouched as the system ticks along.

This gets us to the substitution level of the hierarchy of controls — we’re looking to do away with the hazard of human error in barcode creation, but ultimately we replace it with another hazard, ensuring that an automated system works correctly. We’ll return to this hazard briefly after taking a look at the automation program itself.

The code is available on GitHub. I won’t be including large chunks of it because it’s all fairly nuts and bolts stuff (and this post is long enough already!).

The structure is fairly straightforward. Like a lot of my more recent automation projects at work, it has an AppleScript user interface which passes arguments to a Python command-line program, which either performs some action itself or returns a value for use in the AppleScript program.

In this case, the Python program computes the correct sequence variant (price and weekday) and issue number (ISO week) — along with a human-readable header — and embeds them in a PostScript program that uses the brilliant BWIPP barcode library.

This PostScript is processed into a PDF file by Ghostscript, and the path to this barcode PDF is handed back to the AppleScript program so that it can embed it in a labelled frame in InDesign. (To embed files in an InDesign document you’ll need the unlink verb. Yes, I thought that meant “delete the link” at first as well.)

Here’s a diagram to show the flow through the program (forgive the graphics, I’m learning how to use OmniGraffle):

A diagram showing the flow of action through the ms-barcode application. An AppleScript UI takes input, Python organises the creation of the barcode (using BWIPP and Ghostscript) and then returns the barcode PDF file path to AppleScript, which then embeds it in an Adobe InDesign file.

Cloc tells me that the main Python file has a whopping 104 lines of code, and there are 264 lines of code in the related unit tests.

Really all of the heavy lifting is done by BWIPP, a cut-down version of which is included in the ms-barcode repository (just ISSN, EAN-13 and EAN-2). The entirety of my “own” PostScript is this (where the parts in braces are Python string formatting targets):

%!PS
({bwipp_location}) run

11 5 moveto ({issn} {seq:02} {week:02}) (includetext height=1.07)
  /issn /uk.co.terryburton.bwipp findresource exec

% Print header line(s)
/Courier findfont
9 scalefont
setfont

newpath
11 86 moveto
({header}) show

showpage

The bits that you may need to fiddle with, if you want to produce a different-sized barcode, are the initial location the ISSN symbol is drawn at (line 4) and height=1.07 on the same line.

You’d also want to adjust the size specified to Ghostscript, which is used to trim the resulting image — the arguments are -dDEVICEWIDTHPOINTS, and -dDEVICEHEIGHTPOINTS.

I don’t know enough about PostScript (or Ghostscript) to give good general guidance about getting the right size. My advice would be to start with what I have and make small adjustments until you’re heading in the right direction (which is exactly how I settled on the arguments currently in use).

What I would emphasise is that if you have trouble with the existing Python modules that wrap BWIPP, it’s not difficult to use the PostScript directly yourself. Really, look back at the 16 lines of PostScript above — that’s it.

Wrapping up

By automating in this way, we now have a method where the person responsible for the back page simply clicks an icon in their dock, presses return when asked if they want the barcode for tomorrow, and everything else is taken care of.

Going back to our earlier discussion of hazards, I think we’ve reached the substitution stage rather than the elimination stage.

We have eliminated human error in choosing the components of the barcode, but we’ve done it by substituting code to make that decision. That’s still a good trade, because that code can be tested to ensure it does the right thing.

And then, you can go back to not worrying about barcodes.

Last night, I treated myself to seeing the Colin Currie Group at Kings Place, and they were absolutely sensational.

It was quite the programme, with six pieces of Steve Reich’s music, and closed with a performance of Quartet, composed by Reich for Currie’s group.

I first heard Quartet on May 24 2016, when it was played with Mallet Quartet and Music for 18 Musicians, and quite honestly was desperate to hear it again. It’s magical, and I left last night tapping out bits of it on my leg on the way to the train station.

It wasn’t online, apart for a short clip, but last month a video of the group performing Quartet in full was posted to YouTube, and Nonesuch are releasing a recording of Pulse & Quartet (Pulse by ICE, Quartet by CCG) at the start of February.

The group are also releasing a recording of Drumming, which was as amazing to watch as it was to hear — I was sat right in line with the drums, and the effect of all four players drumming at once was stunning.

Last night was also the first time I’d heard New York Counterpoint (for clarinet) and Vermont Counterpoint (for flute) live, and they really opened my eyes, having sort-of ignored them before.

Really incredible stuff, all of it, start to finish. Reich’s compositions are brilliant, and I’m very thankful to be able to see a group dedicated to performing his work (for the third time now!).