Since I moved the blog off Tumblr, I’ve tried to make it reasonably quick. Reasonably because it’s come in waves — waves of me saying: “Oh, that’s fine” and then deciding that whatever it is isn’t fine and needs to go.

Tumblr used to whack in about 1MB of extra guff for its own purposes. If you’re posting endless gifs then that’s not something you’ll notice, but when you’re mostly dealing in text it’s pretty obvious.

When I redesigned the site almost four years ago I remember chafing at this, but it wasn’t until I settled on my own site generator that I had the chance to really whittle things down.

Much of that was lopping off unneeded stuff such as jQuery. It did succeed in getting the size down — to about 150–200kB for an average post. But I’ve recently made a few changes to speed things up that I wanted to talk about.

Much of this has come about after reading Jacques Mattheij’s The Fastest Blog in the World and Dan Luu’s two excellent posts Speeding up this blog by 25x-50x and Most of the web really sucks if you have a slow connection. (And, well, of course, Maciej. You should really read that one.)


For ages this site used Rooney served by Typekit as its main font. I love Rooney, it’s great. But using web fonts always means sending lots of data.

Despite thinning down the included characters (Typekit allows you to choose language support) and forgoing the use of double emphasis (<em><strong>), serving up three weights of Rooney still clocked in at over 100kB.

I’m looking at Rooney now and it is gorgeous, but there’s no way I could or can justify it — the fonts collectively would usually outweigh anything else on a page. So it went, in favour of Trebuchet MS, which I’ve long had a soft spot for.


Not related to the bytes served up but switching my registrar’s (Hover’s) name servers for Cloudflare helped cut about 100ms in DNS response times (from about 150ms).

You can host your DNS with Cloudflare for free without using any of their other caching services (I don’t), and Cloudflare is consistently one of the fastest DNS hosts in the world.

Syntax highlighting

Up until now I’d been using highlighting.js to colour code snippets, and I’d been very happy with it. It’s a nice library that’s easy to work with and easy to download a customised version for your own use.

But I’d been handing out a 10kB JavaScript library to everyone who visited — whether there was code to be highlighted or not. Had I included more than a handful of languages in my library it would have been even more.

My first decision was to use my blog generator’s extensions mechanism to mark every post or page that included code and would need highlighting — so if you visited a page without code, you didn’t receive the JavaScript library.

But really that wasn’t enough for me. A few things annoyed me:

  • Syntax highlighting had to be performed on every view on the client device at readers’ expense.
  • I could only highlight those languages I’d included in my library.
  • The library included all of my selected languages no matter what was on the page.

This wasn’t ideal.

The Markdown module I use has support for syntax highlighting, but there were some deficiencies with it that had led me to pick the client-side highlighter in the first place, several years ago.

It wasn’t difficult to fix that, however. Taking inspiration from Alex Chan, I modified the included Codehilite extension to match my requirements, which were to handle a “natural-looking” language line and line numbers in the Markdown source. You can see the source online, but it’s pretty rough and I need to tidy it up. (It also uses Pygments’s inline line numbers, instead of the table approach which I’ve seen out of alignment on occasion.)

The tradeoff here was serving a slightly inflated HTML file but a much-reduced JavaScript file (I kept a barebones script of a tenth of the size to show the plain source) — but the increase in HTML size is much smaller than the original JavaScript was.

In all, I’ve gone from a baseline of roughly 160kB to 10-15kB per image-free post. It’s not the fastest blog in the world, especially if you’re far away from Linode’s London datacentre, but it should be pretty nippy.

What else?

There are some things which I’ve rejected so far.

  • Inlining JavaScript and CSS.

    This could make the page render faster, at the expense of transferring data that would otherwise be cached when viewing other pages. (But, assuming the blog is like most others, most will only visit a single page.)

    But I feel a bit icky about munging separate resources together, and in mitigation the site is served over HTTP/2 (which most of the world supports) and inlining is an anti-pattern.

  • Sack off the traditional front page.

    Yeah, I felt this one acutely recently after posting all those dodgy but massive Tube heat maps, sitting towards the bottom of the front page and inflating its size.

    Dan Luu has his archives as the front page, which is svelte but extreme for my tastes. We’ll see about this one.

  • Ditch your CSS.

    Yeah, I know. (Well.) But I like pretty things. And it’s only about 2.5kB.

To recount from last time, our backup strategy was a mess. The tools were solid but used in such a way that the common case (fast file restoration) was likely to fail, even if the rare case (complete disk failure) was covered.

That alone should have made me act sooner than I did. But ultimately the common case was so rare that the pain it caused wasn’t sufficiently motivating. That combined with an already set plan to make the change when the server hardware was changed, a reluctance to spend money that delayed the hardware change, and a near-total lack of time. So it didn’t happen for about 2.5 years after I got the job.

What finally prompted me to overhaul the backups was a nerve-wracking hour late last year when the Mac Mini server became unresponsive and would fail to start up with its external drives attached.

Detaching the drives, booting and then reattaching the drives got us back to normal. (This had happened before; I think it might have been connected to the third-party RAID driver we were using, but I don’t know that for sure.)

As a precaution I checked the server’s backups. None had completed in five days. Understand that there was no monitoring; checking required logging in to the machine and looking at the Arq interface and SuperDuper’s schedule.

The Arq agent had hung and I believe SuperDuper had too, just frozen in the middle of a scheduled copy. As I noted before, these are both solid pieces of software so I’m very much blaming the machine itself and the way it was set up.

Shortly after I ordered a replacement Mac Mini (with a 1TB Fusion drive that feels nearly as fast as my SSD at home) and a bunch of new external drives.


Previously we had been using various 3.5″ drives in various enclosures.

I don’t know what the professional advice about using internal drives in enclosures versus external drives, but my preference is for “real” external drives, mostly because they’re better-designed when it comes to using several of them, they seem to be better ventilated and quieter, and they’re less hassle.

All of our existing drives were several years old, the newest 2.5 years (and I managed to brick that one through impatience — a separate story), so they all needed replacing.

I bought four 4TB USB3 Toshiba drives, which run at 7,200 RPM using drives apparently manufactured by someone else. That review says they’re noisy, but I’ve had all four next to my desk (in an office) since the start of the year, with at least one reading constantly (more on that later), and they’re OK. I might feel differently if I was using one in my quiet bedroom at home, but it’s hard to tell.

Funnily enough, you can no longer buy these drives from the retailer where we got them. But from memory they were about £100 each, maybe a bit less.

More recently I bought a 1TB USB3 Toshiba Canvio 2.5″ external drive to serve as a SuperDuper clone. (Not to match the 4TB Toshiba drives, but because it was well reviewed and cheaper than others.)

In sum, here’s our stock of drives:

  • 1 in the Mac Mini. (You can’t buy dual-drive Minis anymore.)
  • 2 Time Machine drives.
  • 1 nightly SuperDuper clone.
  • 1 for the 2002-2016 archives.
  • 1 for the post-2016 archives.


One of the biggest changes is one that’s basically invisible. Beforehand we had loads of drives, not all of which I mentioned last time.

  • 2 in the Mac Mini server.
  • 2 for daily/3-hourly clones (each drive containing two partitions).
  • 2 for the 2002-2011 archives.
  • 2 for the post-2011 archives.

All of them were in RAID 1, where each drive contains a copy of the data. The idea behind this is that one drive can fail and you can keep going.

We were using a third-party RAID driver by a long-standing vendor. I won’t name them because I didn’t get on with the product, but I don’t want to slight it unfairly.

The RAID software frequently cried wolf — including on a new external drive that lasted 2.5 years until I broke it accidentally — so I got to a point where I just didn’t believe any of its warnings.

The worst was a pop-up once a day warning that one of the drives inside the Mini was likely to fail within the next few weeks. I got that message for over two years. I’d only infrequently log in to the Mini, so I’d have to click through weeks of pop-ups. The drive still hasn’t died.

As I’ve made clear, the machine itself was a pain to work with and that may have been the root problem. But I won’t be using the RAID driver again.

That experience didn’t encourage me to carry over the RAID setup, but the main factor in deciding against using RAID was cost. As we openly admit in the paper, our place isn’t flush with cash so we try to make the best use of what we’ve got.

The demise of the dual-drive Mini means that the internal drive isn’t mirrored, and that would have been the obvious candidate. So if I were to buy extra drives for a RAID mirror, it might be for the external archive drives. They are backed up but if either drive were to snuff it most of their contents would be unavailable for a period.

But buying mirror drives means that money isn’t available for other, more pressing needs.

Backup strategy

OK, with that out of the way, let’s talk about what we are doing now. In short:

  • Time Machine backups every 20 minutes, rotated between two drives.
  • Nightly SuperDuper clones to a dedicated external drive.
  • Arq backups once an hour.
  • “Continuous” Backblaze backups.

Time Machine

I’m a big fan of Time Machine; I’ve been using it for years and very rarely had problems. But Time Machine does have problems. I wouldn’t recommend using Time Machine by itself, particularly not if you’ve just got a single external disk.

There are a couple of reasons for using two drives for Time Machine:

  • Keep backup history if one drive dies.
  • Perform backups frequently without unduly stressing a drive.
  • Potentially a better chance of withstanding a Time Machine problem. (Perhaps? Fortunately this hasn’t happened to me yet.)

Because of the nature of our work in the newsroom, a lot can change with a page or an article in a very short span of time, yet as we approach deadline there may not be any time to recreate lost work. So I run Time Machine every 20 minutes via a script.

It started off life as a shell script by Nathan Grigg, which I was first attracted to because at home my Time Machine drives sit next to my iMac — so between backups I wanted them unmounted and not making any noise. I’ve since recreated it in Python and expanded its logging.

Here’s the version I use at home:

 1 #!/usr/local/bin/python3
 2 """Automatically start TimeMachine backups, with rotation"""
 4 from enum import Enum
 5 import logging
 6 from pathlib import Path
 7 import subprocess
 9 logging.basicConfig(
10   level=logging.INFO,
11   style='{',
12   format='{asctime}  {levelname}  {message}',
13   datefmt='%Y-%m-%d %H:%M'
14 )
17   'HG',
18   'Orson-A',
19   'Orson-B',
20   ]
22 DiskutilAction = Enum('DiskutilAction', 'mount unmount')
25 def _diskutil_interface(drive_name: str, action: DiskutilAction) -> bool:
26   """Run diskutil through subprocess interface
28   This is abstracted out because other the two (un)mounting functions
29   would duplicate much of their code.
31   Returns True if the return code is 0 (success), False on failure
32   """
33   args = ['diskutil', 'quiet',] + [drive_name]
34   return == 0
37 def mount_drive(drive_name):
38   """Try to mount drive using diskutil and return status code"""
39   return _diskutil_interface(drive_name, DiskutilAction.mount)
42 def unmount_drive(drive_name):
43   """Try to unmount drive using diskutil and return status code"""
44   return _diskutil_interface(drive_name, DiskutilAction.unmount)
47 def begin_backup():
48   """Back up using tmutil and return backup summary"""
49   args = ['tmutil', 'startbackup', '--auto', '--rotation', '--block']
50   result =
51     args,
52     stdout=subprocess.PIPE,
53     stderr=subprocess.PIPE
54     )
55   if result.returncode == 0:
56     return (result.returncode, result.stdout.decode('utf-8'))
57   else:
58     return (result.returncode, result.stderr.decode('utf-8'))
61 def main():
62   drives_to_eject = []
64   for drive_name in TM_DRIVE_NAMES:
65     if Path('/Volumes', drive_name).exists():
66       continue
67     elif mount_drive(drive_name):
68       drives_to_eject.append(drive_name)
69     else:
70       logging.warning(f'Failed to mount {drive_name}')
72'Beginning backup')
73   return_code, log_messages = begin_backup()
74   log_func = if return_code == 0 else logging.warning
75   for line in log_messages.splitlines():
76     log_func(line)
77'Backup finished')
79   for drive_name in drives_to_eject:
80     if not unmount_drive(drive_name):
81       logging.warning(f'Failed to unmount {drive_name}, trying again…')
82       if not unmount_drive(drive_name):
83         logging.warning(
84           f'Failed to unmount {drive_name} on second attempt')
86 if __name__ == '__main__':
87   main()

The basic idea is the same: mount the backup drives if they’re not already, perform the backup using tmutil, and eject the drives that were mounted by the script afterwards. There’s nothing tricky in the script, except maybe the enum on line 22 — that’s to replace strings and risking a typo.

The arguments to tmutil on line 49 get Time Machine to behave as if it were running its ordinary, automatic hourly backups. The tmutil online man page is out of date but does contain information on those options.

The version at work is the same except that it contains some code to skip backups overnight:

TIME_LIMITS = (8, 23)

def check_time(time, limits=TIME_LIMITS):
  """Return True if backup should proceed based on time of day"""
  early_limit, late_limit = limits
  return early_limit <= time <= late_limit

# And called like so:

This was written pretty late at night so it’s not the best, but it does work.


SuperDuper is great, if you use it as it’s meant to be used. It runs after everyone’s gone home each evening.

I don’t look forward to booting from the 2.5″ clone drive but an external SSD was too expensive. In any case, should the Mini’s internal drive die the emergency plan is just to set up file sharing on one of our two SSD-equipped iMacs, copy over as many files as needed, and finish off the paper.

We’ve done this before when our internal network went down (a dead rack switch) and when a fire (at a nearby premises) forced us to produce the paper from a staff member’s kitchen. If such a thing happens again and the Mini is still functioning, I’d move that too as it’s not dog-slow like the old one.

In an ideal world we’d have a spare Mini ready to go, but that’s money we don’t have.


The Mini’s internal drive is backed up to S3. The infrequent access storage class is used to keep the costs down, but using S3 over Glacier means we can restore files quickly.

The current year’s archive and last year’s are also kept in S3, because if the archive drives die we’re more likely to need more recent editions.

All of the archives are kept in Glacier, but the modern storage class rather than the legacy vaults. This includes the years that are kept in S3, so that we don’t suddenly have to upload 350GB of stuff as we remove them from the more expensive storage.

I’m slowly removing the legacy Glacier vaults, but it takes forever. You first have to empty the vault, then wait some time so you can actually remove the vault from the list in the management console. I use Leeroy Brun’s glacier-vault-remove tool, to which I had to make a minor change for it to work with Python 3.6 but it was straightforward to get running. (I think it was a missing str.encode() or something. Yes, I know, I should submit a pull request. But it was a one-second job.)

Depending on the size of your vaults, be prepared to have the script run for a long time — this is because of the way that Glacier works (slowly). It took a couple of days for me to remove a 1.5TB-ish vault, followed by a little wait (because the vault had been “recently accessed” — to create the inventory for deletion).

Arq’s ability to set hours during which backups should be paused is great — I could easily ensure our internet connection was fully available during the busiest part of the day.

These are set per-destination (good!) but if one destination is mid-backup and paused, your backups to other destinations won’t run. Consider this case:

  • Our archive is being backed up to Glacier.
  • Archive backup is paused between 3pm and 7pm.
  • Backups of the server’s internal drive won’t run during 3pm to 7pm because the archive backup is “ongoing” — even if it is paused.

What that meant in practice is that the internal drive wasn’t being backed up to S3 hourly during the (weeks-long) archive backup, except if “Stop backup” is pressed during the pause interval to allow other destinations to backup.

Ideally, paused backups would be put to one side during that window of time and other destinations allowed to run.


Our initial 3.5TB Backblaze (Star affiliate link) backup is still ongoing, at about 50GB a day over one of our two 17mbps-up fibre lines. In all it’ll have taken a couple of months, compared to a couple of weeks for Arq to S3 & Glacier.

One thing I’ve noticed from personal use at home is that Arq has no qualms about using your entire bandwidth, whereas Backblaze does seem to try not to clog your whole connection. That’s fine at home, but at work with more than one line it matters less.

I was hesitant to write this post up before we’re all set with Backblaze, but it’s got about a week to run still.

I’m interested in seeing what “continuous” backups means in practice, but the scheduling help page says: “For most computers, this results in roughly 1 backup per hour.” The important thing for me is having an additional regular remote backup separate to Arq (mostly paranoia).


Were money no object, I have some ideas for improvements.

First would be to add Google Cloud Storage as an Arq backup destination, replicating our S3-IA/Glacier split with Google’s Nearline/Coldline storage. It would “just” give us an additional Arq backup, not dependent on AWS.

A big reason why I added Backblaze is to have a remote backup that neither relied on AWS nor Arq. (That’s really not to say that I don’t trust Arq, it’s an eggs-in-one-basket thing.)

Next I’d add an additional Time Machine drive (for a total of three). Running every 20 minutes means that the load on each drive is 50% higher over two hours than the ordinary one backup per hour. Adding a third drive would mean that each drive is being backed up only once per hour. And it would also mean that there is backup history stored on two drives should one fail.

An external SSD to hold a SuperDuper clone is tempting, for the speed and because it might mean that we could continue to use the Mini as usual until we could get the internal drive replaced. (Though that might well be a bad idea.) Our server contains nowhere near its 1TB capacity, so we might get away with a smaller SSD.

And, maybe, instead of extra drives for RAID mirrors (as discussed above), a network-attached storage (NAS) device. But I have no experience with them nor any idea about how best to make use of one. And they seem to be a bit of a black box to me, and a major goal for our setup (Mac Mini, external drives, GUI backup programs) is to be reasonably understandable to any member of production staff — after all, we’re all journalists, not IT workers.

It’s that time again, time to ponder Leicester’s fortunes now that we’re well within the final third of the season. Our plucky Foxes have 11 games to go. Can they escape to something-well-short-of-victory or will the wrenching despair of Claudio Ranieri’s sacking be too crushing to overcome?

To set the stage for our pretty pictures, let’s remind ourselves of that abyss of guilt and shame.

“My heartfelt thanks to everybody at the club, all the players, the staff, everybody who was there and was part of what we achieved.

“But mostly to the supporters. You took me into your hearts from day one and loved me. I love you too.

“No-one can ever take away what we together have achieved, and I hope you think about it and smile every day the way I always will.”

I well up every time I read that. Heartbreaking.

I’ve ditched the “points after x games” chart from our previous two outings because it’s not much help anymore. (Also because of the existential sadness.)

A chart showing Leicester City’s points over time in the 2014-15 to 2016-17 seasons as of 2017-03-17

Er. Hm. I was going to mark when Claudio was sacked but, c’mon, it’s that upturn for game 26 when we beat Liverpool 3-1. That was literally a couple days after he was sacked. The next week it was 3-1 against Hull (yeah, OK).

Otherwise, the current season looks roughly like the disastrous/heroic 2014-15, but notably without the long nearly point-less spell in the first half and so the recovery looks like it’s come much earlier.

For the first half the season Leicester were clinging to the safety trend line, as shown very clearly below:

A chart showing Leicester City’s distance from the safety trend, 2014-15 to 2016-17 seasons as of 2017-03-17

The most worrying part is that sharp decline at the start of the second half — certainly bringing back memories of two years ago. There really isn’t a huge amount in it, but that cushion of a few points could mean a lot.

It’s worth revisiting the reason why I focus so much on that safety trend, given my first such post:

The dashed line represents steady progress towards the magical 40 points to avoid relegation. (West Ham landed in the drop zone with 42 points in 2002–3, but that was unusual. Most of the time you need even fewer points; last season it was 38.)

This season is a bit more competitive than last; the bottom-placed team currently (Sunderland, 19) has more points than the bottom-placed team did at the end of last term (Aston Villa, 17).

But clawing your way up out of the drop zone is bloody hard, and we mustn’t forget just how bloody close the end of 2014-15 was: Leicester finished with 41 points, just six more than 18th-placed Hull and with three other sides between the two.

That said, the fight is not necessarily with the other low-placed teams, the fight is with whoever you’ve got facing you on the pitch. And then we’re back to the importance of that safety line: snatching as many points wherever you can get them, however (within reason) you can get them.

At work, Roger Domeneghetti asked how much the Foxes owe Claudio. At the time I was well in a post-Claudio daze, but ultimately he’s right. The Foxes owe Claudio a lot, a huge amount, but they don’t owe him relegation. You can “What if?” the possibilities if he’d had stayed, but before those two consecutive wins Leicester certainly looked headed for the drop — and staying up is still far from certain.

I’ll always love you Claudio, but few things last forever.

When I moved from south-east London to east London last year, one of the things I was looking forward to was fairly quick access to CS2, a substantially-segregated route from Stratford to Aldgate.

To get to CS2, I go from my home in Leyton to the northern half of the Olympic Park via Temple Mills Lane and Honour Lea Avenue. From there it’s a nice, traffic-free cycle into the southern half of the park.

But at that point you need to make a decision where you need to go because there’s no traffic-free or segregated way of reaching CS2 on Stratford High Street from the park. (We built a massive naff-off Velodrome but won’t give people a safe way to get to it from the main road. It’s ludicrous.)

I’ve generally gone to the roundabout at the junction of Montfichet and Warton Roads, then down Warton to Stratford High Street, and then turn right at the crossroads into the segregated CS2.

But Warton is a horrible traffic sewer, being one of the few ways to get from Stratford proper to the Westfield shopping centre. Newham have recently installed a series of raised tables and raised crossings in an attempt to slow the traffic, an example of tackling the symptoms rather than the cause. It’s not pleasant to cycle and a couple of years ago the junction with Stratford High Street was ranked as one of the most dangerous for people on bikes.

Coming the other way, eastbound, things have been made worse by the Weston Homes development along Stratford High Street, with sections of the segregated tracked regularly closed. This forces people on bikes to join a lane of fast-moving traffic.

Take, for example, the surprise cones captured by @SW19cam:

Suprise cones on CS2

The sadly typical “Cyclists sod off” snapped by @RossiTheBossi, complete with van sitting in the track:

Cyclists Dismount

And then there’s Lucia Quenya’s video of the mess:

I now try to avoid this section of CS2 when heading home and — spurred on by Diamond Geezer — I think I’ve settled on a largely traffic-free route that avoids Warton Road and the Weston Homes works. Here’s a modified screenshot from Google Maps:

A map of the lower section of the Olympic Park and Stratford High Street

Basically, heading east, you take the first left after Bow roundabout (into Cook’s Road), follow the road round, and then turn left after Pudding Mill Lane DLR station. Turn right at the lights, then go up the dropped kerb immediately after the bridge. That puts you in the big chunk of empty ground to the south of the Squiggle and you can go on your merry way.

Heading west it’s the same until you reach the junction of Stratford High Street, but at that point dismount and head for the pedestrian crossing to your right. You can then join the advanced stop box heading west.

The route shown in the map above isn’t direct by any means, and you may encounter the odd large vehicle between Loop Road and Stratford High Street — whether construction or industrial. It’s also likely to be impassable on West Ham home match days.

It’s not a great trade-off, and it’s crackers that this is even necessary if you want to avoid motors so close to the Olympic Park (#Legacy) and the CS2 extension — the first CSH section that was substantially segregated.

But that’s the situation we’re in.

At work I’m responsible for our office file server, which is mostly used for storing the InDesign pages and related files we need to produce the paper every day.

At the end of 2016 we replaced the server, which was a 2010 or 2011 Mac Mini, with a “current” Mac Mini — the 2014 release. That wasn’t ideal and part of the reason why I hadn’t replaced it sooner was because I was waiting for Apple to at least refresh the specs.

In any case, we got to a point where the machine needed swapping. The final straw for me was being unable to trust that the backups would run reliably.

Ever since I got my current post (in mid-2014) I’d been meaning to overhaul our backup strategy but it was always a case of “I’ll do it when we replace the server.” It’s worth sketching out what we were doing before moving on to what caused concern and what I’ve replaced it with now.

In this, I’m going to include RAID because the redundancy side of things also changes. For context, our working day is roughly 10am-7pm Sunday–Friday:

  • Two 500GB drives in the Mini itself, in RAID 1 (mirrored).
  • SuperDuper! copies (newer files) to two drives (in RAID 1) at — if I remember correctly — 9am, 12pm, 3pm, 5pm, 9pm, Sunday–Friday.
  • Daily SuperDuper copies (newer files) to two drives (in RAID 1) at about 9pm, Sunday–Friday.
  • Weekly SuperDuper erase-and-copy on Saturday to both the three-ish-hourly mirrors and the daily mirrors.
  • Arq backups to (old-style) Glacier vaults, once a day.

The SuperDuper copies were done to a set of USB 2 disk enclosures, containing ordinary 3.5″ drives.

You’ll have noticed some fuzziness in my recollections just there, for an important reason: the machine’s performance was so poor everything bled into each other. It was totally competent serving files over AFP but trying to work on the machine itself was a nightmare.

And this is the point where it gets concerning. It wasn’t rare to log in to the server to find a SuperDuper copy having run for 12 hours and still going, or that Arq had hung and needed its agent restarted. Both are solid pieces of software (though it was an older version of Arq, v3 I think) so the blame does lie with the machine.

Which is odd, really, because its only tasks were:

  • Serve files on the local network over AFP (to fewer than 25 users).
  • Run OS X Server to handle user log-in (for access to files only).
  • Host our internal (unloved and little-used) wiki.
  • Perform backups (of its own drives).

Hopefully you can spot the problems — both the ones inherent in the strategy and the ones that result from poor performance:

  • Local backups were relatively infrequent and could result in hours of data loss.

    This is a bit dicey when our business relies on getting all pages to the printers by the early evening. Each page takes about an hour of work to create.

  • Local backups may not complete in a timely fashion.

    This exacerbates the point above, and the cause wasn’t readily apparent.

  • Local backups, as copies, don’t allow for the recovery of older versions of a file.

    So if you accidentally erase everything off a page, save it, and it’s backed up, you can’t recover the previous state unless it was picked up by the previous daily backup. (That’s an unlikely event, since most pages are created, finished and published between daily backups.)

  • Remote backups, because they were stored in Glacier, would take several hours even to reach the point where files could begin to be restored.

    Even if they were hourly (not our case), if you lose a page an hour before deadline and it’s not backed-up locally your only option is to recreate the page because there’s no chance you’ll get it out of Glacier fast enough.

Let’s walk through perhaps the most likely example of data loss during our working day, when a user accidentally deletes a file (taking an InDesign page as our file):

Because of the way deletes work over AFP & SMB, the deleted file is removed immediately, in contrast to a local delete requiring two steps to complete (usually): the file is moved to the trash, the trash is emptied.

First, the local backups are checked. In the lucky case, a backup has run recently (and completed timely!), otherwise you could face up to three hours’ data loss, which could be all of the work on that page.

If the page was created on a previous day (uncommon) then you’d also check the daily local backups, hoping that the page was substantially complete at that point.

Then you’d check the remote backups, hoping that (like above) the page was substantially complete at the point that the backup ran and that you have the three-to-five hours needed for the restore to begin.

It’s perhaps clear that the chances of recovering your work are not good. It gets even worse when you consider the next common case: that a user modifies a complete page, destroying the original work, and that modified version replaces the original on the more-frequent local backups.

(When I was very new at the Star I did something similar. I was asked to work on a features page for the following day. I opened up the corresponding page for edition to save a copy but for whatever reason didn’t — and began to work on the “live” page for edition. When this was noticed — close to deadline! — I was thankfully able to undo my way back to the original state, as I hadn’t closed InDesign since opening it. It was an incredibly close call that still makes me feel queasy when I think about it.)

So while we did have a strategy that included both local and remote backups, it was effectively useless. That everything locally was in RAID 1 — meaning that if one drive of a mirror fails, you don’t lose data — I think just shows that when it was set up, the core purpose of backing up our data was misunderstood.

We had copies of our business-critical data, but constructed in such a way that protection against drive failure was far, far more important that the much more common case of restoring fast-changing files. In this system, you could have five drives die and still have a copy of our server data locally, or all six die (perhaps in a fire) and be able to recover remotely.

To sum up, it was reasonable protection for a rare event but little protection for a common event.

It is, of course, possible to have both. In my next post I’ll go into what’s changed, why, and what that means for protecting our data in both the common and rare cases.