Diamond Geezer asks: “Why do we never end up in the middle?”

It’s unfair to pick on him, but I will because he posted on a day when my annoyance at centrist liberals has well and truly peaked.

First off, the “centre ground” is a concept that is entirely relative. When Jeremy Corbyn campaigned to become and was elected leader of the Labour Party in 2015, he managed to shift the centre ground — the Tories very quickly ditched a plan to bomb the Syrian government.

The centre ground is inherently unstable because it only exists relative to the two dominant forces either side. At our present moment that’s a fairly right-wing Conservative Party and a reasonably social democratic Labour Party. Any “centrist” must define themselves in opposition to their closest opponents on the left and right.

Ultimately if you do that it means you have no principles, nothing that anchors you on the left-right axis. In reality — as much as we joke about spineless politicians — few define their positions in this way and instead the “centre” in various countries is the home to a party that has “right-wing” economic policies and “left-wing” social policies. In Britain that would be the Liberal Democrats, despite Tim Farron’s recent attempts to win over the homophobes.

Left and right are in scare quotes above because this shows the point at which the left-right axis breaks down.

Ultimately the idea of centrism is bankrupt. Politics is a clash of interests. The ideas of the “centre ground,” of the “national interest,” are rubbish. Howard Zinn put it best in his People’s History of the United States:

Nations are not communities and never have been. The history of any country, presented as the history of a family, conceals the fierce conflicts of interest (sometimes exploding, often repressed) between conquerors and conquered, masters and slaves, capitalists and workers, dominators and dominated in race and sex.

As a socialist, to use our compromised axis, the boss class sits on the right and the workers on the left. Given that the boss class is but a tiny sliver of the population, what credibility does a “centrist” party have, one that pretends to balance the desires of the exploited and the exploiters?

It this this “refreshing centrism” that irks me the most, as it is always right-wing economic policies paired with some ameliorating factor — support for gay marriage, say — to assuage the liberals.

But if you’re gay, does being able to officially consecrate your relationship make up for the fact that you spend half your wages on rent?

This has run on, so let’s talk about Emmanuel Macron. The Guardian loves him, noting (without the expected contradicting clause) that it “is tempting … to conclude that European liberal values have successfully rallied to stop another lurch to the racist right.”

And so Macron, an explicit neoliberal, is raised up having defeated (we’ll see) the fascist Marine Le Pen.

The celebration is of liberal values, embodied by Macron. But Macron’s liberal values go a long way to explain the surge in support for France’s fascist National Front, as Cole Stangler shows. His liberal values are likely to increase “unemployment, inequality and poverty” through his right-wing economic policies — along the lines of the French law that bears his name (loi Macron) and hacked away at workers’ rights.

The assault on workers’ rights and public services has been ongoing for nearly 40 years yet liberals and centrists deride the term that describes our current phase: neoliberalism.

The refusal to recognise this trend puts us in a position where the Guardian celebrates the likely victory of Macron, cheering his defeat of the fascists in blissful ignorance. But his political current is the reason why we have ended up with the fascists contesting the second round of the French presidential election (again).

Faced with falling employment and living standards for four decades and (generally) abandoned by the organised left, people have turned to those who promise to take action to improve their material conditions.

Yet Macron’s policies will just exacerbate these problems. This isn’t the end of the fascist challenge in France; should Macron win and pursue his neoliberal programme we could well be in the same situation in five years’ time.

(Unless, potentially, the French left organises a strong anti-fascist campaign like that waged in Britain from the 1970s to the present time, in which the fascists have more or less been suffocated.)

This isn’t a “bold break with the past,” it is the continuation of the rule of the boss class with a fresh coat of paint.

At work we deal a lot with PDFs, both press quality and low-quality for viewing on screen. Over time I’ve automated a fair amount of the creation for both types, but one thing that I haven’t yet done is automate file-size reductions for the low-quality PDFs.

(We still use InDesign CS4 at work, so bear in mind that some or all of the below may not apply to more recent versions.)

It’s interesting to look at exactly what is making the files large enough to require slimming down in the first place. All our low-quality PDFs are exported from InDesign with the built-in “Smallest file size” preset, but the sizes are usually around 700kB for single tabloid-sized, image-sparse pages.

A low-quality image of a Morning Star arts page.

Let’s take Tuesday’s arts page as our example. It’s pretty basic: two small images and a medium-sized one, two drop shadows, one transparency and a fair amount of text. (That line of undermatter in the lead article was corrected before we went to print.)

But exporting using InDesign’s lowest-quality PDF preset creates a 715kB file. The images are small and rendered at a low DPI, so they’re not inflating the file.

Thankfully you can have a poke around PDF files with your favourite text editor (BBEdit, obviously). You’ll find a lot of “garbage” text, which I imagine is chunks of binary data, but there’s plenty of plain text you can read. The big chunks tend to be metadata. Here’s part of the first metadata block in the PDF file for the arts page:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP […]">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xmp="http://ns.adobe.com/xap/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
     Blah blah blah exif data etc 
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>

Which is the none-too-exciting block for one of the images, a Photoshop file. There’s two more like this, roughly 50–100 lines each. Then we hit a chunk which describes the InDesign file itself, with this giveaway line:

<xmp:CreatorTool>Adobe InDesign CS4 (6.0.6)</xmp:CreatorTool>

So what, right? InDesign includes some document and image metadata when it exports a PDF. Sure, yeah. I mean, the metadata blocks for the images weren’t too long, and this is just about their container.

Except this InDesign metadata block is 53,895 lines long in a file that’s 86,585 lines long. 574,543 characters of the document’s 714,626 — 80% of the file.

I think it’s safe to say we’ve found our culprit. But what’s going on in those 54,000 lines? Well, mostly this:

<xmpMM:History>
   <rdf:Seq>
      <rdf:li rdf:parseType="Resource">
         <stEvt:action>created</stEvt:action>
         <stEvt:instanceID>xmp.iid:[… hex ID …]</stEvt:instanceID>
         <stEvt:when>2012-05-22T12:55:27+01:00</stEvt:when>
         <stEvt:softwareAgent>Adobe InDesign 6.0</stEvt:softwareAgent>
      </rdf:li>
      <rdf:li rdf:parseType="Resource">
         <stEvt:action>saved</stEvt:action>
         <stEvt:instanceID>xmp.iid:[… hex ID …]</stEvt:instanceID>
         <stEvt:when>2012-05-22T12:55:54+01:00</stEvt:when>
         <stEvt:softwareAgent>Adobe InDesign 6.0</stEvt:softwareAgent>
         <stEvt:changed>/</stEvt:changed>
      </rdf:li>
    <!--  1,287 more list items  -->
   </rdf:Seq>
</xmpMM:History>

It’s effectively a record of every time the document was saved. But if you look at the stEvt:when tag you’ll notice the first items are from 2012 — when our “master” InDesign file from which we derive our edition files was first created. So, the whole record of that master file is included in every InDesign file we use, and the PDFs we create from them.

Can we remove this metadata from InDesign? You can see it in File ▸ File Info… ▸ Advanced, select it and press the rubbish bin icon. Save, quit, reopen and… it’s still there.

Thankfully Acrobat can remove this stuff from your final PDF, by going through the “PDF Optimizer” or “Save Optimized PDF” or whatever menu item it’s hiding under these days. (In the “Audit Space Usage” window it corresponds to the “Document Overhead”.)

Unfortunately Acrobat’s AppleScript support has always been poor — I’ve no idea what it’s like now, remember we’re talking CS4 — and I’ve no time nor desire to dive into Adobe’s JavaScript interface. So while you can (and we do) automate the PDF exports, you can’t slim these files down automatically with Acrobat.

Our solution at work has been to cut the cruft from the PDF using Acrobat when we use it to combine our separate page PDFs by hand. But ultimately I want to automate the whole process of exporting the PDFs, stitching them together in order, and reducing the file size.

After using ghostscript for our automatic barcode creation, I twigged that it would be useful for processing the PDFs after creation, and sure enough you can use it to slim down PDFs. Here’s an example command:

gs -sDEVICE=pdfwrite \
   -dPDFSETTINGS=/screen \
   -dCompatibilityLevel=1.5 \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile="11_Books_180417-smaller.pdf" \
   "11_Books_180417.pdf"

Most of that is ghostscript boilerplate (it’s not exactly the friendliest tool to use), but the important option is -dPDFSETTINGS=/screen which, according to one page of the sprawling docs, is a predefined Adobe Distiller setting.

Using it on our 715kB example spits out an 123kB PDF that is visually identical apart from mangling the drop shadows (which I think can be solved by changing the transparency flattening settings when the PDF is exported from InDesign).

In my previous post about page speed, I mentioned that I’d written my own site generator. I’m not quite ready to talk specifically about it — I want to write some documentation first — and really I doubt that anyone but me should be using it.

But, having set up publishing to Amazon S3 today, I wanted to write up how I publish this blog to multiple places so that it’ll be around whatever (within reason) might happen.

Majestic’s configuration files are set up in such a way that you have have a default settings file in a directory — settings.json — and you can specify others that make adjustments to that.

In my case the main settings file contains the configuration for publishing to my own server (hosted at Linode) — not the nitty gritty of how to get it on to the server, but what the URLs, site title, etc should be. (It’s online if you want to have a nose around.)

Then I have two extra JSON files: robjwells.github.io.json and s3.robjwells.com.json, which contain the customisations for publishing for those domains. Here’s the config for GitHub in full:

{
  "site": {
    "url": "https://robjwells.github.io",
    "title": "Primary Unit mirror on GitHub",
    "description": "A mirror of https://www.robjwells.com hosted on GitHub"
  },

  "paths": {
    "output root": "gh-pages"
  }
}

Setting site.url is important because of the way my templates render article links (though my markdown source contains only relative links that work anywhere). And paths.output root just specifies the build directory where the HTML files get written.

All the moving parts are contained in a makefile which can build all three of my destinations. Here it is in full:

NOW = $(shell date +'%Y-%m-%d %H:%M')
DISTID = $(shell cat cloudfront-distribution-id)


define upload-robjwells
rsync -zv -e ssh www.robjwells.com.conf
    rick@deckard:/srv/www/www.robjwells.com/
rsync -azv --delete -e ssh site/
    rick@deckard:/srv/www/www.robjwells.com/html/
endef


define upload-github
cd gh-pages ; git add . ; git commit -m "$(NOW)" ; git push
endef


define upload-aws
aws s3 sync s3 s3://s3.robjwells.com --delete
aws cloudfront create-invalidation
    --distribution-id="$(DISTID)" --paths=/index.html
endef


all: robjwells github aws

force-all: force-robjwells force-github force-aws

robjwells:
  majestic
  $(upload-robjwells)

force-robjwells:
  majestic --force-write
  $(upload-robjwells)

github:
  majestic --settings=robjwells.github.io.json
  $(upload-github)

force-github:
  majestic --settings=robjwells.github.io.json --force-write
  $(upload-github)

aws:
  majestic --settings=s3.robjwells.com.json
  $(upload-aws)

force-aws:
  majestic --settings=s3.robjwells.com.json --force-write
  $(upload-aws)

(The force-* options rebuild the whole site, not just files which have changed.)

And, really, that’s all it takes to publish to multiple hosts (once you’re set up at each one, of course).

My own server is just a vanilla rsync command, with an extra one because I keep my Nginx server config locally too.

For GitHub pages the gh-pages folder is a git repository, so make github regenerates the site into that folder, commits the changes with a timestamp as the message, and pushes the changes to GitHub. (It’s all on the same line with semicolons because the cd into the directory doesn’t hold across lines in the makefile.) Because the GitHub repository is set up to publish, the rest is sorted out on their end.

And for S3 I just use the official AWS tool (brew install awscli if you’re on macOS) — the CloudFront line is because I use it to speed up the S3 version and I want to make sure an updated front page is available reasonably quickly, if not anything else.

There’s a bit of overhead setting all of these up but once you do it doesn’t have to be any more work to keep each host updated. For me it’s just a make all away.

Since I moved the blog off Tumblr, I’ve tried to make it reasonably quick. Reasonably because it’s come in waves — waves of me saying: “Oh, that’s fine” and then deciding that whatever it is isn’t fine and needs to go.

Tumblr used to whack in about 1MB of extra guff for its own purposes. If you’re posting endless gifs then that’s not something you’ll notice, but when you’re mostly dealing in text it’s pretty obvious.

When I redesigned the site almost four years ago I remember chafing at this, but it wasn’t until I settled on my own site generator that I had the chance to really whittle things down.

Much of that was lopping off unneeded stuff such as jQuery. It did succeed in getting the size down — to about 150–200kB for an average post. But I’ve recently made a few changes to speed things up that I wanted to talk about.

Much of this has come about after reading Jacques Mattheij’s The Fastest Blog in the World and Dan Luu’s two excellent posts Speeding up this blog by 25x-50x and Most of the web really sucks if you have a slow connection. (And, well, of course, Maciej. You should really read that one.)

Fonts

For ages this site used Rooney served by Typekit as its main font. I love Rooney, it’s great. But using web fonts always means sending lots of data.

Despite thinning down the included characters (Typekit allows you to choose language support) and forgoing the use of double emphasis (<em><strong>), serving up three weights of Rooney still clocked in at over 100kB.

I’m looking at Rooney now and it is gorgeous, but there’s no way I could or can justify it — the fonts collectively would usually outweigh anything else on a page. So it went, in favour of Trebuchet MS, which I’ve long had a soft spot for.

DNS

Not related to the bytes served up but switching my registrar’s (Hover’s) name servers for Cloudflare helped cut about 100ms in DNS response times (from about 150ms).

You can host your DNS with Cloudflare for free without using any of their other caching services (I don’t), and Cloudflare is consistently one of the fastest DNS hosts in the world.

Syntax highlighting

Up until now I’d been using highlighting.js to colour code snippets, and I’d been very happy with it. It’s a nice library that’s easy to work with and easy to download a customised version for your own use.

But I’d been handing out a 10kB JavaScript library to everyone who visited — whether there was code to be highlighted or not. Had I included more than a handful of languages in my library it would have been even more.

My first decision was to use my blog generator’s extensions mechanism to mark every post or page that included code and would need highlighting — so if you visited a page without code, you didn’t receive the JavaScript library.

But really that wasn’t enough for me. A few things annoyed me:

  • Syntax highlighting had to be performed on every view on the client device at readers’ expense.
  • I could only highlight those languages I’d included in my library.
  • The library included all of my selected languages no matter what was on the page.

This wasn’t ideal.

The Markdown module I use has support for syntax highlighting, but there were some deficiencies with it that had led me to pick the client-side highlighter in the first place, several years ago.

It wasn’t difficult to fix that, however. Taking inspiration from Alex Chan, I modified the included Codehilite extension to match my requirements, which were to handle a “natural-looking” language line and line numbers in the Markdown source. You can see the source online, but it’s pretty rough and I need to tidy it up. (It also uses Pygments’s inline line numbers, instead of the table approach which I’ve seen out of alignment on occasion.)

The tradeoff here was serving a slightly inflated HTML file but a much-reduced JavaScript file (I kept a barebones script of a tenth of the size to show the plain source) — but the increase in HTML size is much smaller than the original JavaScript was.

In all, I’ve gone from a baseline of roughly 160kB to 10-15kB per image-free post. It’s not the fastest blog in the world, especially if you’re far away from Linode’s London datacentre, but it should be pretty nippy.

What else?

There are some things which I’ve rejected so far.

  • Inlining JavaScript and CSS.

    This could make the page render faster, at the expense of transferring data that would otherwise be cached when viewing other pages. (But, assuming the blog is like most others, most will only visit a single page.)

    But I feel a bit icky about munging separate resources together, and in mitigation the site is served over HTTP/2 (which most of the world supports) and inlining is an anti-pattern.

  • Sack off the traditional front page.

    Yeah, I felt this one acutely recently after posting all those dodgy but massive Tube heat maps, sitting towards the bottom of the front page and inflating its size.

    Dan Luu has his archives as the front page, which is svelte but extreme for my tastes. We’ll see about this one.

  • Ditch your CSS.

    Yeah, I know. (Well.) But I like pretty things. And it’s only about 2.5kB.

To recount from last time, our backup strategy was a mess. The tools were solid but used in such a way that the common case (fast file restoration) was likely to fail, even if the rare case (complete disk failure) was covered.

That alone should have made me act sooner than I did. But ultimately the common case was so rare that the pain it caused wasn’t sufficiently motivating. That combined with an already set plan to make the change when the server hardware was changed, a reluctance to spend money that delayed the hardware change, and a near-total lack of time. So it didn’t happen for about 2.5 years after I got the job.

What finally prompted me to overhaul the backups was a nerve-wracking hour late last year when the Mac Mini server became unresponsive and would fail to start up with its external drives attached.

Detaching the drives, booting and then reattaching the drives got us back to normal. (This had happened before; I think it might have been connected to the third-party RAID driver we were using, but I don’t know that for sure.)

As a precaution I checked the server’s backups. None had completed in five days. Understand that there was no monitoring; checking required logging in to the machine and looking at the Arq interface and SuperDuper’s schedule.

The Arq agent had hung and I believe SuperDuper had too, just frozen in the middle of a scheduled copy. As I noted before, these are both solid pieces of software so I’m very much blaming the machine itself and the way it was set up.

Shortly after I ordered a replacement Mac Mini (with a 1TB Fusion drive that feels nearly as fast as my SSD at home) and a bunch of new external drives.

Hardware

Previously we had been using various 3.5″ drives in various enclosures.

I don’t know what the professional advice about using internal drives in enclosures versus external drives, but my preference is for “real” external drives, mostly because they’re better-designed when it comes to using several of them, they seem to be better ventilated and quieter, and they’re less hassle.

All of our existing drives were several years old, the newest 2.5 years (and I managed to brick that one through impatience — a separate story), so they all needed replacing.

I bought four 4TB USB3 Toshiba drives, which run at 7,200 RPM using drives apparently manufactured by someone else. That review says they’re noisy, but I’ve had all four next to my desk (in an office) since the start of the year, with at least one reading constantly (more on that later), and they’re OK. I might feel differently if I was using one in my quiet bedroom at home, but it’s hard to tell.

Funnily enough, you can no longer buy these drives from the retailer where we got them. But from memory they were about £100 each, maybe a bit less.

More recently I bought a 1TB USB3 Toshiba Canvio 2.5″ external drive to serve as a SuperDuper clone. (Not to match the 4TB Toshiba drives, but because it was well reviewed and cheaper than others.)

In sum, here’s our stock of drives:

  • 1 in the Mac Mini. (You can’t buy dual-drive Minis anymore.)
  • 2 Time Machine drives.
  • 1 nightly SuperDuper clone.
  • 1 for the 2002-2016 archives.
  • 1 for the post-2016 archives.

Redundancy

One of the biggest changes is one that’s basically invisible. Beforehand we had loads of drives, not all of which I mentioned last time.

  • 2 in the Mac Mini server.
  • 2 for daily/3-hourly clones (each drive containing two partitions).
  • 2 for the 2002-2011 archives.
  • 2 for the post-2011 archives.

All of them were in RAID 1, where each drive contains a copy of the data. The idea behind this is that one drive can fail and you can keep going.

We were using a third-party RAID driver by a long-standing vendor. I won’t name them because I didn’t get on with the product, but I don’t want to slight it unfairly.

The RAID software frequently cried wolf — including on a new external drive that lasted 2.5 years until I broke it accidentally — so I got to a point where I just didn’t believe any of its warnings.

The worst was a pop-up once a day warning that one of the drives inside the Mini was likely to fail within the next few weeks. I got that message for over two years. I’d only infrequently log in to the Mini, so I’d have to click through weeks of pop-ups. The drive still hasn’t died.

As I’ve made clear, the machine itself was a pain to work with and that may have been the root problem. But I won’t be using the RAID driver again.

That experience didn’t encourage me to carry over the RAID setup, but the main factor in deciding against using RAID was cost. As we openly admit in the paper, our place isn’t flush with cash so we try to make the best use of what we’ve got.

The demise of the dual-drive Mini means that the internal drive isn’t mirrored, and that would have been the obvious candidate. So if I were to buy extra drives for a RAID mirror, it might be for the external archive drives. They are backed up but if either drive were to snuff it most of their contents would be unavailable for a period.

But buying mirror drives means that money isn’t available for other, more pressing needs.

Backup strategy

OK, with that out of the way, let’s talk about what we are doing now. In short:

  • Time Machine backups every 20 minutes, rotated between two drives.
  • Nightly SuperDuper clones to a dedicated external drive.
  • Arq backups once an hour.
  • “Continuous” Backblaze backups.

Time Machine

I’m a big fan of Time Machine; I’ve been using it for years and very rarely had problems. But Time Machine does have problems. I wouldn’t recommend using Time Machine by itself, particularly not if you’ve just got a single external disk.

There are a couple of reasons for using two drives for Time Machine:

  • Keep backup history if one drive dies.
  • Perform backups frequently without unduly stressing a drive.
  • Potentially a better chance of withstanding a Time Machine problem. (Perhaps? Fortunately this hasn’t happened to me yet.)

Because of the nature of our work in the newsroom, a lot can change with a page or an article in a very short span of time, yet as we approach deadline there may not be any time to recreate lost work. So I run Time Machine every 20 minutes via a script.

It started off life as a shell script by Nathan Grigg, which I was first attracted to because at home my Time Machine drives sit next to my iMac — so between backups I wanted them unmounted and not making any noise. I’ve since recreated it in Python and expanded its logging.

Here’s the version I use at home:

 1 #!/usr/local/bin/python3
 2 """Automatically start TimeMachine backups, with rotation"""
 3 
 4 from enum import Enum
 5 import logging
 6 from pathlib import Path
 7 import subprocess
 8 
 9 logging.basicConfig(
10   level=logging.INFO,
11   style='{',
12   format='{asctime}  {levelname}  {message}',
13   datefmt='%Y-%m-%d %H:%M'
14 )
15 
16 TM_DRIVE_NAMES = [
17   'HG',
18   'Orson-A',
19   'Orson-B',
20   ]
21 
22 DiskutilAction = Enum('DiskutilAction', 'mount unmount')
23 
24 
25 def _diskutil_interface(drive_name: str, action: DiskutilAction) -> bool:
26   """Run diskutil through subprocess interface
27 
28   This is abstracted out because other the two (un)mounting functions
29   would duplicate much of their code.
30 
31   Returns True if the return code is 0 (success), False on failure
32   """
33   args = ['diskutil', 'quiet', action.name] + [drive_name]
34   return subprocess.run(args).returncode == 0
35 
36 
37 def mount_drive(drive_name):
38   """Try to mount drive using diskutil and return status code"""
39   return _diskutil_interface(drive_name, DiskutilAction.mount)
40 
41 
42 def unmount_drive(drive_name):
43   """Try to unmount drive using diskutil and return status code"""
44   return _diskutil_interface(drive_name, DiskutilAction.unmount)
45 
46 
47 def begin_backup():
48   """Back up using tmutil and return backup summary"""
49   args = ['tmutil', 'startbackup', '--auto', '--rotation', '--block']
50   result = subprocess.run(
51     args,
52     stdout=subprocess.PIPE,
53     stderr=subprocess.PIPE
54     )
55   if result.returncode == 0:
56     return (result.returncode, result.stdout.decode('utf-8'))
57   else:
58     return (result.returncode, result.stderr.decode('utf-8'))
59 
60 
61 def main():
62   drives_to_eject = []
63 
64   for drive_name in TM_DRIVE_NAMES:
65     if Path('/Volumes', drive_name).exists():
66       continue
67     elif mount_drive(drive_name):
68       drives_to_eject.append(drive_name)
69     else:
70       logging.warning(f'Failed to mount {drive_name}')
71 
72   logging.info('Beginning backup')
73   return_code, log_messages = begin_backup()
74   log_func = logging.info if return_code == 0 else logging.warning
75   for line in log_messages.splitlines():
76     log_func(line)
77   logging.info('Backup finished')
78 
79   for drive_name in drives_to_eject:
80     if not unmount_drive(drive_name):
81       logging.warning(f'Failed to unmount {drive_name}, trying again…')
82       if not unmount_drive(drive_name):
83         logging.warning(
84           f'Failed to unmount {drive_name} on second attempt')
85 
86 if __name__ == '__main__':
87   main()

The basic idea is the same: mount the backup drives if they’re not already, perform the backup using tmutil, and eject the drives that were mounted by the script afterwards. There’s nothing tricky in the script, except maybe the enum on line 22 — that’s to replace strings and risking a typo.

The arguments to tmutil on line 49 get Time Machine to behave as if it were running its ordinary, automatic hourly backups. The tmutil online man page is out of date but does contain information on those options.

The version at work is the same except that it contains some code to skip backups overnight:

TIME_LIMITS = (8, 23)

def check_time(time, limits=TIME_LIMITS):
  """Return True if backup should proceed based on time of day"""
  early_limit, late_limit = limits
  return early_limit <= time <= late_limit

# And called like so:
check_time(datetime.now().hour)

This was written pretty late at night so it’s not the best, but it does work.

SuperDuper!

SuperDuper is great, if you use it as it’s meant to be used. It runs after everyone’s gone home each evening.

I don’t look forward to booting from the 2.5″ clone drive but an external SSD was too expensive. In any case, should the Mini’s internal drive die the emergency plan is just to set up file sharing on one of our two SSD-equipped iMacs, copy over as many files as needed, and finish off the paper.

We’ve done this before when our internal network went down (a dead rack switch) and when a fire (at a nearby premises) forced us to produce the paper from a staff member’s kitchen. If such a thing happens again and the Mini is still functioning, I’d move that too as it’s not dog-slow like the old one.

In an ideal world we’d have a spare Mini ready to go, but that’s money we don’t have.

Arq

The Mini’s internal drive is backed up to S3. The infrequent access storage class is used to keep the costs down, but using S3 over Glacier means we can restore files quickly.

The current year’s archive and last year’s are also kept in S3, because if the archive drives die we’re more likely to need more recent editions.

All of the archives are kept in Glacier, but the modern storage class rather than the legacy vaults. This includes the years that are kept in S3, so that we don’t suddenly have to upload 350GB of stuff as we remove them from the more expensive storage.

I’m slowly removing the legacy Glacier vaults, but it takes forever. You first have to empty the vault, then wait some time so you can actually remove the vault from the list in the management console. I use Leeroy Brun’s glacier-vault-remove tool, to which I had to make a minor change for it to work with Python 3.6 but it was straightforward to get running. (I think it was a missing str.encode() or something. Yes, I know, I should submit a pull request. But it was a one-second job.)

Depending on the size of your vaults, be prepared to have the script run for a long time — this is because of the way that Glacier works (slowly). It took a couple of days for me to remove a 1.5TB-ish vault, followed by a little wait (because the vault had been “recently accessed” — to create the inventory for deletion).

Arq’s ability to set hours during which backups should be paused is great — I could easily ensure our internet connection was fully available during the busiest part of the day.

These are set per-destination (good!) but if one destination is mid-backup and paused, your backups to other destinations won’t run. Consider this case:

  • Our archive is being backed up to Glacier.
  • Archive backup is paused between 3pm and 7pm.
  • Backups of the server’s internal drive won’t run during 3pm to 7pm because the archive backup is “ongoing” — even if it is paused.

What that meant in practice is that the internal drive wasn’t being backed up to S3 hourly during the (weeks-long) archive backup, except if “Stop backup” is pressed during the pause interval to allow other destinations to backup.

Ideally, paused backups would be put to one side during that window of time and other destinations allowed to run.

Backblaze

Our initial 3.5TB Backblaze (Star affiliate link) backup is still ongoing, at about 50GB a day over one of our two 17mbps-up fibre lines. In all it’ll have taken a couple of months, compared to a couple of weeks for Arq to S3 & Glacier.

One thing I’ve noticed from personal use at home is that Arq has no qualms about using your entire bandwidth, whereas Backblaze does seem to try not to clog your whole connection. That’s fine at home, but at work with more than one line it matters less.

I was hesitant to write this post up before we’re all set with Backblaze, but it’s got about a week to run still.

I’m interested in seeing what “continuous” backups means in practice, but the scheduling help page says: “For most computers, this results in roughly 1 backup per hour.” The important thing for me is having an additional regular remote backup separate to Arq (mostly paranoia).

Enhancements?

Were money no object, I have some ideas for improvements.

First would be to add Google Cloud Storage as an Arq backup destination, replicating our S3-IA/Glacier split with Google’s Nearline/Coldline storage. It would “just” give us an additional Arq backup, not dependent on AWS.

A big reason why I added Backblaze is to have a remote backup that neither relied on AWS nor Arq. (That’s really not to say that I don’t trust Arq, it’s an eggs-in-one-basket thing.)

Next I’d add an additional Time Machine drive (for a total of three). Running every 20 minutes means that the load on each drive is 50% higher over two hours than the ordinary one backup per hour. Adding a third drive would mean that each drive is being backed up only once per hour. And it would also mean that there is backup history stored on two drives should one fail.

An external SSD to hold a SuperDuper clone is tempting, for the speed and because it might mean that we could continue to use the Mini as usual until we could get the internal drive replaced. (Though that might well be a bad idea.) Our server contains nowhere near its 1TB capacity, so we might get away with a smaller SSD.

And, maybe, instead of extra drives for RAID mirrors (as discussed above), a network-attached storage (NAS) device. But I have no experience with them nor any idea about how best to make use of one. And they seem to be a bit of a black box to me, and a major goal for our setup (Mac Mini, external drives, GUI backup programs) is to be reasonably understandable to any member of production staff — after all, we’re all journalists, not IT workers.