I quite enjoy turning out little plots for posts on here. Admittedly I’m not great at it, but I like to have a go.
However, matplotlib really is not my favourite. It feels like there’s a lot of boilerplate to write and a lot of work to do before you get make something reasonably approaching what you had envisioned in your head.
So I thought I’d give R a try, and learn some things about visualisation along the way with Kieran Healey’s data visualisation course notes, which was fun.
But mostly in this post I wanted to show how ludicrously straightforward using ggplot2 can be compared with what you have to do in matplotlib. Let’s pick on my plot of train ticket prices from just before Christmas.
The Python code for that is quite long so I’m not going to include it, but it is available to view online. I’m not being completely fair because part of that involves getting the data into shape, and I’m sure there’s things I could’ve done to cut out a few lines.
That said, it took me a while to figure out exactly how to go about doing the plot in matplotlib, exactly how to, say, parse the dates and label the axis.
There was some of that with R and ggplot2, but mostly me looking things up in the documentation as I’ve not used them much. But mostly it was pretty straightforward to figure out how to build up the plot.
Anyway, here’s the plot:
And here’s the code that produced it:
3 # Read in and convert string times to datetimes
4 trains <- read.csv('collected.csv')
5 trains$Time <- as.POSIXct(trains$Time, format = '%Y-%m-%dT%H:%M:%S')
7 # Get the data onto the plot
8 p <- ggplot(trains, aes(x = Time, y = Cost))
10 # 'Reveal' the data with points and show the
11 # East Mids price trend with a smoother
12 completed <- p + geom_point(aes(color = Operator)) +
13 geom_smooth(data = subset(trains, Operator == 'East Midlands Trains'),
14 aes(group = Operator, color = Operator),
15 method = 'loess', se = FALSE,
16 size = 0.75, show.legend = FALSE) +
18 # Let's adjust the scales
19 scale_x_datetime(date_breaks = '1 hour',
20 date_labels = '%H:%M') +
21 scale_y_continuous(limits = c(0, 100),
22 breaks = seq(10, 100, 10),
23 expand = c(0, 0)) +
25 # Set some labels and adjust the look
26 labs(title = paste('Cost of single train tickets',
27 'leaving European\ncapital cities',
28 'on Friday December 23 2016'),
29 y = 'Ticket cost (€)',
30 color = 'Train operator') +
31 theme_bw(base_family = 'Trebuchet MS') +
32 theme(plot.title = element_text(hjust = 0.5))
34 ggsave('plot.svg', plot = completed, device = 'svg',
35 width = 8, height = 4, units = 'in')
I’m still figuring things out with R and ggplot so I’m not exactly blazing through. (I still haven’t figured out how to export transparent SVGs without editing them by hand.)
But I love the way that plots are built up out of individual pieces, which makes far more sense to me than trying to wrangle matplotlib’s figures and axes.
I added support for JSON Feed to my homemade static site generator Majestic today, and thought I’d note it because funnily enough the two implementations mentioned by John Gruber (by Niclas Darville and Jason McIntosh) used the approach I’d taken for generating my RSS feed and wanted to avoid.
Basically all three of those define a document template and pass in the posts and other required bits, and you’re done. I’m really not knocking this — again, I do this with the RSS feed and it validates fine. It’s all good.
But I ended up templating my RSS feed like this because I looked at the feedgenerator module and ran away. Majestic was my first Python project of any real size and I wanted to keep things as straightforward as I could. While it looks (with hindsight) reasonably OK in use, it doesn’t have any documentation, has been pulled out of Django, and has funky class names (
Rss201rev2Feed) that didn’t fill me with confidence that I could implement an RSS feed quickly.
I was using Jinja templating for the site and since HTML and XML are cousins just did that. But you can probably tell that I didn’t really know what I’m doing (still don’t!) with escaping as any field that might contain non-Ascii characters is wrapped in
But hey, it works. Feed’s valid.
With JSON, everything just feels much more obvious. In Python you hand off basic types to the built-in json module and you get back a string, all the encoding taken care of. And if I make a mistake Python will complain at me, instead of just dumping out a file of questionable worth.
I think this is what all the people complaining on the Hacker News thread missed. Working in JSON is comfortable and familiar — the tools are good and you get told when something goes wrong. Working with XML can be unclear and a bit of a pain, and creating an invalid document is a risk.
So my super-duper advanced JSON Feed implementation is… constructing a
dict, adding things to it and passing it off to the json module that I use all the time. Taken care of. The code’s so boring I’m not even going to include it here (but it’s online to view).
Diamond Geezer asks: “Why do we never end up in the middle?”
It’s unfair to pick on him, but I will because he posted on a day when my annoyance at centrist liberals has well and truly peaked.
First off, the “centre ground” is a concept that is entirely relative. When Jeremy Corbyn campaigned to become and was elected leader of the Labour Party in 2015, he managed to shift the centre ground — the Tories very quickly ditched a plan to bomb the Syrian government.
The centre ground is inherently unstable because it only exists relative to the two dominant forces either side. At our present moment that’s a fairly right-wing Conservative Party and a reasonably social democratic Labour Party. Any “centrist” must define themselves in opposition to their closest opponents on the left and right.
Ultimately if you do that it means you have no principles, nothing that anchors you on the left-right axis. In reality — as much as we joke about spineless politicians — few define their positions in this way and instead the “centre” in various countries is the home to a party that has “right-wing” economic policies and “left-wing” social policies. In Britain that would be the Liberal Democrats, despite Tim Farron’s recent attempts to win over the homophobes.
Left and right are in scare quotes above because this shows the point at which the left-right axis breaks down.
Ultimately the idea of centrism is bankrupt. Politics is a clash of interests. The ideas of the “centre ground,” of the “national interest,” are rubbish. Howard Zinn put it best in his People’s History of the United States:
Nations are not communities and never have been. The history of any country, presented as the history of a family, conceals the fierce conflicts of interest (sometimes exploding, often repressed) between conquerors and conquered, masters and slaves, capitalists and workers, dominators and dominated in race and sex.
As a socialist, to use our compromised axis, the boss class sits on the right and the workers on the left. Given that the boss class is but a tiny sliver of the population, what credibility does a “centrist” party have, one that pretends to balance the desires of the exploited and the exploiters?
It this this “refreshing centrism” that irks me the most, as it is always right-wing economic policies paired with some ameliorating factor — support for gay marriage, say — to assuage the liberals.
But if you’re gay, does being able to officially consecrate your relationship make up for the fact that you spend half your wages on rent?
This has run on, so let’s talk about Emmanuel Macron. The Guardian loves him, noting (without the expected contradicting clause) that it “is tempting … to conclude that European liberal values have successfully rallied to stop another lurch to the racist right.”
And so Macron, an explicit neoliberal, is raised up having defeated (we’ll see) the fascist Marine Le Pen.
The celebration is of liberal values, embodied by Macron. But Macron’s liberal values go a long way to explain the surge in support for France’s fascist National Front, as Cole Stangler shows. His liberal values are likely to increase “unemployment, inequality and poverty” through his right-wing economic policies — along the lines of the French law that bears his name (loi Macron) and hacked away at workers’ rights.
The assault on workers’ rights and public services has been ongoing for nearly 40 years yet liberals and centrists deride the term that describes our current phase: neoliberalism.
The refusal to recognise this trend puts us in a position where the Guardian celebrates the likely victory of Macron, cheering his defeat of the fascists in blissful ignorance. But his political current is the reason why we have ended up with the fascists contesting the second round of the French presidential election (again).
Faced with falling employment and living standards for four decades and (generally) abandoned by the organised left, people have turned to those who promise to take action to improve their material conditions.
Yet Macron’s policies will just exacerbate these problems. This isn’t the end of the fascist challenge in France; should Macron win and pursue his neoliberal programme we could well be in the same situation in five years’ time.
(Unless, potentially, the French left organises a strong anti-fascist campaign like that waged in Britain from the 1970s to the present time, in which the fascists have more or less been suffocated.)
This isn’t a “bold break with the past,” it is the continuation of the rule of the boss class with a fresh coat of paint.
At work we deal a lot with PDFs, both press quality and low-quality for viewing on screen. Over time I’ve automated a fair amount of the creation for both types, but one thing that I haven’t yet done is automate file-size reductions for the low-quality PDFs.
(We still use InDesign CS4 at work, so bear in mind that some or all of the below may not apply to more recent versions.)
It’s interesting to look at exactly what is making the files large enough to require slimming down in the first place. All our low-quality PDFs are exported from InDesign with the built-in “Smallest file size” preset, but the sizes are usually around 700kB for single tabloid-sized, image-sparse pages.
Let’s take Tuesday’s arts page as our example. It’s pretty basic: two small images and a medium-sized one, two drop shadows, one transparency and a fair amount of text. (That line of undermatter in the lead article was corrected before we went to print.)
But exporting using InDesign’s lowest-quality PDF preset creates a 715kB file. The images are small and rendered at a low DPI, so they’re not inflating the file.
Thankfully you can have a poke around PDF files with your favourite text editor (BBEdit, obviously). You’ll find a lot of “garbage” text, which I imagine is chunks of binary data, but there’s plenty of plain text you can read. The big chunks tend to be metadata. Here’s part of the first metadata block in the PDF file for the arts page:
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP […]">
… Blah blah blah exif data etc …
Which is the none-too-exciting block for one of the images, a Photoshop file. There’s two more like this, roughly 50–100 lines each. Then we hit a chunk which describes the InDesign file itself, with this giveaway line:
<xmp:CreatorTool>Adobe InDesign CS4 (6.0.6)</xmp:CreatorTool>
So what, right? InDesign includes some document and image metadata when it exports a PDF. Sure, yeah. I mean, the metadata blocks for the images weren’t too long, and this is just about their container.
Except this InDesign metadata block is 53,895 lines long in a file that’s 86,585 lines long. 574,543 characters of the document’s 714,626 — 80% of the file.
I think it’s safe to say we’ve found our culprit. But what’s going on in those 54,000 lines? Well, mostly this:
<stEvt:instanceID>xmp.iid:[… hex ID …]</stEvt:instanceID>
<stEvt:softwareAgent>Adobe InDesign 6.0</stEvt:softwareAgent>
<stEvt:instanceID>xmp.iid:[… hex ID …]</stEvt:instanceID>
<stEvt:softwareAgent>Adobe InDesign 6.0</stEvt:softwareAgent>
<!-- 1,287 more list items -->
It’s effectively a record of every time the document was saved. But if you look at the
stEvt:when tag you’ll notice the first items are from 2012 — when our “master” InDesign file from which we derive our edition files was first created. So, the whole record of that master file is included in every InDesign file we use, and the PDFs we create from them.
Can we remove this metadata from InDesign? You can see it in
, select it and press the rubbish bin icon. Save, quit, reopen and… it’s still there.
Thankfully Acrobat can remove this stuff from your final PDF, by going through the “PDF Optimizer” or “Save Optimized PDF” or whatever menu item it’s hiding under these days. (In the “Audit Space Usage” window it corresponds to the “Document Overhead”.)
Our solution at work has been to cut the cruft from the PDF using Acrobat when we use it to combine our separate page PDFs by hand. But ultimately I want to automate the whole process of exporting the PDFs, stitching them together in order, and reducing the file size.
After using ghostscript for our automatic barcode creation, I twigged that it would be useful for processing the PDFs after creation, and sure enough you can use it to slim down PDFs. Here’s an example command:
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dBATCH \
Most of that is ghostscript boilerplate (it’s not exactly the friendliest tool to use), but the important option is
-dPDFSETTINGS=/screen which, according to one page of the sprawling docs, is a predefined Adobe Distiller setting.
Using it on our 715kB example spits out an 123kB PDF that is visually identical apart from mangling the drop shadows (which I think can be solved by changing the transparency flattening settings when the PDF is exported from InDesign).
In my previous post about page speed, I mentioned that I’d written my own site generator. I’m not quite ready to talk specifically about it — I want to write some documentation first — and really I doubt that anyone but me should be using it.
But, having set up publishing to Amazon S3 today, I wanted to write up how I publish this blog to multiple places so that it’ll be around whatever (within reason) might happen.
Majestic’s configuration files are set up in such a way that you have have a default settings file in a directory —
settings.json — and you can specify others that make adjustments to that.
In my case the main settings file contains the configuration for publishing to my own server (hosted at Linode) — not the nitty gritty of how to get it on to the server, but what the URLs, site title, etc should be. (It’s online if you want to have a nose around.)
Then I have two extra JSON files:
s3.robjwells.com.json, which contain the customisations for publishing for those domains. Here’s the config for GitHub in full:
"title": "Primary Unit mirror on GitHub",
"description": "A mirror of https://www.robjwells.com hosted on GitHub"
"output root": "gh-pages"
site.url is important because of the way my templates render article links (though my markdown source contains only relative links that work anywhere). And
paths.output root just specifies the build directory where the HTML files get written.
All the moving parts are contained in a makefile which can build all three of my destinations. Here it is in full:
NOW = $(shell date +'%Y-%m-%d %H:%M')
DISTID = $(shell cat cloudfront-distribution-id)
rsync -zv -e ssh www.robjwells.com.conf
rsync -azv --delete -e ssh site/
cd gh-pages ; git add . ; git commit -m "$(NOW)" ; git push
aws s3 sync s3 s3://s3.robjwells.com --delete
aws cloudfront create-invalidation
all: robjwells github aws
force-all: force-robjwells force-github force-aws
majestic --settings=robjwells.github.io.json --force-write
majestic --settings=s3.robjwells.com.json --force-write
force-* options rebuild the whole site, not just files which have changed.)
And, really, that’s all it takes to publish to multiple hosts (once you’re set up at each one, of course).
My own server is just a vanilla rsync command, with an extra one because I keep my Nginx server config locally too.
For GitHub pages the
gh-pages folder is a git repository, so
make github regenerates the site into that folder, commits the changes with a timestamp as the message, and pushes the changes to GitHub. (It’s all on the same line with semicolons because the
cd into the directory doesn’t hold across lines in the makefile.) Because the GitHub repository is set up to publish, the rest is sorted out on their end.
And for S3 I just use the official AWS tool (
brew install awscli if you’re on macOS) — the CloudFront line is because I use it to speed up the S3 version and I want to make sure an updated front page is available reasonably quickly, if not anything else.
There’s a bit of overhead setting all of these up but once you do it doesn’t have to be any more work to keep each host updated. For me it’s just a
make all away.