I’ve written previously about how the archives of my blog were less full than they should be – that, between domain changes, server/CMS moves, and times when I simply didn’t care, there were potentially hundreds of posts missing from the early years in particular.
Back up your crap, people – including your blog.
For the last couple of years I’ve had an on-off project to restore as much of this personal history as possible. Every so often I’d go ferreting through old hard disks, or exploring the Internet Archive’s Wayback Machine for old content I could salvage. At first I had limited success, turning up only a handful of posts. Of those, I was fussy and only restored the “worthwhile” posts – usually longer posts about big events, or technical in nature.
This last weekend though, I revised my stance on this. If I was going to recreate my blogging history, I couldn’t – shouldn’t – just cherry-pick. I should include as much as I could possibly recover: the good, the bad, the plain inane. Anything less would feel a bit dishonest, and undermine the raison d’etre of the whole endeavour: saving the past.
The only exception would be posts which were so incomplete due to missing assets (images mainly) that any body text made no sense, or posts which were completely unintelligible out of context of the original blog – entries about downtime, for example. Also excluded were my personal pet peeve – posts “apologising” for the time between updates1!
A Brief Synopsis of the “How”:
To bring the past kicking and screaming into the present, I dove back into the Wayback Machine, going as far back on my first domain as I could. From there I worked as methodically as I could: working from the furthest back onwards, post-by-post. The basic process was:
- Copy the post text and title to the WordPress new post screen
- Adjust the post date to match the original
- Where possible, match the original publishing time. Where this wasn’t available, approximate based on context (mentions of morning/afternoon/evening, number of other posts that day, etc)
- Check any links in the post (see below)
- Add any recovered assets – which was rare
- Turn off WordPress social sharing
I started on the Friday afternoon, and manually “imported” around 50 posts in the first batch.
Turning off social sharing was done so I didn’t flood my Twitter followers with a whole load of links to the posts – some over a decade old. One thing I didn’t anticipate though, and which I had zero control over, was WordPress emailing the old posts to those who had subscribed to email notifications. It wasn’t until a friend IM’d me about her full inbox that I realised what was happening – so if you found your mail filled with notifications as a result of this exercise, I apologise!
To get around this, I ended up creating a new, private WordPress blog to perform the initial manual process, so I could later export a file to import into this blog.
Between Saturday, Sunday, and Monday evenings, I tracked down and copied over a further 125 or so posts. Due to the vagaries of the Wayback Machine, not every post could be recovered. Generally speaking, it was reliable in having a copy of the first page of an archive section, but no further pages. Sometimes I could access “permalink” pages for the other posts, but this was really hit-or-miss. A lot of the time the page the WBM had “saved” was a 404 page from one of my many blog reorganisations over the years, or in other cases, it would have maybe one post out of eight.
I made a rule not to change the original posts in any way – no fixing of typo’s/correcting something I was wrong about. The only thing I would do, was mark where there was a missing asset with an “Editors Note” of some sort, when appropriate. The only content I did have to consider changing were links.
Dealing with Links
One thing I had to consider was what to do about links which might have changed or disappeared over time. When copying from the WBM, links had already been written to point to a (potentially non-existent) WBM archive page, but if the original still existed, I wanted to point to that instead. In the end I would have to check pretty much every link by hand – if the original existed, I would point to that page; if not, I would take a chance with the Wayback Machine. In some cases I had to consider what to do where the page existed, but had different content or purpose to the original. I dealt with these on a case-by-case basis.
For internal links, I pointed to an imported version, if it existed, or removed it if there was none and context allowed.
In total, I imported around 175 previously “lost” blog entries, covering 2002-2006, with the majority from 2005. These years have gone from having a handful of entries, to having dozens. Overall, this has grown the archives by roughly 50% – a not so insubstantial amount!
At some point I will go back and appropriately tag them all, but that’s a lower priority job for another time.
2007-2010 were years when my writing output dropped a lot, so while I will look for missing entries from this period, I don’t expect to find many at all.
Side Note: History Repeats
I discovered, in the process of doing all this, that I had gone through the same exercise before, roughly 10 years ago!
Over the last few days, I’ve been working on the archives of my old site; cleaning and recategorising them. Today, I have added them to the archives of Pixel Meadow.
These additions represent everything that was left of ChrisMcLeod.Net. Over the course of its life many changes occured and data was lost – so these additions don’t represent everything that I’ve written there over the years.
You would think I might have learned from this mistake back then, but obviously not! Fingers crossed it’s finally sunk in.
- Though only where they had no other content to the post. ↩