Warning: what follows, besides discussing 9/11, is also kind of a nerdy/geeky/technical discussion about how web pages link to each other and an idea for how to make the links between pages, especially pages that may disappear some day, work better. Maybe.
Today is Patriot Day, a “national day of service and remembrance”. Because it’s also the 14th anniversary of 9/11, I ended up reviewing my collection of 9/11 “stuff” – something I started on 9/11 and continued collecting for a few days after those events. It helped to process things a little bit, I think.
Recently, Dave Winer has been discussing, among other things, the “future-safety” of the internet:
The concern is that the record we’re creating is fragile and ephemeral, so that to historians of the future, the period of innovation where we moved our intellectual presence from physical to electronic media will be a blank spot, with almost none of it persisting.
While reviewing my collection, I realized a possible reason his piece has been percolating in the back of my mind was, in fact, this same collection. Why? Take a look – the images I’m hosting myself, since it’s not a big deal (bandwidth wise or effort wise). The links to other sites? That’s where it falls apart.
Some of the sites are still there – and one or two of the links I had still work. That’s awesome – someone thought ahead, or took the time, when they re-did the website, to make sure that the old content was still accessible.
Other links go to sites that still work, but the “layout” of the website – their URL’s and or URI’s (Uniform Resource Identifier’s) – have changed and no-one took the time to make sure it was still accessible easily. For a couple of those, I was able to find the article on the site at it’s new address, so I updated that.
Then there are two other cases left to deal with: the website is gone, or the link that I have uses a URL click-tracker service that is no more. In the case of the website being gone, I can try to use the Internet Archive (or “Wayback Machine”) to try to find the article and then figure out what to do – I could link to the archive’s version, but I decided to take that snapshot and copy it to my own server – I can’t necessarily rely on the archive to be there forever, can I? Maybe, maybe not.
In the case of the URL tracker, well, that’s going to mean some work. I can try to see if the article is available by title, but my search just now for “World reacts to calamity” returned lots of results, but none of them seem to be on the C|Net website – which is apparently either where I got the link in the first place, or where the article was hosted. That’s not helpful at all.
So what can be done? For starters, encourage the discussion. I went to Winer’s site and posted a comment:
I’ve been mulling this all over, and then realized why today. On 9/11/01, I was collecting links of things relevant to what was going on, but I only had links to the pages. I went back to my collection today, and a lot of the stuff is gone – possibly forever? I went and used the internet archive where I could for some things just now, but a lot of the content seems to be lost – especially due to click-tracking links used at the time. If only I knew then what I know now…..
Dave was quick to reply:
Yes. Today is a very good day to be thinking about that. I should write a blog post. Thanks for pointing this out.
And then he wrote a quick little piece about it: A good day to think about web history. And he has the EXACT same problem: links from that day on his own site just aren’t working.
I’ve tried to sound the alarms. Every day we lose more of the history of the web. Every day is an opportunity to act to make sure we don’t lose more of it. And we should be putting systems into place to be more sure we don’t lose future history.
There’s a solution in there somewhere, that’s for sure. For one thing, you have google, which indexes every page ever if it’s allowed to. But that’s only part of the equation – finding the data. But how? What are we going to look for? And, more importantly, where are we going to look? If a server goes offline, that data is gone unless it’s in the archive (which isn’t fee) or someone decides to mirror it (also not free). But how to make it easy to find? Some content, when you’re searching by title for example, you might find multiple sites similarly titled articles – then you have to sort the wheat from the chaff.
Is there a better way? Maybe. Off the top of my head, we need to do a little more on the backend. But what?
Mark the pages somehow with a UUID (Universally Unique Identifier). For example, it could be an SHA1 hash of data from the page – maybe the hostname as the first part, then the time, date, and article title:
Future proof websites?
That gets turned into: d820eab50a74ad6c0c08566b210454848a573dcf-29b6082b508b593c8de53988ef3d2b14b327664b. What do we do with that? Ideally, it’s auto generated and then put into the META data of this web page. Then, when you link to my page, the browser pulls that out of the META data (if it’s available) and adds that to the link – so instead of:
<a href="http://agerstein.net/2015/09/11/future-proof-websites/">Future Proof Websites?</a>
<a href="http://agerstein.net/2015/09/11/future-proof-websites/" webprint="d820eab50a74ad6c0c08566b210454848a573dcf-29b6082b508b593c8de53988ef3d2b14b327664b">Future Proof Websites?</a>
If you copy/paste the link for an email, or to put on Facebook/Twitter/your blog/whatever, it copies that “webprint” into the link – and if the content goes down for some reason – maybe I die and my website goes away – then a search for the webprint would make it easier to find cached/mirrored copies of the data, since the ID would theoretically go along in the cache/mirror as part of the META data in the pages.
Clearly we would need to use something better/longer than what I have here, since it’s only 81 characters long. That seems like a lot, but we’re in the process of running out of IPv4 addresses and moving everything over to IPv6 – and we didn’t think we’d run out of IPv4 addresses for quite some time back when I got into the computer game.
By having the hostname be the first part of the hash, we reduce the odds of a clash – you could, theoretically have the same second hash as another site, but what are the odds that they would have the same first hash? Impossible unless the site stole your name somehow.
All of this is moot, however, without some longevity built into the hosting. One of Winer’s bigger concern is that sites like Facebook/Twitter/etc seem to have different rules about what counts as a “post” – Tweets don’t have titles, nor do Facebook status updates. But could they? Should they? Things like this mean it’s not as easy to just move your data from one hosting solution to another. You can pull your content from Twitter, but you can’t exactly upload it to Facebook and have it work. You can pull your data from Facebook, but there seems to be so much info available to you – like what advertisements you’ve clicked – that I think you might suffer from over load trying to figure out what to move.
I agree that there should be a standard for this data – and you, as an author/content provider/social media user – should be able to take the data from one service to another. And it should be easy – like just download from one service, suspend your account there, then upload to another and keep going, deleting your prior account when you feel comfortable. But that’s not how things are set up right now. Silos, it would seem are another part of the problem. But there’s a way around that – host it yourself. But then we get to the rub there: what if you die? What if the web server dies? How do you perpetuate your online self after pass on?