Will broken links harm academic research?

This piece stems from a Twitter conversation which I’ve pieced together for completeness.

A face within a flurry of matrix threads.

Recently at the University of Greenwich, we’ve been working on reducing our huge backlog of broken links.

For sites as large as universities, these can be caused by a range of factors. Some can be built into templates, meaning the same breakage appears on hundreds of pages. Some can be caused by changes to internal linking systems. Some breakages can just be put down to poor maintenance.

Most links are within our control to fix, but some sources of links are not. Most of my university’s 404 breakage list now comes from our library’s Academic Literature Archive (GALA), which feeds into our staff webpages. This means that links to academic publications and citations are breaking.

Some of these links rely on a Digital Object Identifier (DOI), a unique code for digital publications. I’ve always had my doubts about the permanence that DOIs are assumed to have. One university describes them as ‘like a Social Security number’, as if they cannot break. But they can, and they do. And what about links that are not DOIs?

“Decay of the modern web”

Looking into this further led me towards a fascinating piece by Jonathan Zittrain which explains how broken links (link-rot) are becoming a weakness to scholarship and knowledge permanence.

Zittrain cites a striking example. Of a sample of 3.5 million scholarly articles in STEM disciplines, the percentage of links that drifted from their originally intended source increased from 20% in 2014 to 75% in 2016. That’s a staggering rise in such a short time.

“People tend to overlook the decay of the modern web, when in fact these numbers are extraordinary,” he says. “They represent a comprehensive breakdown in the chain of custody for facts.

“Some information sticks around when it shouldn’t, while other information vanishes when it should remain.”

It goes without saying, some of these now-broken links are ephemeral – news or commentary that has just been discarded by its author or publishing company. But how will we know how important something is until we discover that citation has gone and the WayBack machine brings no results?

England's Fortress: Perspectives on Thomas, 3rd Lord Fairfax (Ashgate, 2014)

I’ve seen evidence of this shift first-hand. My first major publication was a chapter in a volume called England’s Fortress, published by Ashgate in 2014. Within a year of this, Ashgate had been taken over by Taylor & Francis and joined Routledge. The digital references shift; the merry-go-round begins.

Will anyone volunteer to tidy that mess up at Greenwich, then? It’s not just a messy job. As this shows, it’s regular maintenance to keep academic citations up to date.

End of the book?

In 2012, a conversation between Umberto Eco and Jean-Claude Carrière was published under the title, This is Not the End of the Book. It makes several arguments about why printed texts must continue in the face of digital opposition, from continuity to longevity.

But much has changed in a decade, and Zittrain’s article provides an interesting postscript to this debate. After all, publication is a commercial enterprise, and what makes for good sense does not always make for good business.

Zittrain worries that we’re now moving resources away from print permanently while this problem of insecure digital infrastructure continues to grow.

“Surely anything truly worth keeping for the ages would still be published as a book or an article in a scholarly journal, making it accessible to today’s libraries, and preservable in the same way as before? Alas, no,” he says.

“The incentives for creating paper counterparts, and storing them in the traditional ways, declined slowly at first and have since plummeted… Both publisher and consumer – and libraries that act in the long term on behalf of their consumer patrons – see digital as the primary vehicle for access, and paper copies are deprecated.”

Shakespeare 400 First Folio, Senate House Library

Power shift to publishers

This mention of libraries got me thinking again. As I see it, the academy has two issues to acknowledge and deal with.

The first is readiness. Are we ready to leave print behind? Systems to organise books began millennia ago and have developed into structures of great continuity. Dewey Decimal and Congress, two of the main categorisation systems used in university libraries, were devised in the nineteenth century, and are still used today. You can walk into almost any university library and know roughly where to find something.

No such stature exists for web, even though we’re placing increasingly heavy reliance on it for bibliographical purposes. Search engines are a start, then you might hit institutional paywalls, Shibboleth authentications, and other layers, some of which might require users to navigate in via their ‘home’ library first. It’s surprisingly messy to replace a physical turnstile into a building.

We’re currently using DOI, which makes sense, but this is a tool, not a system. When a system doesn’t work, the tools become pointless. Publishers should (and do) use 301 redirects, but it’s rare these go deep enough to match each moved source with its new home, and they shouldn’t be relied upon as a long-term solution. Moreover, because a lot of resources are behind paywalls, we’re not very quick to notice faults when they arise. How many will notice before they need something that’s lost and it’s too late?

The second issue to deal with is a shift of power. The move from print to digital signals a shift of power from libraries to publishers. When a library buys a book or a print journal, that library is responsible for how its users find it. It can use signage, a classification system, a catalogue, or a combination of the three. But when publishers independent of libraries are responsible for digital links, these can change overnight – and huge numbers of assets will be affected.

Imagine if publishers were able to pluck books and print journals off shelves at any time and put them somewhere else – sometimes without realising they are doing it. It’s not a malicious activity, but it can cause a lot of disruption.

Book shelving / stacks in a library. — _{Photo by Olga Lioncat on Pexels.com}

What can we do?

It’s a great shame neither Eco nor Carrière are still with us (Eco passing in 2016, Carrière earlier this year), but I’d like to see more scholars in the book history field taking up this debate.

WorldCat is possibly the best and most reliable attempt to stay up to date I’ve seen so far, but I’d be surprised if that can sustain its place and the maintenance behind it.

I wonder if copyright libraries need to be given the resource to become a source of truth and get publishers to provide audits of digital published materials. Perhaps more of that happens than I realise, but how do we then allow other providers, such as universities, with different library cataloguing systems, to synchronise that information?

Ultimately, if we’re going to start relying entirely on digital assets, I think the least we need is a permanent coalition of computer scientists and digital humanities experts to keep an ongoing system of governance. Otherwise, it won’t be just a ‘breakdown’ caused by link breakage – it will become a void.

With thanks to Professor Andrew King, University of Greenwich – an ever-thoughtful soundboard.

Writing Privacy

Digital project lead. Web content specialist.