I’ve been working with and thinking about broken link checkers a lot lately. Here are some thoughts on what features link checkers should have, open questions, and what to do with broken links once you find them.
Features that should be standard, but aren’t
- Link checkers should ignore share links.
- Share links are pervasive on blogs and news sites. They show as links on a page, but I don’t consider them “real” links. They work to submit the page to the service and nothing else. Identify them and skip them.
- Here is an ignorelist to get you started: https://gist.github.com/cagrimmett/00200b47a9f5948d7906be154e7abd78
- Link checkers should ignore robots.txt on the sites they check.
- Twitter blocks most broken link checkers. Every Twitter URL on every site I’ve checked comes back as broken, and upon further investigation they are almost always working.
- Link checkers should follow redirects.
- Redirects are valid, functioning links!
- Link checkers should have robust support for services like YouTube and Vimeo.
- Videos that appear missing on those services still return 200 status codes when checked, along with a message in the HTML that the video is not found. Most broken link checkers show those as false negatives.
- Should we ignore comments? Most commenting systems allow links and also tend to link back to the comment author’s site. Those tend to be a significant source of broken links. Should those be left alone because they aren’t the site’s content, but rather user-generated content?
What to do with the broken links once you find them?
- I’m against changing historical content in a damaging way, so I do not support changing links.
- I do support appending additional helpful information, such as a link to a working version of the broken link on the Wayback Machine in a format like this:
Old link text with broken link (archive.org link)
- The Wayback Availability JSON API is a quick way to find if said links exist on the Wayback Machine.
- Ideally this would be built in to to the broken link checker, but if you only have the broken links in a CSV, here is a handy PHP script to check the Wayback Availability JSON API.