Checklinks 1.0.1

...Yet Another HTML Link Checker

Actually, this has major features I couldn't find in other link-checkers, which is why I wrote it. In a nutshell: It's fast, and it supports SSI (SHTML files), the latest Web standards, directory aliases, and other server options.

March 26, 2000-- Version 1.0.1 is released, finally. It's a long-overdue bug fix release, nothing more. There are no new features. Here's a list of the most important bug fixes. For archival purposes, here's the old version 1.0.

Download Checklinks (gzipp'ed), or view the text of the script.

A link-checker checks the validity of all HTML links on a Web site. Most can start at one or more "seed" HTML files, and recursively test all URLs found at that site ("recursively" means when it finds an HTML file, it searches that for links, and so on). This program doesn't follow URLs at other sites, but it does check their existence.

Among its unique (?) features, at least among free link checkers:

Local URLs are read as files when possible, which is much faster than going through the Web server.
Server-Side Includes (SSI, aka SHTML) are supported and checked, even when reading them as files. Does full SSI error checking, which you can't get with link-checkers that only go through the HTTP server.
Supports the latest Web standards-- HTML 4.0, HTTP 1.1, and the newest URL definition.
Supports all sorts of server options-- aliases and so forth. If you want, you can feed it your srm.conf file for auto-configuration (-c option).
Requires nothing extra to be installed, except Perl 5 itself. It's entirely self-contained; it uses only standard modules.

Other not-so-unique but useful features-- the user can:

use regular expressions to restrict which URLs are searched (-I and -X options);
use regular expressions to restrict which results are reported (-i and -x options);
limit the search depth (-d option);
report results in concise one-line-per-URL format, or verbose format (-v option), which includes full referral information and detailed SSI (SHTML) error reporting-- try it!

Other options could easily be added, if there is demand.

It's free, it's in Perl 5, it's got decent comments, it was written with Apache in mind.

In fairness, it doesn't support all features of SSI, like conditionals, variables, and so forth. It does support the  directive, which is enough for many situations.

Really, the world needs only one good link-checker. Someone should either take this one and add more features to it, or copy features from this into other link-checkers. Here's a list of other free link-checkers for Unix.

Installation and Documentation

Not much to it, really. Again, you need Perl 5. To install the script:

Place the file somewhere in your path and make it executable. I name mine cl.
For best results, edit the script to set $LOCAL_HOST and $DOCUMENT_ROOT.
If desired, configure other options as commented in the source code. Otherwise, the default settings usually work fine.

Running it

Run the script from the command line; pass it a list of seed files or URLs. By default, it will follow all links on that server, and report all URLs with non-200 response codes. To control which URLs get checked, build inclusion and exclusion lists with the -I and -X options. To control which results get reported, build similar lists with the -i and -x options. See below for other options.

My most common use, which checks all URLs starting with the current index.html, is simply

cl .

For simple but sufficient instructions, run "cl -?". The output of this is:

To recursively check all HTML links on the local site, enter:
    cl  <options>  [ start-file | start-URL ] ...

Options include:
  -v                    Generate full (verbose) report, including full
                          referral information and detailed SSI error
                          reporting.
  -I <include-pattern>  Only check URLs matching <include-pattern>.
  -X <exclude-pattern>  Don't check URLs matching <exclude-pattern>.
  -i <include-status>   Only report URLs whose response code starts with
                          the pattern <include-status>.
  -x <exclude-status>   Don't report URLs whose response code starts with
                          the pattern <exclude-status>.  Default is to
                          exclude "200" responses only (i.e. "-x 200").
  -d <max-depth>        Traverse links no deeper than <max-depth>.
  -f                    "File mode"-- only read files from the filesystem;
                          do not go through the HTTP server at all.  This
                          will skip all URLs that point to CGI scripts.
  -h                    "HTTP mode"-- use HTTP to check ALL URLs, even
                          if they could be read from the filesystem.
                          Incompatible with "-f" option.
  -c <config-file>      Read appropriate configuration parameters from
                          <config-file>, typically srm.conf.  Use '-' to
                          read from STDIN.  If a directory is named, use
                          "srm.conf" in that directory.
  -q                    Print current configuration parameters.
  -?                    Print this help message.
  --                    End command-line option processing.

Don't stack options like "-vf"; use "-v -f" instead.

For -I, -X, -i, and -x:
  Values are interpreted as Perl 5 regular expressions.
  Use multiple options to build a list (e.g. "-I include1 -I include2").
  Use a value of '' to clear a list (e.g. -x '' means "report all responses",
    "-x '' -x 401" means "report all but 401 responses").
  As a special case, an empty -I or -i list implies no include-restrictions.
  If an item is in both the include and exclude list, it is excluded.
  Note that -I and -X restrict which URLs are traversed into, so they may
    prune large areas of your Web structure.

The output goes to STDOUT, so you may want to redirect it to a file. You may want to run it in the background, too.

Besides the standard HTTP response codes, this program generates some non-standard response codes for certain situations. They are:

450 Can't read file: ... -- The program couldn't read the local file, for the given reason.
451 SSI Error(s) (... total) -- There were SSI errors in this file (use -v option for full details).
600 Can't create socket: ... -- The program couldn't open a TCP/IP socket, for the given reason.
601 Connection timed out -- The HTTP server didn't respond within the timeout period.
602 Incomplete response (... of ... bytes) -- Not all of the response was received.
603 Incomplete chunked response -- The response was in HTTP 1.1 chunked format, and not all of it was received.

If you absolutely must get results exactly how a user would, use the -h option. The program is reasonably bug-free, but there are slight differences between when it reads the files directly and when it uses HTTP. The biggest difference is with authentication: this program doesn't support that as a server does; the program will read files that would otherwise be protected by access.conf. Also, this program doesn't mimic all server options.

Let me know if any incompatibility or missing feature keeps you from using this program.

More specific details, if needed:

If a start-file parameter points to a directory, then the default file (e.g. index.html) will be used.

With neither the -f or -h option, the program will load any URLs it can from the filesystem, and go through the HTTP server for those it can't. This is the normal mode.

The lists for -I, -X, -i, and -x match an item if it matches any in the list, not all. For example, "-I include1 -I include2" will match any URL with either the string "include1" or the string "include2" in it; the URL doesn't need to match both.

Changes in 1.0.1

Aliases now resolve correctly to a directory, not to a relative URL.
In some situations, a redirection combined with relative URLs could result in infinite pathname recursion. This is fixed.
More HTML tags are searched for links.
Non-HTML files are no longer searched for links.
Other minor bug fixes.

Return to Web Tools page

Last Modified: March 26, 2000 http://www.jmarshall.com/tools/cl/