Checklinks 1.0.1

...Yet Another HTML Link Checker

Home > Web Tools > Checklinks

Donate


Actually, this has major features I couldn't find in other link-checkers, which is why I wrote it. In a nutshell: It's fast, and it supports SSI (SHTML files), the latest Web standards, directory aliases, and other server options.

March 26, 2000-- Version 1.0.1 is released, finally. It's a long-overdue bug fix release, nothing more. There are no new features. Here's a list of the most important bug fixes. For archival purposes, here's the old version 1.0.

Download Checklinks (gzipp'ed), or view the text of the script.

A link-checker checks the validity of all HTML links on a Web site. Most can start at one or more "seed" HTML files, and recursively test all URLs found at that site ("recursively" means when it finds an HTML file, it searches that for links, and so on). This program doesn't follow URLs at other sites, but it does check their existence.

Among its unique (?) features, at least among free link checkers:

Other not-so-unique but useful features-- the user can:

Other options could easily be added, if there is demand.

It's free, it's in Perl 5, it's got decent comments, it was written with Apache in mind.

In fairness, it doesn't support all features of SSI, like conditionals, variables, and so forth. It does support the <!--#include...--> directive, which is enough for many situations.

Really, the world needs only one good link-checker. Someone should either take this one and add more features to it, or copy features from this into other link-checkers. Here's a list of other free link-checkers for Unix.


Installation and Documentation

Not much to it, really. Again, you need Perl 5. To install the script:

  1. Place the file somewhere in your path and make it executable. I name mine cl.
  2. For best results, edit the script to set $LOCAL_HOST and $DOCUMENT_ROOT.
  3. If desired, configure other options as commented in the source code. Otherwise, the default settings usually work fine.

Running it

Run the script from the command line; pass it a list of seed files or URLs. By default, it will follow all links on that server, and report all URLs with non-200 response codes. To control which URLs get checked, build inclusion and exclusion lists with the -I and -X options. To control which results get reported, build similar lists with the -i and -x options. See below for other options.

My most common use, which checks all URLs starting with the current index.html, is simply

cl .

For simple but sufficient instructions, run "cl -?". The output of this is:

To recursively check all HTML links on the local site, enter:
    cl  <options>  [ start-file | start-URL ] ...

Options include:
  -v                    Generate full (verbose) report, including full
                          referral information and detailed SSI error
                          reporting.
  -I <include-pattern>  Only check URLs matching <include-pattern>.
  -X <exclude-pattern>  Don't check URLs matching <exclude-pattern>.
  -i <include-status>   Only report URLs whose response code starts with
                          the pattern <include-status>.
  -x <exclude-status>   Don't report URLs whose response code starts with
                          the pattern <exclude-status>.  Default is to
                          exclude "200" responses only (i.e. "-x 200").
  -d <max-depth>        Traverse links no deeper than <max-depth>.
  -f                    "File mode"-- only read files from the filesystem;
                          do not go through the HTTP server at all.  This
                          will skip all URLs that point to CGI scripts.
  -h                    "HTTP mode"-- use HTTP to check ALL URLs, even
                          if they could be read from the filesystem.
                          Incompatible with "-f" option.
  -c <config-file>      Read appropriate configuration parameters from
                          <config-file>, typically srm.conf.  Use '-' to
                          read from STDIN.  If a directory is named, use
                          "srm.conf" in that directory.
  -q                    Print current configuration parameters.
  -?                    Print this help message.
  --                    End command-line option processing.

Don't stack options like "-vf"; use "-v -f" instead.

For -I, -X, -i, and -x:
  Values are interpreted as Perl 5 regular expressions.
  Use multiple options to build a list (e.g. "-I include1 -I include2").
  Use a value of '' to clear a list (e.g. -x '' means "report all responses",
    "-x '' -x 401" means "report all but 401 responses").
  As a special case, an empty -I or -i list implies no include-restrictions.
  If an item is in both the include and exclude list, it is excluded.
  Note that -I and -X restrict which URLs are traversed into, so they may
    prune large areas of your Web structure.

The output goes to STDOUT, so you may want to redirect it to a file. You may want to run it in the background, too.

Besides the standard HTTP response codes, this program generates some non-standard response codes for certain situations. They are:

If you absolutely must get results exactly how a user would, use the -h option. The program is reasonably bug-free, but there are slight differences between when it reads the files directly and when it uses HTTP. The biggest difference is with authentication: this program doesn't support that as a server does; the program will read files that would otherwise be protected by access.conf. Also, this program doesn't mimic all server options.

Let me know if any incompatibility or missing feature keeps you from using this program.

More specific details, if needed:

If a start-file parameter points to a directory, then the default file (e.g. index.html) will be used.

With neither the -f or -h option, the program will load any URLs it can from the filesystem, and go through the HTTP server for those it can't. This is the normal mode.

The lists for -I, -X, -i, and -x match an item if it matches any in the list, not all. For example, "-I include1 -I include2" will match any URL with either the string "include1" or the string "include2" in it; the URL doesn't need to match both.


Changes in 1.0.1

Return to Web Tools page


© 1998, 2000 James Marshall (comments encouraged)

Last Modified: March 26, 2000 http://www.jmarshall.com/tools/cl/