Validating an entire site

Periodically you may want to make sure your entire website validates. This can be a hassle if your site is big. In this article we introduce a few python scripts which will help us do mass validation from a list of links. We will also modify the W3C validator to work the way we want.

Recently I had a siutation where I wanted to validate a large collection of pages. A customer has a rather large site and more than 100 editors are invloved in content creation. Since it is possible to create invalid html in their CMS tool there is a need to check the site regularly. This code could also be the foundation for a continuous testing framework in a GUI development project. In short it includes:

  1. Installing the W3C validator on a local machine.
  2. Modifying the validator to output XML.
  3. Writing a script to collect validator result for all URLs in a text file.
  4. Writing a script to crawl a web site and log all page URLs (which will be used as input to the script above).

The modified validator code and the Python scrips are available at the end of this article.

Installing the W3C validator

If you have a Mac with OS X this is an easy step. Following the excellent installation instructions on the Apple developer site I had the validator up and running on my iBook within ten minutes. Tip: If you have trouble installing the Bundle::W3C::Validator, try to install it after a reboot. It worked on my second attempt.

Modifying the validator

After installing the W3C validator and looking at the Perl code (yack!) I was surprised to find that XML output is already available! Appending ‘&output=xml’ to the query string will make the validator output the validation result as XML. It is even possible to get the result in EARL by using ‘n3′ for the output parameter!

This saved me some time. However, if the validator tries to validate a page where encoding hasn’t been specified it will produce an error in html format. For my purposes this was not good enough so I have a modified the error handling to output XML as well. I also wanted to check if the page was using Dublin Core for meta data and therefore I have included a check for the DC schema identifier.

My modifications are commented with “PKR” in the modified check script.

Some additional scripts

So, all I needed now was some simple scripts to call my local validator, parse the XML result and output data to a CSV file from which I can create statistics. The first script, massvalidate.py, takes two arguments:

  1. The name of a text file containing a list of URLs to validate. One URL per line please.
  2. The filename of the result CSV file.

The script will log information about the number of validation errors, the doctype used, web server software and the presence of DC metadata.

The second script, crawlsite.py, crawls a web site and extracts all unique URLs. The URL of the web site is passed in the first argument to the script (e.g. ‘python crawlsite.py http://127.0.0.1′). You can use this script to crawl your site and then feed the result to the massvalidate.py script.

Interesting statistics

My first trial run was to validate the start page of the 1020 public web sites listed at sverige.se. Sverige.se is the online gateway to Sweden’s public sector. The result: only 63 sites use valid html. See more statistics about doctypes used.

The next step

By adding some simple metadata to your pages (if you are using Dublin Core it may already be there) it would be trivial to have the script email the responsible developer/editor with any validation errors that were found. Please submit other ideas for development and I will add them here.

References

Please note that these scripts were written in a haste. If you find errors or have any suggestions please send them to me and I will update them as soon as possible.

  1. Installing the W3C HTML Validator on Mac OS X
  2. The W3C markup validation service
  3. The modified validator “check” script. Please replace the original “check” file with this one.
  4. crawlsite.py crawls a web site and lists links to validate (view crawlsite.py in html format). This script uses the htmldata library by Connelly Barnes. Please make sure it is available in the same folder.
  5. massvalidate.py. This script uses a list of links, feeds them to the validator and logs the result in a CSV file (view massvalidate.py in html format).
  6. The Web Design Group’s validator. This one uses a different engine compared to the W3C validator but has built in support for mass validation.

Comments are closed for this article.

Peter Krantz, peter.krantz@giraffe.gmail.com (remove giraffe).