Validating an entire site

Periodically you may want to make sure your entire website validates. This can be a hassle if your site is big. In this article we introduce a few python scripts which will help us do mass validation from a list of links. We will also modify the W3C validator to work the way we want.

Recently I had a siutation where I wanted to validate a large collection of pages. A customer has a rather large site and more than 100 editors are invloved in content creation. Since it is possible to create invalid html in their CMS tool there is a need to check the site regularly. This code could also be the foundation for a continuous testing framework in a GUI development project. In short it includes:

  1. Installing the W3C validator on a local machine.
  2. Modifying the validator to output XML.
  3. Writing a script to collect validator result for all URLs in a text file.
  4. Writing a script to crawl a web site and log all page URLs (which will be used as input to the script above).

The modified validator code and the Python scrips are available at the end of this article.

Installing the W3C validator

If you have a Mac with OS X this is an easy step. Following the excellent installation instructions on the Apple developer site I had the validator up and running on my iBook within ten minutes. Tip: If you have trouble installing the Bundle::W3C::Validator, try to install it after a reboot. It worked on my second attempt.

Modifying the validator

After installing the W3C validator and looking at the Perl code (yack!) I was surprised to find that XML output is already available! Appending ‘&output=xml’ to the query string will make the validator output the validation result as XML. It is even possible to get the result in EARL by using ‘n3′ for the output parameter!

This saved me some time. However, if the validator tries to validate a page where encoding hasn’t been specified it will produce an error in html format. For my purposes this was not good enough so I have a modified the error handling to output XML as well. I also wanted to check if the page was using Dublin Core for meta data and therefore I have included a check for the DC schema identifier.

My modifications are commented with “PKR” in the modified check script.

Some additional scripts

So, all I needed now was some simple scripts to call my local validator, parse the XML result and output data to a CSV file from which I can create statistics. The first script, massvalidate.py, takes two arguments:

  1. The name of a text file containing a list of URLs to validate. One URL per line please.
  2. The filename of the result CSV file.

The script will log information about the number of validation errors, the doctype used, web server software and the presence of DC metadata.

The second script, crawlsite.py, crawls a web site and extracts all unique URLs. The URL of the web site is passed in the first argument to the script (e.g. ‘python crawlsite.py http://127.0.0.1′). You can use this script to crawl your site and then feed the result to the massvalidate.py script.

Interesting statistics

My first trial run was to validate the start page of the 1020 public web sites listed at sverige.se. Sverige.se is the online gateway to Sweden’s public sector. The result: only 63 sites use valid html. See more statistics about doctypes used.

The next step

By adding some simple metadata to your pages (if you are using Dublin Core it may already be there) it would be trivial to have the script email the responsible developer/editor with any validation errors that were found. Please submit other ideas for development and I will add them here.

References

Please note that these scripts were written in a haste. If you find errors or have any suggestions please send them to me and I will update them as soon as possible.

  1. Installing the W3C HTML Validator on Mac OS X
  2. The W3C markup validation service
  3. The modified validator “check” script. Please replace the original “check” file with this one.
  4. crawlsite.py crawls a web site and lists links to validate (view crawlsite.py in html format). This script uses the htmldata library by Connelly Barnes. Please make sure it is available in the same folder.
  5. massvalidate.py. This script uses a list of links, feeds them to the validator and logs the result in a CSV file (view massvalidate.py in html format).
  6. The Web Design Group’s validator. This one uses a different engine compared to the W3C validator but has built in support for mass validation.

Comments

  1. Tom says at 2005-04-11 02:04:

    I use an RSS feed to validate my front page. Since my posts normally appear at least once, theres a good chance I get to clean up the errors.

    I’ve also just found out that you can get Yahoo Search Results in RSS format. So if you know for sure that Yahoo has search all your site then you could use the “site:http://www.standards-schmandards.com/” query string to pull up a list of all the pages. Im sure it would be possible to read this RSS file and output it as an url on each line.

    Unforuntaly the output is limited to the first set of results, so you would have to build a little Application using the Yahoo developer tools. I belive there are some Python examples, how effective it would be in comparsion to your script I really dont know.

  2. Pete says at 2005-04-11 07:04:

    Tom, Ben Hammersley’s RSS validation version seems to be an easy way of doing it. However, there are a number of problems:

    • It fails for a number of URLs e.g. this link.
    • It can not validate local sites such as an intranet.
    • Yahoo does not report all pages when using the “site:” parameter even if the site is indexed. E.g. try finding this site.
  3. Tom says at 2005-04-11 16:04:

    Hmm, perhaps Ben could shed some light on why it’s not working for some URLs. I guess an intranet would throw a spanner in the works somewhat.

    Oh I think you have to use “site:www.standards-schmandards.com” instead of “site:http://www.standards-schmandards.com” .

  4. Roger Johansson says at 2005-05-07 17:05:

    Neat. Thanks for compiling this info. I got it working after a bit of fiddling around – the “check” script in my installation of the W3C validator seems to be a different version than the script you modified. I had the validator installed already, so I suppose it could be an older version… not sure.

  5. Small Paul says at 2005-05-11 00:05:

    Great stuff – being on a PowerBook, I expect I’ll be finding this very useful in future.

    I had the same issue at work recently. We solved it pretty quickly, as a developer had built an XML validator using a stock .NET XML validation control. It took him about 90 minutes to whip it into reasonable shape, which was pretty cool.

  6. Jens Meiert says at 2005-05-11 11:05:

    You can also use the “Validate entire site” option of the (web-based) WDG HTML validator (which is per default enabled if you try UITest.com’s Site Check).

  7. Tim Swan says at 2005-05-13 17:05:

    Where are you putting the crawlsite and massvalidate scripts? I’ve got everything working up to this point, but I’m a newbie.

  8. Roger Johansson says at 2005-05-15 09:05:

    Tim: You can put them anywhere, as long as the htmldata library is in the same folder as crawlsite.py. You also need to rename The htmldata script to htmldata.py. I put everything in the validator/htdocs folder.

  9. Pete says at 2005-10-12 12:10:

    I will leave comment nine here as an interesting specimen of comment spam. Does he really think that people will be interested in following links after statements like “Now I already year of the free person with weighty incom!”?

  10. helen says at 2005-11-09 20:11:

    Ben Hammersley’s RSS validation version is great. We always use it and it never failed. I would reccomend it.

  11. Scrambled says at 2005-11-20 02:11:

    I’ve just downloaded the W3C Validator and noticed the check script is completely different to your version any tips on mods needed?

  12. Scrambled says at 2005-11-21 01:11:

    Ok, update I have it working sort of but nothing happens when I run either script except that crawlsite does crawl the site, dos box – I’m running Windows – showing all the links but doesn’t output to a file? I’ve specified a file for it write to and massvalidate runs for about a second then nothing?

Peter Krantz, peter.krantz@giraffe.gmail.com (remove giraffe).