Need Utility to validate website links - perfer OpenSource

rodsfree

[H]ard|Gawd
Joined
Dec 13, 2004
Messages
1,417
Guys,
My site is getting a little too big for me to validate the links easily and I need a way to automate this or something.

I want a utility that I can run on my working copy of the site....before I upload it to the server.

Any suggestions???

What are you using??


Thanks in Advance!!

 
By "validate the links", do you mean run an HTML validator on each of the things that are linked to, or do you want to verify that each of the links points somewhere that won't give you a 404?
 
type in "link checker" (without quotes of course) into google, i got quite a few links that might be worth looking into, but i'm to lazy to go through them for you :p ...hope that helps
 
just whip out a lil script to HEAD $URL and check error message. 200=good, !200=bad
 
Use at your own risk. ;) Change the obvious variables in caps.
Code:
#!/bin/bash

find YOURPATH -name \*.html | xargs grep href | perl -ne \
'chomp;
@match=split(/:/);
$match[0]=~s/[^\/]*$//;
$foo=join(":",@match[1,-1]);
$foo=~s/^.*href="//;
$foo=~s/".*$//;
if(substr($foo,0,6) ne "mailto")
{
    if(substr($foo,0,4) eq "http") { $url=$foo; }
    else {
        if(substr($foo,0,1) eq "/") {
            $url="http://YOURSITE.com".$foo;
        } else {
            $url="http://YOURSITE.com/".$match[0].$foo;
        }
    }
    $status=`curl --head --silent "$url" | head -1`;
    $status=~s/^.*?([0-9]{3}).*$/\1/;
    chomp($status);
    print $status." -- ".$url."\n";
}' | grep -v "^200"

Output:
Code:
405 -- [url]http://us.imdb.com/Title?0066473[/url]
405 -- [url]http://us.imdb.com/Title?0120885[/url]
302 -- [url]http://www.dictionary.com/cgi-bin/dict.pl?term=apotropaic[/url]
302 -- [url]http://www.dictionary.com/cgi-bin/dict.pl?term=boustrophedonically[/url]
...
 
HorsePunchKid said:
By "validate the links", do you mean run an HTML validator on each of the things that are linked to, or do you want to verify that each of the links points somewhere that won't give you a 404?

Just trying to avoid the dead link 404 errors...

I found this http://www.drk.com.ar/spider.php

I'm going to give it a try...

HPKid...thanks for the code!!!.....but I'm on server 2003 instead of linux.



 
rodsfree said:
HPKid...thanks for the code!!!.....but I'm on server 2003 instead of linux.
I'm on Windows XP, but you don't see me not using Cygwin. ;)

That code was only half-serious, anyway. It's unfortunately not trivial to write something to do this, even in the simple case of flat HTML files. This thing you linked to looks pretty decent; I may try it out myself.
 
Back
Top