Bash script to check broken links in a website or 404 message

I needed to check a website , because some links doesn’t exist and when you open some web pages  you get a “404”.

So I created a script to do this check.

I will explain after the code.

echo  "Checking broken links on " $1
echo "Extract all links in the main page..."
wget --spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1
echo "Identify lines that contains sub-pages and drop to dirs.log file..."
cat weblinks.log | grep "index.html" > dirs.log
echo " Extracting URLs..."
awk -F " " '{print $3}' dirs.log > dirs2.log
sed s/"URL:"/""/g dirs2.log > dirs3.log
sort dirs3.log | uniq > dirs4.log
echo " checking every sub-page..."
while read line; do echo "processing $line";fn=$(date +%s).log;echo $line > summary.log; wget --spider -r -nd -nv -H -l 1 -w 2 -o $fn "$line"; echo "$fn done" ; tail -n 15 > summary.log; done <dirs4.log > process_results.log
echo "Done"
  • First, you need to pass the website url as an argument $1
  • wget –spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1
    • this line will check the main page ($1) and will scan 1 depth level (-l 1) waiting 2 seconds between every request (-w 2) and will dump its results (weblinks.log file)
  • cat weblinks.log | grep “index.html” > dirs.log
    • this line will extract all the subpages ( identified with index.html in the previous file) and will drop in another file ( dirs.log)
  • awk -F ” ” ‘{print $3}’ dirs.log > dirs2.log
    sed s/”URL:”/””/g dirs2.log > dirs3.log

    • This will extract only the url from the list
  • sort dirs3.log | uniq > dirs4.log
    • This will sort the list and remove duplicates.
  • while read line; do echo “processing $line”;fn=$(date +%s).log;echo $line > summary.log; wget –spider -r -nd -nv -H -l 1 -w 2 -o $fn “$line”; echo “$fn done” ; tail -n 15 > summary.log; done <dirs4.log > process_results.log
    • This line, the most complex inthis script , will do several things
      • First will read every line of the file containg urls  (while read line)
      • set up a timestamp as a filename for a log file per every url  (fn=$(date +%s).log)
      • will add the url to a final log file (echo $line > summary.log)
      • wget … wil do a second scan on every subpage
      • will copy the last 15 rows in  the summary file ( this step can be improved)
      • and will keep a paralel log file (> process_results.log)
  • Finally you will get the main file summary.log containing urls and broken links , and a log file process_results.log)