9.5 Extracting Links from Web Pages Using HTML::LinkExtor

9.5  Extracting Links from Web Pages Using HTML::LinkExtor

  

We have seen in Section 7.4.2.2 how we can extract links from a Web page fetched over the network by parsing the page. Extracting links and fetching the Web pages pointed to by the links either recursively or in some other fashion is useful for purposes such as creating indices for a Web site with the purpose of facilitating searching. Hyperlinks that appear on a Web page point to resources of various kinds: HTML pages, XML pages, graphic files, audio or video clips, CGI programs, etc. To be able to extract links of a type we are interested in, or extracting all links from the various HTML attributes they can appear in, is, in general, time consuming. There is a Perl module HTML::LinkExtor that can be used for this purpose.

The HTML::LinkExtor module can be used to parse HTML pages fairly easily for extracting links. Below, we present a program that finds the so-called dead or stale links in a Web site. It starts with a URL given to it and traverses the site recursively, and examines each traversed page in order to collect stale or dead URLs.

 Program 9.16

9.5  Extracting Links from Web Pages Using HTML::LinkExtor

  

We have seen in Section 7.4.2.2 how we can extract links from a Web page fetched over the network by parsing the page. Extracting links and fetching the Web pages pointed to by the links either recursively or in some other fashion is useful for purposes such as creating indices for a Web site with the purpose of facilitating searching. Hyperlinks that appear on a Web page point to resources of various kinds: HTML pages, XML pages, graphic files, audio or video clips, CGI programs, etc. To be able to extract links of a type we are interested in, or extracting all links from the various HTML attributes they can appear in, is, in general, time consuming. There is a Perl module HTML::LinkExtor that can be used for this purpose.

The HTML::LinkExtor module can be used to parse HTML pages fairly easily for extracting links. Below, we present a program that finds the so-called dead or stale links in a Web site. It starts with a URL given to it and traverses the site recursively, and examines each traversed page in order to collect stale or dead URLs.

 Program 9.16

#!/usr/bin/perl
#linkExtract5.pl

use strict;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI;  #Needed to absolutize a URL, if necessary

$" = "\n";
my $MAXURLCOUNT = 100;  #The max no of unique URLs to look at
my $COUNT = 0;
my $ua; 
my %TRAVERSED; #Keeps track of all traversed URLs
#Note the / at the end is needed 
my $STARTURL = "http://www.cs.uccs.edu/~kalita/";
#Used to see if a link is inside the domain represented by $BASEURL or outside.  
my $BASEURL = "http://www.cs.uccs.edu/"; 
my $MIMEEXTS = "(s?html?|xml|asp|pl|css|jpg|gif|pdf)";
my ($domain) = ($BASEURL =~ m#http://(.+?)/?$#);
my $RECORDFILE = "ERRORS.$domain.txt";

#Need a UserAgent to be the client
$ua = new LWP::UserAgent;
open OUT, ">$RECORDFILE";
extractLinks ($STARTURL, $BASEURL);
close OUT;

#####################
sub  extractLinks{
    my ($url, $containingURL)  = @_;
    exit if $TRAVERSED{$url};
    $TRAVERSED{$url}++;
    $COUNT = $COUNT + 1; 
    if ($COUNT > $MAXURLCOUNT){
         exit 0;
    }
    print "Looking at an in-domain URL #$COUNT: $url\n";

    #Make the parser. Can give 0, 1 or 2 args. The first is  an optional sub to
    #process the urls. the second is used to absolutize any relative URLs 
    #that may occur.
    #Not giving any args here. Absolutization is done separately.  
    my $p = HTML::LinkExtor -> new ();
    #Request document using HTTP
    my $res = $ua->request(HTTP::Request->new(GET => $url));
    if (!($res -> is_success)){
     print OUT 
     "Stale URL:  $url\nContaining URL:  $containingURL\nHTTP Message: ",
               $res->message, "\n\n";
     return; 
    }
    #We have the contents of the file now
    my $file = $res -> content;
    #This produces an anonymous array with the links
    $p -> parse ($file);
    my @links = $p->links();
    my ($aLinkRef,  %linkHash, @linkArray);
    foreach $aLinkRef (@links){
        my ($tag, $attr, $theUrl) = @$aLinkRef;
        #Absolutizing done below doesn't work if a directory-level 
        #URL doesn't have a / at the end. 
        if ($url !~ /[.]$MIMEEXTS$/ and  $url !~ m@/$@)
                {$url = $url . "/"}; 
        my $theURI = new URI ($theUrl);
 
        $theURI = $theURI->abs($url);
        $linkHash{$theURI}++;
    }
    @linkArray = sort (keys %linkHash);
    my $newURL;
    foreach $newURL (@linkArray){
        next if $TRAVERSED{$newURL};
        if ($newURL =~ m/^$BASEURL/){
            extractLinks ($newURL, $url);
 }
        else {
            checkLink ($newURL, $url);
        }
    }
}

##################
sub checkLink{
    my ($url, $containingURL) = @_;
    exit if $TRAVERSED{$url};
    $TRAVERSED{$url}++;
    $COUNT = $COUNT + 1; 
    if ($COUNT > $MAXURLCOUNT){
         exit 0;
    }
    print "Looking at an out-of-domain URL #$COUNT: $url\n";
    #Request document using HTTP
    my $res = $ua->request(HTTP::Request->new(GET => $url));

    #$res-> code gives the response code
    #$res-> is_success is a boolean
    #$res -> message returns the error message
    if (!($res -> is_success)){
        print OUT 
            "Stale URL:  $url\nContaining URL:  $containingURL\nHTTP Message: ",
           $res->message,  "\n\n";
        return; 
    }
}

The program uses the LWP::UserAgent, HTML::LinkExtor and URI modules. The value of

$MAXURLCOUNT specifies the maximum number of unique URLs the program examines. The program has two URLs, $STARTURL and $BASEURL. $STARTURL specifies the URL where the recursive examination of URLs start. $BASEURL is the base URL for the $STARTURL. Clearly, the base URL can be easily obtained by parsing the $STARTURL and keeping the portion of the URL from the beginning till the end of the server name. The program culls out the domain name from $BASEURL The program looks for stale URLs and records the ones it finds in the file $RECORDFILE. We define a stale URL as one that cannot be retrieved from a Web server during the default timeout period of the user agent. Most frequently, a stale URL corresponds to a page that no longer exists. The recursive extraction of links is done by making a call to the first extractLinks subroutine. The call is given below.

 

extractLinks ($STARTURL, $BASEURL);

 

The subroutine extractLinks takes two arguments: the URL to start searching from, and the base URL of the starting page. The subroutine is recursive and exits when the count of unique URLs examined exceeds $MAXURLCOUNT, a global variable. The subroutine starts by examining if the URL $url, to be fetched and analyzed, has already been examined and hence occurs in the global hash %TRAVERSED. Using a hash to keep track of examined URLs, we can easily ensure that the URLs examined by the program are unique. If the URL is not a key in %TRAVERSED, our program enters it in %TRAVERSED before analyzing it. We examine at most $MAXURLCOUNT URLs.

The subroutine associates a parser with the scalar variable $p. The parse is associated with the following statement of code.

 

    my $p = HTML::LinkExtor -> new ();

 

The constructor new for HTML::LinkExtor can take zero, one or two arguments; The first argument, if given, is reference to a so-called callback subroutine. If a callback subroutine is provided, it is called once automatically for each link found. If a callback subroutine is not provided, as is the case here, the links are collected internally and can be retrieved by calling the links method of the parser. The second argument, if given, is used to absolutize any relative URLs that are obtained from the analyzed Web pages. We do not specify either of the two arguments in our call. When necessary, we absolutize relative URLs on our own, as discussed later.

A GET request is created for the Web page addressed by $url and the request sent to the Web server. The response from the Web server is called $res. If the success code associated with $res indicates that the GET request is successful, the subroutine stores the content of the response in the variable $file. If the GET request is unsuccessful, the URL is recorded as stale or dead. Next, the subroutine calls the parse method of the Link::Extor class on the parser object $p using the argument $file.

 

    $p -> parse ($file);

 

parse is a method of the HTML::Parse class that we do not discuss here. The class HTML::LinkExtor is a subclass of HTML::Parse and hence, inherits the parse method from HTML::Parse. The URLs specified in the HTML file are extracted by the following call.

 

    my @links = $p->links();

 

The HTML::LinkExtor parse $p stores the links internally in the program. The links() method returns these links in a list. Here, the list is called @links. @links contains a list of references to links.

The first foreach loop that follows gets each reference to a link and then processes it. A link stored by the HTML::LinkExtor parser contains three parts: a tag, an attribute, and the URL. For example, if in the HTML file, the tag is

 

<A HREF="http://www.cs.uccs.edu/~kalita">

 

the tag is A, the attribute is HREF and the URL is http://www.cs.uccs.edu/~kalita. If the HTML file contains

 

<IMG SRC="http://www.cs.uccs.edu/~kalita/jk1.jpg">

 

the tag is IMG, the attribute is SRC and the URL is

http://www.cs.uccs.edu/~kalita/jk1.jpg.

Before we collect the URLs specified in a page, we need to absolutize any relative URLs that may occur in the page. A relative URL is one that does not start with the protocol string such as http or ftp. Thus, if we have a relative URL garden-of-the-gods.jpg, and it occurs in the page with absolute URL

http://www.cs.uccs.edu/~kalita, it is absolutized as

http://www.cs.uccs.edu/~kalita/garden-of-the-gods.jpg. To absolutize a URL, we use the abs method in the URI module. The absolutization steps are given below.

 

       #Absolutizing done below doesn't work if a directory-level

        #URL doesn't have a / at the end.

        if ($url !~ /[.]$MIMEEXTS$/ and  $url !~ m@/$@)

                {$url = $url . "/"};

        my $theURI = new URI ($theUrl);

        $theURI = $theURI->abs($url);

 

The abs() method of the URI class does not return the correct absolute URL for an argument if the URL with respect to which absolutization is performed does not end with a /. To take care of this anomalous situation, we use the global variable $MIMEEXTS that gives a non-exhaustive list of extensions used by non-directory files, in the form of a regular expression. If the URL $url being analyzed does not end with one of the file extensions listed in $MIMEEXTS, we assume that $url refers to a directory. If it is a directory, and
does not have a trailing /, we append a / at the end. The next step in absolutization is to take the current individual URL $theURL found in the page with address $url, and create a URI object out of it. To finish the absolutization process, the abs method of the URI class is called on $theURI with the containing URL $url as an argument. If a URL is already an absolute form, it does not change. Otherwise, it is transformed syntactically to make it absolute.

The first foreach loop goes over every element of @links and takes the URL out of the current link data structure and stores it in %linkHash. We use a hash so that there is no duplication of the URLs collected. That is, if a URL occurs several times in a Web page, we register it only once. A new array called @linkArray is created containing the unique URLs that are collected, i.e., the keys of %linkHash.

The extractLinks subroutine has a second foreach loop. In this loop that iterates over every unique and unexamined URL collected, the subroutine extractLinks is recursively called if the URL happens to be in the domain we are exploring. Such in-domain link sare examined for staleness in extractLinks. If the URL is from another domain, the subroutine checkLink is called to see if the URL is stale or dead. URLs pointing to locations outside the focussed site are not explored recursively.

The checkLink subroutine is called with two arguments: $url, the link to check for staleness and the URL in which the first URL was mentioned. The subroutine makes a GET request to retrieve $url. If this request is not successful, the subroutine prints a message saying which URL timed out, and is thus, stale. The subroutine also prints the URL of the page that contains a reference to the stale URL. The program prints the URLs it examines on the screen. A partial printout of the screen is given below. Lines have been broken to fit the printed page.

 

Looking at an in-domain URL #1: http://www.cs.uccs.edu/~kalita/

Looking at an out-of-domain URL #2: http://www.assam.org

Looking at an out-of-domain URL #3: http://www.assamcompany.com

Looking at an out-of-domain URL #4: http://www.autoindia.com

Looking at an in-domain URL #5:

     http://www.cs.uccs.edu/cgi-bin/jkkalita/counter.pl?counter=kalita-index-page

Looking at an in-domain URL #6:

     http://www.cs.uccs.edu/~kalita/accesswatch/accesswatch-1.32/index.html

Looking at an in-domain URL #7: http://www.cs.uccs.edu/cgi-bin/jkkalita/counter.pl

Looking at an in-domain URL #8: http://www.cs.uccs.edu/~kalita

Looking at an in-domain URL #9: http://www.cs.uccs.edu/~kalita/college.gif

Looking at an in-domain URL #10: http://www.cs.uccs.edu/~kalita/cultural.html

Looking at an out-of-domain URL #11: http://www.amnesty.org

 

The program stores the list of stale URLs in the file ERRORS.$domain.txt. In this specific case, $domain has the value www.cs.uccs.edu. Therefore, the record file is called ERRORS.www.cs.uccs.edu. A partial content of this file after the program is run is given below.

 

Stale URL:  http://www.acsu.buffalo.edu/~talukdar/assam/humanrightsassam.html

Containing URL:  http://www.cs.uccs.edu/~kalita/assam/human-rights-violations.html

HTTP Message: Not Found

 

Stale URL: 

   http://www.cs.uccs.edu/cgi-bin/jkkalita/access_counter.pl.old?counter=human-rights

Containing URL:  http://www.cs.uccs.edu/~kalita/assam/human-rights-violations.html

HTTP Message: Not Found

 

Stale URL: 

    http://193.135.156.15/tbs/doc.nsf/c12561460043cb8a4125611e00445ea9/

    f2261dd9e000fbe4802565090051a509?OpenDocument

Containing URL:  http://www.cs.uccs.edu/~kalita/assam/human-rights/ajit-bhuyan.html

HTTP Message: Can't connect to 193.135.156.15:80 (Timeout)

 

Stale URL:  http://www.hri.ca/partners/sahrdc/armed/toc.shtml

Containing URL:  http://www.cs.uccs.edu/~kalita/assam/human-rights/ajit-bhuyan.html#soe/

HTTP Message: Object Not Found

 

Stale URL:  http://www.hri.ca/partners/sahrdc/india/detention.shtml

Containing URL:  http://www.cs.uccs.edu/~kalita/assam/human-rights/ajit-bhuyan.html#soe/

HTTP Message: Object Not Found

 

Stale URL:  http://vag.vrml.org/

Containing URL:  http://www.cs.uccs.edu/~kalita/2000-CS301-roster.html

HTTP Message: Can't connect to vag.vrml.org:80 (Bad hostname 'vag.vrml.org')

 

We note that, some of the URLs are recorded as stale because the URL cannot be accessed within the default timeout period of the user agent $ua. It is possible to increase the timeout period by using the timeout method of the LWP::UserAgent class.