7.4.2.3 A Simple Web Crawler
7.4.2.3 A Simple Web Crawler
In this section, we discuss a simple attempt at writing a program that crawls the World Wide Web. The program starts from a specified URL, fetches it, parses it, culls the URLs that are in this file, and then traverses these URLs in turn. It traverses text URLs only. We have broken the program into three packages or modules: URLConf.pm, Tcp.pm, Crawl.pm, and the main program findURLs.pl. We start by presenting the main program.
Program 7.28
#!/usr/bin/perl -w #file findURLs.pl use strict; use diagnostics -verbose; use Crawl; my ($server, $file, $port); my $url = "http://www.cs.uccs.edu/~kalita"; Crawl::getURLsRecursively ($url);
The main program uses a package called Crawl that we discuss a little later. In Perl, a package is stored in a file with .pm extension. A package can be stored in the same directory as where the script that uses it, or in an operating system-specific location (or, site location) so that all programs by all users can access it. Here, we assume the file Crawl.pm is in the same directory as the main file findURLs.pl. The statement
use Crawl;
loads the package for us, and makes all its identifiers, i.e., variables, subroutines, and objects, available without qualification. In the main program, we specify a URL where we start crawling. Then, we make the call
Crawl::getURLsRecursively ($url);
to start the crawling process. getURLsRecursively is a subroutine defined in the Crawl package. The package name qualification Crawl:: is not necessary. We use the qualifier to make explicit where the subroutine is defined. Even if we use a fully qualified name, we must use the use directive so that the package’s names are available.
Before we discuss the contents of the Crawl package, we discuss two simple packages that Crawl uses: URLConf, and Tcp.
The URLConf package defines a few global variables. These variables are declared only once and used by all the other packages. It is a simple package where we do not make any attempts at exporting variable names.
Program 7.29
#!/usr/bin/perl
#file URLConf.pm
package URLConf;
#number of URLs we want to collect
$FILE_NO_LIMIT = 10000;
#The file where we store the URLs collected
$TRAVERSED_URL_FILE = "traversedURLs.txt";
#Specify patterns whose presence in a URL will disqualify it,
#as regular expressions
$EXCLUDEPATTERNS = "(cgi-bin|mailto:|cgi\?|#.*)";
#Specify all MIME extensions whose presence at the end of a file name
#causes the file to be considered a simple file and not a directory
$MIMEEXTS = "(html?|xml|gif|jpg|jpeg|css|xsl|asp)";
#Specify the MIME extensions of files that we want to fetch and analyze
$INCLUDEDEXTS = "(html?|xml)";
#The domains in which we are collecting URLs
#Write multiple domains separated by | as in
#If you want to cover all domains, put ".*" as the domain name.
$DOMAINS_COVERED = "uccs.edu|colorado.edu|cudenver.edu|uchsc.edu|cusys.edu";
#We need to escape the . in a name like uccs.edu
if ($DOMAINS_COVERED ne ".*"){
$DOMAINS_COVERED =~ tr/./\./;
}
1;
$FILE_NO_LIMIT specifies the maximum number of URLs the crawler will fetch. It may fetch a fewer number if it cannot find the maximum number of URLs requested. $TRAVERSED_URL_FILE specifies the name of the file where the fetched URLs will be stored, one per line. $DOMAINS_COVERED specifies the domains which will be explored. In other words, when the program finds a URL outside these domains, it will silently reject the URL. The variable $EXCLUDEDPATTERNS gives a regular
expression that specifies the URLs that should be rejected. For example, in this case, URLs that should be rejected are ones that have the substring cgi-bin, cgi? , mailto:, etc., in them. Such URLs will not be explored. Also, URLs containing # will not crawled. Some of the URLs that the program will access are directories and others are simple files. The variable $MIMEEXTENSIONS specifies a regular expression listing a set of file extensions. File names ending with these extensions are considered by the program as simple files. Others are considered directories.
The second package that Crawl.pm uses is Tcp.pm. The package’s contents are a minor variation of the program defined in section 7.4.2.2. There are two subroutines: openTCPandFetch and parseURL. If
openTCPandFetch cannot create a socket to a certain server, it does not die, but simply warns and carries on with the next URL. It sends the GET method to the HTTP server as usual. It sends a couple more headers, saying it accepts all kinds of file MIME (Multi-purpose Internet Mail Extensions) types. The program also identifies itself to the server as an agent, or the so-called user agent, called UCCSAgent, version 1.0. If the reply that comes back from
the sever does not contain the string
200 OK
on line 1, saying the GET request was successful, the subroutine returns empty handed. Otherwise, it returns the contents of the file fetched. The parseURL subroutine is exactly like the one discussed in Section 7.4.2.2.
Finally, we discuss the package Crawl.pm. It contains two fairly long subroutines: findURLs and
getURLsRecursively. getURLsRecursively is the one called from the main program
findURLs.pl.
Program 7.30
#!/usr/bin/perl -w
#file Crawl.pm
package Crawl;
use strict;
use diagnostics -verbose;
use Tcp;
use URLConf;
#Given a URL, fetch the document and find all text URLs in it.
sub findURLs{
my ($pushedURL, $fileNew);
my ($href);
my (%thisFileURLHash);
my ($directoryNew);
my ($url) = @_;
my ($server, $port, $completeFilePath) = Tcp::parseURL ($url);
$completeFilePath =~ m@(.*/)(.*)@;
my ($directory, $file) = ($1, $2);
#We were getting files or directory names with spaces in them.
#This was trouble.
if ($directory =~ /\s+/ or $file =~ /\s+/){return;}
my @HTMLText = Tcp::openTCPandFetch ($server, $port, $completeFilePath);
chomp @HTMLText;
my $text = join " ", @HTMLText;
my @hrefs = ($text =~ /$URLConf::TRAVERSED_URL_FILE");
RECURSIVE_LOOP:
while (my @openURLKeys = keys %openURLs){
$aURL = shift (@openURLKeys);
if (exists ($traversedURLs{$aURL})) { next RECURSIVE_LOOP;}
$traversedURLs {$aURL}++;
delete $openURLs {$aURL};
print TRAVERSED_URLS "$aURL\n";
$noURLsTraversed++;
print "Just traversed URL No $noURLsTraversed: $aURL\n\n";
if ($noURLsTraversed >= $URLConf::FILE_NO_LIMIT){
last RECURSIVE_LOOP;
}
@thisURLsSons = findURLs ($aURL);
foreach $key (@thisURLsSons){
if (exists ($openURLs{$key}) or exists ($traversedURLs{$key})){
next;}
$openURLs{$key}++;
}
}#end RECURSIVE_LOOP
close (TRAVERSED_URLS);
print "\n\n*******Statistics********\n";
print "Number of URLs recorded = $noURLsTraversed\n";
}
1;
The subroutine findURLs is given a URL to traverse to. It parses the URL by calling Tcp::parseURL, i.e., the parseURL subroutine in the Tcp package. Given the URL, the parser subroutine returns a server name, a port, and a file name with complete path. The file path is then split into a directory name and a file name. Some people use blank spaces in file or directory names, and such names are summarily disqualified from further consideration. The next few lines obtains
all the URLs specified in an <A tag using the href attribute.
my @HTMLText = Tcp::openTCPandFetch ($server, $port, $completeFilePath);
chomp @HTMLText;
my $text = join " ", @HTMLText;
my @hrefs = ($text =~ /<A\s+href\s*=\s*"([^"]+)"/ig);
@hrefs = grep !/$URLConf::EXCLUDEPATTERNS/, @hrefs;
After obtaining the URLs by calling Tcp::openTCPandFetch, the subroutine discards URLs with any offending patterns as specified in the variable $EXCLUDEDPATTERNS in the URLConf package.
Next, the subroutine goes into a foreach loop where it examines every URL that is specified in the file. A URL can be absolute or relative. It handles absolute URLs that start with http:// first and then later the relative URLs. For an absolute URL, it parses the URL by calling Tcp::parseURL, and continues with a URL only if it is one of the domains expressly indicated. If it is in one of the indicated domains, it enters the URL as a key in the hash %thisFileURLHash
that contains all the URLs that have been obtained as useful from the file under consideration.
If it is a relative URL, it can take several forms: it can start with no . or / (e.g., abc.html, /abc/def.html), it can start with a single dot followed by a slash (./, e.g. ./abc.html), or it can start with ../
(e.g., ../abc/def/ghi.html), ../.., etc. We take care of these possibilities using an if statement with several elseif clauses. Our goal is not be exhaustive, but take care of the most common cases. In each case, we parse the URL, obtain its components, and then make the URL absolute based on the URL of the page where it occurs and what form the URL takes. The newly composed absolute URL is then entered as a key in the hash
%thisFileURLHash. Finally, below the foreach loop, the subroutine returns all the keys of this hash to the program that called it.
The second subroutine in the Crawl.pm package is called getURLsRecursively. It starts with a starting URL given to it, fetches URLs recursively from this starting URL till the number of URLs specified by the variable $URLConf::NO_FILES_TRAVERSED have been fetched and recorded in the file specified by $URLConf::TRAVERSED_URL_FILE.
getURLsRecursively uses two main data structures: %traversedURLs and %openURLs. The keys of the hash %traversedURLs are the unique URLs that have been traversed to by the subroutine and then recorded. The reason we use a hash to store traversed URLs is to ensure that only unique URLs are traveled to and recorded in the file. The hash %traversedURLs has the URL key, and the number of times the URL has been opened.
In reality, we are not interested in the number of times a URL has been traversed, just the list of unique URLs traversed.
The subroutine also uses the hash table %openURLs. This hash stores the URLs that are being considered for visiting at a certain time. To start with, %openURLs contains just the starting URL $startURL. Once again, we are using a hash and not a list to store the “open” URLs because we are not interested in having any duplicate entries in the open data structure at any time. Since hash keys are unique, using a hash ensures that the entries in the open data structure, here, the hash
%openURLs, are all unique. The value stored corresponding to a URL in %openURLs is not really used anywhere for any purpose.
The subroutine looks to see if the file specified by $URLConf::TRAVERSED_URL_FILE exists. If the file exists, it is deleted so that we can start recording traversed URLs fresh in every run of getURLsRecursively.
Next, the subroutine gets into a while loop labeled as RECURSIVE_LOOP. It is not really a recursive subroutine; it is iterative. In the conditional of this while loop, we obtain the keys in %openURLs and store them in @openURLKeys. It gets the first URL in @openURLKeys and records it as traversed in %traversedURLs
by increasing its traversal count. It then deletes this URL from the %openURLs hash because it is no longer “open” or yet to-be-traversed. Then, the line
@thisURLsSons = findURLs ($aURL);
obtains all the URLs specified in the HTML text of this URL.
There is now a foreach loop where we go through every URL in the current file. If a relevant URL does not exist as a key in %openURLs or in %traversedURLs, it is entered as a key in %openURLs so that it will be traversed later. This is in a sense, the agenda of URLs to travel to in the future.
The number of URLs traversed is next incremented. If the number of URLs traversed crosses the limit specified by the variable $URLCOnf::FILE_NO_LIMIT, the subroutine’s work is done and it gets out of the while loop labeled RECURSIVE_LOOP.
After the while loop is executed, the keys of %traversedURLs are printed to the file specified in the variable $URLConf::TRAVERSED_URL_FILE.
Note, since the open and traversed data structures are hashes, the URLs are not being traversed in any specific order. The order depends on the way hashes are ordered.
