9.1.1 Fetching Documents from the Web: Simple “Web Crawling"

9.1.1  Fetching Documents from the Web: Simple “Web Crawling"

 

There are many sites on the Web where useful documents are stored so that interested individuals can download them. For example, the site

http://www.cis.upenn.edu/~techreports/abstracts01.html at the University of Pennsylvania contains abstracts of all technical reports published by the Computer Science Department in the year 2001. Most of these abstracts have links to PostScript or PDF files that are downloadable from the Web. The following program fetches the HTML page corresponding to the URL given above, obtains the links to the technical reports, and then downloads the technical reports to local files.

 Program 9.2

#!/usr/bin/perl
#file getFiles.pl
use LWP::Simple;

my ($ftpURL, $ftpContents, $ftpFile);
my ($htmlURL, $ftpHrefs, $includedExts);
$includedExts = "ps[.]gz|pdf|ps";
$htmlURL = "http://www.cis.upenn.edu/~techreports/abstracts01.html";

$htmlContents = LWP::Simple::get ($htmlURL) or
       die "Couldn't fetch $htmlURL using LWP::Simple::get: $!";
@ftpHrefs = ($htmlContents =~ /$ftpFile";
    print FTPFILE $ftpContents;
    close FTPFILE;
}

The program does not really use the FTP protocol although the names of identifiers has the term FTP in them. The HTML file corresponding to abstracts is first fetched using LWP::Simple::get. The hrefed URLs are next culled from the HTML file. From the URLs, only those that end with the required extension are kept. The required extensions in this program are ps, pdf, or ps.gz. In other words, these are files in the Adobe PostScript format or PDF format. The PostScript files may have been compressed using the Gnu zip command (i.e., gzip). Such compressed files have the gz extension. We know that these are the relevant extensions by examining the HTML page. Our goal is to download all technical reports to our local machine so that we can peruse them either on the computer screen or after printing them.

The program goes into a foreach loop where it obtains the name of the actual file from the URL. The name of the file is the component in the URL after the last /. Then, it obtains the contents of the file by making the following call.

 

    $ftpContents = LWP::Simple::get ($ftpURL);

 

The contents of the downloaded file are written into the local file. The downloading is performed by issuing the GET command to the HTTP server. That is what the get function does. A part of what is printed by this program on the screen for a sample run is given below.

 

Fetching URL: http://www.cis.upenn.edu/~rtg/papers/MS-CIS-01-01.ps.gz

Saving to local file:  MS-CIS-01-01.ps.gz

Fetching URL: http://www.cis.upenn.edu/~lwzhao/papers/cognitive.pdf

Saving to local file:  cognitive.pdf

Fetching URL: http://www.cis.upenn.edu/~sotiris/papers/strongman.ps

Saving to local file:  strongman.ps

Fetching URL: http://www.cis.upenn.edu/~sotiris/papers/subos.ps

Saving to local file:  subos.ps

Fetching URL: http://www.cis.upenn.edu/~rtg/papers/MS-CIS-01-07.ps.gz

Saving to local file:  MS-CIS-01-07.ps.gz

Fetching URL: http://www.seas.upenn.edu/~sachinc/report.ps

Saving to local file:  report.ps

Fetching URL: http://www.cis.upenn.edu/~pengs/publications/fc_tech_report.ps.gz

Saving to local file:  fc_tech_report.ps.gz

Fetching URL: http://db.cis.upenn.edu/DL/ubql.pdf

Saving to local file:  ubql.pdf

 

In this sample run, the program downloads eight technical reports to the local machine. The downloading is done not using the FTP protocol, but by using the HTTP protocol. This could have been done using the FTP protocol also, but we do not discuss that here. In this specific case, it so happens that all the URLs referring to downloaded files are absolute URLs. Therefore, we do not have to massage the URL to make it absolute before we start the downloading process. If the URLs were relative, it would be necessary to make them absolute first before starting to download.

A variation of the program given above can be used to download Perl modules from the

http://www.cpan.org site. The following program looks at the Web page that lists the recently contributed Perl modules and downloads those that satisfy a certain regular expression.

 Program 9.3

#!/usr/bin/perl
#file PerlModules.pl
use LWP::Simple;

my ($ftpURL, $ftpContents, $ftpFile);
my ($htmlURL, $ftpHrefs, $includedExts, $baseURL);
my ($searchPattern) = @ARGV;

$includedExts = "tar[.]gz";
$htmlURL = "http://www.cpan.org/RECENT.html";
$baseURL = "http://www.cpan.org";

$htmlContents = LWP::Simple::get ($htmlURL) or
       die "Couldn't fetch $htmlURL using LWP::Simple::get: $!";
@ftpHrefs = ($htmlContents =~ /$ftpFile";
    print FTPFILE $ftpContents;
    close FTPFILE;
}

The program is called with a command-line argument that acts as a regular expression. This program obtains the Web page at http:://www.cpan.org/RECENT.html, parses the page to obtain all URLs that end with the tar.gz extension. Then, it obtains those URLs that satisfy the regular expression given as command line argument. Finally, it downloads all the tarred sources of recent Perl modules that satisfy the search criterion. An interaction with this program is given below.

 

%getPerlModules.pl "XML"

Fetching URL: http://www.cpan.org/authors/id/A/AN/ANDREIN/DBIx-XMLMessage-0.04.t

ar.gz

Saving to local file:  DBIx-XMLMessage-0.04.tar.gz

Fetching URL: http://www.cpan.org/authors/id/K/KR/KRAEHE/XML-Handler-YAWriter-0.

18.tar.gz

Saving to local file:  XML-Handler-YAWriter-0.18.tar.gz

Fetching URL: http://www.cpan.org/authors/id/K/KR/KRAEHE/XML-Handler-YAWriter-0.

19.tar.gz

Saving to local file:  XML-Handler-YAWriter-0.19.tar.gz

Fetching URL: http://www.cpan.org/modules/by-category/07_Database_Interfaces/DBI

x/DBIx-XMLMessage-0.04.tar.gz

Saving to local file:  DBIx-XMLMessage-0.04.tar.gz

Fetching URL: http://www.cpan.org/modules/by-category/11_String_Lang_Text_Proc/X

ML/XML-Handler-YAWriter-0.19.tar.gz

Saving to local file:  XML-Handler-YAWriter-0.19.tar.gz

Fetching URL: http://www.cpan.org/modules/by-module/DBIx/DBIx-XMLMessage-0.04.ta

r.gz

Saving to local file:  DBIx-XMLMessage-0.04.tar.gz

Fetching URL: http://www.cpan.org/modules/by-module/XML/XML-Handler-YAWriter-0.1

9.tar.gz

Saving to local file:  XML-Handler-YAWriter-0.19.tar.gz

pikespeak[98]: ls

 

The program is called with "XML" as the command line argument. Hence, the program finds those Perl modules that have been recently contributed to the http://www.cpan.org site and have the character string XML in them, and then downloads them to the local machine. These are compressed tar files corresponding to the source programs for these modules. One can then go about installing these modules on the local computer if so desired.