9.1 Fetching a Web Page: Module LWP::Simple
9.1
Fetching a Web Page: Module LWP::Simple
There is a large set of Perl modules that are bundled together and are
called the Library for (World Wide) Web Programming or LIBWWW or LWP.
It is also sometimes called libnet.
One of the
simplest modules is LWP::Simple.
As the name suggests, it is a subclass of the module LWP. If the goal at hand is to fetch a Web page
and do something with it, this package is all that we need. The following
program fetches a Web page, and saves it to a local file. It also prints all the
URLs that are in the page and were used as a hyperlink using the <A
href=...> construct in HTML.
Program
9.1
#!/usr/bin/perl
#file fetchHeadURL.pl
use LWP::Simple;
use strict;
#URL to fetch
my $url = "http://www.assam.org";
my $localFile = "assamorg.html";
my @head = LWP::Simple::head ($url);
exit 0 unless (@head);
my ($contentType, $documentLength,
$modificationTime, $expires, $server) = @head;
print "$url is a file of type $contentType\n";
print "$url is $documentLength bytes long\n";
print "$url was last modified on ", scalar (localtime($modificationTime)), "\n";
print "$url expires on $expires\n" if ($expires);
print "$url is served by $server\n";
#get gets the URL as a string
my $content= LWP::Simple::get ($url);
my @hrefs = ($content =~ /$localFile";
print FILE $content;
close FILE;
The program is
given a URL to fetch and a local file where to save the fetched URL. First, it
fetches the head of the page. The LWP::Simple::head
function returns a list with five elements. The elements are extracted out of
the list and printed. The first element gives the MIME type for the content of
the Web page. The content type is sent by the Web server to a client or a
browser in the header so that the client can decide how to display it. In this
example, the content type is text/html
signifying that it is a text file written using the syntax of the HTML language.
The second element gives the length of the page in bytes. The third element
gives the time when the page was last modified in terms of number of seconds
after the so-called epoch. The epoch is system-specific. For Unix, it is
January 1, 1970. In other words, it is the age of the page. The expiry date
can be empty for a Web page. The server gives details of the server software and
hardware. Next, the program gets the actual content of the URL by using the LWP::Simple::get
function. While head
obtains the header from the Web server, get
obtains the actual Web page. Each takes a URL as an argument. These two
functions automatically set up TCP-based socket connections to the designated
Web servers, send the HTTP-based commands as needed, capture the data that come
back to the client, and then save it in appropriate data structures. Because
connections need to be set up and communications need to take place, these
functions usually take some time in coming back with responses.
The program
culls out all hrefed
URLs from the page. This part of the program is based on the discussion in
Section 4.8.3.2. It also prints out the contents of the URL to a file. Such
a program can be the basis for a sophisticated crawler or a search engine
program.
A partial
output of one run of the program is given below.
http://www.assam.org
is a file of type text/html
http://www.assam.org
is 25141 bytes long
http://www.assam.org
was last modified on Thu Jan 25 02:07:54 2001
http://www.assam.org
is served by Apache/1.3.14 (Unix)
(Red-Hat/Linux) ApacheJServ/1.1.2 mod_ssl/2.7.1
OpenSSL/0.9.6 PHP/4.0.2 mod_perl/1.21
All hrefs
in the page =
http://www.assamcompany.com
mailto:kalita@pikespeak.uccs.edu,webmaster@assam.org
http://assamcompany.com/netourism/start.html
http://mail.bigmailbox.com/users/assamorg/signup.cgi
http://mail.bigmailbox.com/users/assamorg/forgotpassword.cgi
chat/livechat.htm
http://assam.org/assam/AssamBulletinBoard/
http://www.assam.org/assam/individuals/
http://www.wunderground.com/global/stations/42314.html
http://www.wunderground.com/global/stations/42410.html
The output
continues with more such URLs.
