9.2.2 Obtaining a URL From a Web Server

9.2.2  Obtaining a URL From a Web Server

The next program makes a call to the HTTP method HEAD to obtain information on a URL. If the call to HEAD returns successfully, only then the program obtains the actual file. The program follows.

 Program 9.6

#!/usr/bin/perl
#file fetchHeadURL1.pl

use LWP::UserAgent;
use URI;
use strict;
$" = "\n\t";

my ($url, $uri, $ua);
my ($localFile);
my ($headerRequest, $headerResponse);
my ($contentRequest, $contentResponse, $headers, $content, @hrefs);

$url = "http://www.cs.uccs.edu/~kalita";
$uri = URI->new($url);
$ua = LWP::UserAgent->new();

$localFile = "assamorg.html";

$headerRequest = HTTP::Request->new(HEAD=>$uri);
$headerResponse = $ua->request ($headerRequest);
unless ($headerResponse->is_success){
    print $headerResponse->error_as_HTML;
    exit 0;
}

$contentRequest = HTTP::Request->new(GET=>$uri);
$contentResponse = $ua->request ($contentRequest);
$headers = $contentResponse->headers;
$content = $contentResponse->content;

printf "%-15s %-30s\n", "Header Name", "Header Value";
print  "-" x 60, "\n";
$headers->scan (\&headerScanner);
print  "-" x 60, "\n";

#get gets the URL as a string
@hrefs = ($content =~ /$localFile";
print FILE $content;
close FILE;

#callback subroutine to process header entries
sub headerScanner{
   my ($headerName, $headerValue) = @_;
   printf "%-15s %-30s\n", $headerName, $headerValue;
}

Like the previous program, this program also is given a URL which it converts to a URI object. It creates a LWP::UserAgent object and sends out an HTTP request for the HEAD of a URI using the request method of the LWP::UserAgent object. If the HEAD request is unsuccessful, the program prints the error and exits. If the HEAD request is successful, it makes up a GET request, sends it over and then captures the response. The following lines of code do that.

 

$contentRequest = HTTP::Request->new(GET=>$uri);

$contentResponse = $ua->request ($contentRequest);

$headers = $contentResponse->headers;

$content = $contentResponse->content;

 

The GET method requests a URI on the server. As usual, the request is sent to the server by using the request method of the LWP::UserAgent object. The response that comes back from the server is called is

$contentResponse. The response captured by the user agent is automatically an HTTP::Response object. We call the headers method to capture the header part of the response. Unlike the previous program, this time we also call the content method to capture the content in the variable $content. In the previous program, we fetched only the HEAD from a Web server and hence, the content part of the response was empty. But, this time, the content part contains the actual text of the file requested. The headers method returns an object of
type HTTP::Headers whereas the content returns the actual content of the requested URL. The content can be textual or binary.

Before we look at the requested URI’s contents, we look at the HTTP headers that came back with the file. HTTP headers always accompany a response from a server. We make a call to the scan method of the HTTP::Headers class to print the contents of the header lines one by one. Next, we look at the actual content of the file. Since the file requested is small in size, we handle it directly. This program uses a regular expression to cull all URLs used in the file fetched and print them on the screen. The program also prints the contents of the file fetched into a local file, thus mirroring the Web page locally.

The output of the program is given below.

 

Header Name     Header Value                 

------------------------------------------------------------

Connection      close                         

Date            Wed, 04 Apr 2001 20:03:52 GMT

Accept-Ranges   bytes                        

Server          Apache/1.3.14 (Unix)  (Red-Hat/Linux) PHP/3.0.18 mod_perl/1.23

Content-Length  4225                         

Content-Type    text/html                    

ETag            "650328-1081-3ab64734"       

Last-Modified   Mon, 19 Mar 2001 17:51:48 GMT

Client-Date     Wed, 04 Apr 2001 13:07:28 GMT

Client-Peer     128.198.162.68:80            

Link            <default.css>; rel="stylesheet"; type="text/css"

Title           Jugal Kalita                 

------------------------------------------------------------

All hrefs in the page =

 xml/index.xml

 teaching-philosophy.pdf

 schedule.html

 research.html

 http://www.shillong.com

 http://www.autoindia.com

 http://www.indiashipping.com

 http://www.assam.org

 http://www.assamcompany.com

 cultural.html

 http://www.upenn.edu/index.html

 http://www.usask.ca

 http://www.rahul.net/kgpnet/iit/iit.html

 http://www.uccs.edu

 http://www.cs.uccs.edu/~kalita/accesswatch/accesswatch-1.32/index.html

 

It is not really necessary to issue the HEAD request to gather information about a page before issuing the GET command to fetch the page. One can simply use the GET command bypassing the HEAD command. The HEAD command is used only if our goal is to find out certain information about the file such as its last modification date before deciding whether to fetch the file or not. This could be useful if one is trying to create a local mirror of a large Web site. Only files modified after the last time mirroring was done need to be fetched, possibly reducing the amount of fetching to be done to a great extent. Fetching the head or the whole file
both need one transmission over the network. But, the head is a lot less information than the whole file, and thus, the amount of transmission as well computation, both at the server end and the client end is reduced quite a bit.

The following program fetches a URL without fetching the head information first. The program also culls out all the hrefed URLs specified in the file, just like the previous program.

 Program 9.7

#!/usr/bin/perl
#file fetchURL3.pl

use LWP::UserAgent;
use URI;
use strict;
$" = "\n\t";

my ($url, $uri, $ua);
my ($localFile);
my ($contentRequest, $contentResponse, $headers, $content, @hrefs);

$localFile = "assamorg.html";
$url = "http://www.cs.uccs.edu/~kalita";
$uri = URI->new($url);
$ua = LWP::UserAgent->new();

$contentRequest = HTTP::Request->new(GET=>$uri);
$contentResponse = $ua->request ($contentRequest);
if ($contentResponse->is_success){
    $content = $contentResponse->content;
}
else{
    print $contentResponse->error_as_HTML;
    exit 0;
}

#get gets the URL as a string
@hrefs = ($content =~ /$localFile";
print FILE $content;
close FILE;

The program sends out at GET method using the request method of the user agent. When the response comes back, it uses the is_success method of the HTTP::Response object to see if the request was successfully processed by the server. If the response contained an error, the program exits. Otherwise, it fetches the content from the response, culls out all the hrefed URLs, like before. It also prints the fetched file to a local mirror file.