7.4.2.2 Fetching a Web Page

7.4.2.2  Fetching a Web Page

 

In this section, we discuss a program that acts as a simple HTTP client, fetching a Web Page from a specified Web server. Here, we write the client program only. The companion server program already exists and is one of the standard servers such as Apache Web server that the site we want to communicate with employs. A Web browser like Netscape or Microsoft Internet Explorer is a very sophisticated HTTP client. Fetching textual Web pages from servers is a very small part of all that a sophisticated browser can do. A simple textual browser like Lynx is more close to what we do below in that it can deal with only textual pages.

The program given below has a main section toward the end. There it calls a subroutine called getURL that parses an absolute URL and if it is of the right syntax, fetches the file denoted by the URL, by contacting the Web server specified in the URL. The program is given next with discussions following it.

 Program 7.27

#!/usr/bin/perl -w
#file fetchURL1.pl
use strict;
use IO::Socket;

#Open a TCP connection and get the URL as specified in the
#input list: (server-name, port-number, file-address)
sub openTCPConnection{
    my ($server, $port, $fileAddr) = @_;
    my $socket = new IO::Socket::INET (
                        PeerAddr => $server,
                        PeerPort => $port,
                        Proto => 'tcp'); 
    die "Couldn't connect: $!" unless $socket;
    print "Created a TCP socket with $server\n";
    print $socket "GET $fileAddr  HTTP/1.1\n\n";

    #Read all the lines of the remote file to @remoteFile
    my @remoteFile = <$socket>;
    close ($socket);
    return @remoteFile;
}

#parse the URL and return a list: (server-name, port-number, file-address)
sub parseURL{
    my ($url) = @_;
    my ($server, $port, $file);
    
    $url =~ m#http://([^/:]+)(:(\d+))?(/.+)#i;
    $server = $1;
       #If port is not provided, default to 80
    ($port = $3) || ($port = 80);
       #$4 is the file name;  if $4 is empty, "/" is the file name
    ($file = $4) ||  ($file = "/");
    return ($server, $port, $file);
}

#Given a URL, get the file. Assume HTTP protocol
sub getURL{
    my ($url) = @_;
    my ($server, $port, $file) = parseURL ($url);
      #open the TCP connection and return the fetched file
    openTCPConnection ($server,  $port, $file);
}


#main
my $url = "http://www.uccs.edu/";
print getURL ($url);

The program has three subroutines: openTCPConnection, parseURL and getURL. The main program calls getURL with a URL as argument. We make the assumption that the URL is absolute, i.e., starts with http://, and is well-formed.

The getURL subroutine calls parseURL on the URL passed to it. parseURL returns three components of the URL: the HTTP server name, the port number, and the file. Once these three components are obtained, getURL calls openTCPConnection with these components as argument. openTCPConnection returns a list containing the textual lines of the file on the server
corresponding to the requested URL. getURL simply returns what openTCPConnection returns.

The parseURL subroutine gets a well-formed URL as argument. It performs pattern matching to extract the server name, the port number if specified, and the file name that follows. If the port number is not provided, it defaults to 80. If the file name is empty after parsing, the file name is returned as / by default.

The openTCPConnection subroutine takes three arguments: server name, port number and file address. It opens a TCP socket at an unspecified port number (the port number is known to the program, of course) on the local machine. The other end of the socket is at the HTTP server at the port obtained from the URL. If the creation of the socket is successful, the subroutine prints the HTTP GET method on the socket with the required arguments: the file name on the server, and the HTTP protocol version. HTTP is the protocol or language used by a Web server and a Web client to talk to each
other. There are several methods or commands that the HTTP protocol allows, one of them being GET.

The GET method must be followed by two newlines—\n\n, because the HTTP protocol requires a blank line following the method call for the server to work on it. The GET method instructs the HTTP server to fetch a file. The HTTP server simply tries to obtain this file from its file structure, and if the file can be found, returns it using the socket to the requesting client. At the other end of the socket, that is, in openTCPConnection, the program reads what arrives at its
end. This reading is done using the following line of the program.

 

my @remoteFile = <$socket>;

 

This is a read operation in list or array context. Such a read operation reads all the lines available at the socket and stores them as elements of the list @remoteFile. For the socket reading operation to happen correctly, the socket must be unbuffered, i.e., it must be read as input comes to it, instead of the reading taking place only after there is a certain amount of material to read. In the current version of Perl, sockets are automatically unbuffered. In older versions of Perl, this had to be done explicitly using select or
autoflush
. openTCPConnection simply returns the list containing all the lines in the file corresponding to the URL.

The main part of the program simply fetches a URL, and prints the fetched file on the terminal window. Of course, it prints it as text, and it is not formatted using the HTML tags in the text, as one would expect in a regular Web client or browser.