9.4 Handling Redirected Web Pages

9.4  Handling Redirected Web Pages

  A problem that a Web client program sometimes faces is that when it requests a page from a Web server, the server responds saying that the page has moved, or something similar. It is possible for Web servers to redirect requests to pages. To find out if such a situation occurs in a specific case, one has to look at the result that comes back from the Web server when a program attempts to fill in a form and get response from a server. Let us take one specific example. Our goal is to find the price of a book at the www.fatbrain.com site given the ISBN number of a book. We start by going to the www.fatbrain.com site and follow the links to a form that allows advanced search for books. The form can be found in the page with the URL

http://www1.fatbrain.com/search/AdvancedSearch.asp. The form is shown in Figure 9.23.

 

Figure 9.23:  A Search Form at www.fatbrain.com

We look at the HTML source for this page and see that the ACTION attribute for the form specifies the URL for the program that handles the form data as

http://www1.fatbrain.com/asp/Search/SearchResults.asp.

Also, the form uses the POST method for transmission of data from the client to the server. Based on this knowledge, we write the following program.

 Program 9.14

#!/usr/bin/perl
#file fatISBN.notwork.pl
use strict;
use HTTP::Response;
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Cookies;
use HTTP::Request::Common;
use LWP::Debug qw(+);

my ($cookie_jar, $ua, $response,$url); 
my ($content, $req);
my $searchFor = "0534934056";
$cookie_jar = HTTP::Cookies->new(file=>"cookies.dat");
$ua = LWP::UserAgent->new();
$ua->agent ("Mozilla/5.0");
$ua->timeout(600);

$req = POST "http://www1.fatbrain.com/asp/Search/SearchResults.asp",
       [SearchFunction => "reg", 
        VM => "c",
        RegAction => "t",
        ISBN => $searchFor];
$cookie_jar->add_cookie_header($req);
$response = $ua->request($req);
$cookie_jar->extract_cookies($response);
$content = $response->content();
print "****content = $content\n";

This program seems adequate for the purpose at hand. Please note that we are using the LWP::Debug module and imported the + symbol to see on the screen the HTTP interactions taking place. When we run this program, we find an unexpected error saying that the page has moved permanently. The HTTP interactions when this program is run and the program’s output to the screen are given below. Some lines have been broken into two so that they can be printed without crossing into the margin of the paper.

 

 

LWP::UserAgent::new: ()

HTTP::Cookies::add_cookie_header: Checking www1.fatbrain.com for cookies

HTTP::Cookies::add_cookie_header: Checking .fatbrain.com for cookies

LWP::UserAgent::request: ()

LWP::UserAgent::simple_request:

                     POST http://www1.fatbrain.com/asp/Search/SearchResults.asp

LWP::UserAgent::_need_proxy: Not proxied

LWP::Protocol::http::request: ()

LWP::Protocol::http::request: POST /asp/Search/SearchResults.asp HTTP/1.0

Host: www1.fatbrain.com

User-Agent: Mozilla/5.0

Content-Length: 51

Content-Type: application/x-www-form-urlencoded

 

LWP::Protocol::http::request: POST /asp/Search/SearchResults.asp HTTP/1.0

Host: www1.fatbrain.com

User-Agent: Mozilla/5.0

Content-Length: 51

Content-Type: application/x-www-form-urlencoded

 

LWP::Protocol::http::request: reading response

LWP::Protocol::http::request: HTTP/1.1 301 Moved

Server: Microsoft-IIS/4.0

Date: Thu, 17 May 2001 22:21:18 GMT

Location: http://www1.fatbrain.com/search/SearchResults.asp?

Connection: Keep-Alive

Content-Length: 0

Content-Type: text/html

Cache-control: private

 

LWP::Protocol::http::request: HTTP/1.1 301 Moved

LWP::UserAgent::request: Simple response: Moved Permanently

****content =

 

If we carefully look at toward the bottom of this HTTP interaction, we see that it says that the page that we have requested, i.e., the program that is supposed to service the form request has Moved Permanently. In other words, it is possible that the Web server has redirected the request to another URL. To find out where the redirection is taking us, we need to go the Web browser and perform an ISBN search. Assume we fill in the ISBN number box with the valid ISBN number 0534934056 and then click on the Search Now submit button. When the response comes back from the server, we look at the Location box of the browser and see that the URL that is visible is not

http://www1.fatbrain.com/asp/Search/SearchResults.asp, but is

http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0534934056&vm=. Thus, we see that the form request has been redirected to the second URL from the first. The second URL has a question mark following it, indicating that the submission of the form at the second URL uses the GET method for form submission. In addition, there is one attribute isbn with the ISBN number as its value. The second attribute vm does not have an associated value and thus, seems not useful to the current endeavor. Therefore, we try to write a program that mimics the behavior of the client and the server and thus, sends a GET request to second URL. This new request is given the ISBN as an argument. In addition, this request is given the cookies that came as response to the POST request made to the first URL. Finally, when the response comes back from the second HTTP request, we parse the page and obtain the price of the book. The complete program where we take care of redirection is given below.

 Program 9.15

#!/usr/bin/perl
#file fatISBN.pl
use strict;
use HTTP::Response;
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Cookies;
use HTTP::Request::Common;

##############fatbrain.com starts##########################
my ($cookie_jar, $ua, $response, $sessid,$url); 

my ($content, $req);
my $searchFor = "0534934056";
$cookie_jar = HTTP::Cookies->new(file=>"cookies.dat");
$ua = LWP::UserAgent->new();
$ua->agent ("Mozilla/5.0");
$ua->timeout(600);

$req = POST "http://www1.fatbrain.com/asp/Search/SearchResults.asp",
       [SearchFunction => "reg", 
        VM => "c",
        RegAction => "t",
        ISBN => $searchFor];
$cookie_jar->add_cookie_header($req);
$response = $ua->request($req);
$cookie_jar->extract_cookies($response);
$content = $response->content();
$url =
"http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=$searchFor";

$req = HTTP::Request->new(GET=>$url);
$cookie_jar->add_cookie_header($req);
$response = $ua->request($req);
$cookie_jar->extract_cookies($response);
$content = $response->content();
#print $content;

#Make the assumption that name, author, etc., are known for the book
my ($price) =  ($content =~ m#Online Price:.+?\$(.+?)<#si);
print $price;

If we use the LWP::Debug module and import the + symbol, we see HTTP communication beyond what we presented earlier. The additional communication is given below. We have broken some lines so that they can be printed on the available space.

 

TTP::Cookies::add_cookie_header: Checking www1.fatbrain.com for cookies

HTTP::Cookies::add_cookie_header: Checking .fatbrain.com for cookies

LWP::UserAgent::request: ()

LWP::UserAgent::simple_request:

   GET http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0534934056

LWP::UserAgent::_need_proxy: Not proxied

LWP::Protocol::http::request: ()

LWP::Protocol::http::request:

    GET /asp/bookinfo/bookinfo.asp?theisbn=0534934056 HTTP/1.0

Host: www1.fatbrain.com

User-Agent: Mozilla/5.0

 

LWP::Protocol::http::request: reading response

LWP::Protocol::http::request: HTTP/1.1 200 OK

Server: Microsoft-IIS/4.0

Date: Thu, 17 May 2001 23:06:05 GMT

Content-Type: text/html

Set-Cookie: Jar=BID=0478B379EF7D2A06;

       expires=Mon, 01-Jan-2024 05:00:00 GMT; domain=.fatbrain.com; path=/

Cache-control: private

 

LWP::Protocol::http::request: HTTP/1.1 200 OK

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 411 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 1105 bytes

LWP::Protocol::collect: read 1459 bytes

LWP::Protocol::collect: read 1502 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 1199 bytes

LWP::Protocol::collect: read 1504 bytes

LWP::Protocol::collect: read 1460 bytes

LWP::Protocol::collect: read 95 bytes

LWP::UserAgent::request: Simple response: OK

HTTP::Cookies::extract_cookies: Set cookie Jar => BID=0478B379EF7D2A06

 

This interaction shows that the second HTTP request is successful and comes back with a response code of 200 and a response string of OK. The program prints the price of the book we are looking at as 127.95 dollars. Commenting out the use LWP::Debug line on the top of the program causes the program to suppress the HTTP interaction strings to be printed on the screen, but just print the price.

This program shows that to write successful Web client programs, it is sometimes necessary to peform careful detective activities, cautiously monitor the HTTP interactions, and then find ways to solve the problems that arise in a straight-forward program by mimicing the behavior of a real browser and its counterpart Web server in such situations. Determining what a Web browser and a Web server do in such problematic situations is crucial in writing a successful Web client program. The manner in which a client such as Netscape Communicator works in various situations is publicly available. But, finding the appropriate action for the situation at hand is usually difficult from among mountains of information available at sites such as www.netscape.com and www.mozilla.com. There is a mailing list to discuss
problems that arise when programming with the LWP modules. One can subscribe to this mailing list by writing to the address libwww-subscribe@perl.org. One can post a problem, a response or an experience by writing to libwww@perl.org. This mailing list is read by many experienced LWP programmers, including the authors of the various modules. This is the best place to get information and have one’s vexing questions answered.