9.4 Handling Redirected Web Pages
9.4 Handling Redirected Web Pages
A problem that a Web client program sometimes faces is that when it requests a page from a Web server, the server responds saying that the page has moved, or something similar. It is possible for Web servers to redirect requests to pages. To find out if such a situation occurs in a specific case, one has to look at the result that comes back from the Web server when a program attempts to fill in a form and get response from a server. Let us take one specific example. Our goal is to find the price of a book at the www.fatbrain.com site given the ISBN number of a book. We start by going to the www.fatbrain.com site and follow the links to a form that allows advanced search for books. The form can be found in the page with the URL
http://www1.fatbrain.com/search/AdvancedSearch.asp. The form is shown in Figure 9.23.
Figure 9.23: A Search Form at www.fatbrain.com
We look at the HTML source for this page and see that the ACTION attribute for the form specifies the URL for the program that handles the form data as
http://www1.fatbrain.com/asp/Search/SearchResults.asp.
Also, the form uses the POST method for transmission of data from the client to the server. Based on this knowledge, we write the following program.
Program 9.14
#!/usr/bin/perl
#file fatISBN.notwork.pl
use strict;
use HTTP::Response;
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Cookies;
use HTTP::Request::Common;
use LWP::Debug qw(+);
my ($cookie_jar, $ua, $response,$url);
my ($content, $req);
my $searchFor = "0534934056";
$cookie_jar = HTTP::Cookies->new(file=>"cookies.dat");
$ua = LWP::UserAgent->new();
$ua->agent ("Mozilla/5.0");
$ua->timeout(600);
$req = POST "http://www1.fatbrain.com/asp/Search/SearchResults.asp",
[SearchFunction => "reg",
VM => "c",
RegAction => "t",
ISBN => $searchFor];
$cookie_jar->add_cookie_header($req);
$response = $ua->request($req);
$cookie_jar->extract_cookies($response);
$content = $response->content();
print "****content = $content\n";
This program seems adequate for the purpose at hand. Please note that we are using the LWP::Debug module and imported the + symbol to see on the screen the HTTP interactions taking place. When we run this program, we find an unexpected error saying that the page has moved permanently. The HTTP interactions when this program is run and the program’s output to the screen are given below. Some lines have been broken into two so that they can be printed without crossing into the margin of the paper.
LWP::UserAgent::new: ()
HTTP::Cookies::add_cookie_header: Checking www1.fatbrain.com for cookies
HTTP::Cookies::add_cookie_header: Checking .fatbrain.com for cookies
LWP::UserAgent::request: ()
LWP::UserAgent::simple_request:
POST http://www1.fatbrain.com/asp/Search/SearchResults.asp
LWP::UserAgent::_need_proxy: Not proxied
LWP::Protocol::http::request: ()
LWP::Protocol::http::request: POST /asp/Search/SearchResults.asp HTTP/1.0
Host: www1.fatbrain.com
User-Agent: Mozilla/5.0
Content-Length: 51
Content-Type: application/x-www-form-urlencoded
LWP::Protocol::http::request: POST /asp/Search/SearchResults.asp HTTP/1.0
Host: www1.fatbrain.com
User-Agent: Mozilla/5.0
Content-Length: 51
Content-Type: application/x-www-form-urlencoded
LWP::Protocol::http::request: reading response
LWP::Protocol::http::request: HTTP/1.1 301 Moved
Server: Microsoft-IIS/4.0
Date: Thu, 17 May 2001 22:21:18 GMT
Location: http://www1.fatbrain.com/search/SearchResults.asp?
Connection: Keep-Alive
Content-Length: 0
Content-Type: text/html
Cache-control: private
LWP::Protocol::http::request: HTTP/1.1 301 Moved
LWP::UserAgent::request: Simple response: Moved Permanently
****content =
If we carefully look at toward the bottom of this HTTP interaction, we see that it says that the page that we have requested, i.e., the program that is supposed to service the form request has Moved Permanently. In other words, it is possible that the Web server has redirected the request to another URL. To find out where the redirection is taking us, we need to go the Web browser and perform an ISBN search. Assume we fill in the ISBN number box with the valid ISBN number 0534934056 and then click on the Search Now submit button. When the response comes back from the server, we look at the Location box of the browser and see that the URL that is visible is not
http://www1.fatbrain.com/asp/Search/SearchResults.asp, but is
http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0534934056&vm=. Thus, we see that the form request has been redirected to the second URL from the first. The second URL has a question mark following it, indicating that the submission of the form at the second URL uses the GET method for form submission. In addition, there is one attribute isbn with the ISBN number as its value. The second attribute vm does not have an associated value and thus, seems not useful to the current endeavor. Therefore, we try to write a program that mimics the behavior of the client and the server and thus, sends a GET request to second URL. This new request is given the ISBN as an argument. In addition, this request is given the cookies that came as response to the POST request made to the first URL. Finally, when the response comes back from the second HTTP request, we parse the page and obtain the price of the book. The complete program where we take care of redirection is given below.
Program 9.15
#!/usr/bin/perl
#file fatISBN.pl
use strict;
use HTTP::Response;
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Cookies;
use HTTP::Request::Common;
##############fatbrain.com starts##########################
my ($cookie_jar, $ua, $response, $sessid,$url);
my ($content, $req);
my $searchFor = "0534934056";
$cookie_jar = HTTP::Cookies->new(file=>"cookies.dat");
$ua = LWP::UserAgent->new();
$ua->agent ("Mozilla/5.0");
$ua->timeout(600);
$req = POST "http://www1.fatbrain.com/asp/Search/SearchResults.asp",
[SearchFunction => "reg",
VM => "c",
RegAction => "t",
ISBN => $searchFor];
$cookie_jar->add_cookie_header($req);
$response = $ua->request($req);
$cookie_jar->extract_cookies($response);
$content = $response->content();
$url =
"http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=$searchFor";
$req = HTTP::Request->new(GET=>$url);
$cookie_jar->add_cookie_header($req);
$response = $ua->request($req);
$cookie_jar->extract_cookies($response);
$content = $response->content();
#print $content;
#Make the assumption that name, author, etc., are known for the book
my ($price) = ($content =~ m#Online Price:.+?\$(.+?)<#si);
print $price;
If we use the LWP::Debug module and import the + symbol, we see HTTP communication beyond what we presented earlier. The additional communication is given below. We have broken some lines so that they can be printed on the available space.
TTP::Cookies::add_cookie_header: Checking www1.fatbrain.com for cookies
HTTP::Cookies::add_cookie_header: Checking .fatbrain.com for cookies
LWP::UserAgent::request: ()
LWP::UserAgent::simple_request:
GET http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0534934056
LWP::UserAgent::_need_proxy: Not proxied
LWP::Protocol::http::request: ()
LWP::Protocol::http::request:
GET /asp/bookinfo/bookinfo.asp?theisbn=0534934056 HTTP/1.0
Host: www1.fatbrain.com
User-Agent: Mozilla/5.0
LWP::Protocol::http::request: reading response
LWP::Protocol::http::request: HTTP/1.1 200 OK
Server: Microsoft-IIS/4.0
Date: Thu, 17 May 2001 23:06:05 GMT
Content-Type: text/html
Set-Cookie: Jar=BID=0478B379EF7D2A06;
expires=Mon, 01-Jan-2024 05:00:00 GMT; domain=.fatbrain.com; path=/
Cache-control: private
LWP::Protocol::http::request: HTTP/1.1 200 OK
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 411 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 1105 bytes
LWP::Protocol::collect: read 1459 bytes
LWP::Protocol::collect: read 1502 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 1199 bytes
LWP::Protocol::collect: read 1504 bytes
LWP::Protocol::collect: read 1460 bytes
LWP::Protocol::collect: read 95 bytes
LWP::UserAgent::request: Simple response: OK
HTTP::Cookies::extract_cookies: Set cookie Jar => BID=0478B379EF7D2A06
This interaction shows that the second HTTP request is successful and comes back with a response code of 200 and a response string of OK. The program prints the price of the book we are looking at as 127.95 dollars. Commenting out the use LWP::Debug line on the top of the program causes the program to suppress the HTTP interaction strings to be printed on the screen, but just print the price.
This program shows that to write successful Web client programs, it is sometimes necessary to peform careful detective activities, cautiously monitor the HTTP interactions, and then find ways to solve the problems that arise in a straight-forward program by mimicing the behavior of a real browser and its counterpart Web server in such situations. Determining what a Web browser and a Web server do in such problematic situations is crucial in writing a successful Web client program. The manner in which a client such as Netscape Communicator works in various situations is publicly available. But, finding the appropriate action for the situation at hand is usually difficult from among mountains of information available at sites such as www.netscape.com and www.mozilla.com. There is a mailing list to discuss
problems that arise when programming with the LWP modules. One can subscribe to this mailing list by writing to the address libwww-subscribe@perl.org. One can post a problem, a response or an experience by writing to libwww@perl.org. This mailing list is read by many experienced LWP programmers, including the authors of the various modules. This is the best place to get information and have one’s vexing questions answered.
