4.8.3.2.3 Capturing All URLs From A Web Page

4.8.3.2.3  Capturing All URLs From A Web Page

   We now write a program that takes a Web page, i.e., an HTML file and extracts all the URLs from it. In a Web page, a URL is specified inside an anchor tag. An anchor tag is written with an <a> in the beginning and with an </a> at the end. The tag or an attribute of the tag such as href can be written either in lower case or upper case. Here is an example of the use of the anchor tag.

 

<a href="http://www.assam.org:80/orgs/asa/index.html">Assam Society </a>

    was established in 1973.

 

Note that the output has been broken into two lines by hand. The URL is specified following the href =. The URL is enclosed inside double quotes. The following program extracts all URLs from an HTML file using the <a> tag.

 Program 4.34

#!/usr/bin/perl
use strict;

my @HTMLText = ;
chomp @HTMLText; 

my $text = join " ", @HTMLText;
my @lines  = ($text =~ /

An HTML file has free-form syntax. There can be any number of blank spaces or newlines between any two words. For example, in the case of the <a> tag, there can be any amount of space or any number of newlines between <a and href. To avoid complications that arise due to the free form nature, we use the diamond (<>) read operation in an array or list context. When we read from a filehandle in the list context, all lines of the corresponding file are read at once. For example, when we write

 

@fileContents = <FILEHANDLE>;

 

@fileContents will contain all lines read from FILEHANDLE, each as an element. We can then manipulate the list containing the lines of the file and extract the URLs.

The program given above can be called with a list of HTML files. Suppose our call is the following.

 

%findURLs.pl index.html index1.html

 

% is the Unix prompt, index.html and index1.html are two HTML files. The statement

 

my @HTMLText = <ARGV>;

 

in the program reads both files and makes their lines available in the list @HTMLText. Next, the line of code

 

chomp @HTMLText;

 

takes away the newline character from the end of each element of @HTMLText. Then, we take all these chopped lines and put them together into a big string called $text. Depending on how many files are given as command line arguments and the sizes of these files, the string $text can become large. So, this may not be the best way to program in a real environment. To make the program better in such situations, we should modify the program to read one file at a time instead of all the files together. There may be other better solutions.

Finally, the line of code

 

my @lines  = ($text =~ /<A\s+href\s*=\s*"([^"]+)"/ig);

 

searches through the string $text and picks out all URLs. A URL occurs following href and the = sign. Around the equal sign there may be empty spaces. After these empty spaces we have the double quote. Anything following the double quote till the next double quote is the URL string. Here, we use the i modifier to indicate that case of letters does not matter in the specification of HTML tags and the attributes of HTML tags. We have the
g modifier because we want to pluck out all substrings that match the regular expression for a URL. The output of the program looks like the following.

aboutus.htm
contact.htm
cultural.html
feedback.htm
http://www.cs.uccs.edu/cgi-bin/kalita/hello.pl
http://www.cs.uccs.edu/~kalita/accesswatch/accesswatch-1.32/index.html
http://www.rahul.net/kgpnet/iit/iit.html
http://www.uccs.edu
http://www.upenn.edu/index.html
http://www.usask.ca
images/ascol.gif
links.htm
mailto:webmaster@assamcompany.com
research.html
schedule.html
search.htm
whosting.htm