4.6.1 Extracting Components From A URL

 Assume we are looking at the source of an HTML file displayed by a Web browser. Suppose we have the following lines in the HTML file.

 

<a href="http://www.assam.org:80/orgs/asa/index.html"> Assam Society</a> was founded in 1973.

<a href="http://www.assam.org:80/orgs/asa/index.xml"> Assam Society</a> was founded in 1973.

<a href="http://www.assam.org/orgs/asa/index.html"> Assam Society </a>was founded in 1973.

<a href="http://www.assam.org/orgs/asa/"> Assam Society </a> was founded in 1973.

<a href="http://www.assam.org/orgs/asa"> Assam Society </a> was founded in 1973.

<a href="http://www.assam.org"> Assam Society </a> was founded in 1973.

 

The file shows the various manners in which URLs can be written. We assume that there is only one URL per line. Our goal is to read such a file, extract the URL if any, from a line read, and then extract parts of the URL such as the server name, the HTTP port number, the path, and the name of the file, if any, and the type of the file. We will write several programs to achieve our goal.

The first program simply reads lines from an HTML file and extracts URLs.

 Program 4.17

#!/usr/bin/perl
#file extractURL0.pl
while (<>){
    if ($_ =~ m@http://([^"]+)"@){
        $url = $1;
        print "URL = ", $url, "\n";
    }
}

The while loop reads lines from the file one by one. The only programming construct inside the while loop is an if statement. The conditional of the if statement does pattern matching against the line that was read into $_. In the conditional, there is one set of matched parentheses enclosing a subpattern. If the conditional is not satisfied, the line is ignored. If the conditional is satisfied by a particular line read, the part of the target string $_ that matches the subpattern inside the parentheses (i.e., ([^"]+)) is remembered by
Perl. Since there is only one instance of remembering in the regular expression under consideration, the remembered substring is stored in the special variable with the name $1. Later we assign to the variable $url the value of $1. Therefore, the program prints the following.

 

URL = www.assam.org:80/orgs/asa/index.html

URL = www.assam.org:80/orgs/asa/index.xml

URL = www.assam.org/orgs/asa/index.html

URL = www.assam.org/orgs/asa/

URL = www.assam.org/orgs/asa

URL = www.assam.org

 

Any time we use parentheses to group a subpattern, the remembering effect is triggered. If we have several parenthesized subpatterns or subexpressions in a regular expression, the number of special variables created for the purpose of remembering is equal to the number of parenthesized pairs. The variables are numbered $1, $2, $3, etc. $1 stores the substring of the target string that matches the first parenthesized subexpression,
$2 stores the substring that matches the second parenthesized subexpression, etc.

After we have extracted the URL, we now want to extract two substrings out of it, the path and the file name. The program given below accomplishes this goal.

 Program 4.18

#!/usr/bin/perl
#file extractURL111.pl
while (<>){
    $url = $fileName = $serverPath = "";
    if ($_ =~ m@http://([^"]+)"@){
        $url = $1;

        $url = $url . "/"
             if (($url !~ m@/$@) and  ($url !~ m@[.](html|xml)$@));
        print "URL = ", $url, "\n";

        $url =~ m@^(.+)/([^/]+\.(html|xml))?$@;

        $fileName = $2;
        $serverPath = $1;
    }
    print "Server and path = $serverPath\nFile = $fileName\n\n";
      
}

In this program, like before, we have a while loop that reads lines one by one from one or more input files. In the loop, we start by setting each of the variables $url, $fileName and $serverPath to the empty string to ensure that there are no residual values from a previous iteration of the loop. Next, we have an if block, the conditional of which captures, in the special numbered variable
$1, the URL if any, in the current line. The URL is then stored in the variable $url, just like the previous program. The statement that follows is repeated below.

 

$url = $url . "/" if (($url !~ m@/$@) and  ($url !~ m@[.](html|xml)$@));

 

This statement checks to see if the URL does not end with a / and in addition, does not end with either one of the two file extensions: .html or .xml. If it does not, it concludes that it is a directory name without a trailing /, and it appends the URL with a /. In our example, this happens in the case of the last two URLs in the data file being parsed. This preparatory step of pre-processing the URL makes it easier to extract the components we need. The extraction is done in the following statement.

 

$url =~ m@^(.+)/([^/]+\.(html|xml))?$@;

 

The regular expression that appears on the right hand side is anchored with the ^ anchor to start matching with the target string $url from the beginning. The regular expression is also constrained to match till the end of the target string because it is also anchored to the end of the string by the anchor specifier $. Thus, the regular expression must match $url starting in the beginning, and must match $url till
the end. This simply means that the regular expression must match $url completely.

There are three parenthesized subexpressions in the full regular expression. The first parenthesized subexpression is (.+), the second parenthesized subexpression is ([^]+\.(html|xml)), and the third parenthesized subexpression is a part of the second and is (html|xml). Since there are three parenthesized subexpressions, three special variables, $1, $2 and $3 are assigned values by the pattern match operator =~. The value of $3 is ignored in this program after the pattern matching is done.

The first parenthesized subexpression uses the multiplier +. The second parenthesized subexpression,

([^]+\.(html|xml)), also has the multiplier + inside it. The second parenthesized subexpression is optional because it is followed by the multiplier ?. Thus, it is possible that pattern match against $url satisfies without the URL having a file name such as index.html or index.xml seen in some of the example URLs. In our data file, the URLs that do not need to match this optional second parenthesized subexpression even once are the following.

 

URL = www.assam.org/orgs/asa/

URL = www.assam.org/orgs/asa

URL = www.assam.org

 

Note that in the case of the last two URLs above, a trailing / is appended by the program step discussed earlier. In the case of all three URLs, the second parenthesized subexpression does not match any substring in the URL. Thus, the value of $2 is the empty string after the match. The third parenthesized subexpression, (html|xml), which is included inside the second, cannot also match any part of the target string, and hence $3’s value is also the empty string. In
these cases, the + multiplier in the first parenthesized subexpression consumes all characters in the modified URL except the trailing /.

Now, consider the first three URLs of the data file, repeated below.

 

URL = www.assam.org:80/orgs/asa/index.html

URL = www.assam.org:80/orgs/asa/index.xml

URL = www.assam.org/orgs/asa/index.html

 

Each one of these has a file name at the end, it being index.html, index.xml and index.html, respectively. In matching against these targets, all three subexpressions match substrings in the URLs. Thus, $2 and $3, in addition to $1, are assigned non-empty values after the pattern match is performed. But what values do
$1, $2, and $3 get in these cases?

As we know, the first parenthesized subexpression has a multiplier +, and so does the second parenthesized subexpression. The first multiplier consumes as much of the target string $url as possible. The first subexpression (.+) theoretically can consume the whole string because . matches every character except \n and there is no \n in the $url string. After matching the first parenthesized subexpression, we must find a
/ in the target string. Then, matching the second parenthesized subexpression begins. The second parenthesized subexpression starts by looking for a sequence of one or more characters, each one of which is not /, followed by the period and a file extension: html or xml. This means that the first parenthesized subexpression matches the target URL all the way up to the last /, of course, excluding the /. Thus, $1 gets the value

 

www.assam.org:80/orgs/asa

www.assam.org/orgs/asa

www.assam.org/orgs/asa

 

respectively, in the three cases under discussion. Next, in each case, the / is matched. The second parenthesized subexpression matches

 

index.html

index.xml

index.html

 

respectively, for the three URLs.

In general, Perl’s pattern-matching multipliers such as * and + are left greedy. If there are several multipliers in a regular expression, the ones on the left consume as much as possible of the target string. However, their greed is limited by the requirement that the ones that follow must be able to consume at least some of the input, if they are to match as specified. So, after the first multiplier has consumed all it can, the regular expression engine tries to match other parts of the expression that follow. If there is a failure in matching what follows, Perl backtracks
and tries to match the first multiplier again by consuming a little less that it did the first time. In the case of the three URLs under consideration, the first subexpression (.+) cannot consume everything because there are other multipliers and other non-parenthesized and non-multiplied parts of the regular expression that follow. We do not go into the details of how it all works except that the left-most multipliers consume as much as they can, but they are willing to consume a little less if the other multipliers that follow do not have anything to consume for satisfaction.

The result of running this code on the data file shown earlier is the following.

URL = www.assam.org:80/orgs/asa/index.html
Server and path = www.assam.org:80/orgs/asa
File = index.html

URL = www.assam.org:80/orgs/asa/index.xml
Server and path = www.assam.org:80/orgs/asa
File = index.xml

URL = www.assam.org/orgs/asa/index.html
Server and path = www.assam.org/orgs/asa
File = index.html

URL = www.assam.org/orgs/asa/
Server and path = www.assam.org/orgs/asa
File = 

URL = www.assam.org/orgs/asa/
Server and path = www.assam.org/orgs/asa
File = 

URL = www.assam.org/
Server and path = www.assam.org
File = 

Suppose extracting the two parts that the previous program does not satisfy our ultimate objective. Now, we want to extract the name of the server (which is www.assam.org), the HTTP port number (80, in the first two cases), the path and the file name separately. The following code does this for us.

 Program 4.19

#!/usr/bin/perl
#file extractURL2.pl
while (<>){
    $url =  $machine = $port = $file = "";

    if ($_ =~ m@http://([^"]+)"@){
        $url = $1;
        print "URL = ", $url, "\n";
        $url = $url . "/" if (($url !~ m@/$@) and  ($url !~ m@[.](html|xml)$@));

        $url =~ m@^([^:/]+)(:(\d+))?/(.+?)([^/]+[.](html|xml))?$@;
        $machine = $1;
        $port = $3;
        $path = $4;
        $file = $5;
        print "Machine = $machine\nPort = $port\nPath = $path\nFile = $file\n\n";
    }
}

This program is quite similar to the previous one. The first few lines of code are just like what we had earlier. The difference comes when we extract components from the variable $url. The pattern matching statement is given below.

 

$url =~ m@^([^:/]+)(:(\d+))?/(.+?)([^/]+[.](html|xml))?$@;

 

Here, there are six parenthesized subexpressions. The part in the URL before the first / is the machine or server name followed optionally by a colon and a port number, e.g., :80. The extraction till the first / is done using the following part of the regular expression.

 

([^:/]+)(:(\d+))?

 

The matching starts at the beginning of the URL, and continues till a /, the first /. First, it looks for one or more characters belonging to the negative character class [^:/], i.e., one or more characters that are not : or /. This means that the first parenthesized subexpression, ([^:/]+) matches the server name, and this value is remembered in the variable $1.

Next, we look for the optional port number in the URL. The second parenthesized subexpression,

(:(\d+))? tells us that the specification of the port number is optional. If a port number is provided (say, 80), the first set of parentheses matches :80. However, : is not a part of the port number, it is a separator required by syntax. That is why we have the second set of parentheses (the inside set) in (:(\d+))?. The second set matches the port number (i.e., 80). The port number does not have to be just two digits, it can be longer. This approach to capturing the
port number means that the substring that matches the first set of parentheses in (:(\d+))? is not really useful. Useful or not, :80 is the value of $2. Since it is not useful, in the assignment statements that follow we do not use $2 anywhere.

  There is another point that needs to be discussed. A multiplier such as + or * is a maximal multiplier in that the multiplier causes Perl to consume as much of the text as possible. This maximal or all-consuming behavior of a multiplier can be changed by placing a ? after the multiplier. In such a case, the multiplier is forced to consume the least amount of characters possible such that it still satisfies. Such usage is called a minimal multiplier.   We see such a use in the parenthesized subpattern
(.+?). This subpattern matches the path part of the URL in this case.

When we run this program on the same file as before, the output is the following.

URL = www.assam.org:80/orgs/asa/index.html
Machine = www.assam.org
Port = 80
Path = orgs/asa/
File = index.html

URL = www.assam.org:80/orgs/asa/index.xml
Machine = www.assam.org
Port = 80
Path = orgs/asa/
File = index.xml

URL = www.assam.org/orgs/asa/index.html
Machine = www.assam.org
Port =
Path = orgs/asa/
File = index.html

URL = www.assam.org/orgs/asa/
Machine = www.assam.org
Port =
Path = orgs/asa/
File =

URL = www.assam.org/orgs/asa
Machine = www.assam.org
Port =
Path = orgs/asa/
File =

URL = www.assam.org
Machine = www.assam.org
Port =
Path =
File =

In summary, the special variables created by Perl are numbered $1,$2,.... This is one case when Perl starts numbering from 1 instead of 0. The numbering depends on the occurrence of the left parentheses. As usual, parentheses are allowed inside parentheses. In such a case, the first left parentheses corresponds to a lower numbered special variable.

When Perl remembers matched substrings in terms of special numbered scalars, Perl’s =~ operator also returns a list containing all the remembered substrings in sequence if it is used in a list or array context. In the following program, the array context is triggered because what is returned by =~ is used to set the value of an array.

 Program 4.20

#!/usr/bin/perl
#extractURL31.pl

while (<>){
    my $url;
    if ($_ =~ m@([\w]+://[^"]+)@i){
        $url = $1;
        print "url  = $url\n";

        $url = $url . "/" 
                if (($url !~ m@/$@) and  ($url !~ m@[.](html|xml)$@));

        @allParts = 
          ($url =~  
               m@(\w+)://([^:/]+)(:(\d+))?((/[^/]+)*)/([^/]+[.](html|xml))?@);

        print "Protocol = $allParts[0]\n";
        print "Machine = $allParts[1]\n";
        print "Port = $allParts[3]\n";
        print "Path = $allParts[4]\n";
        print "File = $allParts[6]\n";
        print "File type = $allParts[7]\n\n";
    }
}

The parentheses around the two operands of =~ in the following lines is optional. The =~ operators binds its two arguments or operands more tightly than the assignment operator =. Here, there are eight pairs of parentheses. Therefore, eight substrings are remembered by Perl in terms of the special variables $1, $2, $3, $4, $5 and $6. In such a case, Perl’s match operation =~ returns a list which contains the values of the six substrings in order. We have put parentheses around the two arguments to the =~ operator to make things clear.

This program prints the following.

url  = http://www.assam.org:80/orgs/asa/index.html
Protocol = http
Machine = www.assam.org
Port = 80
Path = /orgs/asa
File = index.html
File type = html

url  = http://www.assam.org:80/orgs/asa/index.xml
Protocol = http
Machine = www.assam.org
Port = 80
Path = /orgs/asa
File = index.xml
File type = xml

url  = http://www.assam.org/orgs/asa/index.html
Protocol = http
Machine = www.assam.org
Port = 
Path = /orgs/asa
File = index.html
File type = html

url  = http://www.assam.org/orgs/asa/
Protocol = http
Machine = www.assam.org
Port = 
Path = /orgs/asa
File = 
File type = 

url  = http://www.assam.org/orgs/asa
Protocol = http
Machine = www.assam.org
Port = 
Path = /orgs/asa
File = 
File type = 

url  = http://www.assam.org
Protocol = http
Machine = www.assam.org
Port = 
Path = 
File = 
File type = 

If we change the code slightly so that we have scalar literals in a list on the left side of the pattern-matching operation, it still works exactly the same way. Since we have a list with scalar literals on the left of an assignment statement, the scalars get assigned appropriately.

 Program 4.21

#!/usr/bin/perl
#extractURL41.pl

while (<>){
    my $url;
    if ($_ =~ m@([\w]+://[^"]+)@i){
        $url = $1;
        print "url  = $url\n";

        $url = $url . "/" 
              if (($url !~ m@/$@) and  ($url !~ m@[.](html|xml)$@));

        my ($protocol, $machine, undef, $port, 
            $path, undef, $file, $fileType) =
        ($url 
          =~  m@(\w+)://([^:/]+)(:(\d+))?((/[^/]+)*)/([^/]+[.](html|xml))?@);
         
        print "Protocol = $protocol\n";
        print "Machine = $machine\nPort = $port\n";
        print "Path = $path\nFile = $file\n";
        print "File type = $fileType\n\n";
    }
}

We put undef in the index position two of the list since we do not want to do any assignment with the third element of the list returned by =~. This third element happens to be :80.