4.8.3.1 Counting Word Frequencies: Again

4.8.3.1  Counting Word Frequencies: Again

 Let us now try to write another version of the program to count word frequencies in a set of files. We do not want to use the split function as we did before. We want to try it another way. Be warned that the first update of the program we present next does not work correctly.

 Program 4.29

#!/usr/bin/perl
if ($ARGV[0] =~ m/^-tex$/){
    shift @ARGV;
    @ARGV = grep (/\.tex$/, @ARGV);
}    

while ($text = ){
    @words =  ($text =~ /([a-zA-Z\-]+)/);
    foreach $word (@words){
        $wordCount{$word}++;
    }
}

#printing the words in alphabetical order
print "*" x 60, "\n";
print "Printing the words alphabetically\n";
print "*" x 60, "\n";

foreach $word (sort keys (%wordCount)){
        printf "%-20s %d\n", $word, $wordCount{$word};
}

This program changes the code inside the while loop from the previous version. In the previous version, the while loop looked like the following.

 

while ($text = ){
    @words = split (/\W*\s+\W*/, $text);
    @words = grep (/^[a-zA-Z\-]+$/, @words);
    foreach $word (@words){
        $wordCount{$word}++;
    }
}

Now, the code inside the while loop has changed to the following.


while ($text = ){
    @words =  ($text =~ /([a-zA-Z\-]+)/);
    foreach $word (@words){
        $wordCount{$word}++;
    }
}

Otherwise, everything else in the program is the same as what we had earlier. In the new program, we read in a line of text into the scalar called $text and then perform a pattern match operation on it. The pattern match operation remembers the substring that matches and puts the matched substring in the array called @words.

When we run this program, we find that the word frequencies obtained are much smaller than the frequencies printed by the previous version of the program. If we look carefully in the text of the while loop, we find out why this is the case. When we perform pattern match over $text, since we do not use any options or modifiers, only the first word of each line of text is captured in the array @words. In other words, only the first word of each line is considered when we compute the frequencies of words in the file. This is
definitely wrong and results in gross undercounting. The first few frequencies printed by the program on the same files as before are given below.

A                    10
AFF                  2
Across               1
Additionally         2
AdvP                 1
After                1
All                  2
Almost               1
Also                 1
Among                2
An                   2
Analyzing            2
And                  7
Another              1
As                   5

We see that many words that occurred earlier are missing and for other words, the frequencies are undercounted.

Our objective is to extract all words from each line of text and not just the first word. We can accomplish this with a very little change in our program. We need to specify g or the global match option or modifier in the pattern match operator. It instructs Perl to match the pattern globally. In other words, it tells Perl to modify its usual behavior of matching only once for a target string, but match as many times as possible. If we have ten words per line of text, Perl matches all of these words. Moreover, Perl’s =~ operator returns a list containing all the matched
words when we use it in a context where a list is expected. This is such a situation because the value returned by =~ is used to set an array variable. As a result, @words contains all the words in line of text from the input file and not just the first word. Hence, the program counts the frequencies of all words in the files that are given as arguments. The text of the program is given below.

Program 4.30


#!/usr/bin/perl
if ($ARGV[0] =~ m/^-tex$/){
    shift @ARGV;
    @ARGV = grep (/\.tex$/, @ARGV);
}    

while ($text = ){
    @words =  ($text =~ /([a-zA-Z\-]+)/g);
    foreach $word (@words){
        $wordCount{$word}++;
    }
}

#printing the words in alphabetical order
print "*" x 60, "\n";
print "Printing the words alphabetically\n";
print "*" x 60, "\n";

foreach $word (sort keys (%wordCount)){
        printf "%-20s %d\n", $word, $wordCount{$word};
}

The line of code that extracts the words from a line of text is given below.


@words =  ($text =~ /([a-zA-Z\-]+)/g);

This line has the g modifier appended to the specification of the regular expression. This modifier does the trick for us. The target string is $text. The pattern match operation


$text =~ /([a-zA-Z\-]+)/g

succeeds every time the pattern specified is found in the target string. Not only that, every time the pattern matches, the substring that matches is remembered by Perl because the pattern is enclosed in parentheses. So, if there are ten words in the target string, all of these words are captured by Perl. The list of these words is returned by the =~ operator. @words is assigned this list of words from the target string.