4.7.3 Counting Word Frequency

4.7.3  Counting Word Frequency

  The next program counts the frequencies of all words in a set of files. It looks for words with alphabetic characters only. First we give the program and then we explain it and show a sample output.

 Program 4.25

#!/usr/bin/perl
if ($ARGV[0] =~ m/^-tex$/){
    shift @ARGV;
    @ARGV = grep (/\.tex$/, @ARGV);
}    

while ($text = ){
    @words = split (/\W*\s+\W*/, $text);
    @words = grep (/^[a-zA-Z\-]+$/, @words);
    foreach $word (@words){
        $wordCount{$word}++;
    }
}

#printing the words in alphabetical order
print "*" x 60, "\n";
print "Printing the words alphabetically\n";
print "*" x 60, "\n";

foreach $word (sort keys (%wordCount)){
        printf "%-20s %d\n", $word, $wordCount{$word};
}

In this program also, we distinguish between cases of letters. Therefore, It and it are treated as different words.

First, we look at the arguments given to the call. If the first argument is given as -tex, we consider it to be a Unix-style switch. If the first argument is given as -tex, we look at files that are written in the TeX/LaTeX format only. Such files have the .tex extension in their names. If the first argument happens to be this switch, we cull the file names that have the requisite extension using the grep command.

Next, inside the while loop, we look at each line of each file one by one. A line of input is read into the scalar variable $text. Next, we take the string $text and break it apart using one or more space characters surrounded by zero or more non-word characters as the separator. This separates out every word in the current line of text from the current file. A word separator is a contiguous substring of one or more non-space and non-word characters. Then, we use the grep function to keep only those words that have alphabetic characters and hyphens in them. Then, we go through each word and increment the frequency count for each occurrence. All this is done for every chosen element of @ARGV, i.e., every qualified file name culled from those used as a command line argument. Finally, in the last foreach loop, we sort the words and print them out sequentially with frequency.

Let us store this script in a file called wordcountTeX.pl. Now, if we make a call such as

 

%wordcountTeX.pl -tex *

 

we get an output that is like the following. We show only part of the output.

A                    24
AI                   5
ASAM                 2
Above                1
Abstracts            1
According            2
Achieving            1
Across               3
Action               9
Action-n             2
Actions              1
Additional           1
Additionally         2
Adverbs              1
After                1
Ag                   20
Aleksander           1
Align                4
Aligned              1
All                  2
Allen                1
Almost               1
Also                 3
Alspector            1
Although             7
Altogether           1
Among                5
An                   5
Analysis             1
Analyzing            2
And                  11
Another              2
Approximate          2
As                   7
Asam                 5
Aspect               1
Assuming             3
At                   4