10.2.1 Using a tied Hash to Store Word Frequencies

10.2.1  Using a tied Hash to Store Word Frequencies

   Now, we discuss a program that reads all files in a certain directory, splits the contents of the files into “words", and obtains frequencies of words across all files read. We assume that the files have .tex extension indicating they are text files written in the TeX/LaTeX format. Once again, we use an NDBM file in this example. The word frequencies are kept in a hash that is tied to an NDBM file. Thus, the contents of the hash are actually written out to the NDBM file.

 Program 10.3

#!/usr/bin/perl5.6.0
#file makeWordFreq.pl
#Takes the diamond input, splits the files into words
#and computes the frequency of each word and stores
#the frequencies in a DBM file. We assume that the words
#are not hyphenated.

use NDBM_File;  #Uses a package that lets us us create what
                #are called NDBM database files.
use Fcntl;

tie %WORDS, "NDBM_File", "words", O_RDWR|O_CREAT, 0644;

@ARGV = grep /[.]tex/, @ARGV;
while (my $line = <>){
    $line =~ s/^\s+//;
    $line =~ s/\s+$//;
    foreach my $word (split /\W*\s+\W*/, $line){
        $WORDS{$word}++;
    }
}
untie(%WORDS);

The program is called with * as the sole argument indicating all files in the current directory. The files with .tex extension are the only ones considered. Each line of such files are read, initial and terminal spaces removed, and broken up into individual words. The hash %WORDS is used to store the number of times a word recurs.

 Once the word frequencies have been written out to the hash, we untie the hash variable and sever the link. For testing purposes, we have a second program where we tie a hash to the same NDBM file words. We print the contents of the hash, i.e., the contents of the NDBM file to see the word frequencies. The second program is given below.

 Program 10.4

#!/usr/bin/perl5.6.0
#file readWordFreq.pl

use NDBM_File;  #Uses a package that lets us us create what
                #are called NDBM database files.  
use Fcntl;

#The program reads the words from the
#DBM file and prints them out such that the frequency counts
#are sorted. 

tie %WORDS, "NDBM_File", "words", O_RDWR, 0644;
foreach $word (sort {$WORDS{$b} <=> $WORDS{$a}} keys %WORDS)
{ 
    printf "%20s\t%3d\n", $word, $WORDS{$word};
}
untie (%WORDS);

The program ties to the NDBM_File words using the hash %WORDS. In the hash, the words are keys and the number of occurrences are the values. The program sorts the keys of the hash %WORDS in descending order of frequency of occurrence and prints a two-column output to the standard output or screen.

A small part of the output is given below.

 

                 the    426

                  to    246

                  of    246

                 and    150

                   a    144

                cell    108

                  we     96

                  in     90

               amino     84

                  is     84

                acid     84

                  be     72

               model     60

                 The     54

                  We     54

                  as     48

                  bf     48

                need     48

                that     42

               \item     42