10.2.1 Using a tied Hash to Store Word Frequencies
10.2.1 Using a tied Hash to Store Word Frequencies
Now, we discuss a program that reads all files in a certain directory, splits the contents of the files into “words", and obtains frequencies of words across all files read. We assume that the files have .tex extension indicating they are text files written in the TeX/LaTeX format. Once again, we use an NDBM file in this example. The word frequencies are kept in a hash that is tied to an NDBM file. Thus, the contents of the hash are actually written out to the NDBM file.
Program 10.3
#!/usr/bin/perl5.6.0
#file makeWordFreq.pl
#Takes the diamond input, splits the files into words
#and computes the frequency of each word and stores
#the frequencies in a DBM file. We assume that the words
#are not hyphenated.
use NDBM_File; #Uses a package that lets us us create what
#are called NDBM database files.
use Fcntl;
tie %WORDS, "NDBM_File", "words", O_RDWR|O_CREAT, 0644;
@ARGV = grep /[.]tex/, @ARGV;
while (my $line = <>){
$line =~ s/^\s+//;
$line =~ s/\s+$//;
foreach my $word (split /\W*\s+\W*/, $line){
$WORDS{$word}++;
}
}
untie(%WORDS);
The program is called with * as the sole argument indicating all files in the current directory. The files with .tex extension are the only ones considered. Each line of such files are read, initial and terminal spaces removed, and broken up into individual words. The hash %WORDS is used to store the number of times a word recurs.
Once the word frequencies have been written out to the hash, we untie the hash variable and sever the link. For testing purposes, we have a second program where we tie a hash to the same NDBM file words. We print the contents of the hash, i.e., the contents of the NDBM file to see the word frequencies. The second program is given below.
Program 10.4
#!/usr/bin/perl5.6.0
#file readWordFreq.pl
use NDBM_File; #Uses a package that lets us us create what
#are called NDBM database files.
use Fcntl;
#The program reads the words from the
#DBM file and prints them out such that the frequency counts
#are sorted.
tie %WORDS, "NDBM_File", "words", O_RDWR, 0644;
foreach $word (sort {$WORDS{$b} <=> $WORDS{$a}} keys %WORDS)
{
printf "%20s\t%3d\n", $word, $WORDS{$word};
}
untie (%WORDS);
The program ties to the NDBM_File words using the hash %WORDS. In the hash, the words are keys and the number of occurrences are the values. The program sorts the keys of the hash %WORDS in descending order of frequency of occurrence and prints a two-column output to the standard output or screen.
A small part of the output is given below.
the 426
to 246
of 246
and 150
a 144
cell 108
we 96
in 90
amino 84
is 84
acid 84
be 72
model 60
The 54
We 54
as 48
bf 48
need 48
that 42
\item 42
