6.11.5 Recursively Traversing a File Hierarchy

6.11.5  Recursively Traversing a File Hierarchy

  

We saw in great detail how we deal with a file hierarchy earlier in this chapter. We discussed how we can write recursive functions to deal with file hierarchies. Perl has a module called File::Find that provides the ability to traverse a file hierarchy recursively. This module makes it quite easy to write the programs that were written from first principles earlier. The module provides a function called find that traverses a file hierarchy and performs useful actions on each element of the file hierarchy. The module has another function
finddepth that also traverses the nodes in a file hierarchy, but using depth-first search. For some applications, it is important that the traversal of the file hierarchy is depth-first to ensure that contained files and directories are traversed before the containing directory is traversed. find and finddepth take two arguments each: a reference to a function of no arguments and a top-level directory. Both find and finddepth call the first argument function on every element, i.e., file or directory, in the file hierarchy underneath the top-level directory given as the second argument.

Thus, find or finddepth does the traversal of the hierarchy, and the processing of the current node during traversal simply involves executing the function passed as the first argument to find or finddepth. We can call this function a node-processing function. We can do whatever we want, simple or complex, in this function. By changing the definition of this function, we can perform a recursive listing of the files and directories, find the cumulative size of all
files and directories in the hierarchy, or find the oldest file. The functions for such tasks are fairly simple. Inside such a node-processing function, three variables are available. They are $File::Find::name, $File::Find::dir, and $_. $File::Find::name contains the fully qualified name of the current file being processed. $File::Find::dir contains the full name of the current (sub-)directory being processed. $_ is the name of the file within the current directory. Thus, this package assigns value to the commonly used special variable $_ and one should keep this in mind when programming. $File::Find::name is actually $File::Find::dir/$_. Our node-processing functions use one or more of these variables to achieve the goal at hand.

First, we write a module called FileR.pm that contains several functions that can be used to traverse file hierarchies in various manners. Next, we present a simple program that shows how the subroutines used defined in FileR.pm are used. The module definition follows.

 Program 6.24

#file FileR.pm
package FileR;

use File::Find;
use File::Path; 
use strict;
my (@ISA,  @EXPORT);

#Export subroutines
use Exporter;
@ISA = ('Exporter');
@EXPORT = qw (lsdirR lsdirTR lsdirBR sizeR rmdirR findOldest);

my (@ALL_FILES, @ALL_TEXT_FILES, @ALL_BINARY_FILES);
my $SIZE;
my ($OLDEST_FILE, $OLDEST_AGE);

#recursive listing 
sub lsdirR{
    my ($topDir) = @_;
    @ALL_FILES = ();
    &File::Find::find (\&listAllFiles, $topDir);
    return @ALL_FILES;
  }
sub listAllFiles{
  @ALL_FILES = (@ALL_FILES, $File::Find::name);
}

#recursive listing of text files
sub lsdirTR{
    my ($topDir) = @_;
    @ALL_TEXT_FILES = ();
    &File::Find::find (\&listTextFiles, $topDir);
    return @ALL_TEXT_FILES;
  }
sub listTextFiles{
  my $file = $File::Find::name;
  @ALL_TEXT_FILES = (@ALL_TEXT_FILES, $file) if (-T $file);
}

#recursive listing of binary files
sub lsdirBR{
    my ($topDir) = @_;
    @ALL_BINARY_FILES = ();
    File::Find::find (\&listBinaryFiles, $topDir);
    return @ALL_BINARY_FILES;
  }
sub listBinaryFiles{
  my $file = $File::Find::name;
  @ALL_BINARY_FILES = (@ALL_BINARY_FILES, $file) if (-B $file);
}

#computing cumulative size
sub sizeR{
    my ($topDir) = @_;
    File::Find::find (\&calculateSize, $topDir);
    return $SIZE;
  }
sub calculateSize {
  my $file = $File::Find::name;
  $SIZE = (-s $file) + $SIZE;
}

#recursive removal of contents of a directory
sub rmdirR{
    my ($topDir) = @_;
    print "Are you sure you want to delete $topDir recursively? (Y|N) ";
    my $response = ;
    exit unless ($response =~ /^y$/i);
    File::Find::finddepth (\&removeRecursively, $topDir);
  }

sub removeRecursively{
  my $file = $File::Find::name;
  print "Removing file $file\n";
  if (-d $file){
     rmdir $file or warn "Cannot delete directory $file: $!";
   }
   else{
     unlink $file or warn "Cannot delete file $file: $!";
   }
}

#finding the oldest file and its age
sub findOldest{
  print "In findOldest...\n"; 
  my ($topDir) = @_;
  $OLDEST_FILE = "";
  $OLDEST_AGE = 0;

  File::Find::find (\&findOldestFile, $topDir);
  return ($OLDEST_FILE, $OLDEST_AGE);
}
sub findOldestFile{
  my $file = $File::Find::name;
  my $age = -M $file;
  if ($OLDEST_AGE < $age){
       $OLDEST_AGE = $age;
       $OLDEST_FILE = $file;
     }
}

1;

The module starts by declaring its name FileR. It uses modules File::Find and File::Path. It also uses the strict module that makes sure that all variables are declared before first use. We declare two variables as globals.

 

my (@ISA,  @EXPORT);

 

These two variables are needed to be able to export names from the FileR package.

For the module FileR to make names of variables and subroutines visible and hence, usable by programs outside, we need to export them. Exporting is done by placing names or identifiers in the @EXPORT list. @EXPORT is a global variable specific to a package. We use the package called Exporter to help with exporting of names from the package. The Exporter package looks for global variables in the package being defined to determine what it exports and how. @ISA is also a per-package global variable that the Exporter module looks at. When a program says use FileR; to use variables and functions defined inside the FileR package, Perl automatically calls a method FileR->import(). In our module definition for FileR, we have not written
such a method called import. However, there is such a method in the Exporter package that can be used generally, and that is why we are useing this package. The content of @ISA makes sure that we inherit the import method from the Exporter package. The statement

 

@ISA = ('Exporter');   

 

says that the current package or class is a subclass of the Exporter.pm package or class. A subclass inherits from its superior classes. In Perl, to inherit from a package, we need to place the superior package’s name in the global variable @ISA. Perl looks at the modules in the @ISA package to inherit any definitions that are not in the current package. The @EXPORT array in the package FileR specifies all the
identifiers that can be imported by another package. When another package imports this module FileR, the variables and functions listed in the @EXPORT array become available in the importing package without qualification with package names. That is, the names in the @EXPORT are aliased into the importing package.

The package defines six functions that can be called from outside the package. They are lsdirR, lsdirTR, lsdirBR, sizeR, rmdirR and findOldest. These are all specified in the @EXPORT variable. The package has several global variables that are used by the subroutines in the package. The global variables used by the subroutines in the package are: @ALL_FILES, @ALL_TEXT_FILES, @ALL_BINARY_FILES, $SIZE,
$OLDEST_FILE, and $OLDEST_AGE. These global variables are available only within this package and not from the outside.

Not all subroutines in the package can be accessed from the outside. Only the ones in the @EXPORT array are accessible from the outside. Let us look at all the subroutines one by one.

The first exported subroutine is lsdirR that takes a directory name as an argument and obtains a recursive listing of the files. First, it sets the global variable @ALL_FILES to the empty list although it is not necessary to do so. It does so to be doubly sure that there are no extraneous values in the variable from the previous runs. It then calls the function find in the File::Find module. The call is given below.

 

&File::Find::find (\&listAllFiles, $topDir);

 

We do not have to qualify the name of the function because it is automatically imported to the current program, but we do so just to make clear where it comes from. We also do not have to specify the & before the name of the qualified function name. find ’s first argument is a reference to another function that takes no arguments. The second argument is the name of a directory that is to be processed recursively. The function whose reference is passed is listAllFiles defined in this module, but not exported. $topDir specifies the top-level directory for recursive processing. The subroutine listAllFiles is automatically called on each file and directory in the file hierarchy starting from $topDir. Inside this subroutine, which is sometimes called the wanted subroutine, the fully qualified name of the current file is available in the scalar $File::Find::name. In the subroutine listAllFiles, the name of the current file is simply appended to the current list of files in @ALL_FILES. The calling subroutine lsdirR returns the value in the variable @ALL_FILES after the last call to listAllFiles is over. In other words, the names of all files and directories in the file hierarchy is returned.

The next subroutine exported is lsdirTR. This subroutine returns the list of all text files in the file hierarchy. It does so by using an auxiliary, non-exported function listTextFiles. The list becomes available in the global variable @FileR::ALL_TEXT_FILES after the last recursive call to the auxiliary node-processing function.

The third exported subroutine is lsdirBR that returns the names of all binary files in the file hierarchy. It uses the auxiliary subroutine called listBinaryFiles to obtain this list. The list is available in the global variable @ALL_BINARY_FILES.

The fourth subroutine is sizeR that returns the cumulative size of all directories and files in the file hierarchy. It does so by calling the internal function calculateSize. This non-exported subroutine is called automatically for every file and directory in the hierarchy. In each call, the size of the current file or directory is found by calling the file operator -M. In each call, it adds the size of the current file or directory to the current value of the global variable
$SIZE
.

The fifth exported subroutine is rmdirR that removes the contents of a file hierarchy recursively. It asks to make sure that the file hierarchy is to be removed recursively. Once confirmed, it calls the auxiliary subroutine removeRecursively to remove each file and directory in the file hierarchy. One has to be a little careful though in removing files and directories. A directory must be completely empty before it can be removed. Thus, all contained files and directories must be removed before any directory is removed. This is accomplished by not calling
File::Find::find, but File::Find::finddepth. The finddepth function traverses the file hierarchy using dept-first search. In depth-first search, a tree’s depth is searched or traversed first before its root is traversed. Here root is the directory itself and all contained files are nodes underneath the root. Thus, the deleting process removes all internal files and directories before removing the directory itself. This ensures that all directories are deleted recursively without any hitch. Simple files are removed using unlink whereas directories are removed using rmdir. It must be noted here that the rmtree function in the File::Path package also removes the contents of a directory structure recursively. What we have here is a possible implementation of File::Path::rmtree.

The sixth exported function is findOldest that finds the oldest file or directory in a file hierarchy. It returns the name of the oldest file and its age in days. The auxiliary subroutine called is findOldestFile that returns two scalars, the name of the oldest file and its age. These two subroutines use two global variables, $OLDEST_FILE and $OLDEST_AGE.

A program that calls all the exported functions is given below.

 Program 6.25

#!/usr/bin/perl
#file fileFuns.pl

use FileR;
use Cwd;
use strict;

$" = "\n";

my ($dir, @fileListing, $oldestFile, $oldestAge);
$dir = $ARGV[0] or $dir = qq{.};

#Print a listing of all files under the directory; recursive
#There is a problem with permissions while listing files..
print "dir = $dir\n";
@fileListing = FileR::lsdirR ("$dir");
print "*" x 50, "\n";
print "File list: \n";
print "@fileListing\n";

#Print a listing of all rest files in the directory; recursive
@fileListing = FileR::lsdirTR ("$dir");
print "*" x 50, "\n";
print "Text file list: \n";
print "@fileListing\n";

#print a listing of all binary files in the directory; recursive
@fileListing = FileR::lsdirBR ("$dir");
print "*" x 50, "\n";
print "Binary File list: \n";
print "@fileListing\n";

#find the cumulative size of all files in the directory; recursive
my $size = FileR::sizeR ("$dir");
print "*" x 50, "\n";
print "Cumulative size of files = $size kilobytes\n";

#find the oldest file in the directory and its age; recursive
($oldestFile, $oldestAge) = &FileR::findOldest ("$dir");
print "oldest file = $oldestFile\n";
printf  "oldest age = %5.2f days\n", $oldestAge;

#Finally, ask for the name of a directory and delete it, recursively
print "*" x 50, "\n";
print "Name a directory to delete recursively: ";
my $delDir = ;
chop $delDir;
#If the directory name is not absolute, absolutize it; 
#    works in Unix, recursive
if ($delDir !~  m@^/@){
     $delDir = (Cwd::cwd()) . "/$delDir";
     print "delDir = $delDir\n";
   }
&FileR::rmdirR ("$delDir");

This function calls each one of the six exported functions one by one. It expects to get a directory name as a command-line argument, and if the command-line argument is missing, it uses the current directory. At the very end, it asks for the name of a directory, either absolute or relative, to remove recursively. It confirms the name entered by asking the user, and if confirmed removes the contents of the directory recursively. A sample interaction with this program is given below.

 

%fileFuns.pl /home/kalita/perl/file

dir = /home/kalita/perl/file

**************************************************

File list:

/home/kalita/perl/file

/home/kalita/perl/file/fileCopy.plx

/home/kalita/perl/file/filecopytest.pl

/home/kalita/perl/file/dircopytest.pl

/home/kalita/perl/file/mkdir1.pl

/home/kalita/perl/file/checkPath.plx

/home/kalita/perl/file/filecopy.pl

/home/kalita/perl/file/mkdir.pl

/home/kalita/perl/file/basename.pl

/home/kalita/perl/file/FileR.pm

/home/kalita/perl/file/fileparse.pl

/home/kalita/perl/file/FileR1.pm

/home/kalita/perl/file/myMakePath.pl

/home/kalita/perl/file/myMkpath.pl

/home/kalita/perl/file/myMkpath1.pl

/home/kalita/perl/file/fileFuns.pl

/home/kalita/perl/file/filecopy1.pl

/home/kalita/perl/file/rmtree.pl

/home/kalita/perl/file/jk1.jpg

/home/kalita/perl/file/a

/home/kalita/perl/file/a/b

/home/kalita/perl/file/a/b/c

/home/kalita/perl/file/a/b/c/mkdir1.pl

/home/kalita/perl/file/a/b/c/jk1.jpg

**************************************************

Text file list:

/home/kalita/perl/file/fileCopy.plx

/home/kalita/perl/file/filecopytest.pl

/home/kalita/perl/file/dircopytest.pl

/home/kalita/perl/file/mkdir1.pl

/home/kalita/perl/file/checkPath.plx

/home/kalita/perl/file/filecopy.pl

/home/kalita/perl/file/mkdir.pl

/home/kalita/perl/file/basename.pl

/home/kalita/perl/file/FileR.pm

/home/kalita/perl/file/fileparse.pl

/home/kalita/perl/file/FileR1.pm

/home/kalita/perl/file/myMakePath.pl

/home/kalita/perl/file/myMkpath.pl

/home/kalita/perl/file/myMkpath1.pl

/home/kalita/perl/file/fileFuns.pl

/home/kalita/perl/file/filecopy1.pl

/home/kalita/perl/file/rmtree.pl

/home/kalita/perl/file/a/b/c/mkdir1.pl

**************************************************

Binary File list:

/home/kalita/perl/file

/home/kalita/perl/file/jk1.jpg

/home/kalita/perl/file/a

/home/kalita/perl/file/a/b

/home/kalita/perl/file/a/b/c

/home/kalita/perl/file/a/b/c/jk1.jpg

**************************************************

Cumulative size of files = 229099 kilobytes

In findOldest...

oldest file = /home/kalita/perl/file/fileCopy.plx

oldest age = 240.06 days

**************************************************

Name a directory to delete recursively: a

delDir = /home/kalita/perl/file/a

Are you sure you want to delete /home/kalita/perl/file/a recursively? (Y|N) Y

Removing file /home/kalita/perl/file/a/b/c/mkdir1.pl

Removing file /home/kalita/perl/file/a/b/c/jk1.jpg

Removing file /home/kalita/perl/file/a/b/c

Removing file /home/kalita/perl/file/a/b

Removing file /home/kalita/perl/file/a

 

First, the program gives a complete recursive listing of the files in the hierarchy. Next, it prints the recursive list of text files and then the recursive list of binary files. Directories and graphic files are considered binary files. The program determines that the cumulative size of all files and directories in the hierarchy is 226554 kilobytes and that the oldest file is /home/kalita/perl/file/fileCopy.plx and that this file is 240 days old. It then prompts for the name of a directory, and the user responds with the name a. It deletes this sub-directory recursively. It prints
the names of files being removed, in removal order.

Finally, a word of caution about using the File::Find package. Experience has showed that the find and finddepth functions can be quite slow especially when dealing with large directory structures, say ones containing many hundreds of files or more. For example, when run on a directory such as
/, the top-level directory in Linux and Mac OS X Server with fairly fast machines, it took many hours for the listings to start printing on the screen. This is not acceptable. So, it may be better to use a subroutine especially a non-recursive implementation of depth-first or breadth-first search, for faster processing. In addition, when running in Unix, the find and finddepth
functions seem to have permission problems with files and directories in unexpected places. The permission problems do not occur with the programs discussed earlier in the chapter.