6.12 Archiving Directories and Files: Archive::Tar Module
6.12 Archiving Directories and Files: Archive::Tar Module
Unix provides a tool called tar to make so-called “tape archives” of directories and contained files. It is available on all other operating systems too. The name “tape archive” is now somewhat outdated because most people do not use magnetic or other type of tapes any more, and obviously most do not use tar for making only tape archives. Usually, a tar archive is made when it is necessary to transfer a set of directories and contained files that constitute a reasonably sized software project, from one machine to another. A
complex software project may consist of tens of files, or even hundreds, organized neatly into one or more directories. It is usually cumbersome to transfer the files and directories one by one from a source machine to a destination machine. Although there are FTP programs that can transfer directories recursively, it still is time consuming to do so.
When one transfers a complex academic or commercial project between two machines, one should do it as neatly and efficiently as possible. The programmers of a complex academic project may want to place the project’s code on the Internet so interested individuals can download it easily for free. A company may want to place the software constituting a product at a secure Internet site so that those who have paid can download the software using a given password or by other secure means. In such situations, it is convenient to produce one file out of all the relevant directories and files that constitute the project. Such a comprehensive file is called a tar file if we use the
tar facility to produce it. In common parlance, it is often called a tar ball. Unix’s tar tool takes various arguments to produce the archive, extract files from the archive, or view files contained in the tar ball. tarring and untarring tools are available on other platforms as well, quite frequently with graphical user interface. There are tar tools with graphical user interfaces on Windows machines and Macintoshes.
First, we present a program that tars files of a simple project. All the project files are under one single top level directory. In the top level directory, there are several files of interest. In addition, below the top level directory, the project has two useful sub-directories; there may be other sub-directories that are ignored when archived. The two relevant sub-directories are HTML and perl. The perl directory contains three sub-directories:
cgi-bin, code and modules. In each sub-directory, there are files that are either Perl programs or HTML files. There may be other types of files also. When one works on a project, it is quite likely that one produces temporary files, saves duplicate copies of important files, or creates incremental versions of files for testing purposes. Many text editors also automatically keep copies of older versions of files. For example, the Emacs editor, quite popular in the Unix and related platforms and also available for the PC and the Macintosh, usually keeps previous versions by appending ~ at the end of the file’s name. Thus, the source directories of a project may be cluttered. So,
when the project is archived, we want a tarring program that is a bit smart in that it archives only selected directories and files.
In the project under discussion, cgi-bin and Perl code files have the extension .plx. These are Perl programs. Perl programs are not required to have any specific extensions. The usual extension used is .pl, but the authors of this project decided to use .plx instead. It is somewhat unconventional. The HTML files have either .html or .htm extension. The Perl modules have extension .pm. Only the specified sub-directories under the top level directory and files with specific extensions are tarred.
The Perl tar package is called Archive::Tar. First, a new empty archive file or tar file is produced by making a call to the new class method. Then, one adds files to the tar file by calling the add_files method on the tar file object. The list of files to add to the tar file are given as argument to
the add_files method. If the added files are in directories and sub-directories, the path to these files have to be provided. In this program, the paths are provided in relative form. The program called archive.pl is given below. The program also uses what is called POD (Plain Old Documentation) to provide comments. POD comments can be multi-line, and they can be extracted from the program’s text file to produce HTML or textual documentation. POD documentation is discussed in Section 1.15.
Program 6.27
#!/usr/bin/perl
=head1 NAME
script archive.pl
=head1 SYNOPSIS
Makes a tar ball of the top level files and specified sub-directories
associated with the APTracker project. The default tar ball name is
APTracker.tar.
=head1 UPDATE HISTORY
07/24/2000: Written, Jugal Kalita, recurses from first principles
03/09/2001: Updated, Jugal Kalita, added loop
03/24/2001: Updated, Jugal Kalita, uses File::Find
=head1 DIRECTORIES TARRED
=over 4
=item the top-level directory of the distribution
=item html directory
=item perl/cgi-bin directory
=item perl/code directory
=item perl/modules directory
=back
=cut
use strict;
use File::Find;
use Archive::Tar;
use Cwd;
$" = "\n"; #Separator for printing lists in double-quoted strings
my @ALL_FILES_TO_TAR = ();
my $tar;
my ($cgibinSrcDir, $cgibinExtensions, $htmlSrcDir, $htmlExtensions);
my ($perlCodeSrcDir, $perlCodeExtensions);
my ($perlModuleSrcDir, $perlModuleExtensions);
my ($tarFileName, $sourceTopDir, $excludeExtensions);
$cgibinSrcDir = "perl/cgi-bin";
$cgibinExtensions = "plx";
$htmlSrcDir = "HTML";
$htmlExtensions = "html|htm";
$perlCodeSrcDir = "perl/code";
$perlCodeExtensions = "plx";
$perlModuleSrcDir = "perl/modules";
$perlModuleExtensions = "pm";
$excludeExtensions = "(.*~)";
$sourceTopDir = "/home/kalita/perl/tar/ap.dev";
my @allSrcDirs = ($htmlSrcDir, $perlCodeSrcDir, $cgibinSrcDir,
$perlModuleSrcDir);
my @allExtensions = ($htmlExtensions, $perlCodeExtensions,
$cgibinExtensions, $perlModuleExtensions);
if ($#allExtensions != $#allSrcDirs){
print "Please provide extensions for files in all source directories\n";
exit 1;
}
print "\nWhat is the name of the tar file you want to create\n";
print "(Use \"APTracker.tar\" as default)?";
$tarFileName = ;
chomp ($tarFileName);
if (!($tarFileName)){
$tarFileName = "APTracker.tar";
}
print "tar file name = $tarFileName\n";
if (-e $tarFileName){
unlink $tarFileName or
die "Cannot reinitialize by deleting existing tar file: $!"
}
#Start tarring
$tar = Archive::Tar->new();
print "Tarring files into $tarFileName...\n\n";
#Tar files at the top level
my $oldDir = cwd ();
chdir ($sourceTopDir) or die "Cannot chdir to $sourceTopDir: $!";
my $currDir = cwd();
opendir (DIR, $currDir) or die "Cannot open $currDir: $!";
my @files = readdir DIR or die "cannot read $currDir: $!";
@files = grep !/^[.]{1,2}$/, @files;
@files = grep !/[.]$excludeExtensions$/, @files;
@files = grep {if (-d $_) {0} else {1}} @files;
print "\nTarring top-level files...\n@files\n\n";
$tar -> add_files (@files);
#tar all the sub-directories directories
my $i;
for ($i = 0; $i <= $#allSrcDirs; $i++){
&tarDir ($tar, $allSrcDirs[$i], $allExtensions[$i]);
}
print "Current working directory is: " . cwd () . "\n";
$tar -> write ("$oldDir/$tarFileName");
####subroutine to tar a directory's contents
sub tarDir{
my ($tarArchive, $sourceDir, $extensionRegex) = @_;
@ALL_FILES_TO_TAR = ();
File::Find::find (\&listAllFilesToTar, $sourceDir);
print "+++++ALL_FILES_TO_TAR = @ALL_FILES_TO_TAR\n";
my @sourceFiles;
@sourceFiles = (grep /\.$extensionRegex/, @ALL_FILES_TO_TAR);
@sourceFiles = grep !/[.]$excludeExtensions$/, @sourceFiles;
print "Tarring source files...\n@sourceFiles\n\n";
$tarArchive -> add_files (@sourceFiles);
}
sub listAllFilesToTar{
@ALL_FILES_TO_TAR = (@ALL_FILES_TO_TAR, $File::Find::name);
}
The program starts with POD comments and then declares a number of variables. Values are assigned to variables to specify the location of relevant files. The HTML files are stored in the HTML sub-directory below the top level directory. The Perl files are stored in the directory perl. These files have been saved under three sub-directories: cgi-bin, code and modules. This is because usually when a Perl project is installed, cgi-bin, module and regular code files have to be stored in different locations. Usually, cgi-bin files need to be stored in one or more specific directories dependent on the operating system and the Web server. Perl modules written by users for general use, need also to be stored in certain system-specified directories, usually called site lib directories so that all Perl programs can find them when needed. Regular Perl code can be stored in any directory. In practice, each one of these sub-directories can have embedded directories down to several levels of containment. The program also requires specification of acceptable extensions for files in all
sub-directories below the top level. It also specifies extensions to exclude. The extensions are specified in terms of regular expressions.
The program asks for the name of the tar file to create. If a name is not given, the default name used is APTracker.tar for the project. If the tar file already exists, the file is deleted or unlinked. If this is not done, later when the tar file is written using the add_files and write object methods, the new contents will be added to what is already there in the tar file.
The program creates a new tar file and calls a reference to it $tar by calling new.
$tar = Archive::Tar->new();
The program then chdirs to the top-level directory where the files to be archived exist. At the top-level, the program obtains the list of all files and directories that are not . or ... . is the current directory and .. is the parent directory. These two dotfile entries show up in the listing of a directory in Unix. They are not added to the list of files to be archived. The program also removes names of files
with unacceptable extensions. In this program, the only unacceptable extension is a name that ends with ~. These are older versions of files being edited using the editor called Emacs. At this time, the program removes all directories at the top level from the list of files and directories. As a result, the program does not archive any directories that are not specifically added later in the program to the tar archive. The simple file names at the top level are then added to the tar archive by the following command.
$tar -> add_files (@files);
In the rest of the program, those sub-directories under the top level that are to be archived are specified.
The list of all sub-directories to archive below the top level directory, is available in the variable called
@allSrcDirs. The corresponding list of acceptable extensions is available in the variable
@allExtensions. The program loops over all sub-directories in @allSrcDirs and calls subroutine tarDir on each. This loop is shown below.
for ($i = 0; $i <= $#allSrcDirs; $i++){
&tarDir ($tar, $allSrcDirs[$i], $allExtensions[$i]);
}
Finally, all files added to the tar archive referenced by $tar are actually written to the archive by the following statement.
$tar -> write ($tarFileName);
At this point, all the files added to be tarred are put together using the tar syntax and are available in the single archive file specified by the scalar variable $tarFileName in the directory where the archiving program archive.pl is situated. A tar file is a text file, but with its own syntax to denote file boundaries and locations of files in the archive file hierarchy. It is customary to use the
.tar extension for a tar file.
The tarDir subroutine takes three arguments: a tar archive object, a source directory, and a regular expression specifying acceptable extensions. It opens the directory, obtains a list of its files by making the following call.
File::Find::find (\&listAllFilesToTar, $sourceDir);
It uses an auxiliary function listAllFilesToTar to obtain a recursive listing of all files. The program picks out files with acceptable extensions, and removes files with useless extensions. Finally, the list of files is added to the tar object using the statement given below.
$tarArchive -> add_files (@sourceFiles);
This subroutine does not write the files to the archive. writeing to the archive is done in the main program.
A recursive listing of the top level directory before the program is run is given below.
.:
HTML/ NewSense.tar archive.pl archive.plx~ perl/
LOGFILE README archive.plx install.plx
./HTML:
WS_FTP.LOG basic_config.html databaseServer.html operatingSystem.html
XML.html configure.html newsSource.html
./perl:
cgi-bin/ code/ modules/
./perl/cgi-bin:
WS_FTP.LOG basic_configure.plx configure.plx
./perl/code:
apget.plx cookies1.txt cookies2.txt cookies3.txt
./perl/modules:
AP_DB.pm AP_Time.pm AP_XML.pm AP_globals.pm
This listing is given in a Linux generated format. It shows the name of every directory in the project on a line, and the files contained in each directory following the name of the directory.
After the program is run, we can examine the contents of the tar archive. In Unix, tar tvf does this. In other systems, this may be done using a graphical user interface. When we examine the content of the tar archive by typing
%tar tvf APTracker.tar
we see the following.
README
archive.plx
install.plx
perl/cgi-bin/basic_configure.plx
perl/cgi-bin/configure.plx
HTML/configure.html
HTML/newsSource.html
HTML/databaseServer.html
HTML/operatingSystem.html
HTML/XML.html
HTML/basic_config.html
perl/code/apget.plx
perl/modules/AP_DB.pm
perl/modules/AP_globals.pm
perl/modules/AP_Time.pm
perl/modules/AP_XML.pm
The names of files are given with addresses relative to the top level directory. The listing clearly shows that the program produces a “clean” tar archive with only the files that we specifically required it to contain and files that we excluded.
Once we have a tar archive, the archive can be put on the Internet for downloading, FTPed to the destination, sent in an e-mail message, or transferred to the destination on a floppy, zip disk, CD or tape. To extract the files and directories from the tar file at the destination and get the directory structure again, we can either use the command tar xvf in Unix or we may use a GUI-based tar tool. Untarring can also be done using a Perl program. If we write an untar
program in Perl, we can make the program store files in appropriate locations as directed either by the operating system, by the Web server, or by the Perl system. Although there are other alternatives to do so, this is acceptable unless we need to compile programs using a language like C before distributing files to appropriate places. Having some of the programs of a project written in a language like C is not uncommon though.
The following program install.pl takes the tar archive produced earlier and extracts all the files and places them in the right locations in the target system. The recipient edits the module called AP_Conf.pm to specify the target locations and then runs install.pl to install the programs. The content of the configuration file AP_Conf.pm is given below.
Program 6.28
package AP_Conf;
use strict;
our qw($webServerAbsPath $cgiAbsPath $htmlAbsPath );
our qw($perlSourceInstallAbsPath $perlSiteModuleAbsPath);
our qw($distributionTarFileName $tmpDir);
$webServerAbsPath = "/home/kalita/www/Apache";
$cgiAbsPath = "/home/kalita/www/Apache/cgi-bin/aptracker";
$htmlAbsPath = "/home/kalita/www/Apache/htdocs/newsense";
$perlSourceInstallAbsPath = "/home/kalita/www/data/ap";
$perlSiteModuleAbsPath = "/home/kalita/www/bin/Perl/site/lib";
$distributionTarFileName = "APTracker.tar";
$tmpDir = "/home/kalita/.tmp";
1;
The values of the variables in this file need to be edited by hand to suit the current operating system and the current system configuration. The paths given above are arbitrarily made up for purposes of illustration only. For example, on a machine with Red Hat Linux, files related to the Web server are usually in the directory /home/httpd. The cgi-bin files are usually at /home/httpd/cgi-bin/, the html files are at /home/httpd/html, the perl site modules are usually at a location such as
/usr/lib/perl5/site_perl/5.6.0 or /usr/local/lib/perl5/site_perl/5.6.0/. It must be noted that this program needs to be run by the root on a Unix machine if we want the files to be written to these system required locations in a Unix based machine. On a Windows machine or Mac OS (pre- OS X), anyone usually can run this program even with sensitive system required paths because the security infra-structure is lax in such systems. This program also writes a log file about whatever it was able to do. The log file is quite detailed. Information is also printed to the screen as
the untarring proceeds.
First, the variables are declared and values assigned to them. The log file is opened. The main program gets several pieces of information from the configuration module. The program follows.
Program 6.29
#!/usr/bin/perl
=head1 NAME
script install.pl
=head1 SYNOPSIS
Untars the tar ball of a Perl project called the APTracker project.
The tar file's name comes from the configuration module AP_Config.pm.
=head1 UPDATE HISTORY
=for text
07/24/2000: Written, Jugal Kalita
03/09/2001: Updated, Jugal Kalita
=cut
use strict;
use Archive::Tar;
use File::NCopy;
use File::Path;
use Cwd;
use AP_Conf;
$" = "\n"; #Separator for printing lists in double-quoted strings
sub printlog;
########Main program###########
my ($webServerRoot);
my $tar;
my ($cgibinSrcDir, $cgibinDestDir, $htmlSrcDir, $htmlDestDir);
my ($perlCodeSrcDir, $perlCodeDestDir);
my ($perlModuleSrcDir, $perlModuleDestDir);
my ($tarFileName);
my ($tmpDir, $logFile, $logHandle);
my $filePermissions;
$filePermissions = "0777";
$cgibinSrcDir = "perl/cgi-bin";
$htmlSrcDir = "HTML";
$perlCodeSrcDir = "perl/code";
$perlModuleSrcDir = "perl/modules";
$logFile = "LOGFILE";
open (LOG, ">$logFile") or warn "Cannot write log file $logFile: $!";
$logHandle = *LOG;
$webServerRoot = "$AP_Conf::webServerAbsPath";
printlog $logHandle, "Using $webServerRoot as Web server root directory\n";
$cgibinDestDir = "$AP_Conf::cgiAbsPath";
printlog $logHandle, "Using $cgibinDestDir as cgi-bin path\n";
$htmlDestDir = "$AP_Conf::htmlAbsPath";
printlog $logHandle, "Using $htmlDestDir as HTML document path\n";
$perlCodeDestDir = "$AP_Conf::perlSourceInstallAbsPath";
printlog $logHandle, "Using $perlCodeDestDir as path to Perl installation\n";
$perlModuleDestDir = "$AP_Conf::perlSiteModuleAbsPath";
printlog $logHandle, "Using $perlModuleDestDir as path to site Perl modules\n";
$tmpDir = $AP_Conf::tmpDir;
printlog $logHandle, "Using $tmpDir as path to temporary directory\n";
$tarFileName = "$AP_Conf::distributionTarFileName";
my $oldDir = cwd ();
#Testing; Need to take care of permission for newly created directory in Unix
if (!(-e $tmpDir)){
File::Path::mkpath ($tmpDir, $filePermissions)
or die "Cannot create directory $tmpDir: $!";
printlog $logHandle,
"Created directory $tmpDir with permissions $filePermissions\n";
}
#my $oldDir = cwd ();
chdir $tmpDir or die "Cannot change directory to $tmpDir: $!";
printlog $logHandle, "Untarring $tarFileName...\n";
#Start untarring
$tar = Archive::Tar->new();
$tar-> read ("$oldDir/$tarFileName")
or die "Cannot read tar file $tarFileName: $!";
my @files = $tar->list_files();
$tar -> extract (@files);
chdir ($oldDir);
my @allSrcDirs = ($htmlSrcDir, $cgibinSrcDir, $perlModuleSrcDir,
$perlCodeSrcDir);
my @allDestDirs = ($htmlDestDir, $cgibinDestDir, $perlModuleDestDir,
$perlCodeDestDir);
my $i;
for ($i = 0; $i <= $#allSrcDirs; $i++){
©Dir ($logHandle, $allSrcDirs[$i],
$allDestDirs[$i], $filePermissions, $tmpDir);
}
##NEED to CLEAN UP TMP FILE, i.e., delete everything in it and itself
File::Path::rmtree ($tmpDir);
close LOG;
##########subroutine to copy a sourceDir to a dest recursively
sub copyDir{
my ($logHandle, $srcSubDir, $dest, $filePermissions, $tmpDir ) = @_;
my $copyAgent;
my $src = "$tmpDir/$srcSubDir";
printlog $logHandle,
"Trying to copy directory $src to directory $dest...\n";
return () unless ($src);
die "There is no $src directory from which to copy file: $!"
unless (-e $src);
if (-e $dest){
printlog ($logHandle, "$dest exists\n");
}else{
printlog ($logHandle, "$dest doesn't exist. Creating it...\n");
File::Path::mkpath ($dest, $filePermissions)
or die "Cannot create $dest: $!";
}
#recursive copying
$copyAgent = File::NCopy->new(recursive=>1);
$copyAgent->copy ("$src", "$dest");
}
#subroutine for printing to STDOUT and a log handle
sub printlog{
my ($logHandle, $message) = @_;
print STDOUT $message;
print $logHandle $message;
}
The program gets several pieces of information from the configuration module. It prints information on the screen and to the log file about these locations. The program also needs a temporary directory which is created if it does not exist already.
First, a new Archive::Tar object is created. Then, the tar archive is read into this object by executing the following statements.
$tar -> extract (@files);
$tar-> read ("$oldDir/$tarFileName") or
die "Cannot read tar file $tarFileName: $!";
The list of archived files and directories is obtained by issuing the following command.
my @files = $tar->list_files();
The list of files is extracted. The extraction takes place into the temporary directory $tmpDir to which the program had chdired earlier. The extraction process recreates the directory structure and obtains the contained files in their individual form.
The program goes through all the source directories and copies them to the appropriate destination directories. The destination directories are obtained from the AP_Conf module discussed earlier. To copy a directory recursively, the program uses the copyDir subroutine. After the directories have been copied to their rightful locations, the program deletes the temporary directory $tmpDir recursively by calling the File::Path::rmtree function.
The copyDir subroutine creates the destination directory by calling File::Path::mkpath if necessary. It creates a new File::NCopy object called $copyAgent that copies directories recursively. The copy method is used by $copyAgent to copy directories recursively.
