4.10 Substituting A Pattern: The s/// Operator

4.10  Substituting A Pattern: The s/// Operator

  

Quite frequently, we need to make substitutions in a piece of text. We may perform substitutions on one string, a whole file, all files in a directory, or all files spread out over many directories. A substitution can possibly correct a spelling error, reflect changes in names of individuals or organizations, or perform pre-processing to facilitate more complex tex processing, among others. When a person gets married, a person’s name may change. When a company is bought or merged with another company, its name may change. When a person gets promoted, his or her designation may change. If such changes need to be reflected over one page, or many pages, it is not advisable to do so manually. A
program can do such tasks better than a human, without making any mistakes or without missing any occurrence.

The substitution operation is performed using s. s takes two arguments—a pattern and a substitution string as shown below.

s/pattern/substitutionString/

It looks for the presence of pattern in a string and replaces pattern by the substitutionString.

The following program takes a string and replaces the words Assam Company by Assam Company of

America, Inc.. This is because is because the company is formally incorporated, and the change reflects the same.

 Program 4.40

#!/usr/bin/perl
#assam1.pl

use strict;
my $string = "Assam Company was established in 1997.\n";
$string =~ s/Assam\s+Company/Assam Company of America, Inc.,/;  
print $string;

The output of the program is given below.


Assam Company of America, Inc., was established in 1997.

Now, suppose we want to make this change in a file. We want to change every occurrence of the words Assam Company by Assam Company of America, Inc.. In the original file, the name

Assam Company is used in several places. The modified or substituted text file is given below. It is called index.html.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <title>Assam Company of America, Inc.,</title>

  </head>

 

  <body>

    <h1>Assam Company of America, Inc.,</h1>

    Assam Company of America, Inc., was established in 1997. It has been operational

    without a name for more than a year at that time. Assam Company of America, Inc.,

    specializes in intelligent information processing and

    Internet security.

    <hr>

    <address><a href="mailto:kalita@pikespeak.uccs.edu">J Kalita</a></address>

<!-- Created: Sat Jul 21 10:11:47 MDT 2001 -->

<!-- hhmts start -->

Last modified: Sat Jul 21 10:17:07 MDT 2001

<!-- hhmts end -->

  </body>

</html>

 

The program that performs the replacement everywhere is given below.

Program 4.41

#!/usr/bin/perl
#assam2.pl

use strict;
undef $/;
my $file = $ARGV[0];
open (IN, "$file");
open OUT, ">$file.new";
my $fileContents = ;
$fileContents =~ s/Assam\s+Company/Assam Company of America, Inc.,/g;  
print  OUT $fileContents;
unlink $file;
rename $file.".new", $file;
chmod 0644, $file;

The program reads the whole file in one read operation because we are looking for the string Assam Company that contains two words that may appear broken in the middle in two sentences. Here, we have used the global modifier g with the s operation. Before doing that, we read the whole file into one string called $fileContents. This is read in one read operation because we set $/ to
undef. This has been discussed earlier in the chapter.

Once the file has been read to a string, we perform substitution globally. We write out the changed contents to a file that has the same name as the original, but with a suffix .new. So, if the program assam2.pl is called with the command-line argument index.html, a file index.html.new is created. We then delete the old file index.html, and rename the file
index.html.new
to index.html. Since HTML files must be readable by everyone to be viewable on the Web, we make it so using the chmod command. Changing accessibility mode is necessary only on a Unix machine.

Note that the need to write a new file, delete the old file, and rename the new file, etc., can be obviated by editing the file in-place. Perl allows editing in-place if we give a value to the special variable $^I. Perl automatically does the necessary work to keep an old version of the file by appending the value of $^I to the file being changed. The updated file or new file is stored in the original file name. Using $^I works only if we use the diamond operator <> when used with nothing inside or the special filehandle ARGV. Of course, when there is no filehandle inside, it means ARGV. So, even if there is one file, we need to use <> or <ARGV>. The ARGV filehandle iterates through every file in the special variable @ARGV. Therefore, we can assign @ARGV to get in-place editing effect if we want. The program is given below.

 Program 4.42

#!/usr/bin/perl
#assam3.pl

use strict;
$^I = ".old";
undef $/;
while (<>){
      $_ =~ s/Assam\s+Company/Assam Company of America, Inc.,/g;  
      print $_; 
} 

We must use the <> or <ARGV> filehandle, and read lines into the default variable $_ for in-place editing to work. If we make the following call

 

%assam3.pl index.html

 

the old file is stored as index.html.old and the new, modified file is stored as index.html. If the value of $^I is set to the empty string or is undefed, no copy of the old file is kept. However, this is not advised because if something is wrong during input-output, the original file as well as the new file may be lost.

The program given immediately above does the substitution in all files whose names are given as command-line argument. Suppose that while typing the files, we had mixed cases because we were not careful typing. That is, in some places, we wrote ASSAM COMPANY or Assam COMPANY or ASSAm COmpany, etc. We want these cases also to be substituted. We can simply do so by using the i option with the s operation in addition
to the g option. The modified program is given below.

 Program 4.43

#!/usr/bin/perl
#assam4.pl

use strict;
$^I = ".old";
undef $/;
while (<>){
      $_ =~ s/Assam\s+Company/Assam Company of America, Inc.,/gi;  
      print $_; 
} 

Although the use of $^I makes a program tight and small, it is tricky to use correctly. The restrictions are the following.

1.  One has to use it with <> or <ARGV>.

2.  One has to use the $_ variable that is set by <>. Using another variable does not work. It will destroy the original file. For example, ($myVar = <>) does not work.

3.  It has to be used with a while loop.

4.  One has to print $_ without a filehandle.

If all of these are not followed exactly, it is almost guaranteed that the original file’s contents will be lost. If there are are many files to be read and processed, all of them will be reduced to empty files. This could be disastrous. If there are several filehandles, or a while loop cannot be used, or using $_ is not convenient or transparent, etc., performing in-place editing may be difficult. In fact, it is quite common that a lot of programmers lose contents of files in this manner. This has caused grief to a lot of programmers who have lost whole directories in this manner.
Thus, the use of $^I is not recommended by the author of this book. The use of $^I makes a program short and endows it with some elegance, but the potential dangers associated cause its usage to be perilous. If one insists on using it, one should copy all relevant files to a temporary directory, write the program and experiment with it several times to make sure no data is lost, before using it on the actual data. In other words, one should back up the files before making sure everything works.

A program that performs substitution can be used to make an HTML file look consistent. For example, we may want all our HTML tags to be written in uppercase so that they stand out. It is possible that they are not so to begin with because various individuals produced the pages using raw HTML or by using various tools. The following program makes an initial attempt at doing so. We do not use the $^I variable here.

 Program 4.44

#!/usr/bin/perl
#file htmlTags2.pl
use strict ;
my $backExt = ".old"; 

my (@HTMLTags, $HTMLTag, $uppedHTMLTag, $textLine, $file, $outFile);
@HTMLTags  = qw (html head title br h1 h2 h3 p a img table); 
print "\@ARGV = @ARGV\n"; 

foreach $file (@ARGV){ 
  open IN, $file;
  $outFile = $file . $backExt;
  open OUT, ">$outFile" or die "Cannot open $outFile: $!";
  while ($textLine = ){
      foreach $HTMLTag (@HTMLTags){
           $textLine =~ s/<($HTMLTag)>/"<" . uc($1) . ">"/gie;
           $textLine =~ s/<($HTMLTag)\s+/"<" . uc($1)  . " "/gei;
           $textLine =~ s## ""#gei;
      }    
   print OUT $textLine;
   }
 close OUT;
 unlink $file;
 rename $outFile, $file;
 chmod 0644, $file
}     

Here we list some HTML tags we are interested in. We use the function uc to get the upper-cased version of the tag. The call to the uc function can be made in the call to the s operator itself. We use the s///e modifier to evaluate the substitution string. That is, instead of using the substitution string literally, we can consider it as an expression to be evaluated.

Consider one of the uses of the substitution operator.

           $textLine =~ s/<($HTMLTag)>/"<" . uc($1) . ">"/gie;

Here, the pattern is given between the first two delimiters: /. The substitution string is given between the second and the third delimiters. It is possible to use other delimiters, just like with the pattern match operator: =~. Just like in regular expression matching, when we have a parenthesized sub-expression, the part of the string that matches, is stored temporarily in the special variable $1. This variable can be used in the substitution expression. Here, the substitution expression is

"<" . uc($1) . ">"

This expression performs string concatenation twice. The built-in function uc takes $1 as an argument. e is given as a modifier or option to the s operator causing Perl to evaluate the expression to obtain the actual substitution string. The s operation causes the HTML tag to be upper-cased. Sometimes HTML tags are followed immediately by >, and at other times are not, particularly when a tag accepts attribute values. The third use of the s operator in this program substitutes the closing HTML tags. Because the
e modifier causes the substitution string to be evaluated, it is possible for the the programmer to write a subroutine that is invoked if necessary, performing complex processing.

The s operator accepts s and m modifiers as well with the meanings discussed in the context of the match operator =~. s changes the meaning of the . character in a pattern match, allowing it to match \n as well as any other character. The m
changes the meaning of the ^ and $ anchors allowing them to match at the beginning and end, respectively, of every line in a multi-line string. The reader is referred to Section 4.8 for detailed discussions on these two modifiers.

The following program contains a sequence of unrelated uses of the s operator below. The uses of the s operator, in each case, is explained in terms of comments.

 Program 4.45

#!/usr/bin/perl
#simplePatterns1.pl

#-------------------
$string = "Mr. Clinton goes to Washington.\nMr. Gore goes\nwith him.\n";

# /g automatically works with multi-line strings
$string =~ s/Mr\./Mister/g;
#Mister Clinton goes to Washington.
#Mister Gore goes
#with him.
print "$string";

$count = ($string =~ s/Mister/Mr\./g);
#Changed 2 times
print "Changed $count times\n";

$string  =~ s/(.)/\1\1/g;
#MMrr..  CClliinnttoonn  ggooeess  ttoo  WWaasshhiinnggttoonn..
#MMrr..  GGoorree  ggooeess
#wwiitthh  hhiimm..
#Note: \n is not duplicated
print "$string";

$string = "Mr. Clinton goes to Washington.\nMr. Gore goes\nwith him.\n";
$string  =~ s/(.)/\1\1/gs;
#MMrr..  CClliinnttoonn  ggooeess  ttoo  WWaasshhiinnggttoonn..
#
#MMrr..  GGoorree  ggooeess
#
#wwiitthh  hhiimm..
#
#Note: \n is also duplicated because we have used /s modifier
print "$string";

$string = "Mr. Clinton goes to Washington.\nMr. Gore goes\nwith him.\n";
$string  =~ s/(.)/\1\1/gm;
#Because we don't use /s modifier, . doesn't match \n
#MMrr..  CClliinnttoonn  ggooeess  ttoo  WWaasshhiinnggttoonn..
#MMrr..  GGoorree  ggooeess
#wwiitthh  hhiimm..
print "$string";

$_ = "I will pay \$10000 for the car.\n";
s/(\d+)/$1 * 2/e;
#I will pay $20000 for the car.
print;
#I will pay $40000 for the carcar.
s/(\w+)/if ($1 eq "car") {$1 x 2} else {$1}/ge;
print;

$_ = "I will pay ten thousand dollars for the car\n";
s/\w+/sprintf ("%-9s", $&)/ge;
#I         will      pay       ten       thousand  dollars   for       the       car

sub repeat{
    my ($string) = @_;
    return ($string x 2);
}

$_ = "I love Dawson Creek\n";
s/\w+/&repeat ($&)/eg;
#II lovelove DawsonDawson CreekCreek
print;

#------------
s/(\w+)\1/$1 $1/g;
#I I love love Dawson Dawson Creek Creek
print;

#-------
#trim white space
$_ = "It  is  very nice and   sunny   day.\n";

s/(\w+)\s+/$1 /g;
#It is very nice and sunny day.
print;

The g modifier automatically works with multi-line strings. The substitution operation returns the number of substitutions performed in a scalar context. When parts of the pattern are parenthesized, the substrings in the target string that match are remembered. We can use special variables that are numeric, such as $1, $2, etc., to refer to them. We can also use \1, \2, etc., to refer to the values of $1, $2, etc. . For the period (.) to match \n, we need to use the s modifier. The program shows several uses of the e modifier. In pattern matching as well as substitution, the part of the target string that matches successfully with a pattern is stored in another special variable called $&. We use this variable in some of the examples given in the program.