4.8.5 The m//s Modifier

4.8.5  The m//s Modifier

  There is another modifier that is used in pattern matching when the target is a string that contains embedded newline characters. Just as the effect of the m modifier is limited to changing the meaning of the anchors ^ and $, the effect of using the s modifier is also limited. It is limited to what the dot (.) matches inside the string.

Usually, inside a regular expression, the dot matches any character but \n. This can be a drawback when the target string has embedded \n’s. For example, we may be looking for typos where one types the same word right next to each other. In such a case, we may want to write a program that removes one of the two instances. Sometimes the two instances are on the same line of text, but sometimes, it can so happen that the first occurrence is at the end of one line and the second occurrence is in the next line with intervening newlines and other space characters.

To address a situation like this, it is sometimes necessary that the dot (.) be forced to match \n in multi-line strings. Of course, one can force . to match \n in single-line strings also, but it is useless. In pattern matching, we can force the dot character to match \n if we use the s modifier. Here is a program that shows the use of this modifier. This program finds the first word in a multi-line string that is repeated later in the string. It also finds out the
number of intervening words between the two occurrences.

 Program 4.36

#!/usr/bin/perl
#file repeatedWords2.pl

$textString = "We have revised the entire paper to reduce the background knowledge that a\n"; 
$textString .= "reader would require.   We now  describe  the basic aspects of the MP with\n";
$textString .= "minimal background assumed.  There are many linguistic papers, but none\n";
$textString .= "in computational linguistics as far as we know,  that deal with the\n";
$textString .= "MP. Of course, pure linguistic papers such as\n";
$textString .= '\cite{Chomsky95,Merlo95,Zwart94} assume quite ';
$textString .= "extensive GB background, but we have\n";
$textString .= "spent a large amount of time  making  the paper accessible to a reader\n";
$textString .= "who is somewhat familiar with modern linguistics. We also\n";
$textString .= "have included\n";
$textString .= "extensive references with page numbers to assist the  reader\n";
$textString .= "interested in more details.\n";

#print $textString;

($repeatedWord, $separatingString) = $textString =~ m/\b(\w+?)\b(.*?\b)?\1\b/s;

@separatingWords = split (/\s+/, $separatingString);
print 'Separating Words =' . "@separatingWords\n"; 

print "Repeated Word = $repeatedWord\n";
print "Number of intervening words = $#separatingWords\n";

The output of the program is given below. The first line of output has been broken up into two lines for printing.

Separating Words = have revised the entire paper to reduce the background knowledge 
               that a reader would require.
Repeated Word = We
Number of intervening words = 15

We see the word We as the first character of $textString. There is no other instance of We in the first line of text. However, the second line of text has another occurrence of We.
\1 captures this second occurrence of We. \1 is a way to refer to the value of the special variable $1 inside a regular expression. The distance between the two occurrences in fifteen words. The intervening substring between the
two occurrences is captured by
(.+? ) in the regular expression. By using the s modifier, we force the dot character to match the intervening \n also. The ? after the multiplier
+ asks Perl to perform minimal modifier. Without ?, Perl will print the following output.

Repeated Word = We
Number of intervening words = 84

This is the distance between the first occurrence of We and the last occurrence of We in the string. This is because without ? after the multiplier +, Perl performs maximal match and consumes as much as possible for the multiplier. With the ?, it consumes as little as possible for the multiplier.