4.8.3.2.2 Capturing Pairs of Words From a Multi-Line String

4.8.3.2.2  Capturing Pairs of Words From a Multi-Line String

 When we want the pattern matching operation to work in a little more complex manner over multi-line strings, we need to use modifiers. Each one of three match modifiers: m, g or s can be used to match over multiple lines, but each one modifies the behavior of the m// operator differently. We start with the use of the g modifier in multi-line matching.

Given a string with embedded newlines, suppose we want to pick out each of the words from all the lines of text. It can be done simply by using one line of pattern-matching code.

 Program 4.32

#!/usr/bin/perl
$" = "\t";
$friends = "Tommy\tWashington\nChad\tSanFrancisco\nJeffP";
$friends .= "\tBoulder\nJeffC\tColoradoSprings\n";

@allWords = ($friends =~ /(\w+)/g);
print "All words = @allWords\n\n";

We use the pattern match operator with the g modifier. The target string $friends has multiple lines. The regular expression or pattern used (\w+) usually captures the first substring that it matches in the target string. However, since we use the g modifier, all substrings that match are captured by Perl. The fact that the target string has multiple lines is immaterial when we use the g modifier. As a result,
this program prints out the following.


All words = Tommy       Washington      Chad    SanFrancisco    JeffP   
                     BoulderJeffC   ColoradoSprings

The output has been broken into two lines by hand.

We use a special variable $" in this program. It holds what is called the list item separator. When a list is printed by specifying it inside double quotes, the elements of the list are printed with the separator that is the current value of $". It defaults to a single space if not provided.

Suppose now we want to start with the same multi-line string and construct an associative array or hash that has the name of a friend as the key and his hometown as the value. The following program achieves this for us.

 Program 4.33

#!/usr/bin/perl

$friends = "Tommy\tWashington\nChad\tSanFrancisco\nJeffP";
$friends .= "\tBoulder\nJeffC\tColoradoSprings\n";

@allFriends = ($friends =~ /(\w+\s*\w+)/g);

foreach $friend (@allFriends){
   ($name, $hometown) = 
      $friend =~ /(\w+)\s*(\w+)/;
   $friends {$name} = $hometown;
}

print "Friend\tHometown\n".("-" x 20)."\n";
foreach (keys %friends){
    print $_, "\t", $friends{$_}, "\n";
}

The first pattern match operation

 

@allFriends = ($friends =~ /(\w+\s*\w+)/g);

 

picks out each one of the lines from the string $friends. @allFriends contains these lines as individual elements. Later, we construct a hash or associative array whose keys are the names and whose values are hometowns.

Now, suppose we change the target string $friends so that there can be embedded \n’s between a name and a hometown. That is, a hometown still follows a name, but there can be any number of intervening newlines and other space characters anywhere in the string. For example, assume $friends is assigned the value given below.

 

$friends = "\nTommy\n\tWashington\t\n\n\nChad\nSanFrancisco\nJeffP\n\t";

$friends .= "\n\nBoulder\nJeffC\tColoradoSprings\n";

 

Does the program given above for extracting name-hometown pairs still work?  Actually, it does. The fact that there are intervening newlines between the two elements of a pair does not matter. In situations like this, it is not necessary to use any of the so-called multi-line modifiers such as s or m. This is because we are not trying to change the behavior of the anchors ^ or $, or the dot character.

Once again, as we have mentioned earlier, it is not syntactically or semantically wrong to have a target string for the match operation that is a multi-line string and use none of the modifiers. Therefore, we can write the following.

 

@allFriends = ($friends =~ /(\w+\s*\w+)/);

 

in the program given above. But, in such a situation, @allFriends will contain only the first line containing the pair Tommy and Washington.