L548: Session 8

Regular Expressions (continued)

Substitution and Transliteration

Using $string_variable =~ s/search_pattern/replace_string/
a search_pattern in $string_variable can be replaced by a replace_string.
s/search_pattern/replace_string/g
stands for global replacement, i.e. not only the first occurrence is replaced.
s/search_pattern/replace_string/i
stands for ignore case. "gi" stands for global replacement, ignore case.

(Optional: There is also something called "transliteration" which replaces single characters not strings.
$string_variable =~ tr/character_sequence/character_sequence/
Most of the regular expression special characters are not valid for transliteration but "-" can be used as in tr/a-z// which would delete all letters.)

Examples:

1) s/[Ll][Oo][Nn][Dd][Oo][Nn]/London/g
replaces LOndoN or loNDON etc by London. This is equivalent to s/london/London/gi.
2) s/Alice/Mary/
replaces every occurrence of Alice by Mary.

Use the following script for todays exercises:

#!/usr/local/bin/perl
#
# Regular expressions
#
# reading a file:
open(ALICE, "alice.txt");
@lines = <ALICE> ;
close(ALICE);

# searching the file content line by line:
foreach $line (@lines){
$line =~s/T/t/g;
print $line;
} # end of foreach

Exercises

Using the alice.txt file replace
1a) all upper case A by lower case a.
1b) the word "Alice" by "ALICE".
1c) Delete all words with more than 3 characters.
1d) Print two blank space characters after the "." at the end of a sentence. (Optional: Don't do this if the "." is the last character in a line.)
1e) Replace single quotes (' or `) by double quotes.

Non-greedy Multipliers

By default the multipliers * and + are "greedy" which means they match as many characters as possible. For example, /(\b.+\b)/ matches any non-empty line. A question mark behind a multiplier forces it to be non-greedy. Therefore /(\b.+?\b)/ matches the first word in a line.

Exercise

2) Write a replace statement that deletes all HTML markup from a file. You need non-greedy multipliers because otherwise the text between tags may be deleted in a line that contains several tags.

Remembering Patterns

Patterns within parentheses are remembered. Using \1, \2, etc they can be referred to within the same regular expression (or search expression) and using $1, $2, etc they can be referred to in a print statement (or replace string).

Examples

/(t.*e)/;
print "$1";
prints strings that start with "t" and end with "e".

s/(t.*e)/:$1:/g;
places a ":" in front and behind each string t...e.

/(...)\1/
matches a three character string that is repeated.

s/^(.)(.*)(.)$/$3$2$1/
switches the first and last character of a line.

Exercises

3a) Try the examples from above.
3b) Insert a newline character after each punctuation mark (.,!). If you chomp each line (or the array of lines) before inserting the newline characters you can print each sentence in one line.
3c) Print double characters within parenthesis "()". For example, replace "arrived" by "a(rr)ived".

More special variables:

$` contains the string before the pattern
$& contains the pattern that is matched
$' contains the string after the pattern

For example:
$line="The cat that sat on the mat.";
$line =~ /c.t/;
$` contains "The "
$& contains "cat"
$' contains " that sat on the mat."

Split and Join

Using @personal = split(/:/, $line);
a string such as $line = "Caine:Michael:Actor:14, Leafy Drive";
can be split into an array such as @personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");
Other examples:
@chars = split(//, $word);
@words = split(/ /, $sentence);
@sentences = split(/\./, $paragraph);

"join" does the opposite of "split":
$bigstring = join(":",@personal);

Exercises

4) Read the alice.txt file into an array. Chomp it. Using "join" concatenate it into one string. Then split it into words (or sentences) and print it one word (sentence) per line.

5) Write a script that takes an HTML source file as input and prints it so that a newline follows only "closing tags", i.e. tags that are of the form </...>.

6) Optional: Parsing web pages:
If a CGI script downloads a page from the web, it will retrieve the HTML source code. Look at several html pages on the web. Think about the following questions: How would you extract information from them? How would you store that information in an array? How would a script search for words (or regular expressions) in web pages? Besides the fact that you don't know yet how a CGI script can download pages from the web, have you learned enough so far that you could write such a script?