Week 5

Note:

This week contains more or less everything there is to be known about regular expressions. That is usually more than needed in any single project. You don't need to do all of the exercises listed below. But try to get an overview of the kinds of things one can do with regular expressions.

Don't skip exercises 7, 10 and 11. They are important for processing HTML files.

1 Regular Expressions

A regular expression is a pattern that a string is searched for. Unix commands such as "ls *.*" are similar to regular expressions, but the syntax of regular expressions is more elaborated. Several Unix programs (grep, sed, awk, ed, vi, emacs) use regular expressions and many modern programming languages (such as Java) also support them.

$line =~/the /
searches for the occurrence of the four character sequence "the " in the string in $line. If "!~" was used instead of "=~" then the script would search for strings that do not contain the four character sequence "the ".

1.1 Example:

#!/usr/local/bin/perl
#
# Regular expressions
#
# reading a file:
open(ALICE, "alice.txt");
@lines = <ALICE> ;
close(ALICE);

# searching the file content line by line:
foreach $line (@lines){
if ($line =~/the /){
print $line;
} # end of if
} # end of foreach

1.2 Exercises

Unix versus DOS/Windows: For the exercises you can use the alice.txt file. It is best to save this file by copying and pasting it into a Unix editor. If you save the file under DOS/Windows, each line ends with "\r\n" instead of "\n". You need two chop commands instead of one chomp command if you want to remove these characters.

List of special characters

1) Retrieve all lines from alice.txt that do not contain /the /. Retrieve all lines that contain "the" with lower or upper case letters (hint: inserting an "i" after the expression means "ignore case":/the /i).

2) a) Retrieve lines that contain a three letter string consisting of "t", then any character, then "e", such as "the dog", "tree", "not ever".
b) Retrieve lines with a three letter word that starts with t and ends with e.
c) Retrieve lines that contain a word of any length that starts with t and ends with e. Modify this so that the word has at least three characters.
d) Retrieve lines that start with a. Retrieve lines that start with a and end with n.
e) Retrieve blank lines. Think of at least two ways of doing this.
f) Retrieve lines that have two consecutive o's.
g) Retrieve lines that do not contain the blank space character.
h) Retrieve lines that contain more than one blank space character.

3) For the following regular expressions write a script that lets a user input a string. The string is then compared to a regular expression (using =~). A message is printed to the screen if the match was successful. (Alternatively you could keep using the alice.txt file and simply add a couple of lines to it that contain these patterns.)

Match the following patterns:

a) an odd digit followed by an even digit (eg. 12 or 74)
b) a letter followed by a non-letter followed by a number
c) a word that starts with an upper case letter
d) the word "yes" in any combination of upper and lower cases letters
e) one or more times the word "the"
f) a date in the form of one or two digits, a dot, one or two digits, a dot, two digits
g) a punctuation mark

4) What is the difference between the following expressions?

a) abc* and (abc)*
b) !/yes/ and /[^y][^e][^s]/
c) [A-Z][a-z]* and [A-Z][a-z]+

5) Write a script that asks users for their name, address and phone number. Test each input for accuracy, for example, there should be no letters in a phone number. A phone number should have a certain length. An address should have a certain format, etc. Ask the user to repeat the input in case your script identifies it as incorrect.

1.3 Optional: using the $_ variable

An alternative to the search script from above uses the $_ variable for the current line. It reads the file line by line (as opposed to storing it in an array). In this case the negation would be expressed as !/the /.

#!/usr/local/bin/perl
#
# Regular expressions
#
open(ALICE, "alice.txt");
while (<ALICE>){
if (/the /){
print $_;
} # end of if
} # end of while
close(ALICE);

2 Substitution and Transliteration

Using $string_variable =~ s/search_pattern/replace_string/
a search_pattern in $string_variable can be replaced by a replace_string.
s/search_pattern/replace_string/g
stands for global replacement, i.e. not only the first occurrence is replaced.
s/search_pattern/replace_string/i
stands for ignore case. "gi" stands for global replacement, ignore case.

2.1 Examples

a) s/[Ll][Oo][Nn][Dd][Oo][Nn]/London/g
replaces LOndoN or loNDON etc by London. This is equivalent to s/london/London/gi.
b) s/Alice/Mary/
replaces every occurrence of Alice by Mary.

2.2 Optional: Transliteration

There is also something called "transliteration" which replaces single characters not strings.
$string_variable =~ tr/character_sequence/character_sequence/
Most of the regular expression special characters are not valid for transliteration but "-" can be used as in tr/a-z// which would delete all letters.

2.3 The script to be used for the exercises below

#!/usr/local/bin/perl
#
# Regular expressions
#
# reading a file:
open(ALICE, "alice.txt");
@lines = <ALICE> ;
close(ALICE);

# searching the file content line by line:
foreach $line (@lines){
$line =~s/T/t/g;
print $line;
} # end of foreach

2.4 Exercises

6) Using the alice.txt file replace
a) all upper case A by lower case a.
b) the word "Alice" by "ALICE".
c) Delete all words with more than 3 characters.
d) Print two blank space characters after the "." at the end of a sentence. (Optional: Don't do this if the "." is the last character in a line.)
e) Replace single quotes (' or `) by double quotes.

2.5 Non-greedy Multipliers

By default the multipliers * and + are "greedy" which means they match as many characters as possible. For example, /(\b.+\b)/ matches any non-empty line. A question mark behind a multiplier forces it to be non-greedy. Therefore /(\b.+?\b)/ matches the first word in a line.

2.6 Exercise

7) Write a replace statement that deletes all HTML markup from a file. You need non-greedy multipliers because otherwise the text between tags may be deleted in a line that contains several tags.

2.7 Remembering Patterns

Patterns within parentheses are remembered. Using \1, \2, etc they can be referred to within the same regular expression (or search expression) and using $1, $2, etc they can be referred to in a print statement (or replace string).

2.8 Examples

/(t.*e)/;
print "$1";
prints strings that start with "t" and end with "e".

s/(t.*e)/:$1:/g;
places a ":" in front and behind each string t...e.

/(...)\1/
matches a three character string that is repeated.

s/^(.)(.*)(.)$/$3$2$1/
switches the first and last character of a line.

2.9 Exercises

8) Insert a newline character after each punctuation mark (.,!). If you chomp each line (or the array of lines) before inserting the newline characters you can print each sentence in one line.
9) Print double characters within parenthesis "()". For example, replace "arrived" by "a(rr)ived".

2.10 Optional: Special variables:

$`	contains the string before the pattern
$&	contains the pattern that is matched
$'	contains the string after the pattern

For example:
$line="The cat that sat on the mat.";
$line =~ /c.t/;
$` contains "The "
$& contains "cat"
$' contains " that sat on the mat."

2.11 Split and Join

Using @personal = split(/:/, $line);
a string such as $line = "Caine:Michael:Actor:14, Leafy Drive";
can be split into an array such as @personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");
Other examples:
@chars = split(//, $word);
@words = split(/ /, $sentence);
@sentences = split(/\./, $paragraph);

"join" does the opposite of "split":
$bigstring = join(":",@personal);

2.12 Exercises

10) Read the alice.txt file into an array. Chomp it. Using "join" concatenate it into one string. Then split it into words (or sentences) and print it one word (sentence) per line.

11) Write a script that takes an HTML source file as input and prints it so that a newline follows only "closing tags", i.e. tags that are of the form </...>.

12) Optional: Parsing web pages
If a CGI script downloads a page from the web, it will retrieve the HTML source code. Look at several html pages on the web. Think about the following questions: How would you extract information from them? How would you store that information in an array? How would a script search for words (or regular expressions) in web pages? Besides the fact that you don't know yet how a CGI script can download pages from the web, have you learned enough so far that you could write such a script?

Special Characters

.	Any single character except a newline
^	The beginning of the line or string
$	The end of the line or string (Use "\r$" instead of "$" for end of line in DOS/Windows.)
*	Zero or more of the last character
+	One or more of the last character
?	Zero or one of the last character
{5,10}	Five to ten times the previous character
	for example: * equals {0, }; + equals {1, }
	? equals {0,1}

More special characters

[qjk]	Either q or j or k
[^qjk]	Neither q nor j nor k
[a-z]	Anything from a to z inclusive
[^a-z]	No lower case letters
[a-zA-Z]	Any letter
[a-z]+	Any non-zero sequence of lower case letters
jelly\|cream	Either jelly or cream
(eg\|le)gs	Either eggs or legs
(da)+	Either da or dada or dadada or...
\n	A newline
\t	A tab
\w	Any alphanumeric (word) character.
	The same as [a-zA-Z0-9_]
\W	Any non-word character.
	The same as [^a-zA-Z0-9_]
\d	Any digit. The same as [0-9]
\D	Any non-digit. The same as [^0-9]
\s	Any whitespace character: space,
	tab, newline, etc
\S	Any non-whitespace character
\b	A word boundary, outside [] only
\B	No word boundary

Escapes for special characters

\\|	Vertical bar
\[	An open square bracket
\)	A closing parenthesis
\*	An asterisk
\^	A carat symbol
\/	A slash
\\	A backslash