Week 5

1 Regular Expressions

A regular expression is a pattern that a string is searched for. Unix commands such as "ls *.*" are similar to regular expressions, but the syntax of regular expressions is more elaborated. Several Unix programs (grep, sed, awk, ed, vi, emacs) use regular expressions and many modern programming languages (such as Java) also support them.

PHP allows for the use of two different types of Regular Expressions: POSIX and Perl-compatible ones. Below are a few example scripts which illustrate how to open a file in PHP, how to use Perl-compatible regular expressions and how to use replacement and split/implode.

The list of special characters summarises characters that have special meanings or need to be escaped in regular expressions.

1.1 Opening a file and printing it line by line

<html><head>
<title>Displaying a file line by line</title>
</head><body>
<hr><h1>Results:</h1><hr><p>
<?php
$lines = file('alice.txt');
foreach ($lines as $line_num => $line) {
echo "Line ",$line_num,": ",$line, "<br>\n";
}
?>
<p><hr></body></html>

1.2 Using regular expressions

<html><head>
<title>Searching</title> </head><body>
<hr><h1>Results:</h1><hr><p>
<?php
$lines = file('alice.txt');
foreach ($lines as $line_num => $line) {
if (preg_match("/the /i", $line, $matches)) {
echo "<br> Line ", $line_num, " matches: ",$matches[0],"<br>";
echo $line;
}
}
?>
<p><hr></body></html>

Inserting an "i" after the expression means "ignore case": /the /i.

This manual page provides more details on regular expressions and examples of preg_match.

1.2 Exercises

For the exercises you can use the alice.txt file. (Note for MS Windows users: It is best to save this file by copying and pasting it into a Unix editor. If you save the file under DOS/Windows, each line ends with "\r\n" instead of "\n", which could be a problem.)

1) Retrieve all lines from alice.txt that do not contain /the /. Retrieve all lines that contain "the" with lower or upper case letters.

2) a) Retrieve lines that contain a word of any length that starts with t and ends with e. Modify this so that the word has at least three characters.
b) Retrieve lines that start with a. Retrieve lines that start with a and end with n. Hint: You need to specify the beginning of the line, "a", any number of any characters in the middle, "n", end of line.
c) Retrieve blank lines. Think of at least two ways of doing this.
d) Retrieve lines that contain a word that starts with an upper case letter.

3) What is the difference between the following expressions?

a) abc* and (abc)*
b) !preg_match("/yes/"...) and /[^y][^e][^s]/
c) [A-Z][a-z]* and [A-Z][a-z]+

2 Replacement

<html><head>
<title>Searching</title> </head><body>
<hr><h1>Results:</h1><hr><p>
<?php
$lines = file('alice.txt');
foreach ($lines as $line_num => $line) {
$line = preg_replace("/T/", 't', $line);
echo $line, "<br>";
}
?>
<p><hr></body></html>

(Manual page for preg_replace).

2.1 Examples

a) ("/[Ll][Oo][Nn][Dd][Oo][Nn]/",'London')
replaces LOndoN or loNDON etc by London.
b) ("/Alice/",'Mary')
replaces every occurrence of Alice by Mary.

2.2 Exercises

4) Using the alice.txt file:
a) Replace all upper case A by lower case a.
b) Delete all words with more than 3 characters.

3 Non-greedy Multipliers and Patterns

By default the multipliers * and + are "greedy" which means they match as many characters as possible. For example, /(\b.+\b)/ matches any non-empty line. A question mark behind a multiplier forces it to be non-greedy. Therefore /(\b.+?\b)/ matches the first word in a line.

3.1 Exercise

5) Write a replace statement that deletes all HTML markup from a file. You need non-greedy multipliers because otherwise the text between tags may be deleted in a line that contains several tags.

3.2 Remembering Patterns

Patterns within parentheses are remembered. Using \1, \2, etc they can be referred to within the same regular expression (or search expression). They can also be echoed using the array provided ("$matches" in the example below).

3.3 Examples

preg_match("/(t.*e)/", $line, $matches);
echo "$matches[1]";
echoes strings that start with "t" and end with "e".

("/(t.*e)/",':\1:',$line)
places a ":" in front and behind each string t...e.

/(...)\\1/
matches a three character string that is repeated.

("/^(.)(.*)(.)$/",'\3\2\1',$line)
switches the first and last character of a line.

3.4 Exercise

6) Print double characters within parenthesis "()". For example, replace "arrived" by "a(rr)ived".

4 Split and Implode

Using $personal = preg_split("/:/", $line);
a string such as $line = "Caine:Michael:Actor:14, Leafy Drive";
can be split into an array such as $personal = array ("Caine", "Michael", "Actor", "14, Leafy Drive");
Other examples:
$chars = preg_split("//", $word);
$words = preg_split("/ /", $sentence);
$sentences = preg_split("/\./", $paragraph);

"implode" does the opposite of "preg_split":
$bigstring = implode(":",$personal);

(Manual page for preg_split. Manual page for implode.)

4.1 Exercises

7) Read the alice.txt file into an array. Chomp it. Using "implode" concatenate it into one string. Then split it into words (or sentences) and print it one word (sentence) per line.

8) Write a script that takes an HTML source file as input and prints it so that a newline follows only "closing tags", i.e. tags that are of the form </...>.