Regular Expressions (continued)

1) Non-greedy Multipliers

By default the multipliers * and + are "greedy" which means they match as many characters as possible. For example, r"\b.+\b" or r"\b.*\b" both match any non-empty line. A question mark behind a multiplier forces it to be non-greedy. Therefore r"\b.+?\b" matches the first word in a line whereas r"\b.*?\b" matches an empty string.

Exercise

1.1 Write a regular expression that finds html tags in a file and prints them. (You can use the first lines of the source code of this web page to test it. )

2) Remembering Patterns

Sometimes it is useful to be able to select different substrings from a string that is matched by a regular expression. For example, the expression r"\d\d?\.\d\d?\.\d\d" matches a date format. To select the date, month, year from this date format, parentheses are used as shown in the following example:

#!/usr/bin/env python
import re

date = raw_input("Please enter a date in the format mm.dd.yy ")

keyword = re.compile(r"(\d\d?)\.(\d\d?)\.(\d\d)")

result = keyword.search (date)
if result:
    print "Month:", result.group(1)
    print "Day:", result.group(2)
    print "Year:", result.group(3)

Exercise

2.1 Continue with the previous exercise but print the type of every html tag your script finds, such as html, body, title, a, br.

2.2 Optional: Print all lines in the alice.txt file so that the first and the last character in each line are switched.

Optional material/exercise:

2.3 Parenthesis can also be used to match repeated substrings within one regular expression. In this case, the groups are denoted by \1, \2, \3. For example, r"(.)\1" matches any character that occurs twice. Note that this is different from r"..", which means any two (possibly different) characters. Exercise: Print all lines in the alice.txt file that contain two double characters.

3) Substitution

Instead of just printing the results of a search, Python can also replace them similar to search/replace in a word processor. The following script replaces "t" with "T". In this case, "sub" stands for "substitution". Note: whatever is searched for is a regular expression, but it is replaced by a string. That means r"t" is a regular expression in the example but "T" is a string.

#!/usr/bin/env python
import re

# open a file
file = open("alice.txt","r")
text = file.readlines()
file.close()

# compiling the regular expression:
keyword = re.compile(r"t")

# searching the file content line by line:
for line in text:
    print keyword.sub ("T",line),

Exercises

Using the alice.txt file replace
3.1 all upper case A by lower case a.
3.2 Delete all words with more than 3 characters. Hint: deleting means replacing with nothing.
3.3 Print two blank space characters after the "." at the end of a sentence. (Optional: Don't do this if the "." is the last character in a line.)
3.4 Replace single quotes (' or `) by double quotes.

3.5 Modify your program from exercise 1.1, so that it deletes all HTML markup.

4) Split and Join

Because regular expressions search a file line by line, it is sometimes necessary to reformat a file before searching so that not too many items are on each line. The following example shows how the alice.txt file can be reformatted using split and join so that the line breaks occur where the file had the punctuation marks "." or ",".

#!/usr/bin/env python
import re

# open a file
file = open("alice.txt","r")
text = file.readlines()
file.close()

# join all of the lines together using " " as glue
bigstring = " ".join(text)

# delete newline characters and white space from the end of each line
keyword = re.compile(r"\s*\n\s*")
bigstring = keyword.sub (" ",bigstring)

# split bigstring where "." or "," occurs
keyword = re.compile(r"[\.,]\s*")
text = keyword.split (bigstring)

for line in text:
    print line

Exercises

4.1 Modify the example so that it splits the file into words instead of sentence parts.

4.2 Write a script that takes an HTML source file as input and prints it so that a newline follows only "closing tags", i.e. tags that are of the form </...>.

4.3 Optional: Parsing web pages:
If a CGI script downloads a page from the web, it will retrieve the HTML source code. Look at several html pages on the web. Think about the following questions: How would you extract information from them? How would you store that information in an array? How would a script search for words (or regular expressions) in web pages? Besides the fact that you don't know yet how a CGI script can download pages from the web, have you learned enough so far that you could write such a script?