L548: Session 12

1) Python and Networking

Protocols

Bottom layers: physical network, IP, TCP
Top layers: FTP, telnet, http

Type of connection: connection-oriented vs. packet-oriented

Downloading a page from the web

#!/usr/bin/env python

import urllib
import sys

url = "http://www.slis.indiana.edu/"

##################### connect to remote page #################
try:
remote = urllib.urlopen(url)
except:
print "cannot open URL"
sys.exit()

#################### read the content of the remote page ####
content = remote.read()
print remote.info()
print content

Comments:
1) "remote" supports the same commands as a file handle. "remote.readlines()" reads the content line by line into a list. "remote.read()" reads all of the lines into a single string. Unless a for loop is used to print the content, remote.read() is usually more useful.

2) remote.info() contains the MIME headers

Exercises

1.1 Try the script using pages you know. What happens if the hostname is incorrect or if the file does not exist?

1.2 Create a form that asks the user to submit a URL. Write a CGI script that opens the URL, reads the content and displays it to the user. (There are all kinds of applications for such scripts, for example, they can serve as meta-search engines or could filter content, such as images or advertisements, out from a web page.)

1.3 Create a form that asks the user to submit a URL and a search term. Write a CGI script that opens the URL and displays all the lines that contain the search term.

1.4 (Optional) Write a script that determines whether a 404 File Not Found message is displayed. You should use remote.read() and regular expressions.

2) Parsing HTML

Web crawlers (see p. 654 - 656 in the Core Python Programming book) navigate the web searching for links. The following example shows how to display the links of a page.

#!/usr/bin/env python

import urllib
import sys
import htmllib
import formatter

url = "http://www.slis.indiana.edu/"

##################### connect to remote page #################
try:
remote = urllib.urlopen(url)
except:
print "cannot open URL"
sys.exit()

#################### read the content of the remote page ####
content = remote.read()

#################### parse the HTML of the page #############
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(content)
parser.close()

#################### get the links (anchors) ##################
links = parser.anchorlist
for eachlink in links:
print eachlink

Exercise

2.1 Modify your script from 1.3 so that it only searches the links of a document not the document itself.

2.2 (Optional) You can combine the example above and 1.4 to check for broken links. To simplify matters only check the links that actually start with html://.