CO32037 Coursework

For the coursework you are asked to build an on-line search engine for marked-up text. The text to be searched consists of three files of the Brown corpus which can be downloaded: file1, file2, file3. (The Brown corpus is a collection of newspaper and other texts used by linguists and AI researchers. The mark-up format is XML-like. You can ignore all markup apart from <S> and </S> which indicates sentences and <P> and </P> which indicates paragraphs. You should download the three files and save them in your CGI directory. You should not manually edit the files but instead have your Perl script parse them.)

The coursework will be done in teamwork. Check here for some tips for your teamwork.

Components of your coursework

Your search engine should consist of two search screens.

1) The first screen contains

  • three textboxes called word1, word2 and word3 for the user to type in search terms. (If a user types in more than one word into one box, it will be interpreted as a phrase search.)
  • A drop down menu called scope that has two choices sentence and paragraph to allow searches for words that occur in the same sentence or the same paragraph.
  • Two drop down menus called boolean1, boolean2 with choices AND, OR, NOT. These drop down menus apply to the second and third textbox, respectively.

    The CGI script that processes the input from this form, should be called basicsearch.

    2) The second search screen contains

  • one textbox called word for the user input. In this case the user can type in as many terms as desired. A plus sign (+) in front of a term indicates a required term (i.e., Boolean AND). A minus sign (-) indicates a term that must not occur (i.e., Boolean NOT). All other terms are connected via Boolean OR. A phrase is indicated by enclosing terms in double quotes ("").
  • A drop down menu called scope that has two choices sentence and paragraph to allow searches for words that occur in the same sentence or the same paragraph.

    The CGI script that processes the input from this form, should be called expertsearch. (If you are using the same script for both searches, put a copy under both names.)

    It is important that you name your form items and the CGIs script exactly as described here because a Perl script will be used to check some of the features of your application automatically. Points will be subtracted if this automatic check fails.

    3) The results page for the searches should show the sentences or paragraphs (depending on which scope was selected) that contain the terms. The text should be displayed without the original mark-up. The search terms should be highlighted in a different color.

    4) Both scripts should set a cookie. The cookie checks whether a search is different from the previous one. If a search is exactly the same as the previous one, the results page should display "Same search as last time. Try something else!" at the top of the screen.

    In addition to the search pages, your application should also have

  • 5) A "credits page" which explains the contribution of each team member. If you used any resources for the development of your application other than the materials from the lecture notes and exercises in the practicals, you must list these sources. This includes any HTML, Javascript or Perl code which you may have downloaded from the web.
  • 6) A page with some on-line documentation, which includes a schematic diagram of your application and describes in broad terms how your application works. This documentation should be written in a technical style, similar to commercial product documentation. You should not post this page on the web before the deadline.

    Documents to be handed in

    At the final deadline you will be submitting
  • the URL of your application and
  • the names and matric numbers of your team members on a webpage which will be linked here.

    In addition you will be handing in

  • a printout of the source code of your application (only the Perl/CGI code, not the HTML pages). On this printout you should highlight with a pen the security subroutine of your application and any other pieces of the code that implement security measures;
  • a printout of your on-line documentation.

    Safeguarding your work

    It is your responsibility to ensure that your files are read-protected from others. You should not leave any printouts of your code on Campus, not even in the rubbish bins. You should make regular backup copies of your code, for example, by storing the files on a floppy via your I-drive.


    If you submit your coursework late (between 1 and 7 days after the deadline), the mark will be capped at 40%. After that your coursework will be marked as "fail".

    Normally, all team members will receive the same mark for the coursework. A student will receive a lower mark than the other team members in two cases: if the credits section of your coursework submission shows that this student did not contribute at all or only contributed minor details. Or if it is brought to the module leader's attention during the semester that a student repeatedly misses team meetings and does not respond to his/her email. (If a student has mitigating circumstances why she/he cannot attend meetings or respond to email, this student should contact the module leader as soon as possible.)