He who refuses to do arithmetic is doomed to talk nonsense -- John McCarthy

Sundry Issues in Perl CGI Programming

Perl Programming "Style"

At this point of the subject, many of you are now working on your assignments. Some issues which you should think about:

DO NOT Use Gratuitous Indentation

In the laboratory sessions we get to see how students "lay out" their programs (whether in Perl or any other language) and one of the things we commonly see is code at the "top level" of a program indented from the left margin. Please do not do this.

You should only use indentation to indicate program structure. You should always indicate, using indentation, statements which are "inside" the block of a control structure such as "if", "while" or a subroutine definition. The "standard" Perl indentation increment is 4 spaces.

Learn and Use Perl Idioms

It is important, for program readability, that you learn and use standard layouts for your code -- the "Perl way of doing things."

  • The best way to learn this stuff is to read lots of Perl code that's been written by experts.

  • The Perl convention is to use lowercase for ordinary variables, subroutines and filenames, UPPERCASE for constants and globals and mixed case for module names. Words within a variable name are separated by underscores.

  • Opening curly braces in Perl are on the same line as the keyword, and the closing curly brace of a multi-line BLOCK should line up with the keyword that started the construct.

Read "man perlstyle"

The definitive reference to good coding style. Also on the web at http://perldoc.perl.org/perlstyle.html

Perl Variables and "strict"

It's possible, permissable and probably recommended practice in Perl programming to declare variables before their first use, in much the same way as you will have done in stricter languages such as Java, C and C++. Use the my function, as in:

my $form_name;
my $form_age = 0;

Note that if more than one variable is given for each my, the list must be enclosed in parenthesis. The my function can be used to declare private variables in Perl subroutines -- variables whose scope is the subroutine -- see also local (see man perlsub for details) for a different way of declaring private variables.

You should probably also include, near the top of your program:

use strict;

This will pick up any (presumably accidental) attempt to use undeclared variables in your program. For this subject we do not insist on pre-declared variables -- in most CGIs the type and or use of variables is pretty obvious from their context. However, if you ever have to write Perl code in the Real World™, it's not a bad habit to get into.

Issues in Server-side Programming Style

The following are some issues for consideration when designing medium-to-large scale CGI systems.

Separate HTML Generation from Algorithms

You should (attempt) to separate those sections of your programs which generate "static" HTML from those where the output is being "dynamically" generated in response to data. The easiest way to do this is to use lots of subroutines (or even modules), even where there's a single "flow-of-execution" through your program. In the Java world they call this separating HTML generation from "Business Logic™" :-).

Separate Configuration from Code

Setup and/or configuration should be managed by setting global variables, usually in a separate "config.pl" file which is included using Perl's require function.

Separate Storage from Algorithm Code

Where a system stores and reads data from disk, it makes sense to put the code for this in a separate subroutine/module. For example, a simple system may use plain (flat) textfiles, but you might want the option of using a DBMS later -- this would be very difficult if the data management code was spread throughout the program code.

Always use relative URLs

Where your system refers to other elements on the same server (eg. images), relative URLs allow your system to be easily moved from a test location on the server to a production area.

Testing vis-a-vis Production Code

Simple CGI.pm-based Perl programs can easily be tested from the command line, as we have seen. In general, you must first ensure your program runs correctly at the command line before you "install" it as a working CGI.

One important distinction between command line testing of Perl programs versus production use relates to the -w option. When a program is run from the command line, this produces all sorts of warnings, very useful to the developer who bothers to read them. The warnings are written to the program's STDERR stream, so they don't get in the way of ordinary (STDOUT) output. If this program is run as a CGI with the -w option still in place, the STDERR stream is redirected to the Unix system logfiles, and that's where all of those warnings end up -- clogging the logs. Unless this is what you want, you should remove the -w option before running your Perl programs as CGIs.

The same is true of Perl's die function. If a CGI program executes this, the (human) web user simply sees a "500 Internal Server Error" message, but the system logfiles receive the full error message. In this case, you should add the following to the "use CGI" line of your program:

use CGI::Carp qw( fatalsToBrowser );

This won't stop the error being logged in the server logfiles, and it will actually adds some extra information about the error. You can alternatively (or additionally) log errors to a private log file instead of the system logs: see "perldoc CGI::Carp" for more information. Another alternative is to use a "custom written" error handler instead of die, maybe one which tells the web user all about what the error was -- as if they want to know...:-)

Digression: Generating Email from CGI Programs

Writing CGI programs which generate email messages is surprisingly complex and can create significant security vulnerabilities. Issues include:

  • Virtually all Perl programs generate email by calling an external program. This means that the Unix shell is potentially involved in processing the command-line parameters passed to the external program -- always a potential security hole if the address is not carefully checked for validity.

  • Any system which allows a remote user to generate an email message can potentially be used by Bad Guys to generate unsolicited email messages which could be sent to third parties -- ie. spam -- such systems should be very carefully written!

  • When a user types their own email address into a form, there's no way of knowing if it's their email address or someone else's, or even completely ficticious. If you're intending to use the supplied address for any important purpose, your system should be built to send a test message to the address, perhaps containing a PIN or a password which can subsequently be used to validate the address via a web/CGI interface. There are some interesting session management issues here -- perhaps see assignment 2...

Validating Email Addresses

We briefly touched on this topic way back in tutorial 3. It's actually a surprisingly complex issue. In fact, valid email addresses can be much more complex than the standard "user@domain.name.tld" format that we're accustomed to seeing, and a regular expression which checks an address for conformance with the official (RFC5322) standard is very complicated (see section 3.4 of the RFC).

Fortunately, perhaps, the world is a much simpler place than it was when RFC-822 was written. Nowadays we can probably safely limit email addresses to the "usual" format given above: that is, we can use a regular expression such as the one shown in the following Perl code snippet:

$email = $q->param("email");

if( $email =~ /^([\w\-._]+\@[\w\-.]+)$/ ) {
    print "Valid email address: $1\n";
} else {
    print "Invalid email address\n";
}

This could be easily be extended to check (for example) that the "Top Level Domain" (ie. what's after the last dot in the address is a two, three or four letter sequence. Be careful with tightening the RE though -- it's already too strict to accept the full range of valid email addresses. There is, also, a serious privacy issue in requesting email addresses -- do you need the information? If not, it's important that your programs are not fascist about email addresses!

Delivering Email Using Sendmail

The Unix sendmail program is the de-facto standard for delivering email generated by other programs. It handles most of the complexity in mail delivery and is very simple to use (by other programs, not by humans). Here's an example of a CGI program which sends a message (entered in a form) to the webmaster:

#!/usr/bin/perl -T
use CGI;

$q = new CGI;

$message = $q->param("message");
$email = $q->param("email");
&validate_address($email); # See above

open MAILER, "|/usr/sbin/sendmail -t"
              or die "Can't open sendmail";
print MAILER <<"EOM";
To: webmaster\@mysite.com
Reply-to: $email
Subject: Website feedback

$message
EOM
close MAILER or die "Can't close sendmail: $!";

print $q->redirect("~me/thanks.html");

Note that we open the sendmail program with a Unix pipe -- that is, sendmail reads the message on its STDIN stream. The -t command line option tells sendmail to scan the message for valid To:, Cc: and Bcc: headers. It's possible to put these on the command line, however this potentially exposes them to the Unix shell, so it's considered better to use the -t option. Note also the use of the redirect() method to send the user to a "Thank You" page.

More Complex Mail Delivery

There are a variety of Perl modules which can assist in sending email messages from a Perl program. Why are they needed?

  • You may be using a system which doesn't have sendmail (or any close equivalent), such as a Windows machine. In this case, your CGI will have to use the SMTP (Mail Delivery) protocol directly -- however, you don't want to have to code this yourself! Mail::Mailer is the solution to this problem.

  • You may wish to send email to a very large number of recipients. It's obviously possible to iterate through a list of addresses, running code such as that given in the previous slide, however this would be a very resource-intensive approach. A better solution is to use an existing module which has s clean (simple) interface and which has been carefully tuned for maximum efficiency. An example is the Mail::Bulkmail module. This is a very powerful package, handling not only large mailouts but also HTML attachments, complex "on-the-fly" mail merging and lots more besides.

  • You want to send a message containing an attachment -- for example, an JPEG image, but basically some arbitrary MIME-type. One way to do it is via an external Unix program such as mutt which will do the hard work for you:

    $mutt = "/usr/local/bin/mutt";
    $recipient = "bloggs\@here.com";
    $file = "mypic.jpeg";
    
    open MAILER, "|$mutt -a $file $recipient";
    print MAILER <<"EOM";
    Subject: Here's your photo
    
    It's a JPEG attachment to this email.
    
    EOM
    

    The -a command line option adds an attachment to the email message. This is considered to be very dangerous because the $recipient is potentially exposed to the Unix shell -- for example, if a Bad Guy somehow created a $recipient containing something like: "j.sing@latrobe.edu.au ; cd / ; rm -rf" then you could, under some circumstances, be in some bother... The lesson: make sure it's properly untainted!

    Other options include Mail::Sender and MIME::Lite (one such tutorial can be found here).

Revealing Email Addresses to Clients

By the way: you should probably consider it very bad practice to actually include an email address into an HTML form, as in:

<form action="http://some.domain/mycgi.cgi" method="post">
<textarea name="message" cols="60" rows="40" wrap="hard">
Enter your message here
</textarea>
<input type="hidden" name="addr" value="support@somewhere.com">
<input type="submit" value="Send message">
</form>

Exercise: what's wrong with this? Suggest several better ways of doing it.

Maintaining Your Programs using Version Control

Whenever you are building a system which proceeds through multiple versions, it's worth considering automating version (or revision) control. The simplest system for this is RCS, the Revision Control System. To use RCS, simply create a directory called RCS (note uppercase) off the directory where you are developing your programs. In general, this will NOT be the same as the "production" directory -- where your programs actually run.

To "place a file under RCS", use the command "ci myfile.pl". This "checks in" your program by copying it to the RCS directory, adding some RCS information to the file and deleting the original. To get the file back (eg. to work on it some more, thus creating a new version), use "co myfile.pl" to "check out" the file from RCS. Your production code will usually be the latest "checked out" version. When you then "check in" your modified code, RCS will automatically update a version number for the file.

More on Version Control

If more than one person is working on a project, the files should be checked out with the "co -l myfile.pl" option, which "locks" the file so that no one else can check it out whilst you're working on it.

RCS has lots of very clever features. For example, version numbers usually start at 1.1, with the second number incremented with each new "check in",as in 1.2, then 1.3, etc. You can update this sequence to a new "release" by "ci -r2 myfile.pl", which will make the new version 2.1. You can also "fork" the development by doing (eg.) "ci -r1.3.1 myfile.pl", which will create a new version 1.3.1.1 of the file.

A program to brought under the control of RCS should always have a "comment" string containing the text string $Id$ somewhere near the top of the file. RCS will automatically substitute this with version and other useful information.

For projects where the developers are working on other systems (eg. their own desktop machines), instead of a single system as assumed by RCS, the Concurrent Version System (CVS) provides a greatly enhanced version control facility. CVS is, however, considerably more complex than RCS. Newer alternatives to CVS include Subversion (SVN) and GIT.