1. A Beginner's Introduction to Perl Web Programming
By chromatic
September 5, 2008 | Comments: 17
So far, this series has talked about Perl as a language for mangling numbers,
strings, and files -- the original purpose of the language. (A Beginner's Introduction
to Perl 5.10, A Beginner's Introduction to Files and Strings with Perl 5.10, and A
Beginner's Introduction to Perl Regular Expressions) covered flow control, math
and string operations, and files. Now it's time to talk about what Perl does on the
Web. This installment discusses CGI programming with Perl.
What is CGI?
The Web uses a client-server model: your browser (the client) makes requests of a
Web server. Most of these are simple requests for documents or images, which the
server delivers to the browser for display.
Sometimes you want the server to do more than just dump the contents of a file.
You'd like to do something with a server-side program -- whether that "something" is
reading and sending e-mail, looking up a phone number in a database, or ordering a
copy of Perl Best Practices for your favorite techie. This means the browser must
be able to send information (an e-mail address, a name to look up, shipping
information for a book) to the server, and the server must be able to use that
information and return the results to the user.
The standard for communication between a user's Web browser and a server-side
program running on the Web server is called CGI, or Common Gateway Interface. All
popular web server software supports it. To get the most out of this article, you will
need to have a server that supports CGI. This may be a server running on your
desktop machine or an account with your ISP (though probably not a free Web-page
service). If you don't know whether you have CGI capabilities, ask your ISP or a local
sysadmin how to set things up.
Notice that I haven't described how CGI works; that's because you don't need to
know. The standard Perl module CGI handles the protocol for you. This module is
part of the core Perl distribution; any properly installed Perl should have it available.
Telling your CGI program that you want to use the CGI module is as simple as:
use CGI;
CGI versus Everything Else
You may have heard that "CGI is slow" or "Perl is slow" for web programming. (A similar
assertion is "Perl doesn't scale".) While CGI technically describes how server-side languages
can send and receive information to and from clients, people often mean that the execution
model associated with standalone CGI programs can be slow. Traditionally, a web server
launches a new process to handle CGI requests. This often means loading Perl and
recompiling the program for each incoming request.
2. For a complete list of Perl books, visit the
Perl topic page in the O'Reilly Store.
Though this may take fractions of a second, if you have hundreds of thousands of
requests a day (or hundreds of requests within the span of a few minutes), you may
notice that the overhead of launching new processes is significant. Other execution
models exist, from embedding Perl in the web server (mod_perl) to running your
Perl program as a persisten application and talking to it through another protocol
(FastCGI).
CGI programming is still worth your time learning for two reasons. First,
understanding the web's model of client-server programming and the way Perl fits
into the model is important to all models of web programming with Perl. Second,
persistence or acceleration models can be more complex in some ways -- and it's
likely that your first few server-side Perl programs will need the advanced features of
the other execution models.
A Real CGI Program
It's time to write your first real CGI program. Instead of doing something complex,
how about something that will simply echo back whatever you throw at it. Call this
program backatcha.cgi:
#!/usr/bin/perl -T
use 5.010;
use CGI;
use strict;
use warnings;
my $q = CGI->new();
say $q->header(), $q->start_html();
say "<h1>Parameters</h1>";
for my $param ($q->param()) {
my $safe_param = $q->escapeHTML($param);
say "<p><strong>$safe_param</strong>: ";
for my $value ($q->param($param)) {
3. say $q->escapeHTML($value);
}
say '</p>';
}
say $q->end_html();
Some of this syntax may look new to you: in particular, the arrow operator (->).
When used here, it represents a method call on an object. Object oriented
programming can be a deep subject, but using objects and methods is relatively
simple.
An object (contained in $q in this example, and returned from CGI->new()) is a self-
contained bundle of data and behavior. Think of it like a black box, or a little chunk of
a program. You communicate with that object by sending it messages with the ->
operator. Messages work a lot like functions: they have names, they can take
arguments, and they can return values. (In fact, their definitions look almost identical
to Perl functions. They have two subtle differences, which is why they have a
different name: methods. Calling a method and sending a message are basically the
same thing.) Thus:
$q->header()
... sends the header() message to the CGI object in $q, which performs some
behavior and returns a string. (In this case, a valid HTTP header per the CGI
protocol.) Later in the program, the $q->param() and $q->param( $param )
messages appear. By now, you should be able to guess at what they return, even if
you don't know how they work or why.
If you've paid close attention, you may have noticed that CGI->new() follows the
same form. In this case, it calls the new() method on something referred to by CGI,
which returns a CGI object. This explanation is deliberately vague, because there's a
little more to it than that, but for now all you need to know is that you can send
messages to $q named as methods in the CGI documentation.
If you've never used HTML, the pair of <strong> and </strong> tags mean "begin
strong emphasis" and "end strong emphasis", respectively. (A good paper reference
to HTML is O'Reilly's HTML & XHTML: The Definitive Guide, and online, I like the
Web Design Group.)
One method you may not have seen in other tutorials is escapeHTML(). There are a
lot of subtleties to why this is necessary; for now it's enough to say that displaying
anything which comes from a client directly to the screen without escaping,
validation, or other scrubbing represents a very real security hole in your application.
If you start now by thinking that all incoming data needs careful thought and analysis,
you will prevent many unpleasant surprises later.
Install this program on your server and do a test run. Here's where the real test
starts; understanding how to set up a CGI program on your server can be frustrating.
Here's a short list of the requirements:
• Place the program where your Web server will recognize it as a CGI program.
This may be a special cgi-bin directory. Alternately (or even additionally), make
sure the program's filename ends in .pl or .cgi. If you don't know where to place
the program, your ISP or sysadmin should.
• Make sure the web server can run the program. If you are using a Unix system,
you may have to give the Web server user read and execute permission for the
4. program. It's easiest to give these permissions to everybody by using chmod
filename 755.
• Make a note of the program's URL, which will probably be something like
http://server name/cgi-bin/backatcha.cgi) and go to that URL in your browser.
(Take a guess what you should do if you don't the URL of the program is. Hint: It
involves the words "ask," "your" and "ISP.")
If this works, you will see in your browser only the word "Parameters". Don't worry,
this is what is supposed to happen. The backatcha.cgi program throws back what
you throw at it, and you haven't thrown anything at it yet. It'll show more in a
moment.
If it didn't work, you probably saw either an error message or the source code of the
program. These problems are common, and you need to learn how to solve them.
Uh-Oh!
If you saw an error message, your Web server had a problem running the CGI
program. This may be a problem with the program or the file permissions.
First, are you sure the program has the correct file permissions? Did you set the file
permissions on your program to 755? If not, do it now. (Windows Web servers will
have a different way of doing this.) Try it again; if you see a blank page now, you're
good.
Second, are you sure the program actually works? (Don't worry, it happens to the
best of us.) Change the use CGI line in the program to read:
use CGI '-debug';
Now run the program from the command line. You should see:
(offline mode: enter name=value pairs on standard input)
This message indicates that you're testing the program. You can now press Ctrl-D to
tell the program to continue running without telling it any form items.
If Perl reports any errors in the program, you can fix them now.
(The -debug option is incredibly useful. Use it whenever you have problems with a
CGI program. Ignore it at your peril.)
The other common problem is that you're seeing the source code of your program,
not the result of running your program. There are two simple problems that can
cause this.
First, are you sure you're going through your Web server? If you use your browser's
"load local file" option (to look at something like /etc/httpd/cgi-bin/backatcha.cgi
instead of something like http://localhost/cgi-bin/backatcha.cgi), you aren't even
touching the Web server! Your browser is doing what you "wanted" to do: loading the
contents of a local file and displaying them.
Second, are you sure the Web server knows it's a CGI program? Most web servers
have a special way of designating a file as a CGI program, whether it's a special cgi-
bin directory, the .cgi or .pl extension on a file, or something else. Unless you live up
to these expectations, the Web server will think the program is a text file, and serve
up your program's source code in plaintext form. Ask your ISP for help.
5. CGI programs are unruly beasts at the best of times; don't worry if it takes a bit of
work to make them run properly.
If you're still having problems with errors, consult your server's error log. On Unix-like
systems, with Apache httpd, look for a file called error_log.
If you don't have access to this file (or can't find it), add one more line to the start of
your program:
use CGI::Carp 'fatalsToBrowser';
This core module redirects error messages away from the error log to the client, so
that they'll appear in your web browser where you can read them. As you might
expect, this is suboptimal behavior when running a serious, public-facing application.
It's fine for debugging -- just be sure to remove it when your application goes live.
Making the Form Talk Back
At this point, you should have a working copy of backatcha.cgi spitting out nearly-
blank pages. Want it to tell you something? Save this HTML code to a file:
<form action="putyourURLhere" method="GET">
<p>What is your favorite color?
<input name="favcolor" /></p>
<input type=submit value="Send form" />
</form>
Be sure to replace putyourURLhere with the actual URL of your copy of
backatcha.cgi!
This is a simple form. It will show a text box where you can enter your favorite color
and a "submit" button that sends your information to the server. Load this form in
your browser and submit a favorite color. You should see this returned from the
server:
favcolor: green
CGI Methods
The CGI module provides several methods to CGI objects, as mentioned earlier.
What are these methods?
The first one, header(), produces the necessary HTTP headers before the program
can display HTML output. Try taking this line out; you'll get an error from the Web
server when you try to run it. This is another common source of bugs!
The start_html() method is there for convenience. It returns a simple HTML header
for you. You can pass parameters to it by using a hash, like this:
print $q->start_html( -title => "My document" );
(The end_html() method is similar, but outputs the footers for your page.)
Finally, the most important CGI method is param(). Call it with the name of a form
item, and you'll get a list of all the values of that form item. (If you ask for a scalar,
you'll only get the first value, no matter how many there are in the list.)
my $name = $q->escapeHTML( $q->param('firstname') );
say "<p>Hi, $name!</p>";
6. If you call param() without giving it the name of a form item, it will return a list of all
the form items that are available. This form of param() is the core of the backatcha
program:
for my $value ($q->param($param)) {
say $q->escapeHTML($value);
}
Remember, a single form item can have more than one value. You might encounter
code like this on the Web site of a pizza place that takes orders over the Web:
<p>Pick your toppings!<br />
<input type="checkbox" NAME="top" VALUE="pepperoni"> Pepperoni <br />
<input type="checkbox" NAME="top" VALUE="mushrooms"> Mushrooms <br />
<input type="checkbox" NAME="top" VALUE="ham"> Ham <br />
</p>
Someone who wants all three toppings would submit a form where the form item top
has three values: pepperoni, mushrooms, and ham. The server-side code might
include:
say "<p>You asked for the following pizza toppings: ";
for my $top ($q->param( 'top' )) {
say $q->escapeHTML($top), '. ';
}
say "</p>";
Here's something to watch out for. Take another look at the pizza-topping HTML
code. Try pasting that little fragment into the backatcha form, just above the <input
type="submit"...> tag. Enter a favorite color, and check all three toppings. You'll see
this:
favcolor: burnt sienna
top: pepperonimushroomsham
Why did this happen? When you call $q->param('name'), you get back a list of all of
the values for that form item. (Why? Because the call is in list context, thanks to the
say operator which starts the entire expression.) This could be a bug in the
backatcha.cgi program, but it's easy to fix by using join() to separate the item values:
say "<p><strong>$param</strong>: ", join(', ', map { $q->escapeHTML( $_ ) } $q-
>param($param)), "</p>";
... or call $q->param() in a scalar context first to get only the first value:
my $value = param($param);
say "
$param: $value
";
Always keep in mind that form items can have more than one value!
Okay, I lied about the list form being easy. Your eyes may have crossed as you
wonder what exactly that map block does, and why I made you read it. This is
actually a great time to discuss a very clever and useful part of Perl.
Remember how that code exists to handle a list of values? I explained earlier that
the param() method returns a list of values when you want a list of values, and a
single value when you want a single value. This notion of context is pervasive in
Perl. It may sound like a strange notion, but think of it linguistically in terms of noun-
7. verb number agreement. That is, it's obvious what's wrong with this sentence: Perl
are a nice language!. The subject, Perl, is singular and so the verb, to be, should
also be singular. Getting to know Perl and its contexts means understanding which
contexts are list contexts (plural) and which contexts are scalar contexts (singular).
What about that map though? Think of it as a device for transforming one list into
another, sort of a pipeline. You can drop it in anywhere you have a list to perform the
transformation. It's equivalent in behavior to:
my @params = $q->param( $param );
my @escaped_params;
for my $p (@params)
{
push @escaped_params, $q->escapeHTML( $p );
}
say "<p><strong>$param</strong>: ", join(', ', @escaped_params), "</p>";
... but it's significantly shorter. You can safely ignore the details of how it works for a
few minutes.
Your Second Program
Now you know how to build a CGI program, thanks to a simple example. How about
something useful? The previous article showed how to build a pretty good HTTP
log analyzer. Why not Web enable it? This will allow you to look at your usage
figures from anywhere you can get to a browser.
Before starting on the revisions, decide what to do with the analyzer. Instead of
showing all of the reports generated at once, show only those the user selects.
Second, let the user choose whether each report shows the entire list of items, or the
top 10, 20 or 50 sorted by access count.
The user interface can be a simple form:
<form action="/cgi-bin/http-report.pl" method="post">
<p>Select the reports you want to see:</p>
<p><input type="checkbox" name="report" value="url" />URLs requested<br / />
<input type="checkbox" name="report" value="status" />Status codes<br />
<input type="checkbox" name="report" value="hour" />Requests by hour<br />
<input type="checkbox" name="report" value="type" />File types</P>
<p><select name="number" />
<option value="ALL">Show all</option>
<option value="10">Show top 10</option>
<option value="20">Show top 20</option>
<option value="50">Show top 50</option>
</select></p>
<input TYPE="submit" value="Show report" />
</form>
8. (Remember that you may need to change the URL!)
This HTML page contains two different types of form item in this HTML page. One is
a series of checkbox widgets, which set values for the form item report. The other is
a single drop-down list which will assign a single value to number: either ALL, 10, 20
or 50.
Take a look at the original HTTP log analyzer. Start with two simple changes. First,
the original program gets the filename of the usage log from a command-line
argument:
# We will use a command line argument to determine the log filename.
my $logfile = shift;
This obviously can't work, because the Web server won't allow anyone to enter a
command line for a CGI program! Instead, hard-code the value of $logfile. I've used /
var/log/httpd/access_log as a sample value.
my $logfile = '/var/log/httpd/access_log';
Second, make sure that you output all the necessary headers to the web server
before printing anything else:
my $q = CGI->new();
say $q->header();
say $q->start_html( -title => "HTTP Log report" );
Now look at the report() sub from the original program. It has one problem, relative to
the new goals: it outputs all the reports instead of only the selected ones we've
selected. It's time to rewrite report() so that it will cycle through all the values of the
report form item and show the appropriate report for each.
sub report {
my $q = shift;
for my $type ( $q->param('report') ) {
my @report_args;
given ($type) {
when ('url') { @report_args = ( "URL requests", %url_requests ) }
when ('status') { @report_args = ( "Status code requests",
%status_requests ) }
when ('hour') { @report_args = ( "Requests by hour", %hour_requests ) }
when ('type') { @report_args = ( "Requests by file type", %type_requests ) }
}
report_section( $q, @report_args );
}
}
You probably haven't seen given/when before. It works like you might expect from
reading the code out loud. Given a variable or expression, when it's a specific value,
perform the associated action. When the report type is url, produce the "URL
requests" section of the report.
Finally, rewrite the report_section() sub to output HTML instead of plain text.
sub report_section {
my ( $q, $header, %types ) = @_;
9. my @type_keys;
# Are we sorting by the KEY, or by the NUMBER of accesses?
if ( param('number') eq 'ALL' ) {
@type_keys = sort keys %type;
}
else {
my $number = $q->param( 'number' );
@type_keys = sort { $type{$b} <=> $type{$a} } keys %type;
# truncate the list if we have too many results
splice @type_keys, $number if @type_keys > $number;
}
# Begin a HTML table
say "<table>n";
# Print a table row containing a header for the table
say '<tr><th colspan="2">', $header, '</th></tr>';
# Print a table row containing each item and its value
for my $key (@type_keys)
{
say "<tr><td>", $i, "</td><td>", $type{$i},
"</td></tr>n";
}
# Finish the table
print "</table>n";
}
Sorting
Perl allows you to sort lists with the sort keyword. By default, the sort will happen
alphanumerically: numbers before letters, uppercase before lowercase. This is
sufficient 99 percent of the time. The other 1 percent of the time, you can write a
custom sorting routine for Perl to use.
This sorting routine is just like a small sub. In it, you compare two special variables,
$a and $b, and return one of three values depending on how you want them to show
up in the list. Returning -1 means "$a should come before $b in the sorted list," 1
means "$b should come before $a in the sorted list" and 0 means "they're equal, so I
don't care which comes first." Perl will run this routine to compare each pair of items
in your list and produce the sorted result.
For example, if you have a hash called %type, here's how you might sort its keys in
descending order of their values in the hash.
sort {
return 1 if $type{$b} > $type{$a};
return -1 if $type{$b} < $type{$a};
return 0;
} keys %type;
10. In fact, numeric sorting happens so often, Perl gives you a convenient shorthand for
it: the <=> (spaceship) operator. This operator will perform the above comparison
between two values for you and return the appropriate value. That means you can
rewrite that test as:
sort { $type{$b} <=> $type{$a}; } keys %type
You can also compare strings with sort. The lt and gt operators are the string
equivalents of < and >, and cmp will perform the same test as <=>. (Remember,
string comparisons will sort numbers before letters and uppercase before
lowercase.)
For example, you have a list of names and phone numbers in the format "John Doe
555-1212." You want to sort this list by the person's last name, and sort by first name
when the last names are the same. This is a job made for cmp!
my @sorted = sort {
my ($left_surname) = ($a =~ / (w+)/);
my ($right_surname) = ($b =~ / (w+)/);
# Last names are the same, sort on first name
if ($left_surname eq $right_surname) {
my ($left_first) = ($a =~ /^(w+)/);
my (right_first) = ($b =~ /^(w+)/);
return $left_first cmp $right_first;
} else {
return $left_surname cmp $right_surname;
}
} @phone_numbers;
say $_ for @sorted;
If you look closely at the regexp assignment lines, you'll see list context. Where? The
parentheses around the variable name are not just there for decoration; they group a
single scalar into a one-element list, which is sufficient to provide list context on the
right-hand side of the assignment.
In scalar context (without the parentheses), the regular expression returns the
number of matches. In list context (as written), it returns the captured values. Thus
this is the Perl idiom for performing a regexp match and capture and assignment in a
single line.
Trust No One
Now that you know how CGI programs can do what you want, you need to make
sure they won't do what you don't want. This is harder than it looks, because you
can't trust anyone to do what you expect.
Here's a simple example: You want to make sure the HTTP log analyzer will never
show more than 50 items per report, because it takes too long to send larger reports
to the user. The easy thing to do would be to eliminate the "ALL" line from the HTML
form, so that the only remaining options are 10, 20, and 50. It would be very easy --
and wrong.
Download the source code for the HTTP analyzer with security enhancements.
11. You saw that you can modify HTML forms when you pasted the pizza-topping
sample code into the backatcha page. You can also use the URL to pass form items
to a program -- try going to http://example.com/backatcha.cgi?
itemsource=URL&typedby=you in your browser. Obviously, if someone can do this
with the backatcha program, they can also do it with your log analyzer and stick any
value for number in that they want: "ALL" or "25000", or "four score and seven years
ago."
Your form doesn't allow this, you say. Who cares? People will write custom HTML
forms to exploit weaknesses in your programs, or will just pass bad form items to
your program directly. You cannot trust anything users or their browsers tell you.
They might not even use a browser at all -- anything which can speak HTTP can
contact your program, regardless of whether it's even ever seen your form before (or
cares what your form allows and disallows).
Eliminate these problems by knowing what you expect from the user, and
disallowing everything else. Whatever you do not expressly permit is totally
forbidden. Secure CGI programs consider everything guilty until it is made innocent.
For example, you want to limit the size of reports from the HTTP log analyzer. You
decide that means the number form item must have a value that is between 10 and
50. Verify it like:
# Make sure that the "number" form item has a reasonable value
my ($number) = ($q->param('number') =~ /(d+)/);
if ($number < 10) {
$number = 10;
} elsif ($number > 50) {
$number = 50;
}
Of course, you also have to change the report_section() sub so it uses the $number
variable. Now, whether your user tries to tell your log analyzer that the value of
number is "10," "200," "432023," "ALL" or "redrum," your program will restrict it to a
reasonable value.
You don't need to do anything with report, because it only acts when one of its
values is something expected. If the user tries to enter something other than the
expressly permitted values ("url," "status," "hour" or "type"), the code just ignores it.
Do note that report_section is a little smarter to avoid printing nothing when there's
nothing to print. If the user entered an invalid value, report will call report_section
with only the CGI object $q, and the latter sub will return early, without printing
anything.
Use this sort of logic everywhere you know what the user should enter. You might
use s/D//g to remove non-numeric characters from items that should be numbers
(and then test to make sure what's left is within your range of allowable numbers!), or
/^w+$/ to make sure that the user entered a single word.
All of this has two significant benefits. First, you simplify your error-handling code,
because you make sure as early in your program as possible that you're working
with valid data. Second, you increase security by reducing the number of
"impossible" values that might help an attacker compromise your system or mess
with other users of your Web server.
12. Don't just take my word for it, though. The CGI Security FAQ has more information
about safe CGI programming in Perl than you ever thought could possibly exist,
including a section listing some security holes in real CGI programs.
Play Around!
You should now know enough about CGI programming to write a useful Web
application. (Oh, and you learned a little bit more about sorting and comparison.)
Now for some assignments:
• Write the quintessential CGI program: a guestbook. Users enter their name, e-mail
address and a short message. Append these to an HTML file for all to see.
Be careful! Never trust the user! A good beginning precaution is to disallow all
HTML by either removing < and > characters from all of the user's information or
replacing them with the < and > character entities. The escapeHTML method
in the CGI module is very good for this.
Use substr(), too, to cut anything the user enters down to a reasonable size.
Asking for a "short" message will do nothing to prevent the user dumping a 500k
file into the message field!
• Write a program that plays tic-tac-toe against the user. Be sure that the computer
AI is in a sub so it can be easily upgraded. (You'll probably need to study HTML a
bit to see how to output the tic-tac-toe board.)