Accessing the Web with Perl

Installing libwww-perl

If you are using Cygwin rather than working on a Linux or Unix box, you'll probably need to download several modules onto your machine before you can do very much useful work with Perl on the internet. Several of these modules require access to the gcc and g++ C and C++ compilers for them to be installed. Hopefully, when you installed Cygwin you asked the installer to provide these compilers. If not, you'll have to re-run the Cygwin setup.exe file. (If you've deleted the setup.exe file, you can obtain it again at http://www.cygwin.com/. Just click on the button:

Recall that setup.exe is the Cygwin installer/updater. If you haven't already installed cygwin it acts as an installer, otherwise it acts as an updater, only downloading those components not already on your system. In the dialog box that results from running the installer, navigate to the Devel item. Click on that and you will see a great many more items. You want to make sure you select the g++ item (not the "src" entry, just the executable). That's it! Click on the Next button and follow the directions. The resulting download can take quite a long while, so you definitely need to have a broadband connection.

Okay, now that you've got the Gnu C compiler installed, you're ready to install the libwww-perl modules from CPAN. (We discussed how to install Perl modules in an earlier lecture.) When you examine the README file for libwww-perl, you'll see that it indicates that it is dependent on several other modules:

You will find out which of these are not yet installed in your Perl when you invoke the Makefile. Before installing libwww you'll have to first install the missing modules. When that is done you can re-run the libwww-perl Makefile and then complete the installation. Here is a session transcript for the installation of libwww. During installation you might get an error message something like: "unable to forceunlink ... HEAD". If that happens, you can probably work around the problem by first creating a file named HEAD in C:\cygwin\bin (assuming you installed Cygwin into the default location of C:\cygwin). You can quickly create such a file from within Cygwin with the command touch /usr/bin/HEAD. (Recall that pathnames in Cygwin differ slightly from their DOS equivalents.) After you've done this, try the make install command again and everything should work just fine.

Here is part of the installation session of these modules on my machine.

Using LWP

We are primarily interested in using the LWP module that comes with libwww. The basic structure of the process is straightforward, as discussed in the LWP documentation:

...Communication always take place through these steps: First a request object is created and configured. This object is then passed to a server and we get a response object in return that we can examine. A request is always independent of any previous requests, i.e. the service is stateless. The same simple model is used for any kind of service we want to access.

For example, if we want to fetch a document from a remote file server, then we send it a request that contains a name for that document and the response will contain the document itself. If we access a search engine, then the content of the request will contain the query parameters and the response will contain the query result. If we want to send a mail message to somebody then we send a request object which contains our message to the mail server and the response object will contain an acknowledgment that tells us that the message has been accepted and will be forwarded to the recipient(s).

As you might expect, libwww provides an HTTP::Request class and an HTTP::Response class. What is not clear from the above statement is that objects of these classes are accessed via an intermediary, called a User Agent. The agent takes care of many of the details of interfacing with various servers on the internet, including security concerns, proxies, etc. The user creates an LWP::UserAgent, and uses methods of that object to send requests to web servers, and then to access the responses.

The use of the agent is clear in this sample session taken from the same Perl documentation page:

  # Create a user agent object
  use LWP::UserAgent;
  $ua = LWP::UserAgent->new;
  $ua->agent("MyApp/0.1 ");

  # Create a request
  my $req = HTTP::Request->new(POST => 'http://search.cpan.org/search');
  $req->content_type('application/x-www-form-urlencoded');
  $req->content('query=libwww-perl&mode=dist');

  # Pass request to the user agent and get a response back
  my $res = $ua->request($req);

  # Check the outcome of the response
  if ($res->is_success) {
      print $res->content;
  }
  else {
      print $res->status_line, "\n";
  }

Once the libwww modules are installed you can execute the above code. If all goes well, the program will print the HTML code that is profferred by the web server for the URL http://search.cpan.org/search?query=libwww-perl&mode=dist. You might try comparing that output against the HTML for that page in your browser. (In Foxfire, use the Page Source item in the View menu.)

Request Objects

The web page at the URL http://search.cpan.org/search is a form: it contains three elements that the user can interract with. The elements are identified in this screenshot of the page.

Here is the pertinent HTML code that defines this form:

<form method="get" action="/search" name="f" class="searchbox">
<input type="text" name="query" value="" size="35">
<br>in <select name="mode">
<option value="all">All</option>
<option value="module" >Modules</option>
<option value="dist" >Distributions</option> <option value="author" >Authors</option>
</select>&nbsp;<input type="submit" value="CPAN Search">
</form>

The rightmost element of the form is a textbox, which has been named "query" (notice the HTML code name="query".) The left (and lower) element is a select box named "mode". The third element is the "submit" button, defined in the last line of the HTML code above. On the web page it is labelled "CPAN Search".

In normal interactive use, the user fills out the textbox, makes a selection from the select box and then clicks on the submit button. This causes the browser to submit a new query to the web server in the form of a slightly more complicated URL: instead of http://search.cpan.org/search, it will be http://search.cpan.org/search?query=libwww-perl&mode=dist. Notice the string starting with "?" that has been appended to the original URL. The resulting page looks like this:

So, we want to automate this process--we want to have Perl access this search engine on our behalf, autonomously. To do this, we have to determine the names of the elements of the form in the web page we will be accessing. We can do that by visiting the web page normally, via our browser, and then viewing the HTML source to locate the form definition. There, we have to identify the names given to each of the elements of interest to us. Then we construct a Request object that provides the appropriate value for each of those elements. This is what we were doing in this section of the Perl code we've been looking at:

 # Create a request
  my $req = HTTP::Request->new(POST => 'http://search.cpan.org/search');
  $req->content_type('application/x-www-form-urlencoded');
  $req->content('query=libwww-perl&mode=dist');

If you're really curious about all the possibilities for a Request object, see the documentation. The last line is the interesting part: we specify the values of each element with a XXX=YYY pairs, where XXX is the name of the element and YYY is the value we want to assign it. The elements' pairs are separated with ampersands: "&".

The Response Object

Okay, so once we've formed our query in the form of a request object, we can submit it to the web server via a User Agent. Here's the code from our example that does that:

  my $res = $ua->request($req);

The result is a Response object. The Response class supports a great number of methods, but our sample code makes use of only the two most commonly used ones: is_success, content and status_line. The is_success method returns true if the query was answered without trouble (which is not to say you necessarily got what you wanted!) The content method returns, as a string, the document returned by the web server. The status_line method returns a single human-readable string that describes what kind of error occurred, if one did occur.

The curious reader will want to examine the full documentation for the Response class. There is a lot more information returned by web servers than is examined in our sample code.

If you plan to write code that will query a web server many times, then you are writing a web "robot". Web robots need to conform to certain standards of behavior, else they obstruct human interactions with their target website(s). For example, a robot shouldn't try to pull web pages too frequently, say, not more than one per minute, etc. The standards of behavior for web robots are outlined at http://www.robotstxt.org/. If you want to try writing a web robot with Perl, you should probably use the LWP::RobotUA class, which is really just a wrapper around an HTTP::Request object.

 

If you plan on interacting with a lot of forms, possibly on different servers, you might want to look into the HTTP::Request::Form module, which is designed as a front-end to HTTP::Request. It simplifies the processing of forms in a user agent by filling out fields, querying fields, selections and buttons and pressing buttons. It uses the HTML::TreeBuilder module to generate a parse tree of the HTML document, and then uses the resulting tree to access the form's parts, which are extracted with extract_links. It then creates its own internal representation of forms and uses them to generate a request object to process the form application. The class is quite robust, allowing the user to determine the types and names of all the form elements (radio buttons, selection boxes, text boxes, etc.), and to set the values of those elements in a subsequent request.

The following program does the same thing as our original program, but uses HTTP::Request::Form:

# Create a user agent object

use URI::URL;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Request::Common;
use HTTP::Request::Form;
use HTML::TreeBuilder 3.0;

my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
my $url = url 'http://search.cpan.org/search';
my $res = $ua->request(GET $url);
my $tree = HTML::TreeBuilder->new;
$tree->parse($res->content);
$tree->eof();

my @forms = $tree->find_by_tag_name('FORM');
unless (@forms) {
    die "What, no forms in $url?" unless @forms;
}

my $f = HTTP::Request::Form->new($forms[0], $url);
my @fields = $f->fields();
print "The fields in the form are: @fields \n";
$f->field("query", "libwww-perl");
$f->field("mode", "dist");

#Now that our fields are all set, we submit the request
my $res = $ua->request($f->press("submit"));
# Check the outcome of the response
if ($res->is_success) {
    print $res->content;
}
else {
    print $res->status_line, "\n";
}

As you can see, the code is a bit lengthier. The advantage of Form is that you can often use it to autonomously determine how to access and use forms. This would be particularly useful if you are writing a web crawler or spider--a program that traverses web pages, looking for new links and pages. Using Form, the crawler could locate forms having buttons labelled "search", and submit queries to those forms.