ZOIS *
Technical Notes
ZOIS Technical Note TN-2010-02-17
Author and Audience
The jobs-detail part of the various Unofficial Jobcentre sites has the ability to allow the user to 'click-through' to the official site and its description of the vacancy in question. Although at first site this may seem trivial, it in-fact proved quite challenging. This TN will interest PHP developers who are confronted with similar gratuitously state-based web-sites who would like to allow their uses to 'click through' to an official page.
The reader is assumed to be familiar with programming
techniques, particularly in PHP. Written by Martin
Sullivan[au], ZOIS Limited,
Cockermouth.
Abstract
A mechanism is presented, in PHP, which will allow gretuitously
stateful third-party pages to be displayed programatically. These
pages can then be bookmarked or their link shared.
Introduction
A number of Technical Demonstrations have been produced[nj, cj] that allow a more-or-less complete scrape of the Jobseekers Direct database to be interrogated via a simple web-based interface. Originally this was an intended as a replacement for the closed Cockermouth Jobcentre Plus office, but later evolved into a national system that would allow people to examine the postings recently made by their own local centre. Both these systems present vacancy details from a cache with the possibility to click on a link to connect to original posting found dynamically on the Jobseekers Direct web-site[jd], which is where such things are officially kept.
One would think that a simple HTTP GET command would be sufficient
to retrieve what is essentially a self contained query, however the
Jobseekers Direct site makes no such provision. It requires that the
correct navigation steps are performed beforehand and that cookies and
session-ids are correctly set. While this is convoluted, it is a good
deal simpler than the original Jobcentre Plus vacancy search
web-site. Many web-sites now gratuitously demand state information
presumably to enhance consumer experiences by making educated guesses
about what they would like to buy, or at least see advertised. A
mechanism is presented here that automates this task and effectively
allows the user to perform a simple GET.
Materials and Platform
The examples are coded in PHP throughout, although other
server-side scripting languages could be used. The major component
required preg_* functions which provide Perl-like Regular
Expression string manipulation[pr]. These are now
part of the standard distribution. The system also required the HTTP
system[ht] which is not a standard part of the PHP
distribution and must be installed from the PECL system[pe]. As ever the primary development environment Emacs
running on Linux.
Method
The PECL extension for HTTP must be downloaded and installed. This is documented elsewhere, but may be accomplished using a distribution based package manager such as Ubuntu's Synaptic. Root level privileges are required to do this.
The guts of the code is a function which enacts a HTML POST. This
effectively fills in and submits an onscreen form using the slightly
complicated but idempotent POST method. This form only contains a
single field that is visible to the user the $reference,
but the form has a number of hidden fields which must be acquired and
presented in a hidden format. Thus 'hidden' form variables
__VIEWSTATE and __EVENTTARGET are opaque and
presented by the server in an initial interaction. For maximum
flexibility the actual URL for the form-action is also obtained
automatically.
$result = http_parse_message ( http_post_fields ($url . "/" . $target, array ( '__VIEWSTATE' => $viewstate, '__EVENTTARGET' => $eventtarget, 'uctlHCCDialogue:txtUserInput' => $reference, 'uctlHCCDialogue:btnSearch' => "submit"), array (), $options)); |
As already outlined, the POST requires that information be gathered in advance and to do this, the following dialogue must be entered into.
$url = "http://jobseekers.direct.gov.uk"; $opts = array( 'cookiesession' => TRUE, 'redirect' => 3, 'timeout' => 360 ); $c = http_parse_message (http_get ($url, $opts) )->body; |
The content, $c can then be parsed using
preg_match to obtain variables that are of interest. As
an example:
if (preg_match ("/<form.*?action=\"(.*?)\" id/", $c, $match)) {
$action = $match[1];
} else {
mydie ("No 'action' in the Search page form");
} // else
|
Once the http_post_fields call has been made, a result
is returned. The server code in this instance uses the "Redirect after
Post Get" (RPG) pattern[rp] to ensure the idempotent
nature of the POST interaction in the face of possible Browser
intransigence. Some big words there, a re-POST isn't supposed to
automatically resubmit existing data, but some Browsers do it
anyway. RPG is not required in this instance, but we get it anyway,
possibly because of libraries or cut-and-paste programming. The code
to deal with this is thus:
if ($result->responseCode == 302) {
$c = http_parse_message (
http_get ($url .
$result->headers['Location'], $opts))->body;
} else {
mydie ("No expected redirect location");
} // else
|
The HTTP GET returns the required page. In theory, all that would
be left to do is that it be displayed (via an
echo). Unfortunately the page still requires a number of
additional elements that need to be fetched at the same
time. Cascading Style Sheets (CSS), various graphics and other links
are demanded. They are referenced by URLs which are relative to the
expected URL of the page, but since its being displayed via another
page they will be wrong. It is therefore necessary to fix the base-URL
expectation by injecting a BASE HTML tag. This was done immediately
prior to presentation:
echo preg_replace ('/<HEAD>/i', '<HEAD><base href="' . $url . '/">', $c);
|
It seems to be a common failing amongst modern web-sites,
disingenuous complication fails to allow the user to do simple
things. In this instance a user may like to do something simple and
non-stateful but is required to go through a set number of procedures
which cannot be bookmarked. This complexity can be hidden and the
resultant URL bookmarked, but only by performing the expected activity
programatically on behalf of the user.
References
References found in this section, and in particular the HTML links were correct at time of writing (2010-03-11).
$Date: 2010/03/13 10:03:12 $