'Click-through' on Stateful Third Party Web-sites Using PHP
ZOIS Technical Note TN-2010-02-17
Author and Audience
The jobs-detail part of the various Unofficial Jobcentre sites has the ability to allow the user to 'click-through' to the official site and its description of the vacancy in question. Although at first sight this may seem trivial, it in-fact proved quite challenging. This TN will interest PHP developers who are confronted with similar gratuitously state-based web-sites who would like to allow their uses to 'click through' to an official page.
The reader is assumed to be familiar with programming
techniques, particularly in PHP. Written by Martin
Sullivan[au], ZOIS Limited,
Cockermouth.
Abstract
A mechanism is presented, in PHP, which will allow gratuitously
stateful third-party pages to be displayed programatically. These
pages can then be bookmarked or their link shared.
Introduction
A number of Technical Demonstrations have been produced[nj, cj] that allow a more-or-less complete scrape of the Jobseekers Direct database to be interrogated via a simple web-based interface. Originally this was an intended as a replacement for the closed Cockermouth Jobcentre Plus office, but later evolved into a national system that would allow people to examine the postings recently made by their own local centre. Both these systems present vacancy details from a cache with the possibility to click on a link to connect to original posting found dynamically on the Jobseekers Direct web-site[jd], which is where such things are officially kept.
One would think that a simple HTTP GET command would be sufficient
to retrieve what is essentially a self contained query, however the
Jobseekers Direct site makes no such provision. It requires that the
correct navigation steps are performed beforehand and that cookies and
session-ids are correctly set. While this is convoluted, it is a good
deal simpler than the original Jobcentre Plus vacancy search
web-site. Many web-sites now gratuitously demand state information
presumably to enhance consumer experiences by making educated guesses
about what they would like to buy, or at least see advertised. A
mechanism is presented here that automates this task and effectively
allows the user to perform a simple GET.
Materials and Platform
The examples are coded in PHP throughout, although other
server-side scripting languages could be used. The major component
required preg_* functions which provide Perl-like Regular
Expression string manipulation[pr]. These are now
part of the standard distribution. The system also required the HTTP
system[ht] which is not a standard part of the PHP
distribution and must be installed from the PECL system[pe]. As ever the primary development environment Emacs
running on Linux.
Method
The PECL extension for HTTP must be downloaded and installed. This is documented elsewhere, but may be accomplished using a distribution based package manager such as Ubuntu's Synaptic. Root level privileges are required to do this.
Changes at Jobseekers Direct mean that this code has been modified. See Updates.
The guts of the code is a function which enacts a HTML POST. This
effectively fills in and submits an onscreen form using the slightly
complicated but idempotent POST method. This form only contains a single
field that is visible to the user the $reference, but the
form has a number of hidden fields which must be acquired and presented in
a hidden format. Thus 'hidden' form variables __VIEWSTATE
and, formally but now no longer required __EVENTTARGET,
are opaque and presented by the server in an initial interaction. For
maximum flexibility the actual URL for the form-action is also obtained
automatically.
$result = http_parse_message (
http_post_fields ($url . "/" . $target,
array (
'tsmGlobal_HiddenField' => "",
'__VIEWSTATE' => $viewstate,
'txtSubject' => $reference,
'txtLocation' => "",
'ddlDistance' => "4",
'btnSearch' => "Search",
'ddlHours' => "70",
'ddlType' => "0",
'ddlAge' => "0"),
array (), $opts));
As already outlined, the POST requires that information be gathered in advance and to do this. The following dialogue then must be entered into, particularly if the default number of redirects, 0, is specified.
$url = "http://jobseekers.direct.gov.uk";
$opts = array(
cookiesession => TRUE,
redirect => 3,
timeout => 360
);
$c = http_parse_message (http_get ($url, $opts)
)->body;
The content, $c can then be parsed using
preg_match to obtain variables that are of interest. As
an example:
if (preg_match ("/<form.*?action=\"(.*?)\" id/", $c, $match)) {
$action = $match[1];
} else {
mydie ("No 'action' in the Search page form");
} // else
Once the http_post_fields call has been made, a result
is returned. The server code in this instance uses the "Redirect after
Post Get" (RPG) pattern[rp] to ensure the idempotent
nature of the POST interaction in the face of possible Browser
intransigence. Some big words there, a re-POST isn't supposed to
automatically resubmit existing data, but some Browsers do it
anyway. RPG is not required in this instance, but we get it anyway,
possibly because of libraries or cut-and-paste programming. The code
to deal with this is thus:
if ($result->responseCode == 302) {
$c = http_parse_message (
http_get ($url .
$result->headers['Location'], $opts))->body;
} else {
mydie ("No expected redirect location");
} // else
The HTTP GET returns the required page. In theory, all that would
be left to do is that it be displayed (via an
echo). Unfortunately the page still requires a number of
additional elements that need to be fetched at the same
time. Cascading Style Sheets (CSS), various graphics and other links
are demanded. They are referenced by URLs which are relative to the
expected URL of the page, but since its being displayed via another
page they will be wrong. It is therefore necessary to fix the base-URL
expectation by injecting a BASE HTML tag. This was done immediately
prior to presentation:
echo preg_replace ('/<HEAD>/i', '<HEAD><base href="' . $url . '/">', $c);
Discussion
It seems to be a common failing amongst modern web-sites, disingenuous complication fails to allow the user to do simple things. In this instance a user may like to do something simple and non-stateful but is required to go through a set number of procedures which cannot be bookmarked. This complexity can be hidden and the resultant URL bookmarked, but only by performing the expected activity programatically on behalf of the user.
As with other Technical Notes, feedback is actively solicited. The
author may be contacted via the e-mail address found on his public
biography page[au]. Should something require
changing or enhancing then the fact will be acknowledged with
attribution in an Updates section.
Updates
Feedback has suggested that the following needed to be changed after this TN had been published:
- Enhanced Robustness
- The original system gathered some session information from the
Jobseekers Direct web-site to allow a further query to be made on it.
Indeed, the point of work was to do just that. The session information
is cachable, however, and is now preserved on a multi-visit basis and
shared between servers using the database. When this information is deemed
to have expired it is re-acquired and re-stored. The goal is to reduce
traffic between this system and Jobseekers Direct, which can be overloaded
and slow.
In addition, network timeouts and failures are now handled more gracefully, with an invitation to view the local JCP Mirror cache for a particular vacancy. The script is being heavily used by third-parties, and failures need to be better addressed and less baffling to the casual user. 2011-01-07
- Changes to Jobseekers Direct
- This system relies on the Jobseekers Direct web-site, which
changes from time-to-time. Modifications have been made to the
http_post_fieldsexample, above, to reflect the last change. 2010-12-13 - Syntax Error
- A minor syntax error crept in to the options example fragment during preparation. This has now been noted and fixed. 2010-09-19
References
References found in this section, and in particular the HTML links were correct at time of writing (2010-03-11).
- [au]. Martin Sullivan:
- http://www.zois.co.uk/people/martin_sullivan
- [nj]. The Unofficial National Jobcentre Plus Mirror:
- http://home.zois.co.uk/jcpnational.html
- [cj]. The Unofficial Cockermouth Jobcentre Plus Mirror:
- http://home.zois.co.uk/jcp.html
- [jd]. Jobseekers Direct:
- http://jobseekers.direct.gov.uk
- [pr]. Regular Expressions (Perl-Compatible):
- http://www.php.net/manual/en/book.pcre.php
- [ht]. PHP HTTP Extension:
- http://www.php.net/manual/en/book.http.php
- [pe]. What is PECL?
- http://pecl.php.net
- [rp]. Jouravel M (2004) "Redirect After Post"
- http://www.theserverside.com/tt/articles/article.tss?l=RedirectAfterPost
~Z~