[ZOIS] Home Page * Contact ZOIS * Technical Notes

'Click-through' on Stateful Third Party Web-sites Using PHP

ZOIS Technical Note TN-2010-02-17

Author and Audience

The jobs-detail part of the various Unofficial Jobcentre sites has the ability to allow the user to 'click-through' to the official site and its description of the vacancy in question. Although at first sight this may seem trivial, it in-fact proved quite challenging. This TN will interest PHP developers who are confronted with similar gratuitously state-based web-sites who would like to allow their uses to 'click through' to an official page.

The reader is assumed to be familiar with programming techniques, particularly in PHP. Written by Martin Sullivan[au], ZOIS Limited, Cockermouth.

Abstract

A mechanism is presented, in PHP, which will allow gratuitously stateful third-party pages to be displayed programatically. These pages can then be bookmarked or their link shared.

Introduction

A number of Technical Demonstrations have been produced[nj, cj] that allow a more-or-less complete scrape of the Jobseekers Direct database to be interrogated via a simple web-based interface. Originally this was an intended as a replacement for the closed Cockermouth Jobcentre Plus office, but later evolved into a national system that would allow people to examine the postings recently made by their own local centre. Both these systems present vacancy details from a cache with the possibility to click on a link to connect to original posting found dynamically on the Jobseekers Direct web-site[jd], which is where such things are officially kept.

One would think that a simple HTTP GET command would be sufficient to retrieve what is essentially a self contained query, however the Jobseekers Direct site makes no such provision. It requires that the correct navigation steps are performed beforehand and that cookies and session-ids are correctly set. While this is convoluted, it is a good deal simpler than the original Jobcentre Plus vacancy search web-site. Many web-sites now gratuitously demand state information presumably to enhance consumer experiences by making educated guesses about what they would like to buy, or at least see advertised. A mechanism is presented here that automates this task and effectively allows the user to perform a simple GET.

Materials and Platform

The examples are coded in PHP throughout, although other server-side scripting languages could be used. The major component required preg_* functions which provide Perl-like Regular Expression string manipulation[pr]. These are now part of the standard distribution. The system also required the HTTP system[ht] which is not a standard part of the PHP distribution and must be installed from the PECL system[pe]. As ever the primary development environment Emacs running on Linux.

Method

The PECL extension for HTTP must be downloaded and installed. This is documented elsewhere, but may be accomplished using a distribution based package manager such as Ubuntu's Synaptic. Root level privileges are required to do this.

Changes at Jobseekers Direct mean that this code has been modified. See Updates.

The guts of the code is a function which enacts a HTML POST. This effectively fills in and submits an onscreen form using the slightly complicated but idempotent POST method. This form only contains a single field that is visible to the user the $reference, but the form has a number of hidden fields which must be acquired and presented in a hidden format. Thus 'hidden' form variables __VIEWSTATE and, formally but now no longer required __EVENTTARGET, are opaque and presented by the server in an initial interaction. For maximum flexibility the actual URL for the form-action is also obtained automatically.

$result = http_parse_message (
    http_post_fields ($url . "/" . $target, 
	array ( 
    	    'tsmGlobal_HiddenField' => "",
            '__VIEWSTATE' => $viewstate,
            'txtSubject' => $reference,
            'txtLocation' => "",
            'ddlDistance' => "4",
            'btnSearch' => "Search",
            'ddlHours' => "70",
            'ddlType' => "0",
            'ddlAge' => "0"),
            array (), $opts));

As already outlined, the POST requires that information be gathered in advance and to do this. The following dialogue then must be entered into, particularly if the default number of redirects, 0, is specified.

$url = "http://jobseekers.direct.gov.uk";
$opts =	array(
    cookiesession => TRUE, 
    redirect => 3,
    timeout => 360
);

$c = http_parse_message (http_get ($url, $opts)
    )->body;

The content, $c can then be parsed using preg_match to obtain variables that are of interest. As an example:

if (preg_match ("/<form.*?action=\"(.*?)\" id/", $c, $match)) {
    $action = $match[1];
} else {
    mydie ("No 'action' in the Search page form");
} // else

Once the http_post_fields call has been made, a result is returned. The server code in this instance uses the "Redirect after Post Get" (RPG) pattern[rp] to ensure the idempotent nature of the POST interaction in the face of possible Browser intransigence. Some big words there, a re-POST isn't supposed to automatically resubmit existing data, but some Browsers do it anyway. RPG is not required in this instance, but we get it anyway, possibly because of libraries or cut-and-paste programming. The code to deal with this is thus:

if ($result->responseCode == 302) {
    $c = http_parse_message (
	http_get ($url .
	    $result->headers['Location'], $opts))->body;
} else {
    mydie ("No expected redirect location");
} // else

The HTTP GET returns the required page. In theory, all that would be left to do is that it be displayed (via an echo). Unfortunately the page still requires a number of additional elements that need to be fetched at the same time. Cascading Style Sheets (CSS), various graphics and other links are demanded. They are referenced by URLs which are relative to the expected URL of the page, but since its being displayed via another page they will be wrong. It is therefore necessary to fix the base-URL expectation by injecting a BASE HTML tag. This was done immediately prior to presentation:

echo preg_replace ('/<HEAD>/i', '<HEAD><base href="' . $url . '/">', $c);

Discussion

It seems to be a common failing amongst modern web-sites, disingenuous complication fails to allow the user to do simple things. In this instance a user may like to do something simple and non-stateful but is required to go through a set number of procedures which cannot be bookmarked. This complexity can be hidden and the resultant URL bookmarked, but only by performing the expected activity programatically on behalf of the user.

As with other Technical Notes, feedback is actively solicited. The author may be contacted via the e-mail address found on his public biography page[au]. Should something require changing or enhancing then the fact will be acknowledged with attribution in an Updates section.

Updates

Feedback has suggested that the following needed to be changed after this TN had been published:

Enhanced Robustness
The original system gathered some session information from the Jobseekers Direct web-site to allow a further query to be made on it. Indeed, the point of work was to do just that. The session information is cachable, however, and is now preserved on a multi-visit basis and shared between servers using the database. When this information is deemed to have expired it is re-acquired and re-stored. The goal is to reduce traffic between this system and Jobseekers Direct, which can be overloaded and slow.

In addition, network timeouts and failures are now handled more gracefully, with an invitation to view the local JCP Mirror cache for a particular vacancy. The script is being heavily used by third-parties, and failures need to be better addressed and less baffling to the casual user. 2011-01-07

Changes to Jobseekers Direct
This system relies on the Jobseekers Direct web-site, which changes from time-to-time. Modifications have been made to the http_post_fields example, above, to reflect the last change. 2010-12-13

Syntax Error
A minor syntax error crept in to the options example fragment during preparation. This has now been noted and fixed. 2010-09-19

References

References found in this section, and in particular the HTML links were correct at time of writing (2010-03-11).

[au]. Martin Sullivan:
http://www.zois.co.uk/people/martin_sullivan
[nj]. The Unofficial National Jobcentre Plus Mirror:
http://home.zois.co.uk/jcpnational.html
[cj]. The Unofficial Cockermouth Jobcentre Plus Mirror:
http://home.zois.co.uk/jcp.html
[jd]. Jobseekers Direct:
http://jobseekers.direct.gov.uk
[pr]. Regular Expressions (Perl-Compatible):
http://www.php.net/manual/en/book.pcre.php
[ht]. PHP HTTP Extension:
http://www.php.net/manual/en/book.http.php
[pe]. What is PECL?
http://pecl.php.net
[rp]. Jouravel M (2004) "Redirect After Post"
http://www.theserverside.com/tt/articles/article.tss?l=RedirectAfterPost

~Z~


Date: 2011-01-07


Break Frame * E-mail Webmaster * Copyright