Scraping Jobseekers Direct
ZOIS Technical Note TN-2010-03-13.
Author and Audience
The Jobseekers Direct web-site[jd]
is scraped to allow the content to be useful. The reader is assumed to
be familiar with programming techniques, particularly in Perl. Written
by Martin Sullivan[au], ZOIS Limited,
Cockermouth.
Abstract
The Jobseekers Direct web-site is scraped in its entirety to
provide feed-stock for a variety of system that will deliver timely,
local and relevant vacancy notification. The Technical Note describes
the strategy employed, technology used and has fragments of
source-code. The current scraping system is available from the author
at no-cost, as are nightly feeds of the entire results of about a
weeks worth of scraping.
Introduction
Jobcentre Plus is a UK government inspired set of offices where job vacancies are advertised and the unemployed may access services both to claim benefits and aid their return to gainful employment. This has been, in time, augmented by a fairly useless web site which has now undergone a revamp and renamed "Jobseekers Direct".
While this web-site has its uses it fails to address the traditional needs of looking for a job relatively close to home. It seems to revel in displaying jobs from far, far, away, often with posting dates some considerable time in the past. Generally it has been found that postings by a particular Jobcentre Office are a better geographical guide than trying to use Postcodes in the official web-search form.
Starting initially with the area around Cockermouth, where ZOIS is based and a purely internal system a scraping system based on the original Jobcentre Plus web-site was developed. This was eventually published as "The Unofficial Cockermouth Jobcentre Plus Mirror"[jc] and left as a Technological Demonstrator. Although never documented fully, this system scraped vacancies that appeared on the Jobcentre Plus web-site for the area 50 miles around Cockermouth in response to an Postcode. The jobs returned were stored in a database by reference and were judged to be unique by the Jobcentre Plus web-site. This reference was then used to identify the truly local, live vacancies and display them on a web-site. The database was used to ensure that vacancy details were not needlessly re-downloaded once they had already been seen. This included a large number of vacancies that were remote to Cockermouth, but were never-the-less delivered by Jobcentre Plus as though they were local and relevant.
Prompted by some pro-bono work, done for a client in Birmingham, a national version was envisaged. This was work in progress when the Jobcentre Plus web-search site closed, to be replaced by the "Jobseekers Direct" one (in March, 2010). The scraper was thus recoded to use this new site, which like the old one continued to deliver geographically and temporarily irrelevant jobs in response to simple queries, all be it more reliably.
In response to the relatively poor delivery of jobs tied to a
particular area, a strategy was developed that involved the guessing of
unseen vacancy references. These were then used to speculatively probe
for a real vacancy and to back-fill on vacancies that may have been
missed. This guessing was done on a Jobcentre office-by-office basis.
Materials and Platform
The bulk of the code is written in perl(1) (using version 5.8.8), The system uses the LWP::UserAgent and DBI::Pg Perl modules, both of which can be retrieved from CPAN[cp]. The database management software is PostgreSQL[pg]. Perl and PostgreSQL are available on Linux and various BSD platforms. This work was done on an Ubuntu Linux release, other UNIX-like operating systems may require some additional porting work. Non-UNIX OS like Microsoft Windows remain unexplored.
To obtain the required URLs and page-flows a web-browser which allows
inspection of HTTP source code is required. The author used Firefox
(with the Firebug plug-in) and Lynx. If the reader desires to do
similar work then these tools are invaluable. Tcpdump(8) was
also used to inspect raw HTTP streams in TCP packets. The
-A ASCII dump option was particularly useful in this
respect.
Method
Each posting made through Jobseekers Direct actually comes from an older non-web based system called the Labour Market System (LMS). Vacancies in LMS are identified by a three-letter code for the posting office and a serial number. It is considered safe to assume that the office code is unique and the serial number always increments. These properties are used to identify the particular Jobcentre which is responsible for the vacancy in question. By a laborious sequence of searches and partially automated scrapes it was possible to identify the actual Jobcentre Office from this code. It was also possible to identify their Postcode and address, although they nearly all use a centralised call-centre as a telephone end-point. Some Jobcentres appear to have closed during this process and thus their address may be the same as the office that has replaced them. Some offices seem to be virtual, in that it's not clear what geographical location they represent. Fortunately, however, since it is the premise the this work is based on, most offices have a physical location and deal with vacancies from their tightly defined area. In this respect the office code is a better indication of a job's location than the Postcode. Job Postcodes seem to be haphazard in their entry and often reflect a distant head office or agency.
The scrape works broadly by selecting each active Jobcentre Plus office, finding its Postcode then searching for all vacancies found within the maximum 15 mile radius using the Jobseekers Direct web-search. This yields a wide selection of vacancies in summary form which are then inspected to see if they have been noted before. If not then the details of the vacancy is download, parsed and stored. The act of searching sets up cookies and session identification information to allow synthesised queries based on 'guessed' job-references as though they had been presented in the summary page. The Jobcentre's code and the maximum reference are then used to probe for new postings that have not been notified in the summary. Failures are not noted and, currently, a maximum of three such sequential failures are allowed. This strategy allows the scraper to 'probe' for new vacancies, but not excessively so. Once this has been completed a further scan of the database is done to find 'missing' vacancies. Again a synthesised reference is used to probe for these jobs, as though they had been presented on the summary page. Failures in this instance are noted. Vacancies are apparently frequently withdrawn before they make the web-site.
It is anticipated that new Jobcentre offices will have new codes and will be caught incidentally in the initial query. As vacancies in the tracking database are withdrawn after about 28 days it is expected that newly closed Jobcentre offices would quietly disappear.
Each 'active' Jobcentre office is scanned once and there are currently (March, 2010) in excess of 700 of them. Each offices is treated separately, with a new web-session being set up in each case. The complete scan takes several hours and starts at 18h00 each evening and occurs only once. About two Jobcentre office scans per night are reported with server errors (HTTP code 500). They are not retested, the next nights scan is expected to re-scan them.
What follows are these processes examined in greater detail. Firstly, get your offices:
my $postcodes_select = $dbh->prepare_cached (q{
select distinct o.office_code, o.Postcode
from jcp_office o, jcp j
where j.office_code = o.office_code
and o.Postcode is not null
});
The observant will note that there has to be a Postcode. One of the chores of maintaining the site is to find the Postcode of new or only-recently observed offices. The list of offices is then used to acquire jobs, or a per-office basis:
$postcodes_select->execute ()
or die "DBI::execute: $DBI::errstr";
$postcodes_select->bind_columns (undef, \$office_code, \$postcode);
SCRAPE: while ($postcodes_select->fetch ()) {
eval {
&acquire_jobs ($postcode, $office_code);
};
if ($@) {
print "$office_code ($postcode) has had an error ";
print "(reported elsewhere). However, we'll soldier on ... \n";
next SCRAPE;
} # if
} # foreach
Acquiring the jobs is a matter, initially, of emulating the user typing in a generalised query, "all jobs" for the desired Postcode. The following preliminary set up is required:
my $ua = LWP::UserAgent->new ();
$ua->cookie_jar({});
$ua->timeout (360); # double it
$ua->default_headers->push_header ('Accept' =>
"text/html,application/xhtml" .
"+xml,application/xml");
$ua->default_headers->push_header ('Accept-Language' => "en-gb,en");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");
$ua->default_headers->push_header ('Keep-Alive' => "300");
$ua->default_headers->push_header ('Proxy-Connection' => "keep-alive");
push @{ $ua->requests_redirectable }, 'POST'; # squiffy redirects
# after POST.
Most of this is self-explanatory. The next step is to set up the query form page. This is a fairly simple looking GET, but along the way LWP will set up the necessary session identification information that will allow further interaction with the Jobseekers Direct web-site:
$doc = $ua->get ($url)
or die $doc->status_line;
The form to request jobs is extracted of the resultant page, As an example:
unless ($doc->content =~ /<form.*?action="(.*?)" id/) {
&save_content ($doc->content);
die "Server Error: Front Page, no action";
} # unless
The 'Eventtarget' information is no longer required on this site, but has been left in the TN as an example. 2011-06-17
With the relevant content for the form, including the all important 'viewstate' and 'eventtarget' information, the HTTP POST is then sent:
$doc = $ua->post ($url . $action,
{
'__VIEWSTATE' => $viewstate,
'__EVENTTARGET' => $eventtarget,
'uctlHCCDialogue:txtUserInput' => "all jobs " .
$postcode,
'uctlHCCDialogue:btnSearch' => "submit"
})
or die $doc->status_line;
This returns a set of summary page pages. Each page is scanned for
jobs that we've not seen before and if they've not been seen then
they're downloaded using the fetch_parse_store
subroutine.
do {
my @jobs = ($page =~ /<a\s+
id="dgResultList__ctl\d+_hplJobTitle"
.*?
href="
(
.*?
)
">
/xgs);
foreach my $job (@jobs) {
&fetch_parse_store ($ua, $url, $job, 0)
unless &seen_already ($job);
$last_job = $job; # preserve this for future use ...
} # foreach
$page = &next_page ($ua, $url, $page);
} until (!$page);
When all this has been done a start is made to speculatively probe
forward with references that may exist but have not been notified. The
crucial function fetch_parse_store is described later in
this Technical Note.
Firstly the maximum reference for the particular Jobcentre Office is found and new references are created for that particular office, assuming that the reference is a concatenation of the office's code and a perpetually incrementing serial number:
# Probe into the future
my $failures = 0;
$max_ref =~ s/$office_code\///;
my $ix = $max_ref;
while ($failures < 3) {
$ix++;
my $reference = $office_code . "/" . $ix;
$last_job =~ s/j=\w\w\w\/\d+/j=$reference/;
$failures += &fetch_parse_store ($ua, $url, $last_job, 1)
unless &seen_already ($last_job);
} # while
Although it is not expected to be present, we check that the vacancy has not been already noted, the operation may be performed in parallel at some stage in the future. There is an allowance for three non-consecutive failures at this probing stage. It is felt that more would be excessive and the Jobseekers Direct servers are busy enough as it is. The reference is presented to the Jobseekers Direct web-site as if it had been noted in a search summary page by modifying a real existing query. Session information is retained, but it is assumed that only this Jobcentre office can be probed this way.
In the normal course of events, both with the returned search summary page and with the back-fill searches will return jobs that may no-longer be present when detail is asked for. In those instances the job reference is marked as being found, but no longer live. The job is thus "already seen" and no longer asked for. In this forward probe phase, however the failures are not noted as such, for they may require repeating at some stage in the future.
Finally, when the probing phase is finished the database is examined and any gaps filled in. Caution has be exercised in the selection of the last minimum reference, Jobseekers Direct are in the habit of re-posting some very old vacancies. This, if they were taken at face-value it would prompt the re-downloading of years of expired vacancies. As a sanity check, no more than 500 vacancies would be retrieved per-office in this phase. The number is arbitrary and considered sufficiently large. This is the current query to select the minimum reference for consideration:
my $min_reference_select = $dbh->prepare_cached (q{
select min (reference)
from jcp
where office_code = ?
and live
and added >= date('yesterday')});
Only those vacancies with a recent Jobcentre posting-date ('added') are considered. Then:
$max_ref =~ s/$office_code\///;
$min_ref =~ s/$office_code\///;
if ($min_ref + 500 < $max_ref) {
print "$office_code ($min_ref -> $max_ref) ";
print "Unfeasibly large gap!\n";
return;
} # if
for ($ix = $min_ref + 1; $ix < $max_ref; $ix++) {
my $reference = $office_code . "/" . $ix;
$last_job =~ s/j=\w\w\w\/\d+/j=$reference/;
&fetch_parse_store ($ua, $url, $last_job, 0)
unless &seen_already ($last_job);
} # for
Central to much of this work is fetch_parse_store. The
subroutine is responsible for retrieving the details of a vacancy,
decomposing it and storing it in the database.
The HTTP interaction is, then something that appears to be, at first glance, relatively simple GET:
$doc = $ua->get ($url . $job)
or die $doc->status_line;
However the URL and associated cookies contains tracking information and there must be a valid corresponding session on the server. It is simply not possible to construct a URL and retrieve a valid vacancy from Jobseekers Direct (nor its predecessor). This has, incidentally, resulted in some elaborate click-through technology on the web interface[ct]. So the subterfuge of making a generalised query to get the information is entirely necessary.
Once obtained the resultant page is sanity checked. Negative results are noted and should the reference no longer be valid it is noted as such in the database too.
The page is decomposed by a number of Perl Regular expressions, in
particular using a generic search encapsulated in the
get_thing subroutine.
sub get_thing {
my ($page, $thing) = @_;
if ($page =~ /<h6.*?>\s*$thing\s*<\/h6>
.*?
<p.*?>
(
.*?
)
<\/p>
/sx) {
return ($1);
} else {
return (undef);
} # else
} # get_thing
Prior to insertion in the database the extracted data is cleaned up with respect to embedded tags and HTML-isms. The data is then put in the database using a fairly standard SQL insert, which is automatically commited.
It is the nature of things on these scraped web-sites, that change
is both inevitable and frequent. The source code to the scraping
script is therefore under constant review. Indeed it has recently been
practically rewritten as the Jobcentre vacancies were moved from their
previous home to their new direct.gov.uk home. And indeed various
noises have been made that it will move again to a more friendly site,
reminiscent of 'Facebook' or other social sites. Since the complete
source code can't be guaranteed to work at any given interval, it has
been decided not to put all of the scripts involved on any public
web-site. Should folk want to do similar things to this scrape then
please contact the author who would be happy to assist.
Discussion
The author has gone to some lengths to scrape the Jobseekers Direct web-site for all the jobs posted to this government sponsored system. Jobseekers Direct now is responsible for Jobcentre Plus's vacancy notification presence on the web, covering England, Scotland and Wales. Northern Ireland appears to have recently given a new, separate system[ni]. Thus, Northern Irish Job vacancies can be considered the target of further future work.
Both the scraping system and the Jobseekers Direct services have, by the time this Technical Note was written, settled down. The system scrapes approximately 3,500 to 4,500 new vacancies per evening on a typical week-day, at time of writing. It is interesting to speculate on the nature of these postings. Some of these vacancies are posted multiple times over several different Jobcentres. When this occurs the assumption is that the vacancy is for the same job-role, but in several different areas, but it does seem to inflate the total. Also noted were several job vacancies which were part-time, self-employed and required that the applicant undergo training which they would pay for. While legal it was surprising to see such vacancies appearing on a government web-site. Elsewhere, vacancies were noted for such things as Professorships at prestigious Universities and well paid City Lawyer type jobs. Extremely well paid City Lawyer type jobs. These are vacancies one would not expect to be filled by the person who casually browses the Jobseekers Direct web-site or uses one of the Jobcentres' machines. It is therefore assumed that such vacancies appeared as a matter of record to pass some kind of employment legislative step and that there is no real interest in applicants from the local region.
Once all this data is gathered it should be disseminated. It should
be disseminated in a manner which is easier to use and more relevant
to various parts of the population. Based on his own experiences, the
Jobseekers Direct web-site provides poorly localised searches and
frequently returns vacancies which a totally irrelevant and sometimes
quite old. At time of writing his seems to be an area of active
criticism in the national press and media. A web-based front end
focusing on searches based on a local Jobcentre has thus been produced
based on earlier Cockermouth specific work[nj, ck]. An entire weeks worth of scrapings has been put, as
raw data in Comma Separated Values form on the The author has, more or less, left himself hostage to his good
intentions with this work. And will maintain this, open endedly, until
it becomes irrelevant. Over time the scrapings will accumulate
considerable amounts of historical data. Institutions looking to
examine the Social Sciences, Economic History and Geographical aspects
of this historical database can contact the author in due course.
As with other Technical Notes, feedback is actively solicited. The
author may be contacted via the e-mail address found on his public
biography page[au]. Should something require
changing or enhancing then the fact will be acknowledged with
attribution in an Update section.
Feedback has suggested that the following needed to be changed
after this TN had been published:
ZOIS FTP
site. The author will be happy to produce automated e-mails of such
data in whole or in part to interested parties too.
Updates
errstr had been misspell. This has now been
fixed. 2010-09-22
References
References found in this section, and in particular the HTML links were correct at time of writing (2010-03-11).
- [au]. Martin Sullivan:
- http://www.zois.co.uk/people/martin_sullivan
- [nj]. The Unofficial National Jobcentre Plus Mirror:
- http://home.zois.co.uk/jcpnational.html
- [cj]. The Unofficial Cockermouth Jobcentre Plus Mirror:
- http://home.zois.co.uk/jcp.html
- [jd]. Jobseekers Direct:
- http://jobseekers.direct.gov.uk
- [ni]. JobCentre Online (NI):
- http://www.jobcentreonline.com
- [ff]. BBC News: "Job centres 'failing customers'":
- http://news.bbc.co.uk/1/low/business/8580519.stm
- [ct]. 'Click-through' on Stateful Third Party Web-sites Using PHP:
- http://www.zois.co.uk/tn/tn-2010-02-17.html
- [pg]. PostgreSQL:
- http://www.postgresql.org
- [cp]. CPAN:
- http://www.cpan.org
~Z~