ZOIS *
Technical Notes
ZOIS Technical Note TN-2007-11-01.
Author and Audience
This TN is intended for persons interested automatic retrieval of
bills and similar information from Utility web-sites. The program
presented actually works with BT (British Telecommunications
plc)[1] billing systems, and may be used as an
example in similar work. UNIX systems, PERL programming skills are
assumed as well as an elementary knowledge of the Hyper-Text Transfer
Protocol (HTTP) and procmail(8). Written by Martin
Sullivan[2], ZOIS Limited, Cockermouth.
Abstract
A somewhat lengthy discussion of a Perl script (given its length)
which automates the collection of e-mail-notified information such as
Bills. The specific example is used for the retrieval of BT Retail
bills.
Introduction
Utilities are increasingly using their web-sites to disseminate bills. This coupled with Direct Debits (where the Utility in question is sanctioned to directly access a customers account) are proving increasingly popular.
For a variety of reasons privacy-enhanced e-mail systems have not become popular. Utilities are therefore feel compelled to not send Billing information (which many consider confidential) using this medium. Instead when the billing round is completed they will send an e-mail inviting the customer to log-on using user-id/password pairs to their web-site and retrieve their bill manually. The link is usually secured by an SSL based encryption system, but is open to abuse. Various sites do this with varying degrees of ease (to the customer) but BT's efforts arguably are stereotypically one of the least easy. BT have decided to force their customers to use these systems by imposing a charge on their existing postal paper based system.
It was thus decided to automate the collection of the Bill using
Perl, LWP (specifically LWP::UserAgent) and Procmail.
Materials and Platform
The bulk of the code is written in perl(1) (using version 5.8.8), The system uses the LWP::UserAgent and Mail::Internet Perl modules, both of which can be retrieved from CPAN[3]. Procmail(1) is used to provoke the running of the Perl script, and in addition the script uses pdftops(1) to re-render the Bill, presented in Adobe's PDF[4], into Postscript for printing (a Postscript compatible printing system is naturally assumed). Pdftops is part of the xpdf(1) suite and is required for the standard pdf2ps(1) program cannot interpret BT's PDF. Perl, Procmail and Xpdf are all available on Linux and various BSD platforms. This work was done on a Red Had Fedora Core 7 release. Other UNIX-like operating systems may require some additional porting work and non-UNIX OS like Microsoft Windows remain unexplored.
To obtain the required URLs and page-flows a web-browser which allows inspection of HTTP source code is required. The author used Firefox and Lynx. If the reader requires to modify the code in response to bugs or changing circumstance these tools are invaluable.
Tcpdump(8) has been used in the past to inspect raw HTTP
streams in TCP packets on similar work. In this instance, however,
most of the data is encrypted and the technique was of limited value.
Method
The code for the btbill.pl script is found in the
downloads[5] area In may need adaption to the
environment of the site you are running on. You are reminded of the
source-code caveats[6] and
copyrights[7].
Procmail
The initial processing is done with procmail(1) and the following rule suffices.
# Bill notification from BT. Keep the e-mail and start a process to
# ratch the bill from BT's web-site.
:0 c
* ^From: ebilling@bt.com
| perl ${HOME}/procmail/btbill.pl
|
When a bill is 'ready' an e-mail is received from "ebilling@bt.com"; it contains a URL which starts the process. The URL appears to contain a unique ID for this Bill. Using this link, after a number of automated redirects, leads to an authentication page, where under the protection of SSL the customer is invited to login. The login page may have a user and password ready filled in for the customer (based on a persistent cookie, rather than identification in the e-mail). As an aside, the use of a persistent cookie to retain authentication information is not recommended.
The authentication (apparently using CA's SiteMinder product[8]) proceeds as a HTTP POST, with a redirect to a 'target' URL which after a number of automatic redirects leads to a page where the customer is presented with some billing information, but not the actual bill. These URLs appear to use BEA's Weblogic Portal system although it itself does not appear to serve the relevant pages. The large complicated URLs that are used in this stage of the process contain the identification ID originally presented in the e-mail.
There are two links on the page presented immediately after authentication, one to prepare a 'summary' bill and one to prepare a 'full' bill. As far as one can tell call details are only given in the 'full' bill and this is the one that was selected. Rather than having this bill to hand, pre-prepared in a batch process, it is produced on demand and stored in a cache. This necessarily takes a little time and probably requires direct interaction with BT's customer database and billing systems in real-time. During this phase of the interaction the customer is asked to be patient and the page is re-displayed several times (using HTML's META REFRESH tag). A page will eventually be displayed which invites the customer to down-load their bill. Following this link downloads the bill in Adobe's PDF format, requiring additional software to to view or print it. Authorization to produce the bill and download the resulting PDF is dependent upon an authentication session-cookie provided by the authentication system. This cookie has a very limited life-time and logging out does not seem to be required. The latter part of this process appears to involve the use of Struts[9] leading to characteristic '.do' URLs.
To enable this billing system the customer has to sign-up for a login ID, providing a valid billing account code and a host of demographic data that BT already has (presumably of use in fraud investigations, for this information does not appear to be checked immediately). The login page also has facilities for forgotten passwords and so forth.
The script not only saves the resulting PDF bill, but prints it out
too. To do this the pdftops(1) program is required for the
standard script, pdf2ps(1), does not understand the variety of
PDF involved. Pdftops is part of the xpdf(1) PDF viewing suite
and is readily available on a number of UNIX platforms.
Discussion
Should the reader desire to automate a similar dialogue it is necessary to use the source-viewing option on your chosen web-browser saving each page noting where it appeared in the dialogue and then inspecting it for link information. Some of this information has to be inferred indirectly, such as the use of redirection after a POST and how cookies are exploited. The ability to read Ecmascript (quondam Javascript) is sometimes necessary on such exercises too, but not in this case. When constructing similar automation it is advisable to use an automated user-agent such as LWP::UserAgent (in Perl, others may be available in other language systems) to do the necessary work in cookie-handling and redirection that such large commercial dynamic sites go in for. Finally the resulting HTML pages could be parsed with Perl module Template::Extract or similar but in these examples native Perl regular expressions were used.
The fact that the bill is delivered in PDF means that it can, if
desired, be manipulated further. A typical project may be to extract
and store the individual billing records in a database, suggesting
Perl modules CAM::PDF, Template::Extract and DBI (all available
through CPAN).
Diagnostics
With such a complicated system using several companies technology, apparently written by a cast of thousands and with a large dynamic component it is inevitable that this script-automation will be broken by casual "improvements". In such instances limited diagnostic output will be found in the users Mail/from file (if running via procmail). Should the script get as far as trying but failing to retrieve the PDF then the page it failed on will be found in a file /tmp/btbill.html.
The actual e-mail is retained, for the user. In the system deployed
here at ZOIS it is actually process further, for it
contains only HTML and is frequently read on text-only systems (such
as PDAs). Sending such HTML only e-mails may be considered
anti-social, but is in-line with the design philosophy of the rest of
the system. The following procmail rule may be considered useful:
:0
* ^Content-type:[ ]*text/html
{
:0 hcf
| formail -I "Content-type" -X ""; echo
:0 bfw
| (lynx -stdin -dump | sed -e 's/^ //'; \
echo;echo "[[HTML-only e-mail rendered by lynx]]"; \
)
}
|
As can be seen from the page-flows and necessary HTML inspection acquiring a telephone bill from BT's web-site involves a fair amount of automation. It can be used as a template for further automation on both BT's web-site and other Utilities who would be tempted to use similar techniques. The script necessarily contains a password for authentication and due-diligence should be exercised with this. Just as humans are, it could be that the automated-login component of this script may be subject to an attempt to get login details by redirection to a malicious web-site, a technique known as 'phishing'. In attempt to avoid this the script looks for and will only use a known BT web-site.
It is realised that the BT web-site is a somewhat dynamic beast and in the medium term it is unlikely that the bill-retrieval system will remain amenable to the current script. At such times we may try and fix this script (posting an update to this TN), but at best the script should be considered a template for further work.
The somewhat convoluted nature of BT's bill-distribution process
would suggest a better solutions either involving secure e-mails or if
web-sites are involved, pre-prepared HTML-encoded bills delivered
directly after authentication. It is recognised that this current
system is far from satisfactory, represents a step back from automated
standards for commercial interaction between third-parties and is
probably the dumbest way of interacting with a Billing System that the
author has been involved with.
References
References found in this section, and in particular the HTML links were correct at time of writing (2007-11-01).
ZOIS's Copyright statement:
$Date: 2008/02/07 10:14:16 $