Too Busy For Words - the PaulWay Blog

Look Mum, no bugs!

I recently encountered a bug in RhythmBox where, if you rename a directory, it thinks that all the files in the old directory have disappeared and there's a whole bunch of new files. You lose all the metadata - and for me that was hours of ratings as I worked my way through my time-shiftings of the chillout stream of Digitally Imported. Worse, if RhythmBox was running during the rename, when you try to play one of those files that has 'gone missing' it will just say "output error"; when you restart it because (naturally) you think it's borked its codecs or something, it then removes all those previous entries (giving you no chance to fix the problem if you'd just renamed the directory in error).

I decided to try to be good, so I found the GNOME bugzilla and tried to search for "directory", or "rhythmbox", or anything. Every time it would spend a lot of time waiting and then just finish with a blank page. Deciding that their Bugzilla was hosed, I went and got a Launchpad account and logged it there. Then, in a fit of "but I might have just got something wrong", I went back to the Bugzilla and tried to drill down instead of typing in a keyword.

Lo and behold, when I looked for bugs relating to "Rhythmbox", it turned up in the search bar as product:rhythmbox. Sure enough, if I typed in product:rhythmbox summary:directory then it came up with bugs that mentioned 'directory' in their summary line. If you don't get one of those keywords right, it just returns the blank screen as a mute way of saying "I don't know how to deal with your search terms".

So it would seem that the GNOME bugzilla has hit that classic problem: developer blindness. The developers all know how to use it, and therefore they don't believe anyone could possibly use it any differently. This extends to asserting that anyone using it wrong is "obviously" not worth listening to, and therefore the blank page serves as a neat way of excluding anyone who doesn't know the 'right' way to log a bug. And then they wonder why they get called iconoclastic, exclusive and annoying...

Sadly, the fix is easy. If you can't find any search terms you recognise, at least warn the user. Better still, assume that all terms that aren't tagged appropriately search the summary line. But maybe they're all waiting for a patch or something...

Last updated: | path: tech / web | permanent link to this entry

Tue 24th Mar, 2009

The intangible smell of dodgy

My Dell Inspiron 6400 has been a great laptop and is still doing pretty much everything I want three years after I bought it. I fully expect that it will keep on doing this for many years to come. Its battery, however, is gradually dying - now at 39% of its former capacity, according to the GNOME power widget. So I went searching for a new battery.

I came across the page http://www.aubatteries.com.au/laptop-batteries/dell-inspiron-6400.htm, which I refuse to link to directly. It looks good to start with, but as you study the actual text you notice two things. Firstly, it looks like no person facile with Australian English ever wrote it - while I don't mind the occasional bit of Chinglish this seems more likely to have been fed through a cheap translator program. Secondly, it seems obvious that the words "Dell Inspiron 6400 laptop" have been dropped into a template without much concern for their context. Neither of these inspire confidence.

I was briefly tempted to write to the site contact and mention this, but as I looked at some of the other search results it became increasingly obvious that this was one in a number of very similar sites, all designed a bit differently but using the same text and offering the same prices. This set off a few more of my dodginess detectors and I decided to look elsewhere.

Last updated: | path: tech / web | permanent link to this entry

Tue 25th Nov, 2008

Random randomness

The SECRET_KEY setting in Django is used as a 'salt' in (one would hope) all hash calculations. When a new project is created, a piece of code generates a new random key for that site. I'd seen a couple of these and noted, in passing, that they seemed to have an unusually high amount of punctuation characters. But I didn't give it much thought.

Recently I had to generate a new one, and found a couple of recipes quite quickly. The routine (in Python) is:

from random import choice
print ''.join([choice('abcdefghijklmnopqrstuvwxyz0123456789!@#$%^&*(-_=+)') for i in range(50)])

(Aside: Note how Python's idea of line breaks having grammatical meaning in the source code has meant making one liners is now back in style? Wasn't this supposed to be the readable language? Weren't one liners supposed to be a backward construction used in stupid languages? Is that the sound of a thousand Pythonistas hurriedly explaining that, yes, you can actually break that compound up into several lines, either on brackets or by \ characters or by partial construction? Oh, what a pity.)

Anyway. A friend of mine and I noted that it seemed a little odd that the upper case characters weren't included in the string. Maybe, we reasoned, there was some reason that they didn't include these characters (and the punctuation that isn't on the numeric keys). But, looking through the actual changeset that combined all the various salts and secrets into one thing, and looking at where the secret key was used in the code, it seems that it's always fed into the md5 hasher. This takes bytes, basically, so there was no reason to limit it to any particular character subset.

So my preferred snippet would be:

from random import choice
s = [chr(i) for i in range(32,38) + range(40,127)]
print ''.join([choice(s) for i in range(50)])

So you can at least read your secret key, and it doesn't include the single quote character (ASCII 39) that would terminate the string early. The update to the original functionality is in ticket 9687, so let's see what the Django admins make of it.

Last updated: | path: tech / web | permanent link to this entry

Fri 7th Nov, 2008

Don't Repeat Yourself - Next Generation

I've become a big fan of Django, a web framework that has a nice blend of python, good design and flexibility. The template system might not appeal to the people who like to write code inside templates, but to me it forces programmers to put the code where it belongs - in the views (i.e. the controllers, to non-Djangoistas) or models. I love the whole philosophy of "Don't Repeat Yourself" in Django - that configuration should exist in one place and it should be easy to refer to that rather than having to write the same thing somewhere else. The admin system is nice, you can make it do AJAX without much trouble, and it behaves with WCGI so you can run a site in django without it being too slow.

The one thing I've found myself struggling with in the various web pages I've designed is how to do the sort of general 'side bar menu' and 'pages list' - showing you a list of which applications (as Django calls them) are available and highlighting which you're currently in - without hard coding the templates. Not only do you have to override the base template in each application to get its page list to display list to display correctly, but when you add a new application you then have to go through all your other base templates and add the new application in. This smacks to me of repeating oneself, so I decided that there had to be a better way.

Django's settings has an INSTALLED_APPS tuple listing all the installed applications. However, a friend pointed out that some things listed therein aren't actually to be displayed. Furthermore, the relationship between the application name and how you want it displayed is not obvious - likewise the URL you want to go to for the application. And I didn't want a separate list maintained somewhere that listed what applications needed to be displayed (Don't Repeat Yourself). I'm also not a hard-core Django hacker, so there may be some much better way of doing this that I haven't yet discovered. So my solution is a little complicated but basically goes like this:

First, you do actually need some settings for your 'shown' applications that's different from the 'silent' ones. For me this looks like:

SILENT_APPS = (
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.sites',
)

SHOWN_APPS = (
    ('portal', {
        'display_name'  : 'Info',
        'url_name'      : 'index',
    }),
    ('portal.kb', {
        'display_name'  : 'KB',
        'url_name'      : 'kb_index',
    }),
    ('portal.provision', {
        'display_name'  : 'Provision',
        'url_name'      : 'provision_index',
    }),
)

INSTALLED_APPS = SILENT_APPS + tuple(map(lambda x: x[0], SHOWN_APPS))

We build the INSTALLED_APPS tuple that Django expects out of the silent and shown apps, although I imagine a few Python purists are wishing me dead for the map lambda construct. My excellent defence is a good grounding in functional programming. When my site supports Python 3000 and its pythonisations of these kind of concepts, I'll rewrite it.

So SHOWN_APPS is a tuple of tuples containing application paths and dictionaries with their parameters. In particular, each shown application can have a display_name and a url_name. The latter relates to a named URL in the URLs definition, so you then need to make sure that your index pages are listed in your application's urls.py file as:

    url(r'^$', 'kb.views.vIndex', name = 'kb_index'),

Note the 'name' parameter there, and the use of the url() constructor function.

You then need a 'context processor' to set up the information that can go to your template. This is a piece of code that gets called before the template gets compiled - it takes the request context and returns a dictionary which is added to the dictionary going to the template. At the moment mine is the file app_name_context.py:

from django.conf import settings
from django.core.urlresolvers import reverse

def app_names(request):
    """
        Get the current application name and the list of all
        installed applications.
    """
    dict = {}
    app_list = []
    project_name = None
    for app, info in settings.SHOWN_APPS:
        if '.' in app:
            name = app.split('.')[1] # remove project name
        else:
            name = app
            project_name = name
        app_data = {
            'name'  : name,
        }
        # Display name - override or title from name
        if 'display_name' in info:
            app_data['display_name'] = info['display_name']
        else:
            app_data['display_name'] = name.title()
        # URL name - override or derive from name
        if 'url_name' in info:
            app_data['url'] = reverse(info['url_name'])
        else:
            app_data['url'] = reverse(name + '_index')
        app_list.append(app_data)
    dict['app_names'] = app_list
    app_name = request.META['PATH_INFO'].split('/')[1]
    if app_name == '':
        app_name = project_name
    dict['this_app'] = app_name
    return dict

Note the use of reverse. This takes a URL name and returns the actual defined URL for that name. This locks in with the named URL in the urls.py snippet. This is the Don't Repeat Yourself principle once again: you've already defined how that URL looks in your urls.py, and you just look it up from there. Seriously, if you're not using reverse and get_absolute_url() in your Django templates, stop now and go and fix your code.

We also try to do the Django thing of not needing to override behaviour that is already more or less correct. So we get display names that are title-cased from their application name, and URL names which are the application name with '_index' appended. You now need to include this context processor in the list of template context processors that are called for every page. You do this by using the TEMPLATE_CONTEXT_PROCESSORS setting; unfortunately, if this isn't listed (and it isn't by default) then you get a set of four very useful context processors that you don't want to miss, so you have to include them all explicitly if you override this setting. So in your settings.py file you need to further add:

TEMPLATE_CONTEXT_PROCESSORS = (
    "django.core.context_processors.auth",
    "django.core.context_processors.debug",
    "django.core.context_processors.i18n",
    "django.core.context_processors.media",
    "portal.app_name_context.app_names",
)

The most inconvenient part of the whole lot is that you now have to use a specific subclass of the Context class in every template you render in order to get these context processors working. You need to do this anyway if you're writing a site that uses permissions, so there is good justification for doing it. For every render_to_response call you make, you now have to add a third argument - a RequestContext object. These calls will now look like:

    return render_to_response('template_file.html', {
        # dictionary of stuff to pass to the template
    }, context_instance=RequestContext(request))

The last line is the one that's essentially new.

Finally, you have to get your template to show it! This looks like:

<ul>{% for app in app_names %}
<li><a class="{% ifequal app.name this_app %}menu_selected{% else %}menu{% endifequal %}"
 href="{{ app.url }}">{{ app.display_name }}</a></li>
{% endfor %}</ul>

With the apprporiate amount of CSS styles, you now get a list of applications with the current one selected, and whenever you add an application this will automatically change to include that new application. Yes, of course, the solution may be more complicated in the short term - but the long term benefits quite make up for it in my opinion. And (again in my opinion) we haven't done anything that is too outrageous or made

Last updated: | path: tech / web | permanent link to this entry

Tue 28th Oct, 2008

Cue high speed tape recorder sound effect

For the more technically minded, here is a brief synopsis of my criticisms of the "Clean Feed" 'initiative', sent in a letter to Senator Stephen Conroy:

1% false positive rate is way too high to be usable.

75% slower is too low to be usable, and the faster filters have a higher false positive rate.

It only blocks standard web traffic, not file sharing, chat or other protocols.

If you filter HTTPS, you cripple the financial system of internet shopping, banking, and personal information (e.g. tax returns).

If the Government ignores who's requesting filtered content, then those wishing to circumvent it can keep on looking with no punishment. If the Government does record who requests filtered content, then even ASIO will have a hard time searching through the mountain of false positives.

We already have filtering solutions for those that want it, at no cost.

Mandatory filtering leads to state-run censorship and gives an in for the Big Media Corporations to 'protect their assets' by blocking anything they like.

The whole thing is morally indefensible: it doesn't prevent a majority of online abuse such as chat bullying or file trading, and it relies on the tired old 'think of the children' argument which is beneath contempt.

People who assume that their children are safe under such as system and therefore do not use other protection mechanisms such as watching their children or providing appropriate support are living in a false sense of security.

Instead, the Government should either put the money toward the National Broadband Network programme, or run their own ISP with the clean feed technology to compete with the regular ISPs.
Regards, Paul.

I urge every Australian to write to Senator Conroy and/or their local Member of Parliament on this issue - it is one we cannot afford to be complacent about!

Last updated: | path: tech / web | permanent link to this entry

Tue 14th Oct, 2008

Hacking the LCA registration process for fun and, er, fun

With the way tickets went for Linux Conf AU this year gone, and getting paid today, I decided to get my registration in early. Once again I noted they had continued the fine tradition of having a random silly message per registrant. Once again I decided to hack it to make it say what I wanted it to say.

Needless to say they raised the bar this year. Up until 2007 it was just a hidden field in the form. In 2008 they added a checksum - this delayed me a good five minutes while I worked out how they'd generated it. This year they've upped the ante, including both a different checksum and adding a salt to it. Another five minute's playing with bash revealed the exact combination of timestamp, delimiter, and phrase necessary to get a correct checksum. I am also made of cheese.

Naturally, don't bother emailing me to find out how I did it; the fun is in the discovery!

Last updated: | path: tech / web | permanent link to this entry

Wed 20th Aug, 2008

Error Message Hell

If there's one thing anyone that works with computers hates, it's an error message that is misleading or vague. "Syntax Error", "Bad Command Or File Name", "General Protection Fault", and so forth have haunted us for ages; kernel panics, strange reboots, devices that just don't seem to be recognised by the system, and programs mysteriously disappearing likewise. The trend has been to give people more information, and preferably a way to understand what they need to do to fix the problem.

I blog this because I've just been struggling with a problem in Django for the last day or so, and after much experimentation I've finally discovered what the error really means. Django, being written in Python, of course comes with huge backtraces, verbose error messages, and neat formatting of all the data in the hopes that it will give you more to work with when solving your problem. Unfortunately, this error message was both wrong - in that the error it was complaining about was not actually correct - and misleading - in that the real cause of the error was something else entirely.

Django has a urls.py file which defines a set of regular expressions for URLs, and the appropriate action to take when receiving each one. So you can set up r'/poll/(?P\d+)' as a URL, and it will call the associated view's method and pass the parameter poll_id to be whatever the URL contained. In the spirit of Don't Repeat Yourself, you can also name this URL, for example:

url(r'/poll/(?P\d+)', 'view_poll', name = 'poll_view_one')

And then in your templates you can say:

<a href="{{ url poll_view_one poll_id=poll.id }}">{{ poll.name }}</a>

Django will then find the URL with that name, feed the poll ID in at the appropriate place in the expression, and there you are - you don't have to go rewriting all your links when your site structure changes. This, to me, is a great idea.

The problem was that Django was reporting that "Reverse for 'portal.address_new_in_street' not found." when it was clearly listed in a clearly working urls.py file. Finally, I started playing around with the expression, experimenting with what would work and what wouldn't in the expression. In this case, the pattern was:

new/in/(?P\d+)/(?P[A-Za-z .'-]+)

When I changed this to:

new/in/(?P.+)/(?P.+)

It suddenly came good. And then I discovered that the the thing being fed into the 'suburb_id' was not a number, but a string. So what that error message really means is "The pattern you tried to use didn't match because of format differences between the parameters and the regular expression." Maybe it means that you can have several patterns with the same name that will try to match based on the first such pattern that does so. But until then, I'll remember this; and hopefully someone else trying to figure out this problem won't butt their head against a wall for a day like I did.

Last updated: | path: tech / web | permanent link to this entry

Tue 29th Jul, 2008

Django 101

At work I've started working on a portal written in Python using the Django framework. And I have to say I'm pretty impressed. Django does large quantities of magic to make mothe model data accessible, the templating language is pretty spiffy (it's about on a par with ClearSilver, which I'm more familiar with - each has bits that the other doesn't do), and the views and url mapping handling is nice too. I can see this as being a very attractive platform to get into in the future - I'm already considering writing my Set Dance Music Database in it just to see what it can do.

So how do I feel as a Perl programmer writing Python? Pretty good too. There are obvious differences, and traps for new players, but the fact that I can dive into something and fairly quickly be fixing bugs and implementing new features is pretty nice too. Overall, I think that once you get beyond the relatively trivial details of the structure of the code and how variables work and so on, what really makes languages strong is their libraries and interfaces, and this to me is where Perl stands out with its overwhelmingly successfull CPAN and Python, while slightly less organised from what I've seen so far, still has a similar level of power.

About the only criticism I have is the way the command line option processing is implemented - Python has tried one way (getopt) which is clearly thinking just like a C programmer, and another (optparse) which is more object oriented but is hugely cumbersome to use in its attempt to be flexible. Neither of these hold a candle to Perl's GetOpt::Long module.

Last updated: | path: tech / web | permanent link to this entry

Sun 15th Jun, 2008

Common code in ClearSilver 001

I've been using ClearSilver as a template language for my CGI websites in earnest for about half a year now. I decided to rewrite my Set Dance Music Database in it and it's generally been a good thing. Initially, though, I had two problems: it was hard to know exactly what data had been put into the HDF object, and it was a pain to debug template rendering problems by having to upload them to the server (surprisingly, but I think justifiably, I don't run Apache and PostgreSQL on my laptop so as to have a 'production' environment at home).

I solved this problem rather neatly by getting my code to write out the HDF object to a file, rsync'ing that file back to my own machine, and then test the template locally.

I knew that ClearSilver's Perl library had a 'readFile' method to slurp an HDF file directly into the HDF object, and a quick check of the C library said that it had an equivalent 'writeFile' call. So happily I found that they'd also provided this call in Perl. My 'site library' module provided the $hdf object and a Render function which took a template name; it was relatively simple to write to a file derived from the template name. That way I had a one-to-one correspondence between template file and data file.

Then I can run ClearSilver's cstest program to test the template - it takes two parameters, the template file and the HDF file. You either get the page rendered, or a backtrace to where the syntax error in your template occurred. I can also browse through the HDF file - which is just a text file - to work out what data is being sent to the template, which solves the problem of "why isn't that data being shown" fairly quickly.

Another possibility I haven't explored is to run a test suite against the entire site using standard HDF files each time I do a change to make sure there aren't any regressions before uploading.

Hopefully I've piqued a few people's interest in ClearSilver, because I'm going to be talking more about it in upcoming posts.

Last updated: | path: tech / web | permanent link to this entry

Tue 18th Mar, 2008

Standard Observations

Simon Rumble mentioned Joel Spolsky's post on web standards and it really is an excellent read. The fundamental point is that as a standard grows, testing any arbitrary device's compliance with it it grows harder. Given that, for rendering HTML, not only do we have a couple of 'official' standards: HTML 4, XHTML, etc., but we also have a number of 'defacto' standards - IE 5, IE 5.5, IE 6, IE 7, Firefox, Opera, etc. etc. etc ad nauseam. For a long time, Microsoft has banked on their desktop monopoly to lever their own defacto standards onto us, but I think they never intended it to be because of bugs in their own software. And now the chickens are coming home to roost, and they're stuck with either being bug-for-bug compatible with their own software (i.e. making it more expensive to produce) or breaking all those old web pages (i.e. making it much more unpopular).

I wonder if there was anyone in Microsoft Internet Explorer development team around the time they were producing 5.0 that was saying, "No, we can't ship this until it complies with the standard; that way we know we'll have less work to do in the future." If so, I feel doubly sorry for you: you've been proved right, but you're still stuck.

However, this is not a new problem to us software engineers. We've invented various test-based coding methodologies that ensure that the software probably obeys the standard, or at least can be proven to obey some standard (as opposed to being random). We've also seen the nifty XSLT macro that takes the OpenFormula specification and produces an OpenDocument Spreadsheet that tests the formula - I can't find any live links to it but I saved a copy and put it here. So it shouldn't actually be that hard to go through and implement, if not all, then a good portion of the HTML standard as rigorous tests and then use browser scripting to test its actual output. Tell me that someone isn't doing this already.

But the problem isn't really with making software obey the standard - although obviously Microsoft has had some problem with that in the past, and therefore I don't feel we can trust them in the future. The problem is that those pieces of broken software have formed a defacto standard that isn't mapped by a document. In fact, they form several inconsistent and conflicting standards. If you want another problem, it's that people writing web site code to detect browser type in the past have written something like:

if ($browser eq 'IE') {
    if ($version <= 5.0) {
        write_IE_5_0_HTML();
    } elsif ($version <= 5.5) {
        write_IE_5_5_HTML();
    } else {
        write_IE_HTML();
    }
    ...
}

When IE 7 came along and broke new stuff, they added:

    } elsif ($version <= 6.0) {
        write_IE_6_0_HTML();

It doesn't take much of a genius to work out that you can't just assume that this current version is the last version of IE, or that new versions of IE aren't necessarily going to be bug-for-bug compatible with the last version. So really the people writing the websites are to blame.

Joel doesn't identify Microsoft's correct response in this situation. The reason for this is that we're all small coders reading Joel's blog and we just don't have the power of Microsoft. It should be relatively easy for them to write a program that goes out and checks web sites to see whether they render correctly in IE 8, and then they should work together with the web site owners whose web sites don't render correctly to fix this. Microsoft does a big publicity campaign about how it's cleaning up the web to make sure it's all standard compliant for its new standards-compliant browser, they call it a big win, everyone goes back to work without an extra headache. Instead, they're carrying on like it's not their fault that the problem exists in the first place.

Microsoft's talking big about how it's this nice friendly corporate citizen that plays nice these days - let's see it start fixing up some of its past mistakes.

Last updated: | path: tech / web | permanent link to this entry

Tue 29th Jan, 2008

Finding Sets Made Easy

I can't believe I only just thought of it. My Set Dancing Music Database has its sets and CDs referenced on the URL line by the internal database IDs. While this is unique and easy to link to, it looks pretty useless if you're sending the link to someone. I realised this when writing my post on my experiences at Naughton's Hotel I wanted to link to my page on the South Galway Reel Set and thought "how dull is that?"

Suddenly I realised that I should do what wikis and most other good content management systems have done for ages - made URLs which reference things by name rather than number and let the software work it out in the background. Take the name for the set, flatten it into lower case and replace spaces with underscores; it would also be easily reversible. CDs might be a bit more challenging but there are only one or two that have a repeated name, and I'd have to handle such conflicts anyway at some point.

That combined with my planned rewrite of the site to use some sane HTML templating language - my current choice is ClearSilver - so that it's not all ugly HTML-in-the-code has given me another project for a good week or so of coding. Pity I'm at LCA and have to absorb all those other great ideas...

Last updated: | path: tech / web | permanent link to this entry

Tue 20th Nov, 2007

Wiki Documentulation

In the process of writing up the new manual for LMMS, I've been asked by the lead developer to be able to render the entire manual as one large document. This he will feed into a custom C++ program written to take MediaWiki markup and turn it into Tex markup, for on-processing into a PDF. Presumably he sees a big market for a big chunk of printed document as opposed to distributing the HTML of the manual in some appropriately browsable format, and doesn't mind reinventing the wheel - his C++ program implements a good deal of Perl's string processing capabilities in order to step through the lines byte-by-byte and do something very similar to regular expressions. Although I might be mistaken in this opinion - I don't read C++ very well.

I had originally considered writing a Perl LWP ^[1] program that performed a request to edit the page, with my credentials, but I figured that was a ghastly kludge and would cause some sort of modern day wiki-equivalent of upsetting the bonk/oif ratio (even though MediaWiki obviously doesn't try to track who's editing what document when). But then I discovered MediaWiki's Special:Export page and realised I could hack it together with this.

The question, however, really comes down to: how does one go about taking a manual written in something like MediaWiki and producing some more static, less infrastructure-dependent, page or set of pages that contains the documentation while still preserving its links and cross-referencing? What tools are there for converting Wiki manuals into other formats? I know that toby has written the one I mentioned above; the author of this ghastly piece of giving-Perl-a-bad-name obviously thought it was useful enough to have another in the same vein. CPAN even has a library specifically for wikitext conversion.

This requires more research.

[1] - There's something very odd about using a PHP script on phpman.info to get the manual of a Perl module. But it's the first one I found. And it's better than search.cpan.org, which requires you to know the author name in order to list the documentation of the module. I want something with a URL like http://search.cpan.org/modules/LWP.

Last updated: | path: tech / web | permanent link to this entry

Fri 9th Nov, 2007

Perl, Ajax and the learning experience - part 001

AJAX as a thing I use regularly on web pages is still an unknown territory to me, a person who's still not entirely au fait with CSS and who still uses Perl's CGI module to write scripts from scratch. I understand the whole technology behind AJAX - call a server-side function and do something with the result when it comes back later - but I lacked a toolkit that could make it relatively easy for me to use. Then I discovered CGI::Ajax and a light begun to dawn.

Of course, there were still obstacles. CGI::Ajax's natural way of doing things is for you to feed all your HTML in and have it check for the javascript call and handle it, or mangle the script headers to include the javascript, and spit out the result by itself. All of my scripts are written so that the HTML is output progressively by print statements. This may be primitive to some and alien to others, but I'm not going to start rewriting all my scripts to pass gigantic strings of HTML around. So I started probing.

Internally this build_html function basically does:

if ($cgi->param('fname')) {
    print $ajax->handle_request;
} else {
    # Add the <script> tags into your HTML here
}

For me this equates to:

if ($cgi->param('fname')) {
    print $ajax->handle_request;
} else {
    print $cgi->header,
        $cgi->start_html( -script => $ajax->show_javascript ),
        # Output your HTML here
        ;
}

I had to make one change to the CGI::Ajax module, which I duly made up as a patch and sent upstream: both CGI's start_html -script handler and CGI::Ajax's show_javascript method put your javascript in a <script> tag and then a CDATA tag to protect it against being read as XML. I added an option to the show_javascript method so that you say:

        $cgi->start_html( -script => $ajax->show_javascript({'no-script-tags' => 1}) ),

and it doesn't output a second set of tags for you.

So, a few little tricks to using this module if you're not going to do things exactly the way it expects. But it can be done, and that will probably mean, for the most of us, that we don't have to extensively rewrite our scripts in order to get started into AJAX. And I can see the limitations of the CGI::Ajax module already, chief amongst them that it generates all the Javascript on the fly and puts it into every page, thus not allowing browsers to cache a javascript file. I'm going to have a further poke around and see if I can write a method for CGI::Ajax that allows you to place all the standard 'behind-the-scenes' Javascript it writes into a common file, thus cutting down on the page size and generate/transmit time. This really should only have to be done once per time you install or upgrade the CGI::Ajax module.

Now to find something actually useful to do with Ajax. The main trap to avoid, IMO, is to cause the page's URL to not display what you expect after the Javascript has been at work. For instance, if your AJAX is updating product details, then you want the URL to follow the product's page. It should always be possible to bookmark a page and come back to that exact page - if nothing else it makes it easier for people to find your pages in search engines.

Last updated: | path: tech / web | permanent link to this entry

Wed 11th Jul, 2007

Accessing the Deep Web

IP Australia has an interesting post about the "Deep Web" - those documents which are available on the internet but only by typing in a search query on the relevant website.

On reading their article I get the impression that they think that this is both a hitherto-unknown phenomenon and one which is still baffling web developers. This puzzles me, as even a relative neophyte such as myself knows how to make these documents available to search engines: indexes. All you need is a linked-to page somewhere which then lists all of the documents available. This page doesn't have to be as obvious as my Set Dance Music Database index - it can be tucked away in a 'site map' page somewhere so that it doesn't confuse too many people into thinking that that's the correct way to get access to their documents. However, don't try to hide it so that only search engines can see it, or you'll fall afoul of the regular 'link-farming' detection and elimination mechanisms most modern search engines employ.

Of course, being a traditionalist (as you can see from both the content and design of the Set Dance Music Database) I tend to think that lists are still useful, at least if kept small. And I do need to put in some mechanisms for searching on the SDMDB, as well as a few other drill-down methods. So giving your people just a search form alone may not be catering to all the methods people employ when finding content. Wikis have realised this years ago - people like interlinking. And given that these 'deep web' documents are still accessible via a simple URL, if you really need to you can assist the search engines by creating your own index page to their documents by basically scripting up a search on their website that then puts the links into your index, avoiding listing duplicates.

So the real question is: why are the owners of these web sites not doing this? We may just need to suggest it to them if they haven't thought of it themselves. The benefits of having their documents listed on Google are many - what downsides are there? I'm sure the various criticisms of such indexing are mainly due to organisational bias and narrow-mindedness, and can either be solved or routed around.

There are two variants of this that annoy me. One is the various websites where the only way to get to what you want is by clicking - no direct link is ever provided and your entire navigation is all done through javascript, flash or unspeakable black magic. These people are making it purposefully hard for you to get straight to what you want, either because they want to show you a bunch of advertising on the way or because they want to know exactly what you're up to on their site for some insidious purpose. There is already one Irish music CD store online that I've basically had to completely ignore (except for cross-checking with material on other sites) because there is no way for me to refer people directly to a CD. I refuse outright to give instructions such as "go to http://example.com and type in the words 'Tulla Ceili Band' in the search box", because that's not good navigation.

The other type of annoyance I find ties in with this: it is the practice of making a hidden index, or a privileged level of access, available to search engines that normal people don't see. I've seen a few computing and engineering websites do this, and Experts Exchange is particularly annoying for it: you can google your query and see an excerpt from the page with the question but when you go there you find out that access to the answers requires membership and/or payment. This, as far as I'm concerned, is just a blatant money-grabbing exercise and should be anathema. Either your results are free to access, or they're not - search engines should not be privileged in that respect.

Last updated: | path: tech / web | permanent link to this entry

Tue 6th Mar, 2007

Wiki defacement

To: abuse@ttnet.net.tr
From: paulway@mabula.net
Subject: Defacement of our wiki page by your user dsl.dynamic81213236104.ttnet.net.tr

Dear people,

On Wednesday the 28th of February, a user from your address dsl.dynamic81213236104.ttnet.net.tr made two edits to our Wiki. You can see the page as changed at http://mabula.net/rugbypilg/index.cgi?action=browse&id=HomePage&revision=18, including the above address as the editor. Your client is obviously defacing our and other sites like it, which is probably against your terms of service. In addition, they are too lame to be on the internet. Please take them off it so that they do not do any further damage to themselves and others.

We have reversed their changes and our site is back to normal.

Yours sincerely,

Paul Wayper

Last updated: | path: tech / web | permanent link to this entry

Fri 16th Feb, 2007

Comment spam eradication, attempt 2

Dave's Web Of Lies allows people to submit new lies, a facility that is of course abused by comment spammers. These cretins seem to not notice the complete absence of any linkback generation and the proscription of any text including the magic phrase http://. Like most spammers, they don't care if 100% of their effort is blocked somewhere, because it won't be blocked somewhere else. And there's no penalty for them brutalising a server: their botnets are just trawling away spamming continuously, leaving the spammers free to exploit new markets. It is vital to understand these two factors when considering how to avoid and, ultimately, eradicate spam.

For a while now, I've done a certain amount of checking that the lie submitted meets certain sanity guidelines that also filter out a lot of comment spam. In each case, the user is greeted with a helpful yet not prescriptive error message: for instance, when the lie contains an exclamation point the user is told "Your lie is too enthusiastic". (We take lying seriously at Dave's Web Of Lies.) This should be enough for a person to read and deduce what they need to do to get a genuine lie submitted, but not enough for a spammer to work out quickly what characters to remove for their submission to get anywhere. Of course, this is violating rule 1 above: spammers don't care if any number of messages get blocked, so long as one message gets through somehow.

This still left me with a healthy chunk of spam to wade through and mark as rejected. This also fills up my database (albeit slowly), and I object to this on principle. So I implemented a suggestion from someone's blog: include a hidden field called "website" that, when filled in, indicates that it's from a spammer (since it's ordinarily impossible for a real person to fill any text in the field). Then we silently ignore this field. No false positives? Sounds good to me.

Initial indications, however, were that it was having no effect. I changed the field from being hidden to having the style property "display: none", which causes any modern browser to not display it, but since this was in the stylesheet a spammer would have no real indication just by scraping the submit page that this field was not, in fact, used. This, alas, also had no effect. I surmised that this was probably because the form previously had no 'website' field and spammers were merely remembering what forms to fill in where, rather than re-scraping the form (though I have no evidence for this). Pity.

So my next step was to note that a lot of the remaining spam had a distinctive form. The 'lie' would be some random comment congratulating me on such an informative and helpful web site, the 'liar' would be a single word name, and there was a random character or two tacked on the lie to make it unlikely to be exactly the same as any previous submission. So I hand-crafted a 'badstarts.txt' file and, on lie submission, I read through this file and silently ignore the lie if it starts with a bad phrase. Since almost all of these are crafted to be such that no sane or reasonable lie could also start with the same words, this reduces the number of false positives - important (in my opinion) when we don't tell people whether their submission has succeeded or failed.

Sure enough, now we started getting rejected spams. The file now contains about 36 different phrases. I don't have any statistics on how many got through versus how many got blocked, but that's just a matter of time... And I'm probably reinventing some wheel somewhere, but it's a simple thing and I didn't want to use a larger, more complex but generalised solution.

I'd be willing to share the list with people, but I won't post the link in case spammers find it.

I really want to avoid a captcha system on the Web Of Lies. I like keeping Dave's original simplistic design, even if there are better, all-text designs that I could (or perhaps should) be using.

Last updated: | path: tech / web | permanent link to this entry

Mon 29th Jan, 2007

Domain Search Squatters Must Die episode #001

It looks like the SpinServer people that I mentioned nigh on nine months ago have disappeared. That I can cope with - a pity, because I liked their designs, but businesses come and go.

What INFURIATES me beyond measure is the way the people who run the domain registers then cash in on any businesses' past success by installing a copy-cat templated redirector site that earns them a bit of money from the hapless people who mistake it for the real thing. They're getting good too: it was so well layed out it took me several moments to work out that there was nothing actually useful on the site. Previous attempts I've seen have been pretty much just a bunch of prepackaged searches on the keywords in your previous site listed down the page, with a generic picture of a woman holding a mouse or going windsurfing (or for the more extreme sites going windsurfing holding a mouse). Now it's getting nasty.

It's not good enough that these domain registrars take money for something they've been proven to lose, 'mistakenly' swap to another person, revoke without the slightest authority, fraudulently bill for, and costs them nothing to generate. They they have to leech off the popularity of any site that goes under, not only scamming a few quick thousand bucks in the process but confusing anyone who wanted just a simple page saying "this company is no longer doing business". There must be something preventing this from happening in real life - businesses registering the name of a competitor as soon as they'd closed, buying up the office space and setting up a new branch. Except that there'd be some dodgy marketing exec handing them money for every person who wandered in and asked "Is this where I get my car repaired?". This sounds criminal to me.

Last updated: | path: tech / web | permanent link to this entry

Mon 18th Sep, 2006

They just don't care, do they?

As I've mentioned before, I run the large and well-designed site known as Dave's Web Of Lies. Amongst it's thousands of features is the ability to submit new lies to the database; naturally they are intensely scrutinised for any speck of truth beforehand. Now, the site's name might seem to give the game away, but those industrious linkback spammers obviously don't have time for such niceties as checking whether their handiwork has had any effect, or is even meaningful. My favourite 'comments' left in the submission form so far have been:

Looking for information and found it in this great site... - Jimpson.

Thank you for your site. I have found here much useful information... - Jesus.

The irony is that I don't know whether to include them because they are, indeed, genuine lies. But, on principle, I reject them. It's not as if liars get linkbacks on DWOL anyway...

Last updated: | path: tech / web | permanent link to this entry

Thu 31st Aug, 2006

Pandora opens my box

I see the National Library of Australia is now scanning my home photo gallery with a spider taken from the archive.org people. The project is called Pandora and the crawler site says that they're doing some kind of archiving for Australian pages. Well, that's certainly true of mine. But searching for the term 'Linux' on the main page produces "Linux at the Parkes Observatory", "Linux Australia", AusCERT's page, "Learning Linux" on www.active.org.au (which Pandora tells me is currently restricted for some reason), and then we go onto international sites. So I don't know what that's all about...

Last updated: | path: tech / web | permanent link to this entry

Fri 4th Aug, 2006

Get your nearest mirror here!

It just occurred to me, as I fired up my VMWare copy of Ubuntu and searched its universe repositories, and searched my local RPM mirrors on Fedora Core, for packages of "dar", the Disk Archiver of which I am enamoured, that surely there are local Ubuntu mirrors that I can use here on the ANU campus (I'm doing this from work). I've already found the local mirrors of the various RPM repositories that I use: http://mirror.aarnet.edu.au, http://mirror.optus.net, http://mirror.pacific.net.au/, http://public.www.planetmirror.com/, and others.

I know other people on campus use Ubuntu. I know about http://debian.anu.edu.au, although I haven't configured my Ubuntu installation to use it as a source. I personally think it makes the Internet a better place to get your new and updated packages from the closest mirror you can. If your ISP has a mirror, then definitely use that because it almost certainly won't use up your download gigabytes per month quota.

So imagine if there was a system whereby users could submit and update yum and apt-get configurations based on IP ranges. Then a simple package would be able to look up which configuration would apply to their IP address, and it would automatically be installed. Instantly you'd get the fastest, most cost-effective mirrors available. You could probably do the lookup as a DNS query, too. It'd even save bandwidth for the regular mirrors and encourage ISPs to set up mirrors and maintain the configurations, knowing that this bandwidth saving would be instantly felt in their network rather than relying on their customers to find their mirrors and know how to customise their configurations to suit.

Hmmmm.... Need to think about this.

Last updated: | path: tech / web | permanent link to this entry

Wed 12th Jul, 2006

Too much time, too little gain?

My 'home' home page - http://tangram.dnsalias.net/~paulway/ - has, for a while now, had the appearance of an old greenscreen monitor playing a text adventure. Since it's more or less just a method for me to gather up a few bits and pieces that I can't be bothered putting up on my regular page - http://www.mabula.net - I'm not really worried by creating a work of art.

But, the temptation to carry things too far has always been strong within me. So, of course, the flashing cursor at the end of the page wasn't good enough on its own: I had to have an appropriate command come up when you hovered over the link. After a fair amount of javascript abuse, and reading of tutorials, I finally got it working; I even got it so that the initial text (which has to be there for the javascript to work) doesn't get displayed when the document loads.

Score one for pointless javascript!

Last updated: | path: tech / web | permanent link to this entry

Fri 30th Jun, 2006

Google Checkout For Lies

I run Dave's Web Of Lies (pluggity plug), an internet database of lies that I like to think of as the only place on the internet where you really know the truth of what you're seeing. One of the features that I've been working on in the background is the Lie Of The Day By Email system, where you can subscribe to get a steaming fresh lie delivered to your email address every day. The site as I inherited it from Dave Hancock will always be free, but for other things I feel allowed to make a bit of money to pay for hosting, etc.

At the moment the site is more or less working. You can subscribe anew, or existing subscribers can log in, and see their details and change things as necessary. Email addresses will be confirmed before actually allowing a change. The only thing that doesn't exist yet is the ability to take money for the subscription. Enter the stumbling block.

Up until now I've been intending to use PayPal to collect money, but the key thing holding me back is the paucity of actual working code to do the whole online verification of payment thing. Maybe I'm more than usually thick, but I find the offerings on PayPal's website and on CPAN to be hard to understand - they seem to dive too quickly into the technical details and leave out how to actually integrate the modules into your workflow. I'd be more than glad to hear from anyone with experience programming websites in Perl using PayPal, especially PayPal IPN. I obviously need time to persevere.

Now Google is offering their Checkout facility, and I'm wondering what their API is going to look like. Is it going to make it any easier to integrate into my website than PayPal? Is the convenience of single-sign-on billing going to be useful to me? Should I wait and see?

Last updated: | path: tech / web | permanent link to this entry

Too Busy For Words - the PaulWay Blog

Wed 9th Jul, 2014

Sun 8th May, 2011

Thu 2nd Jul, 2009

Tue 24th Mar, 2009

Tue 25th Nov, 2008

Fri 7th Nov, 2008

Tue 28th Oct, 2008

Tue 14th Oct, 2008

Wed 20th Aug, 2008

Tue 29th Jul, 2008

Sun 15th Jun, 2008

Tue 18th Mar, 2008

Tue 29th Jan, 2008

Tue 20th Nov, 2007

Fri 9th Nov, 2007

Wed 11th Jul, 2007

Tue 6th Mar, 2007

Fri 16th Feb, 2007

Mon 29th Jan, 2007

Mon 18th Sep, 2006

Thu 31st Aug, 2006

Fri 4th Aug, 2006

Wed 12th Jul, 2006

Fri 30th Jun, 2006