Too Busy For Words - the PaulWay Blog

Perl threaded database query processing

In my work I've recently had to implement several pieces of code which follow this basic pattern:

Retrieve data from the database
Process data
Store data somewhere.

Because of Perl DBI's habit (on the systems I've used) of grabbing all the data from the database into memory before actually giving it to the caller, and because that data can often get large enough to get my process swapping or killed, what this usually turns into is:

Get a list of 'grouping' items (e.g. days, months, IP addresses, etc.)
For each item in that group:
1. Retrieve data from the database for that item.
2. Process data
3. Store data somewhere.

This runs into an unfortunate problem when the database server you're talking to takes a noticeable time to process your query - the whole thing slows down hugely. A typical slowdown I've seen is in the order of 500% - and both the database and the client processors are mostly idle during that time, as each query has to be individually fetched, processed, dumped back to the client, and then processed. It suffers the same problem if the time to process each group of data is significant - by the time you've got back to fetching the next group, the database has gone off and done other things and needs to get its disk heads back in the right place for your data.

These days we have processors capable of doing multiple things at the same time, and so it would be nice if the client could be processing rows at the same time as it's also requesting more data from the database. This is where Perl's threads and Thread::Queue libraries come in. It seems to me to be a generalisable task, so I'm sharing my first attempt at doing this in a generalisable way here. My main subroutine is:

######################################
sub Thread_Process {
######################################
    # We take one query which returns a list of items, a query which
    # returns other rows based on each of those items, and a function
    # which processes those rows.  We then run the processor function
    # in parallel to the fetching process to utilise the connection
    # to the database and keep the local processor active.
    # Requirements:
    # The item query must be executed and ready to return rows.  It
    #   can return any number of fields.
    # The rows query must be ready to be executed, and will be
    #   executed with the row_args_ref and then the items from each
    #   row in the item query in turn (as arrays).
    # The function takes as its last argument the Thread::Queue object
    #   that data will be passed through.  It must know exactly how
    #   many items it will take from each row, and that should match
    #   the number of items returned in the query.  For reasons as yet
    #   unclear, we can't pass references of any kind on the queue,
    #   so we pass the fields in each row as single scalars.  Any
    #   arguments that it needs should be given in fn_args_ref.  It
    #   should exit on receiving an undef.
    my ($items_qry, $rows_qry, $row_args_ref, $fn_ref, $fn_args_aref) = @_;
    my ($items_aref) = $items_qry->fetchall_arrayref;
    unless (ref $items_aref eq 'ARRAY') {
        carp "Warning: got no rows from item query\n";
        return 0;
    }
    
    my $queue = Thread::Queue->new();
    my $thread = threads->create($fn_ref, @$fn_args_aref, $queue);
    foreach my $item_aref (@$items_aref) {
        $rows_qry->execute(@$row_args_ref, @$item_aref);
        while (my $aref = $rows_qry->fetchrow_arrayref) {
            $queue->enqueue(@$aref);
        }
    }
    $queue->enqueue(undef);
    $thread->join();
    return scalar @$items_aref;
}

A sample caller function would be:

sub Send_mail_to_everyone {
    my ($mail_handler, $template, $start_date, $end_date);
    my $servers_qry = $dbh->prepare(
        'select distinct mail_server from addresses'
       .' where birth_date between ? and ? and current = true',
    );
    my $args_ref = [$start_date, $end_date];
    $servers->execute(@$args_ref);
    my $email_qry = $dbh->prepare(
        'select user_name, server, full_name, birth_date'
       .' from addresses'
       .' where birth_date between ? and ? and server = ?'
       .' and current = true'
    );
    my $mailer_sub = sub {
        my ($queue) = @_;
        while (defined my $user_name = $queue->dequeue) {
            my $server = $queue->dequeue;
            my $full_name = $queue->dequeue;
            my $birth_date = $queue->dequeue;
            my $email_body = sprintf $template
                , $username, $server, $full_name, $birth_date;
            $mail_handler->send("$user_name@$server", body => $email_body);
        }
    };
    # Here most of the work gets done.
    Thread_Process($servers_qry, $email_qry, $args_ref, $mailer_sub, []);
}

Of course, this is somewhat of a contrived example, and gives you little in the way of feedback or error handling. But it's an example of how to use the Thread_Process subroutine. The mailer subroutine gets its $mail_hander and $template from being within the Send_mail_to_everyone routine.

There are two problems I've discovered so far. The first is that trying to do any kind of database operation within the subroutine doesn't work, because the database handle needs to be cloned. On the systems I've tested this on, unfortunately, $dbh->clone seems to be a no-op and the DBI engine complains that a different thread is using the database handle. I've tried passing $dbh->clone to the handler function, and doing the clone inside the handler function, but they change nothing.

More annoying is the fact that the memory used by the process continues to rise even if the number of outstanding rows is constant or dropping. I haven't traced this down, and haven't really the time now, but it seems to be related to the Thread::Queue object - I've tested variations of my handler routine that reuse existing memory rather than doing undef @array and push @array, $data in the handler, and this changes little.

What I don't know yet is whether either to package this up as a Perl module and start being a maintainer or whether it's too trivial or not generalised enough to be useful for anyone but me.

Last updated: | path: tech / perl | permanent link to this entry

Mon 19th Nov, 2007

What if we had no walls at all?

Incidentally, it struck me that my use of CGI::Ajax referred to in my previous post is an example of where Perl's loose object-orientation works with the programmer, not against it. I'm sure some anal-retentive, secretive paranoid that writes in Java or C++ would have made most of those methods private, and thus would have forced a major re-engineering of my code to fit in with their own personal way of doing things. Perl naturally tends toward letting you see all the methods, not just those publically declared, but to me this is a general argument for letting people have a bit more control of your object than you think is good for it.

Last updated: | path: tech / perl | permanent link to this entry

Fri 1st Sep, 2006

Rocket and Jifty

Work on my Rocket module continues slowly. Slowly, because I've discovered yet another project - a database of music, instructions and notes on Irish Set Dances. It's an extrapolation of the spreadsheet I had that showed which sets were on which CDs and whether I had instructions for them; now it's fully relational so you can have multiple CDs and multiple instruction sources for each set, as well as recording the other tracks on each CD so that if you need waltzes (for example) you know which CDs they're on. Now that I've got the database structure set up, I've started using Rocket again to do the basic CGI work, so I've got back into working on the module again.

However, a friend mentioned Jifty, something he calls "Perl On Rails". A quick look at the Jifty website (which uses a wiki called Wifty - guess what it's based on) shows that, yes, indeed, it does have a lot of similarities to Ruby On Rails - Model/View/Controller structure, centralised config, templates with overrides - without the hassle of learning a new language that's irritatingly similar to one I already know and without a name that's an irritating simile for a long length of steel used to support things. I'm installing it on my home server. My does it need a lot of Perl packages, though...

Last updated: | path: tech / perl | permanent link to this entry

Tue 8th Aug, 2006

Rocket module part 001

To borrow the words of Edna Mode, a new project that I can achieve consumes me as only superhero work can, darling. My main project is the Repository Configurator, which is working well so far. But the key component that I've been finally inspired to write, and that, in laying dormant in my mind for a couple of months, has had the various problems unconsciously solved, is a thing I call Rocket.

Ruby on Rails, by contrast, is a rapid web application framework. It's designed to easily deploy in a site and gradually you implement overrides to the various defaults - display, authentication, data validation and editing being the obvious tasks. From the little I've seen it looks like a great idea. It just means I'd have to learn Ruby - I haven't even got beyond simple Python, I'm not going to start learning another language - and you still have to work within its framework system.

Rocket has the same basic objective - get a website that's based on the contents of a database up and running quickly. But it takes a different tack, and doesn't try to be a whole framework; it just provides (or will provide) a couple of things: print a table or query, print an editor, validate and update the data from the editor. By a flexible system of parameter overrides, which can be set up when you create the Rocket object, when you call PrintTable, or in the database itself in a supplemental table called Booster (either as global, or table-specific, settings), you can control its behaviour. But the Rocket still does most of the work for you - finding the data, handling errors, displaying it neatly.

(My further plan is to have a separate but parallel system that will read your CGI code, find calls to Rocket, and replace them with neatly labelled and commented code that does what Rocket does. This then allows you to further customise the process if you've gone beyond Rocket's capabilities, while saving you the copy-and-paste (or, from my past, trial and error) process of doing the same "Display the data" and "Edit the data" processes. Optimally, it would produce hard, optimised code in the places where Rocket has to be flexible and therefore slower. But that's a whole further project...)

Whenever I have this much hubris, I start worrying, though; it's usually at this point that someone says "But doesn't WellKnown::PerlModule do exactly that?". But, honestly, the HTML::Template and other templating modules I've seen take a third approach, which is to just simply differentiate between presentation and layout. That's a noble goal, and I should look at using that in my code generator. But it doesn't solve the problem of getting a table of data up there quickly and easily - you still have to copy and paste a bunch of lines to get the data and throw it at the template...

So: Who wants to be stuck on Rails, when you can fly with a Rocket? :-)

Last updated: | path: tech / perl | permanent link to this entry

Mon 17th Jul, 2006

CLUG Perl Programmers SIG = waste of time

A quick note about last Thursday's SIG meeting of the CLUG. This month was a special Perlmongers theme, given that we're trying to work with the Canberra chapter of the Australian Perl Mongers. So it was a pity that not only does no-one there other than me show any interest in Perl, but one guy (who I won't name) actively calls Perl "executable line noise". Yeah, thanks. That'll really make people enthusiastic about talking about Perl.

What I thought was ironic was that said Perl detractor has been working on some cool stuff - Microsoft's PDF-alternative format, and some other stuff in KDE that was more than nominally neat but has now slipped off the bottom of my stack. Now, I know that I tend to spend a lot of time telling people about my opinions on stuff - this blog may be a prime example. I do also try my best (I think) to be interested in other people's projects and interests, so I don't think I present a one-sided write-only approach. (As always, please email me at paulway@mabula.net, anonymously if you want, if you want to correct me on this :-) It's a little depressing, though, when no-one wants to talk about the topic of the night but wants their own ideas to be heard and applauded.

(Not that I hold anything against the person in question. I just wanted to talk gnarly Perl stuff, that's all :-)

Last updated: | path: tech / perl | permanent link to this entry

Wed 12th Jul, 2006

Shorter isn't always better!

I'm reading lines in a format I invented that uses a simple run-length compression on zeros: a line can be something like '1 4 2 8 5=7 3 6' and that means 'the values 1, 4, 2, 8, five zeros, 7, 3 and 6'. My code was:

foreach $count (split(/ /, $_)) {
    if ($count =~ /(\d+)=(\d+)/) {
        push @counts, (0)x$1, $2;
        $csubnum += $1 + 1,
    } elsif ($count =~ /(\d+)/) {
        push @counts, $count;
        $csubnum++;
    } elsif ($count =~ /\s*/) {
        next;
    } else {
        warn "bad format in WAIF $vers line: $_ (part $count)\n";
    }
}

"No, wait!" I thought. "I can do all that in a mapping:"

@counts = map (
    {$_ =~ /(\d+)=(\d+)/ ? ((0) x $1, $2) : $_}
    (split / /, $_)
);
$csubnum += scalar @counts;

Testing, though, proved another thing. Reading a file with a reasonable number of zeros per line (lines like '114 0 3 6=3 27=3 3=1 10=3 79=1 8=1 0 1 4=3 16=1 0 1 43=7 15=12 36=16 27=2' are typical) took 21 seconds with the foreach code and 28 seconds with the map code. So that's an extra 33% time per file read. I can only assume this is because Perl is now juggling a bunch of different arrays in-place rather than just appending via push. Still, it's an interesting observation - the regexp gets tested on every value in both versions, so it's definitely not that...

Last updated: | path: tech / perl | permanent link to this entry

Tue 11th Jul, 2006

Editing databases in Perl CGI revisited

I seem to end up often writing what must be a standard style of editor for database fields in Perl CGI: a table with the fields of the database in columns and the rows in, er, rows, and a "Submit" button at the bottom which then compiles the form data into a structure and passes it to a backend which updates the database. There are two minor but persistently annoying problems with this as I'm doing it, and some day I'm going to have to scratch the itch completely and solve them.

The first is that Perl's CGI module, for all its usefulness, makes the form data as supplied by the CGI POST take priority over what you supply in the field. If there's a field called "foo", and you've just POSTed "bar" to it, but the database update didn't work and thus the field should contain its previous value "wig", the the field as displayed says "bar" unless you supply the "override" option to each and every field. Presto, the user thinks the update worked and is surprised when they use the data somewhere else and it's not actually updated as they thought. Having to supply the "override" flag to every field means that you have to pass the CGI field method an anonymous hash with the parameters set up by name, which (in my petty mind) is not as neat as passing it an array with known parameter order.

(Although the solution to this, to work with the user rather than against them, is to highlight the fields that have errors and the rows that haven't been saved. Field priority therefore means that what the user entered is still in the field, but it's obvious that that row hasn't been saved. Or make it obvious that none of the updates have occurred, which is a little harsher but may make more sense from a data integrity point of view.)

The second is more philosophical : what happens if you're adding new rows? Firstly, it's possible to create a bunch of new empty fields for new data, but how many and how does the program keep the data in them separate from real rows? Secondly, if the user submits the form but the row isn't valid, where does the row data go? With the above "fields over parameters" policy it's easy - the form data persists and doesn't need to be plucked out and re-entered by the CGI code. But we've already established that this persistence causes other problems, and it seems really kludgy to rely on persistence for some rows and not for others.

What I need to invent is a set of generic form handler routines that collect the form data, perform some inbuilt and some arbitrary checks against it, submit it to the backend module, get the updated data and display the whole thing, plus any extra blank rows. It will handle adding new rows, deleting existing rows, and allow for a full range of validation field display (i.e. each field in each row could potentially have a different error code recorded against it and it would all be displayed nicely).

I haven't started this project, though, because of the feeling that this wheel has already been re-invented so many times...

Last updated: | path: tech / perl | permanent link to this entry

Thu 18th May, 2006

Recycling bytes the easy way

On Tuesday I rebuilt a machine for a friend on Tuesday. Windows 2000 had decided that something in its boot process was sufficiently adgered that the drive would stop responding after a bit and would lock up the machine. Funnily enough, booting into the System Rescue CD was able to see the entire drive and copy all of the information off it with no lock-ups or obvious problems. Trying to boot into the Windows 'Rescue' tools on the 2000 install CD locked it up as well, which I assume was because it tried to access the drive in a funny way that the drive was no longer having a bar of (and which Linux didn't do, at least in its System Rescue CD configuration).

So, one trip to Harris Technology later, we have a new, faster, 80GB drive (and an ADSL modem for the upcoming move into the modern world of broadband for the Armstrong house). While Rob gets on with his actual work, I install Windows, install all the motherboard drivers and add-ons that Windows natively doesn't find, install all the Service Packs, and do so in only three reboots. Finally we copy everything back with the help of Explore2FS (because the only room on my 80GB portable drive was in the ext2 partition). Then, because half of this is client data that Rob's never quite got around to putting on the Raid-1 80GB SATA drive pair that is supposed to be his work drive, I have to shred the contents of this directory on my portable drive.

If only shred came with a -r option.

So, one small Perl script later, we have a mechanism that will span across multiple directories, shredding every file that it can find and then removing the directory after itself. I did a bit more work on it this morning to fix its remaining bug: not being able to cope with names with spaces in them because of the way Shell.pm throws parameters around (i.e. it creates a small shell script and passes the parameters into that, which then gets reinterpreted for spaces).

And then I found wipe.

But, hey, who needs a reason for reinventing the wheel?

(I also plugged the drive into the USB2 connector rather than the USB connector and increased its speed by an order of magnitude...)

Last updated: | path: tech / perl | permanent link to this entry

Thu 23rd Mar, 2006

Shuffling a list - the power of factorials.

I needed to shuffle a list of DNA sequences for a work program. Having been suitably chastised by Damian Conway's excellent entry in maddog's ticket picking mini-competition, I decided that in future I'd do more searching on CPAN before re-inventing the wheel for the sixteenth time.

I feel so stupid. I downloaded Algorithm::Numeric::Shuffle, written by abigail, as I think it was the first entry that turned up in a search on CPAN. I should have looked by category under List::Util, being a utility function on lists. But anyway. abigail pointed out in the POD that the Perl rand() routine is a linear congruential generator - in other words, from a given random seed the numbers will always follow a predictable pattern. The algorithm works by going through the entire list and swapping the Nth member with another random member. Thus, the actual final ordering is dependent only on the initial RNG seed, which is a 32-bit number.

(The List::Util::shuffle routine suffers from this similarly, despite using a somewhat more bizarre method of giving a unique random ordering to the list.)

Now, 13! is 6227020800, and 2³² is 4294967296, so there are more combinations for a 13-element list than a 32-bit number. Which means, if you think about it, that there are some combinations that you will never get from the shuffle routine. It's just simple information theory, like compressing data: there's no random seed that will cause that ordering of elements to come out.

Here's where once again I forgot to learn the basic rule of my programming existence: What You Want Has Probably Already Been Invented. I wrote an email to abigail suggesting that, if the user of the shuffle routine wanted a truly random shuffle, they could pass in a sub reference to their own rand() function, or one from e.g. MCrypt or OpenSSL::Rand, that does at least have more bits of random seed (or, for preference, draws on a 'true' source of randomness.) I even started writing up a revision of abigail's code to do this.

Then I asked on the #perl channel on irc.freenode.net whether the inbuilt rand() function could be passed as a sub reference (it can't apparently). Someone asked why I wanted to do such a thing. I told them the story. I spent some time convincing them that the statistics said it was possible, and that someone might actually care about being statistically correct. Then someone else said "why don't you just override the rand() function for that module?".

Why not indeed! Perl doesn't make package namespaces sacred, so it's perfectly possible to put a sub rand(!) definition in my code after a package List::Util declaration which calls the rand() function I want. Of course, I've yet to work out how to do that, and the random picks of the sequences are good enough for testing so far, so it's a bit of a moot point. But I hate it when I should have known better all along.

I'm somewhat glad that my email to abigail bounced...

Last updated: | path: tech / perl | permanent link to this entry

Too Busy For Words - the PaulWay Blog

Thu 9th Oct, 2008

Perl threaded database query processing

Mon 19th Nov, 2007

What if we had no walls at all?

Fri 1st Sep, 2006

Rocket and Jifty

Tue 8th Aug, 2006

Rocket module part 001

Mon 17th Jul, 2006

CLUG Perl Programmers SIG = waste of time

Wed 12th Jul, 2006

Shorter isn't always better!

Tue 11th Jul, 2006

Editing databases in Perl CGI revisited

Thu 18th May, 2006

Recycling bytes the easy way

Thu 23rd Mar, 2006

Shuffling a list - the power of factorials.

At last, I've worked out why <pre> formatting in Perl CGI is wrong!