Too Busy For Words - the PaulWay Blog

A new approach to the problem

I've revisited a program of mine that I've written for work. The basic idea is to read a set of DNA sequences and produce different files which list all the different 'subsequences' (jargon: oligonucleotides) of different lengths, and how many we saw of each. I presented this idea at a Programmers SIG a while back and the reply came back: write them out and use sort(1) and uniq(1) -c. In my vanity I hadn't thought of this, and was instead making up great complex data structures and slab memory allocators and all sorts of junk. Then I was diverted to work on other projects; when I came back all this looked like so much gibberish.

So I wrote a simple test and showed that, for an average set of virus sequences, I get about files ranging between 24MB (six-letter words, called 'six-mers' in molecular biologist's jargon) to 77MB (21-mers). Sure enough, sort and uniq produce what I want in a not inefficient manner. But I'd like to run this as a single standalone executable, much as that's against the Unix piping ethos. For efficiency reasons I generate all the files simultaneously, and the thought of forking off fifteen separate sort | uniq -c pipes make me shudder. There must be a better way, I think.

The first obvious improvement is to keep the lists in memory and use some kind of in-memory sort function. The files contain about three and a half million words apiece, so it would be possible using qsort(3) to fill a 21MB array with six-mers (since you wouldn't have to store the null at the end). There's a setting in talloc to allow grabbing chunks of memory greater than 64MB in size, so doing even the 21-mers (77MB in size) would be possible using this method.

The problem, though, is that the way I generate the sequences is to generate all the different ranges simultaneously - doing it in one pass through the sequences. Storing all of them in arrays simultaneously would require 812MB (roughly), and this seems to not be much better than my previous dodgy tree implementation.

Then I realised: all the six-mers are just the prefixes of all the seven-mers, plus all the six-mers that didn't fit in seven-mers. This process applies right up the scale. So I could generate an array which contained the longest strings (null-terminated) available at each location (up to the maximum length required), and sort that. If I did that with fixed-length 32-byte strings (more than enough for all the things we've been doing so far) you'd use 112MB or so. That now contains an ordered list of all the strings of lengths between the minimum and maximum we're using. In order to extract all the N-mers, ignore all strings of length less than N, and take the first N characters of the remaining strings. They're still in order, so counting their frequency is a simple matter. You could even do this in parallel, in one pass across the array (which would increase cache read performance).

Consider, for a moment, though, if you can't allocate huge arrays like that. So you have to break the single array into smaller, more manageable arrays, sort each in turn, and do a merge-condense to recombine each block. Which lends itself to parallelisation: each slave is given a contiguous chunk of the sequence (i.e. one with no non-sequence characters in it - like being given a paragraph), and breaks it up into words, sorts them, and then returns the data to the master (either in single messages, or better yet in chunks) for merging.

But think the sort process through: most common sort routines recursively descend through successively smaller chunks of the file, sorting each and then combining the sorted bits back together into larger sorted chunks. Could this process not also combine elements that were the same? So we might sort records of a string and a count (e.g. 30 byte strings and a two-byte count), initially starting each count at one but as each sort finds duplicated strings they be combined and added together? This would also compact the data storage as we go, which might well mean that it might be good to read each sequence into a separate block, which is then sorted and compacted independently (giving a certain granularity to the process) and then the final merge process happens N-way rather than two-way. If that's too complex, make each merge two-way but just do as many two-way merges as necessary to get all the data into one sorted array, which would now be also compacted and contain the counts anyway.

The best part of this AFAICS it it's also neatly parallelisable. I can even write an application, which, if it finds a PVM machine set up, can distribute itself amongst the hosts neatly, but if it doesn't it can still operate in 'single thread' mode and do all the work itself. Which, in turn, means that as long as I keep in mind that at some stage I could distribute the workload, I won't end up having to substantially rewrite the program (as it was looking with the last method).

So, you see, I do do things other than rant and rave sometimes.

Last updated: | path: tech / c | permanent link to this entry

Too Busy For Words - the PaulWay Blog

Fri 1st Sep, 2006

A new approach to the problem