Too Busy For Words - the PaulWay Blog

String List Fixed, Head Screwed On

I think file formats are the bane of programmers everywhere. Certainly a lot of my troubles with my current program stem from one particularly bogus format - the CLUSTAL 'aligned' format (extension .ALN).

Normal nucleotide and protein sequences have a structure like this (this is FASTA, for example, although PIR is almost identical and GENBANK, while more verbose, has a similar structure):

>Sequence 1 name
AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA
CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT
>Sequence 2 name
CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG
GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGT

I.e. a sequence name with a specific prefix, then everything from there until the next name line is part of that sequence. Clustal is a program for displaying 'aligned' sequences - where special 'zero-length' gaps are inserted into sequences to make them 'line up'. The 'lining up' process is a black art unto itself, but we won't worry about that here. Clustal displays the sequences as rows and the bases aligned in columns, and I shudder to think what hideous perversion caused the programmers to think that this was a more efficient structure to retrieve their data in:

CLUSTALW 1.8.2 format header

Sequence_1_name         AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA
Sequence_2_name         CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG

Sequence_1_name         CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT
Sequence_2_name         GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGT

Even with a memory limit much smaller than the size of the file, I can think of at least two ways to quickly and easily retrieve the data for any 'rectangular area' of sequence from a FASTA or PIR file. You might need to keep a list of byte offsets where each sequence's data starts in the file, but unless you're doing something really evil (i.e. calculating the offset for a particular sequence derived from the line length and the header length alone) that's what you'd have to do anyway. So you've done nothing good at the expense of sequence name readability (did I mention that it's traditional in ALN files to smash the name down to the first word, so you get lots of sequences with names like GI:37592756 and no clue about what they actually are...). Goody.

For me, it means that I have to read the entire file into memory, appending text to each sequence in turn (and I make no assumptions about whether the sequence order changes per 'block', either...). It means that my sequences are stored as a linked list of strings - i.e. the Stringlist structure has one pointer to a char, and one pointer to the next Stringlist. To make appending slightly less stupid, we keep a pointer to the first and the last Stringlist structures in the list, and therefore appending doesn't require us to traverse the list every time. That was a big speed-up from my stupid first implementation right there...

The only problem is that this method has fallen foul of the talloc block overhead. Each string added therefore incurs the talloc overhead for the linked list structure and the string copy - the former being only sixteen bytes in size (on my 64-bit machine). Not only that, but in order to use it as one string we have to allocate a new string with enough space to hold the entire thing, copy each of the list strings in turn, and then free the entire list structure (which is still handled neatly by talloc but does still mean a complete list traversal, effectively).

The solution was obvious when I looked at it in hindsight: each stringlist structure now contains an array of 1024 chars, into which strings are agglomerated. When we run out of characters in the last block we simply create a new block and put the string into it. So, OK, each string in the list may not be 100% full, but it doesn't really need to be - we're already getting a huge saving in talloc overhead, list traversal and string copying anyway.

The 'stringification' process above now doesn't reorganise the linked list. But that's OK, since we pretty much only use the stringified sequence once anyway, and the caller was getting a talloc_strdup'ed copy of the stringifed sequence anyway. So in a way the code just got simpler... And that can only be good. (And we're probably about to discard the sequence from its storage structure anyway...)

Now to work out the finer points of the Rainer structure. Block agglomeration is good!

Last updated: | path: tech / c | permanent link to this entry

Too Busy For Words - the PaulWay Blog

Wed 31st May, 2006

String List Fixed, Head Screwed On