Normal nucleotide and protein sequences have a structure like this (this is FASTA, for example, although PIR is almost identical and GENBANK, while more verbose, has a similar structure):
>Sequence 1 name AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT >Sequence 2 name CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGTI.e. a sequence name with a specific prefix, then everything from there until the next name line is part of that sequence. Clustal is a program for displaying 'aligned' sequences - where special 'zero-length' gaps are inserted into sequences to make them 'line up'. The 'lining up' process is a black art unto itself, but we won't worry about that here. Clustal displays the sequences as rows and the bases aligned in columns, and I shudder to think what hideous perversion caused the programmers to think that this was a more efficient structure to retrieve their data in:
CLUSTALW 1.8.2 format header Sequence_1_name AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA Sequence_2_name CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG Sequence_1_name CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT Sequence_2_name GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGTEven with a memory limit much smaller than the size of the file, I can think of at least two ways to quickly and easily retrieve the data for any 'rectangular area' of sequence from a FASTA or PIR file. You might need to keep a list of byte offsets where each sequence's data starts in the file, but unless you're doing something really evil (i.e. calculating the offset for a particular sequence derived from the line length and the header length alone) that's what you'd have to do anyway. So you've done nothing good at the expense of sequence name readability (did I mention that it's traditional in ALN files to smash the name down to the first word, so you get lots of sequences with names like GI:37592756 and no clue about what they actually are...). Goody.
For me, it means that I have to read the entire file into memory, appending text to each sequence in turn (and I make no assumptions about whether the sequence order changes per 'block', either...). It means that my sequences are stored as a linked list of strings - i.e. the Stringlist structure has one pointer to a char, and one pointer to the next Stringlist. To make appending slightly less stupid, we keep a pointer to the first and the last Stringlist structures in the list, and therefore appending doesn't require us to traverse the list every time. That was a big speed-up from my stupid first implementation right there...
The only problem is that this method has fallen foul of the talloc block overhead. Each string added therefore incurs the talloc overhead for the linked list structure and the string copy - the former being only sixteen bytes in size (on my 64-bit machine). Not only that, but in order to use it as one string we have to allocate a new string with enough space to hold the entire thing, copy each of the list strings in turn, and then free the entire list structure (which is still handled neatly by talloc but does still mean a complete list traversal, effectively).
The solution was obvious when I looked at it in hindsight: each stringlist structure now contains an array of 1024 chars, into which strings are agglomerated. When we run out of characters in the last block we simply create a new block and put the string into it. So, OK, each string in the list may not be 100% full, but it doesn't really need to be - we're already getting a huge saving in talloc overhead, list traversal and string copying anyway.
The 'stringification' process above now doesn't reorganise the linked list. But that's OK, since we pretty much only use the stringified sequence once anyway, and the caller was getting a talloc_strdup'ed copy of the stringifed sequence anyway. So in a way the code just got simpler... And that can only be good. (And we're probably about to discard the sequence from its storage structure anyway...)
Now to work out the finer points of the Rainer structure. Block agglomeration is good!
Last updated: | path: tech / c | permanent link to this entry
All posts licensed under the CC-BY-NC license. Author Paul Wayper.
Main index
/ tbfw/
- © 2004-2023
Paul Wayper
Valid HTML5