Too Busy For Words - the PaulWay Blog

Fri 26th May, 2006

There's Nothing Like Learning Experiences...

And I'm beginning to fear that my sequence counter with the tree structure shows no learning experience happening. I chatted with the knowledgeable Rainer Klein last night and he and I were exploring different ways of storing the data. He pointed out the options I hadn't considered, like suffix trees and hash tables - options I'd discarded because when I'd originally explored this problem, using Perl, hash tables and trees were running out of memory much faster than the current C incarnation.

Of course, Perl's hash functions are designed for any string; knowing that I'm only going to be dealing with A, C, G, and T (or, when using amino acids, A-Z minus B, J, O, U, X, and Z) does make things simpler. For nucleotides (the former case) I can pack four bases into one byte for an instant hash. Or, using the same four bytes, have an array of 256 pointers to further structures. Using this 'four-ply' structure saves me both creation, talloc overheads, and traversal time. And if I think about it a bit there's probably a clever way to do some 32-bit multiply, shift or masking operation that takes four chars known to be of ACGT and turns it into the correct lookup (I discovered, conveniently, that those (base>>1)&3 yields A = 00, C = 01, G = 11, and T = 10, and works for lower case too) in one swell foop.

My 21-long subsequence count had 2.8 million unique sequences. I realised after talking to Rainer that if I stored them as a massive array of string pointers I'd gave 22 bytes per string plus 8 for the pointer (on 64-bit architecture) for 40 bytes per string, or 114MB. Much less than 1.5GB, eh? Of course, the search lookup time on 2.8 million array elements would be prohibitive, but it made me realise that my one-branch-per-character structure was inefficient. Rainer also made me realise that most of my strings after length 14 (at least for this data) were unique. So after that length, why not just store the remaining string? Or use a more generic suffix tree system, where adding ACCT to a tree with an ACGT suffix produces one AC node with two branches, CT and GT. This would again save on lookup time, talloc overhead and traversal time.

And, heaven forefend, I could even do some research and lookup papers on suffix trees.

Resolution: I need to consider many new ways of doing thing rather than just trying the first thing that pops into my head.

Last updated: | path: tech / c | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.


Main index / tbfw/ - © 2004-2023 Paul Wayper
Valid HTML5 Valid CSS!