Too Busy For Words - the PaulWay Blog

Large Memory Structures in C

I wrote up a simple program that calculated how much memory would be used by my data structures under various length and fill ratio parameters. When I go past length 12, it looks like the previous 'Rainer' structure I was working on - like a tree, but with four bases per node rather than one (still keeping individual level counts) - suddenly gets to around 2^24 nodes, which (at about 2.5K per node) comprehensively doesn't fit in memory.

Length 12 has 1 + 256 + 65536 nodes, for a total of around 171MB in structure. That provides direct access to 16.7 million unique subsequences, much more than my usual number of around 3 million at length 12 or above. The main problem is to construct a suffix tree from there without chewing up lots of nodes (and hence adding lots of talloc overhead). If I was just counting only the subsequences of a particular length, the whole storage thing would be more of a moot point, but I can't get away from the feeling that to have to rescan the sequence or reload the file to calculate each different length is a wasteful process.

Now to find why taking a linked list of string fragments and mangling them into a complete single string is taking so long (it's only copying about 29K/sec)...

Last updated: | path: tech / c | permanent link to this entry

Too Busy For Words - the PaulWay Blog

Tue 30th May, 2006

Large Memory Structures in C