Length 12 has 1 + 256 + 65536 nodes, for a total of around 171MB in structure. That provides direct access to 16.7 million unique subsequences, much more than my usual number of around 3 million at length 12 or above. The main problem is to construct a suffix tree from there without chewing up lots of nodes (and hence adding lots of talloc overhead). If I was just counting only the subsequences of a particular length, the whole storage thing would be more of a moot point, but I can't get away from the feeling that to have to rescan the sequence or reload the file to calculate each different length is a wasteful process.
Now to find why taking a linked list of string fragments and mangling them into a complete single string is taking so long (it's only copying about 29K/sec)...
Last updated: | path: tech / c | permanent link to this entry
All posts licensed under the CC-BY-NC license. Author Paul Wayper.
Main index
/ tbfw/
- © 2004-2023
Paul Wayper
Valid HTML5