Too Busy For Words - the PaulWay Blog

Ignorable compression

On the way home from LCA, and on a whim, in Perth I started adding support for LZO compression to Cfile.

This turned out to have unexpected complications: while liblzo supports the wide variety of compression methods all grouped together as "LZO", it does not actually created '.lzo' files. This is because '.lzo' files also have a special header, added checksums, and file contents lists a bit like a tar file. All of this is added within the 'lzop' program - there is no external library for reading or writing lzo files in the same way that zlib handles gz files.

Now, I see three options here:

Abandon it. No-one uses .lzo files anyway.
Copy and paste code from lzop into cfile, bloating it out considerably but supporting .lzo files.
Create a separate 'liblzop' project and produce a library that supports reading and writing of .lzo files compatible with lzop. Something that the inventor of the LZO algorithms hasn't done at all.

Yeah, I'm going for option one there.

LZO is a special case: it does a reasonable job of compression - not quite as much as standard gzip - but its memory requirements for compression can be miniscule and its decompression speed is very fast. It might work well for compression inside the file system, and is commonly used in consoles and embedded computers when reading compressed data. But for most common situations, even on mobile phones, I imagine gzip is still reasonably quick and produces smaller compressed output.

Now to put all the LZO work in a separate git branch and leave it as a warning to others.

Last updated: | path: tech / c | permanent link to this entry

Thu 17th Nov, 2011

Adding Cans to C

I've been meaning to copy some of my personal C libraries to CCAN, Rusty Russell's C Code Archive. It's not yet quite as comprehensive as he would like, I suspect, but it's certainly a good project. And I think, bold as this may be, that I have something to offer it even if I'm not a full-time C programmer.

The thing that's scared me off is the whole "meeting someone else's standards" thing. So after Rusty's talk at OSDC this year, and finding out that 'ccanlint' can prompt you with what you need to do to make a good package, I decided to give it a go. And after I started having a few minor problems understanding exactly what I needed to do to get it working, I decided to write it down here for other people.

Check out the Git repository:
git clone git://git.ozlabs.org/~ccan/ccan ; cd ccan
Make everything:
make
Make the thing that isn't included in everything that you really need:
make tools/ccanlint/ccanlint
Create a directory for my work and put the already written stuff in there:
mkdir ccan/cfile; cd ccan/cfile; cp ~/coding/pwlib-c/cfile/cfile* .
Save yourself some typing: export CCT=../../tools; export CCL=$CCT/ccanlint/ccanlint
Generate a standard configuration file for the project:
$CCT/configurator/configurator > config.h
Check what you've got so far:
$CCL
ccanlint will ask you if you want to create a _info file. This is the file which declares the 'metadata' about a ccan module. It's a C program. Let it generate it.
Check what you've got so far:
$CCL
Keep repeating this step, fixing what ccanlint tells you is wrong, until you get a module in some form.
Submit the module for inclusion in CCAN:
haven't done this yet.

Last updated: | path: tech / c | permanent link to this entry

Mon 26th Nov, 2007

Saves some typing?

I had an idea on Friday for a utility that fills a little niche that I hit regularly. The particular example was wanting to save the iptables configuration after a couple of updates. This is put (on Red Hat standard boxes) in /etc/sysconfig/iptables, and I keep copies named /etc/sysconfig/iptables.yyyymmdd (where yyyymmdd is the current year, month and day) in case a change breaks something and I need to go back to a previous version. Other people use revision control systems like Mercurial for this, with its ability to watch edits to a file that isn't in a pre-set directory. I may be old fashioned here but this method does me fine. Normally, in order to roll the configuration over to the new version you would do:

mv /etc/sysconfig/iptables /etc/sysconfig/iptables.yyyymmdd
iptables_save > /etc/sysconfig.iptables

But what if you'd already done one edit today? Then you'd use a name like /etc/sysconfig/iptables.yyyymmdd.inc, where inc is an increment number or something. And you want that number to increment up until it finds a 'free' number. The usual convention for log files is to roll each file down, so /etc/sysconfig/iptables.yyyymmdd becomes /etc/sysconfig/iptables.yyyymmdd.1, /etc/sysconfig/iptables.yyyymmdd.1 becomes /etc/sysconfig/iptables.yyyymmdd.2 and so forth; I usually end up putting the latest revision at the end of the sequence rather than the earliest.

Now, of course, it would be relatively simple to do that renaming automatically given the base file name. Cafuego coded up a Bash one-liner in half an hour or so, and Debian already has the savelog utility to do just this (a fact I found out much later, not running Debian). However, that only really does half the job. We still end up with:

savelog /etc/sysconfig/iptables
iptables_save > /etc/sysconfig.iptables

That's one repetition of that annoying path too many, with its hostile tab-unfriendly sysconfig directory, for my taste. I realised that what I wanted was something like:

iptables_save | roll /etc/sysconfig.iptables

that would both roll the log file over and then 'redirect' standard input to the target file. Again, a relatively short piece of work in Perl or bash. But do you really want to have to call up all that framework just to roll one file over? I resolved to learn a bit more and do it in C. Not only that, but I'd forswear my usual use of the talloc library and do it as raw as possible.

It took a day, but by the end of it I had coded up the first working version of the code. I presented it to the gallery on the #linux-aus IRC channel on Freenode and Cafuego pointed out that I'd only implemented the all-move-down method, not the move-to-last method. A bit more work that night added that. A bit more work with valgrind found the couple of memory leaks and odd overwrites. More work today put the command-line options processing in place, and corrected the move-to-last method to not only work, but in the process be more efficient.

So, now I release to the wider Linux and Unix community the roll command. You can find the source code at http://mabula.net/code/roll.c and check it out via Subversion through svn://tangram.dnsalias.net/roll/trunk. Comments, criticisms and suggestions as always are welcomed via paulway@mabula.net. Of course, the irony is that I could have written that mv /etc/sysconfig/iptables /etc/sysconfig/iptables.20071123 command by now...

Last updated: | path: tech / c | permanent link to this entry

Wed 11th Oct, 2006

Come up and see my library.

I'm now quite experienced in using Doxygen to add documentation to code. I've written pretty much all of the documentation for the CFile and progress libraries, and pwlib is pretty small so it'll be easier to do. Now: how do I get everyone to have a look at it?

Well, step 1: put it on my own web site. That's relatively easy: those sons of fun at Ace Hosting have subversion installed; just check the source code out in a directory underneath my public HTML directory. I also added a snipped of .htaccess file to prevent people looking in the .svn directories thus created, more to stop undue confusion and anxiety than any real security threat. Step 2 was a little more work: building the Doxygen HTML documentation. I'll have to get Ace Hosting to install Doxygen, but in the meantime I compiled it and ran it from my own bin directory. L^AT_EX is installed, so it can automatically generate the formulas as .png files and link them in automatically. Hooray!

The third step is to make this available as more than just a directory index. This probably means either kludging some Doxygen documentation together to contain it, or writing a file that can scan through the directories inside my /code directory and create a nice piece of HTML from a template that lists all the source files and links to the Doxygen documentation directly. I'll probably just write a templater; who needs another reason to reimplement the wheel? Although - maybe there's some clever way that Doxygen does this already - I should look into its recursive options...

Anyway, for now, you can see what's done so far in the /code directory. So far only the CFile and progress libraries are documented, but have a browse anyway. Genuflections, derisive laughter, offers of money, peanuts, etc. can be sent here. Of course, if you want to check out the Subversion code and play with it that way, you can retrieve it from svn://tangram.dnsalias.net/pwlib-c/.

Last updated: | path: tech / c | permanent link to this entry

Uncompressed bzip2 file sizes the easy way

Cacheing. It's all about cacheing. If you've got some result that took some time to calculate, save it again for later use.

In this case, I'm referring to the uncompressed file size of a bzip2 file. A gzip file is easy - it's the last four bytes as an integer (signed? unsigned? I don't know). But Julian Seward, in his wisdom, didn't put such information in the bzip2 format. So the only way to determine the uncompressed size of a bzip2 file is to physically decompress it. In CFile I do this by opening a pipe to 'bzcat filename | wc -l', which is probably inefficient or insecure in some way, but I reckon does the job better than me rewriting a buffered reader in CFile for the purpose. It means that if you're reading a bzip2 file, you have this potentially long delay if you want to know the file size beforehand. (If you don't care how long the file is before reading it, then you won't incur this time overhead).

So: how do we cache this value for later? In an filesystem extended user attribute, that's how! There's just one minor problem: if you rewrite an existing file, then fopen() and its logical descendents will truncate and rewrite the file, rather than deleting the inode and then creating a new file. Which means that the extended attribute stays where it was, and now refers to the uncompressed size of the wrong file. To solve this, we write a timestamp in another attribute which determines the time that the size was calculated - if the file's modification time is later than this, then the file size attribute is out of date.

(Of course, if the file system doesn't support user extended attributes, then we bork out nicely and calculate the file size from scratch again.)

Of course, you need user extended attributes on your file system. I thought this would be already available in most modern Linux kernels, but no! You have to explicitly turn it on by putting user_xattr in the options section of the file system's mount line in /etc/fstab first. Fortunately, you can do mount -n -o remount / to remount your root file system with the new options - as is so often the case with Linux, you don't have to reboot to set low-level operating system parameters! Yay us! Ext2, Ext3, XFS and ReiserFS all support extended user attributes, too. Once you've done this, you can test it by writing a parameter with something like attr -s user.foo -V bar file to set an attribute and attr -l file to list the file's attributes. You have to use 'user.' as a prefix to specify you want the user attribute namespace.

So, now to write the actual code!

Last updated: | path: tech / c | permanent link to this entry

Mon 9th Oct, 2006

talloc and require

BTW, I found a gotcha in recent versions of the talloc library. In order to work on all those platforms that don't have all the standard and not-so-standard utility functions that talloc and Samba are built against, Samba and talloc now have a 'replace' library that implements them all. So in order to build talloc, you now have to check out the replace library - you do:

svn co svn://svnanon.samba.org/samba/branches/SAMBA_4_0/source/lib/replace

Then in the new replace directory do ./autogen.sh, ./configure, and make. You can do a make install to install the replace library, but talloc still seems to want direct access to the replace.o and snprintf.o files in the replace library, so you have to then go back to talloc, ./autogen.sh and ./configure, then link the replace object files:

ln -s ../replace/replace.o ln -s ../replace/snprintf.o

Then and only then can you do the make in the talloc directory to make the new talloc library. Oh, and if your code sets a destructor function, then you have to change the type of argument it takes, as it's now correctly typed (rather than taking a void *).

Last updated: | path: tech / c | permanent link to this entry

CFile now does bzip2

Over the last couple of days, I've actually been getting to do a fair bit of actual coding. It started with me adding the basics of support for bzip2 compression into my CFile library. Then I decided to redo my subversion repository for the libraries I've written so far (cfile, progress and pwlib) into separate directories, rather than the more standard but somewhat restrictive trunk|branches|tags structure that the Subversion book recommends.

Today I've added the rest of the bzip2 support, namely being able to read lines from them. It involved me copying an implementation of fgets that I found in stdio.c and implementing my own fgetc for bzip2 using array buffers. The hard part is detecting EOF, because it seems that the BZ2_bzerror routine doesn't actually return BZ_STREAM_END when the stream is at an end, it just returns BZ_OK. But BZ2_bzread will return 0 bytes if you're at the end of file, so I detect that and return EOF accordingly.

This also gave me the impetus to correctly detect EOF in the rest of the code, something that I hadn't implemented correctly. I'm still not sure I'm obeying whatever ANSI or POSIX guidelines there are on this subject, but the test-cat program I've written reports no differences between the original uncompressed file and being fed through CFile, so I'm assuming I'm doing something right.

My experience, given that I can be reading 1.5GB uncompressed sequence files, is that compressing the inputs and outputs saves not only space, but time (to read and write the files). I noticed the other day that The Gimp also allows you to read and write files with .bz2 or .gz extensions as if they were the uncompressed images. Hopefully the CFile library will give that functionality to people who only want a dependency on Talloc, rather than on half of the Gimp, an external compression program, and enough spare filesystem space for the uncompressed file...

Last updated: | path: tech / c | permanent link to this entry

Wed 4th Oct, 2006

Working with other people's code

I've long been a convert to the talloc library for memory allocation, after Tridge's talk mentioned it at LCA 2005. Granted, it's not a sinecure for memory problems - you can still have memory leaking out your program's ears, but at least it's not completely unfreeable. And because it's typed and it does all the things you're supposed to do with memory, like zero the memory immediately after it's allocated and nullify the pointer immediately after it's freed, it does help you avoid the worst excesses of pointer mishandling.

For my own sake, I've written a little library that allows you to treat a gzip-compressed file (with the .gz extension) as being functionally identical to a normal file - you use the same library of functions to read or write to it (although not at the same time) regardless of whether it's compressed or not. When you open the file, the file type is determined automatically and the correct lower-level functions are used transparently. It also provides a useful 'getline' routine, which reallocates your string buffer (using talloc_realloc) on the fly to make enough space for whatever line you're reading.

Now, I've just embarked on a project to read and write information in a simple dictionary section/key/value format - much like a .ini file. So I've started looking at a parser library rather than reinventing this particular wheel (I know, I know, shock horror, has Paul actually learned to do some research?). But it doesn't use talloc and it definitely hasn't heard of my CFile library.

It seems likely to me that someone's going to suggest an object oriented language like C++ here, so that I can extend the class. But as far as I can see this doesn't actually solve the problem. I don't want to add functionality to it (yet), I want to replace functionality that it relies on with other functions of my own choosing. Which means that Nicolas would have had to have written his library with a file and memory abstraction class, so that I could then extend those classes and have iniparser use that new functionality with no recompilation or reprogramming. What if he'd thought to abstract the file functions, but not the memory functions (after all, who expects to not use malloc in C?) I'd still be looking at rewriting part of his code. And, since I'm writing in C, it's all a bit of wishful thinking anyway.

So is there any way to do this? Do I have to modify Nicolas's code? I can search through the Samba codebase, because I'm sure they've implemented a .ini file reader in there, but I want to write the file too and maybe they haven't done that. And they won't be using my CFile library.

So is there a solution? Do we have to go to a new, revolutionary programming language, where somehow you not only override these and perhaps other even more basic functions, but do so knowing what other functions you're overriding and what guarantees you're providing to the code 'above'? Does such a thing exist?

Because you can bet Linus Torvalds' hair that there's no library so good, so all-encompassingly correct, that everyone will use it. talloc, for all its plusses, is not ideal in an environment where every byte may count; or at least there will be some people who will fight to the death to avoid using it. My CFile library is perhaps no-one's idea of good (or someone else would have done it way earlier, and I haven't seen one yet), but I reckon it solves a problem that I think is worth solving. It would be good to have ways of using these libraries without having to rework other people's code - it gives one the temptation to just write everything myself and save myself the trouble.

Last updated: | path: tech / c | permanent link to this entry

Fri 1st Sep, 2006

A new approach to the problem

I've revisited a program of mine that I've written for work. The basic idea is to read a set of DNA sequences and produce different files which list all the different 'subsequences' (jargon: oligonucleotides) of different lengths, and how many we saw of each. I presented this idea at a Programmers SIG a while back and the reply came back: write them out and use sort(1) and uniq(1) -c. In my vanity I hadn't thought of this, and was instead making up great complex data structures and slab memory allocators and all sorts of junk. Then I was diverted to work on other projects; when I came back all this looked like so much gibberish.

So I wrote a simple test and showed that, for an average set of virus sequences, I get about files ranging between 24MB (six-letter words, called 'six-mers' in molecular biologist's jargon) to 77MB (21-mers). Sure enough, sort and uniq produce what I want in a not inefficient manner. But I'd like to run this as a single standalone executable, much as that's against the Unix piping ethos. For efficiency reasons I generate all the files simultaneously, and the thought of forking off fifteen separate sort | uniq -c pipes make me shudder. There must be a better way, I think.

The first obvious improvement is to keep the lists in memory and use some kind of in-memory sort function. The files contain about three and a half million words apiece, so it would be possible using qsort(3) to fill a 21MB array with six-mers (since you wouldn't have to store the null at the end). There's a setting in talloc to allow grabbing chunks of memory greater than 64MB in size, so doing even the 21-mers (77MB in size) would be possible using this method.

The problem, though, is that the way I generate the sequences is to generate all the different ranges simultaneously - doing it in one pass through the sequences. Storing all of them in arrays simultaneously would require 812MB (roughly), and this seems to not be much better than my previous dodgy tree implementation.

Then I realised: all the six-mers are just the prefixes of all the seven-mers, plus all the six-mers that didn't fit in seven-mers. This process applies right up the scale. So I could generate an array which contained the longest strings (null-terminated) available at each location (up to the maximum length required), and sort that. If I did that with fixed-length 32-byte strings (more than enough for all the things we've been doing so far) you'd use 112MB or so. That now contains an ordered list of all the strings of lengths between the minimum and maximum we're using. In order to extract all the N-mers, ignore all strings of length less than N, and take the first N characters of the remaining strings. They're still in order, so counting their frequency is a simple matter. You could even do this in parallel, in one pass across the array (which would increase cache read performance).

Consider, for a moment, though, if you can't allocate huge arrays like that. So you have to break the single array into smaller, more manageable arrays, sort each in turn, and do a merge-condense to recombine each block. Which lends itself to parallelisation: each slave is given a contiguous chunk of the sequence (i.e. one with no non-sequence characters in it - like being given a paragraph), and breaks it up into words, sorts them, and then returns the data to the master (either in single messages, or better yet in chunks) for merging.

But think the sort process through: most common sort routines recursively descend through successively smaller chunks of the file, sorting each and then combining the sorted bits back together into larger sorted chunks. Could this process not also combine elements that were the same? So we might sort records of a string and a count (e.g. 30 byte strings and a two-byte count), initially starting each count at one but as each sort finds duplicated strings they be combined and added together? This would also compact the data storage as we go, which might well mean that it might be good to read each sequence into a separate block, which is then sorted and compacted independently (giving a certain granularity to the process) and then the final merge process happens N-way rather than two-way. If that's too complex, make each merge two-way but just do as many two-way merges as necessary to get all the data into one sorted array, which would now be also compacted and contain the counts anyway.

The best part of this AFAICS it it's also neatly parallelisable. I can even write an application, which, if it finds a PVM machine set up, can distribute itself amongst the hosts neatly, but if it doesn't it can still operate in 'single thread' mode and do all the work itself. Which, in turn, means that as long as I keep in mind that at some stage I could distribute the workload, I won't end up having to substantially rewrite the program (as it was looking with the last method).

So, you see, I do do things other than rant and rave sometimes.

Last updated: | path: tech / c | permanent link to this entry

Fri 21st Jul, 2006

Indenting considered harmful

Or: If it compiles, it must be correct.

Michael Ellerman takes issue with my example of bad coding practice. It would seem that we both ultimately agree that indenting code is no guarantee for it to do what you want - except in Python. Because his example, to wit:

if (x > y)
	exch(x, y);
	if (x)
		printf("hello!");
else
	x = 0;

Is another example of coders assuming that indenting is doing what they mean. This code, according to Kernighan and Richie on page 56, will be compiled to mean:

if (x > y) {
    exch(x, y);
}
if (x) {
    printf("hello!");
} else {
    x = 0;
}

Which, though it will compile correctly and will not break if exch() is not a #define do { } while (0); macro, is not, I'll wager, what Michael has meant by his example (which is, I believe, that the else clause attaches to the first if). So I'm going to take his example as another proof of the worth of using braces even if you think you'll only ever have one line of code in the 'then' case.

The other point which I left implicit in my post but Michael has pointed out is that compilation failure in this situation is caused by bad coding practice that is not necessarily related to the line you just added. If the code was:

if (x > y)
    exch(x,y);
else
    exch(y,z);

And I was trying to add the line exch(z,x); after exch(x,y);, then my addition has suddenly caused compilation to fail. But my statement is perfectly legitimate, and spending time looking at the macro of exch() isn't going to elucidate the problem. The problem is that the whole structure is going to assume that the then and else clauses are a single statement; using the do { } while (0); hack doesn't solve this - it doesn't make the bad coding style good. If the if clause is complex, or the lines are more verbose, or the maintenance programmer isn't a language lawyer or is just simply lazy, then they may hit the compilation error and dike their code out to a separate line:

if (x > y)
    exch(x,y);
else
    exch(y,z);
if (x > y)
    exch(z,x);

Which suddenly means a lot more hassle when the tested expression changes or exch changes meaning or the new value of y is suddenly less than x (because it was exchanged with z in the else clause, before). No Good Will Come Of It.

Maybe I should download the kernel source, remove some of these do { } while (0); hacks, fix up the relevant usages so they're properly braced (heh), and make myself an instant top ten kernel patch submitter... :-)

Last updated: | path: tech / c | permanent link to this entry

Thu 20th Jul, 2006

Kernel programming considered harmful

I happened across a the Kernel Newbies Wiki, a sensible attempt to encourage people new to kernel programming to get started, even if they're not going to be diving into hardcore device driver writing immediately. Reading the FAQ, I found a reference to the question "Why do so many #defines in Kernel files use a do { <code> } while 0; construct?" This piqued my interest.

The basic reason why is that if you have a define like:

#define exch(x,y) { int t; t = x; x = y; y = t; }

And you have use it in code such as:

if (x > y) 
    exch(x,y);
else
    x = 0;

Then this gets expanded to:

if (x > y)
    { int t; t = x; x = y; y = t; }; # End of if statement here
else # dangling else
    x = 0;

Now, yes, this is bad. But to me it's bad because the programmer has fallen for C's lazy statement syntax and assumed that the 'then' clause of the if statement is always going to be a single statement. This type of construct fails regularly because if you have to add some other statement to that if/then/else statement, like:

if (x > y) 
    exch(x,y);
    call_foobar(x,y);
else
    x = 0;

Then you've fallen for the same trap. Once again, the exch() call completes the if statement and the call_foobar() call is executed unconditionally. Indenting in this case is worse than a sham, it actively deceives the programmer into thinking that the logic will work when it won't. Of course, if the programmer had initially written:

if (x > y) {
    exch(x,y);
} else {
    x = 0;
}

Then it would all make sense, the #define would work, the extra statements added in would work. What's a couple of extra characters and an extra brace line, compared to many hours hunting down an elusive bug? Lazy programming is good in certain circumstances, but it doesn't excuse bad style or the hubris of thinking you'll never have to revise or debug this piece of code.

I know I'll have already put some people offside by commenting on this coding practice. But if and when I ever have to write something for the kernel I'll be using a sensible amount of braces, thank you.

Last updated: | path: tech / c | permanent link to this entry

Tue 6th Jun, 2006

Calculating a correct average, thrashed to death

Regarding this 'signed integer average of two large numbers' problem, I would have thought that a neater solution would be:

low + (high - low) / 2

Works for signed and unsigned numbers, and also works when you have low negative and high positive (i.e. your range spans the negative integers). Of course, this isn't ever going to occur with array indices, but it's still a handy property for other uses.

Last updated: | path: tech / c | permanent link to this entry

Wed 31st May, 2006

String List Fixed, Head Screwed On

I think file formats are the bane of programmers everywhere. Certainly a lot of my troubles with my current program stem from one particularly bogus format - the CLUSTAL 'aligned' format (extension .ALN).

Normal nucleotide and protein sequences have a structure like this (this is FASTA, for example, although PIR is almost identical and GENBANK, while more verbose, has a similar structure):

>Sequence 1 name
AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA
CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT
>Sequence 2 name
CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG
GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGT

I.e. a sequence name with a specific prefix, then everything from there until the next name line is part of that sequence. Clustal is a program for displaying 'aligned' sequences - where special 'zero-length' gaps are inserted into sequences to make them 'line up'. The 'lining up' process is a black art unto itself, but we won't worry about that here. Clustal displays the sequences as rows and the bases aligned in columns, and I shudder to think what hideous perversion caused the programmers to think that this was a more efficient structure to retrieve their data in:

CLUSTALW 1.8.2 format header

Sequence_1_name         AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA
Sequence_2_name         CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG

Sequence_1_name         CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT
Sequence_2_name         GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGT

Even with a memory limit much smaller than the size of the file, I can think of at least two ways to quickly and easily retrieve the data for any 'rectangular area' of sequence from a FASTA or PIR file. You might need to keep a list of byte offsets where each sequence's data starts in the file, but unless you're doing something really evil (i.e. calculating the offset for a particular sequence derived from the line length and the header length alone) that's what you'd have to do anyway. So you've done nothing good at the expense of sequence name readability (did I mention that it's traditional in ALN files to smash the name down to the first word, so you get lots of sequences with names like GI:37592756 and no clue about what they actually are...). Goody.

For me, it means that I have to read the entire file into memory, appending text to each sequence in turn (and I make no assumptions about whether the sequence order changes per 'block', either...). It means that my sequences are stored as a linked list of strings - i.e. the Stringlist structure has one pointer to a char, and one pointer to the next Stringlist. To make appending slightly less stupid, we keep a pointer to the first and the last Stringlist structures in the list, and therefore appending doesn't require us to traverse the list every time. That was a big speed-up from my stupid first implementation right there...

The only problem is that this method has fallen foul of the talloc block overhead. Each string added therefore incurs the talloc overhead for the linked list structure and the string copy - the former being only sixteen bytes in size (on my 64-bit machine). Not only that, but in order to use it as one string we have to allocate a new string with enough space to hold the entire thing, copy each of the list strings in turn, and then free the entire list structure (which is still handled neatly by talloc but does still mean a complete list traversal, effectively).

The solution was obvious when I looked at it in hindsight: each stringlist structure now contains an array of 1024 chars, into which strings are agglomerated. When we run out of characters in the last block we simply create a new block and put the string into it. So, OK, each string in the list may not be 100% full, but it doesn't really need to be - we're already getting a huge saving in talloc overhead, list traversal and string copying anyway.

The 'stringification' process above now doesn't reorganise the linked list. But that's OK, since we pretty much only use the stringified sequence once anyway, and the caller was getting a talloc_strdup'ed copy of the stringifed sequence anyway. So in a way the code just got simpler... And that can only be good. (And we're probably about to discard the sequence from its storage structure anyway...)

Now to work out the finer points of the Rainer structure. Block agglomeration is good!

Last updated: | path: tech / c | permanent link to this entry

Tue 30th May, 2006

Large Memory Structures in C

I wrote up a simple program that calculated how much memory would be used by my data structures under various length and fill ratio parameters. When I go past length 12, it looks like the previous 'Rainer' structure I was working on - like a tree, but with four bases per node rather than one (still keeping individual level counts) - suddenly gets to around 2^24 nodes, which (at about 2.5K per node) comprehensively doesn't fit in memory.

Length 12 has 1 + 256 + 65536 nodes, for a total of around 171MB in structure. That provides direct access to 16.7 million unique subsequences, much more than my usual number of around 3 million at length 12 or above. The main problem is to construct a suffix tree from there without chewing up lots of nodes (and hence adding lots of talloc overhead). If I was just counting only the subsequences of a particular length, the whole storage thing would be more of a moot point, but I can't get away from the feeling that to have to rescan the sequence or reload the file to calculate each different length is a wasteful process.

Now to find why taking a linked list of string fragments and mangling them into a complete single string is taking so long (it's only copying about 29K/sec)...

Last updated: | path: tech / c | permanent link to this entry

Fri 26th May, 2006

There's Nothing Like Learning Experiences...

And I'm beginning to fear that my sequence counter with the tree structure shows no learning experience happening. I chatted with the knowledgeable Rainer Klein last night and he and I were exploring different ways of storing the data. He pointed out the options I hadn't considered, like suffix trees and hash tables - options I'd discarded because when I'd originally explored this problem, using Perl, hash tables and trees were running out of memory much faster than the current C incarnation.

Of course, Perl's hash functions are designed for any string; knowing that I'm only going to be dealing with A, C, G, and T (or, when using amino acids, A-Z minus B, J, O, U, X, and Z) does make things simpler. For nucleotides (the former case) I can pack four bases into one byte for an instant hash. Or, using the same four bytes, have an array of 256 pointers to further structures. Using this 'four-ply' structure saves me both creation, talloc overheads, and traversal time. And if I think about it a bit there's probably a clever way to do some 32-bit multiply, shift or masking operation that takes four chars known to be of ACGT and turns it into the correct lookup (I discovered, conveniently, that those (base>>1)&3 yields A = 00, C = 01, G = 11, and T = 10, and works for lower case too) in one swell foop.

My 21-long subsequence count had 2.8 million unique sequences. I realised after talking to Rainer that if I stored them as a massive array of string pointers I'd gave 22 bytes per string plus 8 for the pointer (on 64-bit architecture) for 40 bytes per string, or 114MB. Much less than 1.5GB, eh? Of course, the search lookup time on 2.8 million array elements would be prohibitive, but it made me realise that my one-branch-per-character structure was inefficient. Rainer also made me realise that most of my strings after length 14 (at least for this data) were unique. So after that length, why not just store the remaining string? Or use a more generic suffix tree system, where adding ACCT to a tree with an ACGT suffix produces one AC node with two branches, CT and GT. This would again save on lookup time, talloc overhead and traversal time.

And, heaven forefend, I could even do some research and lookup papers on suffix trees.

Resolution: I need to consider many new ways of doing thing rather than just trying the first thing that pops into my head.

Last updated: | path: tech / c | permanent link to this entry

Tue 23rd May, 2006

Talloc, Don't Leave Me!

Since Tridge's talk in 2005 about the wonders of (among other things) talloc, and having Rusty give me a rundown on his own "use talloc or live in sin" talk at LCA 2006 (which I was going to be elsewhere for, unfortunately), I've been using talloc in my C programming for work. While those still into endless pointer maintenance and doing everything by hand might scoff at something that almost but not quite does garbage collection for you, it's been invaluable for saving me having to write and debug a lot of tedious memory clean-up code.

Unfortunately, I have a minor hassle. I'm creating a tree/array hybrid structure - each node has four pointers to subnodes (because I'm dealing with nucleotide bases - A, C, G and T), and rather than have a single root I allocate an array of nodes at the 'minimum tree level'. The structure isn't very complex - it's basically six pointers and an unsigned long count - and even with eight-byte pointers it takes about 56 bytes in memory.

So imagine my chagrin when I discover that the talloc structure that's invisibly maintained behind all that is over twice the size of my tree structure.

This is a problem because, even with 4GB in the machine, the tree structure can easily take up 1GB. Even with a relatively small sample of virus sequences - only 4MB or so of uncompressed data - the tree out to length 21 is 29 million nodes or 1,506MB. Triple that usage and we move to The Valley Of Eternal Disk Paging. It's a small comfort to find out that the internal tally of how much memory has been used by the structures directly is exactly what it should be according to the number of nodes actually allocated. And I'm aiming for much larger sequence sizes too - the shortest human gene is about 12MB in size. You can forget about trying to page that in, out and all about.

Of course, I need to consider moving beyond the bounds of RAM and into using disk - intelligently, not by creating a massive swap file. I can't expect 4GB to always be there, or for my requirements to always fit within the constraints of my memory limitations. So I need to do some research into alternative techniques.

The main avenue I'm working on is a 'slab' allocator, where 'slab' here means 'a chunk of 65536 structures'. Instead of each node having a pointer to its branches, it has an unsigned short slab number and unsigned short positition in that slab. Looking that up then requires a couple of relatively quick array lookups (depending on your implementation). Sure, it's not going to be as fast as direct pointer linking, but the talloc overhead goes way down because suddenly you only have an overhead per slab rather than per node.

And I think I need to find out how mmap works...

Last updated: | path: tech / c | permanent link to this entry

Tue 16th May, 2006

Well, I am <U>shocked</U>!

Mainly, I'm shocked to see that nesting functions is not allowed in plain C. It seemed so... so harmless! Of course, after reading up a bit about closures and the upward and downward Funarg problems, I can see that it's a bit more complex than I thought. But, really, when Pascal can do nested functions, is it that hard for C to do them? If you avoid the upward funarg problem (by never allowing a function to return a function), you seem to avoid a whole class of problems by never needing to use a heap for argument and function execution record information, and therefore avoid having to do garbage collection, which I agree would be anathema in C.

I'm also a little shocked, Ian, that your 'trusted source' on whether nested functions are good or not is someone who just says "They're evil!" without a bit of justification or explanation. OK, they're evil in the current way gcc implements them, because they change behaviour depending on their contents and they break on architectures that don't support execution from stack. OK, they're evil because they're a perversion of the C standard. But I don't really like people just saying "They're evil!"; there are (I feel) plenty of good reasons why they should exist, so if I say "They're wonderful!" does it balance out?

I'm a tiny bit shocked, Cameron, that you could recommend C++ despite it not offering nested functions or closures either. What it does, in what seems to me to be a typical C++ "Let's adger the rules sufficiently to allow us to do what we want and who cares how difficult it is for newbies to understand" way, is allow you to overload the operator() function in order to have a sort of pseudo function object (because overloading something to do something else that you didn't quite expect is absolutely traditional and part of the C++ way, as far as I can see). This method doesn't even offer actual closure (i.e. encapsulating the environment around the function), though, which is all I really want in this instance. So C++ isn't really a solution to my problems, sadly.

I'm glad that my little example, which wasn't meant to be a criticism of C or gcc, has turned into two good articles which have further enlightened me (and hopefully a few others). And please don't take my comments on C++ to heart, anyone - I find it difficult to understand and overly complex, but I'm sure it's second nature to others. BTW, I did actually deduce that the problem was probably due to the fact that the Opteron architecture didn't allow execution of the code block for the passed function, but I didn't think that it would be on the stack. But this does explain how running under valgrind makes it work again - valgrind simply catches the creation of the function on the stack and puts it back in non-stack (i.e. executable) memory again.

Computers are weird things, eh?

Last updated: | path: tech / c | permanent link to this entry

Fri 12th May, 2006

The hidden trampoline in gcc

I found out today that if you pass a local function (a) as a parameter to another function (b) from an outer calling function (c), then gcc makes that local function a trampoline that's resolved at runtime, because AFAICS the function is on the stack. (To reiterate, if from (c) I call function (b) that takes a function and (c) passes (a) to (b), then the function pointer for (a) points onto the stack). That's amusing, but if for some reason function (a) refers to variables local to (c), then the trampoline pointer goes haywire for some reason and ends up generating a segfault - at least on my Opteron work machine.

If you prefer, have a look at the code to see what I'm talking about.

I tried to trace this down, but found found that valgrind silently fooled with this trampoline, turning it back into a pointer which was no longer off the end of the trolley. So when you run it under valgrind, it worked perfectly. Excellent for debugging, and so practical too! gdb leaves the trampoline as is, which means that the jump to the local function fails under gdb without telling you much more.

At this point, everyone who codes in C is saying "well, don't do that then! Pass a global function." The reason why not is: local state. I want to have that local function (a) to have state across its use inside the calling function (c); (b) shouldn't have to know about this state, and I shouldn't have to pass an arbitrary void pointer in calls to (b) to store this state (because if one (c)-like routine has different state for it's internal (a)-function from another (c)-like routine, you can't just make (b) take a pointer to an integer because your second (c)-like function might need a character string as state, or a complex structure, or whatever).

A quick chat on the ##C channel on irc.freenode.org confirmed that other people see the segfault behaviour, so it's not just me or my version of gcc. It also confirmed that everyone thought I was mad - a situation that is familiar to me; most people said "Pass the state around in an arbitrary pointer", which I'm not keen on, "Pass the state around in global variables", which I'm even less keen on, or "It's totally non-portable", which both puzzles me and completely fails to answer the question. So that's been fun.

A more sophisticated example might be a search or sort routine; you pass an array to bsearch() or qsort() as well as the upper and lower bounds to search and a function which compares two array indices. Inside this comparator you want to see how many times its' been called, but only for the most recent top-level call to bsearch(). It makes perfect sense for this call count to be a local variable to the caller, and not state in the function passed to bsearch().

Unfortunately I didn't find all this out until after the CLUG Programmers SIG meeting, otherwise I would have brought it up as a question there and have the opportunity for better minds than mine tell me how stupid and backward I'm being in doing something so manifestly non-C-like as wanting portability and opaque state...

Last updated: | path: tech / c | permanent link to this entry

Too Busy For Words - the PaulWay Blog

Wed 15th Jan, 2014

Thu 17th Nov, 2011

Mon 26th Nov, 2007

Wed 11th Oct, 2006

Mon 9th Oct, 2006

Wed 4th Oct, 2006

Fri 1st Sep, 2006

Fri 21st Jul, 2006

Thu 20th Jul, 2006

Tue 6th Jun, 2006

Wed 31st May, 2006

Tue 30th May, 2006

Fri 26th May, 2006

Tue 23rd May, 2006

Tue 16th May, 2006

Fri 12th May, 2006