Too Busy For Words - the PaulWay Blog

Wed 11th Oct, 2006

Uncompressed bzip2 file sizes the easy way

Cacheing. It's all about cacheing. If you've got some result that took some time to calculate, save it again for later use.

In this case, I'm referring to the uncompressed file size of a bzip2 file. A gzip file is easy - it's the last four bytes as an integer (signed? unsigned? I don't know). But Julian Seward, in his wisdom, didn't put such information in the bzip2 format. So the only way to determine the uncompressed size of a bzip2 file is to physically decompress it. In CFile I do this by opening a pipe to 'bzcat filename | wc -l', which is probably inefficient or insecure in some way, but I reckon does the job better than me rewriting a buffered reader in CFile for the purpose. It means that if you're reading a bzip2 file, you have this potentially long delay if you want to know the file size beforehand. (If you don't care how long the file is before reading it, then you won't incur this time overhead).

So: how do we cache this value for later? In an filesystem extended user attribute, that's how! There's just one minor problem: if you rewrite an existing file, then fopen() and its logical descendents will truncate and rewrite the file, rather than deleting the inode and then creating a new file. Which means that the extended attribute stays where it was, and now refers to the uncompressed size of the wrong file. To solve this, we write a timestamp in another attribute which determines the time that the size was calculated - if the file's modification time is later than this, then the file size attribute is out of date.

(Of course, if the file system doesn't support user extended attributes, then we bork out nicely and calculate the file size from scratch again.)

Of course, you need user extended attributes on your file system. I thought this would be already available in most modern Linux kernels, but no! You have to explicitly turn it on by putting user_xattr in the options section of the file system's mount line in /etc/fstab first. Fortunately, you can do mount -n -o remount / to remount your root file system with the new options - as is so often the case with Linux, you don't have to reboot to set low-level operating system parameters! Yay us! Ext2, Ext3, XFS and ReiserFS all support extended user attributes, too. Once you've done this, you can test it by writing a parameter with something like attr -s user.foo -V bar file to set an attribute and attr -l file to list the file's attributes. You have to use 'user.' as a prefix to specify you want the user attribute namespace.

So, now to write the actual code!

Last updated: | path: tech / c | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.


Main index / tbfw/ - © 2004-2023 Paul Wayper
Valid HTML5 Valid CSS!