Too Busy For Words - the PaulWay Blog

Wed 13th Sep, 2006

Heap merge sort idea

Imagine you have a number of sorted lists that you want to merge together into a single sorted list. You're going to output this single sorted list (somewhere), and the input lists could be quite large, so you don't want to reserve double the memory to put them all together and sort them. So you're looking at an n-way merge, which is a fairly tedious algorithm: basically you have to scan the head of all the lists for the lowest element (i.e. n comparisons) and remove that element, and do this for every element in the lists. Oh, and you've also got to remember which lists have run out of elements, so you have further tests to do each time. Very quickly we're looking at a lot of excess work to get those list merged.

Now, I've been thinking recently about binary heaps. A heap is like a very relaxed binary tree: each node has two children, but the only condition on the children (the 'heap property') is that the value in the parent has to be less than the value of any of its children. I say 'less', but that's one type of heap known as a min-heap - the other is a max-heap, where the parents are greater than any of their children. Read up on the binary heap page linked above for more information.

To return to the problem: we want a better way of finding the smallest element of n lists. I thought: 'heaps'! Put the first elements of all the lists into an array and make it a heap. Then pull the first element off (it's the smallest), put the next element from that list into its place, and re-establish the heap condition. In order to know which list to pull the element from, each element in the heap contains the list numbers as well as the last element of that list. When you run out of one list, it'll be the first element in the heap - swap the last element of the heap with the first, truncate the heap by one element, and re-establish the 'heap property'.

The problem with heapsort is that you're swapping the last element of the heap with the first - i.e. one that's much lower in the heap for the highest. In this heap-merge method, the new element is much closer to the value of the root element, so my intuition is that you're going to do a lot fewer comparisons and swaps than if you'd had to do an equivalent heapsort, whatever that might be.

I did an experiment by hand to merge four lists of eight elements each; for 32 elements I did 23 swaps. So it's less than O(n) performance for that small sample, but I doubt it'd ever be O(log n). We've also had to expend at worst O(n log n) time sorting the lists, so that's the limiting order of complexity. But the main reason I'm interested in being able to sort a number of lists is that this leads to a parallelised version - each slave keeps its own sorted list and then sends them piece by piece to the master, which merges the lists back together with minimal extra processing and memory required.

So: test code, here we come!

Last updated: | path: tech | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.


Main index / tbfw/ - © 2004-2023 Paul Wayper
Valid HTML5 Valid CSS!