In this post I we will explain the algorithm which is implemented in heapq module. This is a very useful code if you want to have a list which has to be sorted, even if you add or pop out objects from it. Let us brifly ilustrate how it works. For this we will use the following function which prints a list in a tree form:
import math from cStringIO import StringIO def show_tree(tree, total_width=50, fill=' '): """Pretty-print a tree.""" output = StringIO() last_row = -1 for i, n in enumerate(tree): if i: row = int(math.floor(math.log(i+1, 2))) else: row = 0 if row != last_row: output.write('\n') columns = 2**row col_width = int(math.floor((total_width * 1.0) / columns)) output.write(str(n).center(col_width, fill)) last_row = row print output.getvalue() print '-' * total_width print return
>>>show_tree(range(1,16)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 --------------------------------------------------
>>> import heapq >>> h=[6,5,4,3,2,1,0] >>> heapq.heapify(h) >>> h [0, 2, 1, 3, 5, 6, 4]
The first element is the smallest element of the list. To see how the order looks like let us draw the tree representation:
>>> show_tree(h) 0 2 1 3 5 6 4 ------------------------------------
The clue is that every parent is smaller than its childrens. When you pop the smallest element with the heapq.heappop() function it will also keep the order of the list. The most important thing is that the sorting algorithm is fast, and does not require to sort whole list as the function sort() would do. So let’s try:
>>> heapq.heappop(h) 0 >>> show_tree(h) 1 2 4 3 5 6 ------------------------------------
Now at the tree root there is 1 and still every parent is smaller than its children. How the tree was changed: after 0 was poped out the two elements in next row will want to jump to the root, the smaller one wins so 1 lands in the root, now the place of 1 is free and 6 and 4 play for its place, since 4 is smaller it takes the place of 1.
You can also insert elements into the heap data structure. The function heapq.heappush() will push an element onto the heap and sort it so that the heap order will be preserved:
>>> heapq.peahpush(1) >>> show_tree(h) 1 2 1 3 5 6 4 ------------------------------------
The algorithm adds the element to the end of the list and then promotes it untill it finds its place. There is also heapq.heappushpop() function which is a fast version of heapq.heappush() followed by heapq.heappop(), and heapq.heapreplace() which pops and returns the current smallest value and then adds a new item to the heap (it is also more efficient that heapq.heappop() followed by heapq.heappush()). Other functions that are included in the module:
- heapq.nlargest() - find the n-th largest element in an iterable object,
- heapq.nsmallest() - find the n-th smallest element in an iterable object,
- heapq.merge() - merge multiple sorted iterables into one sorted iterable object: it returns a generator, does not pull all the data into memory at once, and assumes that all the input interators are already sorted (an expert from its __doc__ string).
Let us come back to the binary tree.
As you will see in a moment binary trees have something to do with binary operations. First of all let us observe that at level 0 there is 2**0 elements, at level 1 – 2**1, at level n there are 2**n elements. All these numbers have a nice representation in binary form: 0b1, 0b10, 0b100, and so on ... . Let as draw the binary tree once again but this time let us label nodes with integers in the binary representation:
Now you can see that the binary shift operation (or integer division by 2) finds a parent of a node. For example the parent of 12=0b1100 is 6=0b110 also it is a parent of 13=0b1101. To find a child of a node it is enough to shift in the other direction (multiply by 2) or shift and add 1. Since Python uses 0-base indexing this two operations will look a little bit different when we use it to index nodes of a tree. The parent of node with index n will be (n-1)>>1: when we indexed the nodes starting from 1 it was just n>>1 hence now it is (n+1)>>1-1 and this is equal to (n-1)>>1: (n+1)>>1-1=(n+1)//2-1=(n-1)//2=(n-1)>>1. Child nodes of the node n are now: 2*(n+1)-1=2*n+1 (the left one) and 2*(n+1)-1+1=2*n+2 for the right one.
There are two basic function which are used to implement the heap heapq._siftdown() and heapq._siftup(). These function are the two basic blocks to implement all the other functions in the heapq module.
def _siftdown(heap, startpos, pos): newitem = heap[pos] # Follow the path to the root, moving parents down until finding a place # newitem fits. while pos > startpos: parentpos = (pos - 1) >> 1 parent = heap[parentpos] if cmp_lt(newitem, parent): heap[pos] = parent pos = parentpos continue break heap[pos] = newitem
The first argument of heapq._siftdown(): heap is a list. In first step it takes the element heap[pos] (which is equal to newitem) finds its parent (parentpos). Then the function :func:heapq.cmp_lt` is used to compare the value at ``pos with the value at its parent node parentpos. It returns True if the value at pos (which is newitem) is smaller than its parent.
def cmp_lt(x, y): # Use __lt__ if available; otherwise, try __le__. # In Py3.x, only __lt__ will be called. return (x < y) if hasattr(x, '__lt__') else (not y <= x)
If the value at pos is smaller than its parent the parent is moved down (this is why the function is called _siftdown). Now we move to the parent position and start over again: we check if the newitem is smaller than the parent if it is the parent is moved down. The process stops either if the parent is smaller than newitem in which case we put the newitem as its child. The while loop guards that we will not go beyond the startpos, which actually might be the ending position of the algorithm.
Let us ilustrate this with an example:
>>> h = [6, 2, 4, 3, 5, 1] >>> show_tree(h) 6 2 4 3 5 1 ------------------------------------ >>> heapq._siftdown(h,0,len(h)-1) 1 2 4 3 5 6 ------------------------------------
Here the while loop will act twice, in slow motion we get:
>>> h = [6, 2, 4, 3, 5, 1] >>> heapq._siftdown(h,2,len(h)-1) >>> show_tree(h) 6 2 1 3 5 4 ------------------------------------
Since 1 was smaller than its parent 4 the parent is moved down. Since now 1 (which is our newitem all the time) is smaller than its parent it will be moved up (or its parent will be shifted down):
>>> heapq._siftdown(h,0,2) >>> show_tree(h) 1 2 4 3 5 6 ------------------------------------
and we get the final result.
def _siftup(heap, pos): endpos = len(heap) startpos = pos newitem = heap[pos] # Bubble up the smaller child until hitting a leaf. childpos = 2*pos + 1 # leftmost child position while childpos < endpos: # Set childpos to index of smaller child. rightpos = childpos + 1 if rightpos < endpos and not cmp_lt(heap[childpos], heap[rightpos]): childpos = rightpos # Move the smaller child up. heap[pos] = heap[childpos] pos = childpos childpos = 2*pos + 1 # The leaf at pos is empty now. Put newitem there, and bubble it up # to its final resting place (by sifting its parents down). heap[pos] = newitem _siftdown(heap, startpos, pos)
At the very begining of this function two objects are rememberd: startpos which is the position at which the algorithm will start and newitem the value at the starting position. The will be used at the very end. This function goes down to the children. The childpos variable is the position of a child node. Then we enter the while loop in which first we check if we keep the position of the smaller of the two children. Then we promote the smaller children to the parent pos and we start over again but not at the child position childpos (which we moved up). So this implements the winnning algorithm that we have noted before (smaller wins and it move up the tree, then we go to the free spot and look which children should win this place and so on ...). After the while loop completes, which is when we hit a leaf of the tree, we take the newitem and add it at as a leaf. Now we use the heapq._siftdown() to place it at the right spot below the startpos.
To ilustrate how it works let us take:
>>> h=[3,2,1] >>> show_tree(h) 3 2 1 ------------------------------------ >>> heapq._siftup(h,0) >>> show_tree(h) 1 2 3 ------------------------------------
So it sorted the heap, and if the list is sorted the order will be preserved:
>>> heapq._siftup(h,0) >>> show_tree(h) 1 2 3 ------------------------------------
def heapify(x): """Transform list into a heap, in-place, in O(len(x)) time.""" n = len(x) for i in reversed(xrange(n//2)): _siftup(x, i)
The implementation of heapq.heapify() which sorts a list with respect to the heap order looks quite simple. It just runs the heapq._siftup() function for every inner node (a node which is not a leaf) starting from the last node up to the 0-th node. This will first sort every triangle with two (or one) leaves and its parent, then move one level up and sort the parent and grandparents of leaves, and so on. At the end we will get a list sorted by the heap order. For example in the following tree:
The for loop will only go through the nodes: v7 to v1. First it will sort the node v7 and its children, and v6 and its children, ... . When it will come to v3 its children will be already sorted now it will take the value v3 out and the smaller of v6 and v7 will be promoted, the v3 will be added as a leaf and then moved to the right position. This is how the heapq._siftup() function works. And it will assure that the the sub tree v3 and its children will be sorted. In the same way v2 and its children will be sorted. To prove that any list will be sorted in this way it is enough to consider a tree t of the form:
where t1 and t2 are already sorted tries. It should be clear (well, maybe after a minute or two) that when we run heapq._siftup(t,0) the whole tree t will get sorted.
Now let us look at the heapq.heappush() function. Well it is even simpler:
def heappush(heap, item): """Push item onto heap, maintaining the heap invariant.""" heap.append(item) _siftdown(heap, 0, len(heap)-1)
Indeed it is enough to append the elemnt as the last leaf and then run heapq._siftdown(). It will promote the appended element to the right place: just under the first its parent which holds a smaller element.
def heappop(heap): """Pop the smallest item off the heap, maintaining the heap invariant.""" lastelt = heap.pop() # raises appropriate IndexError if heap is empty if heap: returnitem = heap heap = lastelt _siftup(heap, 0) else: returnitem = lastelt return returnitem
As you see there are two cases: the heap has is empty after poping out an element, in which case there is nothing to do. If the heap is still not empty, we put the last element lastelt (from the very last leaf of the tree), and then we put it in the root of the tree and we run heapq._siftup(). Since both trees under the left and right root element are sorted, the tree will be sorted after just one application of heapq._siftup(). Finally the function returns the returnitem which is the smallest element of the heap that came as an input.
In all the examples we were using integers as values of heap lists, but actually you can use any objects which have comparisions methods defined.
You could think that just by changing the function heapq.cmp_lt() you could reverse the order in a heap. This is not the case, since actually, if the module _heapq is present it is imported and the functions heapq.heapify(), heapq.heappop(), heapq.heappush(), heapq.heappushpop(), heapq.heapreplace(), heapq.nlargest() and heapq.nsmallest() are overwritten. The _heapq module is written in C – hance it is faster. It implements the same algorithm as the one described above. If you want to reverse the order of a heap you can redefine the comparision methods of your objects.