Tower of Hanoi


Page: 1 2 3 4 of 4	Yellow Pig's Index

The Tower of Hanoi Puzzle

Now we analyze the recursive program. For each invocation of towerrec, we make one comparison n == 1 and one increment, ++move. In reality, they require five machine instructions. The processor first copies the value of n or move from memory to a register. The processor then performs the comparison or increment operation. And finally, the processor copies the new value of move back into memory.

The comparison n == 1 is true in 2^n-1 cases. In those cases, we just print the move. In the other 2^n-1 + 1 cases, we do a few things. We compute n-1. Then for each recursive call, we push onto the runtime stack the return address and the four arguments; then we do a jump; then pop those five values; then do another jump. The processor cannot actually copy the values of from, to and temp from one set of locations on the stack to another. The processor has to copy them from the stack to the registers, then back onto the stack. On the other hand, popping all three off the stack requires just one instruction that moves the stack pointer 12 bytes. This is messy. Let us streamline the n != 1 case into pseudo machine language. We assume in this code fragment that, for procedure calls, the caller pushes and pops the arguments and the return address onto and from the stack. Keep in mind that we have already accounted for the ++move operation.

compute n-1 (1 op)
put n-1 in memory (1 op)
copy from, temp and to to registers (3 ops)
push return address (1 op)
push n-1, from, to and temp (4 ops)
jump to beginning of towerrec (1 op)
pop 4 arguments (1 op)
pop return address (1 op)
jump to the instruction addressed by the return address (1 op)
copy from, temp, to, n, moves again to registers
push return address
push move, n, from, to
call printf
pop 4 arguments
pop return address
jump to the instruction addressed by the return address
copy from, temp, to, n-1 again to registers (4 ops)
push return address (1 op)
push n-1, temp, from, to (4 ops)
jump to beginning of towerrec (1 op)
pop 4 arguments (1 op)
pop return address (1 op)
jump to the instruction addressed by the return address (1 op)

We do not count steps 10 to 16 because we do not count them for the non-recursive program either. There is real cost in calling a procedure and restoring the values of the registers afterwards. But we ignore it for printf.

At the end, we have 22 memory operations, 1 arithmetic operation and 4 jumps. Let us assume that our processor cache is big enough for our needs so that each memory operation is as fast an arithmetic operation. Then, our total is 23 · (2^n-1 + 1) + 5 · 2^n-1, which is roughly 14 · 2ⁿ operations. And we also have 5 jumps for each of the non-base cases for a total of 5 · (2^n-1 + 1) or 2.5 · 2ⁿ branch instructions. (We have 4 jumps within the n != 1 case plus 1 jump into the case. We do not count the jumps in the return statements because they have been accounted for by the caller).

Let us add everything up. Our recursive procedure requires (14 + 2.5) · 2ⁿ = 16.5 · 2ⁿ instructions while the short non-recursive procedure requires (15 + 3) · 2ⁿ = 18 · 2ⁿ instructions. Ignoring the effects of cache misses from jump and memory instructions, we see that the recursive towerrec is more efficient.

Note that when the modulus is not a power of 2, the integer remainder operation requires several clock cycles. Vector arithmetic can reduce the average to 1 clock cycle only if the operation is applied to a lengthy vector. So in reality, our non-recursive program is a little slower than 18 · 2ⁿ. However, it can be optimized a little more. We can for example do something like this:

  if (c & 65535 == 0) zeros += 16;

Another way is to load c into a floating point register and use bitwise operations to extract the exponent. Some processors cannot copy values between integer and floating-point registers. So this method of obtaining the number of zeros (log base 2) of a number may require a few memory operations. We do not analyze these variations in detail here.

What if the number of discs is so big that we cannot cache the entire stack? Let us do some quick calculations. Each recursive call requires pushing the return address and 4 arguments, totaling 20 bytes, onto the stack. Therefore, we must have more than 1600 discs to fill up a 32K cache. (If you compile towerrec program, you might implicitly push 24 or more bytes per recursive call because the subroutine call semantics of the instruction set and/or the compiler dictate that the frame pointer also be pushed. The frame pointer is unnecessary in our recursive program because we have no local variables and we know that exactly four integer arguments are passed every time). To solve a 1600-disc problem non-recursively, we would need to implement 1600-bit arithmetic. So in fact, the recursive program is the only practical solution for very large Tower of Hanoi puzzles.

We do want to point out that indeed, the non-recursive program uses less memory. And more importantly for the Tower of Hanoi puzzle, it can compute any move without needing to compute the preceeding moves.

References