Now we analyze the recursive program. For each invocation of
towerrec, we make one comparison n == 1 and one
increment, ++move. In reality, they require five machine
instructions. The processor first copies the value of n or
move from memory to a register. The processor then performs
the comparison or increment operation. And finally, the processor
copies the new value of move back into memory.
The comparison n == 1 is true in 2^{n1} cases.
In those cases, we just print the move. In the other
2^{n1} + 1 cases, we do a few things. We compute
n1. Then for each recursive call, we push onto the runtime
stack the return address and the four arguments; then we do a jump;
then pop those five values; then do another jump. The processor
cannot actually copy the values of from, to and
temp from one set of locations on the stack to another. The
processor has to copy them from the stack to the registers, then back
onto the stack. On the other hand, popping all three off the stack
requires just one instruction that moves the stack pointer 12
bytes. This is messy. Let us streamline the n != 1 case into
pseudo machine language. We assume in this code fragment that, for
procedure calls, the caller pushes and pops the arguments and
the return address onto and from the stack. Keep in mind that we
have already accounted for the ++move operation.
 compute n1 (1 op)
 put n1 in memory (1 op)
 copy from, temp and to to registers (3 ops)
 push return address (1 op)
 push n1, from, to and temp (4 ops)
 jump to beginning of towerrec (1 op)
 pop 4 arguments (1 op)
 pop return address (1 op)
 jump to the instruction addressed by the return address
(1 op)
 copy from, temp, to, n, moves
again to registers
 push return address
 push move, n, from, to
 call printf
 pop 4 arguments
 pop return address
 jump to the instruction addressed by the return address
 copy from, temp, to, n1 again
to registers (4 ops)
 push return address (1 op)
 push n1, temp, from, to (4 ops)
 jump to beginning of towerrec (1 op)
 pop 4 arguments (1 op)
 pop return address (1 op)
 jump to the instruction addressed by the return address
(1 op)
We do not count steps 10 to 16 because we do not count
them for the nonrecursive program either. There is real cost in
calling a procedure and restoring the values of the registers
afterwards. But we ignore it for printf.
At the end, we have 22 memory operations, 1 arithmetic
operation and 4 jumps. Let us assume that our processor cache
is big enough for our needs so that each memory operation is as fast
an arithmetic operation. Then, our total is 23 ·
(2^{n1} + 1) + 5 · 2^{n1}, which is
roughly 14 · 2^{n} operations. And we also have
5 jumps for each of the nonbase cases for a total of 5
· (2^{n1} + 1) or 2.5 · 2^{n}
branch instructions. (We have 4 jumps within the n != 1
case plus 1 jump into the case. We do not count the jumps
in the return statements because they have been accounted for by the
caller).
Let us add everything up. Our recursive procedure requires (14 +
2.5) · 2^{n} = 16.5 · 2^{n}
instructions while the short nonrecursive procedure requires (15 +
3) · 2^{n} = 18 · 2^{n} instructions.
Ignoring the effects of cache misses from jump and memory
instructions, we see that the recursive towerrec is more
efficient.
Note that when the modulus is not a power of 2, the
integer remainder operation requires several clock cycles.
Vector arithmetic can reduce the average to 1 clock cycle only if the
operation is applied to a lengthy vector. So in reality, our
nonrecursive program is a little slower than 18 ·
2^{n}. However, it can be optimized a little more. We
can for example do something like this:
if (c & 65535 == 0) zeros += 16;
Another way is to load c into a floating point register
and use bitwise operations to extract the exponent. Some processors
cannot copy values between integer and floatingpoint registers. So
this method of obtaining the number of zeros (log base 2) of a number
may require a few memory operations. We do not analyze these
variations in detail here.
What if the number of discs is so big that we cannot cache the entire
stack? Let us do some quick calculations. Each recursive call
requires pushing the return address and 4 arguments, totaling 20
bytes, onto the stack. Therefore, we must have more than 1600 discs
to fill up a 32K cache. (If you compile towerrec program, you
might implicitly push 24 or more bytes per recursive call because the
subroutine call semantics of the instruction set and/or the compiler
dictate that the frame pointer also be pushed. The frame
pointer is unnecessary in our recursive program because we have no
local variables and we know that exactly four integer arguments are
passed every time). To solve a 1600disc problem nonrecursively, we
would need to implement 1600bit arithmetic. So in fact, the
recursive program is the only practical solution for very large Tower
of Hanoi puzzles.
We do want to point out that indeed, the nonrecursive program uses
less memory. And more importantly for the Tower of Hanoi puzzle, it
can compute any move without needing to compute the preceeding moves.
