Now we analyze the recursive program. For each invocation of
towerrec, we make one comparison n == 1 and one
increment, ++move. In reality, they require five machine
instructions. The processor first copies the value of n or
move from memory to a register. The processor then performs
the comparison or increment operation. And finally, the processor
copies the new value of move back into memory.
The comparison n == 1 is true in 2n-1 cases.
In those cases, we just print the move. In the other
2n-1 + 1 cases, we do a few things. We compute
n-1. Then for each recursive call, we push onto the runtime
stack the return address and the four arguments; then we do a jump;
then pop those five values; then do another jump. The processor
cannot actually copy the values of from, to and
temp from one set of locations on the stack to another. The
processor has to copy them from the stack to the registers, then back
onto the stack. On the other hand, popping all three off the stack
requires just one instruction that moves the stack pointer 12
bytes. This is messy. Let us streamline the n != 1 case into
pseudo machine language. We assume in this code fragment that, for
procedure calls, the caller pushes and pops the arguments and
the return address onto and from the stack. Keep in mind that we
have already accounted for the ++move operation.
- compute n-1 (1 op)
- put n-1 in memory (1 op)
- copy from, temp and to to registers (3 ops)
- push return address (1 op)
- push n-1, from, to and temp (4 ops)
- jump to beginning of towerrec (1 op)
- pop 4 arguments (1 op)
- pop return address (1 op)
- jump to the instruction addressed by the return address
(1 op)
- copy from, temp, to, n, moves
again to registers
- push return address
- push move, n, from, to
- call printf
- pop 4 arguments
- pop return address
- jump to the instruction addressed by the return address
- copy from, temp, to, n-1 again
to registers (4 ops)
- push return address (1 op)
- push n-1, temp, from, to (4 ops)
- jump to beginning of towerrec (1 op)
- pop 4 arguments (1 op)
- pop return address (1 op)
- jump to the instruction addressed by the return address
(1 op)
We do not count steps 10 to 16 because we do not count
them for the non-recursive program either. There is real cost in
calling a procedure and restoring the values of the registers
afterwards. But we ignore it for printf.
At the end, we have 22 memory operations, 1 arithmetic
operation and 4 jumps. Let us assume that our processor cache
is big enough for our needs so that each memory operation is as fast
an arithmetic operation. Then, our total is 23 ·
(2n-1 + 1) + 5 · 2n-1, which is
roughly 14 · 2n operations. And we also have
5 jumps for each of the non-base cases for a total of 5
· (2n-1 + 1) or 2.5 · 2n
branch instructions. (We have 4 jumps within the n != 1
case plus 1 jump into the case. We do not count the jumps
in the return statements because they have been accounted for by the
caller).
Let us add everything up. Our recursive procedure requires (14 +
2.5) · 2n = 16.5 · 2n
instructions while the short non-recursive procedure requires (15 +
3) · 2n = 18 · 2n instructions.
Ignoring the effects of cache misses from jump and memory
instructions, we see that the recursive towerrec is more
efficient.
Note that when the modulus is not a power of 2, the
integer remainder operation requires several clock cycles.
Vector arithmetic can reduce the average to 1 clock cycle only if the
operation is applied to a lengthy vector. So in reality, our
non-recursive program is a little slower than 18 ·
2n. However, it can be optimized a little more. We
can for example do something like this:
if (c & 65535 == 0) zeros += 16;
Another way is to load c into a floating point register
and use bitwise operations to extract the exponent. Some processors
cannot copy values between integer and floating-point registers. So
this method of obtaining the number of zeros (log base 2) of a number
may require a few memory operations. We do not analyze these
variations in detail here.
What if the number of discs is so big that we cannot cache the entire
stack? Let us do some quick calculations. Each recursive call
requires pushing the return address and 4 arguments, totaling 20
bytes, onto the stack. Therefore, we must have more than 1600 discs
to fill up a 32K cache. (If you compile towerrec program, you
might implicitly push 24 or more bytes per recursive call because the
subroutine call semantics of the instruction set and/or the compiler
dictate that the frame pointer also be pushed. The frame
pointer is unnecessary in our recursive program because we have no
local variables and we know that exactly four integer arguments are
passed every time). To solve a 1600-disc problem non-recursively, we
would need to implement 1600-bit arithmetic. So in fact, the
recursive program is the only practical solution for very large Tower
of Hanoi puzzles.
We do want to point out that indeed, the non-recursive program uses
less memory. And more importantly for the Tower of Hanoi puzzle, it
can compute any move without needing to compute the preceeding moves.
|