\documentstyle[]{article}

\begin{document}
\begin{center}
Unrolling loops:  An Optimizing Phase of the MIT Scheme 8.0 Compiler\\
by Andy Stark\\
UROP Summer 1995\\
\end{center}

\section{Introduction}
The goal of unrolling loops is to decrease the amount of time spent on
loop overhead and to increase the potential for common subexpression
elimination.  Unrolling a loop essentially involves making a copy of
the loop and substituting it into the code, so that the original loop
is done twice in one pass through the unrolled loop.

The goal of this project was to find and unroll loops at the middle
end level of the MIT Scheme compiler.  The grammar of this level of
the compiler, KMP Scheme, is a fairly high level language, which makes
the task of parsing the program a fairly straightforward one.

How many times to unroll a loop is an uncertain question.  The speedup
gained by unrolling loops must be balanced against the increase in
size of compiled code.  In addition, overzealous loop unrolling may
lead to a decrease in the machine's cache hit rate.

\section{Assumptions}
The only loops that were unfolded were those consisting of only
procedures that tail call into each other.  This avoided the need to
do a complicated dataflow analysis in order to follow the paths of
continuations.  In addition, loops which made stack or heap closures
were not unfolded, as these operations were assumed to be too costly
to copy.

\section{Methodology}
A call graph was built containing a node for each lambda in the
program and an edge for each tail call.  This graph was traversed with
a depth first search in order to find simple cycles.

Because an edge was made for every tail call, it was possible to have
multiple cycles consisting of the same nodes.  Due to sidelining
(which will be explained in the following pages), it was not possible
to unroll each of these cycles.  Instead, it was necessary to choose
one path through a given set of nodes.  This was accomplished by
directing the traversal of the depth first search, so that the cycle
it would find first would be the one most preferable to unroll.

At the stage of the compiler at which this phase was introduced, the
only possible branching was due to if statements.  Thus an attempt to
estimate the likely result of the predicate of each if statements was
made, and the depth first search was directed to traverse the more
likely branch first.

This was accomplished by deeming several common predicate procedures
likely to be true and others likely to be false.  Some predicate
procedures were not possible to predict in general, and even those
predictions that were made might have been completely wrong.
Therefore, a mechanism for the user to indicate which branch of an if
statement was more likely was introduced.  In addition, branches which
resulted in calls to errors were noted to be unlikely to occur.

Once all the cycles in the graph were found, they were ranked by the
depth that they occurred in the call graph.  Assuming that the entry
point to the graph was the correct one, this was a reasonable
heuristic for estimating which of the cycles represented inner loops
in the program.  This determination was necessary because procedures
can only be unrolled as part of a single loop; if two loops shared a
common procedure, only one could be unrolled.  In that case, it would
be preferable to unroll whichever loop was the inner loop, because
programs spend the most time in their inner loops, so there would be
the most potential for speedup there.

Once a loop was chosen to be unrolled, all code which was not on the
direct path of the loop was moved to a sideline procedure.  This
enabled the compiler to copy the procedures in the loop without having
to copy code that was not part of the loop.  This sidelining process
was the reason that a procedure could only be part of a single
unrolled loop; other code that may have been part of a different loop
was moved out of the procedure into a sideline procedure.  Note,
however, that if this code was simple enough, the simplifier phase of
the compiler would later move it back into the original code; i.e.,
the simplifier might substitute a sideline procedure for its call
sites.

Next, the procedures of the loop were copied.  The call sites were
modified so that the copied procedures would be called within the
loop.  Since each copy would have only one call site, it would later
be substituted for that call site by the simplifier.

The number of copies made of the loop was determined by the size of
the loop.  The compiler made an estimation of the number of
instructions in the loop, and the loop was copied as many times as
possible while remaining under a fixed total number of instructions.

\section{Results}
Consider an example program which finds the sum of all the floating
point numbers in a vector.

\pagebreak
\begin{verbatim}
(define (vector-sum vector1)
  (let sum-loop ((i 0)
                 (len (vector-length vector1))
                 (sum 0.))
    (if (fix:< i len)
        (sum-loop (fix:1+ i)
                  len
                  (flo:+ sum (vector-ref vector1 i))))))
\end{verbatim}

Due to Scheme's floating point number system, the floating point
number sum must be boxed and unboxed each time around the loop.  Now
consider unrolling the loop once.  This would result in the body of
the loop being duplicated, so that two numbers would be added to the
sum each time around the loop.  At the second addition, sum would
already have been unboxed and so would not need to be unboxed again.
Thus the total number of times to unbox sum would be cut in half.

Indeed, when this code was repeated multiple times, the following data
was obtained:

\begin{tabular}{l}
\\
Old Version\\
Best time : 3.11 seconds\\
Average time : 3.13 seconds\\
Size of compiled code block : 45 words\\
\\
With Loop Unrolling\\
Best time : 1.33 seconds\\
Average time : 1.35 seconds\\
Size of compiled code block : 66 words\\
\\
Speed up :\\
3.11 / 1.33 = 2.34 times\\
3.13 / 1.35 = 2.32 times\\
\\
Size increase :\\
66 / 45 = 1.5 times\\
\\
\end{tabular}

The phase was tested on the standard Gabriel Lisp benchmark suite
with the following results.  The first column is the running time
without loop unrolling, and the second column is the running time with
loop unrolling.  The third column is what percent the second is of the
first.

\begin{tabular}{lccc}
boyer&      1.64&   1.09&  0.666\\
browse&     1.19&   1.09&  0.914\\
conform&    0.95&   0.90&  0.948\\
cpstak&     0.57&   0.57&  1.007\\
ctak&      10.80&  10.53&  0.975\\
dderiv&     0.93&   0.92&  0.982\\
deriv&      0.72&   0.73&  1.007\\
destruct&   0.97&   0.84&  0.869\\
div&        0.97&   0.89&  0.913\\
fcomp&      0.92&   0.93&  1.019\\
fib&        0.84&   0.83&  0.999\\
matmul1&    1.00&   0.95&  0.942\\
matmul2&    1.49&   1.48&  0.994\\
peval&      0.72&   0.72&  0.999\\
puzzle&     0.70&   0.60&  0.868\\
tak&        0.40&   0.40&  0.998\\
takl&       0.47&   0.48&  1.006\\
traverse&   3.54&   3.50&  0.988\\
triangle&   5.49&   5.50&  1.002\\
wttree&     0.66&   0.67&  1.002\\
~a.mean&       -&      -&  0.955\\
~g.mean&       -&      -&  0.951\\
\multicolumn{4}{l}{Increase in size of compiled files : 9.38\%}\\
\\
\end{tabular}

The loop unroller obtained as much as a 33\% speedup in one of the
benchmarks, but had no effect on several others.  The entire benchmark
was sped up by 5.1\%.  It should be noted that some of these
benchmarks, as would be true in many programs, contained no loops
which could be unrolled.  The cost of this speedup was a 9.38\%
increase in the size of the compiled files.

The following is a comparison of the loop unroller's performance with
an RTL level optimizer designed to move instructions out of inner
loops, written by Jason A. Wilson\footnote{Wilson, Jason A.  An RTL
Loop Optimizer With Performance Measurements and Analysis.  Spring
1995.}.

\begin{tabular}{|l|c|c|}
\hline
\multicolumn{3}{|c|}{String copy}\\
\hline
&               Loop unroller&          RTL optimizer\\
Speedup in&&\\
---Best times&        1.39 times&                    1.55 times\\
---Average times&     1.39 times&                    1.55 times\\
Size increase&       1.61 times&                    unknown\\
\hline
\multicolumn{3}{|c|}{Dot Product}\\
\hline
&               Loop unroller&          RTL optimizer\\
Speedup in&&\\
---Best times&        1.20 times&              1.08 times\\
---Average times&     1.18 times&              1.09 times\\
Size increase&       1.78 times&              unknown\\
\hline
\end{tabular}

\section{Conclusion}
Unrolling loops is a fairly easily implemented optimization that can
yield a considerable speedup in certain applications.  The speedup
must always be balanced against the increase in size of compiled
programs.

\section{Future work}
As described above, there is a continual tradeoff between speedup and
size increase in compiled programs.  This tradeoff has not been
investigated thoroughly; it is unclear what the optimal amount of
unrolling would be, if it exists.  It would be reasonable to give the
user the ability to specify how much unrolling should be done so that
the optimization can be tailored to the user's specific needs.

As is apparent from the Gabriel benchmarks, loop unrolling can be very
useful in some applications while not useful in others.  Loops which
are very rarely used may be unrolled by the compiler, leading to an
increase in the size of the compiled code but no speedup.  Some loops
offer more potential for common subexpression elimination than others,
and so would benefit from more unrolling.  Ideally, the compiler would
be able to determine these cases, but in practice, it may be left up
to the user to note this information.  The entire optimization might
be something the user turns on and off depending on how much time is
expected to be spent in loops.

An even better implementation would allow the user to specify
specifically which loops to unroll or not to unroll.  This would give
the user complete control over the compiler's behavior, and would also
eliminate the need for the compiler to predict the branching of if
statements, since those predictions were made entirely for the purpose
of deciding which loops to unroll.

\end{document}