This chapter outlines techniques for tuning the performance of your R8000 and R10000 applications. It contains five sections:
The first section provides an architectural overview of the R8000 and R10000. This will serve as background for the software pipelining discussion.
The second section presents the compiler optimization technique of software pipelining, which is crucial to getting optimal performance on the R8000. It shows you how to read your software pipelined code and how to understand what it does.
The third section uses matrix multiplies as a case study on loop unrolling.
The fourth section describes how the IVDEP directive can be used in Fortran to gain performance.
The final section describes how using vector intrinsic functions can improve the performance of your program.
Table 6-1 illustrates the main architectural features of the R8000 and R10000 microprocessors. Both can execute the MIPS IV instruction set and both can issue up to four instructions per cycle, of which two can be integer instructions and two can be floating point instructions. The R8000 can execute up to two madd instructions per cycle, while the R10000 can only execute one per cycle.The R8000 can issue two memory instructions per cycle while the R10000 can issue one per cycle. On the other hand, the R10000 operates at a much faster clock frequency and is capable of out-of-order execution. This makes it a better candidate for executing programs that were not explicitly tuned for it.
Table 6-1. Architectural Features of the R8000 and R10000
Feature | R8000 | R10000 |
|---|---|---|
ISA | MIPS IV | MIPS IV |
Frequency | 90 Mhz | 200 Mhz |
Peak MFLOPS | 360 | 400 |
Total Number of Instructions per cycle | 4 | 4 |
Number of Integer Instructions per cycle | 2 | 2 |
Number of Floating Point Instructions per cycle | 2 | 2 |
Number of Multiply-Add Instructions per cycle. | 2 | 1 |
Number of Load/Store Instructions per cycle | 2 | 1 |
Out-of-order Instruction execution | No | Yes |
The purpose of this section is to give you an overview of software pipelining and to answer these questions:
Why software pipelining delivers better performance
How software pipelined code looks
How to diagnose what happened when things go wrong
To introduce the topic of software pipelining, let's consider the simple DAXPY loop (double precision a times x plus y) shown below.
DO i = 1, n
v(i) = v(i) + X * w(i)
ENDDO |
On the MIPS IV architecture, this can be coded as two load instructions followed by a madd instruction and a store. See Figure 6-1.
This simplest schedule achieves the desired result after five cycles. Since the R8000 architecture can allow up to two memory operations and two floating point operations in the same cycle, this simple example uses only one tenth of the R8000's peak megaflops. There is also a delay of three cycles before the results of the madd can be stored.
0: ldc1 ldc1 madd 1: 2: 3: 4: sdc1 |
A loop unrolling by four schedule improves the performance to one quarter of the R8000's peak megaflops.
0: ldc1 ldc1 madd 1: ldc1 ldc1 madd 2: ldc1 ldc1 madd 3: ldc1 ldc1 madd 4: sdc1 5: sdc1 6: sdc1 7: sdc1 |
But this schedule does not take advantage of the R8000's ability to do two stores in one cycle.The best schedule that could be achieved would look like the following:
0: ldc1 ldc1 1: ldc1 ldc1 madd madd 2: sdc1 sdc1 |
It uses 1/3 of the R8000's peak megaflops. But there still is a problem with the maddsdc1 interlock delay. Software pipelining addresses this problem.
Software pipelining allows you to mix operations from different loop iterations in each iteration of the hardware loop. Thus the store instructions (which would interlock) in the above example could be used to store the results of different iterations. This can look like the following:
L1: 0: t1 = ldc1 t2 = ldc1 t3 = madd t1 X t2 1: t4 = ldc1 t5 = ldc1 t6 = madd t4 X t5 2: sdc1 t7 sdc1 t8 beq DONE 3: t1 = ldc1 t2 = ldc1 t7 = madd t1 X t2 4: t4 = ldc1 t5 = ldc1 t8 = madd t4 X t5 5: sdc1 t3 sdc1 t6 bne L1 DONE: |
The stores in this loop are storing the madd results from previous iterations. But, in general, you could mix any operations from any number of different iterations. Also, note that every loop replication completes two loop iterations in 3 cycles.
In order to properly prepare for entry into such a loop, a windup section of code is added. The windup section sets up registers for the first stores in the main loop. In order to exit the loop properly, a winddown section is added. The winddown section performs the final stores. Any preparation of registers needed for the winddown is done in the compensation section. The winddown section also prevents speculative operations.
windup: 0: t1 = ldc1 t2 = ldc1 t7 = madd t1 X t2 1: t4 = ldc1 t5 = ldc1 t8 = madd t4 X t5 L1: 0: t1 = ldc1 t2 = ldc1 t3 = madd t1 X t2 1: t4 = ldc1 t5 = ldc1 t6 = madd t4 X t5 2: sdc1 t7 sdc1 t8 beq compensation1 3: t1 = ldc1 t2 = ldc1 t7 = madd t1 X t2 4: t4 = ldc1 t5 = ldc1 t8 = madd t4 X t5 5: sdc1 t3 sdc1 t6 bne L1 winddown: 0: sdc1 t7 sdc1 t8 br ALLDONE compensation1: 0: t7 = t3 t8 = t6 1: br winddown ALLDONE: |
Our example loop always does loads from at least 4 iterations, so we don't want to start it if we don't want speculative operations and if the trip count is less than 4. The following generalizes our example into a map of a software pipelined loop:
/* Precondition for unroll by 2 */
do i = 1, n mod 2
original loop
enddo
if ( n-i < 4 ) goto simple_loop
windup:
...
/* fp - fence post */
fp = fp - peel_amount
...
swp replication 0:
...
if ( i == fp ) goto compensation 0
...
swp replication n - 1:
...
if ( i != fp ) goto swp replication 0
compensation n-1:
winddown:
...
goto done
compensation 0:
/* Move registers to set up winddown */
rx = ry
...
goto winddown
...
compensation n - 2:
...
/* Move registers to set up winddown */
rx = ry
goto winddown
simple_loop:
do i = i,n
original_loop
enddo
done: |
In practice, software pipelining can make a huge difference. Sample benchmark performances see more than 100% improvement when compiled with software pipelining. This makes it clear that in order to get the best performance on the R8000 (-mips4), you should compile your application to use software pipelining (-O3 -r8000).
Scheduling for the R10000 (-r10000) is somewhat different. First of all, since the R10000 can execute instructions out-of-order, static scheduling techniques are not as critical to its performance. On the other hand, the R10000 supports the prefetch (pref) instruction. This instruction is used to load data into the caches before it is needed, reducing memory delays for programs that access memory predictably. In the schedule below, the compiler generates prefetch instructions.
Since the R10000 can execute only one memory instruction and only one madd instruction per cycle, there are two open slots available for integer instructions. The R10000 has a delay of three cycles before the results of the madd can be stored. It also has delays of three cycles and one cycle before the madd can use the result of a load for multiplication and addition, respectively.
The following schedule shows four replications from a daxpy inner loop scheduled for the R10000. Notice how the operands of the madd instruction are loaded well ahead of the actual execution of the instruction itself. Notice also that the final replication contains two prefetch instructions. Use of the prefetch instruction is enabled by default with the -r10000 flag. The other replications each have a hole where the first prefetch instruction is placed. Had prefetch been turned off through the -LNO:prefetch=0 option, each replication could have been scheduled in three cycles.
L1: 0: t0 = ldc1 1: 2: t7 = ldc1 3: sdc1 t2 4: t2 = madd t4 X t5 beq compensation0 0: t4 = ldc1 1: 2: t3 = ldc1 3: sdc1 t2 4: t2 = madd t0 X t1 beq compensation1 0: t0 = ldc1 1: 2: t5 = ldc1 3: sdc1 t2 4: t2 = madd t4 X t7 beq compensation2 0: t4 = ldc1 1: pref 2: t1 = ldc1 3: sdc1 t2 4: t2 = madd t0 X t3 bne L1 pref |
The proper way to look at the assembly code generated by software pipelining is to use the -S compiler switch. This is vastly superior to using the disassembler (dis) because the -S switch adds annotations to the assembly code which name out the sections described above.
The annotations also provide useful statistics about the software pipelining process as well as reasons why certain code did not pipeline. To get a summary of these annotations do the following:
%f77 -64 -S -O3 -mips4 foo.f |
This creates an annotated .s file
%grep '#<swp' foo.s |
#<swpf is printed for loops that failed to software pipeline. #<swps is printed for statistics and other info about the loops that did software pipeline.
%cat test.f
program test
real*8 a x(100000),y(100000)
do i = 1, 2000
call daxpy(3.7, x, y, 100000)
enddo
stop
end
subroutine daxpy(a, x, y, nn)
real*8 a x(*),y(*)
do i = 1, nn, 1
y(i) = y(i) + a * x(i)
enddo
return
end
%f77 -64 -r8000 -mips4 -O3 -S test.f
%grep swps test.s
#<swps>
#<swps> Pipelined loop line 12 steady state
#<swps>
#<swps> 25 estimated iterations before pipelining
#<swps> 4 unrollings before pipelining
#<swps> 6 cycles per 4 iterations
#<swps> 8 flops ( 33% of peak) (madds count as 2)
#<swps> 4 flops ( 33% of peak) (madds count as 1)
#<swps> 4 madds ( 33% of peak)
#<swps> 12 mem refs (100% of peak)
#<swps> 3 integer ops ( 25% of peak)
#<swps> 19 instructions( 79% of peak)
#<swps> 2 short trip threshold
#<swps> 7 integer registers used
#<swps> 14 float registers used
#<swps> |
This example was compiled with scheduling for the R8000. It shows that the inner loop starting at line 12 was software pipelined. The loop was unrolled four times before pipelining. It used 6 cycles for every four loop iterations and calculated the statistics as follows:
If each madd counts as two floating point operations, the R8000 can do four floating point operations per cycle (two madds), so its peak for this loop is 24. Eight floating point references are 8/24 or 33% of peak. The figure for madds is likewise calculated.
If each madd counts as one floating point operation, the R8000 can do two floating point operations per cycle, so its peak for this loop is 12. Four floating point operations are 4/12 or 33% of peak.
The R8000 can do two memory operations per cycle, so its peak for this loop is 12. Twelve memory references are 12/12 or 100% of peak.
The R8000 can do two integer operations per cycle, so its peak for this loop is 12. Three integer references are 3/12 or 25% of peak.
The R8000 can do four instructions per cycle, so its peak for this loop is 24. Nineteen instructions are 19/24 or 79% of peak. The statistics also point out that loops of less than 2 iterations would not go through the software pipeline replication area, but would be executed in the simple_loop section shown above and that a total of seven integer and eight floating point registers were used in generating the code.
If the example would have been compiled with scheduling for the R10000, the following results would have been obtained.
%f77 -64 -r10000 -mips4 -O3 -S test.f %grep swps test.s #<swps> #<swps> Pipelined loop line 12 steady state #<swps> #<swps> 25 estimated iterations before pipelining #<swps> 4 unrollings before pipelining #<swps> 14 cycles per 4 iterations #<swps> 8 flops ( 28% of peak) (madds count as 2) #<swps> 4 flops ( 14% of peak) (madds count as 1) #<swps> 4 madds ( 28% of peak) #<swps> 12 mem refs ( 85% of peak) #<swps> 3 integer ops ( 10% of peak) #<swps> 19 instructions( 33% of peak) #<swps> 2 short trip threshold #<swps> 7 integer registers used #<swps> 15 float registers used #<swps> |
The statistics are tailored to the R10000 architectural characteristics. They show that the inner loop starting at line 12 was unrolled four times before being pipelined. It used 14 cycles for every four loop iterations and the percentages were calculated as follows:
The R10000 can do two floating point operations per cycle (one multiply and one add), so its floating point operations peak for this loop is 28. If each madd instruction counts as a multiply and an add, the number of operations in this loop are 8/28 or 28% of peak.
If each madd counts as one floating point instruction, the R10000 can do two floating point operations per cycle (one multiply and one add), so its peak for this loop is 28. Four floating point operations (four madds) are 4/28 or 14% of peak.
The R10000 can do one madd operation per cycle, so its peak for this loop is 14. Four madd operations are 4/14 or 28% of peak.
The R10000 can do one memory operation per cycle, so its peak for this loop is 14. Three memory references are 12/14 or 85% of peak.
![]() | Note: prefetch operations are sometimes not needed in every replication. The statistics will miss them if they are not in replication 0 and they will understate the number of memory references per cycle while overstating the number of cycles per iteration. |
The R10000 can do two integer operations per cycle, so its peak for this loop is 28. Three integer operations are 3/28 or 10% of peak.
The R10000 can do four instructions per cycle, so its peak for this loop is 56. Nineteen instructions are 19/56 or 33% of peak. The statistics also point out that loops of less than 2 iterations would not go through the software pipeline replication area, but would be executed in the simple_loop section shown above and that a total of seven integer and fifteen floating point registers were used in generating the code.
When you don't get much improvement in your application's performance after compiling with software pipelining, you should ask the following questions and consider some possible answers:
Did it software pipeline at all?
Software pipelining works only on inner loops. What is more, inner loops with subroutine calls or complicated conditional branches do not software pipeline.
How well did it pipeline?
Look at statistics in the .s file.
What can go wrong in code that was software pipelined?
Your generated code may not have the operations you expected.
Think about how you would hand code it.
What operations did it need?
Look at the loop in the .s file.
Is it very different? Why?
Sometimes this is human error. (Improper code, or typo.)
Sometimes this is a compiler error.
Perhaps the compiler didn't schedule tightly. This can happen because there are unhandled recurrence divides (Divides in general, are a problem) and because there are register allocation problems (running out of registers).
Matrix multiplication illustrates some of the issues in compiling for the R8000 and R10000. Consider the simple implementation.
do j = 1,n
do i = 1,m
do k = 1 , p
c(i,j) = c(i,j) - a(i,k)*b(k,j) |
As mentioned before, the R8000 is capable of issuing two madds and two memory references per cycle. This simple version of matrix multiply requires 2 loads in the inner loop and one madd. Thus at best, it can run at half of peak speed. Note, though, that the same locations are loaded on multiple iterations. By unrolling the loops, we can eliminate some of these redundant loads. Consider for example, unrolling the outer loop by 2.
do j = 1,n,2
do i = 1,m
do k = 1 , p
c(i,j) = c(i,j) - a(i,k)*b(k,j)
c(i,j+1) = c(i,j+1) - a(i,k)*b(k,j+1) |
We now have 3 loads for two madds. Further unrolling can bring the ratio down further. On the other hand, heavily unrolled loops require many registers. Unrolling too much can lead to register spills. The loop nest optimizer (LNO) tries to balance these trade-offs.
Below is a good compromise for the matrix multiply example on the R8000. It is generated by LNO using the -O3 and -r8000 flags. The listing file is generated using -FLIST:=ON option.
%f77 -64 -O3 -r8000 -FLIST:=ON mmul.f
%cat mmul.w2f.f
C ***********************************************************
C Fortran file translated from WHIRL Mon Mar 22 09:31:25 1999
C ***********************************************************
PROGRAM MAIN
IMPLICIT NONE
C
C **** Variables and functions ****
C
REAL*8 a(100_8, 100_8)
REAL*8 b(100_8, 100_8)
REAL*8 c(100_8, 100_8)
INTEGER*4 j
INTEGER*4 i
INTEGER*4 k
C
C **** Temporary variables ****
C
REAL*8 mi0
REAL*8 mi1
REAL*8 mi2
REAL*8 mi3
REAL*8 mi4
REAL*8 mi5
REAL*8 mi6
REAL*8 mi7
REAL*8 mi8
REAL*8 mi9
REAL*8 mi10
REAL*8 mi11
REAL*8 c_1
REAL*8 c_
REAL*8 c_0
REAL*8 c_2
REAL*8 c_3
REAL*8 c_4
REAL*8 c_5
REAL*8 c_6
REAL*8 c_7
REAL*8 c_8
REAL*8 c_9
REAL*8 c_10
INTEGER*4 i0
REAL*8 mi12
REAL*8 mi13
REAL*8 mi14
REAL*8 mi15
INTEGER*4 k0
REAL*8 c_11
REAL*8 c_12
REAL*8 c_13
REAL*8 c_14
C
C **** statements ****
C
DO j = 1, 8, 3
DO i = 1, 20, 4
mi0 = c(i, j)
mi1 = c((i + 3), (j + 2))
mi2 = c((i + 3), (j + 1))
mi3 = c(i, (j + 1))
mi4 = c((i + 3), j)
mi5 = c((i + 2), (j + 2))
mi6 = c((i + 2), (j + 1))
mi7 = c(i, (j + 2))
mi8 = c((i + 2), j)
mi9 = c((i + 1), (j + 2))
mi10 = c((i + 1), (j + 1))
mi11 = c((i + 1), j)
DO k = 1, 20, 1
c_1 = mi0
mi0 = (c_1 -(a(i, k) * b(k, j)))
c_ = mi3
mi3 = (c_ -(a(i, k) * b(k, (j + 1))))
c_0 = mi7
mi7 = (c_0 -(a(i, k) * b(k, (j + 2))))
c_2 = mi11
mi11 = (c_2 -(a((i + 1), k) * b(k, j)))
c_3 = mi10
mi10 = (c_3 -(a((i + 1), k) * b(k, (j + 1))))
c_4 = mi9
mi9 = (c_4 -(a((i + 1), k) * b(k, (j + 2))))
c_5 = mi8
mi8 = (c_5 -(a((i + 2), k) * b(k, j)))
c_6 = mi6
mi6 = (c_6 -(a((i + 2), k) * b(k, (j + 1))))
c_7 = mi5
mi5 = (c_7 -(a((i + 2), k) * b(k, (j + 2))))
c_8 = mi4
mi4 = (c_8 -(a((i + 3), k) * b(k, j)))
c_9 = mi2
mi2 = (c_9 -(a((i + 3), k) * b(k, (j + 1))))
c_10 = mi1
mi1 = (c_10 -(a((i + 3), k) * b(k, (j + 2))))
END DO
c((i + 1), j) = mi11
c((i + 1), (j + 1)) = mi10
c((i + 1), (j + 2)) = mi9
c((i + 2), j) = mi8
c(i, (j + 2)) = mi7
c((i + 2), (j + 1)) = mi6
c((i + 2), (j + 2)) = mi5
c((i + 3), j) = mi4
c(i, (j + 1)) = mi3
c((i + 3), (j + 1)) = mi2
c((i + 3), (j + 2)) = mi1
c(i, j) = mi0
END DO
END DO
DO i0 = 1, 20, 4
mi12 = c(i0, 10_8)
mi13 = c((i0 + 1), 10_8)
mi14 = c((i0 + 3), 10_8)
mi15 = c((i0 + 2), 10_8)
DO k0 = 1, 20, 1
c_11 = mi12
mi12 = (c_11 -(a(i0, k0) * b(k0, 10_8)))
c_12 = mi13
mi13 = (c_12 -(a((i0 + 1), k0) * b(k0, 10_8)))
c_13 = mi15
mi15 = (c_13 -(a((i0 + 2), k0) * b(k0, 10_8)))
c_14 = mi14
mi14 = (c_14 -(a((i0 + 3), k0) * b(k0, 10_8)))
END DO
c((i0 + 2), 10_8) = mi15
c((i0 + 3), 10_8) = mi14
c((i0 + 1), 10_8) = mi13
c(i0, 10_8) = mi12
END DO
WRITE(6, '(F18.10)') c(9_8, 8_8)
STOP
END ! MAIN |
The outermost loop is unrolled by three as suggested above and the second loop is unrolled by a factor of four. Note that we have not unrolled the inner loop. The code generation phase of the compiler back end can effectively unroll inner loops automatically to eliminate redundant loads.
In optimizing for the R10000, LNO also unrolls the outermost loop by three, but unrolls the second loop by 5. LNO can also automatically tile the code to improve cache behavior. You can use the -LNO: option group flags to describe your cache characteristics to LNO. For more information, please consult the MIPSpro Compiling, Debugging and Performance Tuning Guide.
%f77 -64 -O3 -r10000 -FLIST:=ON mmul.f
%cat mmul.w2f.f
C ***********************************************************
C Fortran file translated from WHIRL Mon Mar 22 09:41:39 1999
C ***********************************************************
PROGRAM MAIN
IMPLICIT NONE
C
C **** Variables and functions ****
C
REAL*8 a(100_8, 100_8)
REAL*8 b(100_8, 100_8)
REAL*8 c(100_8, 100_8)
INTEGER*4 j
INTEGER*4 i
INTEGER*4 k
C
C **** Temporary variables ****
C
REAL*8 mi0
REAL*8 mi1
REAL*8 mi2
REAL*8 mi3
REAL*8 mi4
REAL*8 mi5
REAL*8 mi6
REAL*8 mi7
REAL*8 mi8
REAL*8 mi9
REAL*8 mi10
REAL*8 mi11
REAL*8 mi12
REAL*8 mi13
REAL*8 mi14
INTEGER*4 i0
REAL*8 mi15
REAL*8 mi16
REAL*8 mi17
REAL*8 mi18
REAL*8 mi19
INTEGER*4 k0
REAL*8 mi20
REAL*8 mi21
REAL*8 mi22
REAL*8 mi23
REAL*8 mi24
C
C **** statements ****
C
DO j = 1, 8, 3
DO i = 1, 20, 5
mi0 = c(i, j)
mi1 = c((i + 4), (j + 2))
mi2 = c((i + 4), (j + 1))
mi3 = c(i, (j + 1))
mi4 = c((i + 4), j)
mi5 = c((i + 3), (j + 2))
mi6 = c((i + 3), (j + 1))
mi7 = c(i, (j + 2))
mi8 = c((i + 3), j)
mi9 = c((i + 2), (j + 2))
mi10 = c((i + 2), (j + 1))
mi11 = c((i + 1), j)
mi12 = c((i + 2), j)
mi13 = c((i + 1), (j + 2))
mi14 = c((i + 1), (j + 1))
DO k = 1, 20, 1
mi0 = (mi0 -(a(i, k) * b(k, j)))
mi3 = (mi3 -(a(i, k) * b(k, (j + 1))))
mi7 = (mi7 -(a(i, k) * b(k, (j + 2))))
mi11 = (mi11 -(a((i + 1), k) * b(k, j)))
mi14 = (mi14 -(a((i + 1), k) * b(k, (j + 1))))
mi13 = (mi13 -(a((i + 1), k) * b(k, (j + 2))))
mi12 = (mi12 -(a((i + 2), k) * b(k, j)))
mi10 = (mi10 -(a((i + 2), k) * b(k, (j + 1))))
mi9 = (mi9 -(a((i + 2), k) * b(k, (j + 2))))
mi8 = (mi8 -(a((i + 3), k) * b(k, j)))
mi6 = (mi6 -(a((i + 3), k) * b(k, (j + 1))))
mi5 = (mi5 -(a((i + 3), k) * b(k, (j + 2))))
mi4 = (mi4 -(a((i + 4), k) * b(k, j)))
mi2 = (mi2 -(a((i + 4), k) * b(k, (j + 1))))
mi1 = (mi1 -(a((i + 4), k) * b(k, (j + 2))))
END DO
c((i + 1), (j + 1)) = mi14
c((i + 1), (j + 2)) = mi13
c((i + 2), j) = mi12
c((i + 1), j) = mi11
c((i + 2), (j + 1)) = mi10
c((i + 2), (j + 2)) = mi9
c((i + 3), j) = mi8
c(i, (j + 2)) = mi7
c((i + 3), (j + 1)) = mi6
c((i + 3), (j + 2)) = mi5
c((i + 4), j) = mi4
c(i, (j + 1)) = mi3
c((i + 4), (j + 1)) = mi2
c((i + 4), (j + 2)) = mi1
c(i, j) = mi0
END DO
END DO
DO i0 = 1, 20, 5
IF(IAND(i0, 31) .EQ. 16) THEN
mi15 = c(i0, 10_8)
mi16 = c((i0 + 1), 10_8)
mi17 = c((i0 + 4), 10_8)
mi18 = c((i0 + 2), 10_8)
mi19 = c((i0 + 3), 10_8)
DO k0 = 1, 20, 1
mi15 = (mi15 -(a(i0, k0) * b(k0, 10_8)))
mi16 = (mi16 -(a((i0 + 1), k0) * b(k0, 10_8)))
mi18 = (mi18 -(a((i0 + 2), k0) * b(k0, 10_8)))
mi19 = (mi19 -(a((i0 + 3), k0) * b(k0, 10_8)))
mi17 = (mi17 -(a((i0 + 4), k0) * b(k0, 10_8)))
END DO
c((i0 + 3), 10_8) = mi19
c((i0 + 2), 10_8) = mi18
c((i0 + 4), 10_8) = mi17
c((i0 + 1), 10_8) = mi16
c(i0, 10_8) = mi15
ELSE
mi20 = c(i0, 10_8)
mi21 = c((i0 + 1), 10_8)
mi22 = c((i0 + 4), 10_8)
mi23 = c((i0 + 2), 10_8)
mi24 = c((i0 + 3), 10_8)
DO k0 = 1, 20, 1
mi20 = (mi20 -(a(i0, k0) * b(k0, 10_8)))
mi21 = (mi21 -(a((i0 + 1), k0) * b(k0, 10_8)))
mi23 = (mi23 -(a((i0 + 2), k0) * b(k0, 10_8)))
mi24 = (mi24 -(a((i0 + 3), k0) * b(k0, 10_8)))
mi22 = (mi22 -(a((i0 + 4), k0) * b(k0, 10_8)))
END DO
c((i0 + 3), 10_8) = mi24
c((i0 + 2), 10_8) = mi23
c((i0 + 4), 10_8) = mi22
c((i0 + 1), 10_8) = mi21
c(i0, 10_8) = mi20
ENDIF
END DO
WRITE(6, '(F18.10)') c(9_8, 8_8)
STOP
END ! MAIN |
The IVDEP (Ignore Vector Dependencies) directive was started in Cray Fortran. It is a Fortran or C pragma that tells the compiler to be less strict when it is deciding whether it can get parallelism between loop iterations. By default, the compilers do the safe thing: they try to prove to that there is no possible conflict between two memory references. If they can prove this, then it is safe for one of the references to pass the other.
In particular, you need to be able to perform the load from iteration i+1 before the store from iteration i if you want to be able to overlap the calculation from two consecutive iterations.
Now suppose you have a loop like:
do i = 1, n a(l(i)) = a(l(i)) + ... enddo |
The compiler has no way to know that
&a(l(i)) != &a(l(i+1)) |
without knowing something about the vector l. For example, if every element of l is 5, then
&a(l(i)) == &a(l(i+1)) |
for all values of i.
But you sometimes know something the compiler doesn't. Perhaps in the example above, l is a permutation vector and all its elements are unique. You'd like a way to tell the compiler to be less conservative. The IVDEP directive is a way to accomplish this.
Placed above a loop, the statement:
cdir$ ivdep |
tells the compiler to assume there are no dependencies in the code.
The MIPSpro v7.x compilers provide support for three different interpretations of the IVDEP directive, because there is no clearly established standard.Under the default interpretation, given two memory references, where at least one is loop variant, the compiler will ignore any loop-carried dependences between the two references. Some examples:
do i = 1,n b(k) = b(k) + a(i) enddo |
Use of IVDEP will not break the dependence since b(k) is not loop variant.
do i=1,n a(i) = a(i-1) + 3. enddo |
Use of IVDEP does break the dependence but the compiler warns the user that it's breaking an obvious dependence.
do i=1,n a(b(i)) = a(b(i)) + 3. enddo |
Use of IVDEP does break the dependence.
do i = 1,n a(i) = b(i) c(i) = a(i) + 3. enddo |
Use of IVDEP does not break the dependence on a(i) since it is within an iteration.
The second interpretation of IVDEP is used if you use the -OPT:cray_ivdep=TRUE command line option. Under this interpretation, the compiler uses Cray semantics. It breaks all lexically backwards dependences. Some examples:
do i=1,n a(i) = a(i-1) + 3. enddo |
Use of IVDEP does break the dependence but the compiler warns the user that it's breaking an obvious dependence.
do i=1,n a(i) = a(i+1) + 3. enddo |
Use of IVDEP does not break the dependence since the dependence is from the load to the store, and the load comes lexically before the store.
The third interpretation of IVDEP is used if you use the -OPT:liberal_ivdep=TRUE command line option. Under this interpretation, the compiler will break all dependences.
![]() | Note: IVDEP IS DANGEROUS! If your code really isn't free of vector dependences, you may be telling the compiler to perform an illegal transformation which will cause your program to get wrong answers. But, IVDEP is also powerful and you may very well find yourself in a position where you need to use it. You just have to be very careful when you do this. |
The MIPSpro 64-bit compilers support both single and double precision versions of the following vector instrinsic functions: asin(), acos(), atan(), cos(), exp(), log(), sin(), tan(), sqrt(). In C they are declared as follows:
/* single precision vector routines */ extern __vacosf( float *x, float *y, int count, int stridex, int stridey ); extern __vasinf( float *x, float *y, int count, int stridex, int stridey ); extern __vatanf( float *x, float *y, int count, int stridex, int stridey ); extern __vcosf( float *x, float *y, int count, int stridex, int stridey ); extern __vexpf( float *x, float *y, int count, int stridex, int stridey ); extern __vlogf( float *x, float *y, int count, int stridex, int stridey ); extern __vsinf( float *x, float *y, int count, int stridex, int stridey ); extern __vtanf( float *x, float *y, int count, int stridex, int stridey ); /* double precision vector routines */ extern __vacos( double *x, double *y, int count, int stridex, int stridey ); extern __vasin( double *x, double *y, int count, int stridex, int stridey ); extern __vatan( double *x, double *y, int count, int stridex, int stridey ); extern __vcos( double *x, double *y, int count, int stridex, int stridey ); extern __vexp( double *x, double *y, int count, int stridex, int stridey ); extern __vlog( double *x, double *y, int count, int stridex, int stridey ); extern __vsin( double *x, double *y, int count, int stridex, int stridey ); extern __vtan( double *x, double *y, int count, int stridex, int stridey ) |
The variables x and y are assumed to be pointers to non-overlapping arrays. Each routine is functionally equivalent to the following pseudo-code fragment:
do i = 1, count-1
y[i*stridey] = func(x[i*stridex])
enddo |
where func() is the scalar version of the vector intrinsic.
The vector intrinsics are optimized and software pipelined to take advantage of the R8000's performance features. Throughput is several times greater than that of repeatedly calling the corresponding scalar function though the result may not necessarily agree to the last bit. For further information about accuracy and restrictions of the vector intrinsics please refer to your IRIX Compiler_dev Release Notes.
All of the vector intrinsics can be called explicitly from your program. In Fortran the following example invokes the vector intrinsic vexpf().
real*4 x(N), y(N)
integer*8 N, i
i = 1
call vexpf$( %val(%loc(x(1))),
& %val(%loc(y(1))),
& %val(N),%val(i),%val(i))
end |
The compiler at optimization level -O3 also recognizes the use of scalar versions of these intrinsics on array elements inside loops and turns them into calls to the vector versions. If you need to turn this feature off to improve the accuracy of your results, add -LNO:vint=OFF to your compilation command line or switch to a lower optimization level.