Low-level coding decisions |
We have about half a dozen methods of performing sparse matrix-dense vector multiplication (see working paper 2). Many have parameters, meaning there are many variations. The most efficient method for a given matrix on a given machine varies among these methods.
Furthermore, each method can be coded in a variety of ways. The core of these computations is usually a very simple loop (or collection of loops). The main way in which the coding can vary is using array subscripting vs. pointer arithmetic and explicit dereferencing.
Aside from wanting to find the most efficient code for each algorithm, we also want to reduce the number of different codes we need to deal with when choosing the best method for a given matrix.
In this note, we review alternate codings for most of our methods, explaining why we chose the specific coding we did. The methods themselves are described in a separate working note.
The most basic methods are unrolling (the standard "compressed-sparse rows" code, and that code with unrolled loops) and unfolding, in which the entire matrix multiplication is coded as a straight-line sequence of floating-point operations. The unfolding specializer has two arguments, b_r and b_c, giving the blocking factors. If the block size --- or, more accurately, b_c --- is greater than n (which we refer to as ∞) the generated code will have exactly one assignment per row; otherwise, the matrix is divided into blocks and each block will have b_r assignments, each accounting for b_c columns. We ran tests using square blocks with b_r = b_c = 10, 20, ..., 200, rectangular blocks ∞ x 10, ..., ∞ x 200, and ∞ x ∞. (In practice, we have observed that the latter is always at or near the optimal time, but we have not yet determined that we can eliminate the other block sizes.) Unfolding produces code that has no auxiliary data structures. Here is a segment of the code produced by this method:
w[0] += 1037038.99259*v[0] + 259259.059259*v[1] + -1185186.25185*v[9] + -296296.651852*v[10] + 148148.148148*v[18] + 37037.1481481*v[19]; w[1] += 259259.059259*v[0] + 296297.540741*v[1] + 259259.059259*v[2] + -18518.5518519*v[3] + -296296.651852*v[9] + -148148.548148*v[10] + -296296.651852*v[11] + -74073.962963*v[12] + 37037.1481481*v[18] + -148148.214815*v[19] + 37037.1481482*v[20] + 92592.5703704*v[21];
The floating-point numbers in the above code are the non-zero elements
of the matrix.
(The examples in this note are drawn
from the matrix fidap005
, which is the smallest matrix in our
standard test suite, with 297 non-zeroes.)
An alternative is to place the values of the matrix in an array and include references to that array instead of the values themselves. This leads to code like this:
w[0] += mvalues1[0] * v[0] + mvalues1[1] * v[1] + mvalues1[2] * v[9] + mvalues1[3] * v[10] + mvalues1[4] * v[18] + mvalues1[5] * v[19]; w[1] += mvalues1[6] * v[0] + mvalues1[7] * v[1] + mvalues1[8] * v[2] + mvalues1[9] * v[3] + mvalues1[10] * v[9] + mvalues1[11] * v[10] + mvalues1[12] * v[11] + mvalues1[13] * v[12] + mvalues1[14] * v[18] + mvalues1[15] * v[19] + mvalues1[16] * v[20] + mvalues1[17] * v[21];
In this case, we would not expect the second variant to be as fast as the first.
Thus, for each matrix, we have 41 versions of unfolding code for each matrix, and, with 17 matrices, 697 timings for each machine. (The method for taking timings is described in Working Note 1.) The headings for this table mean "Alternative 1 had better times than alternative 2," "Alternative 2 had better timings" (i.e. 697 - first column), "Alternative 1 had timings at least 10% better than alternative 2", and similarly for alternative 2. The results are:
Machine | Alt 1 wins | Alt 2 wins | Alt 1 wins big | Alt 2 wins big | chicago | 618 | 79 | 239 | 4 |
pl | 511 | 186 | 214 | 4 |
loome1 | 578 | 119 | 191 | 25 |
loome2 | 609 | 88 | 228 | 3 |
loome3 | 533 | 164 | 187 | 66 |
i2pc4 | 592 | 105 | 219 | 22 |
The comparisons are first a simple count of how many times each alternative runs faster, and how often each alternative is better by at least 10%. Although it is surprising how often the second alternative is more efficient, the first alternative is the clear winner. Thus, we have eliminated the second version and use only the first in our research.
The basic CSR method is described in the working note on the algorithms. Here is the CSR code unrolled by a factor of five:
ww = 0.0; for (row = 0; row < 27; row++) { nz = rows1[row] - nz; ww = w[row]; while (nz >= 5) { ww += v[*col++] * *(mvals++); ww += v[*col++] * *(mvals++); ww += v[*col++] * *(mvals++); ww += v[*col++] * *(mvals++); ww += v[*col++] * *(mvals++); nz = nz-5; } while (nz > 0) { ww += v[*col++] * *(mvals++); nz--; } w[row] = ww; }
It uses address arithmetic and dereferencing.
(Subscripting is used for v
, but that cannot be avoided.
Or, rather, we could write "*(v+*col++)
", but we would not
expect that to be more efficient than subscripting;
the cases where we are using address arithmetic are those where
doing so potentially saves some calculations.)
We are only interested in the inner loop. Here is what it looks like with subscripting:
k = rows1[i]; kk = rows1[i+1]; for (0; k <= kk-5; k += 5) { ww += mvalues1[k + 0] * v[cols1[k + 0]] + mvalues1[k + 1] * v[cols1[k + 1]] + mvalues1[k + 2] * v[cols1[k + 2]] + mvalues1[k + 3] * v[cols1[k + 3]] + mvalues1[k + 4] * v[cols1[k + 4]]; }
Aside from the use of subscripting, another difference is
that this code consists of a single assignment statement, while
the first one had five.
The reason for this is that expressions with multiple side-effecting
operations (++
) have indeterminate results in C,
so they need to be divided into separate statements.
We also coded a very slight variant on this:
k = rows1[i]; kk = rows1[i+1]; for (0; k <= kk-5; k += 5) { mv = mvalues1+k; c = cols1+k; ww += mv[0] * v[c[0]] + mv[1] * v[c[1]] + mv[2] * v[c[2]] + mv[3] * v[c[3]] + mv[4] * v[c[4]]; }
The results are given in the following four tables. We use 10 unrolling factors (1 to 10) for each matrix, for a total of 170 timings on each machine for each alternative coding. The first table shows which alternatives dominate all three.
Machine | Alt 1 wins | Alt 2 wins | Alt 3 wins |
chicago | 26 | 131 | 13 |
pl | 13 | 87 | 70 |
loome1 | 21 | 68 | 80 |
loome2 | 6 | 70 | 93 |
loome3 | 4 | 104 | 62 |
i2pc4 | 9 | 80 | 80 |
The next three tables give head-to-head comparisons between each pair of alternatives.
Machine | Alt 1 wins | Alt 2 wins | Alt 1 wins big | Alt 2 wins big |
chicago | 26 | 144 | 1 | 32 |
pl | 13 | 157 | 0 | 60 |
loome1 | 25 | 145 | 13 | 43 |
loome2 | 14 | 156 | 2 | 35 |
loome3 | 17 | 153 | 0 | 104 |
i2pc4 | 11 | 159 | 0 | 37 |
Machine | Alt 1 wins | Alt 3 wins | Alt 1 wins big | Alt 3 wins big |
chicago | 70 | 100 | 2 | 23 |
pl | 13 | 157 | 0 | 60 |
loome1 | 41 | 129 | 26 | 49 |
loome2 | 20 | 150 | 6 | 51 |
loome3 | 29 | 142 | 18 | 103 |
i2pc4 | 23 | 147 | 6 | 50 |
Machine | Alt 2 wins | Alt 3 wins | Alt 2 wins big | Alt 3 wins big |
chicago | 154 | 16 | 1 | 1 |
pl | 91 | 79 | 0 | 0 |
loome1 | 79 | 91 | 30 | 15 |
loome2 | 73 | 97 | 8 | 13 |
loome3 | 107 | 63 | 19 | 9 |
i2pc4 | 84 | 86 | 10 | 10 |
Clearly, version 1 is not competitive. Versions 2 and 3 differ by a surprising amount. However, version 2 has the clear edge: It comes in first 58% of the time, and is a big winner 68 times, to version 3's 48. So we have chosen version 2 for all timings given elsewhere.
Our variants of CSRbyNZ differ similarly: one using address arithmetic and explicit dereferencing, the other subscripting. Here is one of the loops from the first alternative:
for (i=0; i<6; i++) { row = rows1[a1]; a1++; w[row] += mvalues1[b1+0] * v[cols1[b1+0]] + mvalues1[b1+1] * v[cols1[b1+1]] + mvalues1[b1+2] * v[cols1[b1+2]] + mvalues1[b1+3] * v[cols1[b1+3]] + mvalues1[b1+4] * v[cols1[b1+4]] + mvalues1[b1+5] * v[cols1[b1+5]] + mvalues1[b1+6] * v[cols1[b1+6]] + mvalues1[b1+7] * v[cols1[b1+7]] + mvalues1[b1+8] * v[cols1[b1+8]] + mvalues1[b1+9] * v[cols1[b1+9]] + mvalues1[b1+10] * v[cols1[b1+10]] + mvalues1[b1+11] * v[cols1[b1+11]]; b1 += 12; }
And here is the second alternative:
for (i=0; i<6; i++) { row = rows1[a1]; a1++; ww = 0.0; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; ww += *mval1++ * v[*colptr1++]; w[row] += ww; }
and the third:
for (i=0; i<6; i++) { row = rows1[a1]; a1++; mvals = mvalues1+b1; c = cols1+b1; w[row] += mvals[0] * v[c[0]] + mvals[1] * v[c[1]] + mvals[2] * v[c[2]] + mvals[3] * v[c[3]] + mvals[4] * v[c[4]] + mvals[5] * v[c[5]] + mvals[6] * v[c[6]] + mvals[7] * v[c[7]] + mvals[8] * v[c[8]] + mvals[9] * v[c[9]] + mvals[10] * v[c[10]] + mvals[11] * v[c[11]] + mvals[12] * v[c[12]] + mvals[13] * v[c[13]] + mvals[14] * v[c[14]]; b1 += 15; }
CSRbyNZ has no parameters, so there are just 17 times for each alternative on each machine. The first table shows that there is no clear winner; alternative 2 has a slight edge.
Machine | Alt 1 wins | Alt 2 wins | Alt 3 wins |
chicago | 8 | 9 | 0 |
pl | 8 | 1 | 8 |
loome1 | 3 | 10 | 4 |
loome2 | 0 | 17 | 0 |
loome3 | 8 | 4 | 5 |
i2pc4 | 3 | 13 | 1 |
The next three tables give head-to-head comparisons between each pair of alternatives. Again, "wins big" means "beats the other time by at least 10%."
Machine | Alt 1 wins | Alt 2 wins | Alt 1 wins big | Alt 2 wins big |
chicago | 8 | 9 | 0 | 0 |
pl | 16 | 1 | 0 | 0 |
loome1 | 6 | 11 | 1 | 4 |
loome2 | 0 | 17 | 0 | 2 |
loome3 | 10 | 7 | 0 | 1 |
i2pc4 | 4 | 13 | 0 | 0 |
Machine | Alt 1 wins | Alt 3 wins | Alt 1 wins big | Alt 3 wins big |
chicago | 17 | 0 | 5 | 0 |
pl | 8 | 9 | 0 | 0 |
loome1 | 7 | 10 | 2 | 4 |
loome2 | 5 | 12 | 0 | 0 |
loome3 | 10 | 7 | 0 | 0 |
i2pc4 | 13 | 4 | 0 | 0 |
Machine | Alt 2 wins | Alt 3 wins | Alt 2 wins big | Alt 3 wins big |
chicago | 17 | 0 | 3 | 0 |
pl | 1 | 16 | 0 | 0 |
loome1 | 10 | 7 | 4 | 3 |
loome2 | 17 | 0 | 0 | 0 |
loome3 | 4 | 13 | 1 | 0 |
i2pc4 | 14 | 3 | 0 | 0 |
None of the three versions is clearly better than the others. The main observation is that they rarely differ by very much. In particular, out of a total of 102 timings for each alternative, alternative 2 is at least 10% slower than another version only four times. (The corresponding numbers for alternatives 1 and 3 are 11 and 15, resp.) Thus, we feel safe in choosing alternative 2, and will use that one in all future experiments.
Examples of the alternative codings of this specializer are:
int stencil1_2[3] = {2, 4, 6}; for (i=0; i<3; i++) { row = stencil1_2[i]; w[row] += v[row+-1] * *(mvals1+0) + v[row+0] * *(mvals1+1) + v[row+1] * *(mvals1+2) + v[row+8] * *(mvals1+3) + v[row+9] * *(mvals1+4) + v[row+10] * *(mvals1+5) + v[row+17] * *(mvals1+6) + v[row+18] * *(mvals1+7) + v[row+19] * *(mvals1+8); mvals1 = mvals1 + 9; }
and
int stencil1_2[3] = {2, 4, 6}; for (i=0; i<3; i++) { row = stencil1_2[i]; ww = 0.0; vv = v+row; ww += vv[-1] * (*mvals1++); ww += vv[0] * (*mvals1++); ww += vv[1] * (*mvals1++); ww += vv[8] * (*mvals1++); ww += vv[9] * (*mvals1++); ww += vv[10] * (*mvals1++); ww += vv[17] * (*mvals1++); ww += vv[18] * (*mvals1++); ww += vv[19] * (*mvals1++); w[row] += ww; }
and
int stencil1_2[3] = {2, 4, 6}; for (i=0; i<3; i++) { row = stencil1_2[i]; vv = v+row; w[row] += vv[-1] * mvals1[0] + vv[0] * mvals1[1] + vv[1] * mvals1[2] + vv[8] * mvals1[3] + vv[9] * mvals1[4] + vv[10] * mvals1[5] + vv[17] * mvals1[6] + vv[18] * mvals1[7] + vv[19] * mvals1[8]; mvals1 = mvals1 + 9; }
This specializer has no parameters, so there are only 17 timings per machine.
Machine | Alt 1 wins | Alt 2 wins | Alt 3 wins |
chicago | 3 | 11 | 3 |
pl | 1 | 4 | 12 |
loome1 | 2 | 10 | 5 |
loome2 | 1 | 2 | 14 |
loome3 | 1 | 3 | 13 |
i2pc4 | 1 | 4 | 12 |
We can fairly conclude that the first coding is the worst. The next three tables give head-to-head comparisons between each pair of alternatives.
Machine | Alt 1 wins | Alt 2 wins | Alt 1 wins big | Alt 2 wins big |
chicago | 3 | 14 | 0 | 0 |
pl | 3 | 14 | 0 | 0 |
loome1 | 3 | 14 | 1 | 11 |
loome2 | 4 | 13 | 1 | 10 |
loome3 | 2 | 15 | 0 | 9 |
i2pc4 | 2 | 15 | 0 | 11 |
Machine | Alt 1 wins | Alt 3 wins | Alt 1 wins big | Alt 3 wins big |
chicago | 6 | 11 | 0 | 0 |
pl | 1 | 16 | 0 | 0 |
loome1 | 5 | 12 | 5 | 8 |
loome2 | 1 | 16 | 1 | 11 |
loome3 | 2 | 15 | 0 | 11 |
i2pc4 | 2 | 15 | 0 | 11 |
Machine | Alt 2 wins | Alt 3 wins | Alt 2 wins big | Alt 3 wins big |
chicago | 12 | 5 | 0 | 0 |
pl | 4 | 13 | 0 | 0 |
loome1 | 11 | 6 | 5 | 1 |
loome2 | 3 | 14 | 1 | 1 |
loome3 | 3 | 14 | 1 | 0 |
i2pc4 | 4 | 13 | 0 | 0 |
As with unrolling, there are two versions that are competitive, although in this case neither one is a big winner very often. Version 2 is a big winner more often, but that is only on loome1, which is the most "volatile" of our machines. Overall, version 3 beats version 2 65 times out of 102, so we are declaring version 3 the winner. This is the version we will use in further experiments.
We did not code different versions of OSKI, but we did compare
our version with the OSKI code in the library from Berkeley.
We just wanted to make sure that we were not inadvertantly
coding this method in a way that would put it at a
disadvantage.
For the record, here is our code for OSKI with 2x2 blocking
(row
iterates over blocks, so it is
n/2; this code is for fidap005, with n = 27):
for (row = 0; row < 13; row++) { nz = rows[row]; while (nz > 0) { r = row * 2; c = *col++ * 2; ww = v[c]*(*mvals++); ww += v[c+1]*(*mvals++); w[r] += ww; r = r+1; ww = v[c]*(*mvals++); ww += v[c+1]*(*mvals++); w[r] += ww; nz--; } }
We ran the original OSKI code on two machines, loome1 and loome2.
Original OSKI has some latency from two factors: time to check the matrix and determine the best block size; and, if the chosen block size is anything other than 1x1, time to restructure the data. In our case, this is done during generation, and therefore not counted toward run time. To neutralize this advantage, we ran original OSKI for a single iteration and then for 10,001 iterations, and subtracted the former from the latter to obtain the timings. We ran our OSKI with blocks sizes 1x1, 1x2, 2x1, and 2x2, and chose the best. We do not know what block size original OSKI chose; all we know is that it chose whatever it calculated would be best. Finally, one last difference: We compiled original OSKI with gcc, and our code, as usual, with clang.
The results are as follows: On loome1, our times were better on every matrix; on loome2, our times were better on 13 out of 17 matrices.
We think it is reasonable to conclude that our coding of OSKI is fair.
(Actually, we ran original OSKI also on loome3 and i2pc4, but since we did not install it on those machines, but simply used the version installed on loome1, the results are not quite legitimate. OSKI uses install-time tuning to determine - in conjunction with properties of the matrix - which block size to use, so that a version installed on a different machine may not choose the best block size for all matrices. In any case, the comparative timings on loome3 and i2pc4 were similar to those on loome1 and loome2.)
Last updated on Mon Aug 20 15:02:08 CDT 2012