Low-level coding decisions


Revision: 7/6/12

We have about half a dozen methods of performing sparse matrix-dense vector multiplication (see working paper 2). Many have parameters, meaning there are many variations. The most efficient method for a given matrix on a given machine varies among these methods.

Furthermore, each method can be coded in a variety of ways. The core of these computations is usually a very simple loop (or collection of loops). The main way in which the coding can vary is using array subscripting vs. pointer arithmetic and explicit dereferencing.

Aside from wanting to find the most efficient code for each algorithm, we also want to reduce the number of different codes we need to deal with when choosing the best method for a given matrix.

In this note, we review alternate codings for most of our methods, explaining why we chose the specific coding we did. The methods themselves are described in a separate working note.

Unfolding

The most basic methods are unrolling (the standard "compressed-sparse rows" code, and that code with unrolled loops) and unfolding, in which the entire matrix multiplication is coded as a straight-line sequence of floating-point operations. The unfolding specializer has two arguments, b_r and b_c, giving the blocking factors. If the block size --- or, more accurately, b_c --- is greater than n (which we refer to as ∞) the generated code will have exactly one assignment per row; otherwise, the matrix is divided into blocks and each block will have b_r assignments, each accounting for b_c columns. We ran tests using square blocks with b_r = b_c = 10, 20, ..., 200, rectangular blocks ∞ x 10, ..., ∞ x 200, and ∞ x ∞. (In practice, we have observed that the latter is always at or near the optimal time, but we have not yet determined that we can eliminate the other block sizes.) Unfolding produces code that has no auxiliary data structures. Here is a segment of the code produced by this method:

   w[0] += 1037038.99259*v[0] + 259259.059259*v[1]
        + -1185186.25185*v[9] + -296296.651852*v[10]
        + 148148.148148*v[18] + 37037.1481481*v[19];
   w[1] += 259259.059259*v[0] + 296297.540741*v[1]
        + 259259.059259*v[2] + -18518.5518519*v[3]
        + -296296.651852*v[9] + -148148.548148*v[10]
        + -296296.651852*v[11] + -74073.962963*v[12]
        + 37037.1481481*v[18] + -148148.214815*v[19]
        + 37037.1481482*v[20] + 92592.5703704*v[21];

The floating-point numbers in the above code are the non-zero elements of the matrix. (The examples in this note are drawn from the matrix fidap005, which is the smallest matrix in our standard test suite, with 297 non-zeroes.)

An alternative is to place the values of the matrix in an array and include references to that array instead of the values themselves. This leads to code like this:

   w[0] += mvalues1[0] * v[0] + mvalues1[1] * v[1]
        + mvalues1[2] * v[9] + mvalues1[3] * v[10]
        + mvalues1[4] * v[18] + mvalues1[5] * v[19];
   w[1] += mvalues1[6] * v[0] + mvalues1[7] * v[1]
        + mvalues1[8] * v[2] + mvalues1[9] * v[3]
        + mvalues1[10] * v[9] + mvalues1[11] * v[10]
        + mvalues1[12] * v[11] + mvalues1[13] * v[12]
        + mvalues1[14] * v[18] + mvalues1[15] * v[19]
        + mvalues1[16] * v[20] + mvalues1[17] * v[21];

In this case, we would not expect the second variant to be as fast as the first.

Thus, for each matrix, we have 41 versions of unfolding code for each matrix, and, with 17 matrices, 697 timings for each machine. (The method for taking timings is described in Working Note 1.) The headings for this table mean "Alternative 1 had better times than alternative 2," "Alternative 2 had better timings" (i.e. 697 - first column), "Alternative 1 had timings at least 10% better than alternative 2", and similarly for alternative 2. The results are:

Machine Alt 1 wins Alt 2 wins Alt 1 wins big Alt 2 wins big
chicago 618 79 239 4
pl 511 186 214 4
loome1 578 119 191 25
loome2 609 88 228 3
loome3 533 164 187 66
i2pc4 592 105 219 22
Timings for two versions of unfolding

The comparisons are first a simple count of how many times each alternative runs faster, and how often each alternative is better by at least 10%. Although it is surprising how often the second alternative is more efficient, the first alternative is the clear winner. Thus, we have eliminated the second version and use only the first in our research.

Unrolling

The basic CSR method is described in the working note on the algorithms. Here is the CSR code unrolled by a factor of five:

   ww = 0.0;
   for (row = 0; row < 27; row++) {
      nz = rows1[row] - nz;
      ww = w[row];
      while (nz >= 5) {
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         nz = nz-5;
      }
      while (nz > 0) {
         ww += v[*col++] * *(mvals++);
         nz--;
      }
      w[row] = ww;
   }

It uses address arithmetic and dereferencing. (Subscripting is used for v, but that cannot be avoided. Or, rather, we could write "*(v+*col++)", but we would not expect that to be more efficient than subscripting; the cases where we are using address arithmetic are those where doing so potentially saves some calculations.)

We are only interested in the inner loop. Here is what it looks like with subscripting:

      k = rows1[i];
      kk = rows1[i+1];
      for (0; k <= kk-5; k += 5) {
         ww += mvalues1[k + 0] * v[cols1[k + 0]]
             + mvalues1[k + 1] * v[cols1[k + 1]]
             + mvalues1[k + 2] * v[cols1[k + 2]]
             + mvalues1[k + 3] * v[cols1[k + 3]]
             + mvalues1[k + 4] * v[cols1[k + 4]];
      }

Aside from the use of subscripting, another difference is that this code consists of a single assignment statement, while the first one had five. The reason for this is that expressions with multiple side-effecting operations (++) have indeterminate results in C, so they need to be divided into separate statements.

We also coded a very slight variant on this:

      k = rows1[i];
      kk = rows1[i+1];
      for (0; k <= kk-5; k += 5) {
         mv = mvalues1+k;
         c = cols1+k;
         ww += mv[0] * v[c[0]] + mv[1] * v[c[1]]
             + mv[2] * v[c[2]] + mv[3] * v[c[3]]
             + mv[4] * v[c[4]];
      }

The results are given in the following four tables. We use 10 unrolling factors (1 to 10) for each matrix, for a total of 170 timings on each machine for each alternative coding. The first table shows which alternatives dominate all three.

Comparison of three versions of unrolling
Machine Alt 1 wins Alt 2 wins Alt 3 wins
chicago 26 131 13
pl 13 87 70
loome1 21 68 80
loome2 6 70 93
loome3 4 104 62
i2pc4 9 80 80

The next three tables give head-to-head comparisons between each pair of alternatives.

Machine Alt 1 wins Alt 2 wins Alt 1 wins big Alt 2 wins big
chicago 26 144 1 32
pl 13 157 0 60
loome1 25 145 13 43
loome2 14 156 2 35
loome3 17 153 0 104
i2pc4 11 159 0 37
Timings for two versions of unrolling

Machine Alt 1 wins Alt 3 wins Alt 1 wins big Alt 3 wins big
chicago 70 100 2 23
pl 13 157 0 60
loome1 41 129 26 49
loome2 20 150 6 51
loome3 29 142 18 103
i2pc4 23 147 6 50
Timings for two versions of unrolling

Machine Alt 2 wins Alt 3 wins Alt 2 wins big Alt 3 wins big
chicago 154 16 1 1
pl 91 79 0 0
loome1 79 91 30 15
loome2 73 97 8 13
loome3 107 63 19 9
i2pc4 84 86 10 10
Timings for two versions of unrolling

Clearly, version 1 is not competitive. Versions 2 and 3 differ by a surprising amount. However, version 2 has the clear edge: It comes in first 58% of the time, and is a big winner 68 times, to version 3's 48. So we have chosen version 2 for all timings given elsewhere.

CSRbyNZ

Our variants of CSRbyNZ differ similarly: one using address arithmetic and explicit dereferencing, the other subscripting. Here is one of the loops from the first alternative:

   for (i=0; i<6; i++) {
      row = rows1[a1];
      a1++;
      w[row] += mvalues1[b1+0] * v[cols1[b1+0]]
              + mvalues1[b1+1] * v[cols1[b1+1]]
              + mvalues1[b1+2] * v[cols1[b1+2]]
              + mvalues1[b1+3] * v[cols1[b1+3]]
              + mvalues1[b1+4] * v[cols1[b1+4]]
              + mvalues1[b1+5] * v[cols1[b1+5]]
              + mvalues1[b1+6] * v[cols1[b1+6]]
              + mvalues1[b1+7] * v[cols1[b1+7]]
              + mvalues1[b1+8] * v[cols1[b1+8]]
              + mvalues1[b1+9] * v[cols1[b1+9]]
              + mvalues1[b1+10] * v[cols1[b1+10]]
              + mvalues1[b1+11] * v[cols1[b1+11]];
      b1 += 12;
   }

And here is the second alternative:

   for (i=0; i<6; i++) {
      row = rows1[a1];
      a1++;
      ww = 0.0;
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      w[row] += ww;
   }

and the third:

   for (i=0; i<6; i++) {
      row = rows1[a1];
      a1++;
      mvals = mvalues1+b1;
      c = cols1+b1;
      w[row] += mvals[0] * v[c[0]] + mvals[1] * v[c[1]]
              + mvals[2] * v[c[2]] + mvals[3] * v[c[3]]
              + mvals[4] * v[c[4]] + mvals[5] * v[c[5]]
              + mvals[6] * v[c[6]] + mvals[7] * v[c[7]]
              + mvals[8] * v[c[8]] + mvals[9] * v[c[9]]
              + mvals[10] * v[c[10]] + mvals[11] * v[c[11]]
              + mvals[12] * v[c[12]] + mvals[13] * v[c[13]]
              + mvals[14] * v[c[14]];
      b1 += 15;
   }

CSRbyNZ has no parameters, so there are just 17 times for each alternative on each machine. The first table shows that there is no clear winner; alternative 2 has a slight edge.

Machine Alt 1 wins Alt 2 wins Alt 3 wins
chicago 8 9 0
pl 8 1 8
loome1 3 10 4
loome2 0 17 0
loome3 8 4 5
i2pc4 3 13 1
Comparison of three versions of CSRbyNZ

The next three tables give head-to-head comparisons between each pair of alternatives. Again, "wins big" means "beats the other time by at least 10%."

Machine Alt 1 wins Alt 2 wins Alt 1 wins big Alt 2 wins big
chicago 8 9 0 0
pl 16 1 0 0
loome1 6 11 1 4
loome2 0 17 0 2
loome3 10 7 0 1
i2pc4 4 13 0 0
Timings for two versions of CSRbyNZ

Machine Alt 1 wins Alt 3 wins Alt 1 wins big Alt 3 wins big
chicago 17 0 5 0
pl 8 9 0 0
loome1 7 10 2 4
loome2 5 12 0 0
loome3 10 7 0 0
i2pc4 13 4 0 0
Timings for two versions of CSRbyNZ

Machine Alt 2 wins Alt 3 wins Alt 2 wins big Alt 3 wins big
chicago 17 0 3 0
pl 1 16 0 0
loome1 10 7 4 3
loome2 17 0 0 0
loome3 4 13 1 0
i2pc4 14 3 0 0
Timings for two versions of CSRbyNZ

None of the three versions is clearly better than the others. The main observation is that they rarely differ by very much. In particular, out of a total of 102 timings for each alternative, alternative 2 is at least 10% slower than another version only four times. (The corresponding numbers for alternatives 1 and 3 are 11 and 15, resp.) Thus, we feel safe in choosing alternative 2, and will use that one in all future experiments.

Stencil

Examples of the alternative codings of this specializer are:

   int stencil1_2[3] = {2, 4, 6};
   for (i=0; i<3; i++) {
      row = stencil1_2[i];
      w[row] += v[row+-1] * *(mvals1+0) + v[row+0] * *(mvals1+1)
              + v[row+1] * *(mvals1+2) + v[row+8] * *(mvals1+3)
              + v[row+9] * *(mvals1+4) + v[row+10] * *(mvals1+5)
              + v[row+17] * *(mvals1+6) + v[row+18] * *(mvals1+7)
              + v[row+19] * *(mvals1+8);
      mvals1 = mvals1 + 9;
   }

and

   int stencil1_2[3] = {2, 4, 6};
   for (i=0; i<3; i++) {
      row = stencil1_2[i];
      ww = 0.0;
      vv = v+row;
      ww += vv[-1] * (*mvals1++);
      ww += vv[0] * (*mvals1++);
      ww += vv[1] * (*mvals1++);
      ww += vv[8] * (*mvals1++);
      ww += vv[9] * (*mvals1++);
      ww += vv[10] * (*mvals1++);
      ww += vv[17] * (*mvals1++);
      ww += vv[18] * (*mvals1++);
      ww += vv[19] * (*mvals1++);
      w[row] += ww;
   }

and

   int stencil1_2[3] = {2, 4, 6};
   for (i=0; i<3; i++) {
      row = stencil1_2[i];
      vv = v+row;
      w[row] += vv[-1] * mvals1[0] + vv[0] * mvals1[1]
              + vv[1] * mvals1[2] + vv[8] * mvals1[3]
              + vv[9] * mvals1[4] + vv[10] * mvals1[5]
              + vv[17] * mvals1[6] + vv[18] * mvals1[7]
              + vv[19] * mvals1[8];
      mvals1 = mvals1 + 9;
   }

This specializer has no parameters, so there are only 17 timings per machine.

Machine Alt 1 wins Alt 2 wins Alt 3 wins
chicago 3 11 3
pl 1 4 12
loome1 2 10 5
loome2 1 2 14
loome3 1 3 13
i2pc4 1 4 12
Comparison of three versions of stencil

We can fairly conclude that the first coding is the worst. The next three tables give head-to-head comparisons between each pair of alternatives.

Machine Alt 1 wins Alt 2 wins Alt 1 wins big Alt 2 wins big
chicago 3 14 0 0
pl 3 14 0 0
loome1 3 14 1 11
loome2 4 13 1 10
loome3 2 15 0 9
i2pc4 2 15 0 11
Timings for versions 1 and 2 of stencil

Machine Alt 1 wins Alt 3 wins Alt 1 wins big Alt 3 wins big
chicago 6 11 0 0
pl 1 16 0 0
loome1 5 12 5 8
loome2 1 16 1 11
loome3 2 15 0 11
i2pc4 2 15 0 11
Timings for versions 1 and 3 of stencil

Machine Alt 2 wins Alt 3 wins Alt 2 wins big Alt 3 wins big
chicago 12 5 0 0
pl 4 13 0 0
loome1 11 6 5 1
loome2 3 14 1 1
loome3 3 14 1 0
i2pc4 4 13 0 0
Timings for versions 2 and 3 of stencil

As with unrolling, there are two versions that are competitive, although in this case neither one is a big winner very often. Version 2 is a big winner more often, but that is only on loome1, which is the most "volatile" of our machines. Overall, version 3 beats version 2 65 times out of 102, so we are declaring version 3 the winner. This is the version we will use in further experiments.

OSKI

We did not code different versions of OSKI, but we did compare our version with the OSKI code in the library from Berkeley. We just wanted to make sure that we were not inadvertantly coding this method in a way that would put it at a disadvantage. For the record, here is our code for OSKI with 2x2 blocking (row iterates over blocks, so it is n/2; this code is for fidap005, with n = 27):

   for (row = 0; row < 13; row++) {
      nz = rows[row];
      while (nz > 0) {
         r = row * 2;
         c = *col++ * 2;
         ww = v[c]*(*mvals++);
         ww += v[c+1]*(*mvals++);
         w[r] += ww;
         r = r+1;
         ww = v[c]*(*mvals++);
         ww += v[c+1]*(*mvals++);
         w[r] += ww;
         nz--;
      }
   }

We ran the original OSKI code on two machines, loome1 and loome2.

Original OSKI has some latency from two factors: time to check the matrix and determine the best block size; and, if the chosen block size is anything other than 1x1, time to restructure the data. In our case, this is done during generation, and therefore not counted toward run time. To neutralize this advantage, we ran original OSKI for a single iteration and then for 10,001 iterations, and subtracted the former from the latter to obtain the timings. We ran our OSKI with blocks sizes 1x1, 1x2, 2x1, and 2x2, and chose the best. We do not know what block size original OSKI chose; all we know is that it chose whatever it calculated would be best. Finally, one last difference: We compiled original OSKI with gcc, and our code, as usual, with clang.

The results are as follows: On loome1, our times were better on every matrix; on loome2, our times were better on 13 out of 17 matrices.

We think it is reasonable to conclude that our coding of OSKI is fair.

(Actually, we ran original OSKI also on loome3 and i2pc4, but since we did not install it on those machines, but simply used the version installed on loome1, the results are not quite legitimate. OSKI uses install-time tuning to determine - in conjunction with properties of the matrix - which block size to use, so that a version installed on a different machine may not choose the best block size for all matrices. In any case, the comparative timings on loome3 and i2pc4 were similar to those on loome1 and loome2.)

Last updated on Mon Aug 20 15:02:08 CDT 2012