Low-level coding decisions

Revision: 7/6/12

We have about half a dozen methods of performing sparse matrix-dense vector multiplication (see working paper 2). Many have parameters, meaning there are many variations. The most efficient method for a given matrix on a given machine varies among these methods.

Furthermore, each method can be coded in a variety of ways. The core of these computations is usually a very simple loop (or collection of loops). The main way in which the coding can vary is using array subscripting vs. pointer arithmetic and explicit dereferencing.

Aside from wanting to find the most efficient code for each algorithm, we also want to reduce the number of different codes we need to deal with when choosing the best method for a given matrix.

In this note, we review alternate codings for most of our methods, explaining why we chose the specific coding we did. The methods themselves are described in a separate working note.

Unfolding

The most basic methods are unrolling (the standard "compressed-sparse rows" code, and that code with unrolled loops) and unfolding, in which the entire matrix multiplication is coded as a straight-line sequence of floating-point operations. The unfolding specializer has two arguments, b_r and b_c, giving the blocking factors. If the block size --- or, more accurately, b_c --- is greater than n (which we refer to as ∞) the generated code will have exactly one assignment per row; otherwise, the matrix is divided into blocks and each block will have b_r assignments, each accounting for b_c columns. We ran tests using square blocks with b_r = b_c = 10, 20, ..., 200, rectangular blocks ∞ x 10, ..., ∞ x 200, and ∞ x ∞. (In practice, we have observed that the latter is always at or near the optimal time, but we have not yet determined that we can eliminate the other block sizes.) Unfolding produces code that has no auxiliary data structures. Here is a segment of the code produced by this method:

   w[0] += 1037038.99259*v[0] + 259259.059259*v[1]
        + -1185186.25185*v[9] + -296296.651852*v[10]
        + 148148.148148*v[18] + 37037.1481481*v[19];
   w[1] += 259259.059259*v[0] + 296297.540741*v[1]
        + 259259.059259*v[2] + -18518.5518519*v[3]
        + -296296.651852*v[9] + -148148.548148*v[10]
        + -296296.651852*v[11] + -74073.962963*v[12]
        + 37037.1481481*v[18] + -148148.214815*v[19]
        + 37037.1481482*v[20] + 92592.5703704*v[21];

The floating-point numbers in the above code are the non-zero elements of the matrix. (The examples in this note are drawn from the matrix fidap005, which is the smallest matrix in our standard test suite, with 297 non-zeroes.)

An alternative is to place the values of the matrix in an array and include references to that array instead of the values themselves. This leads to code like this:

   w[0] += mvalues1[0] * v[0] + mvalues1[1] * v[1]
        + mvalues1[2] * v[9] + mvalues1[3] * v[10]
        + mvalues1[4] * v[18] + mvalues1[5] * v[19];
   w[1] += mvalues1[6] * v[0] + mvalues1[7] * v[1]
        + mvalues1[8] * v[2] + mvalues1[9] * v[3]
        + mvalues1[10] * v[9] + mvalues1[11] * v[10]
        + mvalues1[12] * v[11] + mvalues1[13] * v[12]
        + mvalues1[14] * v[18] + mvalues1[15] * v[19]
        + mvalues1[16] * v[20] + mvalues1[17] * v[21];

In this case, we would not expect the second variant to be as fast as the first.

Thus, for each matrix, we have 41 versions of unfolding code for each matrix, and, with 17 matrices, 697 timings for each machine. (The method for taking timings is described in Working Note 1.) The headings for this table mean "Alternative 1 had better times than alternative 2," "Alternative 2 had better timings" (i.e. 697 - first column), "Alternative 1 had timings at least 10% better than alternative 2", and similarly for alternative 2. The results are:

**Timings for two versions of unfolding**
Machine	Alt 1 wins	Alt 2 wins	Alt 1 wins big	Alt 2 wins big
chicago	618	79	239	4
pl	511	186	214	4
loome1	578	119	191	25
loome2	609	88	228	3
loome3	533	164	187	66
i2pc4	592	105	219	22

The comparisons are first a simple count of how many times each alternative runs faster, and how often each alternative is better by at least 10%. Although it is surprising how often the second alternative is more efficient, the first alternative is the clear winner. Thus, we have eliminated the second version and use only the first in our research.

Unrolling

The basic CSR method is described in the working note on the algorithms. Here is the CSR code unrolled by a factor of five:

   ww = 0.0;
   for (row = 0; row < 27; row++) {
      nz = rows1[row] - nz;
      ww = w[row];
      while (nz >= 5) {
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         ww += v[*col++] * *(mvals++);
         nz = nz-5;
      }
      while (nz > 0) {
         ww += v[*col++] * *(mvals++);
         nz--;
      }
      w[row] = ww;
   }

It uses address arithmetic and dereferencing. (Subscripting is used for v, but that cannot be avoided. Or, rather, we could write "*(v+*col++)", but we would not expect that to be more efficient than subscripting; the cases where we are using address arithmetic are those where doing so potentially saves some calculations.)

We are only interested in the inner loop. Here is what it looks like with subscripting:

      k = rows1[i];
      kk = rows1[i+1];
      for (0; k <= kk-5; k += 5) {
         ww += mvalues1[k + 0] * v[cols1[k + 0]]
             + mvalues1[k + 1] * v[cols1[k + 1]]
             + mvalues1[k + 2] * v[cols1[k + 2]]
             + mvalues1[k + 3] * v[cols1[k + 3]]
             + mvalues1[k + 4] * v[cols1[k + 4]];
      }

Aside from the use of subscripting, another difference is that this code consists of a single assignment statement, while the first one had five. The reason for this is that expressions with multiple side-effecting operations (++) have indeterminate results in C, so they need to be divided into separate statements.

We also coded a very slight variant on this:

      k = rows1[i];
      kk = rows1[i+1];
      for (0; k <= kk-5; k += 5) {
         mv = mvalues1+k;
         c = cols1+k;
         ww += mv[0] * v[c[0]] + mv[1] * v[c[1]]
             + mv[2] * v[c[2]] + mv[3] * v[c[3]]
             + mv[4] * v[c[4]];
      }

The results are given in the following four tables. We use 10 unrolling factors (1 to 10) for each matrix, for a total of 170 timings on each machine for each alternative coding. The first table shows which alternatives dominate all three.

Comparison of three versions of unrolling
Machine	Alt 1 wins	Alt 2 wins	Alt 3 wins
chicago	26	131	13
pl	13	87	70
loome1	21	68	80
loome2	6	70	93
loome3	4	104	62
i2pc4	9	80	80

The next three tables give head-to-head comparisons between each pair of alternatives.

Timings for two versions of unrolling
Machine	Alt 1 wins	Alt 2 wins	Alt 1 wins big	Alt 2 wins big
chicago	26	144	1	32
pl	13	157	0	60
loome1	25	145	13	43
loome2	14	156	2	35
loome3	17	153	0	104
i2pc4	11	159	0	37

Timings for two versions of unrolling
Machine	Alt 1 wins	Alt 3 wins	Alt 1 wins big	Alt 3 wins big
chicago	70	100	2	23
pl	13	157	0	60
loome1	41	129	26	49
loome2	20	150	6	51
loome3	29	142	18	103
i2pc4	23	147	6	50

Timings for two versions of unrolling
Machine	Alt 2 wins	Alt 3 wins	Alt 2 wins big	Alt 3 wins big
chicago	154	16	1	1
pl	91	79	0	0
loome1	79	91	30	15
loome2	73	97	8	13
loome3	107	63	19	9
i2pc4	84	86	10	10

Clearly, version 1 is not competitive. Versions 2 and 3 differ by a surprising amount. However, version 2 has the clear edge: It comes in first 58% of the time, and is a big winner 68 times, to version 3's 48. So we have chosen version 2 for all timings given elsewhere.

CSRbyNZ

Our variants of CSRbyNZ differ similarly: one using address arithmetic and explicit dereferencing, the other subscripting. Here is one of the loops from the first alternative:

   for (i=0; i<6; i++) {
      row = rows1[a1];
      a1++;
      w[row] += mvalues1[b1+0] * v[cols1[b1+0]]
              + mvalues1[b1+1] * v[cols1[b1+1]]
              + mvalues1[b1+2] * v[cols1[b1+2]]
              + mvalues1[b1+3] * v[cols1[b1+3]]
              + mvalues1[b1+4] * v[cols1[b1+4]]
              + mvalues1[b1+5] * v[cols1[b1+5]]
              + mvalues1[b1+6] * v[cols1[b1+6]]
              + mvalues1[b1+7] * v[cols1[b1+7]]
              + mvalues1[b1+8] * v[cols1[b1+8]]
              + mvalues1[b1+9] * v[cols1[b1+9]]
              + mvalues1[b1+10] * v[cols1[b1+10]]
              + mvalues1[b1+11] * v[cols1[b1+11]];
      b1 += 12;
   }

And here is the second alternative:

   for (i=0; i<6; i++) {
      row = rows1[a1];
      a1++;
      ww = 0.0;
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      ww += *mval1++ * v[*colptr1++];
      w[row] += ww;
   }

and the third:

   for (i=0; i<6; i++) {
      row = rows1[a1];
      a1++;
      mvals = mvalues1+b1;
      c = cols1+b1;
      w[row] += mvals[0] * v[c[0]] + mvals[1] * v[c[1]]
              + mvals[2] * v[c[2]] + mvals[3] * v[c[3]]
              + mvals[4] * v[c[4]] + mvals[5] * v[c[5]]
              + mvals[6] * v[c[6]] + mvals[7] * v[c[7]]
              + mvals[8] * v[c[8]] + mvals[9] * v[c[9]]
              + mvals[10] * v[c[10]] + mvals[11] * v[c[11]]
              + mvals[12] * v[c[12]] + mvals[13] * v[c[13]]
              + mvals[14] * v[c[14]];
      b1 += 15;
   }

CSRbyNZ has no parameters, so there are just 17 times for each alternative on each machine. The first table shows that there is no clear winner; alternative 2 has a slight edge.

Comparison of three versions of CSRbyNZ
Machine	Alt 1 wins	Alt 2 wins	Alt 3 wins
chicago	8	9	0
pl	8	1	8
loome1	3	10	4
loome2	0	17	0
loome3	8	4	5
i2pc4	3	13	1

The next three tables give head-to-head comparisons between each pair of alternatives. Again, "wins big" means "beats the other time by at least 10%."

Timings for two versions of CSRbyNZ
Machine	Alt 1 wins	Alt 2 wins	Alt 1 wins big	Alt 2 wins big
chicago	8	9	0	0
pl	16	1	0	0
loome1	6	11	1	4
loome2	0	17	0	2
loome3	10	7	0	1
i2pc4	4	13	0	0

Timings for two versions of CSRbyNZ
Machine	Alt 1 wins	Alt 3 wins	Alt 1 wins big	Alt 3 wins big
chicago	17	0	5	0
pl	8	9	0	0
loome1	7	10	2	4
loome2	5	12	0	0
loome3	10	7	0	0
i2pc4	13	4	0	0

Timings for two versions of CSRbyNZ
Machine	Alt 2 wins	Alt 3 wins	Alt 2 wins big	Alt 3 wins big
chicago	17	0	3	0
pl	1	16	0	0
loome1	10	7	4	3
loome2	17	0	0	0
loome3	4	13	1	0
i2pc4	14	3	0	0

None of the three versions is clearly better than the others. The main observation is that they rarely differ by very much. In particular, out of a total of 102 timings for each alternative, alternative 2 is at least 10% slower than another version only four times. (The corresponding numbers for alternatives 1 and 3 are 11 and 15, resp.) Thus, we feel safe in choosing alternative 2, and will use that one in all future experiments.

Stencil

Examples of the alternative codings of this specializer are:

   int stencil1_2[3] = {2, 4, 6};
   for (i=0; i<3; i++) {
      row = stencil1_2[i];
      w[row] += v[row+-1] * *(mvals1+0) + v[row+0] * *(mvals1+1)
              + v[row+1] * *(mvals1+2) + v[row+8] * *(mvals1+3)
              + v[row+9] * *(mvals1+4) + v[row+10] * *(mvals1+5)
              + v[row+17] * *(mvals1+6) + v[row+18] * *(mvals1+7)
              + v[row+19] * *(mvals1+8);
      mvals1 = mvals1 + 9;
   }

and

   int stencil1_2[3] = {2, 4, 6};
   for (i=0; i<3; i++) {
      row = stencil1_2[i];
      ww = 0.0;
      vv = v+row;
      ww += vv[-1] * (*mvals1++);
      ww += vv[0] * (*mvals1++);
      ww += vv[1] * (*mvals1++);
      ww += vv[8] * (*mvals1++);
      ww += vv[9] * (*mvals1++);
      ww += vv[10] * (*mvals1++);
      ww += vv[17] * (*mvals1++);
      ww += vv[18] * (*mvals1++);
      ww += vv[19] * (*mvals1++);
      w[row] += ww;
   }

and

   int stencil1_2[3] = {2, 4, 6};
   for (i=0; i<3; i++) {
      row = stencil1_2[i];
      vv = v+row;
      w[row] += vv[-1] * mvals1[0] + vv[0] * mvals1[1]
              + vv[1] * mvals1[2] + vv[8] * mvals1[3]
              + vv[9] * mvals1[4] + vv[10] * mvals1[5]
              + vv[17] * mvals1[6] + vv[18] * mvals1[7]
              + vv[19] * mvals1[8];
      mvals1 = mvals1 + 9;
   }

This specializer has no parameters, so there are only 17 timings per machine.

Comparison of three versions of stencil
Machine	Alt 1 wins	Alt 2 wins	Alt 3 wins
chicago	3	11	3
pl	1	4	12
loome1	2	10	5
loome2	1	2	14
loome3	1	3	13
i2pc4	1	4	12

We can fairly conclude that the first coding is the worst. The next three tables give head-to-head comparisons between each pair of alternatives.

Timings for versions 1 and 2 of stencil
Machine	Alt 1 wins	Alt 2 wins	Alt 1 wins big	Alt 2 wins big
chicago	3	14	0	0
pl	3	14	0	0
loome1	3	14	1	11
loome2	4	13	1	10
loome3	2	15	0	9
i2pc4	2	15	0	11

Timings for versions 1 and 3 of stencil
Machine	Alt 1 wins	Alt 3 wins	Alt 1 wins big	Alt 3 wins big
chicago	6	11	0	0
pl	1	16	0	0
loome1	5	12	5	8
loome2	1	16	1	11
loome3	2	15	0	11
i2pc4	2	15	0	11

Timings for versions 2 and 3 of stencil
Machine	Alt 2 wins	Alt 3 wins	Alt 2 wins big	Alt 3 wins big
chicago	12	5	0	0
pl	4	13	0	0
loome1	11	6	5	1
loome2	3	14	1	1
loome3	3	14	1	0
i2pc4	4	13	0	0

As with unrolling, there are two versions that are competitive, although in this case neither one is a big winner very often. Version 2 is a big winner more often, but that is only on loome1, which is the most "volatile" of our machines. Overall, version 3 beats version 2 65 times out of 102, so we are declaring version 3 the winner. This is the version we will use in further experiments.

OSKI

We did not code different versions of OSKI, but we did compare our version with the OSKI code in the library from Berkeley. We just wanted to make sure that we were not inadvertantly coding this method in a way that would put it at a disadvantage. For the record, here is our code for OSKI with 2x2 blocking (row iterates over blocks, so it is n/2; this code is for fidap005, with n = 27):

   for (row = 0; row < 13; row++) {
      nz = rows[row];
      while (nz > 0) {
         r = row * 2;
         c = *col++ * 2;
         ww = v[c]*(*mvals++);
         ww += v[c+1]*(*mvals++);
         w[r] += ww;
         r = r+1;
         ww = v[c]*(*mvals++);
         ww += v[c+1]*(*mvals++);
         w[r] += ww;
         nz--;
      }
   }

We ran the original OSKI code on two machines, loome1 and loome2.

Original OSKI has some latency from two factors: time to check the matrix and determine the best block size; and, if the chosen block size is anything other than 1x1, time to restructure the data. In our case, this is done during generation, and therefore not counted toward run time. To neutralize this advantage, we ran original OSKI for a single iteration and then for 10,001 iterations, and subtracted the former from the latter to obtain the timings. We ran our OSKI with blocks sizes 1x1, 1x2, 2x1, and 2x2, and chose the best. We do not know what block size original OSKI chose; all we know is that it chose whatever it calculated would be best. Finally, one last difference: We compiled original OSKI with gcc, and our code, as usual, with clang.

The results are as follows: On loome1, our times were better on every matrix; on loome2, our times were better on 13 out of 17 matrices.

We think it is reasonable to conclude that our coding of OSKI is fair.

(Actually, we ran original OSKI also on loome3 and i2pc4, but since we did not install it on those machines, but simply used the version installed on loome1, the results are not quite legitimate. OSKI uses install-time tuning to determine - in conjunction with properties of the matrix - which block size to use, so that a version installed on a different machine may not choose the best block size for all matrices. In any case, the comparative timings on loome3 and i2pc4 were similar to those on loome1 and loome2.)

Last updated on Mon Aug 20 15:02:08 CDT 2012