Timings and experimental set-up



All timings in this study were done on a collection of six machines, using clang as the compiler, and testing on a set of 17 matrices chosen from the Matrix Market. Times for matrix multiplication are obtained by doing 10,000 multiplications, repeating that five times, and selecting the minimum of those five times. The machines and matrices are described on this page, and our decision to use minimum times (rather than median) is discussed. The files containing timings are at the bottom of this page.

Machines

Six computers were used to produce timings for each of the “Specializers” developed in this study. The main characteristics of these computers are described below. The information was obtained from each computer's /proc/cpuinfo and /proc/meminfo files.

Computer

Processor

cpu (MHz)

CPU cache size (Kb)

MemTotal (Kb)

OS

loome1

Intel(R) Xeon(R) CPU E5640 @ 2.67Ghz

2660.081

12288

12291944

Linux CentOS release 5.8

loome2

Intel(R) Core(TM) i7 CPU 880 @ 3.07GHz

1200.000

8192

8132912

Linux CentOS release 5.8

loome3

Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz

1600.000

6144

8063880

Linux CentOS release 5.8

i2pc4

Intel(R) Xeon(R) CPU L7555 @ 1.87GHz

1861.989

24576

65955912

Scientific Linux release 6.3

chicago

Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz

2400.000

3072

2097152

MacOS Lion 10.7.4

pl

Intel(R) Xeon(R) CPU E5405 @ 2.00GHz

1994.768

6144

2057296

Ubuntu Linux 10.04 LTS

Matrices

A set of 17 matrices were selected to conduct the test on the specializers. The matrices were obtained from the Matrix Market site (http://math.nist.gov/MatrixMarket/). The main characteristics of the matrices are summarized below. All information and images are from the Matrix Market, except for the non-zero counts; the Matrix Market representation (mtx) sometimes includes zeros (we don't know why), but we are reporting only actual non-zeros.

Name

Size

Non-zeros

Structured Plot

add32

4960x4960

19848



cavity02

317X317

5923



cavity23

4562X4562

131735



fidap001

216X216

4339



fidap005

27X27

279



fidap031

3909

91165



fidap037

3565X3565

67591



memplus

17758X17758

99147



mhd3200a

3200X3200

68026



mhd4800a

4800X4800

102252



nnc666

666X666

4032



orsreg_1

2205X2205

14133



pde900

900X900

4380



rdb2048

2048X2048

12032



saylr4

3564X3564

22316



sherman5

3312X3312

20793



utm5940

5940X5940

83842




Selecting Representative Time for the Specializers - Min Vs Median

Specializers are compared based on the time they spent to solve a given matrix. It is important to report the right time for a given specializer, which is not as simple as it sounds: as is well known, timings for the exact same computation, on the same machine, with the same inputs, can vary "randomly." The approach we used was to run each generated program five times (10,000 multiplications each time) and select the most representative time. However, which out of the five times can be considered the most representative of the set? Two answers were considered for this question: the minimum time and the median time. We use the minimum, for reason we now explain.

Although the median is more often used in statistical reports, we would argue that in this case, as a matter of principle, the minimum time is more meaningful. A computer can perform a given computation in some minimal amount of time; if it takes more time, it is because some event not associated with the computation slowed it down; our goal is to get the "cleanest" timing we can, so we would prefer the one that is contaminated by the fewest external events; that would be the minimum.

But there is another issue, and it is why median is used more often in other types of statistics, which is the issue of reproducibility, or predictability. In many situations, in a given set of samples, the extremes are considered more likely to be anomalous and to result from some random and unpredictable event. Thus, it is common to discount the highest and lowest measurements; taking the median is one way to do that. However, the argument made in the previous paragraph suggests that that might not be the case for the kinds of measurements we are taking. The question here is one of, so to say, expedience: which approach gives us the "smoothest" results, i.e. the results less likely to vary wildly and unpredictably.

One way to ask this question is just this: If we perform our experiment in the way outlined above - five runs of 10,000 multiplications - on the same code multiple times, which method will produce results that are more similar across those trials: minimum or median?

To settle this question, seven specializers and three matrices were randomly selected. Programs were generated for each specializer-matrix combination. Each generated program was used to produce twenty sets of five timing results; that is, we ran our usual method twenty times. Then the coefficient of variation of 20 minimum times was compared with the coefficient of variation of 20 median times. The tests were conducted on four computers: loome1, loome2, loome3, and i2pc4.

The coefficient of variation is a normalized measure of dispersion of a probability distribution, defined as the ratio of the standard deviation to the average of the sample.

Table 1 shows the coefficient of variation of minimum and median time values for each of the selected specializers and matrices over the four computers used for the tests.

Analyzing the results shown on Table 1 we can state the following observations:

  • On all machines except loome1, the coefficients of variation are small: always less than 3%, almost always less than 1%. That is, the timings were quite uniform, so that the choice of median or minimum would not make much difference in either case.

  • Loome1 is the most unpredictable machine. On this machine, the minimum produced less variation among the twenty runs in every case, often dramatically less.

  • On loome2 and loome3, minimum produced less variation than median in 41 out of 42 trials.

  • Only on i2pc4 was median competitive with minimum on the measure of low variance, taking the prize ten times (compared to eight for minimum, with three ties). However, i2pc4 is, in general, our most predictable machine; the coefficients of variation of minimum and median times are so small - less than 1% for all cases - that it would appear the choice simply does not matter.

    The above discussion leads us to the conclusion that the minimum time value obtained from a set of runs of a given test (specializer-matrix-computer) is the most appropriate time to report. Thus, all timings we give on our web pages and papers (including the ones just below) are obtained by the method described in the first paragraph of this page.




    CSV files

    We will add CSV files as we do more experiments. For now, we have only the timings for the basic methods. (We have many, many more timings, of course, but mostly they don't seem very meaningful.) The format of each file is a series of lines of the form: machine, matrix, method, time (microseconds for 10,000 calls to multByM).

    • CSV file containing timings of all basic methods on all (seven) machines. These are:
      • CSRbyNZ.
      • OSKI_1_2, 2_1, and 2_2. Technically, this is a mixed method: if the blocks don't fit exactly, there are leftover elements in the bottom row and/or far right column. These are handled by unfolding.
      • CSR_1, ..., CSR_10.
      • Unfolding. In terms of the specialize.py generator, this is "unfolding, 1000000, 1000000", i.e. this is unfolding with exactly one assignment per row.
      • stencil.
      • diagonal. Again, this is a mixed method. In terms of specialize.py, this is obtained with argument [split_by_dense_band, 1000000, 1000000, 80, [diagonal], [unrolling, 2]]. That is, every diagonal that is at least 80% dense is handled by diagonal, and the rest are handled by CSR_2. Note that do not have timings for loome1 for this method.
      • MKL. Again, we do not have timings for loome1 for this method.

    Last updated on Mon Aug 13 10:22:31 CDT 2012