Documentation for specialize.py

Documentation structure

User-level documentation
Technical documentation (brief)
To-do list

User manual

Overview

specialize.py is a Python script that generates a matrix-vector multiplier for a given matrix using a given method. To use it, first create a folder matrices (which will be used to store the raw contents of matrices). Invoke the script like this:

   python specialize.py matrix method-specification

This first downloads the given matrix from the Matrix Market into the matrices folder (if it is not yet there). It then creates a folder with a set of .c files that together define a function

   void multByM (double v[], double w[])

where v is the input vector and w the output vector.

The name of the folder is derived from the method specification, and the code is generated according to that specification. The .c files include one or more auxiliary function definitions and a file of declarations. The latter, named generated_decls.c, includes mainly array declarations; it also includes declarations of n (the size of the matrix, which can only be square), and a char array matrix, giving the name of the matrix. (For the unfolding specializer, all data is built into the code, so the declarations file declares only n and matrix.)

Method specifications are explained in detail below.

specialize.py runs only in Python 2.6 or higher. (Actually, I've only tested it in version 2.6.1.)

Inputs and outputs

Matrices

The folder in which specialize.py is run must contain a folder called matrices containing matrices in either "mtx" or "mm" format, or both. If a matrix is named on the command line and is not in matrices, specialize will attempt to fetch it from the Matrix Market. "mtx" is the native Matrix Market format, but specialize.py prefers the data in "mm" format. If it needs to fetch the matrix, it will store it in both formats, and if it happens to be in the folder only in mtx format, it will convert it and store the mm format. (Warning: Some matrices in MM are not full matrices but only templates, that is, they give only the indices where non-zeros would go; the input routine will, obviously, report an error on these cases.)

mtx and mm format are very similar. Both begin with a line containing "m n nz" (in our case, m always equals n, since we only handle square matrices), and the remaining lines have the form "r c v", where v = M[r,c]. The differences are: (1) mtx format uses indexing from 1, while mm uses indexing from zero. (2) mtx files can begin with comment lines, which start with %; mm files do not have these. (3) mtx files sometimes contain zero entries (so that the nz given on the first line is not actually the number of non-zeroes, but instead the number of entries given in the file; I do not know why zeroes are ever included, but they sometimes are; mm files contain only non-zero values. (4) mm files are sorted in row-major order.

Command line arguments

The script is invoked as:

    python specialize.py matrix method-specification

where matrix is the name of the matrix and method-specification is a specification describing how the code for this matrix should be generated. In the simplest case, it is just the name of a specializer, as in:

    python specialize.py fidap005 stencil

However, specifications are usually more complicated. Here is a specification that says to extract a band of width 21 and implement multiplication by that part of the matrix using stencil, and the rest using unfolding (with blocking factor of, in effect, infinity):

    python specialize.py fidap031 "[split_by_band, 10, 10, [stencil], [unfolding, 100000, 100000]]"

Here, split_by_band is a function that partitions a matrix into two parts (in-band and out-of-band); such a function is called a splitter. stencil and unfolding are methods of generating code, which are applied here to the two parts of fidap031; these are called specializers.

Specifications are explained in detail below.

Generated files

The output is a folder with a set of C files (with .c extension) containing functions and declarations, plus one non-C file containing data calculated during generation. The name of the folder is obtained from the matrix name and the specification, so that it uniquely identifies a particular specialization of a particular matrix (and can be quite long). The code files have names generated.c (the file containing multByM) and generated_i.c; the declarations file is generated_decls.c. All need to be compiled; the result is a definition of multByM(double *v, double *w) and any auxiliary functions and declarations needed by it.

The decls file includes definitions of variables n and matrix (giving the name of the matrix as a character array). These may be of use to clients.

The file of data is called info.csv, and contains data in csv form, which depends upon the method specification. Note that each matrix that appears (i.e. the original matrix and any matrix produced by a splitter) has a distinct name; splitters create names of the returned matrices by adding _1 or _2 to the name of the matrix being split. For each matrix, the first line of info.csv is the filename, and its n (which in fact is the same for all matrices), and its nz (the number of non-zeros, which is different for each partition). Furthermore, each row of data for the matrix begins with the matrix name. After that, the data depend upon the specializer. (Warning: this list is not complete; the info.csv files are written to be self-explanatory, so they should be consulted.)

Unfolding: This one does not generate any additional info.

Unrolling: Lines give the unrolling factor (as given on the command line), the number of elements of the matrix that are handled in the blocked loop, and the number handled in the 'overflow" loop; and the numbers of rows that fit perfectly into the blocked loop, the number of rows that don't, and the number of empty rows.

CSRbyNZ: For each row non-zero count, the number of rows that have that non-zero count.

Stencil: For each stencil, the number of rows that have that stencil, the length of the stencil, and the stencil itself.

(Technical note: At present, only specializers, and not splitters, can add info to a matrix; this doesn't seem to be a problem.)

Specialization specifications

The main idea of this script is that multiple specifications can be used on one matrix, by partitioning the array and assigning different specializers to the different partitions. The allows complete freedom in choosing secondary specializers, which was the initial goal of the script. The script includes several functions to generate code ("specializers"), and several functions to partition matrices ("splitters"). Splitters always divide matrices into two pieces (it is always possible that one of the pieces is empty); in the future, we may want to generalize this.

The command-line arguments to the script are, as already noted, a matrix name and a specification. Abstractly, a specification is a tree whose internal nodes are labelled by splitters (and their arguments) and whose leaf nodes are labelled by specializers (and their arguments). Concretely, the specification is a Python list of one of these two forms:

[specializer, ... arguments ... ], or
[splitter, ... arguments ..., spec1, spec2], where spec1 and spec2 are themselves specifications, indicating how to generate code for each of the partitions produced by the splitter.
[repeat, r, specification], or

If the only argument is a specializer (no splitter), and it has no arguments, the brackets can be omitted. (The third case is special; it is explained below, in the section on specializers.) We have seen two examples above. Here are those and two more:

python specialize.py fidap005 stencil: - use stencil for fidap005
python specialize.py fidap005 "[stencil]": - same as the previous one
python specialize.py add32 "[unrolling, 5]": - use CSR with an unrolling factor of 5
python specialize.py fidap031 "[split_by_band, 10, 10, [stencil], [unfolding, 100000, 100000]]": - banded stencil, with band [10, 10], and unfolding for the out-of-band elements
python specialize.py fidap031 "[split_by_block, 3, 4, [OSKI, 3, 4], [unfolding, 100000, 100000]]": - extract the part of the matrix that divides evenly into 3-by-4 blocks, and use OSKI for that part; treat the remaining few elements by unfolding.

Note that the specification has to be quoted (except in the rare cases when it is a single word).

Specializers and splitters

The following are the current specializers. Most will reliably generated code for any square matrix; OSKI, row-segment, and diagonal can only be used after specific splitters, to ensure the matrices have the required properties. Working Paper 2 describes the methods in detail. Working Paper 3, Low-level Coding Variants, discusses how they are implemented; in most cases, several versions of the specializer was written and one chosen based on experiments.

stencil - Use stencil code; no arguments
unfolding - Fully unfold code ("per-row blocking"). Arguments give block size (b_r and b_c).
CSRbyNZ - No arguments.
unrolling - CSR code, unrolled; argument is unrolling factor (1 for plain CSR).
OSKI - Use OSKI, with block sizes (b_r and b_c) as arguments.
diagonal - Elements are stored by diagonals (filling with zeroes as necessary), so that an entire diagonal can be multiplied by v and added to w at once, allowing for vectorization. Can be used only after applying split_by_band (N.B. not split_by_band2) or split_by_dense_band.
none - Generate no code. Obviously, this does not produce a correct multiplication routine. It is included for experimental purposes, so that a specializer can be timed on an appropriate partition without having to handle the other partition. For example, we can see exactly how much speed-up we get from diagonal on the elements it handles, ignoring the time it would take to handle the other elements.
row_segment - This specializer works only on a banded matrix; its purpose is to allow for the compiler to create vectorized code for the main loop. row_segment can be applied only to a matrix that has been split by split_by_band or split_by_band2. The idea is to treat the in-band values as a rectangular matrix [..., [M[r,r-lo], M[r,r-lo+1], ..., M[r,r], ... M[r,r+hi]], ...], and store those values as such, including zeroes. (So, this will be an n x width rectangle, where width = hi+lo+1.) The main loop can then iterate over rows, uniformly performing a dot-product of exactly width elements with the corresponding elements from the input vector; if written correctly, this should be easily vectorizable by a compiler. Note that, if we take a full band, starting from row zero, there are, in effect, negative indices, representing phantom columns (-lo, -lo+1, ...), which need to be handled somehow; there are also correspondingly columns n, n+1, ..., n+hi, which do not exist. To avoid having to deal with these, use split_by_band2, which includes only rows lo through n-hi-1, thereby avoiding the phantom columns (at the cost of leaving more elements to be handled by the secondary specializer). (N.B. We are not currently using row_segment in our experiments. Because it uses a dot product, it is not vectorizable on all machines, and will certainly provide no benefit if it is not vectorized; and even when vectorized, the fill factor is likely to limit its utility sharply. In short, we do not believe it offers any benefits.)

Earlier we mentioned that a method specification can begin with repeat followed by a number (say, r). The effect is to generate the code given by the specification, and then duplicate it r times in multByM. The result is a multiplication routine that is wrong; in fact, every value in the output vector will be exactly r times what it should be. The purpose of this feature is to allow experimentation on the effect of varying code sizes; by duplicating the code for a method, we ensure that two versions with different values of r will differ only in the code size, and not in any other way (e.g. memory access patterns). Of course, when doing these experiments, one must remember to divide the time by r.

The following are the current splitters. Each produces a pair of matrices [mat1, mat2] that partition the elements of the argument matrix; we describe only the contents of mat1; mat2 contains all the other elements.

split_by_count - Argument is a number k. mat1 contains the first k elements (where the elements are given in row-major order.
split_by_band - Arguments are the bottom and top of the band, lo and hi. mat1 is the matrix containing the in-band elements, i.e. all elements [r,c,v] where r-lo <= c <= r+hi.
split_by_band2 - Same arguments as split_by_band, but mat1 contains only values from rows that contain a complete band. That is, if the arguments are lo and hi, then mat1 includes only values that are in band and come from rows between lo and n-1-hi. (The idea of this is that, when using the diagonal specializer, split_by_band ends up creating "phantom" columns of zeroes, so to code the main loop tightly, we need to create a larger vector v' and copy v into it before doing the multiplication, which may not be worth the effort.
split_by_band2 avoids the phantom columns and therefore eliminates the copying operation.)
split_by_block - Arguments are blocking factors (b_r and b_c). If p_r is the largest number not greater than n that is divisible by b_r, and p_c is the largest number not greater than n that is divisible by b_c, then mat1 is all the elements [r, c, v] where r < p_r and c < p_c (where we use < because indexing is from zero).
split_by_stencil - First argument is integer k This splitter lists the stencils in descending order of their "coverage" - the number of elements in rows that have this stencil, or the stencil's length times its popularity. It then selects enough stencils so that the size of the code - which is the sum of the sizes of the stencils (not counting their popularity) - does not exceed k elements, and includes those rows in mat1. The second argument is integer minpop, which is the minimum popularity of a stencil for it to be included in the first partition; the popularity of a stencil is the number of times the loop for that stencil will be iterated, and it may not be efficient to iterate a loop a small number of times; this parameter allows those elements that would be handled by low-iteration loops to instead be handled by a secondary specializer. Note that this splitter always selects entire rows, so that mat1 and mat2 will contain empty rows.
split_by_rownz - Arguments are k and minpop. This is similar to split_by_stencil, the only difference being that this splitter starts by listing the matrix in descending order of the coverage of the nz counts; that is, for every m, the product of m and the number of rows whose nz count is m. Again, k refers to code size, which is related to nz count but not popularity, and minpop is used so that elements that would have been handled in low-iteration loops can instead be handled by a secondary specializer.
split_by_dense_band Argument includes the band (as in split_by_band and a percentage (integer in [0..100]). Matrix1 includes elements in those bands, as long the bands are at least as dense as that percentage. (split_by_band lo hi is the same as split_by_dense_band lo hi 0, so the former is not really needed.)

The splitters given above do not attempt to split the matrix intelligently. There is no reason that we could not write a splitter that would attempt to divide the matrix "optimally" for some pair of specializers.

Technical documentation

The code contains a lot of comments. Here I'll just write about what specializers and splitters do, and how to write them. Naturally, the code itself is the best guide.

One "gotcha": If you write a new splitter or specializer, you must add it to the function evalspec; just follow the obvious model there.

Matrices

Matrices are represented as Python dictionaries. Initially - that is, when first read in from the '.mm' file - they have these entries:

name - the name of the matrix
data - the non-zero entries of the matrix, as a list of triples , in row-major order.
n - again, we only have square matrices
nz - the non-zero count

When splitters are called, as explained below, they return new matrices; those matrices have the same n, but different names and (usually) different data and nz counts.

After specializers are called, they add one or more (but usually all) of the following:

decls - global variable definitions - i.e. declarations with initializations - such as the "mvalues" array. Since a specializer can be used more than once in compiling a matrix, there can be several of these declarations, with slightly different names. These declarations are in the form of abstract syntax. (See explanation of specializers below.)
externs - global variable declarations to be added as externs in every code file. These are in the form of a list of strings.
locals - variable declarations to be included in every function. Local variables should not be declared in generated code, but should only be initialized; when writing these declarations into functions, repeated declarations are ignored, so only one declaration of each local is emitted. These are in the form of a list of strings.
code - The actual code for this matrix, in the form of abstract syntax.
info - a list of lists, each containing simple data (strings, ints, floats), which provide information about the array. These are emitted into file info.csv.
stencils - stencils of this matrix. This is not used by the top level (i.e. has no effect on the generated files), but is used internally; see the next section.
rowsbynz - similarly.

Splitters

A splitter takes a matrix, and possibly some additional arguments, and returns a pair of matrices which together represent a partition of the original matrix (i.e. each element in the original matrix occurs in exactly one of the returned matrices). The two matrices have different names, and usually different non-zero counts (although it is legal for a partition to return a pair with the original matrix and an empty matrix), but the same n.

Aside from producing a partition, the only rule about splitters is that they must have names beginning split_.

Two auxiliary functions useful for writing splitters are:

split_matrix_by_pred(matrix, pred): given a predicate on elements (triples
), split the matrix into the elements that satisfy the predicate and those that don't. For example, here is the complete definition of split_by_band:
```
      def split_by_band (matrix, lo, hi):
         return split_matrix_by_pred(matrix,
                  lambda e: e[0]-lo <= e[1] and e[1] <= e[0]+hi)
```
split_matrix_by_rows(matrix, rows): given a list of row numbers, split the matrix into all the elements in those rows and all the elements in the other rows. The rows list must be in ascending order. In principle, this is the same as split_matrix_by_pred(matrix, lambda r: r[0] in rows), but is more efficient.

Also, if you want to write a splitter that is based on the stencils or row-non-zero counts of the matrix, you can call groupRowsByStencil(matrix) or groupRowsByNZ(matrix). Both return the original matrix, with an additional field ('stencils' or 'rowsbynz'); the format of the those fields is explained in the code.

Specializers

Specializers take a matrix and possibly other arguments, add fields to the matrix giving the generated code, and return the modified matrix. At the top level, the fields of the various matrices (if the original was split) are combined and output into the files described above in the section "Generated files".

The fields that can be added are listed above. We elaborate a bit on each here. Two points to keep in mind are that (a) generated code may be divided up into functions and files (this is not done by the specializers, but in a separate pass), and (b) it is possible that the same specializer may be used twice in a specialization tree, so care must be taken to avoid name collisions. A global variable in the script, called stage is incremented for every specializer, and can be used to make names unique.

decls - The global variable definitions - mostly array declarations with initializers - that are needed by your specializer. All are placed into file declared_decls.c. Because of problem (b) above, you should use stage to make these names unique (e.g. define rowsnm to be 'rows'+str(stage), and then use rowsnm where you would otherwise use 'rows/). These definitions are given in abstract syntax, explained below. (Note that the code emitter also adds definitions of n and matrix (giving the matrix name) to the decls file, so you should not add those.)
externs - This one is simple. If you have created globals, you should add their declarations as a list of strings. These will be added to the start of each generated code file.
locals - these are variable declarations to be included in every function. In general, it is not necessary to use unique names for these. Here are the rules: (1) Do not declare these in your code, but do initialize them; (2) Add their declarations to this field as strings. Because of (b), it is possible that the same local variable declaration will appear in different matrices, but the code emitter will only output one of the declarations. (This is based on simple string matching, so make sure that if you use a local variable with the same name as in another specializer, the declarations match exactly.)
code - The actual code for this matrix, in the form of abstract syntax, explained below.
info - a list of lists, each containing simple data (strings, ints, floats), which provide information about the array. These are emitted into file info.csv. You should add these by calling add_info, i.e.
```
        add_info(matrix, list-of-data)
```
This will add the info to the info already added. The list can be a list of "flat" lists, representing rows in a csv file, or just a flat list, representing a single row. (When the top level writes these to info.csv, it will add the name of the current matrix to each row, so that info about the matrices at each stage can be distinguished.)

Abstract syntax

Specializer create code using a simple form of abstract syntax. The purpose of this is to make it easier to divide the code into functions and files. The abstract syntax operators are give in the script (search for AST operations). Many are used rarely or not at all. Here are the important ones:

Stmt - Just include a statement as a string. This should be used only for simple statements, not sequences. The second argument is the number of operations in that statement; I have used this as an approximate count of the floating-point operations only; it is not critical that this be exact, but it should be greater than zero if the statement does any floating-point operations. The purpose of this is to give a rough count of the size of the code, so that the top level can split it into functions and files appropriately.
Exp - Similar to Stmt, but for expressions.
Arraydecl and Vardecl - just what the name implies. The decls field of a matrix will normally consists of a sequence of these.
Seq and Block - These both take a list of statements. The difference is important: statements in a sequence can be split across functions, but statements in a block should be kept together. N.B. blocks should not be nested. (This is actually due to a simple limitation in the implementation, but it's priority is low enough that I probably won't fix it.)

To-do list

Deal with "first +=" problem.: To allow for multiple methods to be applied (using splitters), we follow this convention: the output vector w is initialized to zeros, and all multiplication routines operate by adding to the elements of w; that is, every assignment generated by a specializer has the form w[i] += .... The problem is that this may mean using "+=" more than necessary, since obviously we don't need to use "+=" the first time we assign to w[i]. Using "+=" means doing one more fetch and add than would be needed if we just used "=". For example, if an entire matrix is treated using unfolding, with infinite block size, then every row gets assigned to exactly once, so we can write "=" instead of "+="; on the other hand, if we use unrolling (i.e. standard CSR), the "+=" is unavoidable. The question is whether we can eliminate "+=" when it is actually unnecessary. This seems to be somewhat complicated, but it might improve performance, albeit only slightly.
Not sure how to filter for zeros in mtx files.: Currently just testing "float(input)<>0.0"; seems to work.