specialize.py is a Python script that generates a matrix-vector
multiplier for a given matrix using a given method.
To use it, first create a folder matrices
(which will be used to store the raw contents of matrices).
Invoke the script like this:
python specialize.py matrix method-specification
This first downloads the given matrix from the Matrix Market
into the
matrices
folder (if it is not yet there).
It then
creates a folder with a set of .c files that together define a
function
void multByM (double v[], double w[])where
v
is the input vector and
w
the output vector.
The name of the folder is derived from the method specification,
and the code is generated according to that specification.
The .c files include
one or more auxiliary function definitions
and a file of declarations.
The latter,
named generated_decls.c
, includes
mainly array declarations;
it also includes
declarations of n
(the size of the matrix,
which can only be square),
and a char array matrix
, giving the name of the matrix.
(For the unfolding specializer, all data is built into the
code, so the declarations file declares only
n
and matrix
.)
Method specifications are explained in detail below.
specialize.py runs only in Python 2.6 or higher. (Actually, I've only tested it in version 2.6.1.)
The folder in which specialize.py is run must contain a
folder called matrices
containing matrices in either
"mtx" or "mm" format, or both.
If a matrix is named on the command line and is not in matrices
,
specialize will attempt to fetch it from the Matrix Market.
"mtx" is the native Matrix Market format, but specialize.py prefers
the data in "mm" format.
If it needs to fetch the matrix,
it will store it in both formats, and if it happens to be in the
folder only in mtx format, it will convert it and store the mm format.
(Warning: Some matrices in MM are not full matrices but only
templates, that is, they give only the indices where non-zeros would go;
the input routine will, obviously, report an error on these cases.)
mtx and mm format are very similar. Both begin with a line containing "m n nz" (in our case, m always equals n, since we only handle square matrices), and the remaining lines have the form "r c v", where v = M[r,c]. The differences are: (1) mtx format uses indexing from 1, while mm uses indexing from zero. (2) mtx files can begin with comment lines, which start with %; mm files do not have these. (3) mtx files sometimes contain zero entries (so that the nz given on the first line is not actually the number of non-zeroes, but instead the number of entries given in the file; I do not know why zeroes are ever included, but they sometimes are; mm files contain only non-zero values. (4) mm files are sorted in row-major order.
The script is invoked as:
python specialize.py matrix method-specificationwhere
matrix
is the name of the matrix and
method-specification
is a specification describing how the
code for this matrix should be generated.
In the simplest case, it is just the name of a specializer, as in:
python specialize.py fidap005 stencilHowever, specifications are usually more complicated. Here is a specification that says to extract a band of width 21 and implement multiplication by that part of the matrix using stencil, and the rest using unfolding (with blocking factor of, in effect, infinity):
python specialize.py fidap031 "[split_by_band, 10, 10, [stencil], [unfolding, 100000, 100000]]"Here,
split_by_band
is a function that partitions a
matrix into two parts (in-band and out-of-band);
such a function is called a splitter.
stencil
and unfolding
are methods of
generating code, which are applied here to the two parts of
fidap031
;
these are called specializers.
Specifications are explained in detail below.
The output is a folder with a
set of C files (with .c
extension)
containing functions and declarations,
plus one non-C file containing data calculated during generation.
The name of the folder
is obtained from the matrix name and
the specification,
so that it uniquely identifies a particular specialization of a
particular matrix (and can be quite long).
The code files have names generated.c
(the file containing multByM
)
and generated_i.c
;
the declarations file is generated_decls.c
.
All need to be compiled; the result is a definition of
multByM(double *v, double *w)
and any auxiliary
functions and declarations needed by it.
The decls file includes definitions of variables n
and matrix
(giving the name of the matrix as a
character array).
These may be of use to clients.
The file of data is called info.csv
,
and contains data in csv form, which depends upon the
method specification.
Note that each matrix that appears (i.e. the original matrix and
any matrix produced by a splitter) has a distinct name;
splitters create names of the returned matrices by adding
_1
or _2
to the name of the matrix being split.
For each matrix, the first line of info.csv
is the filename,
and its n
(which in fact is the same for all matrices),
and its nz
(the number of non-zeros, which is different for each partition).
Furthermore, each row of data for the matrix begins with the matrix name.
After that, the data depend upon the specializer.
(Warning: this list is not complete; the info.csv
files are written to be self-explanatory, so they
should be consulted.)
(Technical note: At present, only specializers, and not splitters, can add info to a matrix; this doesn't seem to be a problem.)
The main idea of this script is that multiple specifications can be used on one matrix, by partitioning the array and assigning different specializers to the different partitions. The allows complete freedom in choosing secondary specializers, which was the initial goal of the script. The script includes several functions to generate code ("specializers"), and several functions to partition matrices ("splitters"). Splitters always divide matrices into two pieces (it is always possible that one of the pieces is empty); in the future, we may want to generalize this.
The command-line arguments to the script are, as already noted, a matrix name and a specification. Abstractly, a specification is a tree whose internal nodes are labelled by splitters (and their arguments) and whose leaf nodes are labelled by specializers (and their arguments). Concretely, the specification is a Python list of one of these two forms:
[specializer, ... arguments ... ]
, or[splitter, ... arguments ..., spec1, spec2]
, where
spec1
and spec2
are themselves specifications,
indicating how to generate code for each of the partitions produced
by the splitter.
[repeat, r, specification]
, or python specialize.py fidap005 stencil
python specialize.py fidap005 "[stencil]"
python specialize.py add32 "[unrolling, 5]"
python specialize.py fidap031 "[split_by_band, 10, 10, [stencil], [unfolding, 100000, 100000]]"
python specialize.py fidap031 "[split_by_block, 3, 4, [OSKI, 3, 4], [unfolding, 100000, 100000]]"
The following are the current specializers.
Most will reliably generated code for any square matrix;
OSKI
,
row-segment
,
and diagonal
can only be used after specific splitters,
to ensure the matrices have the required properties.
Working Paper 2 describes the methods in detail.
Working Paper 3, Low-level Coding Variants,
discusses how they are implemented;
in most cases, several versions of the specializer was written
and one chosen based on experiments.
stencil
- Use stencil code; no argumentsunfolding
- Fully unfold code ("per-row blocking").
Arguments give block size (b_r and b_c).
CSRbyNZ
- No arguments.
unrolling
- CSR code, unrolled; argument is unrolling
factor (1 for plain CSR).
OSKI
- Use OSKI, with block sizes (b_r and b_c) as arguments.
diagonal
-
Elements are stored by diagonals (filling with zeroes as necessary),
so that an entire diagonal can be multiplied by v and added to w
at once, allowing for vectorization.
Can be used only after applying split_by_band
(N.B. not
split_by_band2
) or
split_by_dense_band
.
none
-
Generate no code.
Obviously, this does not produce a correct multiplication routine.
It is included for experimental purposes, so that a specializer
can be timed on an appropriate partition without having to
handle the other partition.
For example, we can see exactly how much speed-up we get
from diagonal
on the elements it handles,
ignoring the time it would take to handle the other elements.
row_segment
- This specializer works only on a banded
matrix; its purpose is to allow for the compiler to create vectorized
code for the main loop. row_segment
can be applied only to
a matrix that has been split by split_by_band
or
split_by_band2
.
The idea is to treat the in-band values as
a rectangular matrix [..., [M[r,r-lo], M[r,r-lo+1], ..., M[r,r], ...
M[r,r+hi]], ...], and store those values as such, including zeroes.
(So, this will be an n x width rectangle, where width = hi+lo+1.)
The main loop can then iterate over rows, uniformly performing a
dot-product of exactly width elements with the corresponding elements
from the input vector; if written correctly, this should be easily
vectorizable by a compiler.
Note that, if we take a full band, starting from row zero, there are,
in effect, negative indices, representing phantom columns (-lo, -lo+1, ...),
which need to be handled somehow; there are also correspondingly columns
n, n+1, ..., n+hi, which do not exist.
To avoid having to deal with these, use
split_by_band2
, which includes only rows lo through n-hi-1,
thereby avoiding the phantom columns (at the cost of leaving more elements
to be handled by the secondary specializer).
(N.B. We are not currently using row_segment
in
our experiments.
Because it uses a dot product, it is not vectorizable on all
machines, and will certainly provide no benefit if it is not
vectorized;
and even when vectorized, the fill factor is likely to limit its
utility sharply.
In short, we do not believe it offers any benefits.)
Earlier we mentioned that a method specification can begin
with repeat
followed by a number (say, r).
The effect is to generate the code given by the specification,
and then duplicate it r
times in multByM
.
The result is a multiplication routine that is wrong;
in fact, every value in the output vector will be exactly
r times what it should be.
The purpose of this feature is to allow experimentation
on the effect of varying code sizes;
by duplicating the code for a method, we ensure that two
versions with different values of r will differ
only in the code size, and not in any other way
(e.g. memory access patterns).
Of course, when doing these experiments, one must remember
to divide the time by r.
The following are the current splitters. Each produces a pair of matrices [mat1, mat2] that partition the elements of the argument matrix; we describe only the contents of mat1; mat2 contains all the other elements.
split_by_count
- Argument is a number k.
mat1 contains the first k elements (where the elements are
given in row-major order.
split_by_band
- Arguments are the bottom and top of the band, lo and hi.
mat1 is the matrix containing the in-band elements, i.e. all elements
[r,c,v] where r-lo <= c <= r+hi.
split_by_band2
- Same arguments as split_by_band
, but mat1 contains only
values from rows that contain a complete band.
That is, if the arguments are lo and hi, then mat1 includes only values
that are in band and come from rows between lo and n-1-hi.
(The idea of this is that, when using the diagonal
specializer, split_by_band ends up creating "phantom" columns of
zeroes, so to code the main loop tightly, we need to create a larger
vector v'
and copy v
into it before doing
the multiplication, which may not be worth the effort.
split_by_band2
avoids the phantom columns and
therefore eliminates the copying operation.)
split_by_block
- Arguments are blocking factors (b_r and b_c).
If p_r is the largest number not greater than n that is divisible by b_r,
and p_c is the largest number not greater than n that is divisible by b_c,
then mat1 is all the elements [r, c, v] where r < p_r and c < p_c
(where we use < because indexing is from zero).
split_by_stencil
- First argument is integer k
This splitter lists the stencils in descending order of their "coverage" - the number
of elements in rows that have this stencil, or the stencil's length times
its popularity.
It then selects enough stencils so that the size of the code - which is
the sum of the sizes of the stencils (not counting their popularity) -
does not exceed k elements, and includes those rows in mat1.
The second argument is integer minpop, which is the minimum
popularity of a stencil for it to be included in the first partition;
the popularity of a stencil is the number of times the loop for that
stencil will be iterated, and it may not be efficient to iterate a loop
a small number of times;
this parameter allows those elements that would be handled by low-iteration
loops to instead be handled by a secondary specializer.
Note that this splitter always selects entire rows, so that mat1 and mat2
will contain empty rows.
split_by_rownz
-
Arguments are k and minpop.
This is similar to split_by_stencil
, the only difference being
that this splitter starts by listing the matrix in descending order of
the coverage of the nz counts; that is, for every m, the product of
m and the number of rows whose nz count is m.
Again, k refers to code size, which is related to nz count
but not popularity, and minpop is used so that elements that
would have been handled in low-iteration loops can instead be handled
by a secondary specializer.
split_by_dense_band
Argument includes the band (as in split_by_band
and a percentage (integer in [0..100]).
Matrix1 includes elements in those bands, as long the bands
are at least as dense as that percentage.
(split_by_band lo hi
is the same as
split_by_dense_band lo hi 0
, so the former is not really
needed.)
The splitters given above do not attempt to split the matrix intelligently. There is no reason that we could not write a splitter that would attempt to divide the matrix "optimally" for some pair of specializers.
The code contains a lot of comments. Here I'll just write about what specializers and splitters do, and how to write them. Naturally, the code itself is the best guide.
One "gotcha": If you write a new splitter or specializer, you
must add it to the function evalspec
; just follow
the obvious model there.
Matrices are represented as Python dictionaries. Initially - that is, when first read in from the '.mm' file - they have these entries:
name
- the name of the matrix
data
- the non-zero entries of the matrix,
as a list of triples n
- again, we only have square matrices
nz
- the non-zero count
When splitters are called, as explained below, they return
new matrices; those matrices have the same n
,
but different names and (usually) different data and nz counts.
After specializers are called, they add one or more (but usually all) of the following:
decls
- global variable definitions -
i.e. declarations with initializations -
such as the "mvalues" array.
Since a specializer can be used more than once in
compiling a matrix, there can be several of these
declarations, with slightly different names.
These declarations are in the form of abstract syntax.
(See explanation of specializers below.)
externs
- global variable declarations to
be added as externs in every code file.
These are in the form of a list of strings.
locals
- variable declarations to be
included in every function.
Local variables should not be declared in generated
code, but should only be initialized; when writing
these declarations into functions, repeated declarations
are ignored, so only one declaration of each local
is emitted.
These are in the form of a list of strings.
code
- The actual code for this matrix,
in the form of abstract syntax.
info
- a list of lists, each containing
simple data (strings, ints, floats), which provide
information about the array.
These are emitted into file info.csv
.
stencils
- stencils of this matrix.
This is not used by the top level (i.e. has no effect
on the generated files), but is used internally;
see the next section.
rowsbynz
- similarly.
A splitter takes a matrix, and possibly some additional arguments, and returns a pair of matrices which together represent a partition of the original matrix (i.e. each element in the original matrix occurs in exactly one of the returned matrices). The two matrices have different names, and usually different non-zero counts (although it is legal for a partition to return a pair with the original matrix and an empty matrix), but the same n.
Aside from producing a partition, the only rule about
splitters is that they must have names beginning
split_
.
Two auxiliary functions useful for writing splitters are:
split_matrix_by_pred(matrix, pred)
: given a predicate
on elements (triples split_by_band
:
def split_by_band (matrix, lo, hi): return split_matrix_by_pred(matrix, lambda e: e[0]-lo <= e[1] and e[1] <= e[0]+hi)
split_matrix_by_rows(matrix, rows)
: given a
list of row numbers, split the matrix into all the elements
in those rows and all the elements in the other rows.
The rows list must be in ascending order.
In principle, this is the same as split_matrix_by_pred(matrix,
lambda r: r[0] in rows), but is more efficient.
Also, if you want to write a splitter that is based on the stencils
or row-non-zero counts of the matrix, you can call
groupRowsByStencil(matrix) or groupRowsByNZ(matrix).
Both return the original matrix, with an additional field
('stencils'
or 'rowsbynz'
);
the format of the those fields is explained in the code.
Specializers take a matrix and possibly other arguments, add fields to the matrix giving the generated code, and return the modified matrix. At the top level, the fields of the various matrices (if the original was split) are combined and output into the files described above in the section "Generated files".
The fields that can be added are listed above.
We elaborate a bit on each here.
Two points to keep in mind are that (a) generated code may be
divided up into functions and files (this is not done by the
specializers, but in a separate pass), and (b) it is possible that
the same specializer may be used twice in a specialization tree,
so care must be taken to avoid name collisions.
A global variable in the script, called stage
is
incremented for every specializer, and can be used to make names
unique.
decls
- The global variable definitions -
mostly array declarations with initializers - that are
needed by your specializer.
All are placed into file declared_decls.c
.
Because of problem (b) above, you should use stage
to make these names unique (e.g. define rowsnm
to be 'rows'+str(stage)
, and then use rowsnm
where you would otherwise use 'rows/
).
These definitions are given in abstract syntax, explained below.
(Note that the code emitter also adds definitions of n
and matrix
(giving the matrix name) to the decls file,
so you should not add those.)
externs
- This one is simple. If you have
created globals, you should add their declarations as a list of strings.
These will be added to the start of each generated code file.
locals
- these are variable declarations to be
included in every function.
In general, it is not necessary to use unique names for these.
Here are the rules: (1) Do not declare these in your
code, but do initialize them; (2) Add their declarations
to this field as strings.
Because of (b), it is possible that the same local variable declaration
will appear in different matrices, but the code emitter will only
output one of the declarations. (This is based on simple string matching,
so make sure that if you use a local variable with the same name as in
another specializer, the declarations match exactly.)
code
- The actual code for this matrix,
in the form of abstract syntax, explained below.
info
- a list of lists, each containing
simple data (strings, ints, floats), which provide
information about the array.
These are emitted into file info.csv
.
You should add these by calling add_info
, i.e.
add_info(matrix, list-of-data)This will add the info to the info already added. The list can be a list of "flat" lists, representing rows in a csv file, or just a flat list, representing a single row. (When the top level writes these to
info.csv
, it
will add the name of the current matrix to each row, so that
info about the matrices at each stage can be distinguished.)
Specializer create code using a simple form of abstract syntax.
The purpose of this is to make it easier to divide the code into
functions and files.
The abstract syntax operators are give in the script (search for
AST operations
).
Many are used rarely or not at all.
Here are the important ones:
Stmt
- Just include a statement as a string.
This should be used only for simple statements, not sequences.
The second argument is the number of operations in that statement;
I have used this as an approximate count of the floating-point operations
only; it is not critical that this be exact, but it should be greater than
zero if the statement does any floating-point operations.
The purpose of this is to give a rough count of the size of the code, so
that the top level can split it into functions and files appropriately.
Exp
- Similar to Stmt
, but for expressions.
Arraydecl
and Vardecl
- just what the name
implies.
The decls
field of a matrix will normally consists of a
sequence of these.
Seq
and Block
- These both take a list of
statements.
The difference is important: statements in a sequence can be split
across functions, but statements in a block should be kept together.
N.B. blocks should not be nested. (This is actually due to
a simple limitation in the implementation, but it's priority is low
enough that I probably won't fix it.)
w
is initialized to zeros,
and all multiplication routines operate by
adding to the elements of w
;
that is, every assignment generated by a specializer
has the form w[i] += ...
.
The problem is that this may mean using
"+=" more than necessary,
since obviously we don't need to use "+=" the
first time we assign to
w[i]
.
Using "+=" means doing one more fetch and add than
would be needed if we just used "=".
For example, if an entire matrix is treated using unfolding, with
infinite block size, then every row gets assigned to exactly once,
so we can write "=" instead of "+=";
on the other hand, if we use unrolling (i.e. standard CSR),
the "+=" is unavoidable.
The question is whether we can eliminate "+=" when it is
actually unnecessary.
This seems to be somewhat complicated, but it might improve performance,
albeit only slightly.