So Intel has MKL for the CPU; nVidia has cuBLAS and cuSPARSE for CUDA; and there are various implementations of BLAS and LAPACK for the CPU as jumping off points for developing applications using linear algebra.
I'd like to build an application that uses AMP to provide the underlying linear algebra techniques necessary for my application's functionality. The impetus for doing so is the expected performance improvement over using the CPU alone (CUDA documentation indicates that cuBLAS provides at least order of magnitude performance improvement over MKL for a wide variety of techniques applicable to matrices of 4K rows/columns) along with the abstraction of the underlying hardware provided by AMP.
To use AMP to do the same stuff, I (apparently) need to first construct (at least a subset of) the aforementioned underlying libraries. For me, this as a non-trivial exercise.
With a modest understanding of linear algebra, limiting my research to only matrix multiplication, it took me a full day to research the current state-of-the-art with respect to algorithms - taking into account the underlying architecture, the big "O" performance expectations, relative stability, etc... to get to the point where I believe I have a reasonably good idea of likely candidate algorithms to implement in order to compare empirical performance results.
Add the implementation, optimization, and testing - for each of the necessary methods; and this is going to be a substantial investment of time.
Is anyone else already working on this?
Is anyone aware of a synopsis of algorithms applicable to parallelization on a GPU? Of special interest is the partitioning of large problems and how algorithms might be modified to better accommodate them (for example, Morton ordering for improved memory locality).
Any other suggestions on getting to the point where I have an AMP BLAS/LAPACK implementation?
Ken Miller