Hi,
I'm trying to learn amp, so I put together a simple (stupid) routine that will sort 4 arrays of floats in parallel.
As a test, I wrote the same code in three ways:
1. Straight up C++, no parallelism
2. Using parallel_for_each without amp
3. Using parallel_for_each restrict(amp)
In this case, the parallel version runs ~4 times faster than the non-parallel (as would be expected, I have a quad-core CPU), but the amp version run MUCH slower (several orders of magnitude).
I am running Windows 7-64-bit with a AMD FirePro accelerator. I have run some of the AMP samples and have seen the expected performance improvements.
Can you look at my code and tell me what I am doing wrong. (BTW, I understand the sort implemented is "not optimal", the intent was to learn how to use AMP).
void SuperSortAMP(float *Array1, int len1, float *Array2, int len2, float *Array3, int len3, float *Array4, int len4) { float *combArray = new float[len1*4]; for(int i = 0; i<len1; i++) { combArray[i+0*len1] = Array1[i]; combArray[i+1*len1] = Array2[i]; combArray[i+2*len1] = Array3[i]; combArray[i+3*len1] = Array4[i]; } int lenArray[] = { len1, len2, len3, len4}; array_view<float,1> a(4*len1, combArray); array_view<int,1> l(4, lenArray); parallel_for_each(l.extent, [=](index<1> idx) restrict(amp) { bool bSorted = false; while(!bSorted) { bSorted=true; for(int i = 1; i<l[idx]; i++) { int iCur = idx[0]*l[idx]+i; int iPrior = iCur-1; if(a[iCur]<a[iPrior]) { float f = a[iCur]; a[iCur]=a[iPrior]; a[iPrior] = f; bSorted=false; } } } }); a.synchronize(); for(int i = 0; i<len1; i++) { Array1[i] = a[0*len1+i]; Array2[i] = a[1*len1+i]; Array3[i] = a[2*len1+i]; Array4[i] = a[3*len1+i]; } }