Iterating on the GPU with C++ AMP

July 9, 2012, 5:50 am

≫ Next: visual studion 2010 profile Fortran

≪ Previous: Use of task to prepare a XmlDocument from a file with WinRT library functions.

Hi I'm going to code a GPU based version of Dijkstra search, and since I have just started Learning C++ AMP my knowledge is quite limited.

I guess there is a simple answer, but here goes:

				//Create GPU array views of the data
				array_view<const size_t, 1> V(n_nodes, v);  //All vertices (nodes)
				array_view<const size_t, 2> E(n_nodes, nn, e); //All edges
				array_view<const size_t, 2> W(n_nodes, nn, w); //Weights
				
				array_view<int, 1> M(n_nodes, m); //Mask array
				array_view<float, 1> C(n_nodes, c); //Cost array
				array_view<float, 1> U(n_nodes, u); //Updating cost array
				
				// Run the following code on the GPU
				// ---------------------------------
				//while M not Empty do (or while there are still some M[idx] == 1)
				// {
					parallel_for_each(V.extent, [V, E, W, M, C, U] (index<1> idx) restrict(amp)
					{
						if (M[idx] == 1) 
						{
							//Update cost in C and U
							//Update mask array M, setting several more M[newIdx] = 1; //Think of it as a wavefront propagating in all direction through the graph
						}
					});

					//Wait for all threads to finish updating M.
					// I do not want to copy back to CPU between iterations!
				// }
				// ------------- (end GPU code)

As you can see from the code above what I want to acomplish is to iterate on the GPU, performing some sort of while loop (see comments in code). For each iteration I want to perform a parallel execution over all vertices (V) in my graph, then wait until all threads are finished updating the Mask array (M), and then perform the next iteration.

So my question is how can I perform the iterations on the GPU. I dont want to copy Mask array(M) between the GPU and CPU between each operation. I just want to use the updated Mask array on the GPU in the next iteration. How can I perform something like the while loop indicated in the code comments above?

↧

visual studion 2010 profile Fortran

September 6, 2010, 11:53 pm

≫ Next: ACE_SOCK_ACCEPTOR error elimination..

≪ Previous: Iterating on the GPU with C++ AMP

Hello,

Could you please let me know whether or not to profile Fortran code with VS2010 premiun edition?

Thanks,

Ngcc

↧

ACE_SOCK_ACCEPTOR error elimination..

July 12, 2012, 3:13 am

≫ Next: Parallel processing multithreading pthread

≪ Previous: visual studion 2010 profile Fortran

Hi I am writing a application which works on TCP . This application accept connection from client and and receive data. Here I used ACE-6.0.1 lib .I wrote following code to accept connection :-

ACE_SOCK_ACCEPTOR acceptor_ ;

but it;s giving me error "ACE_SOCK_ACCEPTOR is undefined". Why this error is coming and how i can resolve this error?

Thanx in advance..

plz forward it

↧

Parallel processing multithreading pthread

July 12, 2012, 1:24 pm

≫ Next: Multi-threading AMP

≪ Previous: ACE_SOCK_ACCEPTOR error elimination..

Need urgent help on thread: the goal here is the separtemask will take each image and separate different contours and for each contour in the image it will call handleobject thread. So every for loop will call the handeobject thread. However, object index variable needs to be passed in each thread. But only last value of objectndex is passed, this is becuase the speratemask function loops and repalces the value of obj.objindx and only the last value of obj.objindx is
passed to all the threads. Is there anyway to pass each objectindex
value in handleobject. The code runs fine if we uncomment the pthread_join(tid[objectIndex],NULL); but it will not give a parralel program

void separateMask(IplImage *maskImg)
{

for(r = contours; r != NULL; r = r->h_next){
cvSet(objectMaskImg, cvScalarAll(0), NULL);
CvScalar externalColor = cvScalarAll(0xff);
CvScalar holeColor = cvScalarAll(0x00);
int maxLevel = -1;
int thinkness = CV_FILLED;
int lineType = 8; /* 8-connected */
cvDrawContours(objectMaskImg, r, externalColor, holeColor, maxLevel, thinkness, lineType, cvPoint(0,0));;
obj.objectMaskImg1[objectIndex]=(IplImage *) malloc(sizeof(IplImage));
obj.objectMaskImg1[objectIndex]=objectMaskImg;
obj.objindx=objectIndex;
obj.intensityOut1=intensityOut;
obj.tasOut1=tasOut;
pthread_create(&tid[objectIndex],NULL,handleObject,(void *)&obj);
//pthread_join(tid[objectIndex],NULL);
printf("objectindx %d\n",obj.objindx);
objectIndex++;

}
// cvReleaseImage(&objectMaskImg);
//cvReleaseMemStorage(&storage);
printf("Exitng Separatemask\n");

}

void* handleObject(void *arg)
{
int i, j;
handle *hndl;
hndl=(handle *) malloc(sizeof(handle));
hndl=(handle*)arg;
pthread_mutex_t lock=PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&lock);
IplImage *pImg;
float statistics_ratio[3][9];
pthread_t tid3;
tas3 tas2;
pImg = cvLoadImage("image.tif", CV_LOAD_IMAGE_ANYCOLOR | CV_LOAD_IMAGE_ANYDEPTH);
if(pImg == NULL){
fprintf(stderr, "Fail to load image %s\n", "tiff file");
return ;
}
tas2.pImg1=pImg;
printf("tst%d\n",hndl->objindx);
tas2.x=hndl->objindx;
tas2.objectMaskImg1=hndl->objectMaskImg1[tas2.x];
tas2.statistics_ratio[3][9]=statistics_ratio[3][9];
double mean = average_intensity(pImg, tas2.objectMaskImg1);
int total = total_white(pImg, tas2.objectMaskImg1);
pthread_mutex_unlock(&lock);

printf("Exiting handle object thread_id %d\n\n", pthread_self());
}

↧

Multi-threading AMP

July 10, 2012, 10:16 am

≫ Next: C++ AMP: Detailed perf metrics?

≪ Previous: Parallel processing multithreading pthread

This is not exactly a question, maybe we just start discussion it and find out a good pattern together.

I tried to multi-thread kernel submission. If failed horribly, the implementation quickly blows up with corrupt reference-counters, or maybe access to eliminated objects from reference-counting. There is not that much information that I could build a mental model of what AMP actually is in terms of threading.

The documentation says the accelerator_views are thread-safe, but not concurrency-safe, while the accelerator itself is. From reading I assume an accelerator_view represents something in the spirit of a DX11DeviceContext. But it's not clear to me if an accelerator_view per thread (and not deferred) would allow the concurrent submition of parallel_for_eachs to a single accelerator. I'm also not sure if wrapping AMP-code [with array_view-construction and all] into a parallel_for-lambda is even sane.

I did a very thorough job in my testings, each [concurrency-runtime] thread [spawned from a parallel_for] has basically it's own view(s), it's own arrays and doesn't intersect in any way with the other threads. Still I can't really get it not to blow up.

Here is the schema of how I tested it.

...

// one resource per thread (pb == thread-id)
static vector<accelerator_view> accs;
static vector<array<double, 1> *> arr;

...

// run a kernel on the view assigned to the thread
double executeAMP(int pb) {
  // the accelerator_views ended up in there with push_back(),
  // prevents the default constructor to be used (which doesn't exist)
  accelerator_view &acc = (Repository::accs[pb]);
  array<double, 1> arr = *(Repository::arrs[pb]);

  ...

  extent<2> ee(3, 9);
  tiled_extent<3, 9> te(ee);

  ...

  parallel_for_each(acc, te, [=, &arr](tiled_index<3, 9> th) restrict(amp) {
    ...
  });

  ...

  return something_read_back_via_synchronize_buffer.
}

...

// submit X threads concurrently
parallel_for(0, p, 1, [&](int pb) {
  pb %= concurrent;

  robin[pb].lock();
  reslt[pb] = executeAMP(pb);
  robin[pb].unlock();
});

...

What would be a working and safe pattern to achieve concurrent AMP-threads to execute? Or is there simply no way to use multiple [immediate] views, and it need to be done with a deferred-approach?

Maybe we get a few interesting ideas, maybe from the DX11 side as I'm sure there has been accumulated a bit of experience with the situation (multiple immediate contexts).

↧

C++ AMP: Detailed perf metrics?

July 10, 2012, 5:02 pm

≫ Next: CPU do nothing while GPU is actually working

≪ Previous: Multi-threading AMP

Coming from a CUDA and OpenCL background, where I have used tools like Nvidia Visual Profiler and AMD's profiling utility that show GPU code metrics such as coalesced loads, % of time ALU is busy, % of time fetch unit is stalled etc. Are there ways to collect such detailed metrics from AMP code?

↧

CPU do nothing while GPU is actually working

July 12, 2012, 9:36 pm

≫ Next: Pc freezes while running c++ amp code

≪ Previous: C++ AMP: Detailed perf metrics?

Hi let me say first this, Thank You MS team for Replying my question all the time and kindly.

Here I got a new question.

I just start using concurrency visualizer since yesterday.

It shows me what threads are working in what time.

I saw CPU and GPU Thread lane,then I realize CPU do nothing when GPU is working.

It looks CPU just waiting till GPU finish his job.

I thought parallel_for_each is ASYNCHRONOUS. But it looks it doesnt.

I thought CPU do next job immediately when parallel_for_each is scheduled.

Why CPU is just waiting?

thanks.

↧

Pc freezes while running c++ amp code

June 28, 2012, 9:03 am

≫ Next: Unbounded buffer (VS10)- shows unprocessed messages in debug view

≪ Previous: CPU do nothing while GPU is actually working

When running c++ amp code that takes more than a few seconds my whole pc freezes, not even able to move the mouse cursor, other tasks are still ongoing, like i can still hear my music but the whole screen is blocked until the execution of the amp code finished at which point everything comes back to normal. The accelerator i'm running on is HD7770 with latest drivers, os is win 7 64bit and the RC ver of Visual Studio.

Is this normal behaviour, maybe caused by too high load on the gpu, or some other issue, anyone else encountered something like this or any suggestion on how to alleviate this.

Thanks and regards

↧

Unbounded buffer (VS10)- shows unprocessed messages in debug view

July 12, 2012, 8:11 am

≫ Next: PPL - VS10 compiler - what are the supported platforms ?

≪ Previous: Pc freezes while running c++ amp code

I use unbounded buffer like this in VS10:

bool accepted = asend(this->workQueue, work); //returns true

On a Ms server 2003 R2 SP2 in debug view every time I add a message to the workQueue I see the unprocessed messages growing for the unbounded buffer. Can I use PPL on that platform ? I have to mention that my implementation works on later platforms, and don't know the reason I could get this issues (asend returns true)

Thanks for your help !

PS: the way I observed the issue was because I was adding messages to the buffer and from another agent(thread) I was consuming messages but dequeue() method does not return the new work added.

↧

PPL - VS10 compiler - what are the supported platforms ?

July 12, 2012, 11:40 am

≫ Next: RC and full double_precision error

≪ Previous: Unbounded buffer (VS10)- shows unprocessed messages in debug view

I ask this because I had issues successfully using unbounded_buffer on Windows Server 2003 SP2 (but worked under newer platforms).

That code didn't show any error from the point of view of the unbounded_buffer API - at runtime(queuing confirmed but couldn't successfully read any queued items with dequeue method). Anyone please tell me if you successfully used any ppl code(including unbounded_buffer and agents) on Windows server 2003 (and or XP SP3).

Thanks in advance.

-Ghita

↧

RC and full double_precision error

July 12, 2012, 8:51 am

≫ Next: Want to make a bigint. Carry flag? int32*int32 -> int64?

≪ Previous: PPL - VS10 compiler - what are the supported platforms ?

Hi,

The following code worked perfectly with the Beta, but since I switched to RC, I get this error:

Concurrency::parallel_for_each uses features (full double_precision) unsupported by the selected accelerator.

As I don't need double precision, I only use float and fastMath. I still use the same hardware as before (Nvidia Geforce GTX 550 Ti and AMD Radeon HD 6450).

The class Rect only contains int and ValuesRGB only float.

Have I missed some important changes between Beta and RC?

				ValuesRGB CompensationInternal::CalcAverages(const accelerator_view& av, const Rect field,
					ImageDataAMP<const unsigned int> img)
				{
					ValuesRGB outAvgBrightness;
					const int size = img.height * img.width;
					const int roiWidth = field.right - field.left;
					const int roiHeight = field.bottom - field.top;		

					const unsigned int roiSize = roiWidth * roiHeight;

					float* avg = new float[roiWidth * 3];
					array_view<float,1> avgBrightness( 3*roiWidth, avg);

					avgBrightness.discard_data();

					//std::cout << "roiWidth=" << roiWidth << std::endl; 


					index<1> origin(0);
					extent<1> e(roiWidth);
					try
					{
						// Run code on the GPU
						parallel_for_each(av, e, [=] (index<1> idx)  restrict(amp)
						{
							index<1> x = idx + field.left;
							const int index = idx[0] * 3;
							avgBrightness(0+ index) = 0;
							avgBrightness(1+ index) = 0;
							avgBrightness(2+ index) = 0;
							for (int y = field.top; y <= field.bottom; y++)
							{
								avgBrightness(0+ index) += (float)img.GetPixelR(x[0], y);
								avgBrightness(1+ index) += (float)img.GetPixelG(x[0], y);
								avgBrightness(2+ index) += (float)img.GetPixelB(x[0], y);
							}
						});
						// Copy data from GPU to CPU
						avgBrightness.synchronize();
					}
					catch(accelerator_view_removed& ex)
					{
						CompensationInternal::HandleRemoved(ex);
					}
					catch(runtime_exception& ex)
					{
						CompensationInternal::HandleRuntime(ex);
					}	

					for (int x = 0; x < roiWidth; x++)
					{
						const int index = x * 3;						

						outAvgBrightness.R +=avg[index + 0];
						outAvgBrightness.G +=avg[index + 1];
						outAvgBrightness.B +=avg[index + 2];
					}				

					outAvgBrightness.R /= roiSize;
					outAvgBrightness.G /= roiSize;
					outAvgBrightness.B /= roiSize;						
					delete(avg);
					return outAvgBrightness;
				}

↧

Want to make a bigint. Carry flag? int32*int32 -> int64?

July 13, 2012, 5:44 am

≫ Next: Behavior diffrence betwwen runtime 2010 vs 2008

≪ Previous: RC and full double_precision error

Hello. I am interested in making a bigint that can run on the gpu.

In order to do this, I need to handle overflow conditions, like the + operator overflowing 1 bit to the carry flag on a cpu and the * operator having a result that is twice the size as the operands.

Is there any way to do this with C++AMP? I believe CUDA has a few intrinsics for things like this ( http://www.mpi-inf.mpg.de/~emeliyan/cuda-compiler/ ).

The C++ AMP team has done an amazing job allowing standard C++ to be used. And standard C++ does not include any sort of carry flag or int32*int32->int64. So I suspect that the answer will be "no" and that this is an uphill battle. But it sure would be handy.

Thanks!

↧

Behavior diffrence betwwen runtime 2010 vs 2008

July 13, 2012, 6:42 am

≫ Next: Interop with Direct3D11 textures in C++ AMP: sample project

≪ Previous: Want to make a bigint. Carry flag? int32*int32 -> int64?

Hi,

I have a big Multi thread process that runs until now fine on manydiffrent platform. It's a .exe built from VS 2008 and use thus 2007 Run tile.

I mugrated the entire solution ti VS 2010 as is and buot it.

We encounter randomly now many unexplainable problems. Many - not all - of them occur at first time the concerned function is executed.

Gace someone an idea about this ?

Many thanks.

JPL

↧

Interop with Direct3D11 textures in C++ AMP: sample project

July 10, 2012, 3:37 pm

≫ Next: HelloWorldCSharpWinRT won't compile

≪ Previous: Behavior diffrence betwwen runtime 2010 vs 2008

Hello,

I was reading the very useful post at:

http://blogs.msdn.com/b/nativeconcurrency/archive/2012/07/02/interop-with-direct3d-textures-in-c-amp.aspx

Is there a sample visual studio project that could be made available for this? I'm still new to DX programming, so this would be tremendously helpful for people like me that want to get started by playing around with a working project.

Sample projects have been made available for other blog posts and they are extremely helpful. Any chance of adding one for this one? Nothing fancy requested, just a working project that has the described code in action.

Thank you!

Julien

↧

HelloWorldCSharpWinRT won't compile

July 2, 2012, 11:34 am

≫ Next: AMP: shared texture

≪ Previous: Interop with Direct3D11 textures in C++ AMP: sample project

Hello,

I am trying to play with winRT cpp amp demo found here:

http://blogs.msdn.com/b/pfxteam/archive/2011/11/11/how-to-use-c-amp-from-c-using-winrt.aspx

When I try to compile I get this error:

1>C:\Users\julie_000\Documents\Visual Studio 2012\Projects\HelloWorldCSharpWinRT\HelloWorldCSharpWinRT\Common\LayoutAwarePage.cs(142,63,142,99): error CS0246: The type or namespace name 'ApplicationViewStateChangedEventArgs' could not be found (are you missing a using directive or an assembly reference?)

The Hello World examples work just fine, as does the C# desktop example. I am running on windows 8 release preview (build 8400). Am I doing something wrong, or does the sample need to be updated?

Thanks!

Julien

↧

AMP: shared texture

July 14, 2012, 8:52 am

≫ Next: C++ AMP: "Failed to query D3D marker event status."

≪ Previous: HelloWorldCSharpWinRT won't compile

Why is the graphics::texture objects not created as shareable d3d textures?

Currently if I want to display a graphics::texture object on a screen I have to first copy it over to a texture on that device.

auto tex = texture<norm4, 2>(source, d3d_acc_view);

However if the following was true:

CComQIPtr<IDXGIResource> res = get_texture(source);
DXGI_USAGE usage;
res->GetUsage(&usage);
assert(usage & DXGI_USAGE_SHARED != 0);

Then the extra copy would be unnecessary, or am I missing something?

↧

C++ AMP: "Failed to query D3D marker event status."

July 15, 2012, 7:40 am

≫ Next: parallel invoke & windows forms controls

≪ Previous: AMP: shared texture

Hi Folks,

I have another problem with a TDR event. This code computes the LU-decomposition using a naive implementation of Doolittle's algorithm, on a tri-diagonal matrix that is 40x40 elements. The code works fine in Debug mode on an NVIDIA GTX 470, but fails in Release mode on the third parallel-for-each, on the first iteration of i (=0). The template is instantiated with float.

template <typename _type>
void LU_DecompositionC(accelerator_view & acc, Matrix<_type> & AC, Matrix<_type> & L, Matrix<_type> & U)
{
    std::cout << "Starting parallel C ... ";
    Matrix<_type> A(AC);

    int N = A.rows;

    array_view<_type, 2> a(N, N, A.Data());
    array_view<_type, 2> l(N, N, L.Data());
    array_view<_type, 2> u(N, N, U.Data());
    mycache2(acc, a);
    l.discard_data();
    u.discard_data();

    Counter counter;
    counter.Start();

    for (int i = 0; i < A.rows; ++i)
    {
        //cout << "i = " << i << "\n";
        //cout.flush();
        parallel_for_each(acc, 1, [=](int j) restrict(amp)
        {
            l(i,i) = 1;
        });
        acc.wait();
        parallel_for_each(acc, A.cols, [=](int j) restrict(amp)
        {
            _type sum = 0;
            for (int k = 0; k < i; ++k)
            {
                sum += l(i,k) * u(k,j);
            }
            u(i, j) = a(i, j) - sum;
        });
        acc.wait();
        //cout << "A.rows - (i+1) " << (A.rows - (i+1)) << "\n";
        //cout.flush();
        if (A.rows - i - 1 <= 0)
            continue;
        parallel_for_each(acc, A.rows - i - 1, [=](int j) restrict(amp)
        {
            int jj = j + (i + 1);
            _type sum = 0;
            for (int k = 0; k < i; ++k)
            {
                sum += l(jj,k) * u(k,i);
            }
            l(jj,i) = (a(jj, i) - sum) / u(i,i);
        });
        acc.wait();
    }
    std::cout << counter.Stop() << " ms.\n";
}

Any ideas why, and a work around?

Ken Domino

↧

parallel invoke & windows forms controls

July 15, 2012, 12:59 pm

≫ Next: AMP: array vs texture?

≪ Previous: C++ AMP: "Failed to query D3D marker event status."

I'm using managed CLI/C++ (not pure C++).
I'm using TPL, not PPL.
The following code works fine.
System::Threading::Tasks::Parallel::Invoke(ActionArray)
However, when the Action has anything to do with the windows forms or the controls (labels, charts, etc.) then it crashes.
The following crashes, too.
System::Threading::Tasks::TaskFactory myFactory;
System::Threading::Tasks::Task::WaitAll(myFactory.StartNew(Action));
If the parallel subroutine updates or does anything with the windows forms or the controls or labels or charts, it crashes.
Am I not supposed to have those things in the parallel subroutine?
I would like to open & run multiple windows forms simultaneously.

↧

AMP: array vs texture?

July 14, 2012, 2:36 am

≫ Next: AMP: copy array to texture?

≪ Previous: parallel invoke & windows forms controls

I haven't rly understood the pros and cons of concurrency::array vs concurrency::graphics::texture, and was hoping someone could explain.

I'm building an application for image processing and I'm not sure in what type I should store and process my data.

↧

AMP: copy array to texture?

July 14, 2012, 5:03 am

≫ Next: agents: message block send to all targets?

≪ Previous: AMP: array vs texture?

How can a copy a concurrency::array to a concurrency::graphics::texture, without first mapping the array to host memory? Given that the array uses a direct3d buffer, such a copy should be quite fast?

I haven't found any concurrency::copy overload that does this.

↧