agents: message block send to all targets?

July 16, 2012, 3:53 am

From what I've understood the existing message blocks (e.g unbounded_buffer) always propagate messages to the first target that accepts it. Is there a good way to "broadcast" messages to all targets?

e.g.

unbounded_buffer<int> source;
unbounded_buffer<int> target1
unbounded_buffer<int> target2;

source.link_target(&target1);
source.link_target(&target2);

send(source, 1);

std::cout << receive(target1);
std::cout << receive(target2);

// output: 11

↧

CPU fallback on C++ AMP?

July 18, 2012, 7:04 am

≫ Next: Indexing into different sized arrays in AMP parallel_for_each

≪ Previous: agents: message block send to all targets?

Hello all,

I'm looking into using AMP with my .NET class library (P/invoke). Am I correct in my assumption that if something should go wrong with a GPU kernel launch, the AMP method will fall back to the CPU? If so, is there a way to restrict the execution of the AMP kernel only to the GPU and, if at runtime it can't, to fall back to the C# code in my class library? Thanks!

-L

↧

Indexing into different sized arrays in AMP parallel_for_each

July 17, 2012, 6:50 pm

≫ Next: array_view back to old value when using p-f-e

≪ Previous: CPU fallback on C++ AMP?

I have been attempting write a convolution kernel in AMP to process the depth data from a Kinect and I've been experiencing some problems. I want to make the results form the kernel simply become smaller then the original image, but I don't seem to be able to index into both images with a 2 dimensional index. To diagnose this problem I've been trying to write a simple operation that copies a portion of the original image to the results array.

The CPU implementation is as follows:

void SimpleOperationCPU(int* result, int* image, int width, int height)
{
	int size = 2 * radius + 1;

	for (int r = 0; r <= height - size; r++)
	{
		for (int c = 0; c <= width - size; c++)
		{
			result[r * (width - 2 * radius) + c] = image[(r) * (width) + c];
		}
	}
}

Using a strictly single dimensional extents I get the result I wanted (if a bit slower then I'd want):

void SimpleOperationSINGLE(int* kernel, int* image, int width, int height)
{
	int size = 2 * radius + 1;
	extent<1> imageExtent(width * height);
	array_view<const int, 1> imageAMP(imageExtent, image);
	extent<1> resultExtent((width - 2) * (height - 2 * radius));
	array_view<int, 1> resultAMP(resultExtent, result);
	
	int myWidth = width;

	parallel_for_each(resultExtent,
		[=](index<1> index) restrict(amp)
	{
		resultAMP[index] = imageAMP[index[0] + 2 * fast_math::floor(index[0] / myWidth)];
	});
	resultAMP.synchronize();
}

[success image excluded due to image number restrictions]

However, if I try to index into the arrays with a two dimensional extent it gives me an off by one error:

void SimpleOperation(int* result, int* image, int width, int height)
{
	// Setup for AMP stuff
	extent<2> imageExtent(width, height);
	extent<2> resultExtent((width - 2 *  radius), (height - 2 * radius));

	array_view<const int, 2> imageAMP(imageExtent, image);
	array_view<int, 2> resultAMP(resultExtent, result);
	resultAMP.discard_data();

	parallel_for_each(,
		[=](index<2> idx) restrict(amp)
	{
		resultAMP(idx[0], idx[1]) = imageAMP(idx[0], idx[1]);
	});
	resultAMP.synchronize();
}

And trying to solve this with a mixed solution, it seems to cut off the edge:

void SimpleOperationMIXED(int* result, int* image, int width, int height)
{
	int myOtherWidth = width - 2 *  radius;

	// Setup for AMP stuff
	extent<1> imageExtent(width * height);
	extent<1> resultExtent((width - 2 *  radius) * (height - 2 * radius));

	array_view<const int, 1> imageAMP(imageExtent, image);
	array_view<int, 1> resultAMP(resultExtent, result);
	resultAMP.discard_data();

	int myWidth = width;

	extent<2> e2(width - 2 * radius, height - 2 * radius);

	parallel_for_each(e2,
		[=](index<2> idx) restrict(amp)
	{
		resultAMP(idx[0] * myOtherWidth + idx[1]) = imageAMP(idx[0] * myWidth + idx[1]);
	});
	resultAMP.synchronize();
}

Do you think you could shed some light on these issues?

↧

array_view back to old value when using p-f-e

July 17, 2012, 2:20 am

≫ Next: Register spilling with C++ AMP and GPUs

≪ Previous: Indexing into different sized arrays in AMP parallel_for_each

Hi .
I have hit the wall again..please help.

please look at following code.

float data[1];
array_view<float,1> dm(1,data);
data[0]=2;//dm[0]=2 at this point 100% sure about it.
data[0]+=2;//dm[0]=4 at this point
printf(" dm[0]=%f ",dm[0]);//dm[0]=4 at this point<---of course.

this is fine. then look at following.

float data[1];
array_view<float,1> dm(1,data);
data[0]=2;//dm[0]=2 at this point 100% sure about it.
parallel_for_each(dm.extent,[=](index<1> idx) restrict(amp)
{
	float f=dm[0];//dammy,it doesnt matter. just for using dm in p-f-e block
});
data[0]+=2;//dm[0]=4 at this point
printf(" dm[0]=%f ",dm[0]);//dm[0]=2 at this point<---this is wired for me.
//obviously p-f-e is the key.

when I use "dm"(the array_view) in "p-f-e", then synchronize(synchronize happen at first use) it ,then weird thing(for me) happen.

can anyone explain how it happen or/and anyone teach me how to avoid it?

thank you.

↧

Register spilling with C++ AMP and GPUs

July 20, 2012, 6:59 am

≫ Next: C++ AMP, custom CPU fallback - PPL?

≪ Previous: array_view back to old value when using p-f-e

Hi,

I ported a complicated hydrological model into OpenCL with good results. However, there was some register spillage due to the size of the model and the inability to break it up easily.

C++ AMP has caught my eye, but I'm having some issues getting it to work with the model.
Compiling fails with the error message: "C3568: sum of registers exceeds the limit of 4096 when compiling the call graph for the concurrency::parallel_for_each. Please simplify your program".

Is there a way to increase that limit?

↧

C++ AMP, custom CPU fallback - PPL?

July 20, 2012, 3:23 am

≫ Next: Where do I submit feature requests?

≪ Previous: Register spilling with C++ AMP and GPUs

Hi all,

Newbie to C++-AMP here...

I am currently working on a largish bunch of OpenCL prototype code that should go into product in 6-10 months. The code is executed on GPU or CPU depending on customer hardware (workstation -> usually GPU, virtualized server -> CPU).

Since we are a Windows-only company I would like to port the code to C++-AMP if possible (VS11 switch getting started anyway, C++ integration is wonderful and I hope for better driver compatibility/stability than OpenCL).

From the recent post http://social.msdn.microsoft.com/Forums/en-GB/parallelcppnative/thread/71716f1a-c1e1-40bf-ac2a-4bc5cc718b05 and the blog post http://www.danielmoth.com/Blog/Running-C-AMP-Kernels-On-The-CPU.aspx I gather that we cannot expect CPU acceleration using C++-AMP under Windows 7, at least not in the near-future? Is this official by now?

The lack of CPU fallback for Windows 7 is problematic since I expect many of our customers will stay on Windows 7 for at least 2-3 years. Anyone has a hunch on when we will see a DirectCompute capable WARP on Windows 7 (if ever)? (This question may be better directed to a WARP forum if anyone would like to point me there.)

My options right now seems to be to either stay with OpenCL for a year or two and hope that the future task of porting everything written in the meantime will not be beyond all hope. ...or, go with C++-AMP right away and write a custom wrapper that enables CPU-fallback to PPL (for a fairly limited subset of C++-AMP of course). It seems likely other people have been working in this direction...

If I cannot keep the code execution path more-or-less identical on GPU and CPU I will stay with OpenCL, so I would like to execute the same lambdas using C++-AMP and PPL. The obstacles are obviously how memory is represented (array(_view) vs. vector) and the capture-clause (need to be able to capture by-ref when using PPL). Is it possible (supported?) to use cpu_accelerator for memory management only and use memory allocated with this accelerator when executing the lambda with PPL? I think that would make for a very thin wrapper... :-) If so, is memory management using cpu_accelerator significantly slower than normal?

I would be grateful for any related suggestions on how to best build a wrapper that enables a CPU-fallback to PPL.

Cheers,

T

↧

Where do I submit feature requests?

July 20, 2012, 3:41 am

≫ Next: When would I wait for a texture/array upload?

≪ Previous: C++ AMP, custom CPU fallback - PPL?

Where do I submit feature requests?

I have some feature request (and will probably have more) and would like to know where is the best place to put those?

e.g.

staging texture
copy with pitch overload for 2d texture copy(const void * _Src, unsigned int _Src_byte_size,unsigned int _Src_byte_pitch, concurrency::graphics::texture<_Value_type, 2>& _Dst)
Use parallel_for in _Copy_data_on_host for large data copies to reduce latency.
array <-> texture copy

↧

When would I wait for a texture/array upload?

July 20, 2012, 3:31 am

≫ Next: Copy between textures always goes through host?

≪ Previous: Where do I submit feature requests?

In the context of the following method:

concurrency::completion_future copy_async(const void * _Src, unsigned int _Src_byte_size, texture<_Value_type, _Rank>& _Dst)

What does it mean to wait for the returned future? I guess it simply waits for the host to device DMA transfer to finish. Though, my question is when would you be interested in waiting for that transfer to finish? I don't quite see how waiting or not waiting does any difference on the host side since the transfer is scheduled to the accelerator_view and automatically synced when used on the accelerator?

↧

Copy between textures always goes through host?

July 20, 2012, 4:11 am

≫ Next: What is the accuracy of short vector types?

≪ Previous: When would I wait for a texture/array upload?

Looking at the implementation for graphics::texture I notice that the copy is always performed by downloading src data to the host and then uploading to dst on the device.

This seems rather inefficient, and though I understand (from previous question) that optimizing this when the src and dst is on different accelerator_views can be somewhat complicated, though I was hoping that copying when src and dst are on the same view could be made a lot faster?

i.e. instead of:

    std::vector<unsigned char> _Host_buffer(_Size);
    _Copy_async_impl(_Src, reinterpret_cast<void *>(_Host_buffer.data()), _Size)._Get();
    _Copy_async_impl(reinterpret_cast<void *>(_Host_buffer.data()), _Size, _Dest)._Get();

do something like:

	if(_Src.accelerator_view != _Dest.accelerator_view)
	{
		std::vector<unsigned char> _Host_buffer(_Size);
		_Copy_async_impl(_Src, reinterpret_cast<void *>(_Host_buffer.data()), _Size)._Get(); // Should we really wait here? Isn't a .then(/*...*/) better?
		_Copy_async_impl(reinterpret_cast<void *>(_Host_buffer.data()), _Size, _Dest)._Get();
	}
	else
	{

	}

In my case I wanted to create a helper method:

void fast_copy_helper(/*...*/)
{
        if(src.accelerator_view != dst.accelerator_view)
		graphics::copy(src, dst);
	else
	{		
		CComPtr<IUnknown> d3d_src_unknown;
		d3d_src_unknown.Attach(get_texture(_Src));
		CComQIPtr<ID3D11Resource> d3d_src = d3d_src_unknown;
		
		CComPtr<IUnknown> d3d_dest_unknown;
		d3d_dest_unknown.Attach(get_texture(_Dest));
		CComQIPtr<ID3D11Resource> d3d_dest = d3d_dest_unknown;
		
		CComPtr<IUnknown> d3d_device_unknown;
		d3d_device_unknown.Attach(concurrency::direct3d::get_device(_Src.accelerator_view));
		CComQIPtr<ID3D11Device> d3d_device = d3d_device_unknown;

		_ASSERTE(concurrency::direct3d::get_device(_Src.accelerator_view) == concurrency::direct3d::get_device(_Dest.accelerator_view));

		// Schedule CopyResource call on accelerator_view?
	}
}

I got stuck here as I haven't yet figured out how to schedule direct3d11 interop (ImmidiateContext) calls onto the accelerator_view thread.

↧

What is the accuracy of short vector types?

July 23, 2012, 8:54 am

≫ Next: Convert from float to unorm without clamping?

≪ Previous: Copy between textures always goes through host?

What is the accuracy of short vector types? e.g. unorm4, is that 8 bit per scalar element or 32 bit during calculations?

↧

Convert from float to unorm without clamping?

July 23, 2012, 7:17 am

≫ Next: Deadlock

≪ Previous: What is the accuracy of short vector types?

I'm using fast_math::floor on a value that I know will be in the range of [0.0,1.0] and want to convert it to unorm. Can I do that without wasting computation on clamping?

↧

Deadlock

July 23, 2012, 8:03 am

≫ Next: C++ AMP, general 1D custom CPU fallback to PPL

≪ Previous: Convert from float to unorm without clamping?

I'm experiencing a deadlock when running the following code in debug mode (works fine in release mode).

concurrency::graphics::unorm quantize(concurrency::graphics::unorm value, float range) restrict(cpu, amp)
{
	return concurrency::graphics::unorm(concurrency::fast_math::floor(value * range)/range);
}

concurrency::graphics::unorm_4 quantize(concurrency::graphics::unorm_4 value, float range) restrict(cpu, amp)
{
	return concurrency::graphics::unorm_4(quantize(value.r, range), 
										  quantize(value.g, range), 
										  quantize(value.b, range),
										  1.0);
}

template<typename T>
T quantization_error(T value, float range) restrict(cpu, amp)
{
	return value - quantize(value, range);
}

template<typename T>
concurrency::graphics::texture<T, 2> quantize_impl(const concurrency::graphics::texture<T, 2>& src, unsigned int dst_bits_per_element)
{
	if(src.bits_per_scalar_element <= dst_bits_per_element)
		return src;

	concurrency::graphics::texture<T, 2> dst(src.extent, (((dst_bits_per_element + 7)/8)*8));
		
	auto dst_range = static_cast<float>((1 << dst_bits_per_element) - 1);
	
	auto write = concurrency::graphics::writeonly_texture_view<T, 2>(dst);

	// Floyd–Steinberg Dithering (http://en.wikipedia.org/wiki/Floyd%E2%80%93Steinberg_dithering)
	concurrency::parallel_for_each(write.extent, [&, write, dst_range](concurrency::index<2> idx) restrict(amp)
	{
		auto val = quantize(src[idx], dst_range);

		val += quantization_error<T>(src[idx + concurrency::index<2>( 0,  1)], dst_range) * T(7.0f/16.0f);
                // If the following lines are uncommented it deadlocks?
		//val += quantization_error<T>(src[idx + concurrency::index<2>( 1, -1)], dst_range) * T(3.0f/16.0f);
		//val += quantization_error<T>(src[idx + concurrency::index<2>( 1,  0)], dst_range) * T(5.0f/16.0f);
		//val += quantization_error<T>(src[idx + concurrency::index<2>( 1,  1)], dst_range) * T(1.0f/16.0f);

		write.set(idx, T(val));
	});

	return dst;
}

↧

C++ AMP, general 1D custom CPU fallback to PPL

July 23, 2012, 2:25 am

≫ Next: Microsoft Visual C++ 2005 Express Edition

≪ Previous: Deadlock

Hi all,

(moving specific question from originating discussion "C++ AMP, custom CPU fallback - PPL?")

Is it possible to use PPL (for executing lambdas) together with cpu_accelerator (for memory management) to achieve a general 1D custom CPU fallback in the way illustrated below?

Modified Hello World:

#include <iostream>
#include <amp.h>

using namespace concurrency;
using std::wcout;
using std::endl;

template< typename Kernel_type >
void pfe( const Concurrency::extent< 1 >& e, const Kernel_type& kernel )
{
  if ( accelerator().device_path == accelerator::cpu_accelerator ) {
    wcout << "PFE: Using CPU" << endl;
    auto pplKernel = [&kernel]( int i ) {
                       index< 1 > idx( i );
                       kernel( idx );
                     };
    parallel_for( 0, int( e.size() ), pplKernel ); // TODO: Avoid PPL scheduling overhead
  }
  else {
    wcout << "PFE: Using GPU" << endl;
    parallel_for_each( e, kernel );
  }
}

int main()
{
  bool useCPU = true; // Toggle to use CPU/GPU
  if ( useCPU ) {
    accelerator::set_default( accelerator::cpu_accelerator );
  }

  int v[11] = { 'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c' };
  array_view< int > av( 11, v );
  pfe( av.extent, [ = ]( index< 1 > idx ) restrict ( amp, cpu ) {
                    av[idx] += 1;
                  } );

  wcout << "default_accelerator description = " << accelerator().description << endl;
  wcout << "default_accelerator device_path = " << accelerator().device_path << endl;
  for ( unsigned int i = 0; i < av.extent.size(); i++ ) {
    wcout << static_cast< char >( av( i ) );
  }
  wcout << endl;
}

Whether this is a useful and working fallback depends on positive answers to the following questions:

Is the memory management done by cpu_accelerator (on, e.g., arrays) much slower than native STL?
Comment: It should be fast enough to enable non-debug use in the case of staging arrays: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/10/cpu-accelerator-in-c-amp.aspx.
Is there a significant performance penalty to access memory on the "normal" heap through array(_view) using the index class (as opposed to vector::operator[]) when using PPL?
Comment: Most of the kernels I am interested in runs an internal for-loop, so I can always side-step this by extracting data pointer and int index using array_view::data() and index::operator[] before the loop.
Is there a performance penalty associated with wrapping a lambda in another lambda? It should just be inlined, right?
Is the usage of cpu_accelerator in the code above officially supported? Will it work also for more complex constructs, e.g., using array? If not, any workarounds?

I am grateful for any advice on how to implement a custom CPU fallback while keeping code size/duplication to a minimum.

If the above does not work I believe the next leanest fallback will be to wrap handles to vector and array in my own custom container (to facilitate memory management in the host code) and use independent lambdas for AMP and PPL (Daniel has already noted "For snippets of code that are truly identical, factor them out into their own restrict(cpu,amp) functions so you can call them from either of the two entry point lambdas (from the CPU and AMP paths).").

Cheers,

(Edit: Minor changes to code sample)

↧

Microsoft Visual C++ 2005 Express Edition

July 24, 2012, 1:53 am

≫ Next: How to show the special one of XPS Documents?

≪ Previous: C++ AMP, general 1D custom CPU fallback to PPL

I have a problem Registering Microsoft Visual C++ 2005 Express Edition,
please send me a link to register, cause the link in Dialogue Box of Registering Visual C++ is'nt working!

↧

How to show the special one of XPS Documents?

July 24, 2012, 1:55 am

≫ Next: How to watch reference variables using CUDA Parallel NSight debugger in Visual Studio 2010?

≪ Previous: Microsoft Visual C++ 2005 Express Edition

I want show one of the xps document depend on index.I want do it in following way

.......

String ^strFileName = gcnew String(lpFileName);
XpsDocument ^xpsDoc = gcnew XpsDocument(strFileName, FileAccess::Read);

FixedDocumentSequence ^docSeq = xpsDoc->GetFixedDocumentSequence();
IDocumentPaginatorSource ^docment=docSeq;
DocumentReferenceCollection ^coll=docSeq->References;
DocumentReference ^docRef=coll[nIndex];

FixedDocument ^fd=docRef->GetDocument(true);

AvPageHost::hostedPage->XpsDocumentViewer->Document = fd;

.......

but ,the "coll" count value always 1,and don't change ！

anybody know ，how to make the index and the document related?

↧

How to watch reference variables using CUDA Parallel NSight debugger in Visual Studio 2010?

July 24, 2012, 7:12 am

≫ Next: Queue up work to be done in io callbacks on the same thread

≪ Previous: How to show the special one of XPS Documents?

I can properly watch any variable or pointer but cannot do it for reference variables when debugging with NSight. I have already tried to convert the reference to pointer in the watch or immediate windows but no avail. I don't want to rewrite my code using pointers. I'm using Visual Studio 2010 SP1 in a Windows 7 64 bits machine, CUDA 4.0, Parallel NSight 2.0 and my code is compiling for Win32.

Example: __ device__ void function(int& parameter)

Cannot debug parameter. It is unreachable by debugger.

↧

Queue up work to be done in io callbacks on the same thread

July 24, 2012, 2:53 pm

≫ Next: Store array in graphics memory for later use

≪ Previous: How to watch reference variables using CUDA Parallel NSight debugger in Visual Studio 2010?

hi. i use winhttp i need for a given conection to queue up work to be done on the same threadpool thread so i dont have issue in synchronizing callbacks called for the same connection. how would you suggest on doing this using ppl .?

↧

Store array in graphics memory for later use

July 24, 2012, 7:56 am

≫ Next: C++ AMP Double precision capable cards

≪ Previous: Queue up work to be done in io callbacks on the same thread

I am working on a matching algorithm that searches through 10000 of vectors for the closest match. Each vector consists of 100 floats.

Can I somehow store the vectors in the graphics memory at the beginning of my program so that I don't have to pass them along for each query?

extern "C" __declspec ( dllexport ) void _stdcall QueryDb(float* vResults, const float* vDb, const float* vQuery)
{
    array_view<const float,2> dataView(10000, 100, &vDb[0]);

    array_view<const float, 1> queryVector(_length, &vQuery[0]);

    array_view<float> results(_n, &vResults[0]);
    distances.discard_data();

    // Run code on the GPU
    parallel_for_each(results.extent, [=] (index<1> idx) restrict(amp)
    {		
	//Perform calculations
	results(idx) = calculated_result;
    });

    // Copy data from GPU to CPU
    results.synchronize();
}

↧

C++ AMP Double precision capable cards

July 24, 2012, 8:57 am

≫ Next: Can I create a c++/clr test function to test OpenCL functions base on AMD GPU in vs 2010?

≪ Previous: Store array in graphics memory for later use

Is there any currently available card that has full double precision capability?

I tested on a GT 650M and HD 4000 and both claim support for "limited double precision" with addition, multiplication but not say division. Is there a card that does double precision? Is this a driver issue?

↧

Can I create a c++/clr test function to test OpenCL functions base on AMD GPU in vs 2010?

July 27, 2012, 8:02 am

≫ Next: Using MFC project with MPI

≪ Previous: C++ AMP Double precision capable cards

I am currently using VS2010 to create some kernels base on AMD GPU
But when I create a c++/clr test project to do the unit test there is an error which is 'the agent process was stopped while the test was running'.
The reason I send this message is just want to know can I create a c++/clr test function to test OpenCL functions base on AMD GPU in vs 2010? I am new in c++/clr.

Thanks a lot!

↧