Using MFC project with MPI

July 29, 2012, 6:40 am

≫ Next: Couldn't create more than 64 OpenMP threads in a test application

≪ Previous: Can I create a c++/clr test function to test OpenCL functions base on AMD GPU in vs 2010?

Hi all,

I want to build an application using MPI and I chooses for the UI MFC project, but when I try to configure the linking and the debugging stuff with MPI, i get errors on compilation, what I don't understand is that the same configuration works fine with console project. So, I'm wondering if there is some options to modify in MFC projects or this latter doesn't support MPI?

Regards,

Toufik

↧

Couldn't create more than 64 OpenMP threads in a test application

March 5, 2012, 6:00 pm

≫ Next: Is a serial code quickly than such one transformed to thread?

≪ Previous: Using MFC project with MPI

Hi,

I recently done a test in a simple OpenMP based application and OpenMP couldn't create more than 64 threads.

Also, please follow a discussion on: http://software.intel.com/en-us/forums/showthread.php?t=103375 if interested.

Here is a code of the test:

#include <omp.h>

void main( void )
{
int iShowNumOfThreads = 1;

omp_set_num_threads( 1024 );

     #pragma omp parallel num_threads( 1024 )
     {
          if( iShowNumOfThreads == 1 )
          {
               iShowNumOfThreads = 0;
               printf( "Number of threads created: %ld\n", ( int )omp_get_num_threads() );
          }

          for( int i = 0; i < 16777216; i++ )
          {
               double dA = ( 2 * 4 * 8 * 16 );
          }
     }

printf( "Done\n" );
}

How could I create as many as possible OpenMP threads? For example, more than 32,768?

Best regards,
Sergey

↧

Is a serial code quickly than such one transformed to thread?

July 31, 2012, 4:22 am

≫ Next: C++ AMP Random Number Generator

≪ Previous: Couldn't create more than 64 OpenMP threads in a test application

I was made simple and very complicated experiments too, when take a serial code and put it in one thread (threads). Execution time increases about two time after that. I’d like to know scientific explanation. As result, two processors (cores) are not enough to increase efficiency of series code because we have two times delay (series code to threads) and some time for synchronization (by events) in some cases.

I have used in my experiments QuickWin library and Digital Fortran for creation multi-threading.

Or may be I mistake?

↧

C++ AMP Random Number Generator

July 8, 2012, 3:39 pm

≫ Next: Use of task to prepare a XmlDocument from a file with WinRT library functions.

≪ Previous: Is a serial code quickly than such one transformed to thread?

Hi Sir

I want to use my RandomNumber() function in parallel_for_each() function.

When I Build I get this error

How do I resolve this problem

error C3930: 'RandomNumber' : no overloaded function has restriction specifiers that are compatible with the ambient context 'step_size::<lambda_23f843edc800319054a5b250f2700838>::operator ()'

accelerator pick_accelerator()

{

accelerator chosen_one;

//query all accelerators and pick one based on your criteria

for(accelerator acc: accelerator::get_all())

if(!acc.has_display)

chosen_one = acc;

//return default or new one

return chosen_one;

}

#define MBIG 1000000000 // this is rand3() from numerical recipies in 'c'

#define MSEED 161803398 // initiation is separated from the random number

#define MZ 0 // generator for speed

#define FAC 1.0E-9

static int inext,inextp;

static long ma[56];

void InitRandNumGen(int idum)

{

long mj,mk;

int i,ii,k;

if( idum == -1 ){ // use system time for seed

time_t ltime;

time( &ltime );

idum = (int)(ltime);

idum /= 1000;

}

mj=MSEED-(idum < 0 ? -idum : idum);

mj %= MBIG;

ma[55]=mj;

mk=1;

for (i=1;i<=54;i++) {

ii=(21*i) % 55;

ma[ii]=mk;

mk=mj-mk;

if (mk < MZ) mk += MBIG;

mj=ma[ii];

}

for (k=1;k<=4;k++)

for (i=1;i<=55;i++) {

ma[i] -= ma[1+(i+30) % 55];

if (ma[i] < MZ) ma[i] += MBIG;

}

inext=0;

inextp=31;

}

inline double RandomNumber()

{

long mj;

do{

if (++inext == 56) inext=1;

if (++inextp == 56) inextp=1;

mj=ma[inext]-ma[inextp];

if (mj < MZ) mj += MBIG;

ma[inext]=mj;

}while( !mj );

return (double)(mj*FAC);

}

#undef MBIG

#undef MSEED

#undef MZ

#undef FAC

double step_size()

{

accelerator acc = pick_accelerator();

double RNG;

static const int rank = 1;

extent<rank> e(1);

array<double, rank> rand_out_data(e);

parallel_for_each(e, [=, &rand_out_data] (index<1> idx) restrict(amp)

{

rand_out_data[idx] = RandomNumber();

});

copy(rand_out_data, RNG);

return RNG;

}

My Best Regards

Huseyin Ozgur Kazanci

Akdeniz University ANTALYA TURKEY

↧

Use of task to prepare a XmlDocument from a file with WinRT library functions.

July 10, 2012, 3:28 pm

≫ Next: Iterating on the GPU with C++ AMP

≪ Previous: C++ AMP Random Number Generator

I was wondering if there is a better way to write a function that reads an XML file and turns it into a XmlDocument object, and then executes a user specified function on that XmlDocument. Not feeling too expert, I did come up with one solution, but would like an expert opinion on that solution, and also if there is a better way?

void FileIo::getXML( std::function< void ( Windows::Data::Xml::Dom::XmlDocument ^ passedDoc ) >  myXmlDocHandler )
{
	using namespace Windows::Storage;
	using namespace Windows::Storage::Pickers;
	using namespace Windows::Data::Xml::Dom;
	using namespace concurrency;

	auto picker= ref new FileOpenPicker();
   	picker->FileTypeFilter->Append(".xml");	// Required 
	task< StorageFile^ > getFileTask ( picker->PickSingleFileAsync() ); 

	getFileTask.then([ myXmlDocHandler ]( StorageFile^ xmlFile ) 
	{
		if ( xmlFile != nullptr) 
		{ 
			auto doc= ref new XmlDocument();	
			task< XmlDocument ^ > getDocTask ( doc->LoadFromFileAsync( xmlFile ) ); 
			getDocTask.then( [ myXmlDocHandler ] ( XmlDocument^ doc ) 
			{	
				myXmlDocHandler( doc );
			});
		}
	});

}


//--------------------------------------

// Calling mechanism

	auto lambda = []( Windows::Data::Xml::Dom::XmlDocument ^ xmlDoc ) 
			{
				// Now go process the XML file as you like
				Platform::String ^ nodeName= xmlDoc->NodeName;
			};

	FileIo::getXML(	lambda );

The full context is at:

http://social.msdn.microsoft.com/Forums/en-US/winappswithnativecode/thread/66df8021-6f3e-429c-83b8-392ddc737194

↧

Iterating on the GPU with C++ AMP

July 9, 2012, 5:50 am

≫ Next: Translating some common CUDA constructs to C++ AMP, help

≪ Previous: Use of task to prepare a XmlDocument from a file with WinRT library functions.

Hi I'm going to code a GPU based version of Dijkstra search, and since I have just started Learning C++ AMP my knowledge is quite limited.

I guess there is a simple answer, but here goes:

				//Create GPU array views of the data
				array_view<const size_t, 1> V(n_nodes, v);  //All vertices (nodes)
				array_view<const size_t, 2> E(n_nodes, nn, e); //All edges
				array_view<const size_t, 2> W(n_nodes, nn, w); //Weights
				
				array_view<int, 1> M(n_nodes, m); //Mask array
				array_view<float, 1> C(n_nodes, c); //Cost array
				array_view<float, 1> U(n_nodes, u); //Updating cost array
				
				// Run the following code on the GPU
				// ---------------------------------
				//while M not Empty do (or while there are still some M[idx] == 1)
				// {
					parallel_for_each(V.extent, [V, E, W, M, C, U] (index<1> idx) restrict(amp)
					{
						if (M[idx] == 1) 
						{
							//Update cost in C and U
							//Update mask array M, setting several more M[newIdx] = 1; //Think of it as a wavefront propagating in all direction through the graph
						}
					});

					//Wait for all threads to finish updating M.
					// I do not want to copy back to CPU between iterations!
				// }
				// ------------- (end GPU code)

As you can see from the code above what I want to acomplish is to iterate on the GPU, performing some sort of while loop (see comments in code). For each iteration I want to perform a parallel execution over all vertices (V) in my graph, then wait until all threads are finished updating the Mask array (M), and then perform the next iteration.

So my question is how can I perform the iterations on the GPU. I dont want to copy Mask array(M) between the GPU and CPU between each operation. I just want to use the updated Mask array on the GPU in the next iteration. How can I perform something like the while loop indicated in the code comments above?

↧

Translating some common CUDA constructs to C++ AMP, help

August 6, 2012, 3:10 pm

≫ Next: Calling AMP from a managed class library

≪ Previous: Iterating on the GPU with C++ AMP

Hello, AMP forum!

My first post here and "some thought play". I've been fiddling different kinds of CUDA codes (I'm not really a CUDA programmer, just a beginner and basic toying) and now I'd like to try to translate some structures to C++ AMP. I'm bit unsure about some issues (maybe I'm just too asleep) even after reading the very helpful C++ AMP for CUDA programmer (and other posts here, the blog and C++ AMP book), perhaps someone can nudge my thinking to the right tracks.

In any event, if I distill all of my issues to one piece of pseudoish code and post a few questions. The calling code has the following definitions in addition to the usual <<<>>> ones

struct pair { float x1; float x2; };
struct triplet { int t1; int t2; int t3; };

struct triplet zero;
struct triplet out;

cudaMemcpyToSymbol("triplet", &zero, sizeof(triplet), 0, cudaMemcpyHostToDevice)

cudaMemcpyFromSymbol(&out, "triplet", sizeof(struct triplet), 0, cudaMemcpyDeviceToHost)

and in the <<<>>> the following function will be called

__launch_bounds__(1024, 16)
__global__ void func(pair* pairs, int number_of_pairs, int* indexer, int in_counter)
{
    register int id = threadIdx.x + blockIdx.x * blockDim.x;
    register int pack = blockDim.x * gridDim.x;
    struct triplet out_triplet;
    register int in_counter = counter;
    __shared__ shared_pairs[2048];
    __shared__ shared_triplets [1024];

   //Copy pairs to shared pairs.
   for(register int i = threadIdx.x; i < number_of_pairs; i += blockDim.x)
   {
      shared_pairs[i] = pairs[indexer[i]];
   }

   __syncthreads();

   for(register int j = 0; j < counter; j++)
   {
      temp_id = id  + j * pack;
      pairs[temp_id] = //... some calculation   
    } 
__syncthreads(); 
//512 threads.
if (threadIdx.x < 256)
   if (shared_triplets[threadIdx.x]
//... Some code.
__syncthreads();
//256 threads.
if (threadIdx. < 128)
if (shared_triplets[threadIdx.x] //... Some code.
//And so on until one thread is left.

As the data already exists in the host device, I think I should put the pairs array into array_view<pair, some_big_number>. Is this correct? Also, would it be so that I don't need to copy the elements in any specific way to the shared memory?
According to the aforementioned C++ AMP for CUDA programmer, it looks like id would be something from a tiled_index.global[0] and pack from a ext[0]. But this confuses me, how should I construct the parallel_for_each loop here and how to use the "ext[0]"? It looks like it's revolving around the idea of tile_static stuff, but I can't figure out how. Maybe I'm just too tired, but could someone give me nudge here? I hope the snippets I provided are enough, but if needed, I can go to hunt some real code too.
How to copy data back from C++ AMP to the out variable, like in cudaMemcpyFromSymbol?

↧

Calling AMP from a managed class library

August 8, 2012, 12:48 pm

≫ Next: Saving a float as an int in AMP

≪ Previous: Translating some common CUDA constructs to C++ AMP, help

Hello all,

I'm considering adding some AMP funcitonality to my managed class library. I saw this walkthrough which looks very helpful: http://blogs.msdn.com/b/pfxteam/archive/2011/09/21/10214538.aspx

My question is the following: is the Win32 dll containing the AMP code somehow "incorporated" into the managed assembly so that I only have to distribute my (single) .NET dll, or will I need to distribute both the .NET dll AND the Win32 dll? I will have many win32 dll files for the amp code and would prefer not to have to distribute dozens of unmanaged dlls with my .NET class library, if possible. Thanks in advance.

-L

↧

Saving a float as an int in AMP

August 9, 2012, 1:14 am

≫ Next: Multi-threading blocks of code in relation to one another

≪ Previous: Calling AMP from a managed class library

I have been trying to compute a few max min values that could change in multiple threads. I want to use the atomic_fetch_min and atomic_fetch_max, but the values I want to save are floats. I would be fine with saving them as ints, but part of my calculation requires a divide and I can find a way to save it as an int. Here is the code I am having trouble with:

float xMax = (maxDepth - depth) * worldXConst / (width);
float depthSlope = (maxDepth - depth) / xMax;
for (int i = idx[1]; i < fast_math::fmin(idx[1] + xMax, width); i++)
{
	float distance = xMax - (i - idx[1]);
	int max = (int)fast_math::ceilf( depthSlope * (distance + 1));
	int min = (int)fast_math::floorf( depthSlope * distance);
	if(unknown == imageAMP(idx[0], i))
	{ 
		atomic_fetch_min( &minAMP(idx[0], i), min);
		atomic_fetch_max( &maxAMP(idx[0], i), max);
	} 
}

Is there any thing i can do about this?

↧

Multi-threading blocks of code in relation to one another

August 10, 2012, 1:36 am

≫ Next: Use of task to prepare a XmlDocument from a file with WinRT library functions.

≪ Previous: Saving a float as an int in AMP

I’m writing a program using managed Visual C++ to extract data from an IT system and store it into a Microsoft Access database. It’s all working fine but performance is slow and I was hoping to multi- thread the program. I know the mechanics of how to multi-thread and the program lends itself well to being cut into chunks but I need some form of batch control over the order in which these chunks are processed.

The initial extraction is for a number of objects each with their own type and each type is stored in a separate table, table A, B and C. The extraction extracts objects from their associated tables, for example object A1, A2 & A3 (from table A), B1, B2 & B3 (from the table B) and C1, C2 & C3 (from the table C). The next extraction pulls the relationship between these objects from another set of tables for example A1-B1, A2-B2 & A3-B3 from a table D and B1-C1, B2-C2 & B3-C3 from a table E. Obviously it makes sense to extract all the objects before you try to build the relationship between them.

My program currently does this in a fairly linear manner:

Extract table A
Load it into Access
Extract table B
Load it into Access
Extract table C
Load it into Access
Extract table D
Load it into Access
Extract table E
Load it into Access.

What I would like to do is something like this:

Extract table A and B in parallel. (The source system will support this but I don’t want to kill it by extracting all the tables at the same time)
As soon as either one is finished:
1. Extract table C
2. Start the load into Access for the completed job.
As A, B or C finish continue sequentially loading the Access database.
Once A, B & C are finished start the extractions for table D and E irrespective of whether the Access loads have been completed.
Once all the Access loads have finished for table A, B & C wait for the extractions for table D & E to finished and load them as they become available.

Essentially I want to be able to create three queues of work to be completed and be able to process items off the one queue in parallel and off the other two sequentially.

The first queue I would populate as part of the programme with a list of the tables to be extracted (or references to the blocks of code that would do this) and the number that can be extracted at a time in parallel.
The second would be populated by the first programmatically and triggered to run when data becomes available.
The third would also be populated by the first programmatically and triggered to run when data is available and the second has completed.

I realise that this is not a question that can be answered in five lines but if someone could point me at some documentation that I could work out the answer from I would be much obliged.

↧

Use of task to prepare a XmlDocument from a file with WinRT library functions.

July 10, 2012, 3:28 pm

≫ Next: How to write a mutex in C++ AMP to avoid "write after write hazard"?

≪ Previous: Multi-threading blocks of code in relation to one another

void FileIo::getXML( std::function< void ( Windows::Data::Xml::Dom::XmlDocument ^ passedDoc ) >  myXmlDocHandler )
{
	using namespace Windows::Storage;
	using namespace Windows::Storage::Pickers;
	using namespace Windows::Data::Xml::Dom;
	using namespace concurrency;

	auto picker= ref new FileOpenPicker();
   	picker->FileTypeFilter->Append(".xml");	// Required 
	task< StorageFile^ > getFileTask ( picker->PickSingleFileAsync() ); 

	getFileTask.then([ myXmlDocHandler ]( StorageFile^ xmlFile ) 
	{
		if ( xmlFile != nullptr) 
		{ 
			auto doc= ref new XmlDocument();	
			task< XmlDocument ^ > getDocTask ( doc->LoadFromFileAsync( xmlFile ) ); 
			getDocTask.then( [ myXmlDocHandler ] ( XmlDocument^ doc ) 
			{	
				myXmlDocHandler( doc );
			});
		}
	});

}


//--------------------------------------

// Calling mechanism

	auto lambda = []( Windows::Data::Xml::Dom::XmlDocument ^ xmlDoc ) 
			{
				// Now go process the XML file as you like
				Platform::String ^ nodeName= xmlDoc->NodeName;
			};

	FileIo::getXML(	lambda );

The full context is at:

http://social.msdn.microsoft.com/Forums/en-US/winappswithnativecode/thread/66df8021-6f3e-429c-83b8-392ddc737194

↧

How to write a mutex in C++ AMP to avoid "write after write hazard"?

August 10, 2012, 3:14 pm

≫ Next: Does parallel_for_each(view.extent.tile(), [=](tiled_index t_idx) restrict(amp) give the maximum number of threads or just one?

≪ Previous: Use of task to prepare a XmlDocument from a file with WinRT library functions.

Greetings!

The question is, do I need a mutex or some other construct or is a debugger issued potential write-issue-write hazard always warranted? My situation is as so that I have an output variable in C++ AMP parallel_for_each defined as follows

some_struct out = some_struct;
array_view<some_struct, 1> out_view(1, &out);

parallel_for_each(something.extent.tile<1>(), [=](tiled_index<1>t_idx) //This just shows how the tiled_index is constructed to clarify the subsequent code snippets.

Inside parallel_for_each I have a tile_static array like this

tile_static some_struct some_values[50];

which gets filled in some loops inside parallel_for_each. In the very end, after all the loops and applying barrier, I assign the first value from the tile_static some_values to to out_view[0] as follows

t_idx.barrier.wait();
if(t_idx.local[0] == 0)
{
    if(some_values[t_idx.local[0]].some_value == out_view[0].some_value)
    {
        out_view[0] = some_values[t_idx.local[0]];
    }
}); //End of parallel_for_each.

and here I get a potential write-after-write warning. It would look like it's not real as after this assignmen the parallel_for_each has run it's course and a barrier has been applied before it. Or is there something I'm missing? If I try to move the barrier past if(t_idx.local[0] == 0) check, compiler warns "C3561: tile barrier operation found in control flow that is not tile-uniform when compiling the call graph for the concurrency::parallel_for_each at".

Sudet ulvovat -- karavaani kulkee

↧

Does parallel_for_each(view.extent.tile(), [=](tiled_index t_idx) restrict(amp) give the maximum number of threads or just one?

August 10, 2012, 2:06 pm

≫ Next: Difference between std::thread and std::async

≪ Previous: How to write a mutex in C++ AMP to avoid "write after write hazard"?

Hi!

Just to check if I have understood correctly or the complete opposite, in the following situation, does the following code utilize just on thread or does it divide each floats_view element into its own thread? Moreover, is assuming the threads would be doing binary operations to all of the elements, how should the tile and tiled_index be contructed? I a set-up like this and the following works quite nicely. Then if I change the tile and tiled_index parameters, the code outright crashes or turns to slower (now, I'm assuming the crash happens because of the fact the vector isn't necessarily a power of two and it trips my code, but that's not important for this question).

//This vector isn't necessarily a power of two in length.
vector<float> float_vector(10);
iota(float_vector.begin(), float_vector.end(), 0);

array_view<const float, 1> floats_view(float_vector.size(), float_vector);


parallel_for_each(floats_view.extent.tile<1>(), [=](tiled_index<1> t_idx) restrict(amp)
{
    //[...]
}

↧

Difference between std::thread and std::async

August 12, 2012, 4:03 am

≫ Next: What could be the reason for inconsistent program abort that seem to happen only under debugger and GPU software emulator in C++ AMP?

≪ Previous: Does parallel_for_each(view.extent.tile(), [=](tiled_index t_idx) restrict(amp) give the maximum number of threads or just one?

Would liked to know the difference between std::thread and std::async?

does std::async uses a new thread to complete the task?

What is the limitation of std::async over std::thread?

Under which scenarios std::thread can be used and not std::async

↧

What could be the reason for inconsistent program abort that seem to happen only under debugger and GPU software emulator in C++ AMP?

August 11, 2012, 1:07 pm

≫ Next: Continue Loop on Another Task

≪ Previous: Difference between std::thread and std::async

Hi, again!

I found a curious problem:

No crash: When I run or debug C++ AMP program with auto setting and a real GPU hardware.
No crash: When I run without debugger (Debugger Type: GPU Only) and accelerator is GPU software emulator.
Crash: When I run with debugger (Debugger Type: GPU Only) and accelerator is GPU software emulator.

I have run my program multiple times and it will crash in the very beginning under under conditions as described in (3), but otherwise it runs into completion.

The crash manifests itself with message "... 11.0\vc\include\vector | Line 159 | Expression vector + iterator out of range.", where the bars are line feeds in my quote. After that I have to start a new instance of Visual Studio 2012 RC to inspect the stack. Indeed, I see bogus ranges and I'm rather sure the bogus values originate from a write-after-write hazards, regarding my previous question. Though I'm a bit suspicious as now I do two atomic_exhange (as per the linked question, I don't write out to a struct to send a value from a GPU accelerator back to host, but exhange its values atomically) operations and both of the indices are bogus. This explanation is, I think, is connected to this matter.

But: Should/could the program run experience really be inconsistent like this?

Sudet ulvovat -- karavaani kulkee

↧

Continue Loop on Another Task

August 12, 2012, 11:21 pm

≫ Next: Partial Memory Uploads?

≪ Previous: What could be the reason for inconsistent program abort that seem to happen only under debugger and GPU software emulator in C++ AMP?

Consider the following code:

std::istream& in = /*...*/;

for(int n = 0; n < /*...*/; ++n)
{
	std::vector<int> data(8192);
	in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
	
	process_data(data);
}

I could parallelize it to something like:

std::istream& in = /*...*/;

std::vector<concurrency::task<void>> tasks;

for(int n = 0; n < /*...*/; ++n)
{
	std::vector<int> data(8192);
	in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
	
	tasks.emplace_back(std::bind(&process_data, std::move(data)));
}

concurrency::when_all(tasks.begin(), tasks.end()).wait();

However, the problem with this code is that at the point I'm starting the processing tasks, all the data is already hot in the cache, and therefore I would like to perform processing in the same thread as the data reading. I'm unsure how to achieve this...

std::istream& in = /*...*/;

for(/* ??? */)
{
	std::vector<int> data(8192);
	in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
	
	/* Start a task that begins the next loop? */
	
	process_data(data);
}

↧

Partial Memory Uploads?

August 12, 2012, 11:36 pm

≫ Next: Low Latency Decoding and Small Array Size

≪ Previous: Continue Loop on Another Task

Let's say I create an array.

concurrency::array<int> ar(1024);

I then fill it with data.

int count = 0;
for(int n = 0; n < /* ... */; ++n)
{
    ar_av[++count] = /*...*/;    
}

Now let's say that count < ar.extent.size(), thus I am not using the entire array ar.

If I do the following:

concurrency::parallel_for_each(concurrency::extent<1>(count), [&](const concurrency::index<1>& idx)
{
     /* ... */
}

Is there a way to hint that not the entire array "ar" needs to be uploaded to the accelerator?

↧

Low Latency Decoding and Small Array Size

August 10, 2012, 12:39 pm

≫ Next: Concurrency - Help with understanding Profiling Results

≪ Previous: Partial Memory Uploads?

I'm working on a video decoder which is to decode each frame as fast as possible with lowest possible latency, however I have some issues with achieving this with C++ AMP. In order to reduce latency as much as possible I send data as soon as possible to the GPU for processing, i.e. I send dct blocks to the gpu as soon as they are available.

My code looks something like this:

	concurrency::parallel_for(0, nb_macro_block_rows, 1, [&](int row)
	{
		for(int col = 0; col < nb_macro_block_colums; ++col)
		{
			for(int i = 0; i < nb_dct_per_macro_block; ++i)
			{
				array<int> stage_dct_coeffs(64, accelerator(accelerator::cpu_accelerator).default_view, accelerator(accelerator::default_accelerator).default_view);
				decode_dct_coeffs(stage_dct_coeffs, bit_reader.rows[row]);
								
				result[row].push_back(task<array<int>>(std::bind([](const array<int>& stage_dct_coeffs) -> array<int>
				{
					/* C++ AMP IDCT Code */
				}, std::move(stage_dct_coeffs)))); // Use std::bind in order to avoid copying stage array.
			}
		}
	});

I am a bit worried if it is a good idea to work with such small arrays.

Allocating such small 64 element stage arrays is VERY slow. Any advice regarding that? Should I pre-allocate and manage a pool of arrays myself? Would using different accelerator views for each parallel_for calling context improve allocation performance?
What are the performance characteristics of small blocks in regards of host<-> device transfers and GPU processing? And if there is a difference what is a good minimum block size to use?

↧

Concurrency - Help with understanding Profiling Results

August 12, 2012, 11:51 am

≫ Next: declaring array / array_view inside GPU code

≪ Previous: Low Latency Decoding and Small Array Size

I've just converted some of my code to C++ AMP and my application now finishes a lot slower than before (52 ms per frame compared to 20 ms per frame before).

I have tried to use the VS2012 Profiler to analyze what is happening, however I am not making much sense of it. The task scheduler seems to have a huge overhead here.

Anyone with any insight?

My initial guess is that this is due to the C++ AMP memory transfers between host <-> device (copy_async(/*...*/).then(/*...*/)) and what I am seeing is simply the task scheduler waiting/spinning for new work?

↧

declaring array / array_view inside GPU code

August 13, 2012, 7:44 am

≫ Next: A problem with "error C3564: reading uninitialized value when compiling the call graph for the concurrency::parallel_for_each"

≪ Previous: Concurrency - Help with understanding Profiling Results

Hi, I'm pretty new at this and I did a search on the forums/blogs and couldn't find a solution so if my problem is really basic / already answered I apologise. I'll outline what I'm trying to do because each time I try something new it doesnt work for a different reason.

The problem is I have a function inside a parrallel_for_each loop which takes in an array_view<float> SR and modifies it. So for example:

const int TrajTotal = 1;  //this will be large
const int L = 6;

...

concurrency::extent<1> Trajectories(TrajTotal);

//data to be modified
vector<float> ParaVector(L, 0.0f);							
array_view<float> Parameters(L, ParaVector);
Parameters.discard_data();


parallel_for_each( 
        Trajectories, 
        [=, &IL](index<1> idx) restrict(amp)
    {
...

RK4(Parameters, dN);
...
    }
);

...

void RK4(array_view<float> &SR, float dt) restrict(amp){

array_view<float> K1(SR);  

for(int l = 0; l < L; l++){
K1[l] = F(SR[0], l)*dt;
SR[l] = SR[l] + K1[l];
}

}

I want RK4 to modify the contents of Parameters, so I should pass by reference. Like this it doesn't compile. For the line

RK4(Parameters, dN);

it throws up the error:

1>Main Code.cpp(57): error C2664: 'RK4' : cannot convert parameter 1 from 'const Concurrency::array_view<_Value_type>' to 'Concurrency::array_view<_Value_type> &'

Somehow parameters is a const array_view? If I change &SR -> SR in the function argument it does compile, but then assigning a value to K1 wipes the value of SR so it forgets its previous value (even though I am now passing by value?).

Sorry again if this is a simple problem (it seems like one) but I would appreciate any help / advise. It is probably me just having no knowledge about array / array_views..

Many Thanks

↧