Quantcast
Channel: Parallel Computing in C++ and Native Code forum
Viewing all 856 articles
Browse latest View live

How to most efficiently initialize a large tile static 1D array in 2D tile parallel_for_each?

$
0
0

Hi!

I became to wonder what is the most efficient method to initialize a large 1D tile_static memory when the computing domain is 2D? One way would be to restrict thread access as in the following snippet, but that probably has more severe performance implications than necessary.

The (somewhat contrived) example code:

static const int TS = 16; static const int BLOCKS = 1024 / TS; auto domain = concurrency::extent<2>(BLOCKS * TS, BLOCKS * TS).tile<TS, TS>(); parallel_for_each(domain.pad(), [=](concurrency::tiled_index<TS, TS> t_idx) restrict(amp) { tile_static int out_values[1024]; if(t_idx.local[0] == 0 && t_idx.local[1] == 0) { for(int k = t_idx.local[0]; k < 1024; ++k) { out_values[k] = 0; } } t_idx.barrier.wait();

}


 





C++ AMP - Dynamic Parallelism

$
0
0

I've just read about the new Dynamic Parallelism feature of the new Nvidia cards. This seems like a quite powerful feature and I'm curious in regards to how it might fit into future C++ AMP versions and what the semantics might look like.

Would it simply consist of nested invocation of restrict(amp) parallel_for_each? Or would it require entirely different semantics?

Concurrency::task_group leaks memory

$
0
0

Hi.

I have an app that uses concurrency::task_group class to execute periodic tasks every few seconds. 

I noticed that Private Bytes performance counter of my app is increasing over time.

After inspecting my app using LeakDiag tool I got the following call stack for the leaked memory:

<STACK numallocs="013716" size="0120" totalsize="01645920">
<STACKSTATS>
<SIZESTAT size="0120" numallocs="013716"/>
<HEAPSTAT handle="830000" numallocs="013716"/>
</STACKSTATS>
<FRAME num="0" dll="MSVCR100.dll" function ="malloc" offset="0x4B" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\malloc.c" line="89" addr="0x73540269" />
<FRAME num="1" dll="MSVCR100.dll" function ="operator new" offset="0x1F" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\new.cpp" line="59" addr="0x7354233B" />
<FRAME num="2" dll="MSVCR100.dll" function ="Concurrency__details___TaskCollection___Alias" offset="0x55" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\taskcollection.cpp" line="846" addr="0x73585486" />
<FRAME num="3" dll="MSVCR100.dll" function ="Concurrency__details___TaskCollection___Schedule" offset="0x36" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\taskcollection.cpp" line="1009" addr="0x7358563F" />
<FRAME num="4" dll="gvipcserver.dll" function ="GVSI__Actor__Enqueue" offset="0xA0" filename="c:\users\draganm\gvsi\tfs_release2\server\streamingserver\src\gvipcserver\actor.cpp" line="25" addr="0x4156D0" />
<FRAME num="5" dll="gvipcserver.dll" function ="GVSI__SessionController__RemoteSessionsReceived" offset="0x9A" filename="c:\users\draganm\gvsi\tfs_release2\server\streamingserver\src\gvipcserver\sessioncontroller.cpp" line="250" addr="0x471D9A" />
<FRAME num="6" dll="gvipcserver.dll" function ="std__tr1___Impl_no_alloc0&lt;std__tr1___Callable_obj&lt;GVSI__`anonymous namespace&apos;__&lt;lambda0&gt;,0&gt;,void&gt;___Do_call" offset="0x28" filename="c:\program files (x86)\microsoft visual studio 10.0\vc\include\xxfunction" line="66" addr="0x44B348" />
<FRAME num="7" dll="MSVCR100.dll" function ="Concurrency__details___UnrealizedChore___UnstructuredChoreWrapper" offset="0xF4" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\chores.cpp" line="204" addr="0x73575EC8" />
<FRAME num="8" dll="MSVCR100.dll" function ="Concurrency__details__InternalContextBase__ExecuteChoreInline" offset="0x50" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\internalcontextbase.cpp" line="1384" addr="0x73579D41" />
<FRAME num="9" dll="MSVCR100.dll" function ="Concurrency__details__FreeThreadProxy__Dispatch" offset="0x48" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\freethreadproxy.cpp" line="161" addr="0x73579291" />
<FRAME num="10" dll="MSVCR100.dll" function ="Concurrency__details__ThreadProxy__ThreadProxyMain" offset="0x22" filename="f:\dd\vctools\crt_bld\self_x86\crt\src\threadproxy.cpp" line="164" addr="0x7358671E" />
<FRAME num="11" dll="kernel32.dll" function ="BaseThreadInitThunk" offset="0x12" filename="" line="" addr="0x75B8339A" />
<FRAME num="12" dll="ntdll.dll" function ="RtlInitializeExceptionChain" offset="0x63" filename="" line="" addr="0x77C99EF2" />
<FRAME num="13" dll="ntdll.dll" function ="RtlInitializeExceptionChain" offset="0x36" filename="" line="" addr="0x77C99EC5" />
<STACKID>1F36D708</STACKID>
</STACK>

It seems that Concurrency::details::TaskCollection::Alias function allocates leaked memory.

Does anyone know why is this happening and how to solve this problem.

Regards,

Dragan

C++ AMP: Do GPUs implement protected memory?

$
0
0

Hi all,

Do GPUs (those supported by C++ AMP) implement some kind of memory protection scheme? I.e., do I need to worry about other GPGPU/graphics applications inadvertently modifying the memory my C++ AMP computations is operating on?

(Background: Medical application development, strict safety regulations, and iterative algorithms where an error might be very hard to detect -- both at runtime by the algorithm itself as well as by the operator examining the final result)

(Just to be clear, I am not asking about anything ECC related)

Cheers,

T

16 bit float SIMD optimization?

$
0
0

Given the following:

		// coeffs is array<int>, idct is float[64]		
		tile_static float dct[8][8];

		dct[y][x] = coeffs[t_idx.global];

		t_idx.barrier.wait_with_tile_static_memory_fence();

                tile_static float row_sums[8][8];

		precise_float row_sum = 0.0;
		row_sum += dct[y][0] * idct[x][0];  
		row_sum += dct[y][1] * idct[x][1];  
		row_sum += dct[y][2] * idct[x][2];  
		row_sum += dct[y][3] * idct[x][3];  
		row_sum += dct[y][4] * idct[x][4];  
		row_sum += dct[y][5] * idct[x][5];  
		row_sum += dct[y][6] * idct[x][6];  
		row_sum += dct[y][7] * idct[x][7];  

		row_sums[x][y] = row_sum;

		t_idx.barrier.wait_with_tile_static_memory_fence();

My calculations only need 16 bit accuracy. Is there any way I can make use of this?

I guess what I'd like to achieve is something like:

		precise_float row_sum = 0.0;
		row_sum += dot(float_4(dct[y][0], dct[y][1], dct[y][2], dct[y][3]), float_4(idct[x][0], idct[x][1], idct[x][2], idct[x][3]);  
		row_sum += dot(float_4(dct[y][4], dct[y][5], dct[y][6], dct[y][7]), float_4(idct[x][8], idct[x][9], idct[x][10], idct[x][11]);  
         
		// float_4 4x 16 bit?
Is this possible to achieve using array (using some form of type casting?), or do I have to copy the data first to a 16 bit four component texture?

Also, just a quick question, why is float_x in the graphics namespace?


constant memory vs shared memory

$
0
0

I've noticed some strange behavior I was hoping to get some clarity at. It seems that shared memory (i.e tile_static) is faster than constant memory (i.e. capture by value).

e.g. The following is faster:

		tile_static float dct[8][8];
                tile_static float local_idct[8][8];

		dct[y][x] = coeffs[t_idx.global];
		local_idct[y][x] = idct[y][x];

		t_idx.barrier.wait_with_tile_static_memory_fence();
				
		tile_static float row_sums[8][8];

		precise_float row_sum = 0.0;
		row_sum += dct[y][0] * local_idct[x][0];  
		row_sum += dct[y][1] * local_idct[x][1];  
		row_sum += dct[y][2] * local_idct[x][2];  
		row_sum += dct[y][3] * local_idct[x][3];  
		row_sum += dct[y][4] * local_idct[x][4];  
		row_sum += dct[y][5] * local_idct[x][5];  
		row_sum += dct[y][6] * local_idct[x][6];  
		row_sum += dct[y][7] * local_idct[x][7];  

		row_sums[x][y] = row_sum;

than:

		tile_static float dct[8][8];

		dct[y][x] = coeffs[t_idx.global];

		t_idx.barrier.wait_with_tile_static_memory_fence();
				
		tile_static float row_sums[8][8];

		precise_float row_sum = 0.0;
		row_sum += dct[y][0] * idct[x][0];  
		row_sum += dct[y][1] * idct[x][1];  
		row_sum += dct[y][2] * idct[x][2];  
		row_sum += dct[y][3] * idct[x][3];  
		row_sum += dct[y][4] * idct[x][4];  
		row_sum += dct[y][5] * idct[x][5];  
		row_sum += dct[y][6] * idct[x][6];  
		row_sum += dct[y][7] * idct[x][7];  

		row_sums[x][y] = row_sum;

I am a bit curious about this, how come constant memory is slower than shared memory, I thought constant memory was cached?

8 bit vs 16 bit textures inside kernels

$
0
0

When working with 16 bit textures inside a C++ AMP kernel? At what accuracy are calculations performed?

e.g.

texture<float_4> tex_8bit;


parallel_for_each(/*...*/, 
{
     float_4 test = tex_8bit[idx] * tex_8bit[idx+1]; // 4x 8 bit?
}

texture<float_4> tex_16bit;

parallel_for_each(/*...*/, 
{
     float_4 test = tex_16bit[idx] * tex_16bit[idx+1]; // 4x 16 bit?
}


A nudge with two-dimensional matrix reduction across all tiles

$
0
0

Greetings (again)!

This is related to my previous question, with which Dragon89 was already rather helpful. Basically what I have a code that runs correctly on a CPU. I transformed it for GPU, but it isn't that efficient, it has plenty to do how I do some matrix operations and especially reducing the results. This is a bit lengthy as I provide a lot of code, but please, bear with me here as this excercise contains separately many things I'm trying to wrap my head around. :)

I have some doubts with the following issues:

  • Should I have reduction scratchpads (tile_static arrays) per tile or just one used acrossa all tiles? The context in the following code snippet. This is my main problem, in the following snippet I have a data race I haven't figured out how to handle efficiently. It's related to the fact the processing is happening in 2D.
  • How could I utilize tile_static memory and threads more efficiently when processing a 2D matrix (see code that follows). This is somewhat secondary.
struct best_pair
{
    int best1;
    int best2;
    
    best_pair() best1(0), best2(0) {};
};

inline void max_change(best_pair& a, best_pair const& b) restrict(amp)
{
    a = (a.best1 > b.best1) ? a : b;
}


//This function processes a huge matrix and should return a pair of numbers as a result of the processing.
//To acquire the numbers require a reduction step,
//which is a bit of a problem for me to do efficiently now, see comments (also for the purpose of the arguments)...
best_pair amp_exercise(std::vector<int> matrix, std::vector<int> processing_indices, int matrix_dimension)
{
concurrency::array_view<int, 1> processing_indices_view(matrix_dimension, processing_indices);

static const int TS = 32;
static const int BLOCKS = 1024 / TS;

//Here the idea would be to collect either the best value per tile or just one across all tiles.
std::array<best_pair, TS> out_pairs;
concurrency::array_view<best_pair, 1> out(out_pairs.size(), out_pairs);

//The matrix is in row major order. Note that it doesn't need to be read out.
const concurrency::array_view<const int, 2> matrix_view(matrix_dimension, matrix);
auto domain = concurrency::extent<2>(BLOCKS * TS, BLOCKS * TS).tile<TS, TS>();
parallel_for_each(domain.pad(), [=](concurrency::tiled_index<TS, TS> t_idx) restrict(amp)
{	
    best_pair best = best_pair(99999, 9999);
    
    //Initialize this array. I'm not sure if it'd be better 
    //this were either per tile or across all tiles. That is, should
    //the size be the size of all tiles or the size of
    //matrix dimension given as a parameter?
    //In any event,
    //Dragon89 provided some ideas for this, so the following
    //initialization routine is just a "place holder"...
    tile_static best_pair best_pairs[TS];
    if(t_idx.local[0] == 0 && t_idx.local[1] == 0)
    {
        for(int k = 0; k < TS; ++k)
        {
            best_pairs[k] = best;
        }
     }
     
     t_idx.barrier.wait();
     
     //There probably is a more efficient method
     //to process through the matrix.
     //Note that
     //the "upper-triangular" processing is actually
     //only used to generate indices together with
     //processing_indices_view, which is used also
     //outside of this kernel. That is, the access pattern
     //to the matrix itself isn't upper-triangular.
     for(int i = 0; i < matrix_dimension - 1; ++i)
     {
         for(int j = i + 1; j < matrix_dimension - 1; ++j)
         {
             int a = processing_indices_view[i];int b = processing_indices_view[i + 1];
             int c = processing_indices_view[j];
             int d = processing_indices_view[j + 1];	
						
	   int a1 = matrix_view[a][b];
             int a2 = matrix_view[a][c];
             //Etc. calculations with the combinations of the matrix 
             //entries obtained from the above code.
             //The result stored and processed as follows
             //for illustration.
	   
             //Now, here lies the main rub. This will 
             //generate a data race, for which I haven't found an 
             //efficient method to handle. This is linked on the
             //matter how should I gather the "best result" of all of
             //these operations? This probably would need
             //some kind of a indexing with thread id, but how to
             //ensure the index would be within tile_static bounds?
             int result = a1 + a2;
             if(result > best.best1)
             {
                best.best1 = a1;
                best.best2 = a2;
                best_pairs[t_idx.local[0]] = best;
	   }
          }
       }
       t_idx.barrier.wait();
       
      //Apart from the data race, in previous if clause,
      //this looks like works all right if the
      //scratchpad is made big enough. Though, still,
      //should the reduction be per tile or across
      //all tiles so that the result from this
      //parallel_for_each is only one best_pair
      //or an array-ful of them?
      //Reduction tiled_3 from http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/08/parallel-reduction-using-c-amp.aspx.
      for(int s = TS / 2; s > 0; s /= 2)
      {
          if(t_idx.local[0] < s)
          {
             max_change(best_pairs[t_idx.local[0]], best_pairs[t_idx.local[0] + s]);
}
								
t_idx.barrier.wait();
}
			
//Store the tile result in the global memory.
if(t_idx.local[0] == 0)
{
    auto i = t_idx.tile[0];
    out_pairs[i] = best_pairs[0];
}
});

out_pairs.synchronize();
auto best_pair = *std::max_element(out_pairs.begin(), out_pairs.end(), [](best_pair const& a, best_pair const& b) { return(a.best1 > b.best1); });

return best_pair;
};



Using find with concurrent_unordered_map

C++ AMP - Oversubscribing or Cooperative Blocking?

$
0
0

How does C++ AMP handle blocking? 

Running concurrency visualizer I noticed that I have a lot of extra worker threads running. Practically one extra worker thread (which is "synchronizing") for every active parallel_for_each, wait, or copy call. Which in my specific case is around 32 extra worker threads.

Is this by design? Or am I possibly doing something wrong?

error C2275: 'FILE' : illegal use of this type as an expression_

$
0
0

my program is;this is a mpi program

and show this error (error C2275: 'FILE' : illegal use of this type as an expression    )

please help me!!!


int main(int argc, char *argv[]){
    
    int i,j;
    MPI_Init(&argc,&argv);  
    MPI_Comm_rank(MPI_COMM_WORLD,&id);        
    MPI_Comm_size(MPI_COMM_WORLD,&p);    
    
    startT = MPI_Wtime();  

     if(id == MASTER)  
    {    
         A = (int *)malloc((r)*sizeof(int));
         B = (int *)malloc((s)*sizeof(int));
         C = (int *)malloc((r+s)*sizeof(int));
        
         FILE *finA;
         FILE *finB;
         finA=fopen("c:/a/resultA.txt","r");
         finB=fopen("c:/a/resultB.txt","r");
        
         for(i=0;i<r;i++){
             fscanf(finA,"%d",&A[i]);
         }
        
         for(i=0;i<s;i++){
            fscanf(finB,"%d",&B[i]);
         }  
            MPI_Bcast(&A,r,MPI_INT,1,MPI_COMM_WORLD);
            MPI_Bcast(&B,s,MPI_INT,1,MPI_COMM_WORLD);
            parallel_merge(A,B);        
     }
        
        stopT = MPI_Wtime();
    
       
            fclose(fin);
             getch();
     }
     MPI_Finalize();  
}

array_view variable cause stack overflow(amp), when use for() loop in the parallel_for_each()

$
0
0

HI.

I am porting marching cube CUDA sample to C++ AMP.

My problem is stack overflow(amp) when the code attemped below.

The problem occur in debug configuration only. Release is ok.

It is very long and complex source, so I made the problem source code in my guess.

The problem occur when I use 'numVerts' in for loop.

'numVerts' is a value from 'numVertsTable'(array_view).

When I change the code 'for(uint j=0; j<100; j+=3)', it works fine.

I don't know why array_view variable cause stack overflow in for loop.

how can I avoid it?

Any comment would be helpful.

array_view<const uint, 1> d_numVertsTable(256, numVertsTable);

parallel_for_each(e.tile<1, 1, NTHREADS>(),

[=] (tiled_index<1, 1, NTHREADS> t_idx) restrict(amp)

{

uint numVerts = numVertsTable[cubeindex];

for(uint j=0; j<numVerts; j+=3)

{...}

}


Pack two short into a single int

$
0
0

I want to pack two signed shorts into an int in order to reduce a memory bottleneck when passing data between two kernels.

I've read the C++ AMP: It's got character, but no char! post and based my implementation on that.

template <typename T>
int read_s16(T& arr, int idx) restrict(amp, cpu)
{
	return static_cast<int>((arr[idx/2] >> ((idx % 2) * 16)) << 16) >> 16;
}
	
template<typename T>
void write_s16(T& arr, int idx, int val) restrict(amp)
{
	// NOTE: arr is zero initialized
	concurrency::atomic_fetch_or(&arr[idx/2], (val & 0xFFFF) << ((idx % 2) * 16));
}

However it is rather complicated, I would like to suggest that something like __half2float be added to C++ AMP.


Is a serial code quickly than such one transformed to thread?

$
0
0

I was made simple and very complicated experiments too, when take a serial code and put it in one thread (threads). Execution time increases about two time after that. I’d like to know scientific explanation. As result, two processors (cores) are not enough to increase efficiency of series code because we have two times delay (series code to threads) and some time for synchronization (by events) in some cases.

I have used in my experiments QuickWin library and Digital Fortran for creation multi-threading. 

Or may be I mistake?

C++ AMP: CPU fallback, WARP, Windows7, DirectX 11.1

$
0
0

Hi all,

I was trying to figure out the requirements for using the warp accelerator and in general trying to understand the software stack[*] C++ AMP is built upon when I ran across http://msdn.microsoft.com/en-us/library/gg615082.aspx:

What is WARP?

WARP is a high speed, fully conformant software rasterizer. It is a component of the DirectX graphics technology that was introduced by the Direct3D 11 runtime.

From information on http://www.danielmoth.com/Blog/Running-C-AMP-Kernels-On-The-CPU.aspx and this forum I assumed WARP was integrated in the OS and thus not (easily) upgradable, which would explain why the warp accelerator was only to be available on Windows 8. The above instead indicates that it is part of DirectX? Windows 8 will be released with DirectX 11.1 while Windows 7 currently only supports DirectX 11.0. I guess (?) you are not at liberty to answer if DirectX 11.1 will be available in the future for Windows 7, so let me ask a slightly modified question:

If DirectX 11.1 was to be released for Windows 7, would that mean that the warp accelerator would be available?

Cheers,

T

*Is there a diagram of the Windows/Visual C++/DirectX software stack below/around C++ AMP available somewhere?


Incorporating Amp from a managed Dlll built using "AnyCPU"?

$
0
0

I have a large managed library that is built using the "AnyCPU" specifier and targetting .NET v4 Client Profile. I would like to add a GPU branch to the library and am investigating using amp for this purpose. I saw the following example on the amp blog:

http://blogs.msdn.com/b/pfxteam/archive/2011/09/21/10214538.aspx

I'm a little confused however...do I need to specify "x86" for the managed dll in order to use amp? If so, does this mean that it is not possible to build a 64-bit managed library that can P/Invoke into an amp function? Lastly, do I need to target v4.5 of .NET to use amp even if I'm not using any v4.5 features in the managed (non-amp) portion of my library? Thank you in advcance,

-L

ReadLinesAsync( ) produces access violation

$
0
0

I have written a function which asynchronously opens and reads a text file and write the lines into a vector. I am recieving an Access Violation. Why?

Windows::Foundation::Collections::IVector<Platform::String^>^ FileHandler::readFileReturnVectorWithLines(Platform::String^ filename)
{
	Windows::Foundation::Collections::IVector<Platform::String^>^ vector_of_lines;
	Windows::ApplicationModel::Package^ package = Windows::ApplicationModel::Package::Current;
	Windows::Storage::StorageFolder^ installedLocation = package->InstalledLocation;
	create_task(installedLocation->GetFileAsync(filename)).then([&vector_of_lines](Windows::Storage::StorageFile^ file)
	{
		return FileIO::ReadLinesAsync(file);
	}).then([&vector_of_lines](Windows::Foundation::Collections::IVector<Platform::String^>^ lines)
	{
		// The following line produces an access violation!
		vector_of_lines = lines;
	});
	return vector_of_lines;
}

Calling AMP from a managed class library

$
0
0

Hello all,

I'm considering adding some AMP funcitonality to my managed class library. I saw this walkthrough which looks very helpful: http://blogs.msdn.com/b/pfxteam/archive/2011/09/21/10214538.aspx

My question is the following: is the Win32 dll containing the AMP code somehow "incorporated" into the managed assembly so that I only have to distribute my (single) .NET dll, or will I need to distribute both the .NET dll AND the Win32 dll? I will have many win32 dll files for the amp code and would prefer not to have to distribute dozens of unmanaged dlls with my .NET class library, if possible. Thanks in advance.

-L

Assigning kernels to specific acceleraters in AMP

$
0
0

Hello all, does anyone know how to assign a kernel to run on a particular accelerator when programming with AMP?

-L

Is "accelerator::get_all()" deterministic, and what is its overhead?

$
0
0

Hello all,two quick questions:

1) Is the ordering of the accelerator objects in the vector returned from "accelerator::get_all()" deterministic? In other words, assuming someone doesn't physically remove a device from their computer during a running process, will this function call always return the same accelerators in the same order regardless of how many times it is invoked?

2) What sort of overhead is involved in calling "accelerator::get_all()"? I ask because I see Daniel Moths example of targetting a specific accelerator involves acquiring all the accelerators and choosing one to target. Is the latency associated with this pre-processing step anything to worry about?

-L


Viewing all 856 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>