fast_math::pow vs fast_math::powf

September 27, 2012, 4:16 pm

≫ Next: calling 2 AMP kernels in sequence?

Hello, what is the difference between fast_math::pow and fast_math::powf? They have identical descriptions in the documentation.

-LKeene

↧

calling 2 AMP kernels in sequence?

September 28, 2012, 8:07 am

≫ Next: How to limit the number of threads (including non-concurrent!)

≪ Previous: fast_math::pow vs fast_math::powf

I have an operation to perform on an array that has to be implemented in two passes. I don't want to copy the buffer from the GPU to the host after the first pass since it is needed for the second pass. If I don't call "array_view.synchronize()" after the first pass will this bypass the buffer copy and leave it on the GPU for the second pass, and will the two "parallel_for_each" kernels be invoked synchronously? For example:

array_view<float, 2> DataBuffer(bufferExtent, &dataBuffer[0]);

// Perform first pass processing:
parallel_for_each(someAccelerator.default_view, bufferExtent,[=](index<2> bufferIndex) restrict(amp)
{
.
.
.
});

// Do not call "DataBuffer.synchronize()" i.e. leave data on GPU and perform 2nd pass processing.
// Will this wait for the above "parallel_for_each" to complete before beginning?:

// 2nd pass processing:
parallel_for_each(someAccelerator.default_view, bufferExtent,[=](index<2> bufferIndex) restrict(amp)
{
.
.
.
});

// Processing complete. Copy data back to host:
DataBuffer.synchronize();

-Lkeene

↧

How to limit the number of threads (including non-concurrent!)

September 28, 2012, 10:57 am

≫ Next: Using unsigned short arrays in AMP

≪ Previous: calling 2 AMP kernels in sequence?

Hi All,

I want to use parallel_for with combinable class.

I understand that I can limit the number of threads running IN PARALLEL with SchedulerPolicy using MaxConcurrency key. This works as expected. However, parallel_for is using more threads than defined in MaxConcurrency. As a consequence, the combinable class needs more nodes.

As I have very memory consuming tasks I have to limit the number of threads involved in parallel_for.

Is there any chance to do this?

Thanks and regards,

Klaus

↧

Using unsigned short arrays in AMP

September 29, 2012, 3:26 pm

≫ Next: Agents, Message block call - question

≪ Previous: How to limit the number of threads (including non-concurrent!)

Hello all. For my image processing software I need to be able to pass in byte and unsigned short arrays into AMP kernels (from C#). Thanks to some help from the people on this forum I've got the byte portion working well but now I'm stuck on the unsigned short functionality. To operate on bytes in AMP I'm using the code from the Steve Deitz essay here: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/17/c-amp-it-s-got-character-but-no-char.aspx

I just need to read/write my byte and unsigned short arrays. From Steve's unsigned char example he first casts the input unsigned char array as an unsigned int:

array_view<unsigned int> d_data((size+3)/4, reinterpret_cast<unsigned int*>(data.data()));

Then he provides us with 2 helper functions to read/write a byte at some location in the array:

// Read:
template <typename T>
unsigned int read_uchar(T& arr, int idx) restrict(amp)
{
return (arr[idx >> 2] & (0xFF << ((idx & 0x3) << 3))) >> ((idx & 0x3) << 3);
}

// Write:
template<typename T>
void write_uchar(T& arr, int idx, unsigned int val) restrict(amp)
{
atomic_fetch_xor(&arr[idx >> 2], arr[idx >> 2] & (0xFF << ((idx & 0x3) << 3)));
atomic_fetch_xor(&arr[idx >> 2], (val & 0xFF) << ((idx & 0x3) << 3));
}

Even with his diagram I'm having a heck of a time figuring out how to modify these bitwise operations for unsigned short. Could someone please show how the cast and the two helper functions can be altered to allow read/write capability for 16-bit unsigned ints? Thank you very much in advance.

-LKeene

↧

Agents, Message block call - question

October 2, 2012, 2:48 am

≫ Next: Thread is hung/suspended when invoking API to access COM object

≪ Previous: Using unsigned short arrays in AMP

Hy,

I want to use call block in order to process data that arrives as input(through callbacks) from multiple threads. I have the requirement that parts of the input data also has to be processed using the same thread every time in order to serialize it. Think of it as data coming that pertain to different tcp sessions. I want the data that pertain to same tcp session to be processed in a FIFO order.

If I have multiple concurrent tcp sessions how can I add this to a call target so that I have parallel processing of the data with the restriction that data from a tcp session is processed on the same thread ? Is there a way to use call to achieve that ?

As far as I know call guarantees only processing in the order of how data was sent( to target). I wonder how this block can help me if it possibly starts dequeing 2 data pieces and processes them in parallel in different threads(ppl tasks that could be 2 different threads). What has to do ordering mentioned in msdn doc(http://msdn.microsoft.com/en-us/library/dd504833.aspx) with this behavior.

Any toughs/corrections (regarding on the assumptions I made regarding agents here) appreciated.

-Ghita

Ghita

↧

Thread is hung/suspended when invoking API to access COM object

October 2, 2012, 11:50 am

≫ Next: WDDM 1.2 for Windows 7 expectations

≪ Previous: Agents, Message block call - question

Hi everyone,

My application sets up a global hook. In the hook proc, it retrieve IAccessible object and MSHTML object if it is a web application. Everything works well if the code for retrieving the IAccessible and MSHTML objects in the hook proc, but I create another thread to retrieve the IAccessible and MSHTML, it always hangs. I use Visual Studio 2010 and try to use Win32 thread.

Code for retrieving IAccessible:

Get current cursor position and then invoke ::AccessibleObjectFromPoint

Code for creating a thread:

CreateThread(NULL, 0, ThreadProc, NULL, 0, NULL).

Thanks,

Victory

Thanks, Victory

↧

WDDM 1.2 for Windows 7 expectations

September 28, 2012, 1:07 pm

≫ Next: Hybrid parallelism

≪ Previous: Thread is hung/suspended when invoking API to access COM object

I'd like to ask when (and if) WDDM 1.2 driver will be available for Windows 7. I'm concerned because I'd like to use c++ amp for computations, but I need full double precision. I'm sure that my card has such capability, the only thing that stands in way is that driver.

DanielMoth mentioned in this discussion that driver should be available at time c++ amp ships, but I couldn't find any date when that should happen.

Thanks and sorry if this is more vendor specific question.

↧

Hybrid parallelism

September 29, 2012, 3:27 am

≫ Next: Rough estimate of a thread context switch cost

≪ Previous: WDDM 1.2 for Windows 7 expectations

can nyone help me out how a project can b done on volume rendering usin MPI n OPENMP??

and also whether v can do that n Visual C++ ?

~Thanks n advance

↧

Rough estimate of a thread context switch cost

October 3, 2012, 6:48 am

≫ Next: Mysterious bug

≪ Previous: Hybrid parallelism

Hi, I'm trying to get a very rough estimate for many hundreds of CPU cycles one thread context switch costs (on average) on my system. This is for 64-bit Windows Server 2008 R2 and everything that I'm interested in happens in kernel space. I tried googling it and MSDN suggests that depending on the platform and the load and the OS the cost can vary from 2000 CPU cycles to 8000 cycles. That's a good start, but that's a 4x difference in cost and I want to at least figure out for my workload and my platform which part of the range I'm at (say 2000-5000 or 5000-8000). So what I did was I installed xperf (Windows Performance Toolkit) and Intel VTune Amplifier XE and xperf reports ~340,000 context switches per second, while VTune attributes ~1,000,000,000 of unhalted CPU cycles to ntoskrnl module. So if I just divide one by the other I get ~3,000 cycles. This workload utilizes ~90% of the CPU and there are ~3000 active threads. Another workload that I have utilizes only ~40% of the CPU, there are just tens of threads active and xperf reports 45,000 context switches and ntoskrnl consumes ~320,000,000 CPU cycles per second. And in this case the cost of a context switch is 7,000 cycles. So, questions:

how much administrative, but unrelated to context switches work the kernel is doing? should I be using maybe just a half of those cycles attributed to ntoskrnl? (I looked up some of the function names, some of them are clearly CS related, but say, memset - I'm not sure)
if my estimate is correct, then is there a logical explanation for the cost of CS to go up with the number of threads going down along with lower CPU utilization?
Those of you who ran the benchmarks for context switches on a similar platform (single socket, 6-core intel NHM, Windows Server 2008 R2 64-bit), do these numbers that I'm seeing look reasonable? Should I be taking another factors into account?

Thanks!

↧

Mysterious bug

October 4, 2012, 7:16 am

≫ Next: hook to wm_close

≪ Previous: Rough estimate of a thread context switch cost

I am having a persistent memory-related problem when using AMP. Below is the piece of code that causes my program to crash. The code is quite simple. I am adding numbers in a particular way. It seems that the crash is caused by the command "entr2d[ix*ny_loc+iy]=sum;". More precisely, I think the crash occurs because the variable sum is corrupted. When I sat it to 1 (i.e., when I uncomment "// sum=1.0f;"), the code does not crash. The variable "sum" is computed by adding variables obtained from array "res". The code is very straightforward. I was trying to find the cause of the crash for the last two days, but failed so far. Is it possible that the bug is inside AMP somehow?

Any help would be greatly appreciated.

Thanks,

Alexander

float subr(int nx, int ny, int nz, int n1, float dx, float dy, float dz, vector <float> res_vec, float la, float mu) { 
	float entr=0.f, dist2, temp, dx2, dy2, dz2, mu_loc, la_loc;
	int ixl, ixr, iyl, iyr, izl, izr, n1_loc, nx_loc, ny_loc, nz_loc;
	dx2=dx*dx; dy2=dy*dy; dz2=dz*dz; 
		printf ("nx=%i, ny=%i, nz=%i, n1=%i \n",nx, ny, nz, n1);
	n1_loc=n1; 
	nx_loc=nx; 
	ny_loc=ny; 
	nz_loc=nz;
	mu_loc=mu;
	la_loc=la;

// Define the accelerator
	accelerator_view myAv = accelerator().create_view(queuing_mode_immediate);
// Copy data to GPU
	array_view <const float, 3> res(nx, ny, nz, res_vec);

	vector <float> entr_vec(nx*ny,0.f);
	array_view <float, 1> entr2d(nx*ny, entr_vec);
	
	concurrency::extent <2> ext(nx, ny);
	parallel_for_each (myAv, ext,  [=] (index<2> idx) restrict(amp) {
		int ix=idx[0];
		int iy=idx[1];
	 
		int ixl=max(ix-n1_loc,0);
		int ixr=min(ix+n1_loc+1,nx_loc);
		int iyl=max(iy-n1_loc,0);
		int iyr=min(iy+n1_loc+1,ny_loc);

		float sum=0.f;
		for(int iz = 0; iz < nz_loc; ++iz) {
			int izl=max(iz-n1_loc,0);
			int izr=min(iz+n1_loc+1,nz_loc);

			for(int iz1 = izl; iz1 < izr; ++iz1) {
			for(int iy1 = iyl; iy1 < iyr; ++iy1) {
			for(int ix1 = ixl; ix1 < ixr; ++ix1) {

				float temp=res[ix][iy][iz];
				temp=temp-res[ix1][iy1][iz1];
				sum=sum+temp;	

			}}}
		}
//		sum=1.0f;
		entr2d[ix*ny_loc+iy]=sum;	
	});

	entr2d.synchronize();

	for(int ix = 0; ix < nx; ++ix) {
	for(int iy = 0; iy < ny; ++iy) {
		entr=entr+entr2d[ix*ny_loc+iy];
	}}

	return entr/float(nx*ny*nz);
}

↧

hook to wm_close

October 4, 2012, 5:32 am

≫ Next: Threads and SMP

≪ Previous: Mysterious bug

i want to catch wm_close of main window but it wm_close also fires when i close any dialog box

How to detect wm_close and make sure that is generate by main window closing.

Thanks

sahil

↧

Threads and SMP

October 5, 2012, 10:37 am

≫ Next: Is the ImagingDemo avalable?

≪ Previous: hook to wm_close

my code collection has some rather extreme programs that can saturate a Cray for years. Obviously performance matters.

Here is some code I have for dealing with locks etc so that a shared resource can be used

notice support for a wide range of rigs

#ifndef _UNI_SMP_LIB
#define _UNI_SMP_LIB

#if defined(_WIN32) || defined(_WIN64)

#include <process.h>

#  define pthread_attr_t  HANDLE
#  define pthread_t       HANDLE
#  define thread_t        HANDLE
#  define tfork(t,f,p)    (pthread_t)_beginthreadex(0,0,(unsigned int (__stdcall *)(void *))(f),(void *)(p),0,0);

#if (defined (_M_ALPHA) && !defined(NT_INTEREX))
   
#  ifdef __cplusplus
      extern "C" __int64 __asm (char *, ...);
      extern "C" void __MB (void);
#     pragma intrinsic (__asm)
#     pragma intrinsic (__MB)
#  endif

  typedef volatile int lock_t[1];

#  define LockInit(v)      ((v)[0] = 0)
#  define LockFree(v)      ((v)[0] = 0)
#  define UnLock(v)        (__MB(), (v)[0] = 0)

   __inline void Lock (volatile int *hPtr) {
   __asm ("lp: ldl_l v0,(a0);"
          "    xor v0,1,v0;"
          "    beq v0,lp;"
          "    stl_c v0,(a0);"
          "    beq v0,lp;"
          "    mb;",
                  hPtr);
   }

#elif (defined (_M_IX86) && !defined(NT_INTEREX))

   typedef volatile int lock_t[1];

#  define LockInit(v)      ((v)[0] = 0)
#  define LockFree(v)      ((v)[0] = 0)
#  define UnLock(v)        ((v)[0] = 0)


#  if (_MSC_VER > 1200)

#     ifdef __cplusplus
        extern "C" long _InterlockedExchange (long*, long);
        extern "C" void __cdecl _ReadWriteBarrier(void);
#     else
        extern long _InterlockedExchange (long*, long);
        extern void _ReadWriteBarrier(void);
#     endif
#     pragma intrinsic (_InterlockedExchange)
#     pragma intrinsic (_ReadWriteBarrier)

   __forceinline void Lock (volatile int *hPtr)
    {
    int iValue;
    volatile int *hPtrTmp;

    hPtrTmp = hPtr;     // Workaround for vc7 beta1 bug
    iValue = 1;
    _ReadWriteBarrier();
    for (;;)
        {
        iValue = _InterlockedExchange ((long*) hPtrTmp, iValue);
        if (0 == iValue)
            return;
        while (*hPtrTmp)
            ;   // Do nothing
        }
    _ReadWriteBarrier();
    }

#  else

__inline void Lock (volatile int *hPtr)
    {
    __asm
      {
        mov     ecx, hPtr
   la:  mov     eax, 1
        xchg    eax, [ecx]
        test    eax, eax
        jz      end
   lb:  mov     eax, [ecx]
        test    eax, eax
        jz      la
        jmp     lb
   end:
      }
    }

#  endif

#elif ((defined (_M_IA64) || defined(_M_AMD64)) && !defined(NT_INTEREX))

#  include <windows.h>

extern "C" void __cdecl _ReadWriteBarrier(void);
#  pragma intrinsic (_InterlockedExchange)
#  pragma intrinsic (_ReadWriteBarrier)

   typedef volatile LONG lock_t[1];

#  define LockInit(v)      ((v)[0] = 0)
#  define LockFree(v)      ((v)[0] = 0)
#  define UnLock(v)        ((v)[0] = 0)


__forceinline void Lock (volatile LONG *hPtr)
    {
    int iValue;

    _ReadWriteBarrier();
    for (;;)
        {
        iValue = _InterlockedExchange ((LPLONG) hPtr, 1);
        if (0 == iValue)
            return;
        while (*hPtr)
            ;   // Do nothing
        }
    _ReadWriteBarrier();
    }


#else /* NT non-Alpha/Intel, without assembler Lock() */

#  define lock_t           volatile int
#  define LockInit(v)      ((v) = 0)
#  define LockFree(v)      ((v) = 0)
#  define Lock(v)          do {                                         \
                             while(InterlockedExchange((LPLONG)&(v),1) != 0);  \
                           } while (0)
#  define UnLock(v)        ((v) = 0)

#endif /* architecture check */

#else  /* not NT, assume SMP using POSIX threads (LINUX, etc)  */

#include <pthread.h>

#if defined(MUTEX)

#  define Lock(v)          pthread_mutex_lock(&v)
#  define LockInit(v)      pthread_mutex_init(&v,0)
#  define LockFree(v)      pthread_mutex_destroy(&v)
#  define UnLock(v)        pthread_mutex_unlock(&v)
#  define lock_t           pthread_mutex_t

#elif defined(ALPHA)

#  include <machine/builtins.h>

#  define lock_t           volatile long
#  define LockInit(v)      ((v) = 0)
#  define LockFree(v)      ((v) = 0)
#  define Lock(v)          __LOCK_LONG(&(v))
#  define UnLock(v)        __UNLOCK_LONG(&(v))

#else /* POSIX, but not using MUTEXes */

#  define exchange(adr,reg)                                  \
     ({ volatile int _ret;                                   \
     asm volatile ("xchgw %0,%1"                             \
     : "=q" (_ret), "=m" (*(adr))    /* Output %0,%1 */      \
     : "m"  (*(adr)), "0"  (reg));   /* Input (%2),%0 */     \
     _ret;                                                   \
     })
#  define LockInit(p)           (p=0)
#  define LockFree(p)           (p=0)
#  define UnLock(p)             (exchange(&p,0))
#  define Lock(p)               while(exchange(&p,1)) while(p)
#  define lock_t                volatile int

#endif /* MUTEX */

#if defined(CLONE)
#  define tfork(t,f,p)   {                                            \
                         char *m=malloc(0x100010);                    \
                         if (m <= 0) printf("malloc() failed\n");     \
                         (void) clone(f,                            \
                         m+0x100000,                                  \
                         CLONE_VM+CLONE_FILES+17,                     \
                         (void*) p); }
                           
#else
#  define tfork(t,f,p)          pthread_create(&t,&pthread_attr,f,(void*) p)
#endif

#endif /* NT or POSIX */

#else
#  define LockInit(p)
#  define LockFree(p)
#  define Lock(p)
#  define UnLock(p)

#endif

This is very well tuned code

Windows MVP, XP, Vista, 7 and 8. More people have climbed Everest than having 3 MVP's on the wall.

Hardcore Games, Legendary is the only Way to Play

↧

Is the ImagingDemo avalable?

November 13, 2009, 9:32 am

≫ Next: Getting multi-threading to carry out parallel calls to SAP

≪ Previous: Threads and SMP

I have seen Rick Molloy demostrate a image processing application that uses a pipline in two seperate videos:
http://channel9.msdn.com/pdc2008/TL25/ and http://channel9.msdn.com/pdc2008/TL25/.

Is the source for this application avalable? I am sure I would not be the only person to benifit from seeing an example that does some "real work" using the Asynchronous Agents Libarary.

↧

Getting multi-threading to carry out parallel calls to SAP

October 10, 2012, 3:48 am

≫ Next: C++ AMP Large dedicated_memory mismatch

≪ Previous: Is the ImagingDemo avalable?

I’m trying to multi-thread my application. It reads data from a SAP system by doing reads from multiple tables and it is this process that I was hoping to multi-thread. There is a include file that addresses the SAP connection:

public ref class SAPSystem{
public:
		SAPLogonCtrl::SAPLogonControlClass ^SAPLogon;
		SAPLogonCtrl::Connection           ^SAPConnection;
		SAPFunctionsOCX::SAPFunctionsClass ^SAPFunctions;
public:
	~SAPSystem(){
		...
	}
	SAPSystem(){
		...
	}
	bool Logon(){
		...
		...
		...
	}
	void Logoff(){
		...
		...
		...
	}
	bool RFC_READ_TABLE(...){
		...
		...
		...
	}

This include controls (amongst other things) the logging onto and off the SAP system as well as the extraction of data from a SAP table via the RFC_READ_TABLE function module (which is a SAP function module which has been RFC enabled to allow third party software to access the SAP tables).

All good and well so far.

On my form I declared my SAPSystem connection as a global variable with:

public ref class Form1 : public System::Windows::Forms::Form
	SAPSystem^ SAPsystem;
	.
	.
	.

I set up my program to do the following when the extraction button was pushed (I’ve omitted a lot of the logic and tried to emphasis the program logic):

private: System::Void butExt_Click(System::Object^  sender, System::EventArgs^  e) {
	ThreadStart^ threadStart;
	Thread^ newThread1;
	Thread^ newThread2;
	WorkList_List^ worklist;
	SAPsystem = gcnew SAPSystem;
	//Logon to the SAP system
	if(SAPsystem->Logon() == true){
		//Build a list of tables to be extracted
		threadStart = gcnew ThreadStart(this, &MultiThreadingTest::Form1::ThreadClassMethod1);
		worklist->AddRow(threadStart);
		threadStart = gcnew ThreadStart(this, &MultiThreadingTest::Form1::ThreadClassMethod2);
		worklist->AddRow(threadStart, Checkbox, true);
		//Loop through the worklist kicking off jobs as threads become available
		while(worklistindex != -1 || ThreadAllFinished == false){
			if(newThread1->IsAlive == false && worklistindex != -1){
				newThread1 = gcnew Thread(worklist->WindowsTable[ThreadCounter]->name);
				newThread1->Start();
			}
			else if(newThread2->IsAlive == false && worklistindex != -1){
				newThread2 = gcnew Thread(worklist->WindowsTable[ThreadCounter]->name);
				newThread2->Start();
			}
		}
		SAPsystem->Logoff();
	}
}

On my form I then setup two threads to access different tables within the SAP system something like this:

public: void ThreadClassMethod1(){
		SAPSystem->RFC_READ_TABLE(<<First table’s parameters>>)
	}
public: void ThreadClassMethod2(){
		SAPSystem->RFC_READ_TABLE(<<Second table’s parameters>>)
	}

All this works great with the two threads being kicked off but the whole shooting match comes to a shuddering halt where both threads stop at the first line in the RFC_READ_TABLE routine. I suspected that it may be that they were both trying to access a single occurrence of a routine in an object that was declared in another thread (I’m a numpty so I may be using the wrong nomenclature here so please forgive). I tried with each ThreadClassMethod1 calling its own SAPSystem (using a silent logon i.e. capturing the system and user details on the first logon and using them to create a connection to the SAP system in the worker threads) but that didn’t help either.

Can anyone help as to what I’m doing wrong and how to fix it?

↧

C++ AMP Large dedicated_memory mismatch

October 10, 2012, 7:01 am

≫ Next: C++AMP and Multi Threading

≪ Previous: Getting multi-threading to carry out parallel calls to SAP

Background

I am just starting to use C++ AMP for a project and I have purchased a dedicated (no screen attached) GPU (Sapphire 7970 with 6GB memory).

Question

accelerator.dedicated_memory returns 2,008,344 (kb) when I would expect it to return close to 6Gb. Anyone ideas on the large difference?

For information

- Running on Windows 7 64bit

- It is picking up the correct GPU (all the other information returned matches and the GPU used for the display has 1 Gb)

- Checked and the GPU is running the latest drivers

↧

C++AMP and Multi Threading

October 12, 2012, 2:43 am

≫ Next: Will this AMP design pattern work?

≪ Previous: C++ AMP Large dedicated_memory mismatch

Hi.

Does anyone can tell me about this.

Can we do C++ AMP and Multi Thread same time.

I mean Multi Thread is Multi Task With CPU only. Using a Multi Core of the CPU.

And does C++AMP+Multi-Thread a better performance compare to C++AMP only?

thanks for reading.

Best regards.

↧

Will this AMP design pattern work?

October 12, 2012, 10:32 am

≫ Next: OpenMP clause default(none) and const reference -> compiler error

≪ Previous: C++AMP and Multi Threading

Hello all. I'm trying to reduce code size and duplication in my AMP code and am wondering if the following design pattern will work:

array_view<const float, 2> DataBuffer(dataExtent, &dataBuffer[0]);
array_view<float, 2> IntermediateBuffer(intermediateExtent, &intermediateBuffer[0]);

IntermediateBuffer.discard_data();

// Invoke first AMP kernel:
parallel_for_each(...) restrict(amp)
{
.
.
Generate/store results in "IntermediateBuffer".
.
.
.});

// With "IntermediateBuffer" still on the GPU, check some condition and process, storing the
// results to another buffer:
if(...)
{
    array_view<float, 2> FinalOutputBuffer1(finalOutputExtent, &finalOutputBuffer1[0]);

    Function1(FinalOutputBuffer);
}
else
{
    array_view<int, 2> FinalOutputBuffer2(finalOutputExtent, &finalOutputBuffer2[0]);

    Function1(FinalOutputBuffer2);
}

// Don't bother retrieving the IntermediateBuffer results:
IntermediateBuffer.discard_data();

Where:

private static void Function1(array_view finalArrayView)
{

    // Invoke AMP kernel that uses "IntermediateBuffer" from first kernel to generate and store
    // results to "FinalOutputBuffer1" array view. This assumes "IntermediateBuffer" will still
    // be resident on GPU:
    parallel_for_each(...) restrict(amp)
    {
     .
     .
     Use results in "IntermediateBuffer" to generate/store final result in "finalArrayView"
     .
     .
     });

     finalArrayView.synchronize();
}

private static void Function2(array_view finalArrayView)
{

    // Invoke AMP kernel that uses "IntermediateBuffer" from first kernel to generate and store
    // results to "FinalOutputBuffer2" array view. This assumes "IntermediateBuffer" will still
    // be resident on GPU:
    parallel_for_each(...) restrict(amp)
    {
     .
     .
     Use results in "IntermediateBuffer" to generate/store final result in "finalArrayView"
     .
     .
     });

     finalArrayView.synchronize();
}

The above relies on the assumption that "IntermediateBuffer" will stay on the GPU for use by whichever function is invoked after the first kernel. It also assumes that I can pass an array_view to a function and have an AMP kernel within that function use this array_view. Is this correct? Thank you in advance.

-L

↧

OpenMP clause default(none) and const reference -> compiler error

October 15, 2012, 11:20 am

≫ Next: Very low memory transfer speed in AMP

≪ Previous: Will this AMP design pattern work?

Hi,

I discover a strange behavior with the C++ compiler (VS 2008, SP1) and OpenMP enabled.

When I used an OpenMP section, I use the clause "default(none)" to avoid write/using variable I don't want. And to avoid to add some variable to the clause shared, I set my read only variable "const".

It works great except for the reference. See the sample below (I make this sample smaller as possible and replace the real code by sample code) :

#include <stdio.h>

struct CMyClass
{
	float m_fValue;
};

void TestRef(const CMyClass & const i_Center)
{
#pragma omp parallel default(none) num_threads(4)
	{
		printf("%f", i_Center.m_fValue); // Error C3052 !!
	}
}

void TestPtr(const CMyClass *const i_pCenter)
{
#pragma omp parallel default(none) num_threads(4)
	{
		printf("%f", i_pCenter->m_fValue); // No Error with const pointer
	}
}

int main()
{
	CMyClass object;
	TestRef(object);
	TestPtr(&object);

	return 0;
}

When the program is compiled, I had this error :

error C3052: 'i_Center' : variable doesn't appear in a data-sharing clause under a default(none) clause

On the sample you see I had a const after the reference "const CMyClass& const" to see if the error is remove, but it isn't (but this const is useless for reference...).

So is this error a compiler bug ?

Or did I make a mistake in "i_Center" declaration ? (if it's an error of my code, where I need to put the const to remove the error).

I know that I can solve the error by adding "shared(i_Center)", but it's not my goal, while I don't want to modify this variable in thread and I want it read-only.

regards,

François.

↧

Very low memory transfer speed in AMP

October 15, 2012, 1:49 am

≫ Next: The application was unable to start correctly (0xc000a200). Click OK to close the application

≪ Previous: OpenMP clause default(none) and const reference -> compiler error

I wrote a test program which only copies a sequence of data from host memory to GPU memory. To enhance the speed of data transfer, I've tried GPU memory warm up and staging arrays. But the results are still not satisfying, transferring 8M Bytes taking nearly 80ms, about 800Mb/s.

What's more suspicious is that, after applying staging arrays, the transfer speed isn't improved much, but more stable among multiple test iterations.

Platform: Intel Ivybridge, i5 3450 with HD2500

My test code is posted here:

#include <amp.h>
#include <windows.h>
#include <iostream>

#define RUN_TIMES 10

using namespace concurrency;

void WarmupDeviceData(array<int, 2> &data)
{
	parallel_for_each(data.extent, [&](index<2> idx) restrict(amp) {
		data[idx] = 0xBADDF00D;
	}
	);
}

int _tmain(int argc, _TCHAR* argv[])
{
	accelerator default_device;
	accelerator cpuAcc = accelerator(accelerator::cpu_accelerator);

	int width = 3264;
	int height = 2448;
	int data_len = width*height;

	array<int, 2> stagingArray(height, width, cpuAcc.default_view, default_device.default_view);
	int i = 0;
	std::generate(stagingArray.data(), stagingArray.data() + data_len, [&i](){return i++;});

	array<int, 2> deviceArray(height, width, default_device.default_view);

	WarmupDeviceData(deviceArray);

	LARGE_INTEGER startTimeQPC = {0};
	LARGE_INTEGER endTimeQPC = {0};
    LARGE_INTEGER frequency = {0};

    QueryPerformanceFrequency(&frequency);
    
	double acc_duration = 0;

	WarmupDeviceData(deviceArray);
	for (int i = 0; i < RUN_TIMES; i ++){
		QueryPerformanceCounter(&startTimeQPC);
		copy(stagingArray, deviceArray);
		QueryPerformanceCounter(&endTimeQPC);
		double duration = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; 
		acc_duration += duration;
		printf("Copy in time: %.1f ms\n", (duration));
	}
	
	printf("Mean: %.1f ms\n", acc_duration/RUN_TIMES);

	return 0;
}

Thanks very much~

↧