Hello, what is the difference between fast_math::pow and fast_math::powf? They have identical descriptions in the documentation.
-LKeene
Hello, what is the difference between fast_math::pow and fast_math::powf? They have identical descriptions in the documentation.
-LKeene
I have an operation to perform on an array that has to be implemented in two passes. I don't want to copy the buffer from the GPU to the host after the first pass since it is needed for the second pass. If I don't call "array_view.synchronize()" after the first pass will this bypass the buffer copy and leave it on the GPU for the second pass, and will the two "parallel_for_each" kernels be invoked synchronously? For example:
array_view<float, 2> DataBuffer(bufferExtent, &dataBuffer[0]);
// Perform first pass processing:
parallel_for_each(someAccelerator.default_view, bufferExtent,[=](index<2> bufferIndex) restrict(amp)
{
.
.
.
});
// Do not call "DataBuffer.synchronize()" i.e. leave data on GPU and perform 2nd pass processing.
// Will this wait for the above "parallel_for_each" to complete before beginning?:
// 2nd pass processing:
parallel_for_each(someAccelerator.default_view, bufferExtent,[=](index<2> bufferIndex) restrict(amp)
{
.
.
.
});
// Processing complete. Copy data back to host:
DataBuffer.synchronize();
-Lkeene
Hi All,
I want to use parallel_for with combinable class.
I understand that I can limit the number of threads running IN PARALLEL with SchedulerPolicy using MaxConcurrency key. This works as expected. However, parallel_for is using more threads than defined in MaxConcurrency. As a consequence, the combinable class needs more nodes.
As I have very memory consuming tasks I have to limit the number of threads involved in parallel_for.
Is there any chance to do this?
Thanks and regards,
Klaus
Hello all. For my image processing software I need to be able to pass in byte and unsigned short arrays into AMP kernels (from C#). Thanks to some help from the people on this forum I've got the byte portion working well but now I'm stuck on the unsigned short functionality. To operate on bytes in AMP I'm using the code from the Steve Deitz essay here: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/17/c-amp-it-s-got-character-but-no-char.aspx
I just need to read/write my byte and unsigned short arrays. From Steve's unsigned char example he first casts the input unsigned char array as an unsigned int:
array_view<unsigned int> d_data((size+3)/4, reinterpret_cast<unsigned int*>(data.data()));
Then he provides us with 2 helper functions to read/write a byte at some location in the array:
// Read:
template <typename T>
unsigned int read_uchar(T& arr, int idx)
restrict(amp)
{
return (arr[idx >> 2] & (0xFF << ((idx & 0x3) << 3))) >> ((idx & 0x3) << 3);
}
// Write:
template<typename T>
void write_uchar(T& arr, int idx, unsigned int val)
restrict(amp)
{
atomic_fetch_xor(&arr[idx >> 2], arr[idx >> 2] & (0xFF << ((idx & 0x3) << 3)));
atomic_fetch_xor(&arr[idx >> 2], (val & 0xFF) << ((idx & 0x3) << 3));
}
Even with his diagram I'm having a heck of a time figuring out how to modify these bitwise operations for unsigned short. Could someone please show how the cast and the two helper functions can be altered to allow read/write capability for 16-bit unsigned ints? Thank you very much in advance.
-LKeene
Hy,
I want to use call block in order to process data that arrives as input(through callbacks) from multiple threads. I have the requirement that parts of the input data also has to be processed using the same thread every time in order to serialize it. Think of it as data coming that pertain to different tcp sessions. I want the data that pertain to same tcp session to be processed in a FIFO order.
If I have multiple concurrent tcp sessions how can I add this to a call target so that I have parallel processing of the data with the restriction that data from a tcp session is processed on the same thread ? Is there a way to use call to achieve that ?
As far as I know call guarantees only processing in the order of how data was sent( to target). I wonder how this block can help me if it possibly starts dequeing 2 data pieces and processes them in parallel in different threads(ppl tasks that could be 2 different threads). What has to do ordering mentioned in msdn doc(http://msdn.microsoft.com/en-us/library/dd504833.aspx) with this behavior.
Any toughs/corrections (regarding on the assumptions I made regarding agents here) appreciated.
-Ghita
Ghita
Hi everyone,
My application sets up a global hook. In the hook proc, it retrieve IAccessible object and MSHTML object if it is a web application. Everything works well if the code for retrieving the IAccessible and MSHTML objects in the hook proc, but I create another thread to retrieve the IAccessible and MSHTML, it always hangs. I use Visual Studio 2010 and try to use Win32 thread.
Code for retrieving IAccessible:
Get current cursor position and then invoke ::AccessibleObjectFromPoint
Code for creating a thread:
CreateThread(NULL, 0, ThreadProc, NULL, 0, NULL).
Thanks,
Victory
Thanks, Victory
I'd like to ask when (and if) WDDM 1.2 driver will be available for Windows 7. I'm concerned because I'd like to use c++ amp for computations, but I need full double precision. I'm sure that my card has such capability, the only thing that stands in way is that driver.
DanielMoth mentioned in this discussion that driver should be available at time c++ amp ships, but I couldn't find any date when that should happen.
Thanks and sorry if this is more vendor specific question.
can nyone help me out how a project can b done on volume rendering usin MPI n OPENMP??
and also whether v can do that n Visual C++ ?
~Thanks n advance
Hi, I'm trying to get a very rough estimate for many hundreds of CPU cycles one thread context switch costs (on average) on my system. This is for 64-bit Windows Server 2008 R2 and everything that I'm interested in happens in kernel space. I tried googling it and MSDN suggests that depending on the platform and the load and the OS the cost can vary from 2000 CPU cycles to 8000 cycles. That's a good start, but that's a 4x difference in cost and I want to at least figure out for my workload and my platform which part of the range I'm at (say 2000-5000 or 5000-8000). So what I did was I installed xperf (Windows Performance Toolkit) and Intel VTune Amplifier XE and xperf reports ~340,000 context switches per second, while VTune attributes ~1,000,000,000 of unhalted CPU cycles to ntoskrnl module. So if I just divide one by the other I get ~3,000 cycles. This workload utilizes ~90% of the CPU and there are ~3000 active threads. Another workload that I have utilizes only ~40% of the CPU, there are just tens of threads active and xperf reports 45,000 context switches and ntoskrnl consumes ~320,000,000 CPU cycles per second. And in this case the cost of a context switch is 7,000 cycles. So, questions:
Thanks!
I am having a persistent memory-related problem when using AMP. Below is the piece of code that causes my program to crash. The code is quite simple. I am adding numbers in a particular way. It seems that the crash is caused by the command "entr2d[ix*ny_loc+iy]=sum;". More precisely, I think the crash occurs because the variable sum is corrupted. When I sat it to 1 (i.e., when I uncomment "// sum=1.0f;"), the code does not crash. The variable "sum" is computed by adding variables obtained from array "res". The code is very straightforward. I was trying to find the cause of the crash for the last two days, but failed so far. Is it possible that the bug is inside AMP somehow?
Any help would be greatly appreciated.
Thanks,
Alexander
float subr(int nx, int ny, int nz, int n1, float dx, float dy, float dz, vector <float> res_vec, float la, float mu) { float entr=0.f, dist2, temp, dx2, dy2, dz2, mu_loc, la_loc; int ixl, ixr, iyl, iyr, izl, izr, n1_loc, nx_loc, ny_loc, nz_loc; dx2=dx*dx; dy2=dy*dy; dz2=dz*dz; printf ("nx=%i, ny=%i, nz=%i, n1=%i \n",nx, ny, nz, n1); n1_loc=n1; nx_loc=nx; ny_loc=ny; nz_loc=nz; mu_loc=mu; la_loc=la; // Define the accelerator accelerator_view myAv = accelerator().create_view(queuing_mode_immediate); // Copy data to GPU array_view <const float, 3> res(nx, ny, nz, res_vec); vector <float> entr_vec(nx*ny,0.f); array_view <float, 1> entr2d(nx*ny, entr_vec); concurrency::extent <2> ext(nx, ny); parallel_for_each (myAv, ext, [=] (index<2> idx) restrict(amp) { int ix=idx[0]; int iy=idx[1]; int ixl=max(ix-n1_loc,0); int ixr=min(ix+n1_loc+1,nx_loc); int iyl=max(iy-n1_loc,0); int iyr=min(iy+n1_loc+1,ny_loc); float sum=0.f; for(int iz = 0; iz < nz_loc; ++iz) { int izl=max(iz-n1_loc,0); int izr=min(iz+n1_loc+1,nz_loc); for(int iz1 = izl; iz1 < izr; ++iz1) { for(int iy1 = iyl; iy1 < iyr; ++iy1) { for(int ix1 = ixl; ix1 < ixr; ++ix1) { float temp=res[ix][iy][iz]; temp=temp-res[ix1][iy1][iz1]; sum=sum+temp; }}} } // sum=1.0f; entr2d[ix*ny_loc+iy]=sum; }); entr2d.synchronize(); for(int ix = 0; ix < nx; ++ix) { for(int iy = 0; iy < ny; ++iy) { entr=entr+entr2d[ix*ny_loc+iy]; }} return entr/float(nx*ny*nz); }
Hi
i want to catch wm_close of main window but it wm_close also fires when i close any dialog box
How to detect wm_close and make sure that is generate by main window closing.
Thanks
sahil
my code collection has some rather extreme programs that can saturate a Cray for years. Obviously performance matters.
Here is some code I have for dealing with locks etc so that a shared resource can be used
notice support for a wide range of rigs
#ifndef _UNI_SMP_LIB #define _UNI_SMP_LIB #if defined(_WIN32) || defined(_WIN64) #include <process.h> # define pthread_attr_t HANDLE # define pthread_t HANDLE # define thread_t HANDLE # define tfork(t,f,p) (pthread_t)_beginthreadex(0,0,(unsigned int (__stdcall *)(void *))(f),(void *)(p),0,0); #if (defined (_M_ALPHA) && !defined(NT_INTEREX)) # ifdef __cplusplus extern "C" __int64 __asm (char *, ...); extern "C" void __MB (void); # pragma intrinsic (__asm) # pragma intrinsic (__MB) # endif typedef volatile int lock_t[1]; # define LockInit(v) ((v)[0] = 0) # define LockFree(v) ((v)[0] = 0) # define UnLock(v) (__MB(), (v)[0] = 0) __inline void Lock (volatile int *hPtr) { __asm ("lp: ldl_l v0,(a0);" " xor v0,1,v0;" " beq v0,lp;" " stl_c v0,(a0);" " beq v0,lp;" " mb;", hPtr); } #elif (defined (_M_IX86) && !defined(NT_INTEREX)) typedef volatile int lock_t[1]; # define LockInit(v) ((v)[0] = 0) # define LockFree(v) ((v)[0] = 0) # define UnLock(v) ((v)[0] = 0) # if (_MSC_VER > 1200) # ifdef __cplusplus extern "C" long _InterlockedExchange (long*, long); extern "C" void __cdecl _ReadWriteBarrier(void); # else extern long _InterlockedExchange (long*, long); extern void _ReadWriteBarrier(void); # endif # pragma intrinsic (_InterlockedExchange) # pragma intrinsic (_ReadWriteBarrier) __forceinline void Lock (volatile int *hPtr) { int iValue; volatile int *hPtrTmp; hPtrTmp = hPtr; // Workaround for vc7 beta1 bug iValue = 1; _ReadWriteBarrier(); for (;;) { iValue = _InterlockedExchange ((long*) hPtrTmp, iValue); if (0 == iValue) return; while (*hPtrTmp) ; // Do nothing } _ReadWriteBarrier(); } # else __inline void Lock (volatile int *hPtr) { __asm { mov ecx, hPtr la: mov eax, 1 xchg eax, [ecx] test eax, eax jz end lb: mov eax, [ecx] test eax, eax jz la jmp lb end: } } # endif #elif ((defined (_M_IA64) || defined(_M_AMD64)) && !defined(NT_INTEREX)) # include <windows.h> extern "C" void __cdecl _ReadWriteBarrier(void); # pragma intrinsic (_InterlockedExchange) # pragma intrinsic (_ReadWriteBarrier) typedef volatile LONG lock_t[1]; # define LockInit(v) ((v)[0] = 0) # define LockFree(v) ((v)[0] = 0) # define UnLock(v) ((v)[0] = 0) __forceinline void Lock (volatile LONG *hPtr) { int iValue; _ReadWriteBarrier(); for (;;) { iValue = _InterlockedExchange ((LPLONG) hPtr, 1); if (0 == iValue) return; while (*hPtr) ; // Do nothing } _ReadWriteBarrier(); } #else /* NT non-Alpha/Intel, without assembler Lock() */ # define lock_t volatile int # define LockInit(v) ((v) = 0) # define LockFree(v) ((v) = 0) # define Lock(v) do { \ while(InterlockedExchange((LPLONG)&(v),1) != 0); \ } while (0) # define UnLock(v) ((v) = 0) #endif /* architecture check */ #else /* not NT, assume SMP using POSIX threads (LINUX, etc) */ #include <pthread.h> #if defined(MUTEX) # define Lock(v) pthread_mutex_lock(&v) # define LockInit(v) pthread_mutex_init(&v,0) # define LockFree(v) pthread_mutex_destroy(&v) # define UnLock(v) pthread_mutex_unlock(&v) # define lock_t pthread_mutex_t #elif defined(ALPHA) # include <machine/builtins.h> # define lock_t volatile long # define LockInit(v) ((v) = 0) # define LockFree(v) ((v) = 0) # define Lock(v) __LOCK_LONG(&(v)) # define UnLock(v) __UNLOCK_LONG(&(v)) #else /* POSIX, but not using MUTEXes */ # define exchange(adr,reg) \ ({ volatile int _ret; \ asm volatile ("xchgw %0,%1" \ : "=q" (_ret), "=m" (*(adr)) /* Output %0,%1 */ \ : "m" (*(adr)), "0" (reg)); /* Input (%2),%0 */ \ _ret; \ }) # define LockInit(p) (p=0) # define LockFree(p) (p=0) # define UnLock(p) (exchange(&p,0)) # define Lock(p) while(exchange(&p,1)) while(p) # define lock_t volatile int #endif /* MUTEX */ #if defined(CLONE) # define tfork(t,f,p) { \ char *m=malloc(0x100010); \ if (m <= 0) printf("malloc() failed\n"); \ (void) clone(f, \ m+0x100000, \ CLONE_VM+CLONE_FILES+17, \ (void*) p); } #else # define tfork(t,f,p) pthread_create(&t,&pthread_attr,f,(void*) p) #endif #endif /* NT or POSIX */ #else # define LockInit(p) # define LockFree(p) # define Lock(p) # define UnLock(p) #endif
This is very well tuned code
Windows MVP, XP, Vista, 7 and 8. More people have climbed Everest than having 3 MVP's on the wall.
Hardcore Games, Legendary is the only Way to Play
Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews
I’m trying to multi-thread my application. It reads data from a SAP system by doing reads from multiple tables and it is this process that I was hoping to multi-thread. There is a include file that addresses the SAP connection:
public ref class SAPSystem{ public: SAPLogonCtrl::SAPLogonControlClass ^SAPLogon; SAPLogonCtrl::Connection ^SAPConnection; SAPFunctionsOCX::SAPFunctionsClass ^SAPFunctions; public: ~SAPSystem(){ ... } SAPSystem(){ ... } bool Logon(){ ... ... ... } void Logoff(){ ... ... ... } bool RFC_READ_TABLE(...){ ... ... ... }
This include controls (amongst other things) the logging onto and off the SAP system as well as the extraction of data from a SAP table via the RFC_READ_TABLE function module (which is a SAP function module which has been RFC enabled to allow third party software to access the SAP tables).
All good and well so far.
On my form I declared my SAPSystem connection as a global variable with:
public ref class Form1 : public System::Windows::Forms::Form SAPSystem^ SAPsystem; . . .
I set up my program to do the following when the extraction button was pushed (I’ve omitted a lot of the logic and tried to emphasis the program logic):
private: System::Void butExt_Click(System::Object^ sender, System::EventArgs^ e) { ThreadStart^ threadStart; Thread^ newThread1; Thread^ newThread2; WorkList_List^ worklist; SAPsystem = gcnew SAPSystem; //Logon to the SAP system if(SAPsystem->Logon() == true){ //Build a list of tables to be extracted threadStart = gcnew ThreadStart(this, &MultiThreadingTest::Form1::ThreadClassMethod1); worklist->AddRow(threadStart); threadStart = gcnew ThreadStart(this, &MultiThreadingTest::Form1::ThreadClassMethod2); worklist->AddRow(threadStart, Checkbox, true); //Loop through the worklist kicking off jobs as threads become available while(worklistindex != -1 || ThreadAllFinished == false){ if(newThread1->IsAlive == false && worklistindex != -1){ newThread1 = gcnew Thread(worklist->WindowsTable[ThreadCounter]->name); newThread1->Start(); } else if(newThread2->IsAlive == false && worklistindex != -1){ newThread2 = gcnew Thread(worklist->WindowsTable[ThreadCounter]->name); newThread2->Start(); } } SAPsystem->Logoff(); } }
On my form I then setup two threads to access different tables within the SAP system something like this:
public: void ThreadClassMethod1(){ SAPSystem->RFC_READ_TABLE(<<First table’s parameters>>) } public: void ThreadClassMethod2(){ SAPSystem->RFC_READ_TABLE(<<Second table’s parameters>>) }
All this works great with the two threads being kicked off but the whole shooting match comes to a shuddering halt where both threads stop at the first line in the RFC_READ_TABLE routine. I suspected that it may be that they were both trying to access a single occurrence of a routine in an object that was declared in another thread (I’m a numpty so I may be using the wrong nomenclature here so please forgive). I tried with each ThreadClassMethod1 calling its own SAPSystem (using a silent logon i.e. capturing the system and user details on the first logon and using them to create a connection to the SAP system in the worker threads) but that didn’t help either.
Can anyone help as to what I’m doing wrong and how to fix it?
Background
I am just starting to use C++ AMP for a project and I have purchased a dedicated (no screen attached) GPU (Sapphire 7970 with 6GB memory).
Question
accelerator.dedicated_memory returns 2,008,344 (kb) when I would expect it to return close to 6Gb. Anyone ideas on the large difference?
For information
- Running on Windows 7 64bit
- It is picking up the correct GPU (all the other information returned matches and the GPU used for the display has 1 Gb)
- Checked and the GPU is running the latest drivers
Hi.
Does anyone can tell me about this.
Can we do C++ AMP and Multi Thread same time.
I mean Multi Thread is Multi Task With CPU only. Using a Multi Core of the CPU.
And does C++AMP+Multi-Thread a better performance compare to C++AMP only?
thanks for reading.
Best regards.
Hello all. I'm trying to reduce code size and duplication in my AMP code and am wondering if the following design pattern will work:
array_view<const float, 2> DataBuffer(dataExtent, &dataBuffer[0]);
array_view<float, 2> IntermediateBuffer(intermediateExtent, &intermediateBuffer[0]);
IntermediateBuffer.discard_data();
// Invoke first AMP kernel:
parallel_for_each(...) restrict(amp)
{
.
.
Generate/store results in "IntermediateBuffer".
.
.
.});
// With "IntermediateBuffer" still on the GPU, check some condition and process, storing the
// results to another buffer:
if(...)
{
array_view<float, 2> FinalOutputBuffer1(finalOutputExtent, &finalOutputBuffer1[0]);
Function1(FinalOutputBuffer);
}
else
{
array_view<int, 2> FinalOutputBuffer2(finalOutputExtent, &finalOutputBuffer2[0]);
Function1(FinalOutputBuffer2);
}
// Don't bother retrieving the IntermediateBuffer results:
IntermediateBuffer.discard_data();
Where:
private static void Function1(array_view finalArrayView)
{
// Invoke AMP kernel that uses "IntermediateBuffer" from first kernel to generate and store
// results to "FinalOutputBuffer1" array view. This assumes "IntermediateBuffer" will still
// be resident on GPU:
parallel_for_each(...) restrict(amp)
{
.
.
Use results in "IntermediateBuffer" to generate/store final result in "finalArrayView"
.
.
});
finalArrayView.synchronize();
}
private static void Function2(array_view finalArrayView)
{
// Invoke AMP kernel that uses "IntermediateBuffer" from first kernel to generate and store
// results to "FinalOutputBuffer2" array view. This assumes "IntermediateBuffer" will still
// be resident on GPU:
parallel_for_each(...) restrict(amp)
{
.
.
Use results in "IntermediateBuffer" to generate/store final result in "finalArrayView"
.
.
});
finalArrayView.synchronize();
}
The above relies on the assumption that "IntermediateBuffer" will stay on the GPU for use by whichever function is invoked after the first kernel. It also assumes that I can pass an array_view to a function and have an AMP kernel within that function use this array_view. Is this correct? Thank you in advance.
-L
Hi,
I discover a strange behavior with the C++ compiler (VS 2008, SP1) and OpenMP enabled.
When I used an OpenMP section, I use the clause "default(none)" to avoid write/using variable I don't want. And to avoid to add some variable to the clause shared, I set my read only variable "const".
It works great except for the reference. See the sample below (I make this sample smaller as possible and replace the real code by sample code) :
#include <stdio.h> struct CMyClass { float m_fValue; }; void TestRef(const CMyClass & const i_Center) { #pragma omp parallel default(none) num_threads(4) { printf("%f", i_Center.m_fValue); // Error C3052 !! } } void TestPtr(const CMyClass *const i_pCenter) { #pragma omp parallel default(none) num_threads(4) { printf("%f", i_pCenter->m_fValue); // No Error with const pointer } } int main() { CMyClass object; TestRef(object); TestPtr(&object); return 0; }
When the program is compiled, I had this error :
error C3052: 'i_Center' : variable doesn't appear in a data-sharing clause under a default(none) clause
On the sample you see I had a const after the reference "const CMyClass& const" to see if the error is remove, but it isn't (but this const is useless for reference...).
So is this error a compiler bug ?
Or did I make a mistake in "i_Center" declaration ? (if it's an error of my code, where I need to put the const to remove the error).
I know that I can solve the error by adding "shared(i_Center)", but it's not my goal, while I don't want to modify this variable in thread and I want it read-only.
regards,
François.
Hi
I wrote a test program which only copies a sequence of data from host memory to GPU memory. To enhance the speed of data transfer, I've tried GPU memory warm up and staging arrays. But the results are still not satisfying, transferring 8M Bytes taking nearly 80ms, about 800Mb/s.
What's more suspicious is that, after applying staging arrays, the transfer speed isn't improved much, but more stable among multiple test iterations.
Platform: Intel Ivybridge, i5 3450 with HD2500
My test code is posted here:
#include <amp.h> #include <windows.h> #include <iostream> #define RUN_TIMES 10 using namespace concurrency; void WarmupDeviceData(array<int, 2> &data) { parallel_for_each(data.extent, [&](index<2> idx) restrict(amp) { data[idx] = 0xBADDF00D; } ); } int _tmain(int argc, _TCHAR* argv[]) { accelerator default_device; accelerator cpuAcc = accelerator(accelerator::cpu_accelerator); int width = 3264; int height = 2448; int data_len = width*height; array<int, 2> stagingArray(height, width, cpuAcc.default_view, default_device.default_view); int i = 0; std::generate(stagingArray.data(), stagingArray.data() + data_len, [&i](){return i++;}); array<int, 2> deviceArray(height, width, default_device.default_view); WarmupDeviceData(deviceArray); LARGE_INTEGER startTimeQPC = {0}; LARGE_INTEGER endTimeQPC = {0}; LARGE_INTEGER frequency = {0}; QueryPerformanceFrequency(&frequency); double acc_duration = 0; WarmupDeviceData(deviceArray); for (int i = 0; i < RUN_TIMES; i ++){ QueryPerformanceCounter(&startTimeQPC); copy(stagingArray, deviceArray); QueryPerformanceCounter(&endTimeQPC); double duration = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; acc_duration += duration; printf("Copy in time: %.1f ms\n", (duration)); } printf("Mean: %.1f ms\n", acc_duration/RUN_TIMES); return 0; }
Thanks very much~
Hello,
I have devloped an dll using AMP. Now, I have a simple exe to test my dll. Here is my simple exe source code
#include"AteAMP.h"
int main()
{
AteNgateAmpMainMatcher amp_matcher;
amp_matcher.ampMatchMain();
system("pause");
return 0;
}
when I run this simple program, it spills out the following error
Exercise Solutions.exe Application Error
The application was unable to start correctly (0xc000a200). Click OK to close the application
I am using Windows 8 computer with VS 2012.