Recently we did some research on Image processing using GPGPU, and we tired both OpenCL and AMP. It was found that AMP performs worse than OpenCL, consuming more time with the same workload. And we found that the bottleneck is memory transferring between Host and GPU. So we wrote a test code, using AMP or OpenCL just to copy the values of one vector to another. It contains three partitions: copying data from Host to GPU, execution in GPU, and copying data back from GPU to Host, and the time spent in each partition was recorded:
Platform: IvyBridge, I5-3450, with HD2500 GPU
Data length: 3264*2448*sizeof(int) bytes
OpenCL config: global_size = 3264*2448, local_size = 1, implement the copies via clEnqueueWriteBuffer and clEnqueueReadBuffer.
AMP config: use the array<int, 1>a(3264*2448) with no tile, implement the copies via the function copy().
|
Copy to GPU |
Execution in GPU |
Copy back to CPU |
OpenCL |
7.45 ms |
100.57 ms |
7.68 ms |
AMP |
98.64 ms |
39.4 ms |
89.11 ms |
It can be found that, AMP consumes much more time in memory transfer between Host and GPU, but less time in GPU execution. We guess that maybe AMP copies data to a deeper memory buffer than OCL, so in GPU execution, fetching the data from the GPU buffer in OCL will spend much more time.
However, increasing the local_size in OpenCL can help shorten the GPU Execution time:
Local_size |
1 |
4 |
8 |
16 |
32 |
64 |
GPU Execution in OpenCL |
100.57ms |
35.34ms |
18.13ms |
10.37ms |
10.15ms |
10.21ms |
And time spent in the memory copy almost remains the same, since they have nothing to do with local_size.
But in AMP, the bottleneck is the memory transfer, not the GPU execution, and changing the tile size doesn’t help (in fact, it increased the consumed time of GPU execution in our test). So after changing the local_size in OpenCL, it outperforms a lot than AMP.
Anyone come across similar issues? And what’s your solutions? I need your advice.
Thanks a lot.