How to marshal std::wstring returned from accelerator::get_description()?

September 11, 2012, 4:50 pm

≫ Next: PPL: What is the equivalent of #pragma omp atomic?

≪ Previous: Is "accelerator::get_all()" deterministic, and what is its overhead?

I'm using AMP from my C# application via a simple dll I'm writing and the PInvoke mechanism in the .NET framework. I'd like to expose some of the accelerator properties to my users so that those with more than one GPU can intelligently target one in particular. To do so requires the use of the device_path and description properties of the accelerator object, both of which are std::wstring types. I've spent the entire (hair-pulling) day trying to get a std::wstring out of C++ and into C# (and vice versa) with no luck at all. In fact, the internet seems littered with the corpses of those who came before me and tried to do the same thing. Why does this need to be so difficult? Why wasn't a more "marshal-friendly" type chosen to represent these data members? Can one of you AMP designers please show us how this can be done? Or better yet, how to marshall an entire "accelerator" object if possible? Thank you in advance,

-L

↧

PPL: What is the equivalent of #pragma omp atomic?

September 12, 2012, 1:54 am

≫ Next: More PInvoke trouble...this time Boolean[]

≪ Previous: How to marshal std::wstring returned from accelerator::get_description()?

Hi all,

What is the equivalent of the OpenMP [*] "#pragma omp atomic" construct in a PPL parallel_for lambda?

int* a;
int i;
int x;
[...]
#pragma omp atomic
a[i] += x;

I looked into std::atomic<T> but copying all data to/from an atomic<T> array is way too expensive. So is critical_section::scoped_lock. Always storing the data in an atomic<T> array is not an option (overhead when atomicity is not needed and it would need modification of a lot of code).

I can accept an "ugly" solution based on casting and/or assembler and/or intrinsics if that is possible, I only need it to work on x64, VS2012 and >= Windows7.

Cheers,

*Moving away from OpenMP, mostly due to MS implementation leaking in my usecase (see http://openmp.org/forum/viewtopic.php?f=3&t=907,
http://connect.microsoft.com link broken since a few weeks).

↧

More PInvoke trouble...this time Boolean[]

September 13, 2012, 12:48 pm

≫ Next: Distributing processing to unevenly powerful processing elements and gathering results

≪ Previous: PPL: What is the equivalent of #pragma omp atomic?

Hello fellow "AMPlifiers", my final PInvoke hurdle involves getting a 1D Boolean[] from my C# code into AMP, and this gets ugly. I can successfuly marshal it into straight C++ bool type, but when I tried to use the array in a GPU kernel AMP gave a compile-time exception claiming that it needs value types that are multiples of the size of an int. Consulting the Microsoft docs on marshaling booleans here:

http://msdn.microsoft.com/en-us/library/ms182206(v=vs.80).aspx

I tried to marshal it as "UnmanagedType.Bool" which is supposed to be a 4-byte type. I then change the Dll function to integer* from bool*. I get some unspecified runtime error. Does anyone know how to get an array of type Boolean into AMP? Here's what I've got (I'm trying this in 64-bit mode. Once it's working I'll compile a 2nd version for 32-bit processes):

AMP Dll function:

extern "C" __declspec (dllexport)void _stdcall BooleanTest_x64(int* boolArray, int length)
{
    array_view<int,1> boolArrayView(length, &boolArray[0]);

    parallel_for_each(boolArrayView.extent, [=](index<1> idx) restrict(amp)
    {
        if(boolArrayView[idx])
        {
            boolArrayView[idx] = 0; // Make false.
        }
    }
}

C# declaration:

[DllImport("MyDll", CallingConvention = CallingConvention.StdCall)]
extern unsafe static BooleanTest_x64([MarshalAs(UnmanagedType.Bool)]Boolean* boolArray, int len);

Invoke from C# code:

Boolean[] TestArray = new Boolean[3];
TestArray[0] = true;
TestArray[1] = false;
TestArray[2] = false;

unsafe
{
    fixed (Boolean* boolArrayPtr = &TestArray[0])
    {
        BooleanTest_x64(boolArrayPtr, 3);
    }
}

↧

Distributing processing to unevenly powerful processing elements and gathering results

September 15, 2012, 11:53 pm

≫ Next: Why is this AMP code crashing the video driver?

≪ Previous: More PInvoke trouble...this time Boolean[]

Greetings!

Let’s assume I have a one function that has intensive nested loops. In addition of CPU cores, it can also be executed on some processing element that has distinctively more processing power, say a distributed multi-core machine or a GPU. What I would like to do is to schedule a function for each CPU core only once (the weaker processing elements in this case) and as long as they are processing, keep scheduling additional work on the stronger processor until all the CPU processors have finished. From the results only a minimum or maximum element is selected – or something to that effect.

I was wondering if there’s a preferred pattern for a situation like this? Whilst going through the Concurrency Runtime examples, it looks like as if the Image Processing Network example could be a good starting point.

My initial thinking is/was that I could use a shared concurrent_vector class into which write the results for post-processing (e.g. selecting with std::max), launch a bunch of tasks and keep a note in some variable or lock when all the CPU tasks have processed and in the meantime keep re-adding work on the more powerful processing element. Though, then I came across combinable object and got thinking if there's a more elequent way of achieving this (feels like a case of map-reduce in some parallel language). Hence my question, does anyone know a pattern like this or is there a semi-snippet, blog post or something already that I haven’t managed to find.

I try to keep this updated as I write some code in the coming week. As an added note, it looks like the auto-vectorizer is able to vectorize the cases where there’s two inner loops, one nested, but the CPU could still be utilized better by utilizing more cores, I think.

Sudet ulvovat -- karavaani kulkee

↧

Why is this AMP code crashing the video driver?

September 16, 2012, 5:28 pm

≫ Next: OpenMP compiler bug, only x64 debug non optimized

≪ Previous: Distributing processing to unevenly powerful processing elements and gathering results

Hi folks.

I thought I had my first amp function working pretty well until I tried a very simple little optimization experiment which is causing the video driver to crash. The background: I'm writing the AMP code in a native Dll that I'm calling from my C# class library via PInvoke. I'm passing into amp a pair of 1-D arrays of type "Single" as well as a few ints and floats.

Here's the Dll declaration as it appears in C#:

[DllImport("MyDll", CallingConvention = CallingConvention.StdCall)]
extern static void AmpFunc([MarshalAs(UnmanagedType.LPArray)] [In, Out] Single[] dataBuffer,
                                         [In] Int32 dataBufferRows, [In] Int32 dataBufferColumns,
                                         [MarshalAs(UnmanagedType.LPArray)] [In, Out] Single[] dataSource,
                                         [In] Int32 dataSourceRows, [In] Int32 dataSourceColumns,
                                         [In] Int32 roiX, [In] Int32 roiY, [In] Int32 roiWidth, [In] Int32, etc);

My AMP function is as follows:

extern "C" __declspec (dllexport)void _stdcall AmpFunc(float* dataBuffer, int DataBuferRows,
                                                                                 int DataBufferColumns, float* dataSource,
                                                                                 int DataSourceRows, etc...)
{
array_view<const float, 1> DataBuffer(DataBufferRows*DataBufferColumns, &dataBuffer[0]);
array_view<float, 1> DataSource(DataSourceRows*DataSourceColumns, &dataSource[0]);

int TotalNumThreads = RoiWidth*RoiHeight;

parallel_for_each(extent<1>(TotalNumThreads), [=](index<1> bufferIndex) restrict(amp)
{
    // Compute this threads' location:
    int RoiRow = bufferIndex[0] / RoiWidth;
    int RoiColumn = bufferIndex[0] - (RoiRow * RoiWidth);
    .
    .
    int Offset = (RoiRow - KernelRadius) * DataBufferColumns + RoiColumn; // You get the idea

    for(int x = 0; x < KernelRows; x++)
    {
        bufferIndex[0] = Offset;
        for(int y = 0; y < KernelColumns; y++)
        {
        .
        .
        Do operations in the array region surrounding this thread
        .
        .
        bufferIndex[0]++;
        }

    Offset += DataBufferColumns; // move to next row.
    }
.
.
});

DataSource.synchronize();
}

My AMP kernel runs very well and produces the correct result. However, when I realized each thread was computing some values unnecessarily in the nested loop I decided to alter the code to accept a small lookup table (a 1-D array of length KernelRows * KernelColumns, usually 8x8) that the threads can use. They will only read from this array, not write to it. This addition is what's causing the video driver to crash. I've altered the C# declaration with the following addition:

[DllImport("MyDll", CallingConvention = CallingConvention.StdCall)]
extern static void AmpFunc([MarshalAs(UnmanagedType.LPArray)] [In, Out] Single[] dataBuffer,
                                         [In] Int32 dataBufferRows, [In] Int32 dataBufferColumns,
                                         [MarshalAs(UnmanagedType.LPArray)] [In, Out] Single[] dataSource,
                                         [In] Int32 dataSourceRows, [In] Int32 dataSourceColumns,
                                         [MarshalAs(UnmanagedType.LPArray)] [In, Out] Single[] euclideanDistances,
                                         [In] Int32 roiX, [In] Int32 roiY, [In] Int32 roiWidth, [In] Int32, etc);

And my AMP function is also modified as:

extern "C" __declspec (dllexport)void _stdcall AmpFunc(float* dataBuffer, int DataBuferRows,
                                                                                 int DataBufferColumns, float* dataSource,
                                                                                 int DataSourceRows,
                                                                                 float* euclideanDistances, etc...)
{
array_view<const float, 1> DataBuffer(DataBufferRows*DataBufferColumns, &dataBuffer[0]);
array_view<float, 1> DataSource(DataSourceRows*DataSourceColumns, &dataSource[0]);
array_view<const float, 1> EuclideanDistances(KernelRows*KernelColumns, &euclideanDistances[0]);

int TotalNumThreads = RoiWidth*RoiHeight;

parallel_for_each(extent<1>(TotalNumThreads), [=](index<1> bufferIndex) restrict(amp)
{
    int RoiRow = bufferIndex[0] / RoiWidth;
    int RoiColumn = bufferIndex[0] - (RoiRow * RoiWidth);
    .
    .
    int Offset = (RoiRow - KernelRadius) * DataBufferColumns + RoiColumn; // You get the idea

    index<1> euclidIndex(0);
    for(int x = 0; x < KernelRows; x++)
    {
        bufferIndex[0] = Offset;
        for(int y = 0; y < KernelColumns; y++)
        {
        .
        .
        Do operations in the array region surrounding this thread
         Access the lookup table sequentially in some equation as:
        "..... * EuclideanDistances[euclidIndex]));"
        .
        bufferIndex[0]++;
        euclidIndex[0]++;
        }

    Offset += DataBufferColumns; // move to next row.
    }
.
.
});

DataSource.synchronize();

I've verified that the lookup table array is the proper size and type and not null, etc. Everything checks out. However, as soon as I try to access it through the "EuclideanDistances" array_view it crashes the video driver, even if I change the code in the loop to access only its first element at every iteration. If I comment out the access line where I have "... * EuclideanDistances[euclideanIndex];" the code will run without trouble (but obviously won't produce the correct result). Does anyone have any ideas what's going on here?

-L

↧

OpenMP compiler bug, only x64 debug non optimized

September 13, 2012, 12:57 am

≫ Next: C++ AMP and Direct3D VSync?

≪ Previous: Why is this AMP code crashing the video driver?

Hi everyone,

i tried to port some parallel SSE code to OpenMP (before parallelization was done using win32 thread API). Unit test work fine for win32 debug and release configuration, but while x64 release works fine, x64 debug failes randomly. During debugging i found out, that this problem also can be reproduced without using SSE.

The bug can be reproduced unter the following conditions

OpenMP for loop
Instance of a class is passed by value to a function
Class size > 8 byte (so > size of a pointer, but why)
Class has no copy constructor, so compiler generates its own one

If i pass by reference or implement the copy constructor or switch to single threaded everything works fine.

I tried the code on vs2005, vs2010 and vs2012 with always the same result. has anyone an idea whats wrong?

Here is example code:

#include <iostream>
#include <omp.h>

#define SIZE 4 // code works with 1 and 2

struct S
{
//S()
//{
//}
//S( const S &other )
//{
// for ( int i = 0; i < SIZE; ++i )
// {
// data[ i ] = other.data[ i ];
// }
//}
int data[ SIZE ];
};

S foo( S s )
//S foo( const S &s )
{
return s;
}

int main( int argc, char **argv )
{
#pragma omp parallel for
for ( int i = 0; i < 100000; ++i )
{
  S dummy;
  for ( int j = 0; j < SIZE; ++j )
  {
   dummy.data[ j ] = i;
  }

dummy = foo( dummy );

  for ( int j = 0; j < SIZE; ++j )
  {
   if ( dummy.data[ j ] != i )
   {
    std::cout << ".";
    break;
   }
  }
}

return 0;
}

Thanks,

Markus

↧

C++ AMP and Direct3D VSync?

September 14, 2012, 1:38 pm

≫ Next: How to pass individual value arguments to my AMP kernel?

≪ Previous: OpenMP compiler bug, only x64 debug non optimized

I'm using Direct3D to present the result of my C++ AMP computation and have noticed that the C++ AMP calculation are a lot slower when using vertical sync (on a different direct3d device than what C++ AMP uses).

i.e. the following is slow:

scoped_blocking_signal blocking_signal;
this->swap_chain->Present(1, 0);

While either of the following is faster:

scoped_blocking_signal blocking_signal;
this->swap_chain->Present(1, 0);
Sleep(20);

this->swap_chain->Present(0, 0);

I have checked CPU usage and the vertical sync is not busy waiting. I'm using Nvidia cards.

Any ideas as to what might be causing this?

I'm currently on Windows 7, but once I've upgraded to Windows 8 I hope to use the following code, which hopefully doesn't suffer from this problem:

HRESULT res = this->swap_chain->Present(1, DXGI_PRESENT_DO_NOT_WAIT);
while(res == DXGI_ERROR_WAS_STILL_DRAWING)
{
	Context::Yield();
	res = this->swap_chain->Present(1, DXGI_PRESENT_DO_NOT_WAIT);
}

↧

How to pass individual value arguments to my AMP kernel?

September 17, 2012, 2:45 pm

≫ Next: I want drop packet by C++

≪ Previous: C++ AMP and Direct3D VSync?

I've been watching Daniel Moths' screen casts (and reading some AMP documentation), but so far I can only find examples where big chunks of data (i,e, arrays) are being passed to the GPU for processing. What if, in addition to the chunky arrays, I also need to pass individual values as well? Do I just do the following (I'm asking because I've tried it and it seems to work, but in some configurations I get funny results that I'm having trouble tracking down)?:

//
// array_view declarations here , assume one is called "av1"
//

// Also need this:
float SomeValue = ...;

// GPU kernel:
parallel_for_each(extent<1>(NumThreads), [=](index<1> idx) restrict(amp)
{
    // Assign the value:
    av1[idx] = SomeValue * 5.0f; // Is "SomeValue" visible to all my threads?
    .
    .
    .
}

↧

I want drop packet by C++

September 16, 2012, 6:50 pm

≫ Next: AMP reports different device memory capacity

≪ Previous: How to pass individual value arguments to my AMP kernel?

Currently, I have captured packet network card using C++ by Winsocks
So ,I want to drop packets using C ++ based on the destination IP address
help me

my code :capture packets

#include "stdafx.h"
#include <conio.h>
#include <string>
#include <cstring>
#include <stdio.h>
#include <iostream>
#define MAX_PACKET_SIZE 65525
#include <winsock2.h>
#include <mstcpip.h>
#include <ws2tcpip.h>

using namespace std;

typedef struct iphdr1
{
unsigned char VerIHL; //Version and IP Header Length
unsigned char Tos;
unsigned short Total_len;
unsigned short ID;
unsigned short Flags_and_Frags; //Flags 3 bits and Fragment offset 13 bits
unsigned char TTL;
unsigned char Protocol;
unsigned short Checksum;
unsigned long SrcIP;
unsigned long DstIP;
//unsigned long Options_and_Padding;
} IpHeader1;

typedef struct port
{
unsigned short SrcPort;
unsigned short DstPort;
} TcpUdpPort;

void ProcessPacket(char* Buffer, int Size)
{
IpHeader1 *iphdr1;
TcpUdpPort *port;
struct sockaddr_in SockAddr;
unsigned short iphdrlen;
char C;

iphdr1 = (IpHeader1 *)Buffer;

iphdrlen = (iphdr1->VerIHL << 4);
memcpy(&C, &iphdrlen, 1);
iphdrlen = (C >> 4) * 4; //20

memset(&SockAddr, 0, sizeof(SockAddr));
SockAddr.sin_addr.s_addr = iphdr1->SrcIP;
printf("Packet From: %s ", inet_ntoa(SockAddr.sin_addr));
memset(&SockAddr, 0, sizeof(SockAddr));
SockAddr.sin_addr.s_addr = iphdr1->DstIP;
printf("To: %s ", inet_ntoa(SockAddr.sin_addr));

switch (iphdr1->Protocol)
{
case 1:
printf("Protocol: ICMP ");
break;
case 2:
printf("Protocol: IGMP ");
break;
case 6:
printf("Protocol: TCP ");
if (Size > iphdrlen)
{
port = (TcpUdpPort *)(Buffer + iphdrlen);
printf("From Port: %i To Port: %i ", ntohs(port->SrcPort), ntohs(port->DstPort));
}
break;
case 17:
printf("Protocol: UDP ");
if (Size > iphdrlen)
{
port = (TcpUdpPort *)(Buffer + iphdrlen);
printf("From Port: %i To Port: %i ", ntohs(port->SrcPort), ntohs(port->DstPort));
}
break;
default:
printf("Protocol: %i ", iphdr1->Protocol);
}

printf("\n");
}

void StartSniffing(SOCKET Sock)
{
char *RecvBuffer = (char *)malloc(MAX_PACKET_SIZE + 1);
int BytesRecv, FromLen;
struct sockaddr_in From;

if (RecvBuffer == NULL)
{
printf("malloc() failed.\n");
exit(-1);
}

FromLen = sizeof(From);

do
{
memset(RecvBuffer, 0, MAX_PACKET_SIZE + 1);
memset(&From, 0, sizeof(From));

BytesRecv = recvfrom(Sock, RecvBuffer, MAX_PACKET_SIZE, 0, (sockaddr *)&From, &FromLen);
printf("BytesRecv la:%i",BytesRecv);
if (BytesRecv > 0)
{
ProcessPacket(RecvBuffer, BytesRecv);
}
else
{
printf( "recvfrom() failed.\n");
}

} while (BytesRecv > 0);
free(RecvBuffer);
}
///////////////////////////////////////////////
char* GetLocalAddress()
{
// WSADATA wsaData;
struct hostent *remoteHost;
struct in_addr addr;
int i=0;
//WSAStartup(MAKEWORD(2, 2), &wsaData);
char* buffer="";
gethostname (buffer,strlen(buffer));
remoteHost = gethostbyname(buffer);
if (remoteHost == NULL)
{
printf("ko tim thay :");
}
else
{
//printf("\tOfficial name: %s\n", remoteHost->h_name);
addr.s_addr = *(u_long *) remoteHost->h_addr_list[i++];
//printf("\tIP Address #%d: %s\n", i, inet_ntoa(addr));
return (char*)inet_ntoa(addr);
}

}
////////////////////////////////////////////////
void main()
{
WSAData wsaData;
SOCKET Sock;
struct sockaddr_in SockAddr;
DWORD BytesReturned;
int I = 1;
try
{

if (WSAStartup(MAKEWORD(2, 2), &wsaData) != 0)
{
printf("WSAStartup() failed.\n");
exit(-1);
}

Sock = socket(AF_INET, SOCK_RAW, IPPROTO_IP);

if (Sock == INVALID_SOCKET)
{
printf("socket() failed.\n");
exit(-1);
}

memset(&SockAddr, 0, sizeof(SockAddr));
//SockAddr.sin_addr.s_addr = inet_addr(BIND2IP);
SockAddr.sin_addr.s_addr = inet_addr(GetLocalAddress());
SockAddr.sin_family = AF_INET;
SockAddr.sin_port = 0;

if (bind(Sock, (sockaddr *)&SockAddr, sizeof(SockAddr))== SOCKET_ERROR)
{
printf("bind(%s) failed.\n", GetLocalAddress());
exit(-1);
}
if (WSAIoctl(Sock, SIO_RCVALL, &I, sizeof(I), NULL, NULL, &BytesReturned, NULL, NULL) == SOCKET_ERROR)
{
printf("WSAIoctl() failed.\n");
exit(-1);
}

StartSniffing(Sock);
}
catch (...)
{
printf("CRASH\n");
}
closesocket(Sock);
WSACleanup();
getch();
}

///////////////////////////////////////////////////////////////////////////////////

↧

AMP reports different device memory capacity

September 12, 2012, 4:30 pm

≫ Next: How to pass Character array and Byte array from Unmanaged C++ to C++.NET using wrapper?

≪ Previous: I want drop packet by C++

Hello. I'm testing AMP on an NVidia GTX460. In the NVidia control panel -> device properties section the memory is reported as being 1023 MB GDDR5. However, when I query the "accelerator.dedicated_memory" property it is reported as 1001472 KB (978 MB). what is accounting for this difference?

-L

↧

How to pass Character array and Byte array from Unmanaged C++ to C++.NET using wrapper?

September 18, 2012, 11:25 pm

≫ Next: How do I limit a native c++ application to a maximum of 3 instances for the same user?

≪ Previous: AMP reports different device memory capacity

Hello Guys,

I am doing a project in C++.NET and C++ and I'm trying to connect the two systems using the IPC written in C++ and I am using a wrapper to wrap the managed to unmanaged and unmanaged to managed class. Now i am able to receive the data from the remote system to IPC and to wrapper. But when when i pass the array to the C++.NET application only array[0] is passed. Im new to C++.NET application. I need help in this. When i assign data type for array in delegates its saying you should explicitly specify __gc or __nogc for arrays. But when i do its showing error as __gc can be applied only for pointers and arrays or unions.. Pls give me suggestions...

↧

How do I limit a native c++ application to a maximum of 3 instances for the same user?

September 19, 2012, 6:43 am

≫ Next: No support for Byte buffers in AMP?

≪ Previous: How to pass Character array and Byte array from Unmanaged C++ to C++.NET using wrapper?

Can't really elaborate more than the title. I have a native C++ Win32 application (compiled with MS SDK) and I want users to be able to run multiple instances of it, but only a maximum of 3. It also has to be able to cope with one of the applications terminating unexpectedly or crashing. I.e. if a bad termination occurs this should not result in the user now only being able to run 2, which rules out the use of sempaphores. I thought about mutexes, each process creates a mutex with the same name, as the open handle is closed when a process terminates, this would be ideal, however, I can't find any way to determine how many open handles a mutex has, if such a function exists it would be easy.

Many thanks for any help.

↧

No support for Byte buffers in AMP?

September 19, 2012, 10:17 am

≫ Next: Multhread Error on 64bit machine

≪ Previous: How do I limit a native c++ application to a maximum of 3 instances for the same user?

I've come across information that suggests there is no support for Byte arrays in AMP. Is this true and, if so, what is the reason behind this (significant) restriction? For those of us using AMP for image processing this makes things very clunky indeed.

-L

↧

Multhread Error on 64bit machine

September 19, 2012, 11:59 pm

≫ Next: How to bypass GPU -> Host buffer copy in AMP?

≪ Previous: No support for Byte buffers in AMP?

Hi,

I have to read 3d files using an 3D rendering lib. When I read the files with main thread (only one thread) then more time is taking (nearly 40 sec) to read all the files.

So I implemented multi thread program for reading. With this the reading time reduced up to 50%.

Now my problem is, when I run this on Windows - 7 ultimate 64 bit machine then some times the application crashes while reading.

My development environment is Windows - XP SP3 32 bit machine. I am using .NET 3.5 SP1 framework.

I did this in C++. Here I am giving my code for the reference.

//Globles

struct ModelObj

{

   int id;

};

ModelObj obj[NUM_OF_MODELS];



Model model;//OBJ OF MODEL HANDLING CLASS



HANDLE T[NUM_OF_MODELS];

CWinThread *pThread[NUM_OF_MODELS];

UINT ThreadProc1(LPVOID lpvoid)

{
      ModelObj *temp = (ModelObj*)lpvoid;
      
model.modelForMarker[temp->id] = osgDB::readNodeFile(model.fileNames[temp->id]);


      if(model.modelForMarker[temp->id])

      {

         model.mtForMarker[temp->id]->addChild(model.modelForMarker[temp->id].get());


      }


   return 0;

}

and in Main

Code:

.........





for(int i=0;i<NUM_OF_MODELS;i++)

                        {

                                obj[i].id=i;



                        }

  

strcpy(model.fileNames[0],"osg_01.osg");

                        strcpy(model.fileNames[1],"osg_02.OSG");

                        strcpy(model.fileNames[2],"osg_03.OSG");



DWORD ThreadId[NUM_OF_MODELS];

                        for(int i=0;i<NUM_OF_MODELS;i++)

                        {

                              model.mtForMarker[i] = new osg::MatrixTransform;

            model.modelSwitch->addChild(model.mtForMarker[i].get());



            model.mtForMarker[i]->addChild(model.sound_root.get());

             model.mtForMarker[i]->setUpdateCallback( model.soundCB.get() );



            //pThread[i]=new CWinThread;

            std::cout<<"Thread: "<<i<<std::endl;

            pThread[i] = AfxBeginThread (ThreadProc1, (LPVOID)&obj[i]);



            pThread[i]->m_bAutoDelete=FALSE;



                        }



                        for (int j = 0; j < NUM_OF_MODELS; j ++)

                        {

                                T[j] = pThread[j]->m_hThread;

                        }



                        ::WaitForMultipleObjects(NUM_OF_MODELS, T, TRUE, INFINITE);



                        for (int j = 0; j < NUM_OF_MODELS; j ++)

                        {

                                delete pThread[j];

                        }

The 3D rendering lib is multhi threaded comparable.

Is there any wrog with my code?

Can you please help me why am I getting error in 64 bit machine.

↧

How to bypass GPU -> Host buffer copy in AMP?

September 20, 2012, 4:09 pm

≫ Next: Number of array and array_view in one parallel_for_each

≪ Previous: Multhread Error on 64bit machine

Hello, I'm copying a couple of arrays to my GPU kernel for use by the threads within the kernel. They are only used to store intermediate results during a certain calculation and there is no need to copy the contents of these arrays back to the host once the AMP kernel has completed. Is there an optimization hint for this scenario? Thank you in advance.

-L

↧

Number of array and array_view in one parallel_for_each

September 22, 2012, 10:46 am

≫ Next: What do you want in the next version of C++ AMP? – we are listening

≪ Previous: How to bypass GPU -> Host buffer copy in AMP?

Hi there.

I read somewhere on this forum that the total number of available UAVs for a DirectX 11.0 device is limited to 8, and for a DirectX 11.1 device it is limited to 64; I have an AMD Radeon HD 6970 which supports DirectX 11.0. I tested how many read-write array_view<float, 2> I can access inside a parallel_for_each (if I want to go with the structure of arrays approach I will need at least 12) and the result was up to 10.

So I was wondering if there is any way to increase the number of UAVs? Should I use CsSetUnorderedAccessViews to increase the UAVs available in C++AMP?

Thanks for your help.

↧

What do you want in the next version of C++ AMP? – we are listening

September 19, 2012, 2:41 pm

≫ Next: OpenMP threadprivate directive is not working.

≪ Previous: Number of array and array_view in one parallel_for_each

Visual Studio 2012 includes the first release of the C++ AMP technology and hopefully by now you have had a chance to learn about and even better try your hands at it. We would like you to know that this first release is just a beginning and our team is actively planning new features and improvements for the next version of C++ AMP.

If you have a feature request or suggestion for the next version of C++ AMP, we would love to hear about it. Accompanying details of what scenarios would be enabled by your suggestion or how the suggested feature would simplify your use of C++ AMP, would be very useful, and are encouraged. Several of you have already been sharing feedback with us on our MSDN forum – it has been duly noted and we are sincerely thankful.

While we cannot guarantee all feature suggestions and requests will be fulfilled, we promise to sincerely listen to and include your feedback and suggestions in our planning process for the next version of C++ AMP. So if there is a feature or piece of functionality in your mind that you wish C++ AMP had, please let us know by responding below on this thread.

Looking forward to your responses.

C++ AMP Team

Amit K Agarwal

↧

OpenMP threadprivate directive is not working.

September 20, 2012, 11:48 pm

≫ Next: Is there any way to narrow down performance difference of CUDA and AMP?

≪ Previous: What do you want in the next version of C++ AMP? – we are listening

Hi,

An OpenMP implementation in Visual Studio 2010 has a serious bug that does not allow to have external variables as threadprivate. The the following code would not compile:

file.c

-----

#include <omp.h>

/* Declaration of external variable. */

extern int My_Var;

#pragma omp threadprivate (My_Var)

void MyFunc(void) { My_Var = 1;}

---------

The error message is:

error C3053:
'My_Var' : 'threadprivate' is only valid for global or static data tems.

Clearly the compiler is confused about linkage attribute of the variable.

It is most common to refer global variables in multiple files. So this problem makes threadprivate directive unusable in any realistic program.

I wonder if there are tricks to handle this bug. Unfortunately in my case this problem prevents me to use OMP completely.

Thank you,

Alex

↧

Is there any way to narrow down performance difference of CUDA and AMP?

September 24, 2012, 2:20 am

≫ Next: findWindowEx index issue

≪ Previous: OpenMP threadprivate directive is not working.

Hi, I,m immigrating CUDA marching cube algorithm to C++ AMP.

There are some performance problem so it is not appropriate to convert code by 1:1 matching.

The result of performance of marching cube is CUDA is 2ms and AMP is 340ms on one marching cube(128*128*128 size).

I attached one phase of source code, classify voxels, both version of AMP and CUDA.

The time delay for this phase are AMP(20ms) and CUDA(may be 0?).

Course, I believe it can be optimized but I don't know how to do.

I guess the reason of performance down is access accessing array_view.

CUDA uses cudaBindTexture library to access volume data.

In AMP, I couldn't find similar library so I used just array_view.

I figure out accessing d_volume(array_view volume data) cause performance down by test.

This is AMP Code.

inline unsigned int sampleVolumeIndex(const uint3& p, const uint3& gridSize) restrict(amp)
{
   return (p.z*(gridSize.x+1)*(gridSize.y+1)) + (p.y*(gridSize.x+1)) + p.x;
}
// launch_classifyVoxel
parallel_for_each(d_voxelVerts.extent.tile<1, 1, TILE_SIZE>(), [=] (tiled_index<1, 1, TILE_SIZE> t_idx) restrict(amp)
{
   unsigned int i = (t_idx.global[1] * d_voxelVerts.extent[2]) + t_idx.global[2];
   uint3 gridPos = calcGridPos(i, gridsizeShift, gridsizeMask);
   float field[8];
   field[0] = d_volume[sampleVolumeIndex(gridPos, gridsize)];
   field[1] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 0, 0), gridsize)];
   field[2] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 1, 0), gridsize)];
   field[3] = d_volume[sampleVolumeIndex(gridPos + uint3(0, 1, 0), gridsize)];
   field[4] = d_volume[sampleVolumeIndex(gridPos + uint3(0, 0, 1), gridsize)];
   field[5] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 0, 1), gridsize)];
   field[6] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 1, 1), gridsize)];
   field[7] = d_volume[sampleVolumeIndex(gridPos + uint3(0, 1, 1), gridsize)];

   // calculate flag indicating if each vertex is inside or outside isosurface
   unsigned int cubeindex;
   cubeindex =  unsigned int(field[0] < ref_iso_value); 
   cubeindex += unsigned int(field[1] < ref_iso_value)*2; 
   cubeindex += unsigned int(field[2] < ref_iso_value)*4; 
   cubeindex += unsigned int(field[3] < ref_iso_value)*8; 
   cubeindex += unsigned int(field[4] < ref_iso_value)*16; 
   cubeindex += unsigned int(field[5] < ref_iso_value)*32; 
   cubeindex += unsigned int(field[6] < ref_iso_value)*64; 
   cubeindex += unsigned int(field[7] < ref_iso_value)*128;

   // read number of vertices from texture
   unsigned int numVerts = d_numVertsTable[cubeindex];

   if (i < numVoxels) {
      d_voxelVerts[t_idx.global] = numVerts;
      d_voxelOccupied[t_idx.global] = (numVerts > 0);
   }
});
}

This is CUDA code.

// sample volume data set at a point
__device__
float sampleVolume(uchar *data, uint3 p, uint3 gridSize)
{
    p.x = min(p.x, gridSize.x - 1);
    p.y = min(p.y, gridSize.y - 1);
    p.z = min(p.z, gridSize.z - 1);

    uint i = (p.z*gridSize.x*gridSize.y) + p.y*gridSize.x) + p.x;
    //return (float) data[i] / 255.0f;
    return tex1Dfetch(volumeTex, i);
}


__global__ void
classifyVoxel(uint* voxelVerts, uint *voxelOccupied, uchar *volume, uint3 gridSize, uint3 gridSizeShift, uint3 gridSizeMask, uint numVoxels, float3 voxelSize, float isoValue)
{
    uint blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
    uint i = __mul24(blockId, blockDim.x) + threadIdx.x;

    uint3 gridPos = calcGridPos(i, gridSizeShift, gridSizeMask);

    // read field values at neighbouring grid vertices
    float field[8];
    field[0] = sampleVolume(volume, gridPos, gridSize);
    field[1] = sampleVolume(volume, gridPos + make_uint3(1, 0, 0), gridSize);
    field[2] = sampleVolume(volume, gridPos + make_uint3(1, 1, 0), gridSize);
    field[3] = sampleVolume(volume, gridPos + make_uint3(0, 1, 0), gridSize);
    field[4] = sampleVolume(volume, gridPos + make_uint3(0, 0, 1), gridSize);
    field[5] = sampleVolume(volume, gridPos + make_uint3(1, 0, 1), gridSize);
    field[6] = sampleVolume(volume, gridPos + make_uint3(1, 1, 1), gridSize);
    field[7] = sampleVolume(volume, gridPos + make_uint3(0, 1, 1), gridSize);

    // calculate flag indicating if each vertex is inside or outside isosurface
    uint cubeindex;
    cubeindex =  uint(field[0] < isoValue); 
    cubeindex += uint(field[1] < isoValue)*2; 
    cubeindex += uint(field[2] < isoValue)*4; 
    cubeindex += uint(field[3] < isoValue)*8; 
    cubeindex += uint(field[4] < isoValue)*16; 
    cubeindex += uint(field[5] < isoValue)*32; 
    cubeindex += uint(field[6] < isoValue)*64; 
    cubeindex += uint(field[7] < isoValue)*128;

    // read number of vertices from texture
    uint numVerts = tex1Dfetch(numVertsTex, cubeindex);

    if (i < numVoxels) {
        voxelVerts[i] = numVerts;
        voxelOccupied[i] = (numVerts > 0);
    }
}

↧

findWindowEx index issue

September 27, 2012, 12:04 am

≫ Next: fast_math::pow vs fast_math::powf

≪ Previous: Is there any way to narrow down performance difference of CUDA and AMP?

Hi

i have windows appication(3rd party ) with many dialog box as tabs . These dialog boxes have no control ids

now i have to get dialog boxes handles to search its child controls .but problem is that when i use below code

IntPtr Menu1 = FindWindowEx(parentWindow, IntPtr.Zero, TAB_CLASS, null);
IntPtr Menu2 = FindWindowEx(parentWindow, Menu1, TAB_CLASS, null);
IntPtr Menu2 = FindWindowEx(parentWindow, Menu2, TAB_CLASS, null);

in order changes depending upon menu selected and I got wrong handler

how i can find these dilaog boxes in static order

sahil

↧