MKL, BLAS, LPACK, cuBLAS, cuSPARSE Equivalents

October 24, 2011, 4:28 pm

≫ Next: AMP crashes using completion_future for continuations with copy_async

So Intel has MKL for the CPU; nVidia has cuBLAS and cuSPARSE for CUDA; and there are various implementations of BLAS and LAPACK for the CPU as jumping off points for developing applications using linear algebra.

I'd like to build an application that uses AMP to provide the underlying linear algebra techniques necessary for my application's functionality. The impetus for doing so is the expected performance improvement over using the CPU alone (CUDA documentation indicates that cuBLAS provides at least order of magnitude performance improvement over MKL for a wide variety of techniques applicable to matrices of 4K rows/columns) along with the abstraction of the underlying hardware provided by AMP.

To use AMP to do the same stuff, I (apparently) need to first construct (at least a subset of) the aforementioned underlying libraries. For me, this as a non-trivial exercise.

With a modest understanding of linear algebra, limiting my research to only matrix multiplication, it took me a full day to research the current state-of-the-art with respect to algorithms - taking into account the underlying architecture, the big "O" performance expectations, relative stability, etc... to get to the point where I believe I have a reasonably good idea of likely candidate algorithms to implement in order to compare empirical performance results.

Add the implementation, optimization, and testing - for each of the necessary methods; and this is going to be a substantial investment of time.

Is anyone else already working on this?

Is anyone aware of a synopsis of algorithms applicable to parallelization on a GPU? Of special interest is the partitioning of large problems and how algorithms might be modified to better accommodate them (for example, Morton ordering for improved memory locality).

Any other suggestions on getting to the point where I have an AMP BLAS/LAPACK implementation?

Ken Miller

↧

AMP crashes using completion_future for continuations with copy_async

June 21, 2012, 7:56 am

≫ Next: C++ AMP on WP8

≪ Previous: MKL, BLAS, LPACK, cuBLAS, cuSPARSE Equivalents

I'm trying to maintain static buffers on the accelerator to avoid repeated mallocs, since my OpenCL code gives substantial performance improvements with it. So I have a setup where I have an array<float, 1> for gpu buffers, and a staging array to help speed up copies. To chain the copy operations (CPU->staging->GPU buffer), I'm using continuations with completion_future. However, my code crashes with a memory access error on the copy back (GPU buffer -> staging -> CPU memory). In addition, the crashes are not deterministic, they happen at varying iterations. I've checked that the queued operations are trying to access the correct memory address, and there is sufficient memory allocated for it. Here's my code:

#include <amp.h>
#include <amp_math.h>
#include <iostream>
#include <vector>
#include <cstdlib>

using namespace std;
using namespace concurrency;

#define FRAND (static_cast<float>(rand()) / RAND_MAX)
#define FSRAND (FRAND - 0.5f)

#define USE_COMPLETIONS

void testfunc () {
	const int n = 10000, allocn = 20000;
	static array<float, 1> staging[] = {array<float, 1>(n, accelerator(accelerator::cpu_accelerator).default_view, accelerator(accelerator::default_accelerator).default_view), 
		array<float, 1>(n, accelerator(accelerator::cpu_accelerator).default_view, accelerator(accelerator::default_accelerator).default_view), 
		array<float, 1>(n, accelerator(accelerator::cpu_accelerator).default_view, accelerator(accelerator::default_accelerator).default_view), 
	};
	static array<float, 1> gpumem[] = {array<float, 1>(allocn), array<float, 1>(allocn), array<float, 1>(allocn)};
	static bool isinit = false;
	static vector<float> a(n), b(n), c(n), d(n);
	if (!isinit) {
		a.resize(n);
		b.resize(n);
		c.resize(n);
		for (int i = 0; i < n; ++i) {a[i] = FSRAND;	b[i] = FSRAND; c[i] = FSRAND;}
	}
	vector<float>::iterator worklists[] = {a.begin(), b.begin(), c.begin(), a.end(), b.end(), c.end()};
	vector<float> *wlptrs[] = {&a, &b, &c};
	cout << "init complete" << endl;
#ifdef USE_COMPLETIONS
	for (int i = 0; i < 3; ++i) try {
		completion_future cf = copy_async (worklists[i], worklists[3+i], staging[i]);
		cf.then([&, i] {
			cout << "Continuing copyinit " << i << endl;
			cout.flush();
			copy_async(staging[i], gpumem[i]).wait();
		});
		//cf.get();
	} catch (std::exception ex) {cout << "Caught exception in copyinit: " << ex.what() << endl;cout.flush();}
#else
	shared_future<void> waitlists[3];
	for (int i = 0; i < 3; ++i) try {
		waitlists[i] = copy_async (worklists[i], worklists[3+i], staging[i]);
	} catch (std::exception ex) {cout << "Caught exception in copyinit: " << ex.what() << endl;}
	for (int i = 0; i < 3; ++i) waitlists[i].wait();
	for (int i = 0; i < 3; ++i) try {
		waitlists[i] = copy_async (staging[i], gpumem[i]);
	} catch (std::exception ex) {cout << "Caught exception in copyinit: " << ex.what() << endl;}
	for (int i = 0; i < 3; ++i) waitlists[i].wait();
#endif
	array<float, 1> &ga = gpumem[0], &gb = gpumem[1], &gc = gpumem[2];
	cout << "transfers to gpu scheduled" << endl; cout.flush();
	try {
		parallel_for_each (concurrency::extent<1>(n),
			[&](index<1> idx) restrict(amp) {
				gc[idx] = ga[idx] + gb[idx];
				ga[idx] = gb[idx];
				gb[idx] = gc[idx];
		}
		);
	} catch (std::exception ex) {
		cout << "Caught exception in kernel: " << ex.what() << endl;cout.flush();
	}
	cout << "kernel scheduled" << endl;cout.flush();

#ifdef USE_COMPLETIONS
	for (int i = 0; i < 3; ++i) try {
		completion_future cf = copy_async (gpumem[i].section(0, n), staging[i]);
		cf.then([&, i] {
			cout << "Continuing copyback " << i << " trying to copy to vector at " << (void*)wlptrs[i] << " of size " << wlptrs[i]->size() << " with ptr range: " << &(*wlptrs[i])[0] << ", " << &(wlptrs[i]->back()) << endl;
			copy_async(staging[i], wlptrs[i]->begin()).wait();
		});
		//cf.get();
	} catch (std::exception ex) {cout << "Caught exception in copyback: " << ex.what() << endl;cout.flush();}
#else
	for (int i = 0; i < 3; ++i) try {
		waitlists[i] = copy_async (gpumem[i].section(0,n), staging[i]);
	} catch (std::exception ex) {cout << "Caught exception in copyback: " << ex.what() << endl;}
	for (int i = 0; i < 3; ++i) waitlists[i].wait();
	for (int i = 0; i < 3; ++i) try {
		waitlists[i] = copy_async (staging[i], worklists[i]);
	} catch (std::exception ex) {cout << "Caught exception in copyback: " << ex.what() << endl;}
	for (int i = 0; i < 3; ++i) waitlists[i].wait();
#endif
	cout << "transfer back scheduled" << endl;cout.flush();
}

int main (int argc, char **argv) {
	srand(0);
	for (int i = 0; i < 10000; ++i) {
		testfunc();
	}
	return 0;
}

Could someone help point out what I'm doing wrong in this example? I'm only using continuations for copies, and I have the regular copy code in there too to compare. This is a knocked down version of what I'm doing, I actually have a lot more buffers and a more complicated kernel. That's why I'm trying to use loops for all the copies.

↧

C++ AMP on WP8

June 25, 2012, 10:33 am

≫ Next: How to poll an std::future to see if it's result is available?

≪ Previous: AMP crashes using completion_future for continuations with copy_async

Hi,

Given there was an announcement of Windows Phone 8 which is built on top of the desktop Windows 8 kernel, supports directx and native code, does that mean that C++ AMP will work on WP8 also?

↧

How to poll an std::future to see if it's result is available?

April 17, 2012, 11:39 am

≫ Next: C++ AMP: Microsoft Visual C++ 11 Redistributable Package?

≪ Previous: C++ AMP on WP8

In the code snippet below I'm attempting to manually associate a promise set on one thread with a future checked on another. I then attempt to poll the future to see if it is ready but the loop never terminates. The call to future::wait_for() always returns future_status::deferred. If I remove the wait_for and instead loop for some fixed number of iterations with a sleep_for in the while loop I can get() the future as expected and the function exits normally.

Am I misunderstanding the correct usage of wait_for()? Is there any other way to poll for a future becoming available without blocking on a get() call?


#include <chrono>
#include <thread>
#include <future>
#include <iostream>

using namespace std;
using namespace std::chrono;
void DoSomeWork(int numIters, promise<int>& ret)
{
    int it = 0;
    for ( ; it < numIters; ++it)
    {
        cout << "Thread " << this_thread::get_id() << " working..." << endl;
        this_thread::sleep_for(milliseconds(1000));
    }
    ret.set_value(it);
}

void TestPromise()
{
    int its = 10;
    promise<int> aPromise;
    thread aThread(DoSomeWork, its, ref(aPromise));
    auto aFuture = aPromise.get_future();
    while (aFuture.wait_for(milliseconds(100)) != future_status::ready)
    {
        cout << "Thread " << this_thread::get_id() << " waiting..." << endl;
    }
    cout << "Work iterations: " << aFuture.get();
    aThread.join();
}
int main()
{
    TestPromise();
    return 0;
}

↧

C++ AMP: Microsoft Visual C++ 11 Redistributable Package?

March 7, 2012, 2:20 am

≫ Next: Got Concurrency::runtime_exception Please Help

≪ Previous: How to poll an std::future to see if it's result is available?

We have a big solution with several C# projects in Visual Studio 2010 version, in which I want to integrate a C++ AMP and its C++/CLI wrapper as external DLL for the moment (we don't want to migrate the whole solution now to VS11).

Also, not all developer and test machines will have VS11 Beta installed. As no Microsoft Visual C++ 11 Redistributable Package seems to exit as stated on VS website:

Note: For C++ desktop applications, redistribution of Microsoft Visual C++ files is not supported with Visual Studio 11 Beta, so static linking to the runtime is required.

I tried to compile my C++ AMP with /MT. It remove the dependency to MSVCP110 and MSVCR110 DLL, but it still has a dependency to VCAMP110 DLL (C++ AMP Runtime).

What is the best way to install theses DLL on the developer and test machine without installing VS11 Beta everywhere?

↧

Got Concurrency::runtime_exception Please Help

June 26, 2012, 5:22 pm

≫ Next: Help creating an application!

≪ Previous: C++ AMP: Microsoft Visual C++ 11 Redistributable Package?

Hi Smart People!

Im trying to convert my C++ code to C++ AMP code.

Here is the error i got,

"The number of writable arrays/array_views referenced in the entry function of the parallel_for _each call(11) exceeds the selected accelerator's limit of maximum number(8) of writable buffer views allowed in a lernel."

I have 11 array_view in my code,and got this error.

Do i have any option to avoid this problem?

Or How Can I to set "READABLE" to the array_view?

i need only 1 array to be writable for now.

↧

Help creating an application!

June 27, 2012, 5:31 pm

≫ Next: A way to reduce build time of AMP code?

≪ Previous: Got Concurrency::runtime_exception Please Help

I am interning this summer, and I do not have extensive hand-on experience with c# programming. I need to build an application that can download text (in a table, or an excel sheet). I do not even know where to start... I have read a few books, and i have developed the GUI, but now I am at a loss. I have no idea how to handle any of the back end programming. My experience level is Intro to Computer Science, lol. Please help!

↧

A way to reduce build time of AMP code?

June 22, 2012, 5:29 pm

≫ Next: [VC++][threads] How to make one thread idle and wait for other threads to run until continue?

≪ Previous: Help creating an application!

I have two non-trivial projects implemented in AMP now and both take a really long time to compile. The first about 30 minutes, and the second started to take upto 3 hours (both use upto 1.8 GB of memory). I tried to observe more carefully on the second project why/when this occurs and it appears to be the moment I start using tile_static memory. I suspect it's the race-condition check (DX-compiler) or some other code-analysis which goes crazy about my implementation. Which isn't enormous, just 100 lines maybe.

In addition, the compiler takes 100% of all of my 4 cores, to such a degree that the OS becomes irresponsive. My current approach is to jump to the task manager quickly after I started a build and set the priority to lowest.

So my question is if I can reduce the build time somehow? For compute shaders I can turn off optimizations - for my first project the DC-shader I did to hold the build times in check doesn't benefit from above O0 fe. - and I can also turn off the correctness verification. I wish I could do something like that for AMP.

Oh, and the produced (debug) executable is 80MB vs. 600kB pure C++ ConcRT vs. 69 kB just C++, is that normal?

Thanks for any suggestion to solve this.

↧

[VC++][threads] How to make one thread idle and wait for other threads to run until continue?

June 26, 2012, 11:10 am

≫ Next: Calling Patrallel_For_Each thread per Row Rather Than Row * Col using index Object

≪ Previous: A way to reduce build time of AMP code?

Hi, im only recently trying to use threads and VC++ in windows (been learning on linux(gcc) so far, on a beginner level), since i started digging in threads for windows here in the site, ive still only learned how to create one or more threads and making them do stuff, since i like to learn by simple examples i made my own example to make sure i did it right, but now i tried to apply that to a previous program that i had created, since it was a C/openGL program(game) that was actually a bit big(not using threads or whatever) I tried to change it to run in multiple threads, making a thread for the player, others for enemys, etc, but the problem became evident at the first try, i need some sort of thread synchronization.

Learning how to make treads was simple by the NSDN Lybrary, it even had an example! but now i have no idea where to go, since there are 10000 functions in there and i have no idea what each one does.

So ill keep it simple just the way i like. In the example i have 2 threads running, one prints "1", the other prints "2", all i want is to know how to make process 'one' wait after printing "1" for the process 'two' to print "2", and the same way around. And so the example would look perfect if the output would be more like "12121212....." instead of "1122222222222222111111"(it may be better for us to have this in computing stuff, but in a game, its good to have things run smooth).

Here's the example, all i want to know is what to change\add, since i cant find any examples online:

#include <stdio.h>
#include <windows.h>
#include <tchar.h>
#include <strsafe.h>

# define NUM_THREADS 2

DWORD WINAPI one( LPVOID lpParam );
DWORD WINAPI two( LPVOID lpParam );
void ErrorHandler(LPTSTR lpszFunction);

typedef struct MyData {
    int x;
} MYDATA, *PMYDATA;

int main()
{
	int flag, i;
	
    PMYDATA pDataArray[NUM_THREADS];
    DWORD   dwThreadIdArray[NUM_THREADS];
    HANDLE  hThreadArray[NUM_THREADS];

	for(i=0;i < NUM_THREADS;i++)/* Creating threads */
	{
        // Allocate memory for thread data.

        pDataArray[i] = (PMYDATA) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
                sizeof(MYDATA));

        if( pDataArray[i] == NULL )
        {
           // If the array allocation fails, the system is out of memory
           // so there is no point in trying to print an error message.
           // Just terminate execution.
            ExitProcess(2);
        }

        // Generate unique data for each thread to work with.

        pDataArray[i]->x = i;

        // Create the thread to begin execution on its own.

		if(i==1)
			hThreadArray[i] = CreateThread( 
				NULL,                   // default security attributes
				0,                      // use default stack size  
				one,       // thread function name
				pDataArray[i],          // argument to thread function 
				0,                      // use default creation flags 
				&dwThreadIdArray[i]);   // returns the thread identifier 
			else 
				hThreadArray[i] = CreateThread( 
					NULL,                   // default security attributes
					0,                      // use default stack size  
					two,       // thread function name
					pDataArray[i],          // argument to thread function 
					0,                      // use default creation flags 
					&dwThreadIdArray[i]);   // returns the thread identifier 


        // Check the return value for success.
        // If CreateThread fails, terminate execution. 
        // This will automatically clean up threads and memory. 

        if (hThreadArray[i] == NULL) 
        {
           ErrorHandler(TEXT("CreateThread"));
           ExitProcess(3);
        }
	} // End of main thread creation loop.

    // Wait until all threads have terminated.

    WaitForMultipleObjects(NUM_THREADS, hThreadArray, TRUE, INFINITE);

	
    // Close all thread handles and free memory allocations.

    for(int i=0; i<NUM_THREADS; i++)
    {
        CloseHandle(hThreadArray[i]);
        if(pDataArray[i] != NULL)
        {
            HeapFree(GetProcessHeap(), 0, pDataArray[i]);
            pDataArray[i] = NULL;    // Ensure address is not reused.
        }
    }

	printf(" Finished.\n");
	scanf("%c");
	return 1;
}

DWORD WINAPI one( LPVOID lpParam ) // this is going to be thread 1
{
    int i, a;
    PMYDATA S;
    // Make sure there is a console to receive output results. 
	
    HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
    if( hStdout == INVALID_HANDLE_VALUE )
        return 1;

    // Cast the parameter to the correct data type.
    // The pointer is known to be valid because 
    // it was checked for NULL before the thread was created.
 
    S = (PMYDATA)lpParam;
	a=S->x;

    for(i=0; i<=100; i++)
	{
        printf("Im func %d working...[%d%%]\n", a+1, i);
	}
	return 0;
}

DWORD WINAPI two( LPVOID lpParam ) // this is going to be thread 2
{
    int i, w;
    PMYDATA F;
    // Make sure there is a console to receive output results. 
	
    HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
    if( hStdout == INVALID_HANDLE_VALUE )
        return 1;

    // Cast the parameter to the correct data type.
    // The pointer is known to be valid because 
    // it was checked for NULL before the thread was created.
 
    F = (PMYDATA)lpParam;
	w=F->x;

    for(i=0; i<=100; i++)
	{
        printf("Im func %d working...[%d%%]\n", w+1, i);
	}
	return 0;
}

void ErrorHandler(LPTSTR lpszFunction) 
{ 
    // Retrieve the system error message for the last-error code.

    LPVOID lpMsgBuf;
    LPVOID lpDisplayBuf;
    DWORD dw = GetLastError(); 

    FormatMessage(
        FORMAT_MESSAGE_ALLOCATE_BUFFER | 
        FORMAT_MESSAGE_FROM_SYSTEM |
        FORMAT_MESSAGE_IGNORE_INSERTS,
        NULL,
        dw,
        MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT),
        (LPTSTR) &lpMsgBuf,
        0, NULL );

    // Display the error message.

    lpDisplayBuf = (LPVOID)LocalAlloc(LMEM_ZEROINIT, 
        (lstrlen((LPCTSTR) lpMsgBuf) + lstrlen((LPCTSTR) lpszFunction) + 40) * sizeof(TCHAR)); 
    StringCchPrintf((LPTSTR)lpDisplayBuf, 
        LocalSize(lpDisplayBuf) / sizeof(TCHAR),
        TEXT("%s failed with error %d: %s"), 
        lpszFunction, dw, lpMsgBuf); 
    MessageBox(NULL, (LPCTSTR) lpDisplayBuf, TEXT("Error"), MB_OK); 

    // Free error-handling buffer allocations.

    LocalFree(lpMsgBuf);
    LocalFree(lpDisplayBuf);
}

Would "sleep()" be the answer? lol i mean, it seems like a sloppy way to do things since we cant be sure of how much time we should wait, so id like to know how this should really be done lol

↧

Calling Patrallel_For_Each thread per Row Rather Than Row * Col using index Object

July 2, 2012, 2:13 am

≫ Next: C++ AMP with AudioProcessingObjects(APO)

≪ Previous: [VC++][threads] How to make one thread idle and wait for other threads to run until continue?

In the below snap shot of code, I want each thread to be created per row rather than on each row*col.

extent

<2> e(u32ValidFrameCount, u32SamplesPerFrame);

parallel_for_each(

acc.default_view, e, [=](index<2> idx) restrict(amp)

{

//This set of code needs to be executed on each row rather than each element of the 2 dimensional array

//Thread needs to be created each per row

});

Thanks

-Harsha

↧

C++ AMP with AudioProcessingObjects(APO)

June 24, 2012, 9:33 pm

≫ Next: array_view of rank 2 slow multithreaded access on CPU

≪ Previous: Calling Patrallel_For_Each thread per Row Rather Than Row * Col using index Object

AudioProcessingObjects(APO) are COM objects implemented as Simple DLLs and inserted in to Microsft Auido stack. These APOs Process the audio Data.

Can C++ AMP be used to process the above said Audio Data from the APO on GPU?

↧

array_view of rank 2 slow multithreaded access on CPU

June 29, 2012, 7:39 am

≫ Next: Concurrency task groups

≪ Previous: C++ AMP with AudioProcessingObjects(APO)

I'm trying to run code on both via AMP and on multithreaded CPU for validation purposes and to provide comparison timing. Various low level functions are needed by both implementations so are marked as restrict(amp, cpu). These need to take a 2D array, so i'm using array_view<uint32_t, 2> on both AMP and CPU and values are looked up as table[r][c]. However the multithreaded CPU implementation seems to only run at 50% CPU due to heavy locking inside array_view::operator[](int) - stack trace below.

I've tried templating the function so it takes a vector<vector<uint32_t> instead for the CPU, but that doesn't work as the function is marked restrict(amp, cpu) so the vector template fails to compile for AMP (even though i won't be calling it). Do you have a better solution than duplicating the function?

thanks

terry


vector<uint32_t> v_table;
array_view<const uint32_t, 2> table(v_table.size(), v_table);

parallel_for(0, n, [](int i) {
    //...
    test(table);
    //...
});
void test(const array_view<const uint32_t, 2>& table) restrict(amp, cpu)
{
    //...
    uint32_t u = table[tileIdx][Y];
    //...
}
Stack trace for block:

 	msvcr110d.dll!Concurrency::details::ThreadProxy::SuspendExecution() Line 112	C++
 	msvcr110d.dll!Concurrency::details::FreeVirtualProcessorRoot::ResetOnIdle(Concurrency::SwitchingProxyState switchState) Line 121	C++
 	msvcr110d.dll!Concurrency::details::FreeThreadProxy::SwitchOut(Concurrency::SwitchingProxyState switchState) Line 133	C++
 	msvcr110d.dll!Concurrency::details::InternalContextBase::SwitchTo(Concurrency::details::InternalContextBase * pNextContext, Concurrency::details::InternalContextBase::ReasonForSwitch reason) Line 973	C++
 	msvcr110d.dll!Concurrency::details::InternalContextBase::Block() Line 217	C++
 	msvcr110d.dll!Concurrency::Context::Block() Line 63	C++
 	msvcr110d.dll!Concurrency::details::LockQueueNode::Block(unsigned int currentTicketState) Line 684	C++
 	msvcr110d.dll!Concurrency::critical_section::_Acquire_lock(void * _PLockingNode, bool _FHasExternalNode) Line 1127	C++
 	msvcr110d.dll!Concurrency::critical_section::scoped_lock::scoped_lock(Concurrency::critical_section & _Critical_section) Line 1201	C++
 	vcamp110d.dll!Concurrency::details::_Ubiquitous_buffer::_Get_view_shape(Concurrency::details::_Buffer_descriptor * _Key) Line 1388	C++
 	AMP-MC.exe!Concurrency::details::_Get_buffer_view_shape(const Concurrency::details::_Buffer_descriptor & _Descriptor) Line 3065	C++
 	AMP-MC.exe!Concurrency::details::_Array_view_base<2,1>::_Create_projection_buffer_shape(const Concurrency::details::_Buffer_descriptor & _Descriptor, unsigned int _Dim, int _Dim_offset) Line 1986	C++
 	AMP-MC.exe!Concurrency::details::_Array_view_base<2,1>::_Project0(int _I, Concurrency::details::_Array_view_base<1,1> & _Projected_view) Line 1854	C++
 	AMP-MC.exe!Concurrency::array_view<unsigned int const ,2>::_Project0(int _I, Concurrency::array_view<unsigned int const ,1> & _Projected_view) Line 3351	C++
 	AMP-MC.exe!Concurrency::details::_Array_view_projection_helper<unsigned int const ,2>::_Project0(const Concurrency::array_view<unsigned int const ,2> * _Arr_view, int _I) Line 36	C++
>	AMP-MC.exe!Concurrency::array_view<unsigned int const ,2>::operator[](int _I) Line 3085	C++

↧

Concurrency task groups

July 1, 2012, 9:42 am

≫ Next: accelerator::get_all returns vector with NULL

≪ Previous: array_view of rank 2 slow multithreaded access on CPU

I'm new to the Microsoft Concurrency Runtime (and asynchronous programming in general), and I'm trying to understand what can and can't be done with it. Is it possible to create a task group such that the tasks execute in the order in which they were added, and any task doesn't start until the previous one ends?

I'm trying to understand if there's a more general and devolved way of dealing with tasks compared to chaining several tasks within a single member function. For example, let's say I have a program that creates resources at different points in the program, and the order in which the resources are created matters. Could any resource creation function that was called simply append a task to the end of a central task list, with the result that the tasks execute in the order in which they were added (i.e. the order in which the resource creation functions were called)?

Thanks,

RobertF

↧

accelerator::get_all returns vector with NULL

June 29, 2012, 9:49 am

≫ Next: how to implement a "steps" of for loop in C++AMP

≪ Previous: Concurrency task groups

HI,

I did some experiments with APO.

I took the SwapAPO sample from WinDDK(C:\WinDDK\7600.16385.0\src\audio\sysfx) created a Visual studio 2012 DLL Project. Incorporated the CPP AMP chnages to it. Built the project for x86. it also requires propertypageextension dll(allso part of sample). installed this swapAPO for the HDAudio Codec using the INF file present in the sample. I took the Pick_accelerator function from the one of your presentation and invoked it from the APO Initialize function. Inside Pick_accelerator the count obtained was 3 but the vector had NULL. where as a stand alone application(Hello world) returned a vector of siz 3 and one of them was GPU. I was using windDBg as debugger and captured the data when the pick_accelerator returned. Attaching the watch window for your reference.

↧

how to implement a "steps" of for loop in C++AMP

July 4, 2012, 6:00 pm

≫ Next: What am I doing wrong? Memory leaks with Concurrency::samples::task_scheduler

≪ Previous: accelerator::get_all returns vector with NULL

Hi.

I've succeeded to implement for loop in C++ AMP.

But if the for loop has a steps more than 2.

e.g. for(int i=5;i<100;i+=2) { (job) }

in this case I implement like this for now,

parallel_for_each(extent,[=](index<1> idx) restrict(amp)

{

int i=idx[0];

if( i >=5 && i % 2==1)

{

(job)

}

I am thinking this coding is a bit awkward.

Is there any more elegant coding? or this is only way to this?

thanx for reading .

↧

What am I doing wrong? Memory leaks with Concurrency::samples::task_scheduler

July 6, 2012, 7:19 am

≫ Next: ShaderResourceView vs Constant memory vs Texture - usage and performance

≪ Previous: how to implement a "steps" of for loop in C++AMP

I must be doing something wrong. I have a simple 32-bit Console app. I am running it on Windows 7 x64. I use Visual Studio 2010 SP1.

The sample code pasted below leaks memory (according to Private Bytes in perfmon).

All it does is shedule a 1 sec duration task in a task scheduler every time the ENTER key is hit.

If you hold down the ENTER key you will see memory climb in perfmon as tasks are queued quicker than they can execute.

When you stop holding down the ENTER key, the task pool runs dry like you would expect, but Private Bytes never drops.

If you hit X to delete the task_scheduler instance you will see memory drop in perfmon but never as low as when the app starts.

Looking at my exe's threads in ProcessExplorer I see lots of these:

<tid> MSVCRT100.dll?_IsCanceling@_TaskCollection@details@Concurrency@@QAE_NXZ+0x70b

with stack traces that look like this:

ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a
ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x93d
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!NtWaitForSingleObject+0xde
ntoskrnl.exe!KeSynchronizeExecution+0x3a23
wow64cpu.dll!TurboDispatchJumpAddressEnd+0x6c0
wow64cpu.dll!TurboDispatchJumpAddressEnd+0x4a8
wow64.dll!Wow64SystemServiceEx+0x1ce
wow64.dll!Wow64LdrpInitialize+0x429
ntdll.dll!RtlIsDosDeviceName_U+0x24c87
ntdll.dll!LdrInitializeThunk+0xe
ntdll.dll!ZwWaitForSingleObject+0x15
kernel32.dll!WaitForSingleObjectEx+0x43
kernel32.dll!WaitForSingleObject+0x12
MSVCR100.dll!?_IsCanceling@_TaskCollection@details@Concurrency@@QAE_NXZ+0x61d
MSVCR100.dll!?_IsCanceling@_TaskCollection@details@Concurrency@@QAE_NXZ+0x719
kernel32.dll!BaseThreadInitThunk+0x12
ntdll.dll!RtlInitializeExceptionChain+0x63
ntdll.dll!RtlInitializeExceptionChain+0x36

I am trying to use the task_scheduler as a really simple thread pool in my app. I thought I could use the schedule_task method to dispatch asynchronous events from my class.

What am I doing wrong? Thanks.

Here is the example code:

#include "stdafx.h"
#include "ConcRTExtras/concrt_extras.h"
//==============================================================================
int _tmain(int argc, _TCHAR* argv[])
{
    // use the task_scheduler from concrt_extras.h
    Concurrency::samples::task_scheduler* pTaskScheduler = nullptr;
    while (true)
    {
        wprintf_s(L"Hit ENTER to schedule task (X to exit):");
        int c = _getch_nolock();
        // quit the loop?
        if (c == 'x' || c == 'X') {break;}
        _putch_nolock('\n');        
        if (pTaskScheduler == nullptr)
        {
            pTaskScheduler = new Concurrency::samples::task_scheduler();
        }
        // schedule a long-lived task
        pTaskScheduler->schedule_task([]
        {
            // this speeds things up but makes no difference to mem usage
            Concurrency::scoped_oversubcription_token overSubscribe;
            Sleep(1000);
        });
    }
    delete pTaskScheduler;
    wprintf_s(L"Hit ENTER to exit:\n");
    _getch_nolock();
    return 0;
}

EDIT:

I get the same kind of leaks when I use the sample code show here in MSDN: http://msdn.microsoft.com/en-us/library/ee624185.aspx

↧

ShaderResourceView vs Constant memory vs Texture - usage and performance

July 4, 2012, 3:07 am

≫ Next: SetThreadGroupAffinity returning invalid parameter the 2nd time it's used in the code

≪ Previous: What am I doing wrong? Memory leaks with Concurrency::samples::task_scheduler

It appears from the blogs that there are three ways of passing constant data to the GPU, possibly exposed via different AMP interfaces:

ShaderResourceView - via const array<T> or array_view<const T>
Constant memory - via capturing T[] variable
Textures - via concurrency::graphics::texture

Am i correct about the interfaces for 1 & 2?

Clearly there are various memory and other constraints on the use of each of these.

Do you have a view on the relative performance of each?

It might be handy to summarise the usage, constraints and performance comparison in one place.

thanks.

terry

↧

SetThreadGroupAffinity returning invalid parameter the 2nd time it's used in the code

December 28, 2009, 2:28 pm

≫ Next: C++ AMP Random Number Generator

≪ Previous: ShaderResourceView vs Constant memory vs Texture - usage and performance

Hi all,

I'm trying to write some code for handling affinity that use the new functions supporting processor groups. I've got a system running x64 Win 2k8 R2 DataCenter with 48 logical processors. I ran bcdedit.exe /set groupsize 32 to split the 48 procs into 2 groups.

My software is starting multiple threads and setting each thread to a different logical processor.

The first thread that I execute SetThreadGroupAffinity for is successful. On every thread after the first, SetThreadGroupAffinity returns Invalid Parameter.

I simplified the code to do the following (somewhat pseudo):

///////////////
/// Code Example
///////////////

GROUP_AFFINITY previousAffinity;
GROUP_AFFINITY GroupAffinity;
GroupAffinity.Group = 0;
GroupAffinity.Mask = 1;

hthread  = (HANDLE) _beginthreadex(NULL, 0, &cache, (void*) &arglist, CREATE_SUSPENDED, &threadid);
hthread2  = (HANDLE) _beginthreadex(NULL, 0, &cache, (void*) &arglist, CREATE_SUSPENDED, &threadid);

if (SetThreadGroupAffinity(hThread, &GroupAffinity, &previousAffinity) != 0)
	{
            ResumeThread(hthread);
         }
        //else I print GetLastError
if (SetThreadGroupAffinity(hThread2, &GroupAffinity, &previousAffinity) != 0)
	{
            ResumeThread(hthread2); //edit fixed typo
         }
         //else I print GetLastError

Regardless of what order I start the threads or set the affinity of the 2 threads, the first SetThreadGroupAffinity call works, and the 2nd call returns invalid parameter.

I opened the exe from within WinDBG to step through the code and found that NtSetInformationThread is the one returning bad parameter. I compared the passing code flow to the failing code flow and found that it was identical up until this function, so I don't know why it would return bad parameter. Here's the last couple instructions for the good flow and bad flow:

////////////////////

/// Good Flow

///////////////////

0:000> t

ntdll!NtSetInformationThread+0x8:

00000000`7785ff88 0f05 syscall

0:000> r

rax=000000000000000a rbx=000000000012fdb8 rcx=0000000000000038

rdx=000000000000001e rsi=000000000012fe50 rdi=0000000000000038

rip=000000007785ff88 rsp=000000000012fd08 rbp=0000000000000000

r8=000000000012fe50 r9=0000000000000010 r10=0000000000000038

r11=0000000000000246 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0 nv up ei pl zr na po nc

cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246

ntdll!NtSetInformationThread+0x8:

00000000`7785ff88 0f05 syscall

0:000> t

ntdll!NtSetInformationThread+0xa:

00000000`7785ff8a c3 ret

0:000> r

rax=0000000000000000 rbx=000000000012fdb8 rcx=000000007785ff8a

rdx=0000000000000000 rsi=000000000012fe50 rdi=0000000000000038

rip=000000007785ff8a rsp=000000000012fd08 rbp=0000000000000000

r8=000000000012fd08 r9=0000000000000000 r10=0000000000000000

r11=0000000000000346 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0 nv up ei pl zr na po nc

cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246

ntdll!NtSetInformationThread+0xa:

00000000`7785ff8a c3 ret

////////////////

// Bad Flow

/////////////

0:000> t

ntdll!NtSetInformationThread+0x8:

00000000`7785ff88 0f05 syscall

0:000> r

rax=000000000000000a rbx=000000000012fdb8 rcx=0000000000000040

rdx=000000000000001e rsi=000000000012fe50 rdi=0000000000000040

rip=000000007785ff88 rsp=000000000012fd08 rbp=0000000000000000

r8=000000000012fe50 r9=0000000000000010 r10=0000000000000040

r11=0000000000000246 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0 nv up ei pl zr na po nc

cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246

ntdll!NtSetInformationThread+0x8:

00000000`7785ff88 0f05 syscall

0:000> t

ntdll!NtSetInformationThread+0xa:

00000000`7785ff8a c3 ret

0:000> r

rax=00000000c000000d rbx=000000000012fdb8 rcx=000000007785ff8a

rdx=0000000000000000 rsi=000000000012fe50 rdi=0000000000000040

rip=000000007785ff8a rsp=000000000012fd08 rbp=0000000000000000

r8=000000000012fd08 r9=0000000000000000 r10=0000000000000000

r11=0000000000000346 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0 nv up ei pl zr na po nc

cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246

ntdll!NtSetInformationThread+0xa:

00000000`7785ff8a c3 ret

The only difference I see is the thread handle parameter, yet in the failing case it puts 0xc000000d in eax.

This code works perfectly fine if I use SetThreadAffinityMask instead, but of course doesn't allow me to schedule threads on other groups.

Let me know if you need more information

↧

C++ AMP Random Number Generator

July 8, 2012, 3:39 pm

≫ Next: Use of task to prepare a XmlDocument from a file with WinRT library functions.

≪ Previous: SetThreadGroupAffinity returning invalid parameter the 2nd time it's used in the code

Hi Sir

I want to use my RandomNumber() function in parallel_for_each() function.

When I Build I get this error

How do I resolve this problem

error C3930: 'RandomNumber' : no overloaded function has restriction specifiers that are compatible with the ambient context 'step_size::<lambda_23f843edc800319054a5b250f2700838>::operator ()'

accelerator pick_accelerator()

{

accelerator chosen_one;

//query all accelerators and pick one based on your criteria

for(accelerator acc: accelerator::get_all())

if(!acc.has_display)

chosen_one = acc;

//return default or new one

return chosen_one;

}

#define MBIG 1000000000 // this is rand3() from numerical recipies in 'c'

#define MSEED 161803398 // initiation is separated from the random number

#define MZ 0 // generator for speed

#define FAC 1.0E-9

static int inext,inextp;

static long ma[56];

void InitRandNumGen(int idum)

{

long mj,mk;

int i,ii,k;

if( idum == -1 ){ // use system time for seed

time_t ltime;

time( &ltime );

idum = (int)(ltime);

idum /= 1000;

}

mj=MSEED-(idum < 0 ? -idum : idum);

mj %= MBIG;

ma[55]=mj;

mk=1;

for (i=1;i<=54;i++) {

ii=(21*i) % 55;

ma[ii]=mk;

mk=mj-mk;

if (mk < MZ) mk += MBIG;

mj=ma[ii];

}

for (k=1;k<=4;k++)

for (i=1;i<=55;i++) {

ma[i] -= ma[1+(i+30) % 55];

if (ma[i] < MZ) ma[i] += MBIG;

}

inext=0;

inextp=31;

}

inline double RandomNumber()

{

long mj;

do{

if (++inext == 56) inext=1;

if (++inextp == 56) inextp=1;

mj=ma[inext]-ma[inextp];

if (mj < MZ) mj += MBIG;

ma[inext]=mj;

}while( !mj );

return (double)(mj*FAC);

}

#undef MBIG

#undef MSEED

#undef MZ

#undef FAC

double step_size()

{

accelerator acc = pick_accelerator();

double RNG;

static const int rank = 1;

extent<rank> e(1);

array<double, rank> rand_out_data(e);

parallel_for_each(e, [=, &rand_out_data] (index<1> idx) restrict(amp)

{

rand_out_data[idx] = RandomNumber();

});

copy(rand_out_data, RNG);

return RNG;

}

My Best Regards

Huseyin Ozgur Kazanci

Akdeniz University ANTALYA TURKEY

↧

Use of task to prepare a XmlDocument from a file with WinRT library functions.

July 10, 2012, 3:28 pm

≫ Next: Iterating on the GPU with C++ AMP

≪ Previous: C++ AMP Random Number Generator

I was wondering if there is a better way to write a function that reads an XML file and turns it into a XmlDocument object, and then executes a user specified function on that XmlDocument. Not feeling too expert, I did come up with one solution, but would like an expert opinion on that solution, and also if there is a better way?

void FileIo::getXML( std::function< void ( Windows::Data::Xml::Dom::XmlDocument ^ passedDoc ) >  myXmlDocHandler )
{
	using namespace Windows::Storage;
	using namespace Windows::Storage::Pickers;
	using namespace Windows::Data::Xml::Dom;
	using namespace concurrency;

	auto picker= ref new FileOpenPicker();
   	picker->FileTypeFilter->Append(".xml");	// Required 
	task< StorageFile^ > getFileTask ( picker->PickSingleFileAsync() ); 

	getFileTask.then([ myXmlDocHandler ]( StorageFile^ xmlFile ) 
	{
		if ( xmlFile != nullptr) 
		{ 
			auto doc= ref new XmlDocument();	
			task< XmlDocument ^ > getDocTask ( doc->LoadFromFileAsync( xmlFile ) ); 
			getDocTask.then( [ myXmlDocHandler ] ( XmlDocument^ doc ) 
			{	
				myXmlDocHandler( doc );
			});
		}
	});

}


//--------------------------------------

// Calling mechanism

	auto lambda = []( Windows::Data::Xml::Dom::XmlDocument ^ xmlDoc ) 
			{
				// Now go process the XML file as you like
				Platform::String ^ nodeName= xmlDoc->NodeName;
			};

	FileIo::getXML(	lambda );

The full context is at:

http://social.msdn.microsoft.com/Forums/en-US/winappswithnativecode/thread/66df8021-6f3e-429c-83b8-392ddc737194

↧