A problem with "error C3564: reading uninitialized value when compiling the call graph for the concurrency::parallel_for_each"

August 14, 2012, 4:16 am

≪ Previous: declaring array / array_view inside GPU code

Hi!

I have code like this

struct spin_lock
{
	int spin_;

	spin_lock(): spin_(0) {};
	
	void enter() restrict(amp)
	{
		while(concurrency::atomic_compare_exchange(&spin_, 0, 1) != 0);
	}

	void exit() restrict(amp)
	{
		concurrency::atomic_exchange(&spin_, 0);
	}
};

And I initialize and use it as follows

spin_lock spinner;
array_view<spin_lock, 1> spin_lock(1, &spinner);

parallel_for_each(...)
{
    //[...]
    spin_lock[0].enter();
    //[...]
    spin_lock[0].exit();
    //[...]
});

And the compiler issues "error C3564: reading uninitialized value when compiling the call graph for the concurrency::parallel_for_each". It looks like I can use freestanding functions and an integer variable instead, but how could I use a struct? I don't discard data when reading the spin_lock view to the GPU and I would think that when the struct is created on the host, i.e. on the CPU, the variable spin_ would be zero-initialized even if the construct weren't called on the GPU and the struct were only bitwise copied over the bus.

Sudet ulvovat -- karavaani kulkee

↧

PPL and Memory Mapped Files?

August 14, 2012, 2:01 pm

≫ Next: Concurrency::Context::Oversubscribe Overhead?

≪ Previous: A problem with "error C3564: reading uninitialized value when compiling the call graph for the concurrency::parallel_for_each"

I have function that basically takes a pointer to data as it's argument. This function is executed on the Concrt Task Scheduler using PPL.

The problem I have is that this pointer might be from a memory mapped file, thus accessing this memory can block the task scheduler thread, which is bad since the UMS scheduler is deprecated, unless the default scheduler is mmap aware?

My question is how should I handle this?

Should I touch parts of the data before using it so that the file page is loaded into memory, i.e. won't block (though I guess there is no guarantee it will not be swapped out)? What is the best way to do this? I'd like to avoid doing an unnecessary memory copy if possible.

i.e is the following good enough?

void foo(char* mmap_data, int size)
{
        __assume(size < 256 * 1000);

	concurrency::Context::Oversubscribe(true);

	volatile char dummy;
	for(int n = 0; n < size; n += page_size)	
		dummy = *mmap_data[n];
	
	concurrency::Context::Oversubscribe(false);
	
	/* Do processing of mmap_data*/
}

↧

Concurrency::Context::Oversubscribe Overhead?

August 14, 2012, 2:11 pm

≫ Next: Why does this modified Math Library example not work?

≪ Previous: PPL and Memory Mapped Files?

Do you have any indication in regards to how large the overhead of Concurrency::Context::Oversubscribe is? i.e. how fine grained should its usage optimally be?

↧

Why does this modified Math Library example not work?

August 14, 2012, 1:29 pm

≫ Next: Good parallel programming book

≪ Previous: Concurrency::Context::Oversubscribe Overhead?

Hello,

this is probably a newbie question:

In the code snippet below I slightly modified the example from the Math Library for C++ AMP article (http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/08/math-library-for-c-amp.aspx).

First I saved those two values a and b in a two dimensional array with one row and two columns. Secondly I changed the iteration to use an index<2> and instead of calling pow(a,b) in the lambda function I read the values from my array and called pow() with those values. As in the original example I store the result in av(0). Since there is only one row the parallel_for loop is only run once.

I also stepped through the lambda function and saw that A and B are correctly read from the array_view.

But av(0) is returning 1.0, which is incorrect.

What am I doing wrong?

Thanks for your help,

- Bernd

void main()
{
float a = 2.2f, b = 3.5f;
float result = pow(a,b); // 8
std::vector<float> v(1);
array_view<float> av(1,v);

std::vector<float> inRaw(2);
inRaw[0] = a;
inRaw[1] = b;
// one row, two columns
extent<2> twoDim(1, 2);
array_view<float,2> in(twoDim,inRaw);

parallel_for_each(twoDim, [=](index<2> idx) restrict(amp)
{
  const int row = idx[0];
  const int col = idx[1];

  float A = in(row,col); // 2.2f
  float B = in(row,col+1); // 3.5f
av(0) = precise_math::pow(A,B);
});

_ASSERT(av(0) == result); // barks, because av(0) is 1.0
}

↧

Good parallel programming book

August 14, 2012, 7:13 pm

≫ Next: How to create a wrapper for ActiveX control to run it in a browser?

≪ Previous: Why does this modified Math Library example not work?

Hi.

I'm very interested in learning parallel programming, however where I study there's no course where I can learn it.

I decided to learn it on my own. Can you recommend me a good introdutory book? Take into account that I have absolutely no experience in it. I'm taking it because a professor suggested and I'm looking for new challenges.

I focus on Matlab language, but I also know my way around C.

Thanks in advance!

Any help will be highly appreciated.

Jaló.

↧

How to create a wrapper for ActiveX control to run it in a browser?

August 15, 2012, 6:06 am

≫ Next: array_view memory restrictions

≪ Previous: Good parallel programming book

I've got an ActiveX OCX which is dependent on a lot of other DLL's and OCX'es which are registered in Windows. I'm able to use this ActiveX by locally loading it into a VC++ client program, but when I try to load the ActiveX into a browser, it loads and promptly crashes.

Tried FireBreath, but the creator says that's not what FireBreath is meant for.

I won't be using this ActiveX over the internet. Just need to have all the files on the HDD and load the ActiveX into a local instance of IE and invoke the interface of the ActiveX using Javascript. It's an ATL ActiveX, so is there any wizard I can use to create a wrapper that I can call with Javascript?

p.s: Checked this site for why it's crashing, but my project is a Multithreaded dll (though there's nothing in the registry about it's threading), so it shouldn't be crashing.

I mentioned "it crashes" because debugging it only took me to som e AFX classes and then to assembly code. The control loads, and when I hover the mouse pointer over it, it control goes to the destructor of my ActiveX class. If there's any useful info I could provide here, I'd like to know what I can check/mention. I'm new to ActiveX

↧

array_view memory restrictions

August 16, 2012, 1:25 pm

≫ Next: DirectCompute with optiX ( CUDA )

≪ Previous: How to create a wrapper for ActiveX control to run it in a browser?

Hello,

I'm trying to implement c++ AMP in my native c++ convolutional neural network library. But I don't quite understand how you can copy something like a vector<vector<vector<int_2>>> connections; to an array_view (on the gpu memory) when there is no fixed width in the number of elements in connections in every dimension. All the c++ AMP samples I've already explored have always a rather simple data structure. I was also wondering if it is possible with c++ AMP to create your memory structures directly on the GPU without some sort of copying an array from cpu side.

(hope this make some sense)

thanks,

Filip

↧

DirectCompute with optiX ( CUDA )

August 16, 2012, 4:43 am

≫ Next: What tile size to use when extent is not a multiple of 32 or 64?

≪ Previous: array_view memory restrictions

Hi. Right now I'm trying to implement a radar. I used collision detection algorithm in ray tracing since envrionment is huge. My boss ask me to implement it with using optiX which is Ray Tracing Engine of Nvidia. But objects in the environment is not static therefore I have to use DirectX and that's bring another problem. Performence.

Is it possible to use optiX output in Compute Shader without sending the output of optiX in to CPU. If it is possible how can I make it?

thanks in advance.

↧

What tile size to use when extent is not a multiple of 32 or 64?

August 15, 2012, 7:48 pm

≫ Next: How to interrogate accelerator information in greater detail than is obtained via accelerator::get_all()?

≪ Previous: DirectCompute with optiX ( CUDA )

Hi,

I came across a post at http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/26/warp-or-wavefront-of-gpu-threads.aspx. In that post, it was recommended the tile size to be a multiple of 32 or 64. But what if I have a 100 x 100 matrix? It could not be divided to a tile size that’s a multiple of 32 or 64. What tile size will you recommend in this case, and why?

I used the Dump statistics to Output Window button on GPU Threads window to see what tile size C++ AMP chooses for simple mode. It’s 16 x 16 x 1. So my second question is how 100 x 100 fits into this tile size?

Thanks/Allen

↧

How to interrogate accelerator information in greater detail than is obtained via accelerator::get_all()?

August 17, 2012, 3:41 am

≫ Next: A few more checks on explicitly choosing number of processing threads and tiling

≪ Previous: What tile size to use when extent is not a multiple of 32 or 64?

Hi!

Is there an API or some sufficiently straightforward way to interrogate the underlying hardware information in greater detail than via the accelerator::get_all() method? One thing I can think of is getting the device path information and then using it with some other Windows registry, API (DXGI?) or perhaps a database.

I'd be interested in core configurations, memory frequencies etc.

Sudet ulvovat -- karavaani kulkee

↧

A few more checks on explicitly choosing number of processing threads and tiling

August 16, 2012, 7:30 am

≫ Next: Should this program compile?

≪ Previous: How to interrogate accelerator information in greater detail than is obtained via accelerator::get_all()?

Hi!

A bad topic line, but I just want to make sure I've understood this correctly that irregardless the size of input (e.g. obtained from external source, could be, say, 50 elements long), if I have function with somewhat following signature

void function1(concurrency::array_view<const TYPE_X, 1> const& x_view)

Can I explicitly process with 1024 threads with all of the threads using the same tile_static memory using the following?

auto domain = concurrency::extent<1>(1024).tile<1024>();
parallel_for_each(chosen_accelerator.default_view, domain.pad(), [=](tiled_index<1024> t_idx) restrict(amp)
{
    tile_static TYPE_X values[1024];

    //Initialization of the values array to some value by all the threads simultaneously.
    for(int k = t_idx.local[0]; k < 1024; k += t_idx.tile_dim0)
    {
        values[k] = 0;
    }
    t_idx.barrier.wait();   
}

I had a previous question along the same lines, to which Zhu already provided a good answer and I read the links he included. This case is then sufficiently different for me to warrant another question, it's easy to understand incorrectly...

Of course any pointers that are even tangential to my main questions are welcome.

Sudet ulvovat -- karavaani kulkee

↧

Should this program compile?

August 16, 2012, 6:34 pm

≫ Next: VC++2012: Fatal Error C1001 since Release Candidate

≪ Previous: A few more checks on explicitly choosing number of processing threads and tiling

int foo() restrict(amp) {
return 111;
}

int foo() restrict(cpu) {
return 999;
}

void* get_address() {
return static_cast<void*>(foo);
}

Error 1 error C2440: 'static_cast' : cannot convert from 'overloaded-function' to 'void *'

↧

VC++2012: Fatal Error C1001 since Release Candidate

June 2, 2012, 6:51 am

≫ Next: How to poll an std::future to see if it's result is available?

≪ Previous: Should this program compile?

Hi,

While recompiling a DLL which makes heavy use of PPL and C++ AMP, a fatal error C1001 now triggers in the VC++ 2012 Release Candidate:

1>S:\SRC_vs11\abc\def.cpp : fatal error C1001: Une erreur interne s'est produite dans le compilateur. 1> (fichier du compilateur 'f:\dd\vctools\compiler\utc\src\p2\main.c', ligne 211)

1>LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage
1>  
1>    Version 11.00.50522.1
1>  
1>    ExceptionCode            = C0000005
1>    ExceptionFlags           = 00000000
1>    ExceptionAddress         = 527483EE (52670000) "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\c2.dll"
1>    NumberParameters         = 00000002
1>    ExceptionInformation[ 0] = 00000000
1>    ExceptionInformation[ 1] = 00000034
1>  
1>  CONTEXT:
1>    Eax    = 00000008  Esp    = 0038E470
1>    Ebx    = 00000000  Ebp    = 0038E48C
1>    Ecx    = 007D9E88  Esi    = 00000000
1>    Edx    = 07233140  Edi    = 07233140
1>    Eip    = 527483EE  EFlags = 00010202
1>    SegCs  = 00000023  SegDs  = 0000002B
1>    SegSs  = 0000002B  SegEs  = 0000002B
1>    SegFs  = 00000053  SegGs  = 0000002B
1>    Dr0    = 00000000  Dr3    = 00000000
1>    Dr1    = 00000000  Dr6    = 00000000
1>    Dr2    = 00000000  Dr7    = 00000000

(in Debug or Release configurations, Win32 or x64 targets)

What kind of information should I send for diagnosis ?

↧

How to poll an std::future to see if it's result is available?

April 17, 2012, 11:39 am

≫ Next: Are entire structs captured by (AMP) lambdas or only the used members?

≪ Previous: VC++2012: Fatal Error C1001 since Release Candidate

In the code snippet below I'm attempting to manually associate a promise set on one thread with a future checked on another. I then attempt to poll the future to see if it is ready but the loop never terminates. The call to future::wait_for() always returns future_status::deferred. If I remove the wait_for and instead loop for some fixed number of iterations with a sleep_for in the while loop I can get() the future as expected and the function exits normally.

Am I misunderstanding the correct usage of wait_for()? Is there any other way to poll for a future becoming available without blocking on a get() call?


#include <chrono>
#include <thread>
#include <future>
#include <iostream>

using namespace std;
using namespace std::chrono;
void DoSomeWork(int numIters, promise<int>& ret)
{
    int it = 0;
    for ( ; it < numIters; ++it)
    {
        cout << "Thread " << this_thread::get_id() << " working..." << endl;
        this_thread::sleep_for(milliseconds(1000));
    }
    ret.set_value(it);
}

void TestPromise()
{
    int its = 10;
    promise<int> aPromise;
    thread aThread(DoSomeWork, its, ref(aPromise));
    auto aFuture = aPromise.get_future();
    while (aFuture.wait_for(milliseconds(100)) != future_status::ready)
    {
        cout << "Thread " << this_thread::get_id() << " waiting..." << endl;
    }
    cout << "Work iterations: " << aFuture.get();
    aThread.join();
}
int main()
{
    TestPromise();
    return 0;
}

↧

Are entire structs captured by (AMP) lambdas or only the used members?

August 20, 2012, 3:16 am

≫ Next: adress space problem

≪ Previous: How to poll an std::future to see if it's result is available?

Hi all,

Are entire structs captured by (AMP) lambdas or only the used members? E.g., if float3::x is used by a lambda are the y and z members also using constant memory?

Cheers,

↧

adress space problem

August 22, 2012, 12:56 pm

≫ Next: copy_async and Lifetime Management

≪ Previous: Are entire structs captured by (AMP) lambdas or only the used members?

How can i point adress space ?
You look for "adressing problem"

post.

↧

copy_async and Lifetime Management

August 22, 2012, 12:27 am

≫ Next: Help with tiling

≪ Previous: adress space problem

So I have noticed one potential problem with copy_async which is how to manage lifetime in an elegant way.

Consider the following:

array<int> a1;
array<int> a2;

copy_async(a1, a2).then([=]
{
	// Work on a2
});

From my understanding, there are two problems here:

1. a1 must stay alive during the copy_async.

2. Somehow a2 must be accessed in the then lambda.

I have not found any elegant solution to this. The workaround I have found so far is something along the lines:

array<int> a1;
array<int> a2;

auto a1_ptr = std::make_shared<array<int>>(std::move(a1));
auto a2_ptr = std::make_shared<array<int>>(std::move(a2));

copy_async(*a1_ptr, *a2_ptr).then([a1_ptr, a2_ptr]
{
	a1_ptr = nullptr;
	auto a2 = std::move(*a2_ptr);
});

To get around this I would like to propose the following:

1. copy_async should increment the reference count of the underlying array object (_Object_Ptr or something along those lines) and manage the life-time of the source object.

2. The destination should be taken as an r-value reference and then passed to the continuation.

Which would make the following correct:

array<int> a1;
array<int> a2;

copy_async(a1, std::move(a2)).then([](array<int>& a2) // copy_async should return a unique_task/unique_completion/std::unique_future since a2 shouldn't be copied.
{
	// Work on a2, no need to worry about life-time manegement.
});

Comments?

↧

Help with tiling

August 22, 2012, 9:04 am

≫ Next: Calling task::get from UI thread

≪ Previous: copy_async and Lifetime Management

I've just converted one of my C++ AMP algorithms to use tiling, and I got a nice 80% speed up, great!

However, I'm unsure if I'm doing it correctly (most efficiently), since my code looks quite different from the various examples I have seen, and would therefore appreciate some feedback.

Basically my scenario is that I have an one dimensional array filled with consecutive 8x8 dct matrices laid out in linear order on which I need to peform idct (basically matrix multiplication). So matrix1 is array[0..63] and matrix 2 is array[64..127] etc..

My current code looks as follows. Note that TS = 64 = 8 *8, is the only tile size that works for this implementation (unlike others I have seen where the algorithm works regardless).

array_view<float> coeffs = /*...*/;
float idct[8][8] = /*...*/;

static const int TS = 64;

concurrency::parallel_for_each(coeffs.extent.tile<TS>(), [=](const concurrency::tiled_index<TS>& t_idx) restrict(amp)
{
	using namespace concurrency;

	auto local_idx = t_idx.global % 64;

	auto row = (t_idx.local % 8)[0];
	auto col = (t_idx.local / 8)[0];

	tile_static float local_coeffs[8][8];

	local_coeffs[row][col] = coeffs[t_idx.global];

	t_idx.barrier.wait();

	float tmp = 0.0;
	for (int i = 0; i < 8; ++i)
	{
		float tmp2 = 0.0;

		tmp2 += local_coeffs[0][i] * idct[row][0];  
		tmp2 += local_coeffs[1][i] * idct[row][1];  
		tmp2 += local_coeffs[2][i] * idct[row][2];  
		tmp2 += local_coeffs[3][i] * idct[row][3];  
		tmp2 += local_coeffs[4][i] * idct[row][4];  
		tmp2 += local_coeffs[5][i] * idct[row][5];  
		tmp2 += local_coeffs[6][i] * idct[row][6];  
		tmp2 += local_coeffs[7][i] * idct[row][7]; 
		
		tmp += idct[col][i] * tmp2;
	}
	dest[t_idx.global] = tmp;	
});

↧

Calling task::get from UI thread

August 27, 2012, 5:16 am

≫ Next: C++ AMP Disassembly?

≪ Previous: Help with tiling

Hei,

So I need to wait for some operation to happen. Here is a reason. My app crashes and debugger doesn't want to show where (or let me rephrase it, it crashes at someone else computer and I need to figure out why). I need to add some logging. I can't log messages asynchronously because my app will crash earlier then they will have a chance to be flushed so I open and close file for each write. I don't care about unresponsiveness. It's not going to be delivered to market store. I need something hackish but working. Is there a way to write to file with StorageFile API and block execution until it is done?

Thanks!

↧

C++ AMP Disassembly?

August 24, 2012, 8:27 am

≫ Next: How to most efficiently initialize a large tile static 1D array in 2D tile parallel_for_each?

≪ Previous: Calling task::get from UI thread

Is there a way to see what kind of code is generated by C++ AMP?

In this particular case I would like to know whether things like,"int x", "x/8" and "x % 8" are properly optimized into simple bitwise operators, or if I have to do this manually?

↧