Allocating memory for kernels

Hi,

two fast questions:

1] is it possible to allocate dynamically memory in a kernel? (i suppose that the answer is
no and that only constant memory can be allocated at compile time).

2] Suppose you have to write a kernel that scans a gray-level image and store the coordinates of every pixel whose value is > 128. You don’t know in advance how many pixel will meet that conditions. So you have to allocate a buffer of WHsizeof(pixel), where W is the width of the image, h is the height and pixel is a pod struct:

struct pixel
{
int x, y;
}

Is this the only way to pass an output buffer to the kernel?

3] Suppose that you pass to the kernel a buffer object of size Ksizeof(pixel), where K is a parameter decided at run-time (it can be < WH). Is there a way to say to the kernel: ok, you have to stop know since we does not have enough memory left???

Thank you.

1] is it possible to allocate dynamically memory in a kernel?

No, it’s not possible.

2] Suppose you have to write a kernel that scans a gray-level image and store the coordinates of every pixel whose value is > 128. You don’t know in advance how many pixel will meet that conditions. So you have to allocate a buffer of WHsizeof(pixel), where W is the width of the image, h is the height and pixel is a pod struct:

struct pixel
{
int x, y;
}

Is this the only way to pass an output buffer to the kernel?

You got the general idea right. You can use a parallel prefix sum to efficiently compute this.

3] Suppose that you pass to the kernel a buffer object of size Ksizeof(pixel), where K is a parameter decided at run-time (it can be < WH). Is there a way to say to the kernel: ok, you have to stop know since we does not have enough memory left?

Sure, you can pass an argument to the kernel specifying the size of that buffer.

Ok. To sum up: you allocate enough memory on the host (for example, if the maximum objects you expect to find is M, you allocate a buffer to contain M objects), run the GPU algorithm. You could end up with a buffer with empty elements ([O1, X, X, O2, X, O3…]), for this reason run a compaction algorithm (prefix-sum) to compact the buffer (empty slots will begin only after the last object). Is this right?

Ok, i see that patterns in parallel algorithm are popping out everytime. A lot of things that are “natural” to do with a single CPU becomes less natural when switching to parallel algorithms.

Is there something like the bible of patterns when dealing with parallel algorithms?
Or better, a book that shows how to write a parallel version of the most common image processing algorithms (the most interesting at least). There is a lot of info on the net, and it is not simple to select what to read first. For example, Mark Harris wrote a presentation entitled “Data Parallel Algorithm on the GPUs”. I think is a worth reading, since it presents “the building blocks”, but it is only a brief overview.

Thanks!

Ok. To sum up: you allocate enough memory on the host (for example, if the maximum objects you expect to find is M, you allocate a buffer to contain M objects), run the GPU algorithm. You could end up with a buffer with empty elements ([O1, X, X, O2, X, O3…]), for this reason run a compaction algorithm (prefix-sum) to compact the buffer (empty slots will begin only after the last object). Is this right?

That’s right!

Ok, i see that patterns in parallel algorithm are popping out everytime. A lot of things that are “natural” to do with a single CPU becomes less natural when switching to parallel algorithms.

Yeah, parallelism is an additional layer of difficulty on top of whatever you are trying to solve.

Is there something like the bible of patterns when dealing with parallel algorithms?

If there is, I haven’t found it yet. The work by Mark Harris and the earlier work by Daniel Hillis is the best I’ve found so far.