Can a kernel function call another kernel function?

Hi all,

i have two algorithm. The first process an image and computes a set of points of interest (like corners). The second takes an image and a set of interest point as input parameters and does some computations (for every point returns an array of float).

The two algorithm are unrelated, in the sense that i do not want to fix things (that is, i want to be able to change the interest point detector and the second algorithm easily, at run-time if needed).

I see two possibility.

Define two different kernels, call them A (1° algo) and B (2° algo) and do the following:

1.prepare kernel A
2.call A
3.get results from A
4. prepare kernel B
5. call B (passing to it the results of A)
6. get results from B

The only question about this approach is: at step 3, i do not want to tranfer memory from the device to the host, so i can avoid to enqueue a copy command. I leave the results of A on the device, since B will be executed on it. Is this ok? (i am sure the answer is a big YES, but since i am really starting with OpenCL…).

The second approach would be different. I would have a kernelC that will call A, and, for every point found by A, will call B. I can’t see how this could be made to work.
Any idea? I am almost sure that the first approach is the correct one, but i would like to know if i can call other kernels (and not local functions) from a given kernel and how to do it.

Thanks

The only question about this approach is: at step 3, i do not want to tranfer memory from the device to the host, so i can avoid to enqueue a copy command. I leave the results of A on the device, since B will be executed on it. Is this ok? (i am sure the answer is a big YES, but since i am really starting with OpenCL…).

Data transfers from host to device and vice versa are essentially controlled by the application. If you enqueue a read, write or map command then there will be a data transfer of some sort. If you don’t then data will stay on the device.

With the caveat that I’m not familiar with your code, it seems like what you are attempting is very doable.

The second approach would be different. I would have a kernelC that will call A, and, for every point found by A, will call B.

That won’t work the way you want. A kernel can call another kernel, but it will have the same semantics as a kernel calling any other function.

It’s beneficial to keep in mind the distinction between a kernel (that is, a function) and an NDRange (that is, a collection of work-items). A kernel can call another kernel, but it will not change the shape or size of the NDRange that was enqueued. The number of work-items and work-groups will remain the same.

In contrast, an NDRange cannot spawn a new NDRange. That’s what people often mean when they talk about “a kernel calling another kernel”.

Hi David,

thank you for your reply. I have another question, which is not really in-topic with the other but i will take this opportunity to ask you anyway.

I was thinking about component labeling. I have already an algorithm that process a binary image and find all the connected components (it is written in C++). It not only tells you how many connected components there are in the image, but, for every CC, it returns its contour (and the label map, but i usually use only the list of components). The algorithm scans each pixel of the image and, if it is a contour pixel, it follow the contour until it is closed, store it (in a list of contours) and go on with the next pixel (if it was not already encountered, otherwise it skip it and continues with the next and so on).

I am not interested in the possible ways of implementing it, but what i would like to understand is what kinds of output parameters that kernel would return.

Clearly, not a list of connected components, since only contiguous buffers (1D,2D or 3D) of memory can be used. Moreover, the algorithm do not know neither how many connected components it will find nor the length of each contour until it finishes to scan it.

The last note is this (and it is a very important point): the software i am writing should allow to change easily (even at run-time) the implementation used (OpenCL or plain C++). This requires to use interfaces and that is ok, but then the memory layout become important, since i do want to make the smallest computations possible (possibly avoiding the cost of converting one data structure to another, especially if the plain C++ implementation is used).

Any idea about memory design for this type of problem?

Thanks

Ps.
It is possible that this type of algorithm is not well suited to be ported on the GPU, but i can’t find any reasonable argument to support this claim (i mean, if even sorting can be done with OpenCL…).

I am not interested in the possible ways of implementing it, but what i would like to understand is what kinds of output parameters that kernel would return.

The reason I didn’t reply in the previous thread is because it seems like the answer is probably going to be rather complicated and I’m not certain of what would be the best way to do solve that problem in the first place. I hope you understand we are answering these questions in our free time and may skip some of them.

The last note is this (and it is a very important point): the software i am writing should allow to change easily (even at run-time) the implementation used (OpenCL or plain C++).

If I were you I would only implement these algoritms once; it will be a lot less work for you that way. Even if the client’s PC doesn’t always have an OpenCL-capable GPU it will always have a CPU that you can use to run the code. Both AMD and Intel support CPUs in their OpenCL implementations.

Hi David,

If I were you I would only implement these algoritms once; it will be a lot less work for you that way. Even if the client’s PC doesn’t always have an OpenCL-capable GPU it will always have a CPU that you can use to run the code. Both AMD and Intel support CPUs in their OpenCL implementations.

i don’t know. The software library should work with different systems:

-windows
-linux
-mac
-mobile platforms (mainly iOs, Android and Symbian)

I think that there aren’t implementations for mobile platforms, as well as there are not open source SDK of OpenCL for desktop OS’s (Intel is only CPU, i have played a bit with it, AMD and NVIDIA have GPU implementations for their cards only).

My idea was to “prepare” the library for a future OpenCL implementation, in a way that users of the library should not be forced to change their code (just plug the new library and it still will work). Unfortunately, i don’t have a deep understanding of OpenCL. I come up with the simple idea of:

  • creating a context interface IContext (subclussed by IContextCPU, IContextOpenCL)

  • creating a context factory (just specify CTX_CPU, CTX_OPENCL and get the right context object)

  • creating a IMemoryObject interface to handle memory buffer management
    (that is, cl_mem objects with OpenCL and simple implementation when OpenCL is not there).
    I come up an “implicit” way of defining algorithms that could potentially be implemented in OpencL:

  • use a basic interface and subclass it with C++/OpenCL implementation
    for example: class IConnectedComponent {…};

  • use IMemoryObject for input/output parameters:

    virtual int ComputeCC( IMemoryObject *binary_image) = 0;

  • a factory to get the proper implementation (the factory will build the correct object
    according to the implementation selected) of this type of algorithms.

The problem, that i described to you before is in the method:

virtual IMemoryObject *GetCC() = 0;

The object returned should contain CC. But how memory have to be designed??? And should i use that design (that could be a bit cumbersome) everywhere in the library or should i convert it to a more programmer-friendly data structure??

Sorry for my long posts!