Is it possible to call one kernel in another?

Dear all,

I am trying to write a code for image processing.

I want to write one kernel and want to call in another kernel. Is it possible?

I came to know that we can call functions. I also want to try the kernel also.

Please help me in this regard.

Here is the situation in which I am now.

//creatematrix.cl
float ** createMat(int iRows, int iColumns, float Initvalue)
{
float **pMatrix;

pMatrix = (float**)malloc(sizeof(float*) * iRows);
for (int i = 0; i < iRows; i++)
{
	pMatrix[i] = (float*)malloc(sizeof(float) * iColumns);
}

for (int i = 0; i < iRows; i++)
{
	for (int j = 0; j < iColumns; j++)
	{
		pMatrix[i][j] = Initvalue;
	}		
}

return pMatrix;

}

// Convolution.cl
__kernel float* IMConvolution(const __global float * pInput,
__constant float * pFilter,
const int nInWidth,
const int nFilterWidth,
const int nWidth)
{
float *pOutput;

//const int nWidth = get_global_size(0);

const int xOut = get_global_id(0);
const int yOut = get_global_id(1);

const int xInTopLeft = xOut;
const int yInTopLeft = yOut;

float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
{
    const int idxFtmp = r * nFilterWidth;

    const int yIn = yInTopLeft + r;
    const int idxIntmp = yIn * nInWidth + xInTopLeft;

    for (int c = 0; c < nFilterWidth; c++)
    {
        const int idxF  = idxFtmp  + c;
        const int idxIn = idxIntmp + c;
        sum += pFilter[idxF]*pInput[idxIn];
    }
} //for (int r = 0...
/*if (sum > 1) 
	sum = 1;
if (sum < -1) 
	sum = -1;*/

const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;

return(pOutput);

}

// CNNonGPU.cl
#include “Convolution.cl”
#include “creatematrix.cl”

__kernel void CNNonGPU(const __global float * pInput,
__constant float * tempA,
__constant float * tempB,
__global float * pOutput,
const int nInWidth,
const int nInHeight,
const int nFilterWidth,
const int iterations)
{

const int nWidth = get_global_size(0);

float *TBimg, *Yimg, *extYimg, *TAimg;
float **tempimg;
int Elements = 0;
int inElements = 0;
float *sum;

Elements = nWidth * nWidth;
inElements = nInWidth * nInHeight;

TBimg = new float [Elements];
TAimg = new float [Elements];
sum = new float [Elements];

TBimg = IMConvolution(pInput,tempB,nInWidth,nFilterWidth,nWidth);

extYimg = pInput;

for(int i = 0; i < iterations;i++)
{
	Yimg = extYimg;

	TAimg = IMConvolution(Yimg,tempA,nInWidth,nFilterWidth,nWidth);

	for(int j = 0; j < Elements;j++)
	{
		sum[j] = TBimg[j] + TAimg[j] + bias;
	}

	tempimg = createMat(nInWidth,nInHeight,0);
	
	int ind = 0;
	
	for (int i = 1; i < iHeightExtended-1; i++)
	{
		for (int j = 1; j < iWidthExtended-1; j++)
		{
			tempimg[i][j] = 0.5 * ((abs(sum[ind] + 1) - (abs(sum[ind] - 1)));
			ind++;
		}		
	}

	ind = 0;
	for(int j = 0; j < nInHeight; j++)
	{
		for(int i = 0; i < nInWidth;i++)
		{
			extYimg[ind] = tempimg[i][j];
			ind++;
		}
	}
}

ind = 0;
for (int i = 1; i < iHeightExtended-1; i++)
{
	for (int j = 1; j < iWidthExtended-1; j++)
	{
		pOutput[ind] = tempimg[i][j];
		ind++;
	}		
}

}

Here in CNNonGPU I want to use the convolution kernel many times. Some time it may be 1000 time or even more.

I also want to use the create matrix function once. Is it possible to do so?

I tried to execute the above one by calling the CNNonGPU from main(CPU). I am getting error at clBuildProgram. (Error is :CL_BUILD_PROGRAM_FAILURE)

Thanks in advance.

You cannot compile CNNonGPU.cl with clBuildProgram() because it’s not written in OpenCL C.

The good thing is that I don’t see anything in CNNonGPU() that would make sense running as OpenCL. Why don’t you just write CNNonGPU() in regular C/C++ and replace the calls to IMConvolution() with calls to clSetKernelArg() and clEnqueueNDRangeKernel()?

I already did the CNNonGPU in C and IMconvolution on kernel. If I do like this I don’t get the good benefit of GPU. Because after this execution the GPU is only 3 or 4 time faster than CPU(Eg: if GPU take 10 sec, CPU take 30 sec). But hopefully we can achieve more than that if we put everything on a kernel.

This is the situation because, I need to call the IMConvolution many times(Eg:1000 or more). In those cases there will be many times data transfer takes place between host to device and device to host. This consumes time. I want to minimize that time also. So I am searching for an alternative.

Thanks in advance.

I see. The problem is that you can’t execute an NDRange from within a kernel. That is not supported in OpenCL.

However, this doesn’t mean that you can’t do more work in the GPU. You could keep IMConvolution as it is today and then create a new kernel that does this part of the computation (let’s call it IMAccumulate):


for(int j = 0; j < Elements;j++)
{
sum[j] = TBimg[j] + TAimg[j] + bias;
}

Then there’s no need to do any data transfers. You would enqueue IMConvolution, then IMAccumulate, and repeat as many times as necessary.

Does that sound good?

Here the TBimg is calculated once using IMConvolution. But TAimg is calculated ‘N’ using IMConvolution.

The accumulation is done after the calculation of TAimg. Even if we keep Accumulation in separate function, what is the benefit we can have?

Sorry, I didn’t look at the code very carefully and without indentation it’s a bit difficult to follow.

Here the TBimg is calculated once using IMConvolution. But TAimg is calculated ‘N’ using IMConvolution.

Do you mean that TBimg is computed once, while TAimg is computed multiple times?

It still seems like the computation after IMConvolution, include the accumulation and the other stuff could be done in a separate kernel without having to transfer data from the device to the host.

In other words, you don’t need to make CNNonGPU() into a kernel in order to improve performance. You could keep CNNonGPU() as a regular function that enqueues multiple kernels, each one doing part of the work.

Yes. TAimg is computed multiple times and accumulated everytime.

If we write accumulation and other stuff in a separate kernel, can we call it in IMConvolution kernel?

I didnt understand the “a regular function enqueues multiple kernel” can u give me a small example to understand it a bit clearly.

If we write accumulation and other stuff in a separate kernel, can we call it in IMConvolution kernel?

A kernel function “A” can call another kernel function “B”. However, the number of work-items that will execute kernel “B” are simply the same work-items from “A” that called into “B”. You can’t enqueue a new NDRange.

I didnt understand the “a regular function enqueues multiple kernel” can u give me a small example to understand it a bit clearly.

Sure. All I was saying is that you can implement CNNonGPU()() as a C function (running on the host CPU) that calls clEnqueueNDRangeKernel() several times: one time for IMConvolution(), another for IMAccumulation(), etc.

It would look something like this:



// nongpu.c
void CNNonGPU()
{
    ...
    // Compute value of TBimg
    clSetKernelArg(convolutionKernel, 0, sizeof(cl_mem), TBImg);
    clEnqueueNDRangeKernel(..., convolutionKernel, ...);

    for(i = 0; i < iterations; ++i)
    {
        // Compute a new version of TAImg
        ...
        clSetKernelArg(convolutionKernel, 0, sizeof(cl_mem), TAImg);
        clEnqueueNDRangeKernel(..., convolutionKernel, ...);

        // Accumulate TAImg with TBImg
        clSetKernelArg(accumulationKernel, 0, sizeof(cl_mem), TAImg);
        clSetKernelArg(accumulationKernel, 1, sizeof(cl_mem), TBImg);
        clSetKernelArg(accumulationKernel, 2, sizeof(cl_mem), accumImg); // output
        clEnqueueNDRangeKernel(..., accumulationKernel, ...);

        // Compute difference of absolute values here
        clSetKernelArg(absKernel, 0,  sizeof(cl_mem), accumImg);
        clSetKernelArg(absKernel, 0,  sizeof(cl_mem), tempImg);
        clEnqueueNDrangeKernel(..., absKernel, ...);
    }
}


Thanks for your reply…

The question I am having now on seeing your code is…

  1. The kernel is launched ‘N’ time. That is every time we have to launch the kernel? (Will there be not data transfer between the host and device all the time because of that)

  2. If it is the matter of launching the kernel every time how can we retrieve the data?

    i.e., Before accumulating kernel is launched the TAimg work must be completed. we can use the clFinsih() to assure that. or all the kernels run in parallel?

Cant we write this entire functionality on GPU like the following

_kenrel CNNonGPU()
{
       const int nWidth = get_global_size(0);

		const int xOutB = get_global_id(0);
		const int yOutB = get_global_id(1);

		const int xInTopLeftB = xOut;
		const int yInTopLeftB = yOut;

		float temp1 = 0;
		for (int r = 0; r < nFilterWidth; r++)
		{
			const int idxFtmpB = r * nFilterWidth;

			const int yInB = yInTopLeftB + r;
			const int idxIntmpB = yIn * nInWidth + xInTopLeft;

			for (int c = 0; c < nFilterWidth; c++)
			{
				const int idxFB  = idxFtmpB  + c;
				const int idxInB = idxIntmpB + c;
				temp1 += tempB[idxFB] * pInput[idxInB];
			}
		} //for (int r = 0...
		
		const int idxOutB = yOutB * nWidth + xOutB;

		TBimg[idxOutB] = temp1; 
	
	float *Yimg;
	Ytempimg = pInput;

	for(int i = 0; i < iterations;i++)
	{
		Yimg = Ytempimg;

		const int xOutA = get_global_id(0);
		const int yOutA = get_global_id(1);

		const int xInTopLeftA = xOutA;
		const int yInTopLeftA = yOutA;

		float temp2 = 0;
		for (int r = 0; r < nFilterWidth; r++)
		{
			const int idxFtmpA = r * nFilterWidth;

			const int yInA = yInTopLeftA + r;
			const int idxIntmpA = yInA * nInWidth + xInTopLeftA;

			for (int c = 0; c < nFilterWidth; c++)
			{
				const int idxFA  = idxFtmpA  + c;
				const int idxInA = idxIntmpA + c;
				temp2 += tempA[idxFA] * Yimg[idxInA];
			}
		} //for (int r = 0...
		

		const int idxOutA = yOutA * nWidth + xOutA;

		TAimg[idxOutA] = temp2;
		sum[idxOutA] = TAimg[idxOutA] + TBimg[idxOutA] + bias;

		if(i = 0)
			tempimg[idxoutA] = sum[idxOutA];
		else
			tempimg[idxOutA] = 0.5 * ((abs(sum[idxOutA]+1))-(abs(sum[idxOutA]-1)));
		
		Ytempimg[] = tempimg[idxOutA];

	}
}

Here it is some thing like putting convolution once before iteration and then againd puting the same convolution in loop. Like

_kernel
{
     Convolution for TBimg
    
     for()
     Convolution for TAimg
}

I guess to my knowledge a kernel is a simple functional execution block. So we should not like write this. Am I right?

  1. The kernel is launched ‘N’ time. That is every time we have to launch the kernel? (Will there be not data transfer between the host and device all the time because of that)

No, there will not be any transfers.

  1. If it is the matter of launching the kernel every time how can we retrieve the data?

You can retrieve the data at any time with clEnqueueReadBuffer().

Before accumulating kernel is launched the TAimg work must be completed. we can use the clFinsih() to assure that. or all the kernels run in parallel?

It’s not necessary to call clFinish(). Commands are executed in the same order they are enqueued.

Cant we write this entire functionality on GPU like the following

Maybe. I don’t understand well enough the algorithm you are trying to implement. From what I can see, your code would need some work-group barriers in the middle of the kernel in order to work (see section 6.11.8 ). It will be easier for you to implement as a sequence of calls to clEnqueueNDRange().

Thanks for your support and quick reply. From all this knowledge I will try to implement my algorithm effectively.
:smiley: