Image2d max size

Hello,

I am tying to use image2d mem object to perform operations on pixels, with YUV images. For testing, I juste use a uchar array, that I copy into image2d object.

It works well with small arrays.
The problem is that I cannot use arrays with dimension bigger than 12864 or 64128 (8192 bytes), which is poor since I need to work with HD images x).

Here I create my image2d:

	cl_image_format image_format;
	
	image_format.image_channel_data_type=CL_UNSIGNED_INT8;
	image_format.image_channel_order=CL_RGBA;	
 
	//Create OpenCL Image
	g_inputImage=  clCreateImage2D (g_context,
						CL_MEM_READ_ONLY,
						&image_format,
						(size_t)stride/4,		//RGBA! 4 bytes per pixel
						(size_t)arrayNrows,
						0,
						NULL,
						&ret);

Here I copy the array into the image object:

//Parameters for clEnqueueWriteImage
size_t origin[3]={0, 0, 0};
size_t region[3]={stride/4,arrayNrows,1}; //RGBA! 4 bytes per pixel

err = clEnqueueWriteImage (g_cmd_queue,
g_inputImage,
CL_TRUE,
origin,
region,
stride,
0,
inputArray,
0,
NULL,
NULL);

I take into account the fact that I’m using RGBA, whereas my input array simulates a 1 component image (1byte/pixel). Here the “stride” is just the width of my image, in bytes. Then the RGBA picture created has 4 times less pixels than the “real” picture, but it’s not a problem for my test.

The program stops after clEnqueueNDRangeKernel returns 1 instead of CL_SUCCESS. As 1 does not bring any information about the error, I do not understand why it crashes. The kernel is not executed at all.

Here is how I run the kernel:

            // set work-item dimensions
			size_t global_work_size[2];
			global_work_size[1] = (size_t) stride/4;	// using 4 element vectors!
             global_work_size[0] = (size_t) arrayNrows;///2;	//number of quad items in input array
			 

			 size_t local_work_size[2];
			 local_work_size[1] = (size_t) stride/4;	
             local_work_size[0] = (size_t) 1;	
			 
			 nd=2; // execute kernel (2D)

         if ((err = clEnqueueNDRangeKernel(g_cmd_queue, g_kernel, nd, NULL, global_work_size, local_work_size, 0, NULL, NULL) != CL_SUCCESS))
         {
                printf("ERROR: Failed to execute kernel
");
                return false;
         }

Here is the output, with some information about the CL_DEVICE, and the memory:

No command line arguments specified, using default values.
Initializing OpenCL runtime...
Reading file 'ker4_FLADIntra_sum_c.cl' (size 3145 bytes)
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS= 3
CL_DEVICE_MAX_WORK_ITEM_SIZES= 1024, 1024, 1024
CL_DEVICE_MAX_WORK_GROUP_SIZE= 1024
CL_DEVICE_ADDRESS_BITS= 64
CL_DEVICE_IMAGE2D_MAX_WIDTH= 8192
CL_DEVICE_IMAGE2D_MAX_HEIGHT= 8192
CL_DEVICE_IMAGE_SUPPORT= 1
CL_DEVICE_LOCAL_MEM_SIZE= 32768
CL_KERNEL_WORK_GROUP_SIZE= 1024
Input size is 16384 items
Executing OpenCL kernel...
MEM OBJECT INFO: 
MEM_SIZE= 16384
IMAGE INFO: 
CL_IMAGE_ELEMENT_SIZE= 4
CL_IMAGE_ROW_PITCH= 128
CL_IMAGE_WIDTH= 32
ERROR: Failed to execute kernel

In that test I was trying to run it with a 128*128 array (16KByte ). As I cut the data into 128 workgroups, it should not be a problem of data but I am certainly missing something.

Thank you for your help,
Chris

Since you’ve only quoted bits of your code, it’s impossible to tell what you’re doing.

Your explanations don’t really make sense either: write image needs to have exactly the same data format, if you have a greyscale input, it wont be converted to rgba, it will be taken as rgba data. I know you just say ‘it has 1/4 as many pixels’, but that still doesn’t make sense; either just use a single-channel image, or create your data as rgba 4-bytes, or integers. In either case the kernel can remain the same.

I suspect it’s some sort of problem with your work sizes. Either don’t pass the local work size and let the runtime choose it (and be aware it may choose very bad values) or choose something that suits the kernel/problem. Usually it should be some good hardware multiple like 64 or 128, separate from the size of the global problem.

If you’re using images, use a 2d work-group size like 16x16, doing a whole column is very inefficient. And make sure the global work size is an even multiple of the same in each dimension. You have to do the range checking inside the kernel.

How do you know your clEnqueue* returns 1? Your code doesn’t print err?

Thank you for your answer.

Here is the rest of the “important” part of the code. I removed my localsize to let the run-time setting up by itself, but the issue is not fixed.

Here I creat my input data array, which is a 1 dimension array:

void generateInput(cl_uchar* inputArray, size_t arrayNrows, size_t arrayNcollumns)
{
	for(size_t j = 0; j < arrayNrows; j++)
	{
		for (size_t i = 0; i < arrayNcollumns; i++)
		{
			inputArray[j*arrayNcollumns + i] = (cl_int)j*arrayNcollumns+i;
			//printf("%d ",inputArray[j*arrayNcollumns + i]);
		}
		//printf("
");
	}
}

Function running the kernel:

bool ExecuteSortKernel(cl_uchar* inputArray, cl_int arrayNrows, cl_int stride, cl_uint sortAscending, cl_int* p_flad_sum)
{
    cl_int err = CL_SUCCESS;
    cl_int numStages = 0;
	cl_int sum_tmp=0;
	cl_event event1;
	cl_uint nd;
	cl_int* tab_FLAD_sum = (cl_int*) malloc(sizeof(cl_int)*arrayNrows);
	cl_int ret;
	//cl_uchar** inputArray2D;

	/*
	inputArray2D=(cl_uchar**)malloc(sizeof(cl_uchar*)*arrayNrows);
	for(int i=0; i<arrayNrows; i++)
	{
		inputArray2D[i]=(cl_uchar*)malloc(sizeof(cl_uchar)*stride);
		for(int j=0; j<stride; j++)
			inputArray2D[i][j]=inputArray[i*stride + j];
	}
	*/

	cl_image_format image_format;
	image_format.image_channel_data_type=CL_UNSIGNED_INT8;
	image_format.image_channel_order=CL_RGBA;	//FIXME: CHECK THIS
	//CL_UNSIGNED_INT32


	cl_int arraySize=arrayNrows*stride;	//Make the procress easier, taking the square into account

	//Create OpenCL Image
	g_inputImage=  clCreateImage2D (g_context,
						CL_MEM_READ_ONLY,
						&image_format,
						(size_t)stride/4,		//RGBA! 4 bytes per pixel
						(size_t)arrayNrows,
						0,
						NULL,
						&ret);

    //create OpenCL buffer using input array memory
	 g_localmem = clCreateBuffer(g_context, CL_MEM_READ_WRITE, sizeof(cl_uchar) * arraySize, NULL, NULL);
    //g_inputBuffer = clCreateBuffer(g_context, CL_MEM_READ_ONLY, sizeof(cl_uchar) * arraySize, NULL, NULL);
	g_flad_sum = clCreateBuffer(g_context, CL_MEM_READ_WRITE, sizeof(cl_int) * arrayNrows, NULL, NULL);

    if (g_inputImage == (cl_mem)0)
    {
        printf("ERROR: Failed to create input data Buffer
");
        return false;
    }
	

	//Parameters for clEnqueueWriteImage
	size_t origin[3]={0, 0, 0};	
	size_t region[3]={stride/4,arrayNrows,1};	//RGBA! 4 bytes per pixel
	
	
	err = clEnqueueWriteImage (g_cmd_queue,
 						g_inputImage,
 						CL_TRUE,
						origin,
						region,
 						stride,
 						0,
 						inputArray,
 						0,
 						NULL,
 						NULL);

	/*--------GET_MEM_OBJECT_INFO-------------------------*/
//...
/*-----------------------------------------------------*/
/*--------GET_IMAGE_INFO-------------------------------*/		
//...
/*-----------------------------------------------------*/

	//err = clEnqueueWriteBuffer( g_cmd_queue, g_inputBuffer, CL_TRUE, 0, sizeof(cl_uchar) * arraySize, inputArray, 0, NULL, NULL );
	//err = clEnqueueWriteBuffer( g_cmd_queue, g_flad_sum, CL_TRUE, 0, sizeof(cl_int) , p_flad_sum, 0, NULL, NULL );

	err  = clSetKernelArg(g_kernel, 0, sizeof(cl_int), (void *) &stride);
	//err |= clSetKernelArg(g_kernel, 1, sizeof(cl_mem), (void *) &g_inputBuffer);
    err |= clSetKernelArg(g_kernel, 1, sizeof(cl_mem), (void *) &g_inputImage);
    err |= clSetKernelArg(g_kernel, 2, sizeof(cl_mem), (void *) &g_flad_sum);
	err |= clSetKernelArg(g_kernel, 3, sizeof(cl_int) * arraySize, (void *) NULL);
    if (err != CL_SUCCESS)
    {
        printf("ERROR: Failed to set input kernel arguments
");
        return false;
    }


            // set work-item dimensions
			size_t global_work_size[2];
			global_work_size[1] = (size_t) stride/4;	
             global_work_size[0] = (size_t) arrayNrows/2;	//sepeare odd and even rows //number of quad items in input array
			 			 
			 nd=2; // execute kernel (2D)

         if ((err = clEnqueueNDRangeKernel(g_cmd_queue, g_kernel, nd, NULL, global_work_size, NULL, 0, NULL, NULL) != CL_SUCCESS))
         {
                printf("ERROR: Failed to execute kernel
");
                return false;
         }
	
		 
	err = clFinish(g_cmd_queue);

	//err = clEnqueueReadBuffer( g_cmd_queue, g_inputBuffer, CL_TRUE, 0, sizeof(cl_int) * arraySize, inputArray, 0, NULL, NULL );
	err = clEnqueueReadBuffer( g_cmd_queue, g_flad_sum, CL_TRUE, 0, sizeof(cl_int)*arrayNrows, tab_FLAD_sum, 0, NULL, NULL );

	/*-------------------HOST REDUCTION**---------------*/

	for(int i=0; i<arrayNrows; i++)
		*p_flad_sum+=tab_FLAD_sum[i];


    return true;
}

The kernel, working with image2d:

__kernel void ker4_FLADIntra_sum_c (
const int stride, 
read_only image2d_t pix,
__global uint* p_flad_sum,
__local uchar4* localmem)
{
const sampler_t samplerA = CLK_NORMALIZED_COORDS_FALSE |
			CLK_ADDRESS_CLAMP |
			CLK_FILTER_NEAREST;

int g_i= get_global_id(1);
int g_j = get_global_id(0);
int l_i= get_local_id(1);
int l_j = get_local_id(0);

int gid= get_group_id(0);

int localSize= get_local_size(0)*get_local_size(1);
int globalSize= get_global_size(0)*get_global_size(1);

uint4 tmp;

//Load data and perform FLAD
//Sweep only even rows
//localmem[l_i]=abs(pix[2*g_j*stride + g_i] - pix[(2*g_j+1)*stride + g_i]);

localmem[l_i]=convert_uchar4(abs( read_imagei(pix, samplerA, (int2)(g_i, 2*g_j)) - read_imagei(pix, samplerA, (int2)(g_i, 2*g_j+1)) ));
barrier(CLK_LOCAL_MEM_FENCE);

// repeat reduction in local memory
for(int s = localSize/2; s > 1; s >>= 1)
{
	if(l_i< s)
	localmem[l_i] += localmem[l_i+ s];
	

	// keep barrier outside conditional
	barrier(CLK_LOCAL_MEM_FENCE);
}

// write result to global memory
if (l_i== 0)
{
	tmp=convert_uint4(localmem[0] + localmem[1]);
	p_flad_sum[gid] =tmp.x + tmp.y + tmp.z + tmp.w;
}

	 
}

I understand what you mean, but my device only propose single-channel with CL_FLOAT, which is not that good either, is it? I tried to run with this configuration, but it does not change the problem. It’s not sensful I agree.

I also tried the same code, using Buffers “clCreateBuffer” instead of image2d, and the problem is exactly the same, crashing beyond 128*64 size of data.

I know it returns 1 thanks to the debugger. However, I have the same error code when I put wrong work-size.

Problem solved.

I made a big mistake with the size of the local memory, which was bigger than work group max size in the case of array size over 128*64.