Register and local mem problems

Hello !

I am working on an optical flow algorithm, I developed a working 2D version and now I have troubles with the 3D version… I work with blocks of 888 pixels since I have CL_DEVICE_MAX_WORK_GROUP_SIZE: 512.

My code is skipped without any warning or error when I try to allocate too much local memory or when the code requires too many registers.

  1. Local memory:

I have the following CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte, I suppose it means I can have a maximum of 4 blocks of float (4 bytes ?) of 8+2 pixels in my local mem ? (8 is the local size + 2 for the overlapping edges)
44101010 = 16 000: OK ?

  1. Number of register

This is my maximum number of register per block :
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
Does it mean: number_of_register * number_of_thread_per_block <= 8192 ?
In this case I must have less than 16 registers, which is really limited.

I gues this is a limit per kernel and not on the overall?
If so, one solution would be to divide the kernels into small ones. Unfortunately it is hardly possible with my code.

If I compile with the option “-cl-nv-maxrregcount=16” then the results is not right (In some case I got undefined numbers) probably because I reach the lower limit of how many registers I need for this particular algorithm. http://forums.nvidia.com/index.php?showtopic=193492

With no restriction (no -cl-nv-maxrregcount) the algorithm requires 23 registers.

So what can I do really ?

One thing which require some registers is the initialization of the local mem at 0.

For now I have this code :


inline int idx(int i, int j, int k, int size)
{
	return ((k*size*size)+(i*size)+j);
}

inline void initP(__local float* pLocal, int li, int lj, int lk, int lSize)
{	
	pLocal[3*idx(li,lj,lk,lSize)+0] =
	pLocal[3*idx(li,lj,lk,lSize)+1] =
	pLocal[3*idx(li,lj,lk,lSize)+2] = 0;

	//if(li-1 == 0 || lj-1 == 0 || lk-1 == 0 || li+1 == lSize-1 || lj+1 == lSize-1 || lk+1 == lSize-1)
	//{
	pLocal[3*idx(li-1,lj,lk,lSize)+0] = pLocal[3*idx(li-1,lj,lk,lSize)+1] = pLocal[3*idx(li-1,lj,lk,lSize)+2] = 0;
	pLocal[3*idx(li+1,lj,lk,lSize)+0] = pLocal[3*idx(li+1,lj,lk,lSize)+1] = pLocal[3*idx(li+1,lj,lk,lSize)+2] = 0;
	pLocal[3*idx(li,lj-1,lk,lSize)+0] = pLocal[3*idx(li,lj-1,lk,lSize)+1] = pLocal[3*idx(li,lj-1,lk,lSize)+2] = 0;
	pLocal[3*idx(li,lj+1,lk,lSize)+0] = pLocal[3*idx(li,lj+1,lk,lSize)+1] = pLocal[3*idx(li,lj+1,lk,lSize)+2] = 0;
	pLocal[3*idx(li,lj,lk-1,lSize)+0] = pLocal[3*idx(li,lj,lk-1,lSize)+1] = pLocal[3*idx(li,lj,lk-1,lSize)+2] = 0;
	pLocal[3*idx(li,lj,lk+1,lSize)+0] = pLocal[3*idx(li,lj,lk+1,lSize)+1] = pLocal[3*idx(li,lj,lk+1,lSize)+2] = 0;
    pLocal[3*idx(li+1,lj-1,lk,lSize)+0] = pLocal[3*idx(li+1,lj-1,lk,lSize)+1] = pLocal[3*idx(li+1,lj-1,lk,lSize)+2] = 0;
    pLocal[3*idx(li-1,lj+1,lk,lSize)+0] = pLocal[3*idx(li-1,lj+1,lk,lSize)+1] = pLocal[3*idx(li-1,lj+1,lk,lSize)+2] = 0;
    pLocal[3*idx(li+1,lj,lk-1,lSize)+0] = pLocal[3*idx(li+1,lj,lk-1,lSize)+1] = pLocal[3*idx(li+1,lj,lk-1,lSize)+2] = 0;
    pLocal[3*idx(li-1,lj,lk+1,lSize)+0] = pLocal[3*idx(li-1,lj,lk+1,lSize)+1] = pLocal[3*idx(li-1,lj,lk+1,lSize)+2] = 0;
    pLocal[3*idx(li,lj-1,lk+1,lSize)+0] = pLocal[3*idx(li,lj-1,lk+1,lSize)+1] = pLocal[3*idx(li,lj-1,lk+1,lSize)+2] = 0;
    pLocal[3*idx(li,lj+1,lk-1,lSize)+0] = pLocal[3*idx(li,lj+1,lk-1,lSize)+1] = pLocal[3*idx(li,lj+1,lk-1,lSize)+2] = 0;
    //}
}

Would the commented if statement change anything ? I could also separate each line into an appropriate if() statement, would it change anything (except it increases the number of register) ?

Is there a way to initialize this local mem to 0 automatically ?

  1. Build Log

I would also like to know what means the info return by clGetProgramBuildInfo(cpProgram, device, CL_PROGRAM_BUILD_LOG, 4096, logTxt, NULL);.

I got something like this:

Build Log:

: Considering profile ‘compute_11’ for gpu=‘sm_11’ in ‘cuModuleLoadDataEx_4’
: Retrieving binary for ‘cuModuleLoadDataEx_4’, for gpu=‘sm_11’, usage mode=’ --verbose --maxrregcount 30 ’
: Considering profile ‘compute_11’ for gpu=‘sm_11’ in ‘cuModuleLoadDataEx_4’
: Control flags for ‘cuModuleLoadDataEx_4’ disable search path
: Ptx binary found for ‘cuModuleLoadDataEx_4’, architecture=‘compute_11’
: Ptx compilation for ‘cuModuleLoadDataEx_4’, for gpu=‘sm_11’, ocg options=’ --verbose --maxrregcount 30 ’
ptxas info : Compiling entry function ‘opticalFlow’ for ‘sm_11’
ptxas info : Used 15 registers, 40+16 bytes smem, 199 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘rof’ for ‘sm_11’
ptxas info : Used 23 registers, 12+16 bytes smem, 199 bytes cmem[0], 16 bytes cmem[1]
ptxas info : Compiling entry function ‘warp’ for ‘sm_11’
ptxas info : Used 8 registers, 16+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]

What is sm_11 ?

And what are smem, cmem[0] and cmem[1] ? What are the limits I should not exceed for each mem ?

Thanks,
Arthur

Any idea Mr David Garcia :wink: ?

Hi Arthur,

I didn’t reply earlier because your questions are by and large specific to nVidia.

I’ll give it a shot anyway :slight_smile:

I have the following CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte, I suppose it means I can have a maximum of 4 blocks of float (4 bytes ?) of 8+2 pixels in my local mem ? (8 is the local size + 2 for the overlapping edges)
44101010 = 16 000: OK ?

Yeah, floats are 32-bit since OpenCL C requires them to follow the IEEE single precision representation. Your estimation of local memory usage therefore looks right.

One thing which require some registers is the initialization of the local mem at 0.

Ah, and you are running into trouble because you need to initialize border data as well, right? What you can do is allocate an extra buffer object of size 16KB “pGlobalZeroedMemory”, initialize it with zeroes using a new utility kernel, then afterwards in your main kernel call this to reset your local memory:


event_t zeroed = async_work_group_copy(pLocal, pGlobalZeroedMemory, 4000, 0);
... do other stuff here if desired...
wait_group_events(1, &zeroed);

Ok !

Yes, I must initialize border data as well.
I tried this global to local mem copy and unfortunately it did not decrease the number of register (I have currently 23 and it must go down to 16). And cmem[1] increased from 24 to 32.

Thanks anyway !

I just had the following answer from a professional:

“The available number of registers is always a huge problem, most times a
simple splitting of the algorithm into multiple kernels is the fastest approach (depending
on the CC of the GPU).”