Hello !
I am working on an optical flow algorithm, I developed a working 2D version and now I have troubles with the 3D version… I work with blocks of 888 pixels since I have CL_DEVICE_MAX_WORK_GROUP_SIZE: 512.
My code is skipped without any warning or error when I try to allocate too much local memory or when the code requires too many registers.
- Local memory:
I have the following CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte, I suppose it means I can have a maximum of 4 blocks of float (4 bytes ?) of 8+2 pixels in my local mem ? (8 is the local size + 2 for the overlapping edges)
44101010 = 16 000: OK ?
- Number of register
This is my maximum number of register per block :
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
Does it mean: number_of_register * number_of_thread_per_block <= 8192 ?
In this case I must have less than 16 registers, which is really limited.
I gues this is a limit per kernel and not on the overall?
If so, one solution would be to divide the kernels into small ones. Unfortunately it is hardly possible with my code.
If I compile with the option “-cl-nv-maxrregcount=16” then the results is not right (In some case I got undefined numbers) probably because I reach the lower limit of how many registers I need for this particular algorithm. http://forums.nvidia.com/index.php?showtopic=193492
With no restriction (no -cl-nv-maxrregcount) the algorithm requires 23 registers.
So what can I do really ?
One thing which require some registers is the initialization of the local mem at 0.
For now I have this code :
inline int idx(int i, int j, int k, int size)
{
return ((k*size*size)+(i*size)+j);
}
inline void initP(__local float* pLocal, int li, int lj, int lk, int lSize)
{
pLocal[3*idx(li,lj,lk,lSize)+0] =
pLocal[3*idx(li,lj,lk,lSize)+1] =
pLocal[3*idx(li,lj,lk,lSize)+2] = 0;
//if(li-1 == 0 || lj-1 == 0 || lk-1 == 0 || li+1 == lSize-1 || lj+1 == lSize-1 || lk+1 == lSize-1)
//{
pLocal[3*idx(li-1,lj,lk,lSize)+0] = pLocal[3*idx(li-1,lj,lk,lSize)+1] = pLocal[3*idx(li-1,lj,lk,lSize)+2] = 0;
pLocal[3*idx(li+1,lj,lk,lSize)+0] = pLocal[3*idx(li+1,lj,lk,lSize)+1] = pLocal[3*idx(li+1,lj,lk,lSize)+2] = 0;
pLocal[3*idx(li,lj-1,lk,lSize)+0] = pLocal[3*idx(li,lj-1,lk,lSize)+1] = pLocal[3*idx(li,lj-1,lk,lSize)+2] = 0;
pLocal[3*idx(li,lj+1,lk,lSize)+0] = pLocal[3*idx(li,lj+1,lk,lSize)+1] = pLocal[3*idx(li,lj+1,lk,lSize)+2] = 0;
pLocal[3*idx(li,lj,lk-1,lSize)+0] = pLocal[3*idx(li,lj,lk-1,lSize)+1] = pLocal[3*idx(li,lj,lk-1,lSize)+2] = 0;
pLocal[3*idx(li,lj,lk+1,lSize)+0] = pLocal[3*idx(li,lj,lk+1,lSize)+1] = pLocal[3*idx(li,lj,lk+1,lSize)+2] = 0;
pLocal[3*idx(li+1,lj-1,lk,lSize)+0] = pLocal[3*idx(li+1,lj-1,lk,lSize)+1] = pLocal[3*idx(li+1,lj-1,lk,lSize)+2] = 0;
pLocal[3*idx(li-1,lj+1,lk,lSize)+0] = pLocal[3*idx(li-1,lj+1,lk,lSize)+1] = pLocal[3*idx(li-1,lj+1,lk,lSize)+2] = 0;
pLocal[3*idx(li+1,lj,lk-1,lSize)+0] = pLocal[3*idx(li+1,lj,lk-1,lSize)+1] = pLocal[3*idx(li+1,lj,lk-1,lSize)+2] = 0;
pLocal[3*idx(li-1,lj,lk+1,lSize)+0] = pLocal[3*idx(li-1,lj,lk+1,lSize)+1] = pLocal[3*idx(li-1,lj,lk+1,lSize)+2] = 0;
pLocal[3*idx(li,lj-1,lk+1,lSize)+0] = pLocal[3*idx(li,lj-1,lk+1,lSize)+1] = pLocal[3*idx(li,lj-1,lk+1,lSize)+2] = 0;
pLocal[3*idx(li,lj+1,lk-1,lSize)+0] = pLocal[3*idx(li,lj+1,lk-1,lSize)+1] = pLocal[3*idx(li,lj+1,lk-1,lSize)+2] = 0;
//}
}
Would the commented if statement change anything ? I could also separate each line into an appropriate if() statement, would it change anything (except it increases the number of register) ?
Is there a way to initialize this local mem to 0 automatically ?
- Build Log
I would also like to know what means the info return by clGetProgramBuildInfo(cpProgram, device, CL_PROGRAM_BUILD_LOG, 4096, logTxt, NULL);.
I got something like this:
Build Log:
: Considering profile ‘compute_11’ for gpu=‘sm_11’ in ‘cuModuleLoadDataEx_4’
: Retrieving binary for ‘cuModuleLoadDataEx_4’, for gpu=‘sm_11’, usage mode=’ --verbose --maxrregcount 30 ’
: Considering profile ‘compute_11’ for gpu=‘sm_11’ in ‘cuModuleLoadDataEx_4’
: Control flags for ‘cuModuleLoadDataEx_4’ disable search path
: Ptx binary found for ‘cuModuleLoadDataEx_4’, architecture=‘compute_11’
: Ptx compilation for ‘cuModuleLoadDataEx_4’, for gpu=‘sm_11’, ocg options=’ --verbose --maxrregcount 30 ’
ptxas info : Compiling entry function ‘opticalFlow’ for ‘sm_11’
ptxas info : Used 15 registers, 40+16 bytes smem, 199 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘rof’ for ‘sm_11’
ptxas info : Used 23 registers, 12+16 bytes smem, 199 bytes cmem[0], 16 bytes cmem[1]
ptxas info : Compiling entry function ‘warp’ for ‘sm_11’
ptxas info : Used 8 registers, 16+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]
What is sm_11 ?
And what are smem, cmem[0] and cmem[1] ? What are the limits I should not exceed for each mem ?
Thanks,
Arthur