Matrix mul in Large global work group size with very less local work group size

My system basicaly suppot all the given constraint in opencl execution by the specification of Khronos opencl 1.2. But in terms of matrix multiplication, when I am varying work group size 2 to 32 and global work size 512 to 4096. Here 4096 is my final global work size due to system maximum global memory constraint and 32 can be last local workgroup size due to 1024 max work item per group in my system. But, when control reach to a position with global work size 2048 with local work size 2, global work size 4098 with local work size 2 and 4. only that point of time,I am getting an error with clEnqueueReadBuffer i.e. CL_OUT_OF_RESOURCE. Now in general of my observation, I ca’t track the problem. Any one help me to these situation. is there any answer," maximum number of work group supported by kernel ?" or any other.

Sometimes, I see CL_OUT_OF_RESOURCE returned from blocking calls when a previously enqueued kernel has done an out-of-bounds access. So perhaps your kernel code has a bug? I know it’s a strange error code to report for that, but that’s been my experience.