CL_MEM_OBJECT_ALLOCATION_FAILURE upon clEnqueueNDRangeKernel

I have 5 Kernels, which keeps processing a finite amount of data. Multiple cl_mem objects are created, some which are used only in a single kernel and some which are shared across kernels. I keep getting CL_MEM_OBJECT_ALLOCATION_FAILURE while enqueuing the 3rd Kernel. However, when I reduce the data am getting the error while enqueuing the 4th Kernel (The 3rd Kernel enqueue works fine). There are no errors returned in any of the clCreateBuffer calls. I suspected it to be a memory issue. For the first (larger) set of data, almost 42MB memory (Global Memory) (cl_mem objects) was allocated before the 3rd Kernel enqueue failure. For the second (smaller) set of data, only 1.48MB memory (Global Memory) was allocated before the 4th Kernel enqueue failure. My device capabalities queries yield CL_DEVICE_MAX_MEM_ALLOC_SIZE as 256MByte and CL_DEVICE_GLOBAL_MEM_SIZE 1024MByte. Am allocating much less than these values. Fearing, it could be a problem in the kernel code, I commented out the entire Kernel code, except the parameters and still am getting the same. So am completely lost in understanding this issue. The callback notification function set to the context (in clCreateContext) didn’t provide any additional details. Is there any way to get details regarding which memory object allocation failed and for what reasons?

Thanks in advance…

Running OpenCL 1.1
These are the device details :

Device Details

CL_DEVICE_NAME: GeForce GTX 460
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 340.62
CL_DEVICE_VERSION: OpenCL 1.1 CUDA
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1
CL_DEVICE_TYPE: GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 7
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1350 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 47KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8

You are seeing the effect of the so-called delayed or lazy allocation, which is a very common technique used in many implementations.

Basically, when an OpenCL buffer is created, it is not instantly physically allocated on the device. Instead, allocation is delayed to the time it is used. This is why you are seeing the error at different times depending on array size.

Lazy allocation is allowed by the standard, and is not a problem per se. The problem comes from the fact that many implementation do not support the other side, which is buffer eviction (i.e. moving buffer out of the device when it’s not used): to work around this, you will have to manage your buffers manually, releasing them when not in use.