clSetKernelArg performance

Since I simulate things, basically I loop running same kernels over and over. Large amount of those kernels don’t change their argument values. For now I always set all the arguments before clEnqueueNDRangeKernel, over and over again.

For these unchangeable argument values, can I set arguments only at start ?

Because what bothers me is the “Notes”, that are little ambiguous to me, from clSetKernelArg specs. With focus on this sentence:

Users may not rely on a kernel object to retain objects specified as argument values to the kernel.

Btw how high is clSetKernelArg’s overhead anyway?
Thanks

I think you can set arguments only at start because arguments are copied and it is said :

The argument value specified is the value used by all API calls that enqueue kernel (clEnqueueNDRangeKernel and clEnqueueTask) until the argument value is changed by a call to clSetKernelArg for kernel

You can even release OpenCL objects pointed by args, because even if at the specification start you can read :

Reference Count: The life span of an OpenCL object is determined by its reference count—an internal count of the number of references to the object. When you create an object in OpenCL, its reference count is set to one. Subsequent calls to the appropriate retain API (such as clRetainContext, clRetainCommandQueue) increment the reference count. Calls to the appropriate release API (such as clReleaseContext, clReleaseCommandQueue) decrement the reference count. After the reference count reaches zero, the object’s resources are deallocated by OpenCL.

There is a lot of special cases :

cl_int clReleaseMemObject (cl_mem memobj) decrements the memobj reference count. After the memobj reference count becomes zero and commands queued for execution on a command-queue(s) that use memobj have finished, the memory object is deleted. …

cl_int clReleaseCommandQueue (cl_command_queue command_queue) decrements the command_queue reference count. After the command_queue reference count becomes zero and all commands queued to command_queue have finished (e.g., kernel executions, memory object updates, etc.), the command-queue is deleted.

cl_int clReleaseSampler (cl_sampler sampler) decrements the sampler reference count. The sampler object is deleted after the reference count becomes zero and commands queued for execution on a command-queue(s) that use sampler have finished.

cl_int clReleaseProgram (cl_program program) decrements the program reference count. The program object is deleted after all kernel objects associated with program have been deleted and the program reference count becomes zero.

cl_int clReleaseKernel (cl_kernel kernel) decrements the kernel reference count. The kernel object is deleted once the number of instances that are retained to kernel become zero and the kernel object is no longer needed by any enqueued commands that use kernel.

cl_int clReleaseEvent (cl_event event) decrements the event reference count. The event object is deleted once the reference count becomes zero, the specific command identified by this event has completed (or terminated) and there are no commands in the command-queues of a context that require a wait for this event to complete.

Ok, so global (and constant memory) is persistent across kernel invocations.

But is this the case with simple arguments (that are copied to private memory) also ?

clEnqueueNDRangeKernel and clEnqueueTask reset all kernel arguments with the value copied from last clSetKernelArg.
But only arguments.