about clSetEventCallBack()

Hi,

I have a question regarding clSetEventCallBack() and the way callbacks work.
I run the following code.
=============================================================================

int global_var = 0;

void call_back(void * args)
{
global_var = 1;
}

begin_comp_copy_time = rtclock();
  clEnqueueNDRangeKernel(...,&compute_event);  // enqueue some computation
  clEnqueueReadBuffer(data_queue, CL_FALSE,....,1, &compute_event, &read_event);  // perform some read (non-blocking)
  clSetEventCallBack(read_event,..., call_back, args); // set a call back once the read finishes.
  clFinish(compute_queue);
  clFinish(data_queue);   // at this point all commands on all queues have completed.
total_comp_copy_time = rtclock() - begin_comp_copy_time;

begin_callback_time = rtclock();
    while(!global_var); // spin until the global_var is set to 1
total_callback_time = rtclock() - begin_time;

==============================================================================

When I print the timings, I see that total_comp_copy_time is just 3ms, but the total_callback_time is a whoophing 17ms.

Now, If I remove the “while(!global_var)” check, then the time is just a little over 3ms, but there is no synchronization if this code is run inside a loop.

I am not able to reason out precisely the reason for this.(I am suspecting that this exceedingly high time is due to process scheduling issue)
Can anyone precisely reason out the cause for this high timing values ?
On what processor does the callback execute? is it the same processor as the one which registered the callback?
If the processor is spinning in while loop, is the call_back waiting for the process to get scheduled out? is this the reason for 17ms?

Thanks
-Thejas

I have refined the code further to pin down to the smallest piece of code causing this
problem. There is no need of running Kernel too.

======================================
int global_var = 0;

void call_back(void * args)
{
global_var = 1;
}

int main()
{

clEnqueueReadBuffer(data_queue, CL_FALSE,…,0, 0, &read_event); // perform some non-blocking read

clSetEventCallBack(read_event,..., call_back, args); // set a call back once the read finishes.

begin_callback_time = rtclock();
while(!global_var); // spin until the global_var is set to 1
total_callback_time = rtclock() - begin_time;

}

=================================================

The above code is enough to hit a timing of 20ms. (though it only takes 2ms for actual data transfer). so only if I try to synchronize the callback (with “while”) I am seeing this issue.
If I dont try to synchronize, (I want to synchronize because I want to run this code in
a loop and I dont want to go to the next iteration before this iteration’s data is copied out)

Any thoughts on what might be the reason?

This is not defined by the spec- the only guarantee is that the callback will eventually be executed after the command completes.

Best thing to do is to profile your code and find out!

Note that your code does not include a flush and the read is non-blocking, so the runtime might never submit the command to the device.

Thanks for the reply!

>> Best thing to do is to profile your code and find out!

I did profiling and saw that callback and the thread that registered the callback are
running on two different processors. So the reason is not due to scheduling issue i guess.

>> Note that your code does not include a flush and the read is non-blocking, so the
>> runtime might never submit the command to the device

I later added a clFlush() too. But still the result is same as before.