i run my code on cpu device and everything is all right, but when i run it ongpu device i get
error 36 that according to cl.h it corresponding to CL_INVALID_COMMAND_QUEUE
this is a piece of code that have problem:
when i run it (with global_size=64,local_size=1) on cpu it works(it goes every 10 rounds) but on gpu i get :
Round 0…
Round 1…
ERROR: clEnqueueNDRangeKernel, error code -36
i suspect that somehow synchronization has problem, then i add clFinish(command_queue) but still not works
what are different between cpu and gpu in execution kernel,
i mean, when i run kernel on gpu with global_size=64 and local_size=1 and then run it with same
parameter global_size=64 and local_size=1 on cpu what is deferent except “command_queue”.
i was thinking that when i don’t group data (local_size=1) , then there is no deferent between running on cpu and gpu then i have to get same result from both( both cpu and gnu run same kernel).
when i comment “clFinish(command_queue)” the while loop finish correctly( i can see Round 0 …
till Round10 … in output) but after while loop i get same error “error code -36” but this time
it is relative to “clEnqueueReadBuffer”?
i’am pretty sure that “command_queue” has problem because:
1. i comment content of kernel entirely but i get same error then the problem can’t be of
kernel
2. i NULL local_size( workgroup size) till opencl assign it itself and error remain
3. i NULL event_execute and noting change
4. the only option that change between running on cpu( that works correctly) and gpu (that
has error) is command_queue and other options are same for cpu and gnu
then the only error prone option is “command_queue”
but i have no idea what else can i do , because i don’t have any access to command queue and i
don’t know how to debug it?
please help
i found something:
actually when we run kernel with global_size=m and local_size=1 on gpu opencl spread kernel
between m “compute unit” that each one has only a work item but we have different scenario on
cpu i think the only option for running kernel on cpu is local_size=1 (the only number that can assign to local_size is 1) maybe with this constraint
we force cpu to run serially ???
am i correct ??