Is this undefined behavior?

I have a single kernel(embarrasingly parallel data access such as Ai = Ai + 5) and a single buffer for a read+write operation.

To shorten total latency, I did these:

  • Break kernel into 64 smaller kernels, each with a global offset and a smaller range.
  • Break buffer copies into 64 smaller copies, each with pointer offset and a smaller range.
  • Overlap all freely in “many queues” or “duplicated staircase overlap” using event based scheduling.

Is this undefined behavior?

If yes, only on kernel side?

If no, is there a way to hide latencies of a single (host to device buffer copy)+(kernel)+(device to host buffer copy) operation without using extra buffers?

I tried this on these devices:

  • HD7870

  • R7-240

  • RX-550

  • FX8150

  • HD400

  • N3070

and it worked without returning any OpenCL error code(other than CL_SUCCESS) nor any data garbage for various algorithms such as raytracing and image processing. Maybe it worked just because both AMD and INTEL implemented it as a simple buffer copy on compute-unit to vram? I know OpenCL uses relaxed memory hierarch but even on embarrassingly parallel regions? I don’t use any atomic functions in such kernels so I don’t expect them to work. But simple c=a+b logic seems to be working fine.

Regards.

I forgot to mention: by the “overlapping”, I mean “time”, not data. Just a note for newcomers like me.

Maybe using N contexts can achieve same pipelining but AMD hardware wouldn’t let me create more than 5 contexts per device. I’m partializing a kernel into 64 pieces in single context and it boosts the performance for a good percentage. But I can’t think of any other “kernel level” pipelining (similar to usual multi-kernel pipelining but for same kernel and same buffer, partialized)

I have another question:

If

  • at first, command-queue-1 flows to command-queue-2 with an event
  • then command-queue-2 synchronizes with something like a “wait barrier” or “clFinish()”

does command-queue-1 synchronize too? I wish for that, because clFinish on all queues gets slower and defies the advantage of using multiple command queues on my pc. (very slow CPU so I use single queue finish to sync 15 other queues, %50 more performance)

Another version:

If I wait for marker(with event or callback), does it synchronize the command before that marker in same queue?

Regards.

I’ve forgotten another thing to write in first question: N number of kernels were actually same cl::kernel instance, running on same buffer(parameter) with different offset+range. Should I convert it to an N-instances version?

You’re good if writes do not overlap, but you force the runtime to perform multiple additional memory barriers. If it does actually improves the perfromance than good for you, but it may be worth trying to use subbuffers (perhaps, on a singular queue) instead of explicit offsets.

If I wait for marker(with event or callback), does it synchronize the command before that marker in same queue?

clWaitForEvents is a synchronization point, so yes.

Thank you, do you mean should I lift the bug warning I’ve written here Bugs · tugrul512bit/Cekirdekler Wiki · GitHub and here Pipelining · tugrul512bit/Cekirdekler Wiki · GitHub ?

How would I use subbuffer with effective pointer offset? (I mean, I use exact same kernel code in all parts so write-once run-everywhere style unified address - like coding becomes possible) Because each workitem is writing to its own address values - elements. A small subbuffer would go outof bounds wouldn’t it?

I’d keep it in case you will add multi-device support somewhere in the future: writing into the same buffer from different devices is positively bad.

There is already multi gpu support but working on different regions of host array. All buffers are duplicated per device because they are in distinct contexts.

They don’t overlap writes. They overlap reads though.

Copying or mapping takes different offsets for each device but buffer creation pointers(cl mem use host ptr) are same (because working range of devices are changing at each iteration, they need to be able to work on any range at a different time for load balancing).

They overlap reads though.

Do you mean you have a read only buffer each kernel uses (this is valid) or do you mean you make multiple reads from different sections of the buffer you write into? This is a data race even in case of a single kernel.

Either it is read only accessed by any address or it is read/write only within narrow range per device(same range for host-device copy too)

Like this:

gpu1 read + write element 1

gpu2 read+write element 2

or

both read whole array but only read

When they read same locations, it is intended not to load balance but to broadcast host data to alld devices, as full arrays.

Even if user enables full-array writes from device to host, only 1 device writes(another GPU writes to another host array, they all work at the same time).

The project is for other opencl developers to write opencl kernel codes in C# so I don’t take any responsibility in kernel part. Its their option to do undefined behavior in kernel :smiley: I’m just concerned about multiple kernel accesses to same buffer concurrently(for single GPU) but to unique non overlapping regions. For multi GPU, everthing is already duplicated per device so only host-side pointers could be a problem (napping-unmapping)maybe.

Also I’ve been using this function to sync for many queues without problem:


	__declspec(dllexport)
		void waitN(OpenClCommandQueue ** hCommandQueueArray, OpenClCommandQueue * hCommandQueueToSync, const int n)
	{
		std::vector<cl::Event> evtVect;
		for (int i = 0; i < n; i++)
		{
			evtVect.push_back(cl::Event());
			hCommandQueueArray[i]->commandQueue.enqueueMarkerWithWaitList(NULL, &evtVect[i]);
			hCommandQueueArray[i]->commandQueue.flush();
		}
		hCommandQueueToSync->commandQueue.enqueueMarkerWithWaitList(&evtVect, NULL);
		hCommandQueueToSync->commandQueue.finish();
	}

does the last marker with wait list make memory consistency for all queues after the single finish() command? (so that next batch of kernels can see true buffer values and host can read the latest bits?)

Should I convert these to an enqueue barrier with wait list version? There are not any commands enqueued after markers so they are last but is it enough with just waiting for markers?

It is not very explicitly documented that which one is enough to satisfy queue level, multiple queue level, multiple device level(if same context ofcourse) memory consistency. I’m sure kernels finish before these but what about buffer states? When do the latest bits get ready if its a single queue, if its a multiple queue or a multiple device context?

EnqueueMarker is deprecated, I believe.

after the single finish() command?

Finish itself if a synchronization point.

Basically, if some command involves waiting on events, it is automatically a synchronization point for every device in the context. At queue level with an in-order queue you don’t need to worry at all: every subsequent command sees the results of a previous command with the excection of non-blocking mapping and reading/writing.

Then a synchronization point synchronizes for all queues and for everything? (I mean, that “moment” of state of all buffers, and all kernels,)

If I dont use events, other queues may have still not complete and their buffer usage may be garbage but still the moment of “finish” they still synched, am I right?

No, you can only count on memory objects associated with those synch points (i.e. buffers used by a kernel launched) to be in valid state after a wait. For every queue and device in the context, though. If you don’t use events, you have to use clFinish at some point which will ensure that every memory object touched by any command on that queue is synched.

Example:


RunKernel(buffer_a)
RunKernel(buffer_b, event_t)
clWaitForEvents(event_t)
//At this point buffer_b is in valid state, but it might turn out that the first kernel is still running

[QUOTE=Salabar;42425]No, you can only count on memory objects associated with those synch points (i.e. buffers used by a kernel launched) to be in valid state after a wait. For every queue and device in the context, though. If you don’t use events, you have to use clFinish at some point which will ensure that every memory object touched by any command on that queue is synched.

Example:


RunKernel(buffer_a)
RunKernel(buffer_b, event_t)
clWaitForEvents(event_t)
//At this point buffer_b is in valid state, but it might turn out that the first kernel is still running

[/QUOTE]

Thank you very much. I use in-order type for all queues so its ok to sync on single queue, provided that before that sync point, events ensure other queues are done computing, if I understand correctly.