Reuse device buffer across kernel batches

Just seeking some advice for an implementation where I’m batching kernel runs on a device and there’s particular read only buffers I would like to reuse across these runs.

Each batch consists of 1,000,000 kernel threads. I need to batch them because I parse an array of structs where each struct contains values that kernel thread writes to. If I didn’t, it would require 39GB of device memory. :slight_smile:

So in my host loop I build an array of structs “clmodels” for 1,000,000 items and fire it off to the kernel in such fashion:


// 1,000,000 clmodels batched.

cl::Buffer d_clmodels=cl::Buffer(context, CL_MEM_READ_WRITE, h_clmodels.size()*sizeof(clmodel_t));
cl::Buffer d_clvar=cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(clvar_t));

queue.enqueueWriteBuffer(d_clmodels, CL_TRUE, 0, h_clmodels.size()*sizeof(clmodel_t), &h_clmodels[0]);
queue.enqueueWriteBuffer(d_clvar, CL_TRUE, 0, sizeof(clvar_t), &h_clvar);

cl::Kernel kernel(program_, "compute", &err);

kernel.setArg(0, d_clmodels);
kernel.setArg(1, (unsigned int)h_clmodels.size());
kernel.setArg(2, d_clvar);

cl::NDRange localSize(64);
cl::NDRange globalSize((int)(ceil(h_clmodels.size()/(double)64)*64));

cl::Event event;
queue.enqueueNDRangeKernel(
            kernel,
            cl::NullRange,
            globalSize,
            localSize,
            NULL,
            &event);
event.wait();

queue.enqueueReadBuffer(d_clmodels, CL_TRUE, 0, h_clmodels.size()*sizeof(clmodel_t), &h_clmodels[0]);

// Loop for next 1,000,000 batch.

That additional read only buffer “d_clvar” is the one I want to reuse. It contains a struct of variables that get read in once by the main host program and never changed again.

So my question, how can I create that d_clvar buffer so that I can re-use it across my batched host loops without having to call enqueueWriteBuffer and hence make an expensive device memory copy operation every time. Basically, I want to write it once into device memory and use it for each new kernel run.

Have you actually ever tried running it? Because, it should just work™.

Thanks for the response.

The code actually runs. That’s not the issue.

The issue is I want to reduce unnecessary memory copies to hopefully speed things up. There is actually more than d_clvar being parsed. What I pasted is just a sample. I have a read-only 2D array (that I’m accessing in a 1D fashion) as well that needs to be pushed to device memory but never changes. It doesn’t make sense to copy it for each batch but I’m doing it this way at the moment as I’m having issue doing otherwise.

If I try and move the buffer creation and copying outside of the loop and parse this to the kernel for each batch loop I get memory errors.

Can anyone provide any advice on buffering “objects” in device memory for essentially being accessed by multiple calls of the same kernel (in a loop). I.e. write-once-read-many :).

I’m fairly new to OpenCL so seeking any advice on this matter.

As I said before: this must work. Once you have your copy on the device you can re-use the memory object for as many kernel calls as you like.

It doesn’t make sense to copy it for each batch but I’m doing it this way at the moment as I’m having issue doing otherwise. […] If I try and move the buffer creation and copying outside of the loop and parse this to the kernel for each batch loop I get memory errors.

You should be more explicit :roll:

You’re absolutely right. It does just work. I don’t know what I was doing before!