Just seeking some advice for an implementation where I’m batching kernel runs on a device and there’s particular read only buffers I would like to reuse across these runs.
Each batch consists of 1,000,000 kernel threads. I need to batch them because I parse an array of structs where each struct contains values that kernel thread writes to. If I didn’t, it would require 39GB of device memory.
So in my host loop I build an array of structs “clmodels” for 1,000,000 items and fire it off to the kernel in such fashion:
// 1,000,000 clmodels batched.
cl::Buffer d_clmodels=cl::Buffer(context, CL_MEM_READ_WRITE, h_clmodels.size()*sizeof(clmodel_t));
cl::Buffer d_clvar=cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(clvar_t));
queue.enqueueWriteBuffer(d_clmodels, CL_TRUE, 0, h_clmodels.size()*sizeof(clmodel_t), &h_clmodels[0]);
queue.enqueueWriteBuffer(d_clvar, CL_TRUE, 0, sizeof(clvar_t), &h_clvar);
cl::Kernel kernel(program_, "compute", &err);
kernel.setArg(0, d_clmodels);
kernel.setArg(1, (unsigned int)h_clmodels.size());
kernel.setArg(2, d_clvar);
cl::NDRange localSize(64);
cl::NDRange globalSize((int)(ceil(h_clmodels.size()/(double)64)*64));
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
globalSize,
localSize,
NULL,
&event);
event.wait();
queue.enqueueReadBuffer(d_clmodels, CL_TRUE, 0, h_clmodels.size()*sizeof(clmodel_t), &h_clmodels[0]);
// Loop for next 1,000,000 batch.
That additional read only buffer “d_clvar” is the one I want to reuse. It contains a struct of variables that get read in once by the main host program and never changed again.
So my question, how can I create that d_clvar buffer so that I can re-use it across my batched host loops without having to call enqueueWriteBuffer and hence make an expensive device memory copy operation every time. Basically, I want to write it once into device memory and use it for each new kernel run.