bandwidth test

I want to test my OpenCL memory bandwitdh.
I work on a nVidia gt280 so my kernel should write or read (in global memory) a maximum of 118GB/s.

  • I tried with the simplest kernel :
void main(__global float * array)
{
   array[ get_global_id(0) ] = 123.321;
}
  • I work on a 16 777 216 floats array, with a non host memory buffer.
    -Each thread write on float so each warp (32 threads) write 128 bytes (that’s the best case for nVidia GPU compabilitie 1.3).
  • I use OpenCL profiler.

=> My kernel is executing in 0.955 ms, so bandwitdh is : 65,462 GB/s.

What is the problem with my experimentation :?

In fact it’s each half warps that should use a coalesced zone of 128 bytes.
So here half warps use 64 bytes bloc. But it should not be a problem.

Maybe it’s only 118GB/s going in both directions? In only one direction it is half that?

The bandwidth testing example in the NVidia OpenCL best practices guide goes in both directions. http://www.nvidia.com/content/cudazone/ … sGuide.pdf

Try that to see what you get.

In fact I use a gtx275 (gt280 have 240 GB/s bandwidth).

You’re right, with this sample I have 99GB/s bandwith :

...
cl::PlatformList		platforms;

cl::DeviceList			devices;
cl::Device &				device		(cl::GetPlatformList(&platforms).front().GetDeviceList(cl::ALL_DEVICE_TYPE, &devices).front());
cl::Context				context		(device);
cl::CommandQueue		commandQueue	(context, device, cl::CommandQueue::IN_ORDER_EXECUTION, cl::CommandQueue::ENABLE_PROFILING);
cl::Program const			program		(context, cl::util::GetFileSource("sample4.cl"));
cl::Kernel				kernel			(program, "main");
cl::Buffer				buffer1		(context, cl::Mem::WRITE_ONLY, ARRAY_SIZE * sizeof(float));
cl::Buffer				buffer2		(context, cl::Mem::READ_ONLY, ARRAY_SIZE * sizeof(float));
cl::Kernel::Args const		args			(buffer1, buffer2);
cl::Kernel::WorkSize const	workSize		( glm::size3(ARRAY_SIZE, 1, 1) );

cl::Event event = kernel.EnqueueNDRange(commandQueue, args, workSize);

event.Wait();

std::cout << cl::util::BandWidth(2 * ARRAY_SIZE * sizeof(float), cl::Event::ProfilingInformation(event)) << std::endl;
...

__kernel

void main(__global float * array1, __global float const * array2)

{

	array1[ get_global_id(0) ] = array2[ get_global_id(0) ];

}

ps: I change a bit my API due to our discussion and problem about returns :smiley:

I don’t know if we can say gtx275 have a 118GB/S bandwidth… :?
Perhaps it’s an OpenCL implementation limitation? Does someone test the same sample with CUDA?

Two suggestions:

  • Run your kernel once to warm up the card, then average your results over a few dozen/hundred runs. You can easily get very strange results with both the first kernel execution and any individual execution.
  • Use the vload commands to load a larger chunk of data at once. On some architectures this can make a big difference, on others it may not.

Also, the Nvidia OpenCL drivers are apparently not as mature as the CUDA ones (not surprising given that they are far newer) so you probably won’t get the same performance in some areas.