diferent time in openCL programs execution

mustang · February 24, 2012, 12:07pm

Hi again,
I have done a simple matrix vector product with sparse matrix, and it works but the problem is that the first time I executed it takes much more time than the following times, for example, with a dimension of 4993 the first execution takes 500 miliseconds and the following are between 70 and 90 miliseconds!! By the way I measure time with getTimeOfDay() after reading the matrix and vector from a file, as the idea is to compare this version to a cpu version, and in both versions I have to read them from and there is no difference in the way I do it so that time is irrelevant.
One more thing, these results are from a 1201n netbook with a 9400m gpu, but the code was tested in a much better machine and it also had a big difference.
I guess this my be caused by my code but I´m not sure. Any help would be apreciated!!
Thanks in advance!!

Pablo

notzed · February 24, 2012, 1:41pm

This is pretty much expected behaviour for any micro-benchmark on any platform. The details might change but in general the first run may incur (significant) additional overheads.

Which is why, unless you’re interested in the startup time itself, you should always do a few dummy runs first to let the system warm up, and then do more than one run of the test too (to average hide/over system interference, although this is less important now computers are so fast/multi-cored). This way you only measure the steady state performance which is more in-line with how a real application will end up using it.

The first time might be slower because of operating system or driver allocation of resources, which don’t need to occur on subsequent runs. These all of course depend on the os and the drivers themselves.

mustang · February 24, 2012, 4:10pm

Thanks!! So I guess the real time is not first but the following!! that´s good! but what if I had a program that takes a lot of time? should I execute a smaller program before executing the main program? or in that case, the loss of time doesn´t increase and therefore doesn´t impact in the final result as much as it does in this case?
Another question: I´m trying to make a conjugate gradient method with sparse matrix but up to the moment the cpu version is much faster than the gpu version (in the 1201n cpu: 2 secs aprox and gpu: 11 secs aprox) with dimension 4993 and non zero items 34000 aprox.
And I have for example the kernel that makes matriz vector multiplication a few times in the same loop:

/*some code*/
for(int k=0;k<dim-4000;k++){
		//rho=r*r;

		/*some code*/

		//p=r+beta*p
		//betap=beta*p
		
		for(i=0;i<dim-1;i++) {
			betap[i]=0;
		}
		p_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            dim * sizeof(float*), p, &ret);
		betap_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            dim * sizeof(float*), betap, &ret);
		beta_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            sizeof(float*), beta, &ret);

		ret = clSetKernelArg(kernel4, 0, sizeof(cl_mem), (void *)&p_mem_obj);
		ret = clSetKernelArg(kernel4, 1, sizeof(cl_mem), (void *)&betap_mem_obj);
		ret = clSetKernelArg(kernel4, 2, sizeof(cl_mem), (void *)&beta_mem_obj);
		ret = clSetKernelArg(kernel4, 3, sizeof(int *), (void *)&dim);

		global_item_size[0]=cantGlobalItems;
		local_item_size[0]=cantLocalItems;

		ret = clEnqueueNDRangeKernel(command_queue, kernel4, 1, NULL, 
            global_item_size, local_item_size, 0, NULL, NULL);
	
		/*some code*/

		//x=x+alfa*p
		//alfap=alfa*p
		for(i=0;i<dim;i++) {
			alfap[i]=0;
		}
		alfap_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            dim * sizeof(float*), alfap, &ret);
		alfa_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            sizeof(float*), alfa, &ret);
	
		ret = clSetKernelArg(kernel4, 0, sizeof(cl_mem), (void *)&p_mem_obj);
		ret = clSetKernelArg(kernel4, 1, sizeof(cl_mem), (void *)&alfap_mem_obj);
		ret = clSetKernelArg(kernel4, 2, sizeof(cl_mem), (void *)&alfa_mem_obj);
		ret = clSetKernelArg(kernel4, 3, sizeof(int *), (void *)&dim);

		global_item_size[0]=cantGlobalItems;
		local_item_size[0]=cantLocalItems;

		ret = clEnqueueNDRangeKernel(command_queue, kernel4, 1, NULL, 
            global_item_size, local_item_size, 0, NULL, NULL);

		/*some code*/

}

/*some code*/

so as it is the same kernel witch is called for the diferent cases, I have to call the clSetKernelArg inside the loop and I guess that may be very bad for performance, should I create two different kernels that do the same but with different names so I can set the arguments outside the loop or that doesn´t take to much time?
Sorry for asking so many things but I´m still a bit confused!!
Thanks again!!

Pablo

notzed · February 24, 2012, 11:56pm

Well, it’s anybody’s guess, but an educated opinion would be that the overhead is basically fixed, and would be less a proportion in the end.

If you’re timing overall execution e.g. of a single-execution of programme, the startup time can’t be excluded of course.

I was just talking about micro-benchmarks: i.e. testing individual routines, or if you have a long-running gui application or server which will end up running the same problem many times during it’s lifetime. In these cases the start-up time may be not be an issue at all.

Another question: I´m trying to make a conjugate gradient method with sparse matrix but up to the moment the cpu version is much faster than the gpu version (in the 1201n cpu: 2 secs aprox and gpu: 11 secs aprox) with dimension 4993 and non zero items 34000 aprox.
And I have for example the kernel that makes matriz vector multiplication a few times in the same loop:

/*some code*/
for(int k=0;k<dim-4000;k++){
		//rho=r*r;

  /*some code*/

  //p=r+beta*p
  //betap=beta*p
  
  for(i=0;i<dim-1;i++) {
  	betap[i]=0;
  }
  p_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
        dim * sizeof(float*), p, &ret);
  betap_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
        dim * sizeof(float*), betap, &ret);
  beta_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
        sizeof(float*), beta, &ret);

  ret = clSetKernelArg(kernel4, 0, sizeof(cl_mem), (void *)&p_mem_obj);
  ret = clSetKernelArg(kernel4, 1, sizeof(cl_mem), (void *)&betap_mem_obj);
  ret = clSetKernelArg(kernel4, 2, sizeof(cl_mem), (void *)&beta_mem_obj);
  ret = clSetKernelArg(kernel4, 3, sizeof(int *), (void *)&dim);

  global_item_size[0]=cantGlobalItems;
  local_item_size[0]=cantLocalItems;

  ret = clEnqueueNDRangeKernel(command_queue, kernel4, 1, NULL, 
        global_item_size, local_item_size, 0, NULL, NULL);

  /*some code*/

  //x=x+alfa*p
  //alfap=alfa*p
  for(i=0;i<dim;i++) {
  	alfap[i]=0;
  }
  alfap_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
        dim * sizeof(float*), alfap, &ret);
  alfa_mem_obj = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
        sizeof(float*), alfa, &ret);

  ret = clSetKernelArg(kernel4, 0, sizeof(cl_mem), (void *)&p_mem_obj);
  ret = clSetKernelArg(kernel4, 1, sizeof(cl_mem), (void *)&alfap_mem_obj);
  ret = clSetKernelArg(kernel4, 2, sizeof(cl_mem), (void *)&alfa_mem_obj);
  ret = clSetKernelArg(kernel4, 3, sizeof(int *), (void *)&dim);

  global_item_size[0]=cantGlobalItems;
  local_item_size[0]=cantLocalItems;

  ret = clEnqueueNDRangeKernel(command_queue, kernel4, 1, NULL, 
        global_item_size, local_item_size, 0, NULL, NULL);

  /*some code*/

}

/some code/

so as it is the same kernel witch is called for the diferent cases, I have to call the clSetKernelArg inside the loop and I guess that may be very bad for performance, should I create two different kernels that do the same but with different names so I can set the arguments outside the loop or that doesn´t take to much time?
Sorry for asking so many things but I´m still a bit confused!!
Thanks again!!

Pablo
[/quote][/quote]
clSetKernelArg is no issue for performance; all it’s doing is setting a simple (pointer) value into a table (probably an array). When you enqueue the task the values in the table are copied to the work queue, but even that isn’t very expensive. You could save re-setting the values which don’t change: but in reality you wont even be able to measure the difference.

But you definitely shouldn’t be creating any buffers inside loops: creating buffers is going to expensive if you’re doing it thousands of times. Create them outside the loop once, and then just use them as required. You then need to use writeBuffer if you want to initialise them from the cpu side … but unless the initialisation requires a whole lot of I/O, you should do the initialisation on the GPU side as well, or any processing of the intermediate values. Any loop like this should avoid any cpu synchronisation at all, if possible.

I use the analogy that the work queue is like a shopping list; even if you live right next door to a shop in walking distance: if you write down one item at a time, go buy the one item, return, write down the next item, etc, it’ll be orders of magnitude slower than just filling a basket with a list of all items and only going to the shop once.

I can’t tell from your code paste if you’re reading results back on the cpu in the ‘some code’ sections, but if you are it’s definitely the undesirable scenario I list above. It could be hundreds of times slower than the hardware is capable of.

mustang · February 25, 2012, 6:29am

thanks for answering!!!
I guess I´m forced to pass read to the cpu some results cause I have to do for example a few scalar divisions (for example a float divided by another float and the result is a float) witch are then used to perform other operations.
The only thing I managed to do outside the loop is doing cl_mem p_mem_obj; for example, cause I tried to put some of the clSetKernelArg outside the loop but never worked, I guess there are two reasons, one: the operations that are made in the cpu have to be “reenqued” as they are not made in the cl_mem but directly in the variables;
two: I use in the same loop the same kernel for diferent values, for example I sum vectors r and p and then x and p (the names are a bit different) so until I enqueue the first I cannot set the arguments of the second, becuase as it is the same kernel, the arguments would be over written.

Pablo

notzed · February 25, 2012, 8:20pm

a gpu is fine at scalar operations too, infact you should be doing all calculations on the gpu if possible.

but I can’t really suggest more without seeing the code.

The only thing I managed to do outside the loop is doing cl_mem p_mem_obj; for example, cause I tried to put some of the clSetKernelArg outside the loop but never worked, I guess there are two reasons, one: the operations that are made in the cpu have to be “reenqued” as they are not made in the cl_mem but directly in the variables;
two: I use in the same loop the same kernel for diferent values, for example I sum vectors r and p and then x and p (the names are a bit different) so until I enqueue the first I cannot set the arguments of the second, becuase as it is the same kernel, the arguments would be over written.

Pablo

Obviously if the arguments change you need to keep those in the loop. Any i don’t understand why you’re even worrying about setting arguments: setting arguments is fast.

mustang · February 26, 2012, 5:27am

thanks for the reply!!!
I´m worried about setting arguments and creating the buffers inside because I made the conjugate gradient method with space matrix in both gpu and cpu and the cpu is much faster!! even with 19713x19713 matrix, although it has 130000 aprox. non zero values, but as I had read about the enormous diferences in favour of the gpu I thought it would be faster than the cpu with these sizes!! And I was trying to improve my code in everything I can, at least from the host side. Anyway I managed to take all the setArguments and create buffers outside of the loop and I didn´t get a much better performance, what is worse is that in some cases I even got worse!!!
I know there is place for a lot of improvement but my knowledge of openCL is low in order to know in what places I should use the local memory for example. But searching the web I found different strategies (not about the use of memory), for example to do in one kernel a*x+y being a an escalar and x and y vectors, but it didn´t improve the performance, at least not much!!

mustang · March 1, 2012, 4:47pm

Hi,
I know this is an off topic but I prefer not to open another topic, but my question is if it is posible that I can´t handle a 19713X19713 matriz (it is not sparse). it throws segment fault (I don´t really how to translate “violacion de segmento” in spanish). By the way I have a cpu version that loads the matrix in the same way and it works!!

Pablo

notzed · March 1, 2012, 7:27pm

Although it could be an opencl implementation bug, it’s more likely to be code bugs: e.g. going off the end of the array, or not checking every return value for errors.

19713 * 19713 * float = about 1.4GB of memory, so you probably failed to allocate some buffer somewhere. Even if your GPU has more memory than that you may not be able to allocate it one block or some might be used by the system.

mustang · March 2, 2012, 5:40am

Thanks, I guess it is not that I forgot to allocate a buffer because I have used the same program with smaller matrix and it seemed to work fine!! and I have centralized the points where I change the value of the matrix dimension so there is not to much risk of that. But I don´t know what is nvidia buffers max size, I´ll try to find it!!

Pablo