Low performance of OpenCL application

Hi

I have a question about how to get better performance of my OpenCL application. The size of computations is quiet big - something like 10 millions of computations needed.

I’m not sure if I’m using OpenCL API right, because my GPU application is not any faster than CPU. Of course it’s not a rule that GPU version will be 100x faster than CPU one, but just check my current approach to the problem:

Problem need to run a lot of computations, a lot of work items - something like 10 mln.
I set global_work_size to 640,
local_work_size to 320.

After every run of clEnqueueNDRangeKernel() I’m reading results to check if my problem is already solved with clEnqueueReadBuffer (blocking set to CL_TRUE).

The final performance is still very poor. I haven’t done any measurements but I see it’s just not fast enough. If I missed some basic information just tell. If code is required to analyze - tell which one.

PS. I’m computing on NVIDIA Quadro 140M NVS (laptop)

Whenever you want to improve the performance of a piece of code the first thing you need to do is to measure where is the time being spent. There’s no performace tuning without performance measurement. Your hardware vendor surely has a nice profiling tool that you can use to find out how long each operation in your program is taking.

Generally speaking you want to minimize the number of synchronization points. By this I mean places where the GPU is waiting for the CPU to do something or vice-versa. Blocking memory reads are an example of synchronization points.

Currently from what you describe you are running a tiny amount of work (640 work-items), then reading back data, then run some test on the data (“check if my problem is already solved”), then run a little amount of work again.

Why not execute bigger NDRanges? Also, why check if the solution was found so many times? It would probably be cheaper to do the check with a kernel instead of using the CPU.

Why not execute bigger NDRanges?

Every workitem to compute it’s own result need to have among other things three arrays; two of them of size like ~1500bytes (size is not fixed) and one of size 256bytes (this array is always 256bytes long). So to provide space for “global_work_size” number of work items I’m allocating these arrays in advance by using clCreateBuffer() and pass pointers as kernel arguments. Every array is “global_work_size” times bigger than it should be for one computation, because “at the same time” “global_work_size” number of computations is performed.

Inside kernel I’m computing offset for every work item so it has independent space (like I said 3 arrays) where temporary results (needed to compute final result) can be stored.

In kernel there is also little piece of code which check if problem was solved. If it was - some values are copied to small array which also was created by clCreateBuffer() and passed as a kernel argument. After running clEnqueueNDRangeKernel() I read this small array to check if something is there (if this array is filled with data - this is my final result I’m looking for).

Of course I can provide also simplified code so it will be easier to see any bigger code mistakes and technical issues with memory management.

It could be great if there is any way to improve my approach because now it performs not so bad and probably I’m not taking any advantage of running my code on GPU.

I insist that the first thing you need is to run a performance profiler. Anything I say will just be guessing based on what you’ve described.

Every workitem to compute it’s own result need to have among other things three arrays; two of them of size like ~1500bytes (size is not fixed) and one of size 256bytes (this array is always 256bytes long).

So each work-item needs less than 2KB of private data. Let’s say that we want to continue this simple approach. Let’s also say that you allocate a 256MB buffer for this purpose. This means you could run NDRanges with up to 128x1024 work-items in them.

After running clEnqueueNDRangeKernel() I read this small array to check if something is there (if this array is filled with data - this is my final result I’m looking for).

Why not read that small array inside another kernel? What you want to avoid is this:

  1. Enqueue small NDRange.
  2. Blocking read.
  3. If solution not found, go to step 1.

Instead, you can do something like this:

  1. Initialize an integer in global memory with the value zero. Let’s call this the “found solution” variable. When your kernel has found a solution, it will set that variable to one.
  2. Enqueue large NDRange that will run the main kernel. First of all this kernel checks if the “found solution” variable is zero or not. If it’s not zero, all work items return immediately.
  3. Repeat step 2 a bunch of times. You could do step 2 only once but we have to limit the size of the NDRange due to the memory requirements mentioned above.
  4. Read the small buffer with the “found variable” and stop if it’s not zero.

I’m just guessing here how your algorithm works. Hopefully this will be close enough to give you some ideas.

I changed my kernel and now arrays are in private memory. Complexity of kernel decreased and value given by clGetKernelWorkGroupInfo() jumped to 448 (from 320) - so it’s good I think.

Once again info about my GPU:

OpenCL SW Info:

 CL_PLATFORM_NAME:      NVIDIA CUDA
 CL_PLATFORM_VERSION:   OpenCL 1.0 CUDA 4.0.1
 OpenCL SDK Revision:   7027912


OpenCL Device Info:

 1 devices found supporting OpenCL:

 ---------------------------------
 Device Quadro NVS 140M
 ---------------------------------
  CL_DEVICE_NAME:                       Quadro NVS 140M
  CL_DEVICE_VENDOR:                     NVIDIA Corporation
  CL_DRIVER_VERSION:                    275.33
  CL_DEVICE_VERSION:                    OpenCL 1.0 CUDA
  CL_DEVICE_TYPE:                       CL_DEVICE_TYPE_GPU
  CL_DEVICE_MAX_COMPUTE_UNITS:          2
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:   3
  CL_DEVICE_MAX_WORK_ITEM_SIZES:        512 / 512 / 64
  CL_DEVICE_MAX_WORK_GROUP_SIZE:        512
  CL_DEVICE_MAX_CLOCK_FREQUENCY:        800 MHz
  CL_DEVICE_ADDRESS_BITS:               32
  CL_DEVICE_MAX_MEM_ALLOC_SIZE:         128 MByte
  CL_DEVICE_GLOBAL_MEM_SIZE:            82 MByte
  CL_DEVICE_ERROR_CORRECTION_SUPPORT:   no
  CL_DEVICE_LOCAL_MEM_TYPE:             local
  CL_DEVICE_LOCAL_MEM_SIZE:             16 KByte
  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:   64 KByte
  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_PROFILING_ENABLE
  CL_DEVICE_IMAGE_SUPPORT:              1
  CL_DEVICE_MAX_READ_IMAGE_ARGS:        128
  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:       8
  CL_DEVICE_SINGLE_FP_CONFIG:           INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

  CL_DEVICE_IMAGE <dim>                 2D_MAX_WIDTH     4096
                                        2D_MAX_HEIGHT    32768
                                        3D_MAX_WIDTH     2048
                                        3D_MAX_HEIGHT    2048
                                        3D_MAX_DEPTH     2048

  CL_DEVICE_EXTENSIONS:                 cl_khr_byte_addressable_store
                                        cl_khr_icd
                                        cl_khr_gl_sharing
                                        cl_nv_d3d9_sharing
                                        cl_nv_d3d10_sharing
                                        cl_khr_d3d10_sharing
                                        cl_nv_d3d11_sharing
                                        cl_nv_compiler_options
                                        cl_nv_device_attribute_query
                                        cl_nv_pragma_unroll
                                        cl_khr_global_int32_base_atomics
                                        cl_khr_global_int32_extended_atomics


  CL_DEVICE_COMPUTE_CAPABILITY_NV:      1.1
  NUMBER OF MULTIPROCESSORS:            2
  NUMBER OF CUDA CORES:                 16
  CL_DEVICE_REGISTERS_PER_BLOCK_NV:     8192
  CL_DEVICE_WARP_SIZE_NV:               32
  CL_DEVICE_GPU_OVERLAP_NV:             CL_TRUE
  CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:     CL_TRUE
  CL_DEVICE_INTEGRATED_MEMORY_NV:       CL_FALSE
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>  CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0

And what I need in my kernel:
S[256]
keystream[1512]
plaintext[1512] (value 1512 can be a little bit different - it depends on WiFi packet length)

Here is my kernel code:

__kernel void crash(int messageLength,  
                    int rootKeySize,
                    __global uchar *iv,
                    __global uchar *cipher,
                    __global uchar *passwordfound,
		    __global uint *solved)
{
    
if(solved[0]==1) return;

int length = 5;
char secretKey[5];
for(int k = 0; k < length; k++) {
	secretKey[k] = ch[0];
}

int tmpi = get_global_id(0);
int rest;
for(int j = 0; j < length; j++) {
	rest = tmpi % 25;
	tmpi = tmpi / 25;	
	secretKey[j] = secretKey[j] + rest;
}

// RC4-KSA
uchar S[256];
int i, j;
for (i = 0; i < N; i++) {
	S[i] = i;
}
j = 0;
for (i = 0; i < N; i++) {
	j = (j + S[i] + ((i%rootKeySize<IV_SIZE) ? iv[i % rootKeySize] : secretKey[i%rootKeySize-IV_SIZE])) % N;
       SwapElements(S, i, j);
}

// RC4-PRGA
uchar keystream[1512];
i = 0;
j = 0;
for (int repetition = 0; repetition < messageLength; repetition++) {
	i = (i + 1) % N;
	j = (j + S[i]) % N;
	SwapElements(S, i, j);
	keystream[repetition] = S[((S[i] + S[j]) % N)];
}

// compute plaintext from cipher 
uchar plaintext[1512];
for (i = 0; i < messageLength; i++) {
	plaintext[i] = (int)(cipher[i] ^ keystream[i]);
}

bool keyValid = true;
uchar hash[4];
Crc32(plaintext, 0, messageLength - CRC_SIZE, hash);
for (int i = 0; i < CRC_SIZE; i++) {
        keyValid &= hash[i] == plaintext[messageLength - 1 - i];
}
if(keyValid) {
   solved[0]=1;
   for(int z=0; z<length; z++) passwordfound[z]=secretKey[z];
}
}

And now could you please explain how can I manage to do my computations without blocking read? This part is still not clear for me.

And now host code.

Buffers:


// cipherBody and iv arrays were somewhere initialized and filled with data.  
CipherBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(uchar)*(size - HEADER_SIZE), cipherBody, &errcode);
assert(errcode==CL_SUCCESS);

IVBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(uchar)*IV_SIZE, iv, &errcode);
assert(errcode==CL_SUCCESS);

// Allocate output memory on GPU (password is a final result)
PassBuffer = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY, sizeof(uchar)*PASSLENGTH, NULL, &errcode);
assert(errcode==CL_SUCCESS);

// Solved - indicator you suggested
uint *solved = new uint[1];
solved[0]=0;

cl_mem solvedBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(uint), solved, &errcode);
assert(errcode==CL_SUCCESS);

Setting arguments for kernel:

clSetKernelArg(OpenCLVectorAdd, 0, sizeof(int), &siz); 
clSetKernelArg(OpenCLVectorAdd, 1, sizeof(int), &ROOTKEYLENGTH); 
clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&IVBuffer); 
clSetKernelArg(OpenCLVectorAdd, 3, sizeof(cl_mem), (void*)&CipherBuffer); 
clSetKernelArg(OpenCLVectorAdd, 4, sizeof(cl_mem), (void*)&PassBuffer);  
clSetKernelArg(OpenCLVectorAdd, 5, sizeof(cl_mem), (void*)&solvedBuffer);

Running kernel with avoiding blocking read - how? WrokSize parameter should be a size_t value of something really big now. 1512+1512+256+(some other small arrays and variables something like 30 bytes )= approx. 3.5 KB lets say. How much memory can I allocate ? 82 MB ? (CL_DEVICE_GLOBAL_MEM_SIZE).

I thought you couldn’t put that data in private memory and that was why you were using global memory. I must have been thinking of somebody else.

Running kernel with avoiding blocking read - how? WrokSize parameter should be a size_t value of something really big now. 1512+1512+256+(some other small arrays and variables something like 30 bytes )= approx. 3.5 KB lets say. How much memory can I allocate ? 82 MB ? (CL_DEVICE_GLOBAL_MEM_SIZE).

Sorry, I don’t understand. You seem to be saying several different things at once. Since you are now using private memory for all those temporary variables, you should be able to use very large global work sizes, independently of how much global memory you have – because private memory is managed internally by the OpenCL driver.

BTW, I hope I’m not breaking any local laws by helping you to write a wifi password cracker :?

:smiley: No you don’t break laws ;p Do I ? ;p Purpose of code is to learn OpenCL :slight_smile:

I successfully run kernel with global_work_size=20000. To run all my computations I executed clEnqueueNDRange() many times in loop and after loop I placed clEnqueueReadBuffer() with blocking set to CL_TRUE like you suggested on some other thread. Result was calculated faster than on CPU.

I do not know how big can improvement be in comparison to CPU version, but running my application on laptop with better graphic card than mine give results even better so it’s good indicator.

I forgot to add that setting bigger global_work_size (like 30k or 40k) causes my GPU reset or screens (both laptop screen and external screen) get dark and I have to restart my computer.