Results 1 to 9 of 9

Thread: Is there any opencl tools to deal with multiple different GPUs?

  1. #1
    Junior Member
    Join Date
    Nov 2018
    Posts
    6

    Is there any opencl tools to deal with multiple different GPUs?

    Otherwise I have to judge the capabilities of every GUP, and send different amout of works to them for a huge computing.

    It seems an awful task.

    Is there any opencl tools to help me use all opencl devices to complete a huge computing work?

    Or any suggestion?

    thanks in advance.















  2. #2
    Senior Member
    Join Date
    Dec 2011
    Posts
    258
    Divide your work up into smaller bites and feed them to the GPUs at the rate they can eat them. In more detail: use OpenCL Events with each clEnqueueNDRangeItem. Enqueue 3 jobs to each GPU. As jobs finish (detect this using Events), queue up another to that GPU. The faster GPUs will go through more jobs; the slower one will go through fewer.

  3. #3
    Junior Member
    Join Date
    Nov 2018
    Posts
    6
    Thank you Dithermaster.
    If I devided my work into small pieces, whether it will take a longer time than the original?
    I can't always devide my data block into pieces, though I can always split the computation.
    BTW, I think what you told me is to send a new job to a GPU, just after one of the three job finished. Is that right?
    Yours chen.

  4. #4
    Senior Member
    Join Date
    Dec 2011
    Posts
    258
    As long as the jobs aren't too small it won't run slower. Jobs are subdivided by the runtime into what the hardware can do, so once it is doing that larger jobs run at the same speed (ignoring the inefficiency due to the last bits of the job not filling the device). The reason I had you queue up three to each device is so the work queue never goes empty. An empty work queue means an idle device, which means you're not going as fast as you could. So, yes, after first job finishes queue up another. Then when second job finishes, queue up another. Keep the pipeline moving.

  5. #5
    Junior Member
    Join Date
    Nov 2018
    Posts
    6
    Thank you so much Dithermaster.
    I have tested what you said, they are so true!
    During my testing, I thought some problems:
    1) When the data Object was really transfered between main momory and opencl devices?
    clCreateBuffer? clSetKernelArg? or clEnqueueNDRangeKernel?
    2)How to deal with the GPU used by the OS? If it is overloaded this guy likes to stop you!
    Is there any means to find which GPU is used by the OS, I don't want to touch this camel.
    3)If I have 2 layer loop, and the first layer is big enough for parallel. I want to know which one is better:
    a)make a 1-dimension clEnqueueNDRangeKernel, and in the kernal make the other layer with a for statement.
    b)just make a 2-dimension clEnqueueNDRangeKernel.
    I was told the for, if, while statements will badly slow the kernal, but in my case, it let me to avoid the syncronize problem for sum.
    I am sorry for so much problem.
    Thanks again and again.

  6. #6
    Senior Member
    Join Date
    Dec 2011
    Posts
    258
    > 1) When the data Object was really transfered between main momory and opencl devices

    During clEnqueueRead/Write or clEnqueueMap/Unmap operations.

    > 2)How to deal with the GPU used by the OS? If it is overloaded this guy likes to stop you!
    > Is there any means to find which GPU is used by the OS, I don't want to touch this camel.

    Use OpenGL "get device" commands to find the GPU being used by the OS. It's not perfect, but often works.

    > 3)If I have 2 layer loop, and the first layer is big enough for parallel. I want to know which one is better:
    > a)make a 1-dimension clEnqueueNDRangeKernel, and in the kernal make the other layer with a for statement.
    > b)just make a 2-dimension clEnqueueNDRangeKernel.
    > I was told the for, if, while statements will badly slow the kernal, but in my case, it let me to avoid the syncronize problem for sum.

    Implement and try both and measure results.

  7. #7
    Junior Member
    Join Date
    Nov 2018
    Posts
    6
    Thank you Dithermaster for the detailed suggestions.
    I will try the third one.
    I wish you to confirm that:
    1)clSetKernelArg also transfer data object to device, right?
    2)Does OpenGL have a function like "get device"? I use openGL(version 1.1) very often ,but I don't know this function.
    Thank you again.

  8. #8
    Senior Member
    Join Date
    Dec 2011
    Posts
    258
    > 1)clSetKernelArg also transfer data object to device, right?
    No, it doesn't cause data tranfers. It's just used to pass the handle to the cl_mem object.

    > 2)Does OpenGL have a function like "get device"? I use openGL(version 1.1) very often ,but I don't know this function.
    Look at glGetString(GL_VENDOR). Check sub-strings for AMD, ATI, NVIDIA, Intel. Compare to similar strings in your OpenCL platform. We use it before trying CL/GL interop; you could use it to avoid the primary GPU.

  9. #9
    Junior Member
    Join Date
    Nov 2018
    Posts
    6
    OK!Dithermaster
    I think I am ready for my work after your guides.
    You make my thought much clear and applicable.
    Thank you so much for your professional help.
    Yours Chen.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean