Results 1 to 3 of 3

Thread: clCreateContextFromType and clCreateCommandQueue consumes 1GB host memory per device

  1. #1
    Junior Member
    Join Date
    Aug 2015
    Posts
    21

    clCreateContextFromType and clCreateCommandQueue consumes 1GB host memory per device

    NVIDIA does not support parallel kernel execution for multiple-queues inside a single context, I had to create multiple threads and create one context per thread. Now a new issue emerges, the following two lines inside the for loop became extremely slow and memory consuming:

    Code :
        cl_context_properties cps[3]={CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0};
        cl_context_properties* cprops=(platform==NULL)?NULL:cps;
        cl_command_queue_properties prop = CL_QUEUE_PROFILING_ENABLE;
     
        for(i=0;i<workdev;i++){
             OCL_ASSERT(((mcxcontext[i]=clCreateContextFromType(cprops,CL_DEVICE_TYPE_ALL,NULL,NULL,&status),status)));
             OCL_ASSERT(((mcxqueue[i]=clCreateCommandQueue(mcxcontext[i],devices[i],prop,&status),status)));
             ...
        }

    full code can be browsed at https://github.com/fangq/mcxcl/blob/....cpp#L323-L324

    I found that for every device, clCreateContextFromType consumes about 5-600MB host memory, and clCreateCommandQueue consumes another 300-400MB memory. so, for an empty context and queue, the host needs to consume 1GB per device and about 2-3 seconds per device to create these objects. This feels really strange. I don't know if this is really needed, why does an empty queue/context consumes so much memory?

    this is not the end of it, if I request to run my cl kernel to use over 8 devices (my server has 11 GTX 1080Ti GPUs), the memory will grow up to around 8GB, and then the program crashes at the above two lines of code. My host has 16GB memory, I am not sure why it crashes (I was able to use all 11 devices if running CUDA).

    I am wondering if anyone can offer some explanations on this behavior, should this be expected? is there a way I can reduce this overhead? I tried to change CL_QUEUE_PROFILING_ENABLE to CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, did not help.

    thanks
    Last edited by fangqq; 10-25-2017 at 05:49 PM.

  2. #2
    Junior Member
    Join Date
    Aug 2015
    Posts
    21
    if anyone want to reproduce this, you can use the following commands:

    Code :
    git clone https://github.com/fangq/mcxcl.git
    cd mcxcl
    git checkout nvidiaomp
    cd src
    make clean all
    cd ../example/benchmark
    ./run_benchmark1.sh -G 1

    the above command will launch the kernel on the 1st CL device (use mcxcl/bin/mcxcl -L to list the devices). If you have multiple GPUs, you can use -G 01_mask (for example, -G 1101 will use the 1st, 2nd and 4th GPU). you can do a top in the meantime to see the memory allocation.

  3. #3
    Junior Member
    Join Date
    Aug 2015
    Posts
    21
    can anyone comment on this?

    such behavior is quite unusual, really limits the scalability. can't believe it was designed this way.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean