Info about device Query

Hi folks,
I am using the NVIDIA opencl SDK and I have the two graphic cards on my system that is GTX 580 and GT 240…
I ran device query program that is
clGetDeviceIDs(…);
clCreateContext(…);
clGetDeviceInfo(devices[i], CL_DEVICE_NAME, sizeof(cBuffer), &cBuffer, NULL);
printf(" Device %s
", cBuffer);

Now this is showing me the following output…(shown just of one)
CL_DEVICE_NAME: GeForce GTX 580
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 304.88
CL_DEVICE_VERSION: OpenCL 1.1 CUDA
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 16
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1544 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 383 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1535 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_fp64

CL_DEVICE_COMPUTE_CAPABILITY_NV: 2.0
NUMBER OF MULTIPROCESSORS: 16
NUMBER OF CUDA CORES: 512
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 32768
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1

Now I am confused that what is the number of cores and number of MULTI PROCESSORS and the local and global sizes with the normal terminology such as no of work groups and/or in a device and no of processing elements in each work group, total local memory available to each work gruop…

It willl be really great if any one can explain the complete output wrt processing elements, work grp, compute units, local worksize of each work group and the completeglobal size …
I am a newbie and really confused within the technical terms with the real output…
ITS URGENT…

THANKS
PIYUSH

Here’s some of my understanding about these concept. I’m also new to OpenCL and not familiar with nVIDIA GPU, so anybody please correct me if I am wrong.

  1. every work item executes your kernel.
  2. a work group is composed of a bunch of such work items (local work size is set to decide how many work items are there in one work grp)
  3. max_work_group_size, 1024 in your case, means that there can be at most 1024 work items in one work grp
  4. max_work_item_size ,(1024, 1024,64) means that your work groups’ size can be (1024, 1, 1), (1,1024,1), or (256, 2, 2) etc. But it can not be (1, 1, 1024) since the third dimension has a limit of 64.
  5. work items in one work group share the local memory.
  6. The above are all abstract concept, while a compute unit is the real hardware component of your GPU. One work grp can only execute on one compute unit, but a compute unit can handle several work grps.
    I’m not sure whether these are the answers you are looking for. Good luck.

thanks for the prompt reply… it helped me some
Still having little doubts
thanks again leoamuro

This is quite a complex matter.

An NVIDIA multiprocessor is the hardware unit which corresponds to an OpenCL compute unit. Each multiprocessor can independently run concurrent threads.

In NVIDIA hardware, threads are grouped into warps. A warp contains 32 threads which are executed simultaneously (as long as there is no branching in fact, but that’s another story).

A multiprocessor can keep up to 48 warps running at the same time (for a 2.0 compute capability device). So with 16 multiprocessors in a GTX 580, this gives a theoretical maximum 164832 = 24,576 threads running concurrently.

However, a thread makes computations, accesses memory, and all this severely limits the number of threads really working at a time. A CUDA core is more or less an ALU that makes integer and floating-point operations. The GTX 580 has 512/16 = 32 CUDA cores per multiprocessor. So even if it can run 48*32 = 1,536 concurrent threads, it can only make 32 multiplications per clock cycle.

A logical OpenCL work-group is split into a (hardware) block of warps. For instance, if you have a work-group of size 16*16=256, it will be split into a block of 256/32=8 warps.

When you execute a kernel, the hardware tries to use the maximum number of blocks for your global work-size. The hardware can handle a maximum of 8 blocks simultaneously. As a result, if a work-group has a size lower than 1536/8=192, there will be too much blocks for the hardware to handle and the GPU occupancy will be lower than 100%. For instance, a work-group of size 128 with the maximum of 8 blocks will run only 128*8=1024 threads for an occupancy of 1024/1536=67%

Each multiprocessor also has its own local memory (48 KB for the GTX 580). This local memory is shared among the blocks running on the multiprocessor. As a result, this can also limit the number of blocks running at a time if a block uses a lot of local memory.

thanks , it cleared the whole …