When tuning the local work size for optimum performance, one of the parameters that must be taken into account is the number of work-items (WI) and work-groups (WG) that can be managed by a compute unit (CU) at a given time.
This information is currently not exposed by the clGetDeviceInfo() API call, and I propose their inclusion in the next issue of the standard (something like CL_DEVICE_MAX_WORK_GROUPS_PER_CU and CL_DEVICE_MAX_WORK_ITEMS_PER_CU).