hi, everybody,
I’m doing Matrix Multiplication with OPENCL,
I split the multiplication into some work groups,
then i add them into global memory.
the code below is the final step that sum the sub result to final result
for(k=0 ; k<group_num ; ++k)
{region = (group_id+k)%group_num; l=local_id; while(l<matrix_size) { c_mat[(8*region+0)*matrix_size+l] += local_output_matrix[(8*region+0)*matrix_size+l]; c_mat[(8*region+1)*matrix_size+l] += local_output_matrix[(8*region+1)*matrix_size+l]; c_mat[(8*region+2)*matrix_size+l] += local_output_matrix[(8*region+2)*matrix_size+l]; c_mat[(8*region+3)*matrix_size+l] += local_output_matrix[(8*region+3)*matrix_size+l]; c_mat[(8*region+4)*matrix_size+l] += local_output_matrix[(8*region+4)*matrix_size+l]; c_mat[(8*region+5)*matrix_size+l] += local_output_matrix[(8*region+5)*matrix_size+l]; c_mat[(8*region+6)*matrix_size+l] += local_output_matrix[(8*region+6)*matrix_size+l]; c_mat[(8*region+7)*matrix_size+l] += local_output_matrix[(8*region+7)*matrix_size+l]; l=l+group_size; } barrier(CLK_GLOBAL_MEM_FENCE);
}
when the size is 64, this code worked,
but when size increased to 128,
the kernel failed and sent the message: fatal: si_isa_DS_WRITE_B32_impl: invalid address.
but if i write
c_mat[(8*region+0)*matrix_size+l] += const ; or
temp += local_output_matrix[(8*region+7)*matrix_size+l];
the kernel worked, but the answer is wrong obviously.
So do any body had met this fatal error code?
thanks for your help