Issue in OpenCL Kernel function

I am new to Open-cl and I am trying to write kernel code for the following matrix operation:

A is a 2X2 matrix:

A = [1 2] ----> row1

[3 4] ----->row2

I need to compute:

1) s1 = transpose(row1) X row1

2) s1 = transpose(row2) X row2

3) Sum = s1+s2

I wrote kernel code for row level (i.e I can do transpose(row1) X row1 )

-this serves the purpose for first row only

Code below:

private static String programSource1 =

"__kernel"+

" void matrixMul(__global float* A, __global float* C, int rowLength)"+

"{"+

"int row = get_global_id(1);"+

"int col = get_global_id(0);"+

"C[row*rowLength+col] = A[col] * A[row];"+

"}";

How do I use parallelism to compute this for each row and find the final sum within kernel function ?

Please help me in this regard.

Regards

Rohit Sarewar