My GPU is Qualcomm Adreno320?

I am making a program with large data transfer. A serious problem I met is that storing data to the memory is too slow.
To be simple, I write a test kernel that read an 1M Bytes array from the gobal memory and then write it to another array
to tests the speed of loading and storing.

My code :
Get Platform–>Get Device–>Create Context–> Create Command Queue–>create buffer–>
Create Program–>Build Program–>Create Kernel–>Set Kernel Arg–>Range Kernel
–>finish (cmdQueue)

//my data (input) is a matrix that 1024*1024.
buffer_a=clCreateBuffer(context,
CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR,
sizeof(cl_float2)*buffer_num,
in,
&err);
getErr(err,“ERR in clCreateBuffer->buffer_a”);
buffer_b=clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(cl_float2)*buffer_num,
0,
&err);
getErr(err,“ERR in clCreateBuffer->result_buffer”);

size_t globalWorkSize[2]={1024,1024};
size_t localWorkSize[2]={128,2};
cl_event rangeEvt;
err=clEnqueueNDRangeKernel(cmdQueue,
kernel,
2,
NULL,
globalWorkSize,
localWorkSize,
0,
NULL,
&rangeEvt);
waitForEventAndRelease(&rangeEvt,1);
getErr(err,“ERR in clEnqueueNDRangeKernel”);

My kernel
__kernel void READandWrite (const int width,
__global float2 * in,
__global float2* out)
{
float2 table;
table=in[get_global_id(0)+get_global_id(1)*width];
out[get_global_id(0)+get_global_id(1)*width]=table;
}

In my test:
1.I keep all codes in the kernel, and it takes about 3-4ms.
2.I delete the code that input from array in, modified the code to be
void READandWrite (const int width,
__global float2 * in,
__global float2
out)
{
float2 table;
table=(float2)(1.1,1.1);
out[get_global_id(0)+get_global_id(1)width]=table;
}
It still takes about 3-4ms
I modified the code to be
void READandWrite (const int width,
__global float2 * in,
__global float2
out)
{
float2 table;
table=in[get_global_id(0)+get_global_id(1)*width];
}
Now It takes about 1.5ms

I delete all codes in the kernel
void READandWrite (const int width,
__global float2 * in,
__global float2* out)
{
}
It takes about 1.5ms
According to the result, reading process is very quick, however, writing takes more than 1.5ms.
I also tried using “vload2()”and “vstore2()”, the result is the same. I was wondering why storing
data is so slow? How can I improved it to be similar to the loading process. I would be grateful if
anyone can help me figure it out.