how to use OpenCL to process video sequence one by one?

Hello everyone,
I am OpenCL learner, and now i want to process the YUV data (format 4:1:1,planar) with le full search Motion Estimation algorithms. The test video is CITY.yuv (size: 352 x 288).
I know that how to process one frame or image in one CommandQueue, but i don’t konw how to process the frame sequence one by one with OpenCL.
As i do, i try to use one for-loop to deal with all the frames in one video, the codes are as follows:

// --------------------------------------------------------
// Core sequence… copy input data to GPU, compute, copy results back
//suppose there is 100 frames.
for(num=0;num<1000;num++)
{
int i,j;
//Read and Copy pixel data into frame_original
fseek(fp_cur,XXYY3/2*(num+1),SEEK_SET);
if(fread(current_frame[0],XXYY,1,fp_cur)==0)break;
memcpy(frame_original, current_frame[0], width * height * sizeof(cl_uchar));
//Read and Copy pixel data into frame_ref
fseek(fp_ref,XX
YY3/2(num+0),SEEK_SET);
if(fread(ref_frame[0],XX*YY,1,fp_ref)==0)break;
memcpy(frame_ref, ref_frame[0], width * height * sizeof(cl_uchar));

// Asynchronous write of data to GPU device
ciErrNum = clEnqueueWriteBuffer(cqCommandQue, cm_original , CL_TRUE, 0, widthheightsizeof(cl_uchar), frame_original, 0, NULL, &H2D);
shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup);

ciErrNum = clEnqueueWriteBuffer(cqCommandQue, cm_ref , CL_FALSE, 0, widthheightsizeof(cl_uchar), frame_ref, 0, NULL, NULL);
shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup);

//Create the kernel
ckKernel = clCreateKernel(cpProgram,“motion_estimation”, &ciErrNum);
shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup);
// set the global size for each block in frame.
global[0]= width/block_size;
global[1]= height/block_size;

// set the kernel arguments
int n = 0;
printf("clSetKernelArg…

");
ciErrNum = clSetKernelArg(ckKernel, n++, sizeof(cl_mem), (void*)&cm_original);
ciErrNum |= clSetKernelArg(ckKernel, n++, sizeof(cl_mem), (void*)&cm_ref);
ciErrNum |= clSetKernelArg(ckKernel, n++, sizeof(cl_int), (void*)&cm_mv_output);

// Launch kernel
ciErrNum = clEnqueueNDRangeKernel(cqCommandQue, ckKernel, 2, NULL,global,NULL,0, NULL, &ceEvent);
shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup);

// wait for the command to finish
clFinish(cqCommandQue);

// Read back results and check accumulated
ciErrNum = clEnqueueReadBuffer(cqCommandQue, cm_mv_output, CL_TRUE, 0, XYsizeof(MV), MV_output, 0, NULL, &GPUDone);
shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup);
// Release event
ciErrNum = clReleaseEvent(ceEvent);
shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup);
ceEvent = 0;
Cleanup (EXIT_SUCCESS);

}

I konw that there must be a better way to process the yuv data, if you have any idea or opinion, please let me konw.
Thanks for your help.

Each iteration of the loop does two things near the end:

a) Calls clFinish(), which is not necessary.
b) Makes a blocking call to clEnqueueReadBuffer().

This means that each iteration of the loop submits a very small amount of work (CIF resolution) to the GPU and then waits until it’s done. It’s not the most suitable way to submit work to a GPU.

There are several things you could try, such as using double or triple-buffering or submitting multiple frames worth of data in each NDRangeKernel given that CIF is so small.

Also, do not call clCreateKernel() in each iteration. Create the kernel outside of the loop and reuse it.

Thanks for your advises
i will try it soon.