I want to chain together multiple kernels, but is it better to call the kernel functions from within a kernel or via the host.

Pseudo code below:
Kernel calling kernel
Code :
__kernel void vsubtract( __global float * a, __global float * b, __global float * c, const unsigned int count, unsigned int red)               
   int i = get_global_id(0);               
   if(i < count)  
       a[i] = b[i] - a[i];      
       c[i] = a[i]; 
       a[i] = a[i] * a[i];          
  //call reduction kernel
  reduction(a, count, red);

or host calling kernels
Code :
        vsubtract(cl::EnqueueArgs(queue, cl::NDRange(count), cl::NDRange(local)), d_a, d_b, d_c, count, red);
        queue.enqueueReadBuffer(d_a, CL_TRUE, 0, sizeof(float) * LENGTH, &vector_a[0]);
        reduction(cl::EnqueueArgs(queue, cl::NDRange(count), cl::NDRange(local)), d_a, count, red)

I would assume it would be faster to have the kernel calling the other kernels to avoid the additional data transfer with the host and the device.

Is there any issues that I need to be aware of if I have kernels calling kernels?