How to get vectorized enhanced performance for XOR, AND for ulong

Hi,

Problem 1:
I’ve tried couple of tricks to get vectorized performance for XOR, AND operations for ulong data in OpenCL.
Not a single one resulted in good performances. One technique is for example, breaking ulong data to 8 uchar and
then perform XOR (by ^) but the result performs worse.

Problem 2:
Also a second problem, I wanted to accelerate the following code using uchar8 but no improvement:

// Assembly language version of the below code provides 20% better result
int len = 0 ;
ulong v = …
while ((v & UCHAR_MAX) == 0) { // UCHAR_MAX is 255, CHAR_BIT is 8
v >>= CHAR_BIT;
en += 1;
}

My approach was given below:

int len = 0 ;
ulong v = … // Some 64-bit data
ulong8 u8cmax =(ulong8)(CHAR_BIT) ;
if ((v & UCHAR_MAX) == 0) {
ulong uvt[8] ;
uvt[0] = v ;
int i = 1 ;
while(i < 8) {
uvt[i] = uvt[i-1]>>CHAR_BIT ;
i++ ;
}
ulong8 uv8 = (ulong8)(uvt[0], uvt[1], uvt[2], uvt[3], uvt[4], uvt[5], uvt[6], uvt[7]) ;
ulong8 uc = uv8 & u8cmax ; // Vectorized AND of 8 ulong data
ulong uv[8] = {uc.s0, uc.s1, uc.s2, uc.s3, uc.s4, uc.s5, uc.s6, uc.s7} ;
i = 0 ;
while(uv[i++]==0) {
len += 1;
}
}

Can someone shed light on the above? I appreciate…

Thanks,
Syed Hussain