which type of memory should I use?

I need a huge amout of memory for the one unit. I have have a question connected with this fact. Which type of memory should I use?
a) when I need 256KB of memory
b) when I need 1MB of memory

Theoretically __private is faster than the __global memory but is it always true ?
When people use small amout of __private memory it is based on registers, am I right?
I would like to ask if the __private memory is faster than __global when I allocate very large array ?
It is said that __private is not limited but is it really true?

Usually, the decision to use private or global is driven by how the memory is being used. Data written into private memory is only visible within a single work-item, and can only be written by the work-item itself. Data written into global memory is visible across the whole device, and can be accessed by the host. In the devices I’ve used, private variables are allocated in registers until registers run out. Then they spill to something slower and further away, perhaps even the same memory space from which global memory is allocated. Then it would be similar performance to global memory, except for one thing: In global memory, you might be able to control the layout to make it more efficient (e.g. structure of arrays), in a way that you couldn’t for private. I don’t see anything in the spec that says if there are limits to private memory, but no resource is truly unlimited. You might get a CL_OUT_OF_RESOURCES error or something similar if you try to allocate too much.

thanks kuzne, could you tell me or give me some links about cotrolling the layout to make it more efficient?

The basic idea is that you want to make neighboring work items access neighboring data, if it’s possible. Here’s a good StackOverflow link that explains this:

The StackOverflow discussion happens to be CUDA-focused, but the same general idea applies to OpenCL as well. For some kernels, a coalesced pattern of memory access is very natural. But for others, it might take some restructuring of the data. For example, if you have an array of structures, and you want neighboring work-items to work on neighboring elements of the array, you’ll end up with strided memory accesses when those work-items are executed in a SIMD group. But if the data structure can be expressed as a structure of arrays, the accesses can be better coalesced. In the end, not every algorithm can get their accesses to coalesce, and you may decide to live with the scattered accesses. It may or may not be important, depending on the workload.