thoughts on optimizing my software rasterizer.

I am currently doing a software rasterizer/renderer with OpenCL as the engine for the fragment shading stage.
I eventually plan on moving as much as practical to OpenCL.
In this list’s opinion, given current limitations with OpenCL and threads, and also HOST<>GPU communications overhead,
what would be the best practical strategy for optimizing my scenario.

I know that modifying command queues is not thread safe (I tried it;>).

Right now the thread hierarchy looks like this:

(CPUthread0:TransformGeometry) … (CPUthread63:TransformGeometry) (using a thread pool)
/
thread safe (but not locked) screen-space per material per screen tile post transform buckets
/
(CPUthread0:RasterizePt1) … (CPUthread15:RasterizePt1) (using the same thread pool)
/ (SCAN CONVERT TRIANGLES INTO PRE-SHADED FRAGMENTS)
thread safe (but not locked) tile-space per material per screen tile preshaded-fragment buffers
/
(Locked OpenCL Device: Fragment Shading) // CPU THREADS SERIALIZED HERE (Most time spent per frame is also here)
/ (SHADE FRAGMENTS)
thread safe (but not locked) tile-space per material per screen tile postshaded-fragment A-Buffers
/
(CPUthread0:RasterizePt3) … (CPUthread15:RasterizePt3) (using the same thread pool, actually the same workqueue job as RasterizePt1 )
/ (ZSort, A-Buffer Composite and AntiAlias Resolve TileBuffer to FrameBuffer)
DONE

If it matters, I am not currently concerned with all hardware platforms, just mine. I will be at some point, but I am not there yet…
I am using a dual Xeon E5520 and Geforce 260 Core 216.

you can see some performance tables at
http://www.tweakoz.com/michael/wordpress/?page_id=464

Thanks,

mtm