Using non-square rectangular blocking for a matrix multiplication kernel

I have been working with a kernel that does matrix multiplication.

The kernel is very much like the the common examples on matrix multiplication (can't post a URL to it yet)

It uses 16 x 16 blocksizes. I have read that one could use rectangular block sizes (but that always seems to be "an exercise left to the reader")

When I try them I am routinely getting -5 errors, so I know I am going somewhere I shouldn't.

I assume I am not quite understanding how I am accessing the LOCAL (shared) memory, as well, I am not sure if the block is only relative to the output or actually either or both of the input matrices.

Can someone point me to a reference that might help me, or an example of a matrix multiplication that does in fact use rectangular blocking?

Thanks.