need performance advice

I was wondering if anyone on here could point me in the right direction/provide insight for a post process stage for a VR renderer that’s responsible for reconstructing the half res region in the radial density mask.

see this image:


This is stemming from a post I made yesterday about using a pre-made stencil to cull 2x2 quads so they don’t have to run a fragment shader:

in the reconstruction filter I decide if the pixel is in the central eye region or in the checkerboard region. if in the central eye region i sample the corresponding pixel from the input(which is render target of the forward pass using the stencil)
if in the checker board region i determine if its in the unrendered 2x2 spot or a rendered 2x2 spot and do the appropriate bilinear sampling according to slide at 20:20 (second option on the right) in Vlachos’ GDC talk here:GDC 2016: Advanced VR Rendering Performance by Alex Vlachos - YouTube
5 texture reads for the rendered pixels in the checkerboard ring and 4 reads for unrendered pixels in the checkboard ring

doing the stencil mask saves a good amount but doing the stencil then hole fill is slightly worse than not doing stencil and hole fill. Doing the hold fill adds about 1.3ms. I’m no expert but I feel like this is a bit sluggish. The cards is gtx960m at a render target res of 1792x1120

the input descriptor is shader read only optimal layout and the image is a combined image sampler doing VK_FILTER_NEAREST.

Also, I assume fragments running a shader program get grouped up in some way (is it the same as cuda warps?32?) and must march in lockstep through the branches, i.e. if one thread takes the branch the others must wait until it finishes before they can continue.
If the thread groups are 2x2 group of pixels then they must march in lockstep through something like this (if they were in a rendered 2x2 group in the checkboard region)
if (pixel 1) fetch 5 samples, output color
else if (pixel 2) fetch 5 samples, output color
else if (pixel 3) fetch 5 samples, output color
else if (pixel 4) fetch 5 samples, output color

if the thread group is larger than 2x2 section of pixels then its something like this if they fall in the checker region:
if( in rendered checker 2x2) {
if (pixel 1) fetch 5 samples, output color
else if (pixel 2) fetch 5 samples, output color
else if (pixel 3) fetch 5 samples, output color
else if (pixel 4) fetch 5 samples, output color

} else if (in unrendered checker 2x2) {
if (pixel 1) fetch 4 samples, output color
else if (pixel 2) fetch 4 samples, output color
else if (pixel 3) fetch 4 samples, output color
else if (pixel 4) fetch 4 samples, output color
}

So I guess worse case threads are waiting around for 45 texture fetches and bilinear weighting? Maybe use the 2x2 pixel quad ID (1,2,3,4) to come up with formula to sample correctly without the need for branching. Maybe blindly prefetch the samples that it might need (with out knowing its case yet) as far up the shader as possible? will a SPIR-V compiler try to do any of this?