A pile of technical GPU questions, sorry :)

Firstly, If anyone knows of any decent resources for learning details like I’m asking about, can you tell me plzthx. I’m happy to do a lot of reading.

Here’s a handful of GPU questions I’m having trouble finding answers to. They’re not specific to Vulkan, but I don’t see a better category to use. I have a Vulkan wrapper that I’m tweaking and some of the questions here will help with the design.
I thought it’d be easier to ask these in bulk rather than multiple questions. I hope that’s OK.

For the following questions, I’m really asking about standard-practice in immediate-rendering GPUs (Nvidia/ATI/maybe Intel).

1. Terminology
Is there a standard terminology for GPU shading components yet? What’s the best way to refer to:
[ul]
[li]The element responsible for a single texel output (eg CUDA core). (= Lane? Unit?)
[/li][li]The block of elements (above) whose instructions are performed together (SIMD). (= Core?)
[/li][li]The component responsible for managing tasks and cores. (= Thread dispatcher?)
[/li][/ul]
I will use lane and core for the rest of this uberquestion.

2. Memory addressing
Is GPU access to graphics memory ever virtual (ie, via page tables)? Can the driver/GPU choose to move resources to different parts of physical memory (eg to avoid contention when running multiple applications)?

3. Per-primitive user data
GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right? Is there any technical reason why? Implicit per-primitive data is required by cores (interpolation constants and flat values). This seems to be a common request, and data does seem to be being wasted.

4. ROP texel ordering
How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs.

5. TMUs and cores
Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method?

6. Identical texture metadata
For two textures with identical metadata in the same memory heap, is switching a TMU between textures necessarily any more complex then simply changing the TMU’s texture pointer offset (ignoring resulting cache misses).

7. Data “families”

There seem to be many data “families” available to core lanes:
[LIST=A]
[li]Per-lane:
[/li][ol]
[li]Private lane variables. (Read/Write).
[/li][li]Lane location/index (differentiating lanes within a core). (Read-only).
[/li][li]Derivatives (per pair/quad?). (Read/Write(ish)).
[/li][/ol]
[li]Per-core (read-only):
[/li][ol]
[li]Per-primitive(or patch, etc) constant data. Interpolation constants etc.
[/li][li]Draw-call-constant data (uniforms, descriptor set data).
[/li][/ol]
[li]RAM-based stuff (TMU, buffer array data, input attachments, counters, etc).
[/li][/LIST]

Are B1 and B2 are stored in the same area? Are they stored per-core or shared between cores somehow? They’re often identical between many cores, but IIUC other cores can be performing different tasks.

How does the task-manager/thread-dispatch write B1/B2? In bulk / all-at-once, or granularly? Are these writes significant performance-wise? (kinda technical but related to a shader-design issue I have).

Thanks for all help / info.