SSE Questions

We do heavy number-crunchery at work and are looking into trying to use SSE to get better throughput. However, finding people/sources that can answer my basic questions about SSE is difficult without investing 20+ hours to figure things out. So I turn to folks here in hopes they can answer a few relatively (I think) simple questions:

  1. SSE registers (regardless of which flavor) seem to all be 128bits wide. There are 8 of them. We do very simple operations (N-Body force problems, so take a vector of N length x,y,z, components, find the distance between N_i and N_j). My initial understanding was that I should be able to get at least 8 simulataneous operations on floats (8 registers with 4 floats per register). However after hiring a CS student to work for a week (before quitting) it looks like 4 is the total speed up possible, meaning I can only use SIMD instructions across any two registers. Is this right? The only level of control I have is giving one binary instruction to two registers in a single clock cycle?

  2. If the above is not true (and I can expect more than a 4x speedup on simple arithmetic operations on floats) how do I go about getting said speedup? I’ve tried ordering my loops in the Intel C compiler to promote multiple directions of data aggregation to no avail. Is it something that I can’t do without coding assembly?

Any pointers to overviews that are techy enough to actually explore getting real code written (i.e. Not Wikipedia level) at high performance is greatly appreciated. Maybe one day I’ll be able to go through Intel’s documentation to figure this stuff out, but right now I simply don’t have the time to do so.

WUMPUS IS YOUR GOD NOW.

Wiki says there’s 16 registers on modern hardware actually. I’d recommend looking at the SSE instrinsic docs for your compiler. Here’s the VS one.