In general, we can divide the vector statements in a vectorized loop
into related groups ("rgroups") and say that for each rgroup there is
some nS such that the rgroup operates on nS values from one scalar
iteration followed by nS values from the next. That is, if VF is the
vectorization factor of the loop, the rgroup operates on a sequence:
(1,1) (1,2) ... (1,nS) (2,1) ... (2,nS) ... (VF,1) ... (VF,nS)
where (i,j) represents a scalar value with index j in a scalar
iteration with index i.
[ We use the term "rgroup" to emphasise that this grouping isn't
necessarily the same as the grouping of statements used elsewhere.
For example, if we implement a group of scalar loads using gather
loads, we'll use a separate gather load for each scalar load, and
thus each gather load will belong to its own rgroup. ]
In general this sequence will occupy nV vectors concatenated
together. If these vectors have nL lanes each, the total number
of scalar values N is given by:
N = nS * VF = nV * nL
None of nS, VF, nV and nL are required to be a power of 2. nS and nV
are compile-time constants but VF and nL can be variable (if the target
supports variable-length vectors).
In classical vectorization, each iteration of the vector loop would
handle exactly VF iterations of the original scalar loop. However,
in vector loops that are able to operate on partial vectors, a
particular iteration of the vector loop might handle fewer than VF
iterations of the scalar loop. The vector lanes that correspond to
iterations of the scalar loop are said to be "active" and the other
lanes are said to be "inactive".
In such vector loops, many rgroups need to be controlled to ensure
that they have no effect for the inactive lanes. Conceptually, each
such rgroup needs a sequence of booleans in the same order as above,
but with each (i,j) replaced by a boolean that indicates whether
iteration i is active. This sequence occupies nV vector controls
that again have nL lanes each. Thus the control sequence as a whole
consists of VF independent booleans that are each repeated nS times.
Taking mask-based approach as a partially-populated vectors example.
We make the simplifying assumption that if a sequence of nV masks is
suitable for one (nS,nL) pair, we can reuse it for (nS/2,nL/2) by
VIEW_CONVERTing it. This holds for all current targets that support
fully-masked loops. For example, suppose the scalar loop is:
float *f;
double *d;
for (int i = 0; i < n; ++i)
{
f[i * 2 + 0] += 1.0f;
f[i * 2 + 1] += 2.0f;
d[i] += 3.0;
}
and suppose that vectors have 256 bits. The vectorized f accesses
will belong to one rgroup and the vectorized d access to another:
f rgroup: nS = 2, nV = 1, nL = 8
d rgroup: nS = 1, nV = 1, nL = 4
VF = 4
[ In this simple example the rgroups do correspond to the normal
SLP grouping scheme. ]
If only the first three lanes are active, the masks we need are:
f rgroup: 1 1 | 1 1 | 1 1 | 0 0
d rgroup: 1 | 1 | 1 | 0
Here we can use a mask calculated for f's rgroup for d's, but not
vice versa.
Thus for each value of nV, it is enough to provide nV masks, with the
mask being calculated based on the highest nL (or, equivalently, based
on the highest nS) required by any rgroup with that nV. We therefore
represent the entire collection of masks as a two-level table, with the
first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
the second being indexed by the mask index 0 <= i < nV.
The controls (like masks or lengths) needed by rgroups with nV vectors,
according to the description above.