GCC Middle and Back End API Reference
vect_optimize_slp_pass Class Reference
Collaboration diagram for vect_optimize_slp_pass:

Public Member Functions

 vect_optimize_slp_pass (vec_info *vinfo)
 
void run ()
 

Private Member Functions

struct loopcontaining_loop (slp_tree)
 
bool is_cfg_latch_edge (graph_edge *)
 
void build_vertices (hash_set< slp_tree > &, slp_tree)
 
void build_vertices ()
 
void build_graph ()
 
void create_partitions ()
 
template<typename T >
void for_each_partition_edge (unsigned int, T)
 
bool is_compatible_layout (slp_tree, unsigned int)
 
int change_layout_cost (slp_tree, unsigned int, unsigned int)
 
slpg_partition_layout_costspartition_layout_costs (unsigned int, unsigned int)
 
void change_vec_perm_layout (slp_tree, lane_permutation_t &, int, unsigned int)
 
int internal_node_cost (slp_tree, int, unsigned int)
 
void start_choosing_layouts ()
 
slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int, unsigned int, unsigned int)
 
slpg_layout_cost total_in_cost (unsigned int)
 
slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int)
 
slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int)
 
void forward_pass ()
 
void backward_pass ()
 
slp_tree get_result_with_layout (slp_tree, unsigned int)
 
void materialize ()
 
void remove_redundant_permutations ()
 
void decide_masked_load_lanes ()
 
void dump ()
 

Private Attributes

vec_infom_vinfo
 
bool m_optimize_size
 
graphm_slpg = nullptr
 
auto_vec< slpg_vertexm_vertices
 
auto_vec< int > m_leafs
 
auto_vec< vec< unsigned > > m_perms
 
auto_vec< slpg_partition_infom_partitions
 
auto_vec< unsigned int > m_partitioned_nodes
 
auto_vec< slpg_partition_layout_costsm_partition_layout_costs
 
auto_vec< slp_treem_node_layouts
 

Detailed Description

This class tries to optimize the layout of vectors in order to avoid
unnecessary shuffling.  At the moment, the set of possible layouts are
restricted to bijective permutations.

The goal of the pass depends on whether we're optimizing for size or
for speed.  When optimizing for size, the goal is to reduce the overall
number of layout changes (including layout changes implied by things
like load permutations).  When optimizing for speed, the goal is to
reduce the maximum latency attributable to layout changes on any
non-cyclical path through the data flow graph.

For example, when optimizing a loop nest for speed, we will prefer
to make layout changes outside of a loop rather than inside of a loop,
and will prefer to make layout changes in parallel rather than serially,
even if that increases the overall number of layout changes.

The high-level procedure is:

(1) Build a graph in which edges go from uses (parents) to definitions
    (children).

(2) Divide the graph into a dag of strongly-connected components (SCCs).

(3) When optimizing for speed, partition the nodes in each SCC based
    on their containing cfg loop.  When optimizing for size, treat
    each SCC as a single partition.

    This gives us a dag of partitions.  The goal is now to assign a
    layout to each partition.

(4) Construct a set of vector layouts that are worth considering.
    Record which nodes must keep their current layout.

(5) Perform a forward walk over the partition dag (from loads to stores)
    accumulating the "forward" cost of using each layout.  When visiting
    each partition, assign a tentative choice of layout to the partition
    and use that choice when calculating the cost of using a different
    layout in successor partitions.

(6) Perform a backward walk over the partition dag (from stores to loads),
    accumulating the "backward" cost of using each layout.  When visiting
    each partition, make a final choice of layout for that partition based
    on the accumulated forward costs (from (5)) and backward costs
    (from (6)).

(7) Apply the chosen layouts to the SLP graph.

For example, consider the SLP statements:

S1:      a_1 = load
    loop:
S2:      a_2 = PHI<a_1, a_3>
S3:      b_1 = load
S4:      a_3 = a_2 + b_1
    exit:
S5:      a_4 = PHI<a_3>
S6:      store a_4

S2 and S4 form an SCC and are part of the same loop.  Every other
statement is in a singleton SCC.  In this example there is a one-to-one
mapping between SCCs and partitions and the partition dag looks like this;

     S1     S3
      \     /
       S2+S4
         |
        S5
         |
        S6

S2, S3 and S4 will have a higher execution frequency than the other
statements, so when optimizing for speed, the goal is to avoid any
layout changes:

- within S3
- within S2+S4
- on the S3->S2+S4 edge

For example, if S3 was originally a reversing load, the goal of the
pass is to make it an unreversed load and change the layout on the
S1->S2+S4 and S2+S4->S5 edges to compensate.  (Changing the layout
on S1->S2+S4 and S5->S6 would also be acceptable.)

The difference between SCCs and partitions becomes important if we
add an outer loop:

S1:      a_1 = ...
    loop1:
S2:      a_2 = PHI<a_1, a_6>
S3:      b_1 = load
S4:      a_3 = a_2 + b_1
    loop2:
S5:      a_4 = PHI<a_3, a_5>
S6:      c_1 = load
S7:      a_5 = a_4 + c_1
    exit2:
S8:      a_6 = PHI<a_5>
S9:      store a_6
    exit1:

Here, S2, S4, S5, S7 and S8 form a single SCC.  However, when optimizing
for speed, we usually do not want restrictions in the outer loop to "infect"
the decision for the inner loop.  For example, if an outer-loop node
in the SCC contains a statement with a fixed layout, that should not
prevent the inner loop from using a different layout.  Conversely,
the inner loop should not dictate a layout to the outer loop: if the
outer loop does a lot of computation, then it may not be efficient to
do all of that computation in the inner loop's preferred layout.

So when optimizing for speed, we partition the SCC into S2+S4+S8 (outer)
and S5+S7 (inner).  We also try to arrange partitions so that:

- the partition for an outer loop comes before the partition for
  an inner loop

- if a sibling loop A dominates a sibling loop B, A's partition
  comes before B's

This gives the following partition dag for the example above:

     S1        S3
      \        /
       S2+S4+S8   S6
        |   \\    /
        |    S5+S7
        |
       S9

There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and
one for a reversal of the edge S7->S8.

The backward walk picks a layout for S5+S7 before S2+S4+S8.  The choice
for S2+S4+S8 therefore has to balance the cost of using the outer loop's
preferred layout against the cost of changing the layout on entry to the
inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed).

Although this works well when optimizing for speed, it has the downside
when optimizing for size that the choice of layout for S5+S7 is completely
independent of S9, which lessens the chance of reducing the overall number
of permutations.  We therefore do not partition SCCs when optimizing
for size.

To give a concrete example of the difference between optimizing
for size and speed, consider:

a[0] = (b[1] << c[3]) - d[1];
a[1] = (b[0] << c[2]) - d[0];
a[2] = (b[3] << c[1]) - d[3];
a[3] = (b[2] << c[0]) - d[2];

There are three different layouts here: one for a, one for b and d,
and one for c.  When optimizing for speed it is better to permute each
of b, c and d into the order required by a, since those permutations
happen in parallel.  But when optimizing for size, it is better to:

- permute c into the same order as b
- do the arithmetic
- permute the result into the order required by a

This gives 2 permutations rather than 3.   

Constructor & Destructor Documentation

◆ vect_optimize_slp_pass()

vect_optimize_slp_pass::vect_optimize_slp_pass ( vec_info * vinfo)
inline

Member Function Documentation

◆ backward_cost()

slpg_layout_cost vect_optimize_slp_pass::backward_cost ( graph_edge * ud,
unsigned int to_node_i,
unsigned int from_layout_i )
private
UD represents a use-def link between TO_NODE_I and a node in an earlier
partition; TO_NODE_I could be the definition node or the use node.
The node at the other end of the link wants to use layout FROM_LAYOUT_I;
return the cost of any necessary fix-ups on edge UD, or
slpg_layout_cost::impossible () if the choice cannot be made.

At this point, TO_NODE_I's partition has a fixed choice of layout.   

References slpg_layout_cost::add_serial_cost(), graph_edge::dest, edge_layout_cost(), gcc_assert, slpg_layout_cost::impossible(), slpg_partition_info::in_degree, internal_node_cost(), slpg_partition_info::layout, m_optimize_size, m_partitions, m_vertices, partition_layout_costs(), SLP_TREE_CODE, slpg_layout_cost::split(), and graph_edge::src.

Referenced by backward_pass().

◆ backward_pass()

void vect_optimize_slp_pass::backward_pass ( )
private

◆ build_graph()

void vect_optimize_slp_pass::build_graph ( )
private
Build the graph.  Mark edges that correspond to cfg loop latch edges with
a nonnull data field.   

References add_edge(), build_vertices(), graph_edge::data, is_cfg_latch_edge(), m_optimize_size, m_slpg, m_vertices, new_graph(), and SLP_TREE_CHILDREN.

Referenced by run().

◆ build_vertices() [1/2]

void vect_optimize_slp_pass::build_vertices ( )
private
Fill the vertices and leafs vector with all nodes in the SLP graph.   

References build_vertices(), FOR_EACH_VEC_ELT, i, m_leafs, m_vertices, m_vinfo, SLP_INSTANCE_TREE, vec_info::slp_instances, and visited.

Referenced by build_graph(), build_vertices(), and build_vertices().

◆ build_vertices() [2/2]

void vect_optimize_slp_pass::build_vertices ( hash_set< slp_tree > & visited,
slp_tree node )
private
Fill the vertices and leafs vector with all nodes in the SLP graph.
Also record whether we should optimize anything for speed rather
than size.   

References build_vertices(), FOR_EACH_VEC_ELT, gimple_bb(), i, m_leafs, m_optimize_size, m_vertices, optimize_bb_for_speed_p(), SLP_TREE_CHILDREN, SLP_TREE_REPRESENTATIVE, vect_orig_stmt(), _slp_tree::vertex, and visited.

◆ change_layout_cost()

int vect_optimize_slp_pass::change_layout_cost ( slp_tree node,
unsigned int from_layout_i,
unsigned int to_layout_i )
private
Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I
to layout TO_LAYOUT_I for a node like NODE.  Return -1 if either of the
layouts is incompatible with NODE or if the change is not possible for
some other reason.

The properties taken from NODE include the number of lanes and the
vector type.  The actual operation doesn't matter.   

References count, i, is_compatible_layout(), m_perms, m_vinfo, MAX, SLP_TREE_LANES, vect_slp_permute(), and vectorizable_slp_permutation_1().

Referenced by edge_layout_cost().

◆ change_vec_perm_layout()

void vect_optimize_slp_pass::change_vec_perm_layout ( slp_tree node,
lane_permutation_t & perm,
int in_layout_i,
unsigned int out_layout_i )
private
Change PERM in one of two ways:

- if IN_LAYOUT_I < 0, accept input operand I in the layout that has been
  chosen for child I of NODE.

- if IN_LAYOUT >= 0, accept all inputs operands with that layout.

In both cases, arrange for the output to have layout OUT_LAYOUT_I   

References m_partitions, m_perms, m_vertices, SLP_TREE_CHILDREN, vect_slp_permute(), and _slp_tree::vertex.

Referenced by internal_node_cost(), and materialize().

◆ containing_loop()

struct loop * vect_optimize_slp_pass::containing_loop ( slp_tree node)
private

◆ create_partitions()

◆ decide_masked_load_lanes()

◆ dump()

◆ edge_layout_cost()

slpg_layout_cost vect_optimize_slp_pass::edge_layout_cost ( graph_edge * ud,
unsigned int node1_i,
unsigned int layout1_i,
unsigned int layout2_i )
private
Return the cost of switching between layout LAYOUT1_I (at node NODE1_I)
and layout LAYOUT2_I on cross-partition use-to-def edge UD.  Return
slpg_layout_cost::impossible () if the change isn't possible.   

References change_layout_cost(), graph_edge::dest, slpg_layout_cost::impossible(), m_optimize_size, m_vertices, slpg_layout_cost::split(), and graph_edge::src.

Referenced by backward_cost(), backward_pass(), forward_cost(), and forward_pass().

◆ for_each_partition_edge()

template<typename T >
void vect_optimize_slp_pass::for_each_partition_edge ( unsigned int node_i,
T fn )
private
Look for edges from earlier partitions into node NODE_I and edges from
node NODE_I into later partitions.  Call:

   FN (ud, other_node_i)

for each such use-to-def edge ud, where other_node_i is the node at the
other end of the edge.   

References m_slpg, m_vertices, vertex::pred, graph_edge::pred_next, vertex::succ, graph_edge::succ_next, and graph::vertices.

Referenced by backward_pass(), dump(), forward_pass(), start_choosing_layouts(), and total_in_cost().

◆ forward_cost()

slpg_layout_cost vect_optimize_slp_pass::forward_cost ( graph_edge * ud,
unsigned int from_node_i,
unsigned int to_layout_i )
private
UD represents a use-def link between FROM_NODE_I and a node in a later
partition; FROM_NODE_I could be the definition node or the use node.
The node at the other end of the link wants to use layout TO_LAYOUT_I.
Return the cost of any necessary fix-ups on edge UD, or return
slpg_layout_cost::impossible () if the change isn't possible.

At this point, FROM_NODE_I's partition has chosen the cheapest
layout based on the information available so far, but this choice
is only provisional.   

References slpg_layout_cost::add_serial_cost(), edge_layout_cost(), gcc_assert, slpg_layout_cost::impossible(), slpg_layout_cost::is_better_than(), slpg_layout_cost::is_possible(), slpg_partition_info::layout, m_optimize_size, m_partitions, m_vertices, slpg_partition_info::out_degree, partition_layout_costs(), and slpg_layout_cost::split().

Referenced by forward_pass().

◆ forward_pass()

void vect_optimize_slp_pass::forward_pass ( )
private
Make a forward pass through the partitions, accumulating input costs.
Make a tentative (provisional) choice of layout for each partition,
ensuring that this choice still allows later partitions to keep
their original layout.   

References add_cost(), slpg_layout_cost::add_serial_cost(), edge_layout_cost(), for_each_partition_edge(), forward_cost(), gcc_assert, slpg_layout_cost::impossible(), internal_node_cost(), slpg_layout_cost::is_better_than(), is_compatible_layout(), slpg_layout_cost::is_possible(), m_optimize_size, m_partitioned_nodes, m_partitions, m_perms, m_vertices, partition_layout_costs(), SLP_TREE_CODE, total_in_cost(), and _slp_tree::vertex.

Referenced by run().

◆ get_result_with_layout()

slp_tree vect_optimize_slp_pass::get_result_with_layout ( slp_tree node,
unsigned int to_layout_i )
private

◆ internal_node_cost()

int vect_optimize_slp_pass::internal_node_cost ( slp_tree node,
int in_layout_i,
unsigned int out_layout_i )
private
Check whether the target allows NODE to be rearranged so that the node's
output has layout OUT_LAYOUT_I.  Return the cost of the change if so,
in the same arbitrary units as for change_layout_cost.  Return -1 otherwise.

If NODE is a VEC_PERM_EXPR and IN_LAYOUT_I < 0, also check whether
NODE can adapt to the layout changes that have (perhaps provisionally)
been chosen for NODE's children, so that no extra permutations are
needed on either the input or the output of NODE.

If NODE is a VEC_PERM_EXPR and IN_LAYOUT_I >= 0, instead assume
that all inputs will be forced into layout IN_LAYOUT_I beforehand.

IN_LAYOUT_I has no meaning for other types of node.

Keeping the node as-is is always valid.  If the target doesn't appear
to support the node as-is, but might realistically support other layouts,
then layout 0 instead has the cost of a worst-case permutation.  On the
one hand, this ensures that every node has at least one valid layout,
avoiding what would otherwise be an awkward special case.  On the other,
it still encourages the pass to change an invalid pre-existing layout
choice into a valid one.   

References change_vec_perm_layout(), count, DR_GROUP_FIRST_ELEMENT, DR_GROUP_SIZE, DR_IS_READ, dyn_cast(), is_compatible_layout(), LOOP_VINFO_VECT_FACTOR, m_perms, m_vinfo, SLP_TREE_CHILDREN, SLP_TREE_CODE, SLP_TREE_LANE_PERMUTATION, SLP_TREE_LANES, SLP_TREE_LOAD_PERMUTATION, SLP_TREE_REPRESENTATIVE, STMT_VINFO_DATA_REF, STMT_VINFO_GROUPED_ACCESS, vect_slp_permute(), vect_transform_slp_perm_load_1(), vectorizable_slp_permutation_1(), and vNULL.

Referenced by backward_cost(), and forward_pass().

◆ is_cfg_latch_edge()

bool vect_optimize_slp_pass::is_cfg_latch_edge ( graph_edge * ud)
private
Return true if UD (an edge from a use to a definition) is associated
with a loop latch edge in the cfg.   

References bb_loop_header_p(), containing_loop(), graph_edge::dest, gimple_bb(), is_a(), m_vertices, SLP_TREE_CODE, SLP_TREE_DEF_TYPE, SLP_TREE_REPRESENTATIVE, graph_edge::src, vect_internal_def, and vect_orig_stmt().

Referenced by build_graph().

◆ is_compatible_layout()

bool vect_optimize_slp_pass::is_compatible_layout ( slp_tree node,
unsigned int layout_i )
private
Return true if layout LAYOUT_I is compatible with the number of SLP lanes
that NODE would operate on.  This test is independent of NODE's actual
operation.   

References m_perms, and SLP_TREE_LANES.

Referenced by change_layout_cost(), forward_pass(), and internal_node_cost().

◆ materialize()

◆ partition_layout_costs()

slpg_partition_layout_costs & vect_optimize_slp_pass::partition_layout_costs ( unsigned int partition_i,
unsigned int layout_i )
inlineprivate
Return the costs of assigning layout LAYOUT_I to partition PARTITION_I.   

References m_partition_layout_costs, and m_perms.

Referenced by backward_cost(), backward_pass(), dump(), forward_cost(), forward_pass(), and total_in_cost().

◆ remove_redundant_permutations()

void vect_optimize_slp_pass::remove_redundant_permutations ( )
private
Elide load permutations that are not necessary.  Such permutations might
be pre-existing, rather than created by the layout optimizations.   

References as_a(), DR_GROUP_FIRST_ELEMENT, DR_GROUP_GAP, DR_GROUP_NEXT_ELEMENT, DR_GROUP_SIZE, FOR_EACH_VEC_ELT, is_a(), known_eq, LOOP_VINFO_VECT_FACTOR, m_leafs, m_vertices, m_vinfo, NULL, SLP_TREE_LANES, SLP_TREE_LOAD_PERMUTATION, SLP_TREE_SCALAR_STMTS, and STMT_VINFO_GROUPED_ACCESS.

Referenced by materialize(), and run().

◆ run()

void vect_optimize_slp_pass::run ( )

◆ start_choosing_layouts()

◆ total_in_cost()

slpg_layout_cost vect_optimize_slp_pass::total_in_cost ( unsigned int node_i)
private
Return the incoming costs for node NODE_I, assuming that each input keeps
its current (provisional) choice of layout.  The inputs do not necessarily
have the same layout as each other.   

References add_cost(), slpg_layout_cost::add_parallel_cost(), slpg_layout_cost::add_serial_cost(), for_each_partition_edge(), m_partitions, m_vertices, partition_layout_costs(), and slpg_layout_cost::split().

Referenced by forward_pass().

Field Documentation

◆ m_leafs

auto_vec<int> vect_optimize_slp_pass::m_leafs
private

◆ m_node_layouts

auto_vec<slp_tree> vect_optimize_slp_pass::m_node_layouts
private

◆ m_optimize_size

◆ m_partition_layout_costs

auto_vec<slpg_partition_layout_costs> vect_optimize_slp_pass::m_partition_layout_costs
private

◆ m_partitioned_nodes

auto_vec<unsigned int> vect_optimize_slp_pass::m_partitioned_nodes
private

◆ m_partitions

◆ m_perms

◆ m_slpg

graph* vect_optimize_slp_pass::m_slpg = nullptr
private

◆ m_vertices

◆ m_vinfo


The documentation for this class was generated from the following file: