GCC Middle and Back End API Reference
|
Public Member Functions | |
vect_optimize_slp_pass (vec_info *vinfo) | |
void | run () |
Private Attributes | |
vec_info * | m_vinfo |
bool | m_optimize_size |
graph * | m_slpg = nullptr |
auto_vec< slpg_vertex > | m_vertices |
auto_vec< int > | m_leafs |
auto_vec< vec< unsigned > > | m_perms |
auto_vec< slpg_partition_info > | m_partitions |
auto_vec< unsigned int > | m_partitioned_nodes |
auto_vec< slpg_partition_layout_costs > | m_partition_layout_costs |
auto_vec< slp_tree > | m_node_layouts |
This class tries to optimize the layout of vectors in order to avoid unnecessary shuffling. At the moment, the set of possible layouts are restricted to bijective permutations. The goal of the pass depends on whether we're optimizing for size or for speed. When optimizing for size, the goal is to reduce the overall number of layout changes (including layout changes implied by things like load permutations). When optimizing for speed, the goal is to reduce the maximum latency attributable to layout changes on any non-cyclical path through the data flow graph. For example, when optimizing a loop nest for speed, we will prefer to make layout changes outside of a loop rather than inside of a loop, and will prefer to make layout changes in parallel rather than serially, even if that increases the overall number of layout changes. The high-level procedure is: (1) Build a graph in which edges go from uses (parents) to definitions (children). (2) Divide the graph into a dag of strongly-connected components (SCCs). (3) When optimizing for speed, partition the nodes in each SCC based on their containing cfg loop. When optimizing for size, treat each SCC as a single partition. This gives us a dag of partitions. The goal is now to assign a layout to each partition. (4) Construct a set of vector layouts that are worth considering. Record which nodes must keep their current layout. (5) Perform a forward walk over the partition dag (from loads to stores) accumulating the "forward" cost of using each layout. When visiting each partition, assign a tentative choice of layout to the partition and use that choice when calculating the cost of using a different layout in successor partitions. (6) Perform a backward walk over the partition dag (from stores to loads), accumulating the "backward" cost of using each layout. When visiting each partition, make a final choice of layout for that partition based on the accumulated forward costs (from (5)) and backward costs (from (6)). (7) Apply the chosen layouts to the SLP graph. For example, consider the SLP statements: S1: a_1 = load loop: S2: a_2 = PHI<a_1, a_3> S3: b_1 = load S4: a_3 = a_2 + b_1 exit: S5: a_4 = PHI<a_3> S6: store a_4 S2 and S4 form an SCC and are part of the same loop. Every other statement is in a singleton SCC. In this example there is a one-to-one mapping between SCCs and partitions and the partition dag looks like this; S1 S3 \ / S2+S4 | S5 | S6 S2, S3 and S4 will have a higher execution frequency than the other statements, so when optimizing for speed, the goal is to avoid any layout changes: - within S3 - within S2+S4 - on the S3->S2+S4 edge For example, if S3 was originally a reversing load, the goal of the pass is to make it an unreversed load and change the layout on the S1->S2+S4 and S2+S4->S5 edges to compensate. (Changing the layout on S1->S2+S4 and S5->S6 would also be acceptable.) The difference between SCCs and partitions becomes important if we add an outer loop: S1: a_1 = ... loop1: S2: a_2 = PHI<a_1, a_6> S3: b_1 = load S4: a_3 = a_2 + b_1 loop2: S5: a_4 = PHI<a_3, a_5> S6: c_1 = load S7: a_5 = a_4 + c_1 exit2: S8: a_6 = PHI<a_5> S9: store a_6 exit1: Here, S2, S4, S5, S7 and S8 form a single SCC. However, when optimizing for speed, we usually do not want restrictions in the outer loop to "infect" the decision for the inner loop. For example, if an outer-loop node in the SCC contains a statement with a fixed layout, that should not prevent the inner loop from using a different layout. Conversely, the inner loop should not dictate a layout to the outer loop: if the outer loop does a lot of computation, then it may not be efficient to do all of that computation in the inner loop's preferred layout. So when optimizing for speed, we partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner). We also try to arrange partitions so that: - the partition for an outer loop comes before the partition for an inner loop - if a sibling loop A dominates a sibling loop B, A's partition comes before B's This gives the following partition dag for the example above: S1 S3 \ / S2+S4+S8 S6 | \\ / | S5+S7 | S9 There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and one for a reversal of the edge S7->S8. The backward walk picks a layout for S5+S7 before S2+S4+S8. The choice for S2+S4+S8 therefore has to balance the cost of using the outer loop's preferred layout against the cost of changing the layout on entry to the inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed). Although this works well when optimizing for speed, it has the downside when optimizing for size that the choice of layout for S5+S7 is completely independent of S9, which lessens the chance of reducing the overall number of permutations. We therefore do not partition SCCs when optimizing for size. To give a concrete example of the difference between optimizing for size and speed, consider: a[0] = (b[1] << c[3]) - d[1]; a[1] = (b[0] << c[2]) - d[0]; a[2] = (b[3] << c[1]) - d[3]; a[3] = (b[2] << c[0]) - d[2]; There are three different layouts here: one for a, one for b and d, and one for c. When optimizing for speed it is better to permute each of b, c and d into the order required by a, since those permutations happen in parallel. But when optimizing for size, it is better to: - permute c into the same order as b - do the arithmetic - permute the result into the order required by a This gives 2 permutations rather than 3.
|
inline |
|
private |
UD represents a use-def link between TO_NODE_I and a node in an earlier partition; TO_NODE_I could be the definition node or the use node. The node at the other end of the link wants to use layout FROM_LAYOUT_I; return the cost of any necessary fix-ups on edge UD, or slpg_layout_cost::impossible () if the choice cannot be made. At this point, TO_NODE_I's partition has a fixed choice of layout.
References slpg_layout_cost::add_serial_cost(), graph_edge::dest, edge_layout_cost(), gcc_assert, slpg_layout_cost::impossible(), slpg_partition_info::in_degree, internal_node_cost(), slpg_partition_info::layout, m_optimize_size, m_partitions, m_vertices, partition_layout_costs(), SLP_TREE_CODE, slpg_layout_cost::split(), and graph_edge::src.
Referenced by backward_pass().
|
private |
Make a backward pass through the partitions, accumulating output costs. Make a final choice of layout for each partition.
References add_cost(), slpg_layout_cost::add_serial_cost(), backward_cost(), edge_layout_cost(), for_each_partition_edge(), gcc_assert, slpg_layout_cost::impossible(), slpg_layout_cost::is_better_than(), slpg_layout_cost::is_possible(), m_optimize_size, m_partitioned_nodes, m_partitions, m_perms, m_vertices, and partition_layout_costs().
Referenced by run().
|
private |
Build the graph. Mark edges that correspond to cfg loop latch edges with a nonnull data field.
References add_edge(), build_vertices(), graph_edge::data, is_cfg_latch_edge(), m_optimize_size, m_slpg, m_vertices, new_graph(), and SLP_TREE_CHILDREN.
Referenced by run().
|
private |
Fill the vertices and leafs vector with all nodes in the SLP graph.
References build_vertices(), FOR_EACH_VEC_ELT, i, m_leafs, m_vertices, m_vinfo, SLP_INSTANCE_TREE, vec_info::slp_instances, and visited.
Referenced by build_graph(), build_vertices(), and build_vertices().
|
private |
Fill the vertices and leafs vector with all nodes in the SLP graph. Also record whether we should optimize anything for speed rather than size.
References build_vertices(), FOR_EACH_VEC_ELT, gimple_bb(), i, m_leafs, m_optimize_size, m_vertices, optimize_bb_for_speed_p(), SLP_TREE_CHILDREN, SLP_TREE_REPRESENTATIVE, vect_orig_stmt(), _slp_tree::vertex, and visited.
|
private |
Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I to layout TO_LAYOUT_I for a node like NODE. Return -1 if either of the layouts is incompatible with NODE or if the change is not possible for some other reason. The properties taken from NODE include the number of lanes and the vector type. The actual operation doesn't matter.
References count, i, is_compatible_layout(), m_perms, m_vinfo, MAX, SLP_TREE_LANES, vect_slp_permute(), and vectorizable_slp_permutation_1().
Referenced by edge_layout_cost().
|
private |
Change PERM in one of two ways: - if IN_LAYOUT_I < 0, accept input operand I in the layout that has been chosen for child I of NODE. - if IN_LAYOUT >= 0, accept all inputs operands with that layout. In both cases, arrange for the output to have layout OUT_LAYOUT_I
References m_partitions, m_perms, m_vertices, SLP_TREE_CHILDREN, vect_slp_permute(), and _slp_tree::vertex.
Referenced by internal_node_cost(), and materialize().
Return the cfg loop that contains NODE.
References cfun, ENTRY_BLOCK_PTR_FOR_FN, gimple_bb(), basic_block_def::loop_father, SLP_TREE_REPRESENTATIVE, and vect_orig_stmt().
Referenced by create_partitions(), and is_cfg_latch_edge().
|
private |
Create the node partitions.
References vertex::component, containing_loop(), hash_map< KeyId, Value, Traits >::empty(), gcc_assert, hash_map< KeyId, Value, Traits >::get_or_insert(), graphds_dfs(), graphds_scc(), m_leafs, m_optimize_size, m_partitioned_nodes, m_partitions, m_slpg, m_vertices, NULL, skip_cfg_latch_edges(), SLP_TREE_DEF_TYPE, SLP_TREE_VEC_DEFS, vect_constant_def, vect_external_def, and graph::vertices.
Referenced by run().
|
private |
Masked load lanes discovery.
References compare_step_with_zero(), DR_GROUP_FIRST_ELEMENT, DR_GROUP_SIZE, gcc_assert, gimple_call_internal_p(), is_a(), _slp_tree::ldst_lanes, m_slpg, m_vertices, m_vinfo, vertex::pred, graph_edge::pred_next, SLP_TREE_CHILDREN, SLP_TREE_CODE, SLP_TREE_DEF_TYPE, SLP_TREE_LANE_PERMUTATION, SLP_TREE_LANES, SLP_TREE_REF_COUNT, SLP_TREE_REPRESENTATIVE, SLP_TREE_VECTYPE, STMT_VINFO_GROUPED_ACCESS, STMT_VINFO_SLP_VECT_ONLY, STMT_VINFO_STMT, STMT_VINFO_STRIDED_P, vect_free_slp_tree(), vect_internal_def, vect_load_lanes_supported(), _slp_tree::vertex, and graph::vertices.
Referenced by run().
|
private |
Print the partition graph and layout information to the dump file.
References slpg_layout_cost::add_serial_cost(), slpg_layout_cost::depth, dump_printf(), dump_printf_loc(), for_each_partition_edge(), m_partitioned_nodes, m_partitions, m_perms, m_vertices, MSG_NOTE, partition_layout_costs(), print_edge(), SLP_TREE_CODE, SLP_TREE_REPRESENTATIVE, TEMPLATE, sreal::to_double(), slpg_layout_cost::total, and vect_location.
Referenced by run().
|
private |
Return the cost of switching between layout LAYOUT1_I (at node NODE1_I) and layout LAYOUT2_I on cross-partition use-to-def edge UD. Return slpg_layout_cost::impossible () if the change isn't possible.
References change_layout_cost(), graph_edge::dest, slpg_layout_cost::impossible(), m_optimize_size, m_vertices, slpg_layout_cost::split(), and graph_edge::src.
Referenced by backward_cost(), backward_pass(), forward_cost(), and forward_pass().
|
private |
Look for edges from earlier partitions into node NODE_I and edges from node NODE_I into later partitions. Call: FN (ud, other_node_i) for each such use-to-def edge ud, where other_node_i is the node at the other end of the edge.
References m_slpg, m_vertices, vertex::pred, graph_edge::pred_next, vertex::succ, graph_edge::succ_next, and graph::vertices.
Referenced by backward_pass(), dump(), forward_pass(), start_choosing_layouts(), and total_in_cost().
|
private |
UD represents a use-def link between FROM_NODE_I and a node in a later partition; FROM_NODE_I could be the definition node or the use node. The node at the other end of the link wants to use layout TO_LAYOUT_I. Return the cost of any necessary fix-ups on edge UD, or return slpg_layout_cost::impossible () if the change isn't possible. At this point, FROM_NODE_I's partition has chosen the cheapest layout based on the information available so far, but this choice is only provisional.
References slpg_layout_cost::add_serial_cost(), edge_layout_cost(), gcc_assert, slpg_layout_cost::impossible(), slpg_layout_cost::is_better_than(), slpg_layout_cost::is_possible(), slpg_partition_info::layout, m_optimize_size, m_partitions, m_vertices, slpg_partition_info::out_degree, partition_layout_costs(), and slpg_layout_cost::split().
Referenced by forward_pass().
|
private |
Make a forward pass through the partitions, accumulating input costs. Make a tentative (provisional) choice of layout for each partition, ensuring that this choice still allows later partitions to keep their original layout.
References add_cost(), slpg_layout_cost::add_serial_cost(), edge_layout_cost(), for_each_partition_edge(), forward_cost(), gcc_assert, slpg_layout_cost::impossible(), internal_node_cost(), slpg_layout_cost::is_better_than(), is_compatible_layout(), slpg_layout_cost::is_possible(), m_optimize_size, m_partitioned_nodes, m_partitions, m_perms, m_vertices, partition_layout_costs(), SLP_TREE_CODE, total_in_cost(), and _slp_tree::vertex.
Referenced by run().
|
private |
Return a node that applies layout TO_LAYOUT_I to the original form of NODE. NODE already has the layout that was selected for its partition.
References dump_enabled_p(), dump_printf_loc(), is_empty(), m_node_layouts, m_partitions, m_perms, m_vertices, m_vinfo, MSG_NOTE, SLP_TREE_CHILDREN, SLP_TREE_CODE, SLP_TREE_DEF_TYPE, SLP_TREE_LANE_PERMUTATION, SLP_TREE_LANES, SLP_TREE_REPRESENTATIVE, SLP_TREE_SCALAR_OPS, SLP_TREE_SCALAR_STMTS, SLP_TREE_VEC_DEFS, SLP_TREE_VECTYPE, vect_constant_def, vect_create_new_slp_node(), vect_external_def, vect_location, vect_slp_permute(), vect_slp_tree_uniform_p(), vectorizable_slp_permutation_1(), and _slp_tree::vertex.
Referenced by materialize().
|
private |
Check whether the target allows NODE to be rearranged so that the node's output has layout OUT_LAYOUT_I. Return the cost of the change if so, in the same arbitrary units as for change_layout_cost. Return -1 otherwise. If NODE is a VEC_PERM_EXPR and IN_LAYOUT_I < 0, also check whether NODE can adapt to the layout changes that have (perhaps provisionally) been chosen for NODE's children, so that no extra permutations are needed on either the input or the output of NODE. If NODE is a VEC_PERM_EXPR and IN_LAYOUT_I >= 0, instead assume that all inputs will be forced into layout IN_LAYOUT_I beforehand. IN_LAYOUT_I has no meaning for other types of node. Keeping the node as-is is always valid. If the target doesn't appear to support the node as-is, but might realistically support other layouts, then layout 0 instead has the cost of a worst-case permutation. On the one hand, this ensures that every node has at least one valid layout, avoiding what would otherwise be an awkward special case. On the other, it still encourages the pass to change an invalid pre-existing layout choice into a valid one.
References change_vec_perm_layout(), count, DR_GROUP_FIRST_ELEMENT, DR_GROUP_SIZE, DR_IS_READ, dyn_cast(), is_compatible_layout(), LOOP_VINFO_VECT_FACTOR, m_perms, m_vinfo, SLP_TREE_CHILDREN, SLP_TREE_CODE, SLP_TREE_LANE_PERMUTATION, SLP_TREE_LANES, SLP_TREE_LOAD_PERMUTATION, SLP_TREE_REPRESENTATIVE, STMT_VINFO_DATA_REF, STMT_VINFO_GROUPED_ACCESS, vect_slp_permute(), vect_transform_slp_perm_load_1(), vectorizable_slp_permutation_1(), and vNULL.
Referenced by backward_cost(), and forward_pass().
|
private |
Return true if UD (an edge from a use to a definition) is associated with a loop latch edge in the cfg.
References bb_loop_header_p(), containing_loop(), graph_edge::dest, gimple_bb(), is_a(), m_vertices, SLP_TREE_CODE, SLP_TREE_DEF_TYPE, SLP_TREE_REPRESENTATIVE, graph_edge::src, vect_internal_def, and vect_orig_stmt().
Referenced by build_graph().
Return true if layout LAYOUT_I is compatible with the number of SLP lanes that NODE would operate on. This test is independent of NODE's actual operation.
References m_perms, and SLP_TREE_LANES.
Referenced by change_layout_cost(), forward_pass(), and internal_node_cost().
|
private |
Apply the chosen vector layouts to the SLP graph.
References bitmap_bit_p, bitmap_clear(), bitmap_set_bit, change_vec_perm_layout(), dump_enabled_p(), dump_printf_loc(), FOR_EACH_VEC_ELT, gcc_assert, get_result_with_layout(), m_node_layouts, m_partition_layout_costs, m_partitioned_nodes, m_partitions, m_perms, m_vertices, m_vinfo, MSG_NOTE, _slp_tree::refcnt, remove_redundant_permutations(), SLP_TREE_CHILDREN, SLP_TREE_CODE, SLP_TREE_LANE_PERMUTATION, SLP_TREE_LOAD_PERMUTATION, SLP_TREE_SCALAR_STMTS, vect_free_slp_tree(), vect_location, vect_slp_permute(), and vectorizable_slp_permutation_1().
Referenced by run().
|
inlineprivate |
Return the costs of assigning layout LAYOUT_I to partition PARTITION_I.
References m_partition_layout_costs, and m_perms.
Referenced by backward_cost(), backward_pass(), dump(), forward_cost(), forward_pass(), and total_in_cost().
|
private |
Elide load permutations that are not necessary. Such permutations might be pre-existing, rather than created by the layout optimizations.
References as_a(), DR_GROUP_FIRST_ELEMENT, DR_GROUP_GAP, DR_GROUP_NEXT_ELEMENT, DR_GROUP_SIZE, FOR_EACH_VEC_ELT, is_a(), known_eq, LOOP_VINFO_VECT_FACTOR, m_leafs, m_vertices, m_vinfo, NULL, SLP_TREE_LANES, SLP_TREE_LOAD_PERMUTATION, SLP_TREE_SCALAR_STMTS, and STMT_VINFO_GROUPED_ACCESS.
Referenced by materialize(), and run().
void vect_optimize_slp_pass::run | ( | ) |
Main entry point for the SLP graph optimization pass.
References backward_pass(), build_graph(), create_partitions(), decide_masked_load_lanes(), dump(), dump_enabled_p(), forward_pass(), free_graph(), m_perms, m_slpg, materialize(), remove_redundant_permutations(), and start_choosing_layouts().
Referenced by vect_optimize_slp().
|
private |
Decide which element layouts we should consider using. Calculate the weights associated with inserting layout changes on partition edges. Also mark partitions that cannot change layout, by setting their layout to zero.
References bitmap_bit_p, bitmap_clear(), bitmap_set_bit, DR_GROUP_FIRST_ELEMENT, DR_GROUP_SIZE, for_each_partition_edge(), gcc_assert, hash_map< KeyId, Value, Traits >::get(), hash_map< KeyId, Value, Traits >::get_or_insert(), gimple_call_combined_fn(), gimple_get_lhs(), info_for_reduction(), is_gimple_call(), m_partition_layout_costs, m_partitioned_nodes, m_partitions, m_perms, m_slpg, m_vertices, m_vinfo, MAX, MIN, needs_fold_left_reduction_p(), slp_inst_kind_ctor, slp_inst_kind_reduc_chain, SLP_INSTANCE_KIND, SLP_INSTANCE_TREE, vec_info::slp_instances, SLP_TREE_CHILDREN, SLP_TREE_CODE, SLP_TREE_LANE_PERMUTATION, SLP_TREE_LANES, SLP_TREE_LOAD_PERMUTATION, SLP_TREE_REPRESENTATIVE, SLP_TREE_VECTYPE, graph_edge::src, STMT_VINFO_DATA_REF, STMT_VINFO_GROUPED_ACCESS, STMT_VINFO_REDUC_CODE, STMT_VINFO_STMT, vertex::succ, TREE_TYPE, TYPE_VECTOR_SUBPARTS(), vect_slp_node_weight(), graph::vertices, and vNULL.
Referenced by run().
|
private |
Return the incoming costs for node NODE_I, assuming that each input keeps its current (provisional) choice of layout. The inputs do not necessarily have the same layout as each other.
References add_cost(), slpg_layout_cost::add_parallel_cost(), slpg_layout_cost::add_serial_cost(), for_each_partition_edge(), m_partitions, m_vertices, partition_layout_costs(), and slpg_layout_cost::split().
Referenced by forward_pass().
|
private |
Referenced by build_vertices(), build_vertices(), create_partitions(), and remove_redundant_permutations().
Referenced by get_result_with_layout(), and materialize().
|
private |
Referenced by backward_cost(), backward_pass(), build_graph(), build_vertices(), create_partitions(), edge_layout_cost(), forward_cost(), and forward_pass().
|
private |
Referenced by materialize(), partition_layout_costs(), and start_choosing_layouts().
|
private |
Referenced by backward_pass(), create_partitions(), dump(), forward_pass(), materialize(), and start_choosing_layouts().
|
private |
|
private |
Referenced by build_graph(), create_partitions(), decide_masked_load_lanes(), for_each_partition_edge(), run(), and start_choosing_layouts().
|
private |
Referenced by backward_cost(), backward_pass(), build_graph(), build_vertices(), build_vertices(), change_vec_perm_layout(), create_partitions(), decide_masked_load_lanes(), dump(), edge_layout_cost(), for_each_partition_edge(), forward_cost(), forward_pass(), get_result_with_layout(), is_cfg_latch_edge(), materialize(), remove_redundant_permutations(), start_choosing_layouts(), and total_in_cost().
|
private |