| Age | Commit message (Collapse) | Author |
|
|
|
- Many stylistic nits.
- Removed blkmerge().
- Some minor bug fixes.
- GCM reassoc is now "sink"; a pass that
moves trivial ops in their target block
with the same goal of reducing register
pressure, but starting from instructions
that benefit from having their inputs
close.
|
|
More or less as proposed in its ninth iteration with the
addition of a gcmmove() functionality to restore coherent
local schedules.
Changes since RFC 8:
Features:
- generalization of phi 1/0 detection
- collapse linear jmp chains before GVN; simplifies if-graph
detection used in 0/non-0 value inference and if-elim...
- infer 0/non-0 values from dominating blk jnz; eliminates
redundant cmp eq/ne 0 and associated jnz/blocks, for example
redundant null pointer checks (hare codebase likes this)
- remove (emergent) empty if-then-else graphlets between GVN and
GCM; improves GCM instruction placement, particularly cmps.
- merge %addr =l add %addr1, N sequences - reduces tmp count,
register pressure.
- squash consecutive associative ops with constant args, e.g.
t1 = add t, N ... t2 = add t2, M -> t2 = add t, N+M
Bug Fixes:
- remove "cmp eq/ne of non-identical RCon's " in copyref().
RCon's are not guaranteed to be dedup'ed, and symbols can
alias.
Codebase:
- moved some stuff into cfg.c including blkmerge()
- some refactoring in gvn.c
- simplification of reassoc.c - always reassoc all cmp ops
and Kl add %t, N. Better on coremark, smaller codebase.
- minor simplification of movins() - use vins
Testing - standard QBE, cproc, hare, harec, coremark
[still have Rust build issues with latest roland]
Benchmark
- coremark is ~15%+ faster than master
- hare "HARETEST_INCLUDE='slow' make check" ~8% faster
(crypto::sha1::sha1_1gb is biggest obvious win - ~25% faster)
Changes since RFC 7:
Bug fixes:
- remove isbad4gcm() in GVN/GCM - it is unsound due to different state
at GVN vs GCM time; replace with "reassociation" pass after GCM
- fix intra-blk use-before-def after GCM
- prevent GVN from deduping trapping instructions cos GCM will not
move them
- remove cmp eq/ne identical arg copy detection for floating point, it
is not valid for NaN
- fix cges/cged flagged as commutative in ops.h instead of cnes/cned
respectively; just a typo
Minor features:
- copy detection handles cmp le/lt/ge/gt with identical args
- treat (integer) div/rem by non-zero constant as non-trapping
- eliminate add N/sub N pairs in copy detection
- maintain accurate tmp use in GVN; not strictly necessary but enables
interim global state sanity checking
- "reassociation" of trivial constant offset load/store addresses, and
cmp ops with point-of-use in pass after GCM
- normalise commutative op arg order - e.g. op con, tmp -> op tmp, con
to simplify copy detection and GVN instruction dedup
Codebase:
- split out core copy detection and constant folding (back) out into
copy.c, fold.c respectively; gvn.c was getting monolithic
- generic support for instruction moving in ins.c - used by GCM and
reassoc
- new reassociation pass in reassoc.c
- other minor clean-up/refactor
Changes since RFC 6:
- More ext elimination in GVN by examination of def and use bit width
- elimination of redundant and mask by bit width examination
- Incorporation of Song's patch
Changes since RFC 5:
- avoidance of "bad" candidates for GVN/GCM - trivial address offset
calculations, and comparisons
- more copy detection mostly around boolean values
- allow elimination of unused load, alloc, trapping instructions
- detection of trivial boolean v ? 1 : 0 phi patterns
- bug fix for (removal of) "chg" optimisation in ins recreation - it
was missing removal of unused instructions in some cases
ifelim() between GVN and GCM; deeper nopunused()
|
|
|
|
Crashing loads of uninitialized memory
proved to be a problem when implementing
unions using qbe. This patch introduces
a new UNDEF Ref to represent data that is
known to be uninitialized. Optimization
passes can make use of it to eliminate
some code. In the last compilation stages,
UNDEF is treated as the constant 0xdeaddead.
|
|
The copy elimination pass is not
complete. This patch improves
things a bit, but I think we still
have quite a bit of incompleteness.
We now consistently mark phis with
all arguments identical as copies.
Previously, they were inconsistently
eliminated by phisimpl(). An example
where they were not eliminated is
the following:
@blk2
%a = phi @blk0 %x, @blk1 %x
jnz ?, @blk3, @blk4
@blk3
%b = copy %x
@blk4
%c = phi @blk2 %a, @blk3 %b
In this example, neither %c nor %a
were marked as copies of %x because,
when phisimpl() is called, the copy
information for %b is not available.
The incompleteness is still present
and can be observed by modifying
the example above so that %a takes
a copy of %x through a back-edge.
Then, phisimpl()'s lack of copy
information about %b will prevent
optimization.
|
|
|
|
udiv %x, 1 == %x, and for each of sub, or, xor, sar, shr, and shl,
<op> %x, 0 == %x.
|
|
C99 6.5.2.5p6:
> If the compound literal occurs outside the body of a function,
> the object has static storage duration; otherwise, it has automatic
> storage duration associated with the enclosing block.
So, we can't use the address of a compound literal here. Instead,
just set p to NULL, and make the loop conditional on p being non-NULL.
Remarks from Quentin:
I made a cosmetic change to Michael's
original patch and merely pushed the
literal at toplevel.
|
|
The code used to see add 0, 10 as
a copy of 0.
|
|
When lowering pointer arithmetic, it is
natural for a C frontend to generate
those instructions.
|
|
The sparse data-flow analysis used for
copy elimination before this patch
could sometimes diverge. The core
reason for this behavior is that the
visitphi() function was not monotonic
in the following copy-of lattice:
top (represented as the temp
/ | \ itself)
x y z ...
\ | /
bot (represented as R)
This monotonicity defect could be
fixed by reverting 2f41ff03, but
then the pass would end up missing
some redundant phis.
This patch re-implements the pass
from scratch using a different
approach. The new algorithm should
get rid of all redundant copies. On
the other hand, it can run slower
than the monotonic sparse data-flow
analysis because, in the worst case,
an instruction in a phi cluster can
be visited as many times as there
are phis in the input program.
Thanks to Michael Forney for reviewing
and testing the new pass.
|
|
Compiler warned about comparison between signed and unsigned values.
|
|
|
|
|
|
|
|
The pass was not doing anything incorrect, but
it missed some opportunities to optimize. On
a copy heavy example I observed that, in the
output of the pass a phi of the following shape
remained:
%a =w phi @A %c, @B %a
Originally the phi for %a was:
%a =w phi @A %b, @B %a
Since %b was discovered a copy of %c, %a should
have been eliminated and replaced with %c.
I think the problem is that knowledge on the
first argument of the phi %a changes as the
algorithm progresses, a more detailed walk-
through follows.
In the first round of the algoritm, %a is
discovered to be a copy of its first argument
%b.
phi(%b, %a) -> %b
In the second round, %a is computed as the phi
of %c (since the first argument changed) and %b
(the result of the first iteration), in our
lattice, the glb of those two is bottom.
phi(%c, %b) -> %a (Bottom)
Finally, there is a third round in which we
compute %a as the phi of %a and %c, which again,
gives bottom.
phi(%c, %a) -> %a (Bottom)
The bug is not tied to a phi of a copy, for
example, if the first argument is speculated
to be a copy of 0 and then, this knowledge is
retracted; we will compute in sequence:
phi(0, %a) -> 0
phi(%b, 0) -> %a (Bottom)
phi(%b, %a) -> %a (Bottom)
The way I fixed this is by ignoring arguments
of the phi that were discovered to be copies of
the phi node itself. This will make the last
rounds above do the correct thing.
|
|
|
|
|
|
|
|
|
|
|