| Age | Commit message (Collapse) | Author |
|
It is possible that GVN removes
some dead blocks, this could lead
to odd - but probably harmless -
phi args appearing in the IL.
This patch cleans things up during
fillcfg().
|
|
Useful for ifopt to match more
often. Empty blocks are fused
and conditional jumps on empty
blocks with the same successor
(and no phis in the successor)
are collapsed.
|
|
|
|
Replacement of tiny conditional jump graphlets with
conditional move instructions.
Currently enabled only for x86. Arm64 support using cselXX
will be essentially identical.
Adds (internal) frontend sel0/sel1 ops with flag-specific
backend xselXX following jnz implementation pattern.
Testing: standard QBE, cproc, harec, hare, roland
|
|
|
|
|
|
In case we need to spill to accomodate
for the jump argument, piggyback the
reloads from slots to regalloc so that
they can be correctly inserted on edges.
|
|
Apple's assembler actually hard
crashed on overflows.
|
|
|
|
|
|
Thanks to Luke Graham for reporting
and fixing this issue.
|
|
|
|
|
|
Signed-off-by: Horst H. von Brand <[email protected]>
|
|
While at it, extract most duplicated code across targets into a
function.
|
|
If $bin is set in the environment, use it instead of using `qbe` from
the source tree. The same for $binref. This supports the following use
cases:
- I have a qbe package installed, and I want to test my local changes
with the installed packages as a reference:
$ binref=/usr/bin/qbe ./tools/test.sh all
- I want to test the installed qbe against new tests that I have
written, to reproduce a bug:
$ bin=/usr/bin/qbe ./tools/test.sh test/newtest.ssa
In Debian, we also run tests against the installed package when
dependencies change, etc. We will also run on several architectures
where the necessary cross compilers might not be available. So make
tests that cannot be run because of a missing compiler exit with 77,
signaling to Debian's autopkgtest that the test is skipped.
|
|
When developing on an arm64 machine, it's useful to be able to test the
x86_64 target.
|
|
On Apple platforms x18 is not guaranteed
to be preserved across context switches.
So we now use IP1 as scratch register.
En passant, one dubious use of IP0 in
arm64/emit.c fixarg() was transitioned
to IP1. I believe the previous code could
clobber a user value if IP0 was live.
|
|
|
|
- Many stylistic nits.
- Removed blkmerge().
- Some minor bug fixes.
- GCM reassoc is now "sink"; a pass that
moves trivial ops in their target block
with the same goal of reducing register
pressure, but starting from instructions
that benefit from having their inputs
close.
|
|
|
|
More or less as proposed in its ninth iteration with the
addition of a gcmmove() functionality to restore coherent
local schedules.
Changes since RFC 8:
Features:
- generalization of phi 1/0 detection
- collapse linear jmp chains before GVN; simplifies if-graph
detection used in 0/non-0 value inference and if-elim...
- infer 0/non-0 values from dominating blk jnz; eliminates
redundant cmp eq/ne 0 and associated jnz/blocks, for example
redundant null pointer checks (hare codebase likes this)
- remove (emergent) empty if-then-else graphlets between GVN and
GCM; improves GCM instruction placement, particularly cmps.
- merge %addr =l add %addr1, N sequences - reduces tmp count,
register pressure.
- squash consecutive associative ops with constant args, e.g.
t1 = add t, N ... t2 = add t2, M -> t2 = add t, N+M
Bug Fixes:
- remove "cmp eq/ne of non-identical RCon's " in copyref().
RCon's are not guaranteed to be dedup'ed, and symbols can
alias.
Codebase:
- moved some stuff into cfg.c including blkmerge()
- some refactoring in gvn.c
- simplification of reassoc.c - always reassoc all cmp ops
and Kl add %t, N. Better on coremark, smaller codebase.
- minor simplification of movins() - use vins
Testing - standard QBE, cproc, hare, harec, coremark
[still have Rust build issues with latest roland]
Benchmark
- coremark is ~15%+ faster than master
- hare "HARETEST_INCLUDE='slow' make check" ~8% faster
(crypto::sha1::sha1_1gb is biggest obvious win - ~25% faster)
Changes since RFC 7:
Bug fixes:
- remove isbad4gcm() in GVN/GCM - it is unsound due to different state
at GVN vs GCM time; replace with "reassociation" pass after GCM
- fix intra-blk use-before-def after GCM
- prevent GVN from deduping trapping instructions cos GCM will not
move them
- remove cmp eq/ne identical arg copy detection for floating point, it
is not valid for NaN
- fix cges/cged flagged as commutative in ops.h instead of cnes/cned
respectively; just a typo
Minor features:
- copy detection handles cmp le/lt/ge/gt with identical args
- treat (integer) div/rem by non-zero constant as non-trapping
- eliminate add N/sub N pairs in copy detection
- maintain accurate tmp use in GVN; not strictly necessary but enables
interim global state sanity checking
- "reassociation" of trivial constant offset load/store addresses, and
cmp ops with point-of-use in pass after GCM
- normalise commutative op arg order - e.g. op con, tmp -> op tmp, con
to simplify copy detection and GVN instruction dedup
Codebase:
- split out core copy detection and constant folding (back) out into
copy.c, fold.c respectively; gvn.c was getting monolithic
- generic support for instruction moving in ins.c - used by GCM and
reassoc
- new reassociation pass in reassoc.c
- other minor clean-up/refactor
Changes since RFC 6:
- More ext elimination in GVN by examination of def and use bit width
- elimination of redundant and mask by bit width examination
- Incorporation of Song's patch
Changes since RFC 5:
- avoidance of "bad" candidates for GVN/GCM - trivial address offset
calculations, and comparisons
- more copy detection mostly around boolean values
- allow elimination of unused load, alloc, trapping instructions
- detection of trivial boolean v ? 1 : 0 phi patterns
- bug fix for (removal of) "chg" optimisation in ins recreation - it
was missing removal of unused instructions in some cases
ifelim() between GVN and GCM; deeper nopunused()
|
|
Remove edgedel() calls from fillrpo().
Call new prunephis() from fillpreds().
[Curiously this never seems to do anything even tho edgedel()
is no longer called from fillrpo()]
One remaining fillpreds() call in parse.c typecheck - seems
like it will still work the same.
defensive; fillcfg() combining fillrpo() and fillpreds() - problem after simpljmp() - think it is cos fillrpo() is still doing edgedel() which should now be covered by fillpreds()
comment out edgedel() in fillrpo() - fillcfg() no longer asserts after simpljmp() but seems like prunephis() never triggers???
static fillrpo(); remove edgedel() from fillrpo()
replace fillrpo() and/or fillpreds() with fillcfg()
|
|
Now that b->pred is a vector we do can remove the counting pass.
|
|
Essentially use post-order as id, then reverse to rpo.
Avoids needing f->nblk initially; slightly simpler logic.
|
|
Removes last re-allocation of b->ins.
|
|
Always used this way and factors setting b->nins.
Makes b->ins vector contract more obvious.
|
|
Scratching an itch - avoid unnecesary re-allocation in idup()
which is called often in the optimisation chain.
Blk::ins is reallocated in xxx_abi() - needs further fiddling.
|
|
Scratching an itch - avoid unnecesary re-allocation in fillpred()
which is called often in the optimisation chain.
|
|
Scratching an itch - avoid unnecesary re-allocation in fillrpo()
which is called multiple times in the optimisation chain.
|
|
|
|
- dynamic allocations could generate
bad 'and' instructions (for the
and with -16 in salloc()).
- symbols used in w context would
generate adrp and add instructions
on wN registers while they seem to
only work on xN registers.
Thanks to Rosie for reporting them.
|
|
When rbp is not necessary to compile
a leaf function, we skip saving and
restoring it.
|
|
Clang incorrectly optimizes this negation with -O2 and causes QBE to
emit 0 in place of INT64_MIN.
|
|
Functions are now aligned on 16-byte
boundaries. This mimics gcc and should
help reduce the maximum perf impact of
cosmetic code changes. Previously, any
change in the output of qbe could have
far reaching implications on alignment.
Thanks to Roland Paterson-Jones for
pointing out the variability issue.
|
|
This was cute to do, but it is
largely inconsequential, as shown
by the rough timings below:
benchmarking mul8_lea
3.9 ticks ± 0.88 (min: 3)
benchmarking mul8_imul
3.3 ticks ± 0.27 (min: 3)
benchmarking div8_udiv
6.5 ticks ± 0.52 (min: 6)
benchmarking div8_shr
3.3 ticks ± 0.34 (min: 3)
|
|
Additionally, the strength-reduction
for small powers of two is handled
by amd64/emit.c now.
|
|
|
|
|
|
|
|
|
|
|
|
Hopefully the right time now!
|
|
Passes the "standard" test suite.
(cproc bootstrap, hare[c] make test, roland units, linpack/coremark run)
However linpack benchmark is now notably slower. Coremark is ~2% faster.
As noticed before, linmark timing is dubious, and maybe my cheap (AMD) laptop
prefers mul to shl.
|
|
|
|
In this branch we only need that br[b->loop].b
is defined. This is the case if b->loop >= n.
|
|
when applying a custom set of CFLAGS under clang that does not include
-std=c99, asm is treated as a keyword and as such can not be used as an
identifier. this prevents the issue by renaming the offending variables.
|
|
Comparisons return a 1-bit value, in theory
we could add a Wu1 width for them but I did
not bother and just used Wub. This simply
means that if a frontend generates an extsb
of a comparison result (silly), we will not
generate good code.
|
|
|
|
Quotes are used on Apple target
variants to flag that we must
not add the _ symbol prefix.
|