| Age | Commit message (Collapse) | Author |
|
|
|
Add SSE2 implementations
|
|
This enables pretty printing via `{:#?}`. The normal style for `{:?}` is
kept exactly the same.
|
|
Wire up basic functions and features for SSE2 support using the SSE4.1 version
as a basis without implementing the SSE2 instructions yet.
* Cargo.toml: add no_sse2 feature
* benches/bench.rs: wire SSE2 benchmarks
* build.rs: add SSE2 rust intrinsics and assembly builds
* c/Makefile.testing: add SSE2 C and assembly targets
* c/README.md: add SSE2 to C build instructions
* c/blake3_c_rust_bindings/build.rs: add SSE2 C rust binding builds
* c/blake3_c_rust_bindings/src/lib.rs: add SSE2 C rust bindings
* c/blake3_dispatch.c: add SSE2 C dispatch
* c/blake3_impl.h: add SSE2 C function prototypes
* c/blake3_sse2.c: add SSE2 C intrinsic file starting with SSE4.1 version
* c/blake3_sse2_x86-64_{unix.S,windows_gnu.S,windows_msvc.asm}: add SSE2
assembly files starting with SSE4.1 version
* src/ffi_sse2.rs: add rust implementation using SSE2 C rust bindings
* src/lib.rs: add SSE2 rust intrinsics and SSE2 C rust binding rust SSE2 module
configurations
* src/platform.rs: add SSE2 rust platform detection and dispatch
* src/rust_sse2.rs: add SSE2 rust intrinsic file starting with SSE4.1 version
* tools/instruction_set_support/src/main.rs: add SSE2 feature detection
|
|
|
|
It looks like I originally made this mistake when I was copying code
from the baokeshed prototype (a274a9b0faa444dd842a0584483eae6e97dbf21e),
and then it got replicated into the C implementation later.
|
|
|
|
There are two scenarios where compiling AVX-512 C or assembly code might
not work:
1. There might not be a C compiler installed at all. Most commonly this
is either in cross-compiling situations, or with the Windows GNU
target.
2. The installed C compiler might not support e.g. -mavx512f, because
it's too old.
In both of these cases, print a relevant warning, and then automatically
fall back to using the pure Rust intrinsics build.
Note that this only affects x86 targets. Other targets always use pure
Rust, unless the "neon" feature is enabled.
|
|
The biggest change here is that assembly implementations are enabled by
default.
Added features:
- "pure" (Pure Rust, with no C or assembly implementations.)
Removed features:
- "c" (Now basically the default.)
Renamed features;
- "c_prefer_intrinsics" -> "prefer_intrinsics"
- "c_neon" -> "neon"
Unchanged:
- "rayon"
- "std" (Still the only feature on by default.)
|
|
Suggested by @zaynetro:
https://github.com/BLAKE3-team/BLAKE3/pull/24#issuecomment-594369061
|
|
|
|
If the total number of chunks hashed so far is e.g. 1, and update() is
called with e.g. 8 more chunks, we can't compress all 8 together. We
have to break the input up, to make sure that that 1 lone chunk CV gets
merged with its proper sibling, and that in general the correct layout
of the tree is preserved. What we should do is hash 1-2-4-1 chunks of
input, using increasing powers of 2 (with some cleanup at the end). What
we were doing was 2-2-2-2 chunks. This was the result of a mistaken
optimization that got us stuck with an always-odd number of chunks so
far.
Fixes https://github.com/BLAKE3-team/BLAKE3/issues/69.
|
|
|
|
|
|
|
|
This is a new interface that allows the caller to provide a
multi-threading implementation. It's defined in terms of a new `Join`
trait, for which we provide two implementations, `SerialJoin` and
`RayonJoin`. This lets the caller control when multi-threading is used,
rather than the previous all-or-nothing design of the "rayon" feature.
Although existing callers should keep working, this is a compatibility
break, because callers who were relying on automatic multi-threading
before will now be single-threaded. Thus the next release of this crate
will need to be version 0.2.
See https://github.com/BLAKE3-team/BLAKE3/issues/25 and
https://github.com/BLAKE3-team/BLAKE3/issues/54.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Closes https://github.com/BLAKE3-team/BLAKE3/issues/41.
|
|
|
|
|
|
Because compress_subtree_to_parent_node effectively cuts its input in
half, we can give it an input that's twice as big, without violating the
CV stack invariant.
|
|
|
|
For the Read and Write traits, this also allows the compiler to see that
the return value is always Ok, allowing it to remove the Err case from
the caller as dead code.
|
|
The generic constant_time_eq has several branches on the slice length,
which are not necessary when the slice length is known. However, the
optimizer is not allowed to look into the core of constant_time_eq, so
these branches cannot be elided.
Use instead a fixed-size variant of constant_time_eq, which has no
branches since the length is known.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The previous version of this API called for a key of exactly 256 bits.
That's good for optimal performance, but it would mean losing the
use-with-other-algorithms property for applications whose input keys are
a different size. There's no way for an abstraction over the previous
version to provide reliable domain separation for the "extract" step.
|
|
|
|
|
|
|
|
|
|
This is simpler than sometimes incrementing by CHUNK_LEN and other times
incrementing by BLOCK_LEN.
|
|
Smaller chunk sizes are a big benefit for parallelism at shorter input
lengths, and recent benchmarks show that this reduction has a relative
small cost in terms of peak throughput. It's also a nice round number.
|
|
This lets us enable it by default in b3sum.
|
|
|
|
The portable implementation was getting slowed down by converting back
and forth between words and bytes.
I made the corresponding change on the C side first
(https://github.com/veorq/BLAKE3-c/commit/12a37be8b50922a358c016ba07f46816a3da4a31),
and as part of this commit I'm re-vendoring the C code. I'm also
exposing a small FFI interface to C so that blake3_neon.c can link
against portable.rs rather than blake3_portable.c, see c_neon.rs.
|
|
|
|
|
|
This would fire (incorrectly) on platforms where MAX_SIMD_DEGREE=1.
|
|
I'm about to add C integration for AVX-512 and NEON, and this matches
better what the C code is doing.
|
|
|