aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJack O'Connor <[email protected]>2024-01-21 19:43:07 -0800
committerJack O'Connor <[email protected]>2024-01-21 19:43:07 -0800
commit1a6c1e2037c3b0221e57bcba9e02b7bb5f29f067 (patch)
tree3fb63ed3d728567c8562bf39b3c3fadd886723aa
parent1ca383ba9b1aa2c2508b15d89f3048be10cd9def (diff)
guts readme updates
-rw-r--r--rust/guts/readme.md90
1 files changed, 54 insertions, 36 deletions
diff --git a/rust/guts/readme.md b/rust/guts/readme.md
index a1adbf1..4957816 100644
--- a/rust/guts/readme.md
+++ b/rust/guts/readme.md
@@ -2,51 +2,65 @@
## Introduction
-This crate contains low-level, high-performance, platform-specific
-implementations of the BLAKE3 compression function. This API is complicated and
-unsafe, and this crate will never have a stable release. For the standard
-BLAKE3 hash function, see the [`blake3`](https://crates.io/crates/blake3)
-crate, which depends on this one.
-
-The most important ingredient in a high-performance implementation of BLAKE3 is
-parallelism. The BLAKE3 tree structure lets us hash different parts of the tree
-in parallel, and modern computers have a _lot_ of parallelism to offer.
-Sometimes that means using multiple threads running on multiple cores, but
-multithreading isn't appropriate for all applications, and it's not the usual
-default for library APIs. More commonly, BLAKE3 implementations use SIMD
-instructions ("Single Instruction Multiple Data") to improve the performance of
-a single thread. When we do use multithreading, the performance benefits
-multiply.
-
-The tricky thing about SIMD is that each instruction set works differently.
-Instead of writing portable code once and letting the compiler do most of the
-optimization work, we need to write platform-specific implementations, and
-sometimes more than one per platform. We maintain *four* different
-implementations on x86 alone (targeting SSE2, SSE4.1, AVX2, and AVX-512), in
-addition to ARM NEON and the RISC-V vector extensions. In the future we might
-add ARM SVE2.
-
-All of that means a lot of duplicated logic and maintenance. So while the main
-goal of this API is high performance, it's also important to keep the API as
-small and simple as possible. Higher level details like the "CV stack", input
-buffering, and multithreading are handled by portable code in the main `blake3`
-crate. These are just building blocks.
-
-## The private API
-
-This is the API that each platform reimplements. It's completely `unsafe`,
-inputs and outputs are allowed to alias, and bounds checking is the caller's
-responsibility.
+This [`blake3_guts`](https://crates.io/crates/blake3_guts) sub-crate contains
+low-level, high-performance, platform-specific implementations of the BLAKE3
+compression function. This API is complicated and unsafe, and this crate will
+never have a stable release. Most callers should instead use the
+[`blake3`](https://crates.io/crates/blake3) crate, which will eventually depend
+on this one internally.
+
+The code you see here (as of January 2024) is an early stage of a large planned
+refactor. The motivation for this refactor is a couple of missing features in
+both the Rust and C implementations:
+
+- The output side
+ ([`OutputReader`](https://docs.rs/blake3/latest/blake3/struct.OutputReader.html)
+ in Rust) doesn't take advantage of the most important SIMD optimizations that
+ compute multiple blocks in parallel. This blocks any project that wants to
+ use the BLAKE3 XOF as a stream cipher
+ ([[1]](https://github.com/oconnor663/bessie),
+ [[2]](https://github.com/oconnor663/blake3_aead)).
+- Low-level callers like [Bao](https://github.com/oconnor663/bao) that need
+ interior nodes of the tree also don't get those SIMD optimizations. They have
+ to use a slow, minimalistic, unstable, doc-hidden module [(also called
+ `guts`)](https://github.com/BLAKE3-team/BLAKE3/blob/master/src/guts.rs).
+
+The difficulty with adding those features is that they require changes to all
+of our optimized assembly and C intrinsics code. That's a couple dozen
+different files that are large, platform-specific, difficult to understand, and
+full of duplicated code. The higher-level Rust and C implementations of BLAKE3
+both depend on these files and will need to coordinate changes.
+
+At the same time, it won't be long before we add support for more platforms:
+
+- RISCV vector extensions
+- ARM SVE
+- WebAssembly SIMD
+
+It's important to get this refactor done before new platforms make it even
+harder to do.
+
+## The private guts API
+
+This is the API that each platform reimplements, so we want it to be as simple
+as possible apart from the high-performance work it needs to do. It's
+completely `unsafe`, and inputs and outputs are raw pointers that are allowed
+to alias (this matters for `hash_parents`, see below).
- `degree`
- `compress`
+ - The single compression function, for short inputs and odd-length tails.
- `hash_chunks`
- `hash_parents`
- `xof`
- `xof_xor`
+ - As `xof` but XOR'ing the result into the output buffer.
- `universal_hash`
+ - This is a new construction specifically to support
+ [BLAKE3-AEAD](https://github.com/oconnor663/blake3_aead). Some
+ implementations might just stub it out with portable code.
-## The public API
+## The public guts API
This is the API that this crate exposes to callers, i.e. to the main `blake3`
crate. It's a thin, portable layer on top of the private API above. The Rust
@@ -56,7 +70,11 @@ version of this API is memory-safe.
- `compress`
- `hash_chunks`
- `hash_parents`
+ - This handles most levels of the tree, where we keep hashing SIMD_DEGREE
+ parents at a time.
- `reduce_parents`
+ - This uses the same `hash_parents` private API, but it handles the top
+ levels of the tree where we reduce in-place to the root parent node.
- `xof`
- `xof_xor`
- `universal_hash`