aboutsummaryrefslogtreecommitdiff
path: root/src/kernel.rs
AgeCommit message (Collapse)Author
2022-04-09kernel_3d_16 and xof functionskernelJack O'Connor
2022-03-26xor_xof variants for the 2d kernelJack O'Connor
2022-03-20blake3_avx512_xof_stream_4Jack O'Connor
2022-03-20blake3_avx2_xof_stream_2Jack O'Connor
2022-03-20blake3_avx512_xof_stream_2Jack O'Connor
2022-03-20initial xof_stream functionsJack O'Connor
2022-03-16rename kernel_1 to kernel2d_1 and add degree argsJack O'Connor
2022-03-15generate blake3_{avx512,sse41,sse2}_compress with asm.pyJack O'Connor
2022-03-11replace tail calls with jumpsJack O'Connor
2022-03-11blake3_avx512_chunks_8 and blake3_avx512_parents_8Jack O'Connor
2022-03-09blake3_avx512_xof_xor_16Jack O'Connor
2022-03-09test unaligned writesJack O'Connor
2022-03-09broadcast the block length and domain flags inside blake3_avx512_kernel_16Jack O'Connor
blake3_avx512_xof_stream_16 was also incorrectly hardcoding a block length of 64. The block length parameter is the *input* block length, which is independent of the output block length. (The output block length is not a compression function parameter.)
2022-03-09move third row initialization into blake3_avx512_kernel_16Jack O'Connor
2022-03-09interleave the write ops in blake3_avx512_xor_stream_16Jack O'Connor
This seems to give a small but consistent performance boost.
2022-03-09blake3_avx512_xof_stream_16Jack O'Connor
2022-03-08split the left and right child CVs for blake3_avx512_parents_16Jack O'Connor
There's no reason to force the caller to allocate them together.
2022-03-08blake3_avx512_parents_16Jack O'Connor
2022-03-08use a memory argument for vpbroadcastdJack O'Connor
2022-03-08describe the transposition in commentsJack O'Connor
2022-03-08now using only 3 scratch zmm registersJack O'Connor
2022-03-08interleave the first pass -- good performanceJack O'Connor
2022-03-08try it with 4 times as many loadsJack O'Connor
2022-03-08add a benchmarkJack O'Connor
2022-03-08blake3_avx512_chunks_16Jack O'Connor
2022-03-08unroll the block loop and load the keyJack O'Connor
2022-03-08correct the last two transposition passesJack O'Connor
2022-03-08nonzero messageJack O'Connor
2022-03-08start working on a refactored assembly implementationJack O'Connor
The main goal is to eventually have extended outputs benefit from the same SIMD optimizations as inputs. To make this easier, I want to factor out a shared "kernel" routine that can be shared among several different interfaces: - compressing chunks - compressing parents - producing XOF output - xor'ing XOF output The timing here partly coincides with Rust stabilizing inline asm. That's certainly not necessary for any of this to work, but it gives me the confidence to try this without needing to master the rules of three different calling conventions.