| Age | Commit message (Collapse) | Author | |
|---|---|---|---|
| 2022-04-09 | kernel_3d_16 and xof functionskernel | Jack O'Connor | |
| 2022-03-26 | xor_xof variants for the 2d kernel | Jack O'Connor | |
| 2022-03-20 | blake3_avx512_xof_stream_4 | Jack O'Connor | |
| 2022-03-20 | blake3_avx2_xof_stream_2 | Jack O'Connor | |
| 2022-03-20 | blake3_avx512_xof_stream_2 | Jack O'Connor | |
| 2022-03-20 | initial xof_stream functions | Jack O'Connor | |
| 2022-03-16 | rename kernel_1 to kernel2d_1 and add degree args | Jack O'Connor | |
| 2022-03-15 | generate blake3_{avx512,sse41,sse2}_compress with asm.py | Jack O'Connor | |
| 2022-03-11 | replace tail calls with jumps | Jack O'Connor | |
| 2022-03-11 | blake3_avx512_chunks_8 and blake3_avx512_parents_8 | Jack O'Connor | |
| 2022-03-09 | blake3_avx512_xof_xor_16 | Jack O'Connor | |
| 2022-03-09 | test unaligned writes | Jack O'Connor | |
| 2022-03-09 | broadcast the block length and domain flags inside blake3_avx512_kernel_16 | Jack O'Connor | |
| blake3_avx512_xof_stream_16 was also incorrectly hardcoding a block length of 64. The block length parameter is the *input* block length, which is independent of the output block length. (The output block length is not a compression function parameter.) | |||
| 2022-03-09 | move third row initialization into blake3_avx512_kernel_16 | Jack O'Connor | |
| 2022-03-09 | interleave the write ops in blake3_avx512_xor_stream_16 | Jack O'Connor | |
| This seems to give a small but consistent performance boost. | |||
| 2022-03-09 | blake3_avx512_xof_stream_16 | Jack O'Connor | |
| 2022-03-08 | split the left and right child CVs for blake3_avx512_parents_16 | Jack O'Connor | |
| There's no reason to force the caller to allocate them together. | |||
| 2022-03-08 | blake3_avx512_parents_16 | Jack O'Connor | |
| 2022-03-08 | use a memory argument for vpbroadcastd | Jack O'Connor | |
| 2022-03-08 | describe the transposition in comments | Jack O'Connor | |
| 2022-03-08 | now using only 3 scratch zmm registers | Jack O'Connor | |
| 2022-03-08 | interleave the first pass -- good performance | Jack O'Connor | |
| 2022-03-08 | try it with 4 times as many loads | Jack O'Connor | |
| 2022-03-08 | add a benchmark | Jack O'Connor | |
| 2022-03-08 | blake3_avx512_chunks_16 | Jack O'Connor | |
| 2022-03-08 | unroll the block loop and load the key | Jack O'Connor | |
| 2022-03-08 | correct the last two transposition passes | Jack O'Connor | |
| 2022-03-08 | nonzero message | Jack O'Connor | |
| 2022-03-08 | start working on a refactored assembly implementation | Jack O'Connor | |
| The main goal is to eventually have extended outputs benefit from the same SIMD optimizations as inputs. To make this easier, I want to factor out a shared "kernel" routine that can be shared among several different interfaces: - compressing chunks - compressing parents - producing XOF output - xor'ing XOF output The timing here partly coincides with Rust stabilizing inline asm. That's certainly not necessary for any of this to work, but it gives me the confidence to try this without needing to master the rules of three different calling conventions. | |||
