github.com/BLAKE3-team/BLAKE3 - BLAKE3 is a cryptographic hash function

Age	Commit message (Collapse)	Author
2022-04-09	kernel_3d_16 and xof functionskernel	Jack O'Connor

2022-03-26	xor_xof variants for the 2d kernel	Jack O'Connor

2022-03-20	blake3_avx512_xof_stream_4	Jack O'Connor

2022-03-20	blake3_avx2_xof_stream_2	Jack O'Connor

2022-03-20	blake3_avx512_xof_stream_2	Jack O'Connor

2022-03-20	initial xof_stream functions	Jack O'Connor

2022-03-16	rename kernel_1 to kernel2d_1 and add degree args	Jack O'Connor

2022-03-15	generate blake3_{avx512,sse41,sse2}_compress with asm.py	Jack O'Connor

2022-03-11	replace tail calls with jumps	Jack O'Connor

2022-03-11	blake3_avx512_chunks_8 and blake3_avx512_parents_8	Jack O'Connor

2022-03-09	blake3_avx512_xof_xor_16	Jack O'Connor

2022-03-09	test unaligned writes	Jack O'Connor

2022-03-09	broadcast the block length and domain flags inside blake3_avx512_kernel_16	Jack O'Connor
	blake3_avx512_xof_stream_16 was also incorrectly hardcoding a block length of 64. The block length parameter is the input block length, which is independent of the output block length. (The output block length is not a compression function parameter.)
2022-03-09	move third row initialization into blake3_avx512_kernel_16	Jack O'Connor

2022-03-09	interleave the write ops in blake3_avx512_xor_stream_16	Jack O'Connor
	This seems to give a small but consistent performance boost.
2022-03-09	blake3_avx512_xof_stream_16	Jack O'Connor

2022-03-08	split the left and right child CVs for blake3_avx512_parents_16	Jack O'Connor
	There's no reason to force the caller to allocate them together.
2022-03-08	blake3_avx512_parents_16	Jack O'Connor

2022-03-08	use a memory argument for vpbroadcastd	Jack O'Connor

2022-03-08	describe the transposition in comments	Jack O'Connor

2022-03-08	now using only 3 scratch zmm registers	Jack O'Connor

2022-03-08	interleave the first pass -- good performance	Jack O'Connor

2022-03-08	try it with 4 times as many loads	Jack O'Connor

2022-03-08	add a benchmark	Jack O'Connor

2022-03-08	blake3_avx512_chunks_16	Jack O'Connor

2022-03-08	unroll the block loop and load the key	Jack O'Connor

2022-03-08	correct the last two transposition passes	Jack O'Connor

2022-03-08	nonzero message	Jack O'Connor

2022-03-08	start working on a refactored assembly implementation	Jack O'Connor
	The main goal is to eventually have extended outputs benefit from the same SIMD optimizations as inputs. To make this easier, I want to factor out a shared "kernel" routine that can be shared among several different interfaces: - compressing chunks - compressing parents - producing XOF output - xor'ing XOF output The timing here partly coincides with Rust stabilizing inline asm. That's certainly not necessary for any of this to work, but it gives me the confidence to try this without needing to master the rules of three different calling conventions.