diff options
| author | Jack O'Connor <[email protected]> | 2020-05-16 13:29:10 -0400 |
|---|---|---|
| committer | Jack O'Connor <[email protected]> | 2020-05-16 13:29:10 -0400 |
| commit | cd436251b61eded574f1a19c24674ea71eacd955 (patch) | |
| tree | 7fab76bd7cf2c2d9cbee50f2b86de23cea1cbae6 | |
| parent | e1f3043e76597ea160346d242628c89597ae2198 (diff) | |
some more clarifications in the --check docs
| -rw-r--r-- | b3sum/what_does_check_do.md | 34 |
1 files changed, 18 insertions, 16 deletions
diff --git a/b3sum/what_does_check_do.md b/b3sum/what_does_check_do.md index 1f7bc80..57c8eaf 100644 --- a/b3sum/what_does_check_do.md +++ b/b3sum/what_does_check_do.md @@ -55,19 +55,19 @@ and very similar output for failure. Since the checkfile format (the regular output format of `b3sum`) is newline-separated text, we need to worry about what happens when a filepath -contains a newline, or worse. Suppose we create a file named `abc[newline]def` -(7 characters). One way to create such a file is with a Python one-liner like +contains a newline, or worse. Suppose we create a file named `x[newline]x` +(3 characters). One way to create such a file is with a Python one-liner like this: ```python ->>> open("abc\ndef", "w") +>>> open("x\nx", "w") ``` Here's what happens when we hash that file with `b3sum`: ```bash -$ b3sum abc* -\af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 abc\ndef +$ b3sum x* +\af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 x\nx ``` Notice two things. First, `b3sum` puts a single `\` character at the front of @@ -117,7 +117,7 @@ However, tragically, we *can* create a file with that byte in its name (on Linux at least, though not usually on macOS): ```python ->>> open(b"def\xFFghi", "w") +>>> open(b"y\xFFy", "w") ``` So some filepaths aren't representable in Unicode at all. Our plan to "convert @@ -125,8 +125,8 @@ platform-specific bytes into some consistent Unicode encoding" isn't going to work for everything. What does `b3sum` do with the file above? ```bash -$ b3sum def* -af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 def�ghi +$ b3sum y* +af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 y�y ``` That � in there is a "Unicode replacement character". When we run into @@ -137,19 +137,21 @@ see a replacement character. Together with a few more details covered in the next section, this gives us an important set of properties: 1. Any file can be hashed locally. -2. Any file with a valid Unicode name can be checked. -3. Checkfiles are always valid UTF-8. -4. Checkfiles are portable between Unix and Windows. +2. Any file with a valid Unicode name not containing the � character can be + checked. +3. Checking ambiguous or unrepresentable filepaths always fails. +4. Checkfiles are always valid UTF-8. +5. Checkfiles are portable between Unix and Windows. ## Formal Rules 1. When hashing, filepaths are represented in a platform-specific encoding, - which can accommodate any filepath on the current platform. (In Rust, this - is `OsStr`/`OsString`.) + which can accommodate any filepath on the current platform. In Rust, this is + `OsStr`/`OsString`. 2. In output, filepaths are first converted to UTF-8. Any non-Unicode segments - are replaced with Unicode replacement characters. (In Rust, this is - `OsStr::to_string_lossy`.) -3. Then, if a filepath contains a backslash (U+005C) or a newline (U+000A), + are replaced with Unicode replacement characters (U+FFFD). In Rust, this is + `OsStr::to_string_lossy`. +3. Then, if a filepath contains any backslashes (U+005C) or newlines (U+000A), these characters are escaped as `\\` and `\n` respectively. 4. Finally, any output line containing an escape sequence is prefixed with a single backslash. |
