Commit 90a3dcb
doc: Strengthen and adjust
The `copy-royal` algorithm maintains the patterns and "shape" of
text sufficiently to keep diffs the same (in the vast majority of
cases). It is used in `internal-tools` to help prepare test cases
with what is important and relevant to a regression test of diff
behavior, rather than the exact original repository content in a
tree that has been found to trigger a bug. It avoids needless
verbatim reproduction, while preserving aspects that are useful and
necessary for testing. It keeps the focus on patterns, preventing
irrelevant details of code in a tree that triggered a bug from
being confused with the logic of gitoxide itself, and makes it less
likely to be touched inadvertently in efforts to fix bugs or
improve style (which, in test data, would cause subtle breakage).
Although these benefits are substantial and we intend to continue
using copy-royal in the preparation of test cases as needed if or
when regressions arise, some of the guidance and rationale we had
given for its use was inaccurate or misleading. Most importantly,
copy-royal cannot be used in practice to redact sensitive
information: if you have a repository whose contents should not be
made public, then it is not safe to share the output of copy-royal
run on that repository either.
Copy-royal is implemented (roughly speaking) by mapping alphabetic
characters down to ten letters. This removes some information, at
least in principle: that is, if it were given totally random
letters as input, then it would be impossible to reverse it to get
those letters back. Even on input that is much more structured and
predictable, such as real-world input, it obfuscates it, making it
look garbled and nonsensical. However, even when one intuitively
feels that it has destroyed information, it is possible to reverse
it in many cases, and possibly even in all practical cases.
The reason is that, in real world source code and natural language,
some sequences of letters are overwhelmingly more likely to occur
than others, both in general and (especially) contextually given
what surrounding text is present. The information that is removed
by mapping into ten letters could often be reconstructed by:
1. Building a grammar of possible inputs, which can be done in a
simple manner by translating the copy-royal output one wishes to
reverse into a regular expression in which every symbol in the
copy-royal output becomes a character class of characters that
map to it. In effect, for every output of the copy-royal
algorithm, there is a regex that matches the possible inputs.
2. Predicting, stepwise, what code or text is likely to have arisen
that matches that grammar. In principle this could be done with
a variety of techniques or even manually. But one fruitful
approach would be to use an autoregressive large language model,
and apply constrained decoding[1] to sample only logits
consistent with the regex. Small experiments carried out so far
suggest[2] this to be a workable technique when combined with
beam search[3]. (This technique does not require the specific
text or code being reconstructed to have existed when the model
was trained.)
Accordingly, this modifies the documentation of copy-royal to avoid
claiming that the input of copy-royal cannot be recovered, or
anything that recommends or may appear to recommend the use of
copy-royal to redact sensitive information. It also clarifies and
adjusts the explanation of when it makes sense to use copy-royal,
and describes some of its benefits that do not rely on the
assumption that it is infeasible (or even difficult) to reverse.
In the comment documenting `BlameCopyRoyal`, which is among those
edited in the above ways, this also edits its top line to make
clear more generally how `BlameCopyRoyal` relates to `git blame`.
[1]: https://github.com/Saibo-creator/Awesome-LLM-Constrained-Decoding
[2]: See link(s) in #2180
[3]: https://en.wikipedia.org/wiki/Beam_search
Co-authored-by: Sebastian Thiel <sebastian.thiel@icloud.com>copy-royal usage guidance and caveats1 parent 442f800 commit 90a3dcb
1 file changed
+25
-16
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
| 19 | + | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
| 24 | + | |
24 | 25 | | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
32 | 29 | | |
33 | 30 | | |
34 | 31 | | |
| |||
59 | 56 | | |
60 | 57 | | |
61 | 58 | | |
62 | | - | |
63 | | - | |
| 59 | + | |
| 60 | + | |
64 | 61 | | |
65 | 62 | | |
66 | 63 | | |
67 | | - | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
68 | 76 | | |
69 | | - | |
70 | | - | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
71 | 80 | | |
72 | 81 | | |
73 | 82 | | |
| |||
93 | 102 | | |
94 | 103 | | |
95 | 104 | | |
96 | | - | |
97 | | - | |
| 105 | + | |
| 106 | + | |
98 | 107 | | |
99 | 108 | | |
100 | 109 | | |
| |||
0 commit comments