-
Notifications
You must be signed in to change notification settings - Fork 1.3k
backup+util: Prevent too long backup file names with hashing + resolve file #3806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think it would be nice to include part of the path (maybe like last 10 characters or something) so the name gives at least some clue about the origin in case someone looks at the backups manually. |
|
I check the The paths will be stored in base64 at |
|
I thought about using |
|
I read the issue and reread the first comment. I totally ignored the "long path" part. I thought the issue was related only to UTF-8. |
internal/util/util.go
Outdated
| if _, err := os.Stat(md5sum); err == nil { | ||
| return md5sum | ||
| runes := []rune(filepath.Base(path)) | ||
| truncBaseName := string(runes[len(runes)-Min(len(runes), 16):]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the last 16 characters, not the first 16 characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested last N characters so the file extension is kept, but in this implementation the md5 hash is after the truncated name. I would probably swap the parts (<md5>_<last16> instead of <last16>_<md5>).
|
Human-readability and (especially) decodability are indeed solid arguments again this new approach. If the user accidentally discovers that there is a backup of some file saved in So, we need an encoding that would be:
The original pre-#3273 encoding (i.e. simply replacing slashes and colons with So, an example encoding satisfying all 4 (i.e. similar to the original one but unambiguous):
(I picked BTW the original encoding was escaping both |
If we desire escaping to be always unnecessary, this may not be a valid reason to avoid picking specific characters, since
Micro cannot automatically use the same backups as described in the former use case, since the FS structure and path syntax is too different between Unix and Windows. However I agree with the latter reason of providing the ability to store the same config directory in either OSs, since this will prevent errors on Windows. |
|
I didn't research or read #3273 due to the amount of time it will take, but I came up with 2 other methods on letting the user know the original path. Both may not be reliable, but try to be cross-platform and meet the conditions @dmaluka mentioned.
Not really related, but I hope there's a tool to search comments in a PR like #3273. Edit: The reason I came up with the 2 methods above, is to completely avoid any path length limitation which can be encountered while using other programs. |
From my point of view 3 isn't that important as long as 1, 2 & 4 are fulfilled. 3 results out of 2 in case there is a mechanism to easily do that.
...might work, while...
...doesn't due to the expansion by [BACKUP_DIR]+[MAX_POSSIBLE_PATH_LENGTH]. When we are at it with "compression", what about DEFLATEing the target path? |
The person looking at the file names will most likely be completely unaware that there's a command to decode them, and even if they know about it they wouldn't know which one(s) to decode. It's a lot of extra friction just to handle a rare edge case. |
|
I am quite certain that I might end up with backups that I don't recognize, and I'm unsure how to recover them without using Another idea is to use a hash file similar to those used in GitHub releases. If a |
Isn't it reflected with the following block? Lines 56 to 66 in d7e43d4
Long paths aren't rare...they're quite usual in build environments heavily nested with generated file names.
It is human readable, but is it interpretable without checking the documentation or code? It can't be used for c&p without modification either. The additional lookup file could solve this, but needs some more logic to be kept in sync. |
I'm not insisting on
Ah, indeed. |
Sure. It is not hard to guess that |
|
What is not quite clear to me: the original bug report in #3794 says:
@yuhoocom are you sure this is accurate? From my brief googling it looks like with LongPathsEnabled the maximum path length is 32767 characters. It doesn't sound plausible that the URL encoding caused reducing this limit by 300 times. Even in the worst case (the path consisting entirely of 4-byte unicode characters) it would only reduce it by 12 times, right? |
Maximum file path length may be increased to 32767 but maximum file name length is typically still only 255. According to Microsoft Learn article on maximum path length limitation:
|
|
Aaa, got it... we are talking about the length of the backup file name, which encodes a file path, but itself is a file name, not a file path. So |
|
Exactly and...
...results in... vim (OK): micro (NOK): Exchanging one character with two is already something we should prevent, since it unnecessarily extends the resulting file name. That was the reason why I came up with something like shorting resp. encoding the path/name, but I agree that it is not trivial to recover. So which options do we have?
|
|
So, seems like the only options to satisfy everyone are the following (or their variations):
Now, regarding option 2: is there a compression algorithm with a compression ratio of 128 (32768 / 256)? :) |
|
I don't think so, since 100(%) is already done by the |
A lossless algorithm that reduces the size of every possible input is mathematically impossible (pigeonhole principle), the compression ratio is irrelevant. I would suggest keeping the old method (URL encoding) as-is for the happy path, but when the result is longer than
The advantage of this approach is that the behavior doesn't change at all in most cases, and even when the path is long the filename will be mostly human readable. If we use the full Footnotes
|
|
Ok, maybe that's the way to go. Small notes:
Or the original path?
Using ...That being said,
We already deprecated one encoding once (while providing backward compatibility support for it), we can do that again. :) And I'm not very comfortable with keeping using URL encoding, since, as we've realized, it is pretty pointless to use. We really only need to encode the file separators ( |
I guess it doesn't make a difference, either one should work.
Wouldn't paths normally start with something like
I think URL encoding is simple enough to encode/decode that it's not a big deal, and it could prevent bugs under some (admittedly unlikely) circumstances. If the backups are stored on a different file system than the file you're editing it could have different restrictions. On FAT32 filenames can't use Footnotes
|
Good point. Ok, let's keep using URL encoding then. |
The intention was to truncate to the length of
In case we store a LUT we can point to the (target) In general: |
|
Yes, I think so too. |
|
Wouldn't the LUT need to be read everytime before doing backup, due to the possibility that other Micro instances have written to it? It's the reason I suggested to write the path in a separate file (maybe in
Sorry, may I ask what do you specifically mean by collecting backup files at startup and syncing them with the LUT? |
I'll answer with your suggestion...
...because it is even better to do it just before writing. Anyway we have to check all available backups present in the backup directory and need to check if all of them are available in the LUT and unnecessary entries must be deleted. It is quite a lot of new logic, which is only needed in the scenarios where someone needs to take care of these backup files manually. Footnotes
|
Most probably we don't need to...assuming the presence of a backup implies, that this backup was created by
Might be something to think about. |
bc7419e to
f6a4f7c
Compare
|
I pushed the first draft for tests. |
|
I haven't followed the recent discussions, but it sounded like the complexity and overhead (and fragility) of maintaining a LUT file is not worth it, i.e. just |
It isn't enough, LUT is provided to let users or programs determine the complete target path of a backup. |
|
Sorry, please ignore the coment I just linked, since it might be referring to an earlier suggestion with less information in the backup filename. |
No we have at least a proof of concept. What I'm still struggling with is if we need to truncate the full "happy path" when it exceeds the filename limit?
What do you think? |
I'm not sure what you mean by "happy path" here, but I think @Andriamanitra by "happy path" meant precisely the case when the URL-encoded path does not exceed the filename limit. What else could it mean? ...Anyway, why won't we explore other options? For example, we could store the full path inside the backup file itself. Or, probably better, we could store it in a separate file per backup file. For example, in And thus avoid scalability issues of having a single LUT file for all backups. And, in line with what @Andriamanitra suggested, we don't need to do that in the happy case when the encoded path doesn't exceed the filename limit, so in most of real cases the behavior wouldn't change from the existing behavior. |
Or speaking of names, indeed, like @JoeKar noted, why need to stuff a part of the path into the backup file name if the full path is stored in a separate file anyway. So in such case the backup file name could be e.g. just |
Sounds like the best option so far. I'll throw the LUT away... |
f6a4f7c to
2663b91
Compare
DetermineEscapePath() with precedence and rename it to DeterminePath()2663b91 to
e31e593
Compare
internal/util/util.go
Outdated
| name := filepath.Base(path) | ||
| if len(name) > fileNameLengthLimit { | ||
| dir := filepath.Dir(path) | ||
| path = filepath.Join(dir, HashStringMd5(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the hash, without any extension?
Can't we just adjust DetermineEscapePath() to use fileNameLengthLimit - len(".micro-backup") instead of fileNameLengthLimit instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. No need to process this twice.
Only benefit was, to have a bit longer filename...now we hash earlier, even the escaped path doesn't exceed the max filename limit.
e31e593 to
a92b5f0
Compare
f3daf64 to
eb0bb07
Compare
|
|
||
| newPath := b.Path != filename | ||
| if newPath { | ||
| b.RemoveBackup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
...in case the escaped path exceeds the file name length limit
… file Since full escaped backup paths can become longer than the maximum filename size and hashed filenames cannot be restored it is helpful to have a lookup file for the user to resolve the hashed path.
eb0bb07 to
bab3907
Compare
This will solve the problem unable to save backups with very long paths due to URL escaped UTF8 characters.
If we think SHA256 is the better option over MD5 (which it should be for new functions) then we simply need to rename it to SHA256.
Fixes #3794