String Encoding UTF vs Latin1 and incompatibility between Soulseek.NET and Nicotine #777
Replies: 6 comments 16 replies
-
I did originally switch to sending messages encoded as latin1 by default, and I think I might have suggested this to @jpdillingham back then too. The idea was improved compatibility with Soulseek NS: nicotine-plus/nicotine-plus#760 Eventually I figured out that there were downsides with this approach, as mentioned in the linked commit: nicotine-plus/nicotine-plus@1328753 I continued using latin1 by default for search strings, since there were no apparent side effects, and Soulseek NS clients sent more search results. It seems like I also enabled this for FolderContentsRequest messages later, since Soulseek NS clients were sending empty responses in some cases: nicotine-plus/nicotine-plus@574ff03 It also seems like I summarized my changes in a comment a while ago: #438 (comment) |
Beta Was this translation helpful? Give feedback.
-
Thank you very much. Using UTF for transfers and falling back to Latin1 on certain failure messages seems to me the most robust solution. I do want to make a note that for Soulseek.NET when sending a TransferRequest with Latin1 encoded filename to a Nicotine client in the failure case outlined above, that a timeout occurs rather than a TransferRejectedException. This is due to the fact that the QueueFailed response from the remote client will have the unintended filename (that resulted in UTF8 decoding a Latin1 encoded string) and therefore it will not match any download or WaitKey. Here: I don't know an elegant fix for this other than maybe checking the received filename both against the transfer Filename and the result of the transfer Filename Latin1 encoded and UTF8 decoded, but I just thought I would make note of it. |
Beta Was this translation helpful? Give feedback.
-
Another proposed solution (again, FWIW), which is elegant in the sense that it is transparent to the remote client: When receiving a filename, make note of how you decode it (requires a boolean per filename). When sending the filename back to the remote client (for transfer request, folder contents) encode it in the same way you decoded it. In that sense the remote client will get the same byte sequence back and they will know what to do with it, it doesn't matter if we decoded it using a code page no one has ever heard of, as long as we encode it back using that same code page resulting in the same byte sequence the remote client will understand it. Because really the fundamental problem is a client decoding one way and encoding a different way resulting in a different byte sequence (in the problem case in the OP it was decoding in UTF8 and encoding in Latin1). In my testing as long as you send the remote client the same filename byte sequence they sent you, they will be able to understand the request. |
Beta Was this translation helpful? Give feedback.
-
I made some changes in #778 and want to check my thinking. This PR is more about enablement and changing the default behavior; the fix for the root problem (ambiguity in filename encoding) will come later. In this PR, I'm:
Still to do:
In a subsequent PR:
|
Beta Was this translation helpful? Give feedback.
-
Which version of Nicotine+ did you test with, and which OS and file system? I tried your example file names, downloaded them with slskd from a Nicotine+ client running the latest version, but couldn't reproduce the issue. |
Beta Was this translation helpful? Give feedback.
-
I have put a lot of thought and testing into this in the last few days, with most of those changes being visible in the history of #778. I have:
These changes ensure that modern clients are maximally compatible by defaulting to UTF-8, and break compatibility for some files served by Soulseek NS clients. I have put a lot of thought and investigation into what it will take to detect encoding and retain that information along with search and browse responses, and allow clients to pass it back in to ensure correct behavior, and frankly it just isn't something I'm interested in doing at this point. To do this I would need to pass encoding along with strings all over the library, refactor a ton of tested and hardened code, break a large chunk of my ~2100 unit tests, and burn what would likely be several weeks of time that could otherwise be spent developing features for slskd. I'm not saying I'll never do this, and if someone wanted to take on the work I'd be happy to review a proposal, but my thinking at this time is that it's simply not worth the disruption or the time. |
Beta Was this translation helpful? Give feedback.
-
When requesting a file using Soulseek.NET from a client using Nicotine there is a filename encoding issue that can occur causing the transfer to fail.
For background - Soulseek.NET encodes strings in Latin1 by default and only if the encoding fails encodes strings in UTF8. Soulseek.NET decodes strings it receives from UTF8 and only if the decoding fails uses Latin1. Nicotine encodes strings using UTF8 (other than when doing a FileSearch, RoomSearch, or FolderContentsRequest) and decodes strings using UTF8 and only if that fails uses Latin1.
This means that if a Soulseek.NET client does a TransferRequest to a Nicotine client with a filename that is has a valid encoding in Latin1 that can be decoded in UTF8 but where the resulting string is different depending on the encoding, it will fail.
For example -
Good Case - I request file "Ü.jpg", Soulseek.NET encodes it using Latin1 (which succeeds as it is valid Latin1), Nicotine decodes it using UTF8 but it is not valid UTF8 and so it uses Latin1 as a fallback. The Nicotine client sees "Ü.jpg" and can send the file.
Bad Case - I request file "eÌe.jpg", Soulseek.NET encodes it using Latin1 (which succeeds as it is valid Latin1), Nicotine decodes it using UTF8 and it is valid UTF8, however interpreted as UTF8 it becomes "ée.jpg" and the transfer fails as the user does not have said file.
So basically, my question is, is there a downside to always encoding the request in UTF8? Concerns re SoulseekNS or others? Tagging @mathiascode as they likely have useful info with change made here: nicotine-plus/nicotine-plus@1328753
Beta Was this translation helpful? Give feedback.
All reactions