UTF8 Support? #23

Dino804 · 2020-04-30T21:56:49Z

I am really still new on programming in this architecture. But I have a suggestion:
ANSI is quite nice, but it does not support special characters or umlauts or other diacritical characters.
It would be great for other languages if you could implement UTF8 (maybe instead of ANSI?), please.
Thank you very much for the great, genious project to keep it up to date the server code and work on a new database.

tvdijen · 2020-04-30T23:02:48Z

I think you’re mixing up ANSI and ASCII

lisdude · 2020-05-01T00:22:09Z

There's been some work done on this in the UTF-8 branch (https://github.com/lisdude/toaststunt/tree/utf8), but that hasn't been touched in a while, so it might take a bit of effort to merge all of the new changes in. It works well enough (or did) on a test MOO, but it's never been used in production. You're welcome to try it, though!

Going forward, I think it would make sense to start using std::string for everything instead of putting more work into these patches. It's a fairly huge change, though, so that could take some time to get going. I'm not positive which direction I'll go just yet.

Either way, this is definitely on the todo list!

elena-v2 · 2020-08-05T20:32:38Z

std::string won't help much in that matter, unfortunately. Even with C++20, there's not really anything resembling utf8 support in any way that matters/in any way that's useful for MOO, you'll still either have to roll your own solutions to those problems ala the patch from 2002, or use something like ICU or utf8proc.

There's a number of things that the patch doesn't properly tackle, like input sanitization (as-is, that particular patch lets you just slap down more or less any character from any encoding without a care in the world... which isn't the best.) and so on.

As far as libraries go, utf8proc is way lighter weight and servicable enough, and ICU is an overwrought behemoth that uses utf16 natively, but is slowly getting some utf8 support added in. We used the 2002 patch for a really long time, but we ended up starting the transition to ICU a while back.

After all that's said and done, a big thing to keep in mind is that most MUD clients either don't support utf8 at all, or severely halfass it, so that's a rather unfortunate fact to keep in mind.

NathanTech7713 · 2021-12-08T14:08:59Z

I wanted to weigh in on this one as I had a few questions of my own.

As @elena-v2 mentioned part of the utf8 patch problem is slapping down any characters, so you have to think about serious checking so you don't end up with a character on your mud called bob😄😄😄😄.

I was wondering if either @elena-v2commented on 5 Aug 2020 or @lisdude could give some pointers in terms of where to look in terms of networking and such.

For instance, I'd imagine half the problem of implementing utf8 is that pointers can be more than one byte, for instance, the pound sign '£' is two byte.
I'm assuming this would cause issues with builtins like index, which would return wrong numbers until told to behave with utf8 but it lead me to wondering about ISO 8859-1 (Latin-1) which is single byte?

I was looking at network.cc and the big holder here seems to be isgraph().

How badly would a reimplementation of isgraph to support characters of ISO 8859-1 (Latin-1) break everything else, especially memory wise which I know nothing about?

Nathan

NathanTech7713 · 2021-12-08T14:12:32Z

As an addendum to my above, Windows-1252 seems to have taken over from the above mentioned encoding, source:
https://en.wikipedia.org/wiki/Windows-1252

As an additional edit, I just tried changing this line in network.cc:

if(isgraph(c) c == ' ' || c == '\t')

To:

if ((c>=20 && c<=247) || c == ' ' || c == '\t')

I booted it up fine and it nicely printed £ and other characters like Español, but I'm not sure of further implications.

elena-v2 · 2021-12-08T16:10:08Z

Actually, the issue I mentioned isn't someone with a name comprised entirely of valid utf8 graphemes. That's fine. From the server-development side of things, I personally don't care about that at all. (Mind you, I very much care about that in games that I work on, but that's not what's being discussed here.) -- if MOO administrators want to care about that, that's their own business. The issue arises when you have someone inputting something that's not valid utf8 in the first place - the current patch doesn't care about that.

I honestly can't see any problems with supporting Latin-1 from the MOO side of things - it's basically just an even lazier version of that utf8 patch that's been floating around for two decades, but from the UX side, the huge elephant in the room is that you absolutely cannot guarantee that the locale is going to match your locale - you're going to run into players with mojibake issues a lot sooner than you might otherwise anticipate. And funnily enough, people erroneously using Latin-1 and its ilk is actually one of the big concerns when it comes to invalid utf8 strings! The server has to correct for that, otherwise you're going to have a bad time.

Anyways, string indexing honestly isn't a huge deal on the whole, but keeping it reasonably performant is honestly the much larger issue. The modifications I made to add in ICU support to our fork of MOO eventually got to where it was nearly the same speed in most scenarios, but I never got around to proper input sanitization/correction (i.e. detecting what codepage the user is on and shuffling stuff forward and back as need be) - some real-life health issues cropped up and it ended up falling by the wayside. Another consideration that I addressed on our fork was things such as letting strings match/index in less strict grapheme terms, an example of such is a something with 'ß' would count as both 'ß' and 'ss.'

NathanTech7713 · 2021-12-08T16:55:29Z

@elena-v2 thanks for the detailed answer. I admit I didn't even realise one could type invalid latin1 strings, do you have any articles I could read on that as a reference?

elena-v2 · 2021-12-12T05:02:41Z

I think wires got crossed somewhere - invalid latin-1 isn't the issue. That's not 'really' a thing as far as the computer's concerned. The core issue at hand is mojibake - where the codepages don't match between systems so different users end up seeing wildly varying things. When I mentioned invalid strings I explicitly meant invalid utf8.

As an example, someone inputting 'pokémon' on a latin-1/win-1252 system will result in an invalid/malformed string on a utf8 system, 'pokХmon' on a KOI-8 system, or 'pok駑on' on a Shift JIS system if you don't do anything to ensure that the encoding that you're sending out is what the user's client expects. Similarly, if you type out 'pokémon' on a utf8 system, one would get 'pokÃ©mon', 'pokц╘mon', and 'pokﾃｩmon', respectively.

Really, at minimum you want to detect what codepage the user is using so you can convert their input such that the encoding is consistent on the MOO side, and optimally you would want to transliterate to nearest-match graphemes from the MOO to the user so they have the best/most consistent experience possible. Once you have users other than yourself, the code page they're using on their system with their client is out of your control, and you will absolutely run into people who are running a different locale than you rather quickly, even with people who only speak English and are even in the same country as you.

In short, a simple modification like that will absolutely work and won't actually break things from a system perspective but it has a lot of caveats and pitfalls from the user experience side of things that you should be aware of before moving forward with using high ASCII graphemes. I could prioritize finishing up the changes to support utf8 that I was working on if there's any actual interest - right now it works but I don't feel it's quite ready for prime time just yet... and it would add a pretty heavy dependency in the form of ICU4C, though I will admit that could be made optional by making unicode support optional.

Apologies for the long reply, but this is something I have quite a lot of experience with... and even at its most basic level, i18n is by no means a simple subject. In any case, I hope it proved to be helpful, even if it might have been discouraging to hear.

NathanTech7713 · 2021-12-12T16:24:14Z

Would definitely be something I'd be really interested in as it's something I get a lot of push for from the users on a semi regular basis. I admit, having read through your explanation, it's far more complex than I originally thought, even just to support single byte characters which is a set back in some regards. I mean, how would you even go about detecting a person's encoding without asking? That comapritively simple problem was enough to hault my brain in its tracks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 Support? #23

UTF8 Support? #23

Dino804 commented Apr 30, 2020

tvdijen commented Apr 30, 2020

lisdude commented May 1, 2020

elena-v2 commented Aug 5, 2020

NathanTech7713 commented Dec 8, 2021

NathanTech7713 commented Dec 8, 2021 •

edited

Loading

elena-v2 commented Dec 8, 2021

NathanTech7713 commented Dec 8, 2021

elena-v2 commented Dec 12, 2021

NathanTech7713 commented Dec 12, 2021

UTF8 Support? #23

UTF8 Support? #23

Comments

Dino804 commented Apr 30, 2020

tvdijen commented Apr 30, 2020

lisdude commented May 1, 2020

elena-v2 commented Aug 5, 2020

NathanTech7713 commented Dec 8, 2021

NathanTech7713 commented Dec 8, 2021 • edited Loading

elena-v2 commented Dec 8, 2021

NathanTech7713 commented Dec 8, 2021

elena-v2 commented Dec 12, 2021

NathanTech7713 commented Dec 12, 2021

NathanTech7713 commented Dec 8, 2021 •

edited

Loading