Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are bytestrings encoded? #134

Open
unhammer opened this issue Feb 10, 2024 · 2 comments
Open

How are bytestrings encoded? #134

unhammer opened this issue Feb 10, 2024 · 2 comments

Comments

@unhammer
Copy link

λ> let bø = [fmt|ø|] :: ByteString; in T.putStrLn $ "ø vs " <> [fmt|{T.decodeUtf8With T.lenientDecode bø}|]
ø vs �

λ> ([fmt|ø|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")


λ> ([fmt|{T.pack "ø"}|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")

("ø" in UTF-8 is 0xC3 0xB8 which is 195 184)

Does fmt encode bytestrings using some other encoding?


My current workaround is to restrict it to text

{-# LANGUAGE TemplateHaskell #-}
module PyF.Foo where
import Language.Haskell.TH.Quote
import Data.Text qualified as T
import PyF
import PyF.Internal.QQ (Config (..))
t :: QuasiQuoter
t =
  mkFormatter
    "t"
    ( fmtConfig
        { postProcess = \e -> [|T.pack $(e)|]
        }
    )
@guibou
Copy link
Owner

guibou commented Apr 22, 2024

Hey, that's super interesting. Thank you for the report and sorry for the delay (I kinda missed the notification).

I had a look and to be honest it is unclear for me.

The code [fmt|ø|] is actually spliced as fromString "ø", as simple as it can be, it really behaves like a "string literal" in the context of OverloadedString.

Note that if you do not use OverloadedString, you are forced to manually do the conversion and there won't be any problem if you pass by text. However, if you use Data.ByteString.Char8.pack, it will return \248. I'm unable to find a documentation on what is the encoding used by Data.ByteString.Char8.pack.

Your workaround is great.

All of that being said, I'm really wondering if PyF should really work with ByteString. For sure it is convenient, but encoding issues are obvious and we never really know how to deal with them.

Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.

I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?

@unhammer
Copy link
Author

unhammer commented Apr 23, 2024

Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.

That might be useful, though I'm not sure how to read that (is the "input" assumed to be utf8 or the output?).

I think at least it would be good to have some kind of note about this in the docs, perhaps that one should avoid using plain fmt when producing bytestrings which may contain non-ASCII.

I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?

It's been some months so I can't recall exactly, but it may have been that I was appending a bytestring with show Int to create another bytestring (as a key for rocksdb)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants