How are bytestrings encoded? #134

unhammer · 2024-02-10T09:26:09Z

λ> let bø = [fmt|ø|] :: ByteString; in T.putStrLn $ "ø vs " <> [fmt|{T.decodeUtf8With T.lenientDecode bø}|]
ø vs �

λ> ([fmt|ø|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")


λ> ([fmt|{T.pack "ø"}|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")

("ø" in UTF-8 is 0xC3 0xB8 which is 195 184)

Does fmt encode bytestrings using some other encoding?

My current workaround is to restrict it to text

{-# LANGUAGE TemplateHaskell #-}
module PyF.Foo where
import Language.Haskell.TH.Quote
import Data.Text qualified as T
import PyF
import PyF.Internal.QQ (Config (..))
t :: QuasiQuoter
t =
  mkFormatter
    "t"
    ( fmtConfig
        { postProcess = \e -> [|T.pack $(e)|]
        }
    )

The text was updated successfully, but these errors were encountered:

guibou · 2024-04-22T17:55:24Z

Hey, that's super interesting. Thank you for the report and sorry for the delay (I kinda missed the notification).

I had a look and to be honest it is unclear for me.

The code [fmt|ø|] is actually spliced as fromString "ø", as simple as it can be, it really behaves like a "string literal" in the context of OverloadedString.

Note that if you do not use OverloadedString, you are forced to manually do the conversion and there won't be any problem if you pass by text. However, if you use Data.ByteString.Char8.pack, it will return \248. I'm unable to find a documentation on what is the encoding used by Data.ByteString.Char8.pack.

Your workaround is great.

All of that being said, I'm really wondering if PyF should really work with ByteString. For sure it is convenient, but encoding issues are obvious and we never really know how to deal with them.

Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.

I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?

unhammer · 2024-04-23T07:54:15Z

Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.

That might be useful, though I'm not sure how to read that (is the "input" assumed to be utf8 or the output?).

I think at least it would be good to have some kind of note about this in the docs, perhaps that one should avoid using plain fmt when producing bytestrings which may contain non-ASCII.

I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?

It's been some months so I can't recall exactly, but it may have been that I was appending a bytestring with show Int to create another bytestring (as a key for rocksdb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are bytestrings encoded? #134

How are bytestrings encoded? #134

unhammer commented Feb 10, 2024

guibou commented Apr 22, 2024

unhammer commented Apr 23, 2024 •

edited

Loading

How are bytestrings encoded? #134

How are bytestrings encoded? #134

Comments

unhammer commented Feb 10, 2024

guibou commented Apr 22, 2024

unhammer commented Apr 23, 2024 • edited Loading

unhammer commented Apr 23, 2024 •

edited

Loading