-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special characters in GEF-file raise UnicodeDecodeError #6
Comments
I am not sure if gef file with Dutch characters would work. For the gef reading we are using fields that are described in the "Geotechnical exchange format for cpt-data". |
Hi Martijn, The CUR standard clearly states that the GEF file should only consist of characters in the ASCII charachter set (only 128 characters found here). The GEF file is parsed using utf-8, which is the most used encoding on the web with all possible charachters (in all languages), the original 128 characters from ASCII are mapped to the same bytes in 'utf-8). For obvious compatibility reasons. Your GEF file is probably encoded in cp1252 (ANSI) encoding. Which is an extension that adds some extra characters to the set which are used in westen european languages. Unfortunally these special characters map to different byte(s) in utf-8 and cp1252. (because cp1252 is a single byte encoding and utf-8 a multiple byte encoding). Actually the byte of ö in 'windows-1252' (0xf6) is not a valid byte used in 'utf-8'. That is what is causing the problem, otherwise you would just get the wrong character out instead of an error. Easy fix for you is to open de gef in notepad (kladblok) and save the file in 'UTF-8'. The GEF file wil probably parse correct including the ö. Another fix to try (in pyhton) is to try to decode the file using 'utf-8', id this fails, catch the error en decode the file using cp1252 and then re-encode the file using utf-8. with open('file.gef', 'rb') as fp:
try:
file_as_string = fp.read().decode('utf-8')
# everything alright send file to GEOLIB+
except UnicodeDecodeError:
# File is probably cp1252 with special character, convert to utf-8
file_as_string = fp.read().decode('cp1252')
file_as_bytes_utf_8 = file_as_string.encode('utf-8') |
Hi Maarten, Thanks for the detailed explanation! The funny part is that the #DATAFORMAT header of the GEF file says it's ASCII-encoded like specified in the standard, even though it's clearly not 😄 I remember trying to change the file encoding, but failed back then and switched to a different approach for the project that didn't involve this code. Somehow I currently cannot reproduce the error I initially got, even though I'm parsing the same GEF file which is ANSI-encoded and contains the ö-character. If I encounter the same problem another time I'll try your solutions! |
I encounter the same problem with GEF files created with the software of A.P. van den Berg. These may contain the signs for 'degree' Celcius and the character ö in the dutch word coordinate. When I use chardet to determine the encoding, it points to ISO-8859-1. The code snippet above does not work in my case. I rewrote it to the following to get it working properly:
|
If the GEF file cannot be read with the default encoding (UTF8) it will fall back onto the cp1252 encoding. This helps to accept GEF files as commonly produced by dutch suppliers. For more description of the problem, see: Deltares/GEOLib-Plus#6 (comment)
If the GEF file cannot be read with the default encoding (UTF8) it will fall back onto the cp1252 encoding. This helps to accept GEF files as commonly produced by dutch suppliers. For more description of the problem, see: Deltares/GEOLib-Plus#6 (comment)
Dutch GEF-files may contain special characters, for example the umlaut in the word "coördinatensysteem". This raises the UnicodeDecodeError below when parsing the file, which traces back to codecs.py. Replacing the "ö" with a regular "o" solves the issue.
The text was updated successfully, but these errors were encountered: