Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Handling Non-UTF-8 Encodings #287

Open
iMMIQ opened this issue Jun 25, 2024 · 1 comment
Open

Feature: Handling Non-UTF-8 Encodings #287

iMMIQ opened this issue Jun 25, 2024 · 1 comment

Comments

@iMMIQ
Copy link

iMMIQ commented Jun 25, 2024

Background:

Currently, when svlint encounters a non-UTF-8 encoded file, it throws an error and halts the main program, even during the analysis of a file list. I would like to add automatic detection and reading of non-UTF-8 encoded files. Even if the encoding is problematic, the main program should not terminate.

Proposed Solutions:

  1. Modify the read_to_string function in src/main.rs with the following code:

    ~/code/svlint$ git diff
    diff --git a/Cargo.toml b/Cargo.toml
    index 2097796..08efc02 100644
    --- a/Cargo.toml
    +++ b/Cargo.toml
    @@ -41,6 +41,8 @@ sv-parser               = "0.13.3"
    term                    = "0.7"
    toml                    = "0.8"
    sv-filelist-parser      = "0.1.3"
    +chardetng               = "0.1.17"
    +encoding_rs             = "0.8.34"
    
    [build-dependencies]
    regex   = "1"
    diff --git a/src/main.rs b/src/main.rs
    index 70bda82..a13200a 100644
    --- a/src/main.rs
    +++ b/src/main.rs
    @@ -2,8 +2,9 @@ use anyhow::{Context, Error};
    use clap::{Parser, CommandFactory};
    use clap_complete;
    use enquote;
    +use chardetng::EncodingDetector;
    use std::collections::HashMap;
    -use std::fs::{read_to_string, File, OpenOptions};
    +use std::fs::{File, OpenOptions};
    use std::io::{Read, Write};
    use std::path::{Path, PathBuf};
    use std::{env, process};
    @@ -275,7 +276,16 @@ pub fn run_opt_config(printer: &mut Printer, opt: &Opt, config: Config) -> Resul
                // by textrules to reset their internal state.
                let _ = linter.textrules_check(TextRuleEvent::StartOfFile, &path, &0);
    
    -            let text: String = read_to_string(&path)?;
    +            let mut file = File::open(&path)?;
    +            let mut buffer = Vec::new();
    +            
    +            file.read_to_end(&mut buffer)?;
    +            let mut detector = EncodingDetector::new();
    +            detector.feed(&buffer, true);
    +            let encoding = detector.guess(None, true).decode(&buffer).0;
    +            
    +            let text = encoding.into_owned();
                let mut beg: usize = 0;
    
                // Iterate over lines in the file, applying each textrule to each

    This change might cause a slight performance degradation during file reading, but it is acceptable.

  2. Provide a runtime parameter, such as --guess-encoding. When this parameter is activated, use the above code; otherwise, continue using read_to_string.

  3. If read_to_string fails to read a UTF-8 file, do not print an error. Instead, treat this situation as a rule violation: the code file must be saved in UTF-8 encoding. Report the file path (it's difficult to locate the problematic file in a complex nested file list without the file path).

@DaveMcEwan
Copy link
Contributor

Sounds sensible to me. @iMMIQ Would you open a PR with some testcases? I guess your tests should be similar to the filelist tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants