Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

Open
asfimport opened this issue May 24, 2022 · 8 comments

Comments

@asfimport
Copy link
Collaborator

Hi All

When I use Arrow Reading Parquet File like follow:

 

auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), &_reader);   
arrow::Status status = _reader->GetRecordBatchReader({_current_group},_parquet_column_ids, &_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch);  

status is not ok and an error occured like this:

IOError: Corrupt snappy compressed data. 

When I comment out this statement

 _reader->set_use_threads(true);

The program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using _reader->set_use_threads(true); and a single column will not occur error

The testing parquet file is created by pyarrow,I use only 1 group and each group has 3000000 records.
The parquet file has 20 columns including int and string types

you can create a test parquet file using attachment python script

In my case,I read 0,1,2,3,4,5,6 index columns

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!

@pitrou 

@westonpace  

Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0 python3.8,

Reporter: yurikoomiga

Original Issue Attachments:

Externally tracked issue: #13186

Note: This issue was originally created as ARROW-16642. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
You might need to provide a few more details on how you are reading the parquet file. I used the python script you provided to create a file /home/pace/test.parquet which I then tested with this script:


#include <iostream>

#include "arrow/filesystem/api.h"
#include "arrow/record_batch.h"

#include "parquet/api/reader.h"
#include "parquet/arrow/reader.h"

int main() {
  auto fs = std::make_unique<arrow::fs::LocalFileSystem>();
  auto input_file = fs->OpenInputFile("/home/pace/test.parquet").ValueOrDie();

  std::unique_ptr<parquet::arrow::FileReader> file_reader;
  arrow::Status st = parquet::arrow::FileReader::Make(
      arrow::default_memory_pool(),
      parquet::ParquetFileReader::Open(input_file), &file_reader);
  if (!st.ok()) {
    std::cerr << "Error making file reader: " << st << std::endl;
    return -1;
  }
  std::vector<int> parquet_column_ids = {0, 1, 2, 3, 4, 5, 6};
  std::cout << "The file has " << file_reader->num_row_groups() << " row groups"
            << std::endl;
  for (int row_group_idx = 0; row_group_idx < file_reader->num_row_groups();
       row_group_idx++) {
    std::cout << "Reading row group: " << row_group_idx << std::endl;
    std::shared_ptr<arrow::RecordBatchReader> record_batch_reader;
    st = file_reader->GetRecordBatchReader({row_group_idx}, parquet_column_ids,
                                           &record_batch_reader);
    file_reader->set_batch_size(65536);
    file_reader->set_use_threads(true);
    std::shared_ptr<arrow::RecordBatch> batch;
    while (true) {
      st = record_batch_reader->ReadNext(&batch);
      if (st.ok()) {
        if (!batch) {
          // Reached the end of the row group
          break;
        }
        std::cout << "  Read in record batch with " << batch->num_rows()
                  << " rows" << std::endl;
      } else {
        std::cerr << "Error encountered reading record batch: " << st
                  << std::endl;
        return -2;
      }
    }
  }
}

I did not get any errors and got the expected output:


The file has 1 row groups
Reading row group: 0
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 50880 rows

Does my test program work in your environment?

@4ertus2
Copy link

4ertus2 commented Dec 25, 2024

SnappyCodec::Decompress() called from SerializedPageReader::DecompressIfNeeded() fails if input_len == 0
It happens if all values in column are NULLs.

@4ertus2
Copy link

4ertus2 commented Dec 25, 2024

diff --git a/cpp/src/arrow/util/compression_snappy.cc b/cpp/src/arrow/util/compression_snappy.cc
index 731fdfd13..b862c6a24 100644
--- a/cpp/src/arrow/util/compression_snappy.cc
+++ b/cpp/src/arrow/util/compression_snappy.cc
@@ -43,6 +43,9 @@ class SnappyCodec : public Codec {
  public:
   Result<int64_t> Decompress(int64_t input_len, const uint8_t* input,
                              int64_t output_buffer_len, uint8_t* output_buffer) override {
+    if (!input_len) {
+      return 0;
+    }
     size_t decompressed_size;
     if (!snappy::GetUncompressedLength(reinterpret_cast<const char*>(input),
                                        static_cast<size_t>(input_len),

@kou
Copy link
Member

kou commented Dec 29, 2024

Could you open a pull request with a test?

@wgtmac
Copy link
Member

wgtmac commented Dec 30, 2024

IIRC, levels are compressed together with values. If all values are NULLs, it must have definition levels encoded and compressed. In any case, the compressed length should not be 0. The fix itself looks reasonable to me.

@4ertus2
Copy link

4ertus2 commented Jan 2, 2025

Could you open a pull request with a test?

Would it be enough if I place buggy parquet here instead? :)

snappy_bug.parquet.gz

The file is made by java lib

Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)

@wgtmac
Copy link
Member

wgtmac commented Jan 3, 2025

@4ertus2 Do you mind opening a pull request against https://github.com/apache/parquet-testing to add this file?

@4ertus2
Copy link

4ertus2 commented Jan 9, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants