[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

asfimport · 2022-05-24T14:48:43Z

Hi All

When I use Arrow Reading Parquet File like follow:

auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), &_reader);   
arrow::Status status = _reader->GetRecordBatchReader({_current_group},_parquet_column_ids, &_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch);

status is not ok and an error occured like this:

IOError: Corrupt snappy compressed data.

When I comment out this statement

 _reader->set_use_threads(true);

The program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using _reader->set_use_threads(true); and a single column will not occur error

The testing parquet file is created by pyarrow，I use only 1 group and each group has 3000000 records.
The parquet file has 20 columns including int and string types

you can create a test parquet file using attachment python script

In my case,I read 0,1,2,3,4,5,6 index columns

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!

@pitrou

@westonpace

Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0 python3.8,

Reporter: yurikoomiga

Original Issue Attachments:

test_std_02.py

Externally tracked issue: #13186

_{Note: This issue was originally created as ARROW-16642. Please see the migration documentation for further details.}

asfimport · 2022-05-25T21:20:23Z

Weston Pace / @westonpace:
You might need to provide a few more details on how you are reading the parquet file. I used the python script you provided to create a file /home/pace/test.parquet which I then tested with this script:


#include <iostream>

#include "arrow/filesystem/api.h"
#include "arrow/record_batch.h"

#include "parquet/api/reader.h"
#include "parquet/arrow/reader.h"

int main() {
  auto fs = std::make_unique<arrow::fs::LocalFileSystem>();
  auto input_file = fs->OpenInputFile("/home/pace/test.parquet").ValueOrDie();

  std::unique_ptr<parquet::arrow::FileReader> file_reader;
  arrow::Status st = parquet::arrow::FileReader::Make(
      arrow::default_memory_pool(),
      parquet::ParquetFileReader::Open(input_file), &file_reader);
  if (!st.ok()) {
    std::cerr << "Error making file reader: " << st << std::endl;
    return -1;
  }
  std::vector<int> parquet_column_ids = {0, 1, 2, 3, 4, 5, 6};
  std::cout << "The file has " << file_reader->num_row_groups() << " row groups"
            << std::endl;
  for (int row_group_idx = 0; row_group_idx < file_reader->num_row_groups();
       row_group_idx++) {
    std::cout << "Reading row group: " << row_group_idx << std::endl;
    std::shared_ptr<arrow::RecordBatchReader> record_batch_reader;
    st = file_reader->GetRecordBatchReader({row_group_idx}, parquet_column_ids,
                                           &record_batch_reader);
    file_reader->set_batch_size(65536);
    file_reader->set_use_threads(true);
    std::shared_ptr<arrow::RecordBatch> batch;
    while (true) {
      st = record_batch_reader->ReadNext(&batch);
      if (st.ok()) {
        if (!batch) {
          // Reached the end of the row group
          break;
        }
        std::cout << "  Read in record batch with " << batch->num_rows()
                  << " rows" << std::endl;
      } else {
        std::cerr << "Error encountered reading record batch: " << st
                  << std::endl;
        return -2;
      }
    }
  }
}

I did not get any errors and got the expected output:


The file has 1 row groups
Reading row group: 0
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 50880 rows

Does my test program work in your environment?

4ertus2 · 2024-12-25T12:20:59Z

SnappyCodec::Decompress() called from SerializedPageReader::DecompressIfNeeded() fails if input_len == 0
It happens if all values in column are NULLs.

4ertus2 · 2024-12-25T12:29:24Z

diff --git a/cpp/src/arrow/util/compression_snappy.cc b/cpp/src/arrow/util/compression_snappy.cc
index 731fdfd13..b862c6a24 100644
--- a/cpp/src/arrow/util/compression_snappy.cc
+++ b/cpp/src/arrow/util/compression_snappy.cc
@@ -43,6 +43,9 @@ class SnappyCodec : public Codec {
  public:
   Result<int64_t> Decompress(int64_t input_len, const uint8_t* input,
                              int64_t output_buffer_len, uint8_t* output_buffer) override {
+    if (!input_len) {
+      return 0;
+    }
     size_t decompressed_size;
     if (!snappy::GetUncompressedLength(reinterpret_cast<const char*>(input),
                                        static_cast<size_t>(input_len),

kou · 2024-12-29T03:08:06Z

Could you open a pull request with a test?

wgtmac · 2024-12-30T05:25:06Z

IIRC, levels are compressed together with values. If all values are NULLs, it must have definition levels encoded and compressed. In any case, the compressed length should not be 0. The fix itself looks reasonable to me.

4ertus2 · 2025-01-02T15:43:11Z

Could you open a pull request with a test?

Would it be enough if I place buggy parquet here instead? :)

snappy_bug.parquet.gz

The file is made by java lib

Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)

wgtmac · 2025-01-03T09:31:36Z

@4ertus2 Do you mind opening a pull request against https://github.com/apache/parquet-testing to add this file?

4ertus2 · 2025-01-09T08:57:39Z

https://github.com/apache/parquet-testing

Done apache/parquet-testing#68

4ertus2 mentioned this issue Jan 9, 2025

Snappy compressed NULLs-only column apache/parquet-testing#68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

asfimport commented May 24, 2022

asfimport commented May 25, 2022

4ertus2 commented Dec 25, 2024

4ertus2 commented Dec 25, 2024

kou commented Dec 29, 2024

wgtmac commented Dec 30, 2024

4ertus2 commented Jan 2, 2025 •

edited

Loading

wgtmac commented Jan 3, 2025

4ertus2 commented Jan 9, 2025

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

Comments

asfimport commented May 24, 2022

Original Issue Attachments:

Externally tracked issue: #13186

asfimport commented May 25, 2022

4ertus2 commented Dec 25, 2024

4ertus2 commented Dec 25, 2024

kou commented Dec 29, 2024

wgtmac commented Dec 30, 2024

4ertus2 commented Jan 2, 2025 • edited Loading

wgtmac commented Jan 3, 2025

4ertus2 commented Jan 9, 2025

4ertus2 commented Jan 2, 2025 •

edited

Loading