Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] st_read fails to read an Excel file from an S3 path #182

Closed
tboddyspargo opened this issue Nov 16, 2023 · 2 comments · Fixed by #187
Closed

[BUG] st_read fails to read an Excel file from an S3 path #182

tboddyspargo opened this issue Nov 16, 2023 · 2 comments · Fixed by #187

Comments

@tboddyspargo
Copy link

tboddyspargo commented Nov 16, 2023

I've been trying to use st_read by passing the same style of file paths that I would use with read_csv_auto: s3://bucket_name/path/file.extension. However, This doesn't seem to work. I'm currently using duckdb 0.9.2, but I have observed the same failures in 0.9.0 and 0.9.1.

The following will fail:

import duckdb

conn = duckdb.connect(database=':memory:')
conn.execute('SET s3_endpoint="localhost:8333";')
conn.execute('SET s3_url_style="path";')
conn.execute('SET s3_use_ssl=false;')
conn.execute('INSTALL spatial; LOAD spatial;')
r = conn.execute("SELECT * FROM st_read('s3://appdata/global/sba_sample.xlsx', layer='sba_sample')")

print(r.fetchdf())

In this case, http://localhost:8333 is a local seaweedfs container compliant with the S3 API (and sba_sample.xlsx exists at that path).

The exception that gets raised is:

Traceback (most recent call last):
  File "tmp/excel_example.py", line 8, in <module>
    r = conn.execute("SELECT * FROM st_read('s3://appdata/global/sba_sample.xlsx', layer='sba_sample')")
duckdb.duckdb.IOException: IO Error: GDAL Error (4): s3://appdata/global/sba_sample.xlsx: Undefined error: 0
@Maxxen
Copy link
Member

Maxxen commented Nov 20, 2023

Hi! Thanks for reporting this issue. This is indeed a regression. Im working on a fix, but in the meantime you should be able to circumvent this by using the GDAL s3 filesystem instead, e.g. by prefixing the path with /vsis3/

@Maxxen
Copy link
Member

Maxxen commented Nov 22, 2023

Hi! I think this has been fixed in #187.

Note: for XLSX in particular using DuckDB's filesystem with httpfs has a slower initial load time than /vsis3/ or /vsicurl/. Im aware of the issue and will look into improving it more in the future.

In a couple of hours (when the CI finishes) you should be able to install the extension with this fix for 0.9.2 by running:

FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants