Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLAlchemy getIndexes() returns columns from partitions metadata table, provoking warning messages on reflection #518

Open
1 task done
metadaddy opened this issue Jan 15, 2025 · 0 comments

Comments

@metadaddy
Copy link
Member

Expected behavior

  • SQLAlchemy reflection on an Iceberg table should not emit warnings.
  • partition should not be returned as an index on an Iceberg table.

Actual behavior

  • When performing reflection on an Iceberg table, SQLAlchemy's Inspector._reflect_indexes() method displays a series of warnings, one for each of the columns in the Iceberg table's $partitions metadata table:

    /Users/ppatterson/src/trino-indexes/test.py:8: SAWarning: index key 'partition' was not located in columns for table 'drivestats'
    metadata.reflect(bind=engine)
    /Users/ppatterson/src/trino-indexes/test.py:8: SAWarning: index key 'record_count' was not located in columns for table 'drivestats'
    metadata.reflect(bind=engine)
    /Users/ppatterson/src/trino-indexes/test.py:8: SAWarning: index key 'file_count' was not located in columns for table 'drivestats'
    metadata.reflect(bind=engine)
    /Users/ppatterson/src/trino-indexes/test.py:8: SAWarning: index key 'total_size' was not located in columns for table 'drivestats'
    metadata.reflect(bind=engine)
    /Users/ppatterson/src/trino-indexes/test.py:8: SAWarning: index key 'data' was not located in columns for table 'drivestats'
    metadata.reflect(bind=engine)
    
  • After reflection, the table object for the Iceberg table contains an index named partition with no columns.

The above problems are caused by _get_partitions() (called by TrinoDialect.get_indexes()), which assumes that, if SELECT * FROM {schema}."{table_name}$partitions" succeeds, then the table is a Hive table.

This assumption is not correct - the query also succeeds for Iceberg tables, but the returned columns are not indexes on the Iceberg table. The client should probably query system.metadata.catalogs to get the connector_name for the catalog and check that it is "hive".

Steps To Reproduce

The following test app causes the warnings to be shown. pip install trino[sqlalchemy] before running it.

from sqlalchemy import create_engine, MetaData

# "iceberg" is an Iceberg catalog containing a single namespace, "drivestats",
# which contains a single table, "drivestats".
engine = create_engine('trino://admin@localhost:8080/iceberg/drivestats')

metadata = MetaData()

print('Calling reflect()')
metadata.reflect(engine)

print('\nListing tables:')
# What tables can we see?
for t in metadata.sorted_tables:
    print(f'Table name: {t.name}')
    # What indexes are on the table?
    for idx in metadata.tables[t.name].indexes:
        print(f'\tIndex name: {idx.name}')
        # What are the index columns?
        for c in idx.columns:
            print(f'\t\tColumn name: {c}')

Expected output:

Calling reflect()

Listing tables:
Table name: drivestats

Current output:

[Note that the warnings are sent to stderr, so they typically appear out of sequence with the app output, which is sent to stdout.]

Calling reflect()

Listing tables:
Table name: drivestats
	Index name: partition
/Users/ppatterson/src/trino-indexes/test.py:10: SAWarning: index key 'partition' was not located in columns for table 'drivestats'
  metadata.reflect(bind=engine)
/Users/ppatterson/src/trino-indexes/test.py:10: SAWarning: index key 'record_count' was not located in columns for table 'drivestats'
  metadata.reflect(bind=engine)
/Users/ppatterson/src/trino-indexes/test.py:10: SAWarning: index key 'file_count' was not located in columns for table 'drivestats'
  metadata.reflect(bind=engine)
/Users/ppatterson/src/trino-indexes/test.py:10: SAWarning: index key 'total_size' was not located in columns for table 'drivestats'
  metadata.reflect(bind=engine)
/Users/ppatterson/src/trino-indexes/test.py:10: SAWarning: index key 'data' was not located in columns for table 'drivestats'
  metadata.reflect(bind=engine)

Log output

n/a

Operating System

RHEL 9.5 (Plow) in Docker

Trino Python client version

0.332.0

Trino Server version

467

Python version

Python 3.12.2

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant