Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query caching issues - query returning results different to Athena GUI #41

Open
RobinL opened this issue Feb 1, 2021 · 3 comments
Open

Comments

@RobinL
Copy link
Member

RobinL commented Feb 1, 2021

I've run into an issue a few times where I run a query usin read_sql_query and get completely different results frmo the Athena GUI.

Restarting kernel and re-running fixes it.

I think this is likely to be a result of caching behaviour - possibly from awswrangler.

I haven't yet been able to create a fully reproducible example. The query I was running this morning was:

sql = """
select *
from data_linking_temp.nomis_csv_qa
limit 10
"""

df = pydb.read_sql_query(sql)

Note: This isn't simply a result of different rows being returned (which might be expected due to the distributed nature of the query). The query was returning the wrong columns and the wrong number of rows

@isichei
Copy link
Contributor

isichei commented Feb 1, 2021

Interesting, it must be the ctas_approach from wrangler but I dno if we might be causing it by setting a temp database?

@RobinL
Copy link
Member Author

RobinL commented Feb 12, 2021

Some more notes:

I was running into this issue again today. This time I was getting the right columns but in the wrong order.

Again, running the query generated by pydbtools in Athena gave the right result:

select c.postcode, d.pc_nospace, d.count_distinct_person, c.codes from mojap_de_temp_1613136638555168000.pc_with_distinct_persons as d left join mojap_de_temp_1613136638555168000.pc_with_codes as c on d.pc_nospace = c.pc_nospace 
order by d.count_distinct_person desc;

Further inspection suggests I was getting all the right data but in the wrong order. (using ctas_approach =True)

This makes sense if using the arrow reader, because presumably it reads the parquet files into pandas in any order.

I also ran the query directly in wranger.athena.read_sql_query and got the same result i.e. this doesn't seem to be an issue with the manipulations pydbtools does.

@RobinL
Copy link
Member Author

RobinL commented Feb 12, 2021

A good start to further troubleshoot this would be to put some logging in to wrangler at the DEBUG level which outputs
1.. the SQL statement executed and
2. the args which are submitted to AWS wrangler

@RobinL RobinL mentioned this issue Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants