Query caching issues - query returning results different to Athena GUI #41

RobinL · 2021-02-01T09:25:10Z

I've run into an issue a few times where I run a query usin read_sql_query and get completely different results frmo the Athena GUI.

Restarting kernel and re-running fixes it.

I think this is likely to be a result of caching behaviour - possibly from awswrangler.

I haven't yet been able to create a fully reproducible example. The query I was running this morning was:

sql = """
select *
from data_linking_temp.nomis_csv_qa
limit 10
"""

df = pydb.read_sql_query(sql)

Note: This isn't simply a result of different rows being returned (which might be expected due to the distributed nature of the query). The query was returning the wrong columns and the wrong number of rows

The text was updated successfully, but these errors were encountered:

isichei · 2021-02-01T09:44:20Z

Interesting, it must be the ctas_approach from wrangler but I dno if we might be causing it by setting a temp database?

RobinL · 2021-02-12T15:10:43Z

Some more notes:

I was running into this issue again today. This time I was getting the right columns but in the wrong order.

Again, running the query generated by pydbtools in Athena gave the right result:

select c.postcode, d.pc_nospace, d.count_distinct_person, c.codes from mojap_de_temp_1613136638555168000.pc_with_distinct_persons as d left join mojap_de_temp_1613136638555168000.pc_with_codes as c on d.pc_nospace = c.pc_nospace 
order by d.count_distinct_person desc;

Further inspection suggests I was getting all the right data but in the wrong order. (using ctas_approach =True)

This makes sense if using the arrow reader, because presumably it reads the parquet files into pandas in any order.

I also ran the query directly in wranger.athena.read_sql_query and got the same result i.e. this doesn't seem to be an issue with the manipulations pydbtools does.

RobinL · 2021-02-12T15:11:59Z

A good start to further troubleshoot this would be to put some logging in to wrangler at the DEBUG level which outputs
1.. the SQL statement executed and
2. the args which are submitted to AWS wrangler

RobinL mentioned this issue Feb 12, 2021

Add logging #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query caching issues - query returning results different to Athena GUI #41

Query caching issues - query returning results different to Athena GUI #41

RobinL commented Feb 1, 2021 •

edited

Loading

isichei commented Feb 1, 2021

RobinL commented Feb 12, 2021

RobinL commented Feb 12, 2021

Query caching issues - query returning results different to Athena GUI #41

Query caching issues - query returning results different to Athena GUI #41

Comments

RobinL commented Feb 1, 2021 • edited Loading

isichei commented Feb 1, 2021

RobinL commented Feb 12, 2021

RobinL commented Feb 12, 2021

RobinL commented Feb 1, 2021 •

edited

Loading