Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] tests/test_builders.py::test_builder_build Flaky #292

Closed
charles-turner-1 opened this issue Dec 4, 2024 · 9 comments · Fixed by #293
Closed

[BUG] tests/test_builders.py::test_builder_build Flaky #292

charles-turner-1 opened this issue Dec 4, 2024 · 9 comments · Fixed by #293
Assignees
Labels
bug Something isn't working

Comments

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Dec 4, 2024

Describe the bug

We were getting some weird test behaviour in #291 - and after a bit of digging, I noticed we are getting a CI failure on main.

I couldn't figure out exactly why, because I couldn't reproduce the failure locally - so I deleted & recloned the repo, and which reproduced the failure. Running the test suite a couple more times resulted in the test suite passing, with no apparent changes to the file structure of the repo.

To Reproduce

  1. Start with a fresh install - either
$ git clone https://github.com/ACCESS-NRI/access-nri-intake-catalog/
$ conda env create -f environment-3.13.yml --name access-nri-intake 
$ pip install -e .

or

$ git clone https://github.com/ACCESS-NRI/access-nri-intake-catalog/ 
$ conda env create -f environment-3.13.yml --name access-nri-intake
$ pip install .
  1. Run pytest tests/test_builders.py::test_builder_build
  2. You should get
>       assert len(builder.df) == num_valid_assets
E       AssertionError: assert 26 == 27
  1. Repeat 2 twice. The error/failure should occur again on the second run, but not the 3rd.

Additional info

On python3.12 (change environment in step 1), it took four runs - starting with

E       AssertionError: assert 25 == 27

and then incrementing to 26, 26, and then finally 27.

It looks to me like we might have some test pollution going on here, but I'm not quite sure what.

Potentially related to #270 and #276?

@marc-white
Copy link
Collaborator

Woah that's a lot of failures, particularly in the 3.13 test...

@marc-white
Copy link
Collaborator

I just pulled out a clean version of main, that passed pytest just fine for me. That's using Python 3.11.10.

@charles-turner-1
Copy link
Collaborator Author

I still get the same issue with Python 3.11.0 (conda default for Python 3.11).

@charles-turner-1
Copy link
Collaborator Author

From the failing CI run:

=========================== short test summary info ============================
FAILED tests/test_builders.py::test_builder_build[basedirs4-Mom6Builder-kwargs4-27-27-15] - AssertionError: assert 25 == 27
 +  where 25 = len(                                 filename  ...   realm\n0                   19000101.ice_daily.nc  ...  seaIce\n1       ...cean_daily_rho2_2005_360.nc  ...   ocean\n26     20051101.ocean_daily_z_2005_360.nc  ...   ocean\n\n[25 rows x 13 columns])
 +    where                                  filename  ...   realm\n0                   19000101.ice_daily.nc  ...  seaIce\n1       ...cean_daily_rho2_2005_360.nc  ...   ocean\n26     20051101.ocean_daily_z_2005_360.nc  ...   ocean\n\n[25 rows x 13 columns] = Mom6Builder(paths=['/home/runner/work/access-nri-intake-catalog/access-nri-intake-catalog/tests/data/mom6'], storage_o...ry.nc', '*ocean.stats.nc', '*Vertical_coordinate.nc'], include_patterns=['*.nc'], joblib_parallel_kwargs={'n_jobs': 4}).df
FAILED tests/test_builders.py::test_builder_parser[mom6/output001/19010101.ice_month.nc-Mom6Builder-seaIce-None-XXXXXXXX_ice_month] - KeyError: 'realm'
FAILED tests/test_builders.py::test_builder_parser[mom6/output001/19010101.ocean_month_rho2.nc-Mom6Builder-ocean-None-XXXXXXXX_ocean_month_rho2] - KeyError: 'realm'
FAILED tests/test_builders.py::test_builder_parser[mom6/output000/19000101.ocean_month_z.nc-Mom6Builder-ocean-None-XXXXXXXX_ocean_month_z] - KeyError: 'realm'
FAILED tests/test_builders.py::test_builder_parser[mom6/output001/19010101.ocean_month.nc-Mom6Builder-ocean-None-XXXXXXXX_ocean_month] - KeyError: 'realm'
FAILED tests/test_builders.py::test_builder_parser[mom6/output001/19010101.ocean_scalar_month.nc-Mom6Builder-ocean-None-XXXXXXXX_ocean_scalar_month] - KeyError: 'realm'
FAILED tests/test_builders.py::test_Mom6Builder_parser_bad_realm[mom6/output001/19010101.ice_month.nc] - assert 'ParserError' in 'Traceback (most recent call last):\n  File "/home/runner/miniconda3/envs/access-nri-intake-test/lib/python3.11/site-p...me.pyx", line 421, in cftime._cftime.cast_to_int\nOverflowError: time values outside range of 64 bit signed integers\n'
FAILED tests/test_builders.py::test_Mom6Builder_parser_bad_realm[mom6/output001/19010101.ocean_month_rho2.nc] - assert 'ParserError' in 'Traceback (most recent call last):\n  File "/home/runner/miniconda3/envs/access-nri-intake-test/lib/python3.11/site-p...me.pyx", line 421, in cftime._cftime.cast_to_int\nOverflowError: time values outside range of 64 bit signed integers\n'
FAILED tests/test_builders.py::test_Mom6Builder_parser_bad_realm[mom6/output000/19000101.ocean_month_z.nc] - assert 'ParserError' in 'Traceback (most recent call last):\n  File "/home/runner/miniconda3/envs/access-nri-intake-test/lib/python3.11/site-p...me.pyx", line 421, in cftime._cftime.cast_to_int\nOverflowError: time values outside range of 64 bit signed integers\n'
FAILED tests/test_builders.py::test_Mom6Builder_parser_bad_realm[mom6/output001/19010101.ocean_month.nc] - assert 'ParserError' in 'Traceback (most recent call last):\n  File "/home/runner/miniconda3/envs/access-nri-intake-test/lib/python3.11/site-p...me.pyx", line 421, in cftime._cftime.cast_to_int\nOverflowError: time values outside range of 64 bit signed integers\n'
FAILED tests/test_builders.py::test_Mom6Builder_parser_bad_realm[mom6/output001/19010101.ocean_scalar_month.nc] - assert 'ParserError' in 'Traceback (most recent call last):\n  File "/home/runner/miniconda3/envs/access-nri-intake-test/lib/python3.11/site-p...me.pyx", line 421, in cftime._cftime.cast_to_int\nOverflowError: time values outside range of 64 bit signed integers\n'
FAILED tests/test_builders.py::test_parse_access_ncfile[Mom6Builder-mom6/output000/19000101.ice_daily.nc-expected19-False] - OverflowError: time values outside range of 64 bit signed integers
==== 12 failed, 337 passed, 102 skipped, 7 xfailed, 103 warnings in 17.95s =====

All failures seem to relate to the Mom6Builder, but I'm not convinced that the issue is necessarily the Mom6Builder itself - all these tests were passing when it was merged into main.

The overflows all seem to originate from .pyx files (Cython files) in the cftime library.
From logs:

> List installed packages
cftime                    1.6.4           py313ha014f3b_1    conda-forge
...
xarray                    2024.11.0          pyhd8ed1ab_0    conda-forge
  • cftime 1.6.4 was released June 7th, and there haven't been any updates to it on conda forge in over 3 months.
  • An issue mentioned in the xarray changelog here on version 24.10 shows similar issues.

I'm about to test whether reverting to xarray <= 2024.10 has any affect (N.B xarray 2024.10 broke a lot of the tests for intake-esm, despite no mentions of breaking changes in the changelog - plausible it's affecting us too).

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Dec 4, 2024

TODO:

  • test_builder_build : Check whether test_builder_build is failing due to ecgtools.Builder.clean_dataframe() - seems to be the most likely cause here. Potentially indicates problematic assets in test data?
  • test_builder_parser : Determine how we are getting a KeyError in some instances, not in others - presumably indicates instability in the results of Builder.parser().
  • test_Mom6Builder_parser_bad_realm : ParserError in Traceback - is this just an error handling issue with an upstream dependency changing their error type?

@charles-turner-1
Copy link
Collaborator Author

Also noting empty (size 0 bytes) files in test data - potential source of instability?

Modified files: rebuilt using script from tests/data/mom6/README without modifications, git finds binaries differ.

❯ eza --long --header --git tests/data/mom6/output00{0,1}/*.nc
Permissions Size User     Date Modified Git Name
.rw-r--r--   26k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ice_daily.nc
.rw-r--r--   23k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ice_month.nc
.rw-r--r--  175k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_annual.nc
.rw-r--r--   40k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_annual_rho2.nc
.rw-r--r--   78k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_annual_z.nc
.rw-r--r--   42k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_daily.nc
.rw-r--r--  166k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_month.nc
.rw-r--r--   34k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_month_rho2.nc
.rw-r--r--   48k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_month_z.nc
.rw-r--r--   30k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_scalar_annual.nc
.rw-r--r--   30k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_scalar_month.nc
.rw-r--r--   43k u1166368  5 Dec 12:04   -M tests/data/mom6/output000/19000101.ocean_static.nc
.rw-r--r--   26k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ice_daily.nc
.rw-r--r--   23k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ice_month.nc
.rw-r--r--  175k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_annual.nc
.rw-r--r--   40k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_annual_rho2.nc
.rw-r--r--   78k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_annual_z.nc
.rw-r--r--   42k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_daily.nc
.rw-r--r--  166k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_month.nc
.rw-r--r--   34k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_month_rho2.nc
.rw-r--r--   48k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_month_z.nc
.rw-r--r--   30k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_scalar_annual.nc
.rw-r--r--   30k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_scalar_month.nc
.rw-r--r--   43k u1166368  5 Dec 12:04   -M tests/data/mom6/output001/19010101.ocean_static.nc
.rw-r--r--     0 u1166368  5 Dec 07:56   -- tests/data/mom6/output000/ocean.stats.nc
.rw-r--r--     0 u1166368  5 Dec 07:56   -- tests/data/mom6/output001/ocean.stats.nc
.rw-r--r--     0 u1166368  5 Dec 07:56   -- tests/data/mom6/output000/sea_ice_geometry.nc
.rw-r--r--     0 u1166368  5 Dec 07:56   -- tests/data/mom6/output001/sea_ice_geometry.nc
.rw-r--r--     0 u1166368  5 Dec 07:56   -- tests/data/mom6/output000/Vertical_coordinate.nc
.rw-r--r--     0 u1166368  5 Dec 07:56   -- tests/data/mom6/output001/Vertical_coordinate.nc

@charles-turner-1
Copy link
Collaborator Author

I discovered a bug in src/access_nri_intake/source/utils.py::get_timeinfo, which leads to a number of errors & may be the source of some of these issues.

The offending snippet of code is

    def _todate(t):
        return cftime.num2date(t, time_var.units, calendar=time_var.calendar)

...

        has_bounds = hasattr(time_var, "bounds") and time_var.bounds in ds.variables
        if has_bounds:
            bounds_var = ds.variables[time_var.bounds]
            ts = _todate(bounds_var[0, 0])
            te = _todate(bounds_var[-1, 1])
        else:
            ts = _todate(time_var[0])
            te = _todate(time_var[-1])

In the case of some of the test datasets for the Mom6Builder, time bounds are contained within the dataset, but are empty and have been filled with a fill value: in this case, 1e20.

This causes the _todate(t) method to fail with an IntegerOverflow error.

How much of the weird behaviour this resolves is yet to be seen.

@charles-turner-1 charles-turner-1 self-assigned this Dec 5, 2024
@charles-turner-1
Copy link
Collaborator Author

I can cause additional tests to crash by making the change

        with xr.open_dataset(
            file,
            chunks={},
            decode_cf=False,
            decode_times=False,
            decode_coords=False,
        ) as ds:

to

        with xr.load_dataset(
            file,
            chunks={},
            decode_cf=False,
            decode_times=False,
            decode_coords=False,
        ) as ds:

here. Changing open_dataset to load_dataset forces xarray to eagerly evaluate/load the dataset, & so any overflow/ datatype errors will be realised immediately.

My current working assumption is that a nontrivial number of test data probably have some sort of issues with time and/or time bounds due to the process of making them - but this is unlikely to extend to the real data.

This should be relatively straightforward to fix as a bug, but may require changes to large amounts of test data...

@charles-turner-1
Copy link
Collaborator Author

Subsequent investigations:

  • All bugs/test failures caused by this function and error handling wrapping it.
def _todate(t):
       return cftime.num2date(t, time_var.units, calendar=time_var.calendar)
  • cftime.num2date is a Cython function & that seems to be the cause of the integer overflow.
  • My fix avoids this issue but does not fix it - we should look further into the cftime.num2date function and how to fix the IntegerOverflow errors it returns, because it's probably causing related issues elsewhere.
  • MOM6 models required a different time parser because of the way time info is encoded. Presumably this is general to GFDL models, will require some caution moving forwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

2 participants