Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test streamflow with multiple forecasts #1

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jameshalgren
Copy link

@jmunroe
(ping @karnesh)
I re-tooled this slightly to download the streamflow. Streamflow is a vector dataset, so it doesn't plot as nicely with the hvplot function.

With the example, I'm trying to download the multiple forecast traces for a single day. There should be 432 data values (24 forecast/reference times, 18 hours of forecast per forecast.)

Questions (we're working on these, would welcome insights or ideas -- @martindurant, if you feel like chiming in, please feel free.)

  1. Notice that the plot doesn't show anything overlapping -- the times, I think, are stomping on each other. We probably need to define multiple reference times. How do we get the zarr indexing to recognize the additional dimension so that the dataset recognizes the additional cardinality of the forecast time vs. the valid time?
  2. (much lower priority... almost a philosophical question) How to plot geographically?

@martindurant
Copy link

Would it be reasonable to have ["time", "reference_time"] as the axes to concat on? These are both apparently coordinates in the inputs, so it should work without any further config. I understand these should both populate to have a 2d time coordinate for all the variables. On making nice geo visual output, others know much better how to do this nicely!

@jmunroe
Copy link
Owner

jmunroe commented Aug 26, 2022

Would it be reasonable to have ["time", "reference_time"] as the axes to concat on? These are both apparently coordinates in the inputs, so it should work without any further config. I understand these should both populate to have a 2d time coordinate for all the variables. On making nice geo visual output, others know much better how to do this nicely!

I don't think so... this is forecast model so my understanding is the reference_time is when each forecast "starts" and time refers to the time each forecast is relevant to. There is a another aspect of timing which might also be important for some use cases: lead time (number of hours between the reference_time and the time of the particular forecast). This not included in this example since we have explicitly selected the 'f001' -- the first hour of each of these 18 hour forecasts.

@martindurant
Copy link

Probably I don't understand the data structure :|
Are you saying that we need to derive the real value of the data, or is there a 2d "time" and "lead time" coordinates? Basically, what do you want the output dataset's coordinates to look like.

@jmunroe
Copy link
Owner

jmunroe commented Aug 26, 2022

For this particular example, just the one time coordinate. (My comment refers to the more generalized case)

@martindurant
Copy link

In that case I'm not sure how your example multiple things are "stomping" without another coordinate variable.

@jameshalgren
Copy link
Author

jameshalgren commented Aug 26, 2022

The following code produces an example list, shown below, that helps to explain the data structure, at least conceptually. I'm working on digging into the actual netcdf headers -- the file names are a far more reliable index than what is encoded into the file, I have found. You both probably understand what I'm explaining here and I beg pardon if it seems pedantic. Hopefully, it will help me ask a more useful question at the tail end.

flist = []
for day in range(22, 23):  # TODO: replace this with a formatted date output range
    for i in range(4):
        for fc in range(1,4):
            flist.append(f'gcs://national-water-model/nwm.202208{day:02d}/short_range/nwm.t{i:02d}z.short_range.{var}.f{fc:03d}.conus.nc')

produces

"flist": ["gcs://national-water-model/nwm.20220822/short_range/nwm.t00z.short_range.channel_rt.f001.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t00z.short_range.channel_rt.f002.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t00z.short_range.channel_rt.f003.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t01z.short_range.channel_rt.f001.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t01z.short_range.channel_rt.f002.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t01z.short_range.channel_rt.f003.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t02z.short_range.channel_rt.f001.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t02z.short_range.channel_rt.f002.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t02z.short_range.channel_rt.f003.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t03z.short_range.channel_rt.f001.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t03z.short_range.channel_rt.f002.conus.nc",
  "gcs://national-water-model/nwm.20220822/short_range/nwm.t03z.short_range.channel_rt.f003.conus.nc"
]

Visually, these should overlap something like the following:

t00z |f001--f002--f003--
      t01z |f001--f002--f003--
            t02z |f001--f002--f003--
                  t03z |f001--f002--f003--

The output from xarray shows the same thing. First, the full output from a single dataset:

>>> xr.open_dataset("nwm.t02z.short_range.channel_rt.f002.conus.nc")
<xarray.Dataset>
Dimensions:         (time: 1, reference_time: 1, feature_id: 2776738)
Coordinates:
  * time            (time) datetime64[ns] 2022-08-25T04:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T02:00:00
  * feature_id      (feature_id) int32 101 179 181 ... 1180001803 1180001804
Data variables:
    crs             |S1 ...
    streamflow      (feature_id) float64 ...
    nudge           (feature_id) float64 ...
    velocity        (feature_id) float64 ...
    qSfcLatRunoff   (feature_id) float64 ...
    qBucket         (feature_id) float64 ...
    qBtmVertRunoff  (feature_id) float64 ...
Attributes: (12/19)
    TITLE:                      OUTPUT FROM NWM v2.2
    featureType:                timeSeries
    proj4:                      +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
    model_initialization_time:  2022-08-25_02:00:00
    station_dimension:          feature_id
    model_output_valid_time:    2022-08-25_04:00:00
    ...                         ...
    model_configuration:        short_range
    dev_OVRTSWCRT:              1
    dev_NOAH_TIMESTEP:          3600
    dev_channel_only:           0
    dev_channelBucket_only:     0
    dev:                        dev_ prefix indicates development/internal me...
>>> 

Then the manually truncated output showing just the headers of a few more files (I probably should have scripted the output so I don't have to wear out my keyboard with a hammered-out finger macro...):

>>> xr.open_dataset("nwm.t01z.short_range.channel_rt.f001.conus.nc")
  * time            (time) datetime64[ns] 2022-08-25T02:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T01:00:00
>>> xr.open_dataset("nwm.t02z.short_range.channel_rt.f001.conus.nc")
  * time            (time) datetime64[ns] 2022-08-25T03:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T02:00:00
>>> xr.open_dataset("nwm.t03z.short_range.channel_rt.f001.conus.nc")
  * time            (time) datetime64[ns] 2022-08-25T04:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T03:00:00
>>> xr.open_dataset("nwm.t04z.short_range.channel_rt.f001.conus.nc")
  * time            (time) datetime64[ns] 2022-08-25T05:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T04:00:00
>>> xr.open_dataset("nwm.t01z.short_range.channel_rt.f002.conus.nc")
  * time            (time) datetime64[ns] 2022-08-25T03:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T01:00:00
>>> xr.open_dataset("nwm.t02z.short_range.channel_rt.f002.conus.nc")
  * time            (time) datetime64[ns] 2022-08-25T04:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-08-25T02:00:00

@jmunroe created the first example by concatenating the f001 output from a series of forecasts on a particular day (i.e., t00z-t23z). The idea, I think, was to take the "best" forecast of each output -- we assume that the nearest hour will be the most accurate. (The model actually emits a separate "analysis_assim" dataset that is supposed to represent the best-available estimate of the system state for each given forecast time, and that might be a better single-trace value to plot, but that's easy enough to extract.)

More complex queries (and sort-of the direction I'm going with this exploration) involve questions about the progress of the forecast accuracy through time. This is connected to what you mean by "lead time", I think, @martindurant . I tried a couple of different variations on throwing reference_time into the input for the MultiZarrToZarr command e.g.,

mzz = MultiZarrToZarr(json_list,
        remote_protocol='gcs',
        remote_options={'anon':True},
        concat_dims=['reference_time', 'time'],
        identical_dims = ['COMID',],
    )

But the result was always a dataset output with multiple time coordinates and only one reference_time, like this:
Frustratingly, only one reference time

So, the questions we need to answer seem to be:

  1. Mechanistically (as in, "please teach me"), how to get the reference time to be properly represented as a dimension so that multiple forecasts can be held in one file (if that's not an antipattern in itself)?
  2. How do we build an interface then for querying/slicing across those indices (e.g., I can imagine wanting to know, for a given moment in time, how the forecast improved as we got closer and closer - that would be a vertical slice through the diagram above. We might also look at a particular lead time -- say, 6 hours out -- and across a range of forecasts, consider the average performance relative to the actual value observed 6 hours later - that would be more like the original example notebook in this PR and would be a diagonal line on the diagram. Finally, we might look at at single forecast or even the envelope of all forecasts to get a sense for the aggregate behavior of the model relative to observed - that would be more horizontal slices, perhaps.)

@jameshalgren
Copy link
Author

jameshalgren commented Aug 26, 2022

For this particular example, just the one time coordinate. (My comment refers to the more generalized case)

In the notebook attached to the PR, I've changed the logic so that we are grabbing multiple values from each forecast; i.e., I'm not using the best_hour variable. Technically, when concatenating in the way that the notebook was first built, there should be multiple reference times also--there would just be a 1:1 match between times and reference times, since only one value from each forecast was used.

@martindurant
Copy link

martindurant commented Aug 26, 2022

With current main kerchunk, this seems to work for me

ofs = fsspec.open_files(flist)
out = []
for of in ofs:
    with of as f:
        out.append(kerchunk.hdf.SingleHdf5ToZarr(f, of.path).translate())
mzz = kerchunk.combine.MultiZarrToZarr(out, remote_protocol="gcs", concat_dims=["time", "reference_time"])
tot = mzz.translate()
ds = xr.open_zarr("reference://", storage_options={"fo": tot, "remote_protocol": "gcs"})
ds
Dimensions:         (time: 6, reference_time: 4, feature_id: 2776738)
Coordinates:
  * feature_id      (feature_id) float64 101.0 179.0 181.0 ... 1.18e+09 1.18e+09
  * reference_time  (reference_time) datetime64[ns] 2022-08-22 ... 2022-08-22...
  * time            (time) datetime64[ns] 2022-08-22T01:00:00 ... 2022-08-22T...
Data variables:
    crs             (time, reference_time) object dask.array<chunksize=(1, 1), meta=np.ndarray>
    nudge           (time, reference_time, feature_id) float64 dask.array<chunksize=(1, 1, 2776738), meta=np.ndarray>
    qBtmVertRunoff  (time, reference_time, feature_id) float64 dask.array<chunksize=(1, 1, 2776738), meta=np.ndarray>
    qBucket         (time, reference_time, feature_id) float64 dask.array<chunksize=(1, 1, 2776738), meta=np.ndarray>
    qSfcLatRunoff   (time, reference_time, feature_id) float64 dask.array<chunksize=(1, 1, 2776738), meta=np.ndarray>
    streamflow      (time, reference_time, feature_id) float64 dask.array<chunksize=(1, 1, 2776738), meta=np.ndarray>
    velocity        (time, reference_time, feature_id) float64 dask.array<chunksize=(1, 1, 2776738), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.6
    NWM_version_number:         v2.2
    TITLE:                      OUTPUT FROM NWM v2.2
    cdm_datatype:               Station
    code_version:               v5.2.0-beta2
    dev:                        dev_ prefix indicates development/internal me...
    ...                         ...
    model_output_type:          channel_rt
    model_output_valid_time:    2022-08-22_01:00:00
    model_total_valid_times:    18
    proj4:                      +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
    station_dimension:          feature_id
    stream_order_output:        1

noting the multiple values for time and reference_time.

By the way, the correct kwarg for gcs is token="anon", but I don't think that's in your way. Also, CRS is certainly independent of either version of time, and there are probably other small details.

@martindurant
Copy link

(edit: added ofs = fsspec.open_files(flist) to start of snippet)

@jameshalgren
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants