-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test streamflow with multiple forecasts #1
base: main
Are you sure you want to change the base?
Conversation
Would it be reasonable to have ["time", "reference_time"] as the axes to concat on? These are both apparently coordinates in the inputs, so it should work without any further config. I understand these should both populate to have a 2d time coordinate for all the variables. On making nice geo visual output, others know much better how to do this nicely! |
I don't think so... this is forecast model so my understanding is the |
Probably I don't understand the data structure :| |
For this particular example, just the one time coordinate. (My comment refers to the more generalized case) |
In that case I'm not sure how your example multiple things are "stomping" without another coordinate variable. |
The following code produces an example list, shown below, that helps to explain the data structure, at least conceptually. I'm working on digging into the actual netcdf headers -- the file names are a far more reliable index than what is encoded into the file, I have found. You both probably understand what I'm explaining here and I beg pardon if it seems pedantic. Hopefully, it will help me ask a more useful question at the tail end. flist = []
for day in range(22, 23): # TODO: replace this with a formatted date output range
for i in range(4):
for fc in range(1,4):
flist.append(f'gcs://national-water-model/nwm.202208{day:02d}/short_range/nwm.t{i:02d}z.short_range.{var}.f{fc:03d}.conus.nc') produces "flist": ["gcs://national-water-model/nwm.20220822/short_range/nwm.t00z.short_range.channel_rt.f001.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t00z.short_range.channel_rt.f002.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t00z.short_range.channel_rt.f003.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t01z.short_range.channel_rt.f001.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t01z.short_range.channel_rt.f002.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t01z.short_range.channel_rt.f003.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t02z.short_range.channel_rt.f001.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t02z.short_range.channel_rt.f002.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t02z.short_range.channel_rt.f003.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t03z.short_range.channel_rt.f001.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t03z.short_range.channel_rt.f002.conus.nc",
"gcs://national-water-model/nwm.20220822/short_range/nwm.t03z.short_range.channel_rt.f003.conus.nc"
] Visually, these should overlap something like the following: t00z |f001--f002--f003--
t01z |f001--f002--f003--
t02z |f001--f002--f003--
t03z |f001--f002--f003-- The output from xarray shows the same thing. First, the full output from a single dataset: >>> xr.open_dataset("nwm.t02z.short_range.channel_rt.f002.conus.nc")
<xarray.Dataset>
Dimensions: (time: 1, reference_time: 1, feature_id: 2776738)
Coordinates:
* time (time) datetime64[ns] 2022-08-25T04:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T02:00:00
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
Data variables:
crs |S1 ...
streamflow (feature_id) float64 ...
nudge (feature_id) float64 ...
velocity (feature_id) float64 ...
qSfcLatRunoff (feature_id) float64 ...
qBucket (feature_id) float64 ...
qBtmVertRunoff (feature_id) float64 ...
Attributes: (12/19)
TITLE: OUTPUT FROM NWM v2.2
featureType: timeSeries
proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
model_initialization_time: 2022-08-25_02:00:00
station_dimension: feature_id
model_output_valid_time: 2022-08-25_04:00:00
... ...
model_configuration: short_range
dev_OVRTSWCRT: 1
dev_NOAH_TIMESTEP: 3600
dev_channel_only: 0
dev_channelBucket_only: 0
dev: dev_ prefix indicates development/internal me...
>>> Then the manually truncated output showing just the headers of a few more files (I probably should have scripted the output so I don't have to wear out my keyboard with a hammered-out finger macro...): >>> xr.open_dataset("nwm.t01z.short_range.channel_rt.f001.conus.nc")
* time (time) datetime64[ns] 2022-08-25T02:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T01:00:00
>>> xr.open_dataset("nwm.t02z.short_range.channel_rt.f001.conus.nc")
* time (time) datetime64[ns] 2022-08-25T03:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T02:00:00
>>> xr.open_dataset("nwm.t03z.short_range.channel_rt.f001.conus.nc")
* time (time) datetime64[ns] 2022-08-25T04:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T03:00:00
>>> xr.open_dataset("nwm.t04z.short_range.channel_rt.f001.conus.nc")
* time (time) datetime64[ns] 2022-08-25T05:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T04:00:00
>>> xr.open_dataset("nwm.t01z.short_range.channel_rt.f002.conus.nc")
* time (time) datetime64[ns] 2022-08-25T03:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T01:00:00
>>> xr.open_dataset("nwm.t02z.short_range.channel_rt.f002.conus.nc")
* time (time) datetime64[ns] 2022-08-25T04:00:00
* reference_time (reference_time) datetime64[ns] 2022-08-25T02:00:00 @jmunroe created the first example by concatenating the f001 output from a series of forecasts on a particular day (i.e., t00z-t23z). The idea, I think, was to take the "best" forecast of each output -- we assume that the nearest hour will be the most accurate. (The model actually emits a separate "analysis_assim" dataset that is supposed to represent the best-available estimate of the system state for each given forecast time, and that might be a better single-trace value to plot, but that's easy enough to extract.) More complex queries (and sort-of the direction I'm going with this exploration) involve questions about the progress of the forecast accuracy through time. This is connected to what you mean by "lead time", I think, @martindurant . I tried a couple of different variations on throwing mzz = MultiZarrToZarr(json_list,
remote_protocol='gcs',
remote_options={'anon':True},
concat_dims=['reference_time', 'time'],
identical_dims = ['COMID',],
) But the result was always a dataset output with multiple So, the questions we need to answer seem to be:
|
In the notebook attached to the PR, I've changed the logic so that we are grabbing multiple values from each forecast; i.e., I'm not using the |
With current main kerchunk, this seems to work for me ofs = fsspec.open_files(flist)
out = []
for of in ofs:
with of as f:
out.append(kerchunk.hdf.SingleHdf5ToZarr(f, of.path).translate())
mzz = kerchunk.combine.MultiZarrToZarr(out, remote_protocol="gcs", concat_dims=["time", "reference_time"])
tot = mzz.translate()
ds = xr.open_zarr("reference://", storage_options={"fo": tot, "remote_protocol": "gcs"})
ds
noting the multiple values for time and reference_time. By the way, the correct kwarg for gcs is |
(edit: added |
@jmunroe
(ping @karnesh)
I re-tooled this slightly to download the streamflow. Streamflow is a vector dataset, so it doesn't plot as nicely with the hvplot function.
With the example, I'm trying to download the multiple forecast traces for a single day. There should be 432 data values (24 forecast/reference times, 18 hours of forecast per forecast.)
Questions (we're working on these, would welcome insights or ideas -- @martindurant, if you feel like chiming in, please feel free.)