-
Notifications
You must be signed in to change notification settings - Fork 179
ODC EP 011 Enhance hyperspectral support
The goal of this EP is to efficiently support hyperspectral data sources in the open data cube. To achieve that we'll need to
- Generalize data model to allow for arbitrary extra dimensions on output data variables
- Extend metadata support for describing data sources in hierarchical data formats like hdf, netcdf and zarr
- Implement data loading based on native format libraries rather than always going via
rasterio
andGDAL
- Support IO concurrency without Dask
- Implement arbitrary chunking along temporal dimension when using Dask
- Right now temporal chunking always starts with 1 chunk per time slice followed by optional re-chunking
- This results in large Dask graph and can confuse Dask scheduler
- Loading data divided into
4x4
spatial chunks and one single temporal chunk covering 1000 time steps should produce Dask graph with only 16 chunks, not the current 16,016
The changes to the ODC are in the core data load and re-projection pipeline and is proposed that this EP be included in ODC v2.0 to allow for the extensive and potentially breaking changes required.
The EP will also be tackled in two stages:
- A reduced scope with single input format and output dimension ordering - this will allow exploration of the ODC load pipeline changes required without the complication of generalisation.
- Generalisation to additional file formats and output dimension ordering.
There is considerable uncertainty regarding approaches to full generalisation and the use-cases to be supported by ODC. The Reduced Scope will provide some understanding of this uncertainty and allow for better scoping of the second stage.
Below are the details first for the Full Scope, noting the considerable uncertainty of some aspects, and then the Reduced Scope stage (the former is required for context).
The full scope proposed has considerable uncertainty but is the result of analysis of a range of existing and expected Hyperspectral data and analysis requirements along with additional data cube-like datasets (eg. Climate and Oceans).
- Output data variables must have
y,x
dimensions next to each other and in that order-
b,y,x
OK -
y,x,b
OK -
m,y,x,b
OK -
y,b,x
BAD
-
- Extra dimensions are assumed to be uniform across datasets of the same product
- Example:
visual
band has dimensionsy,x,color
with coordinatecolor=["r","g","b","a"]
- Example:
classifier_prob
band has dimensionsm,y,x,c
, with coordinatesm=["small", "large"],c=["water", "pasture", "sand", ...]
- Limited Supported:
wtemp
band with dimensionsdepth,y,x
withdepth=[100, 201.3, ...]
coordinate being different for each dataset- Allow describing data like this, but do not support general case loading of data of that kind as this requires developing rules for unification of data along extra dimensions
- Loading of a single dataset should be fine
- Example:
- Allow assembling of multi-spectral data variable from a bunch of files split along the extra dimension
- Example: assemble
visual
band from 3 separate tiff images - Example: assemble
reflectance
band from 3 netcdf files with100, 100, 87
bands each. - Multiple sources must have the same
y,x
dimensions across all parts
- Example: assemble
- Allow addressing any slice of the
ndarray
variable stored in hdf-like storage format- Currently one can only select a single slice of the data variable stored on disk:
ds.dv[idx:int]
- Currently one can only select a single slice of the data variable stored on disk:
- We keep fundamental assumption that single
Dataset
captures information about one single moment in time only- Multiple datasets can reference different temporal sections of the same data store
In datacube, dimensions of the output data variable must be known ahead of time, before we get to read any of the data files. As a result we need to capture not just metadata about what extra dimensions are present in the data, but also the exact coordinates contained within a data store pointed to by the dataset.
Define output dimensions and coordinates for data variables with extra dimensions
name: sample_prod
metadata_type: eo3
measurements:
- name: visual
dims: "y.x.c"
coords:
c: ["r", "g", "b", "a"]
dtype: int8
...
- name: landsat
dims: "wl.y.x"
coords:
wl_lo:
dim: wl
data: [0.45, 0.52, 0.63, ...]
wl:
dim: wl
data: [0.48, 0.56, 0.66, ...]
wl_hi:
dim: wl
data: [0.52, 0.60, 0.69, ...]
dtype: uint16
...
Above defines
-
visual
band with an extra color dimension on the right. Color dimension contains red, green, blue and alpha channels in that order. -
landsat
band presents landsat reflectance bands as a single data array, with wavelength dimensionwl
on the left and three coordinates that record low, mid and hi points of the spectra captured by the band at a given index.
This covers the case when extra dimension is uniform across all datasets of the product. An individual dataset might be missing some of the measurements, but when data is present it was captured at the wavelength specified by the product document. Just like in xarray
we can define multiple coordinates for any given dimension.
To support data sources with varying sampling along extra dimensions, coords
parameter from the product can be included in the same format on dataset document.
Support assembling single data variables from multiple files split along extra dimension.
# Dataset definition
id: ...
...
measurements:
visual:
path: visual.webp # all 4 bands from one file
landsat:
- path: B1.tif
- path: B2.tif
- path: B3.tif
band: 1
...
Instead of specifying a single data source with .path[,.band,.layer]
, supply a list of data files
each encoding a single plane of pixels. Those 2d+ pixel planes will be stacked along the extra
dimension of the data variable. For simplicity of notation this syntax can only work with 3d data
variables only. Higher dimension sources will have to be loaded from a single data store that
supports ndarray storage natively.
Support arbitrary slices into hdf-like sources. Right now one can only specify
-
.layer
-- data variable name within netcdf file -
.band
-- single integer slice into leading dimension (typicallytime
), this was used to support "stacked netcdf" data products in GA.
We should be able to extract arbitrary section of the array variable. Suggest extending .layer
to include slice information.
Example: you have zarr data file results-202306.zarr
with data variable PROB
containing probabilities from some classifier for 20 different classes, computed across 3 different model configurations, with 10 runs for each model configuration.
Dimensions are arranged like so:
- model configuration
- data experiment number
- y
- x
- class
You need to expose a subset of classes for the first data experiment of the first model. Let's say we want to expose first 10 classes only: ds["PROB"][0,0,:,:,0:10]
. Product definition will look similar to this:
# Product Definition
measurements:
- name: prob
dims: y.x.klass
coords:
klass: ["water", "bare", "sand", "urban", ...]
We define prob
data variable with an extra dimension klass
. Dataset document will then contain something like this:
# Dataset definition
id: ...
...
measurements:
prob:
path: "results-202306.zarr"
layer: "PROB[0,0,:,:,0:10]"
We extend layer
to support a form NAME[slice]
, where slice
has to be a slice into the data variable NAME
. Slices for y,x
coordinates must be :
, and slices into un-modeled dimensions must be a single integer. Basically, ds[NAME][slice].shape
must match expected variable shape as defined by the product document.
If you wanted to also include classifiers 17 and 18, you would add two more classes to the klass
dimension in the product definition document and rewrite dataset document this way:
# Dataset definition
id: ...
...
measurements:
prob:
- path: "results-202306.zarr"
layer: "PROB[0,0,:,:,0:10]"
- path: "results-202306.zarr"
layer: "PROB[0,0,:,:,17:19]"
-
Measurement
- changes to support extra dimensions and coords info
- changes to support multiple sources
- parse and expose slice component of the
.layer
-
Product
schema and construction ofMeasurement
objects -
Dataset
construction ofMeasurement
model from multiple sources -
ExtraDimensions
supporty.x.C
in addition to currentC.y.x
-
BandInfo
capture slice component of the.layer
- Support assembling variables from multiple sources
- Generalize loading code to allow multiple extra dimensions on the left and on the right of spatial dimensions
- At the very least we should support not just
C.y.x
(current state), but alsoy.x.C
case (visual band) - Instead of
prepend_dim: Optional[int]
supply desired data variable dimensions. e.g.('x', 'y', 'c')
- Current
prepend_dim=1
will be equivalent to('C', 'y', 'x')
- Current
- Generalize
extra_dim_index: Optional[int]
to be of typeUnion[None, int, Tuple[int,...]]
- At the very least we should support not just
- Generalize pixel fusing to support extra dimensions in arbitrary positions
Don't just construct Dask graph with time=1
followed by .rechunk(time=desired_t_chunks)
, instead
process as many temporal slices as user requested within a single Dask sub-task.
This is particularly important for per pixel time series workflows. By manually forcing Dask to load many temporal slices as part of one work unit we significantly simplify Dask scheduling decision making and avoid the need for re-chunking. Without re-chunking step we reduce peak memory usage, allowing for larger spatial footprint of any given chunk, this in turn improves efficiency of the IO step and reduces number of chunks further.
Reading hdf-like sources with high number of non-spatial dimensions with rasterio/GDAL can be very slow for certain chunking configurations. GDAL and/or rasterio interface does not allow for efficient reading of multiple bands at once. When data store has multiple bands per chunk, GDAL will have to process same chunk multiple times extracting different section each time. Even assuming perfect caching of the raw data, the cost of repeated decompression and data movement of those chunks adds up to a significant slow down compared to access of the same data with native libraries.
We need to implement data loaders geared towards direct interaction with popular data reading libraries: zarr
, netcdf
, hdf5
. To allow for format specific optimizations, extension point needs to be fairly high up the call stack, roughly at the level of Datacube.load_data
. Common code does db query, grouping of datasets along temporal dimension, figuring out desired geobox from the user input, normalization of common parameters like resampling
, these are then passed on to data format specific code to construct Dask graph or to load data directly. We can later implement "MultiFormat" loader that delegates data loading of a set of bands to a format specific loader then re-assembles that into one xarray.Dataset
.
A while back work has been started to introduce a new reader driver abstraction. Main goal at the time was to move away from completely stateless loader interface (basically URL in, pixels out), to an interface that introduces loading context and delayed IO operations. Context can be helpful for implementing a cache of open file handles. Switch to delayed IO allows for local concurrency when loading pixels directly into RAM as opposed to using Dask. Remnants of that work are still present in the datacube library: datacube.drivers.rio._reader.py
and read_time_slice_v2
in datacube.storage._read.py
.
It's worth finishing off that work. Caching of open file handles is particularly important for accessing netcdf sources, especially over network. This new Reader interface will need to be modified slightly to support proposed changes for extra dimensions. Instead of the current assumption that every read generates a single 2d plane of pixels, reader should be able to return an Nd block of pixels with extra non-spatial dimensions as described in the product definition document for that data variable.
Allowing for IO concurrency without using Dask can be extremely helpful for certain types of batch processing workflows that are RAM limited. Anything that operates on per-pixel time series can be tricky to implement efficiently in Dask as this requires an expensive and fragile re-chunking step. But when operating in the traditional contiguous memory mode with parallel IO, we can load all the data quickly first and then process large in memory datacube using local concurrency with things like OpenMP or simple multi-threading. What's more, the size of the datacube can be close to the total RAM capacity, something that is not possible with Dask.
In the general case the changes are expected to support significant variation in access patterns, particularly in the manner in which IO is performed (native IO) and dask graphs are generated (improve time, spatial and spectral chunking options).
The intial goal will be to support an ~30x increase in the wavelength bands (from multi-spectral 10 to hyper-spectral 300 bands) with similar ease of use experienced with large scale multi-spectral processing. This is expected future requirement will be in the order of 1000x as planned Hyperspectral satellites include increases in spatial and temporal resolution (eg. 5m daily with 300 bands).
- EMIT
- EnMAP
- Prisma
- DESIS
- COGS
- Netcdf4 & HDF5
- Zarr - also tracking work on geozarr
- eo3 extensions as proposed above
- STAC HSI - there are a couple of variations on the STAC representation to be explored.
Minimum desired outcome:
- Can index into ‘datacube’ and ’dc.load’ EMIT data without format conversion
- Hyper-spectral/visual bands are presented as a single data variable with dimensions time,y,x,wavelength
- Augment product metadata to support ‘y,x,B’ dimension order and not just ‘B,y,x’
- Define extra dimension name and location relative to ’y,x’ plane: t,B,y,x’ and ‘t,y,x,B
- Expose extra dimension information through data model classes, not just name but order also
- Finish implementation of ‘dc.load’ that uses new Reader Driver interface to delegate data-loading to custom code
- This is basically resurrecting ‘xr_load’ code
- Implement Dask version of that with proper temporal chunking and allowing for chunking along extra dimension also
- Implement Reader IO Driver based on ‘xarray.open_dataset’, possibly with ‘kerchunk‘ enabled backend.
Datacube already has ‘extra_dimensions’ field that contains a set of extra dimensions and corresponding coordinate values. Any measurement that has populated ‘.extra_dim = B’ is assumed to have ‘B,y,x’ dimensions on disk, but we need to also support ‘y,x,B’ case. Instead of ‘.extra_dim = B’ we can specify ‘.dims = [B, y, x]’ which then allows us to specify not just presence/absence of the extra dimension, but also capture the desired order relative to y,x, e.g. dims = [y, x, B].
Example EO3 metadata extensions:
name: sample_prod
metadata_type: eo3
extra_dimensions:
# Already defined by datacube spec allows for single coordinate per dimension
- name: c
dtype: str
values: ["r", "g", "b", "a"]
- name: wl
dtype: float32
values: [0.48, 0.56, 0.66, ...]
measurements:
- name: visual ## Can not be defined with current spec
#extra_dim: c ## Assumes c, y, x order, so no good
dims: [y, x, c] ## [NEW] Captures order relative to y,x
dtype: int8
...
- name: landsat ## Can be specified with the current schema
extra_dim: wl
dims: [wl, y, x] ## [NEW] equivalent representation in the proposed schema
dtype: uint16
...
Above defines:
- Visual band with an extra color dimension on the right. Color dimension contains red, green, blue and alpha channels in that order.
- Landsat band presents landsat reflectance bands as a single data array, with wavelength dimension wl on the left.
This covers the case when extra dimension is uniform across all datasets of the product. An individual dataset might be missing some of the measurements, but when data is present it was captured at the wavelength specified by the product document. Just like in xarray we can define multiple coordinates for any given dimension.
This schema allows for arbitrary number of extra dimensions. Data loading code will only support single extra dimension to begin with. Implementation should give us more insight into what it will take to generalize this further.
A while back work has been started to introduce a new reader driver abstraction. Main goal at the time was to move away from completely stateless loader interface (basically URL in, pixels out), to an interface that introduces loading context and delayed IO operations. Context can be helpful for implementing a cache of open file handles. Switch to delayed IO allows for local concurrency when loading pixels directly into RAM as opposed to using Dask. Remnants of that work are still present in the datacube library: datacube.drivers.rio._reader.py and read_time_slice_v2 in datacube.storage._read.py.
This work needs to be completed to enable alternative IO reader backends. This new Reader interface will need to be modified slightly to support proposed changes for extra dimensions. Instead of the current assumption that every read generates a single 2d plane of pixels, reader should be able to return an Nd block of pixels with extra non-spatial dimensions as described in the product definition document for that data variable. Slicing into the extra dimensions should be supported to enable Dask chunking along the extra dimension.
This is also a good opportunity to implement proper temporal chunking when constructing Dask graph of dc.load task. Currently every time slice generates a new node in the output Dask graph, we then apply .rechunk to return final array with the desired temporal chunking. This approach significantly complicates Dask scheduling and leads to much higher peak memory usage.
Reading HDF-like data through rasterio and GDAL is possible, but can be much less efficient than using native libraries (zarr,hdf,netcdf) directly. Implementing new IO driver should be relatively straightforward. At high level it needs to support:
- Open
- Compute and report GeoBox and other metadata of the underlying data
- Read some slice of the data
- Robert Woodcock
- Kirill Kouzoubov
- odc-geo
- eo3 refactor
- datacube-zarr
- datacube-core
- https://github.com/stac-extensions/hsi - STAC HSI extension
- hyperspectral-notebooks repo demonstrating ODC-like examples of working with hyperspectral data.
Welcome to the Open Data Cube