-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for detailed and structured data types #214
Comments
This would give us 5 levels on which we can organize our data with in a single run
For any given data, it should always be possible to move the structure up or down a level of organization. For example if we take the simple case of a motor and a point detector and running a step scan where we want to do multiple sweeps. At the run level we could do one run per sweep or one run with a stream per-sweep. Both are allowed within the model, but they have different access patterns (because databroker is built around the concept of access to a run, the searches are currently built around the start document, and the uids / scan_ids are per-start). So if you are always going to do 10 sweeps and you are always going to want to pull up all ten together, then maybe naming your streams If we now look at the case of one-sweep per run, we still have some choices in how to structure the streams. The model allows you to have one stream per point (I personally think it is a bad idea, but do not know how to encode "Tom thinks it is a bad idea in a schema" ;)) or one stream with all of the points in it. The model also allows you to have 2 streams: one for the detector and one for the motor. You could then at analysis time "know" that you need to zip these two stream together (again I think this is a bad idea), or you could put them in the same stream and let the fact that they are in the same event tell you that they should be zipped together. If we adopt this proposal then we have one more option, we could have a field for the motor and a field for the detector in events in the stream, or we could have one field that uses a structured data type to sticky take them together. The common theme of all of these is how much "pre-aggregation" that we are letting the data structures do for us. In all of the cases above, we can make it "work", but some of the choices are going to be more painful, both in terms of programming against them and it terms of performance. This pain can show up both in terms of having to do too much "zipping" in analysis code or too much "pulling apart". Another issue we need to think about is how to handle row-major vs column-major in the case where the shape is 1d. I am relatively sure that in the case of 1D (variable length), this description works just as well for dataframes (and other notionally collumnar data structures) as it does for record arrays, however the python-side data structures that represent these things are not the same / particularly interchangeable. I also see very strong arguments for providing the option to either. In the case of the detector that prompted this discussion, iirc, we have an hdf5 file that has 3 data sets which is collumanr data. Taking that and re-packing it into a row-major data structure before sending it back to the user (who is likely to transpose it back) seems daft. On the other hand, a different detector I have worked with has a native data layout of c-structs that pack (energy, time, position). Reading all of that in to transpose it to collumnar before sending back to the user. |
Attn @thomascobb, @callumforrester, @lsammut , @clintonroy (whom it will not let me assign). |
Another thought that I missed earlier, embracing the variable length and extra structure makes akward array look a lot more promising (and pushes us towards data patterns that are friends in the HEP world also have). |
My sense from the Pilot call was that folks see this as a positive change and a natural extension of what we have. Specific points:
|
Can anyone explain what downsides there are to this approach? I think NumPy structured arrays are cool, but I'm skeptical that Bluesky needs them. |
Further to a discussion from a few months ago with @tacaswell and @danielballan, does this help to batch large amounts of asynchronously captured data into events or would you still use event paging for that? |
@callumforrester If your large patches can be written as a block that can then be described as a structured array: yes. You would still also have events and events can still be packed into pages. This now lands you on a datastructure that fits very badly into either a data from (they really really want the type of the values in the columns to be simple scalars) or an xarray (which wants to think of data as a regular cube with labeled axis). As mentioned above, I think the escape hatch here is https://awkward-array.readthedocs.io/en/latest/ (out of the HEP community) which handles this case extremely well (they have lots of aggregated by bin data with variable length fastest-axis). I case where this would work well is fly scans where some hardware system is coordinating (x, y, t) and triggering a camera you could have an event with 2 fields {
"image": {
"dtype": "array",
"detailed_dtype": "u4",
"shape": [128, 2028, 1024],
"external": "FILESTORE:",
},
"the_data_from_hardware": {
"dtype": "array", "
detailed_dtype": [["x", "f4"], ["y", "f4"], ["time", "u4"]],
"shape": [128,],
"external": "FILESTORE:",
}
} and then "number of rows" events. As @untzag points out option 4 is not so bad in this case, you promote each of "x", "y", and "t" (which is some cases may be better from a data access for analysis point of view!), however there are some cases where is problematic from both an implementation and an conceptual stand point. From an implementation point of view this means that instead of having 1 resource per run and 1 datum per event well have at least 3 datum per event (and 3 trips through the filler machinery). We have found that this can be a major performance bottle neck. From a conceptual point of view, lets look at either the motivating case here (an in-detector feature detector for locating single-photon spots so we get {
"x": {
"dtype": "number",
"detailed_dtype": "f4",
"shape": [],
},
"found_centroids_1": {
"dtype": "array", "
detailed_dtype": [["x", "f4"], ["y", "f4"], ["intensity", "u2"]],
"shape": [-1,],
"external": "FILESTORE:",
}
} That is "at ever point we measure the position of the x-stage and the (variable) number of photon hits on the camera". This makes a point about giving us another nested namespace (which de-conflicts the Adopting structured arrays also lets up punt a bit longer on sorting out how to reform Resource/Datum to be able to fill more than one field / one event at a time.
This is the right attitude! We have made it this far without them. To some degree my publicly verbose comments on this are as much about convincing my self this is a good idea as anyone else ;) Getting a bit further ahead, looking at https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_numpy.html#ak-from-numpy they have a bool to pick between putting a special layout machinery on top (to be aware of the columns in the dtype) or just blindly treat it like a numpy array. They have come to regret that decision and in 2.0 are going to always force This also goes to a discussion I had with @danielballan about adding a We want to keep them distinct from DataFrames because once you go to a DataFrame you are locking your self into 1 primary axis / index and lots of assumptions about by-block columnar access. In the case of a record array we want to be able to access Nd chunks. It happens that all of the examples that we have gone through in this thread an 1-D at the outer array level, but that may not always be the case (think a 2D-fly scan with a 2D array of (x, y, I) in each event and then running an outer step scan around that (think xanes mapping or florescence tomography (or xanes tomography for the very patient)). |
Thanks for your further explanation @tacaswell. It seems like you reject option 4 (Expand each column in the table to field on the top-level event) because it restrains Bluesky to one set of keys per event. The nested "event field"/"array field" structure proposed in option 2 is preferable because it retains useful information about the relationship between arrays. That makes sense to me. If the complexity is truly present in the data we shouldn't try to hide it by forcing a simpler datastructure. @ksunden and I have solved similar problems by inserting extra axes such that the broadcast rules tell us all that we need to know about array relationships. However that only works for well-behaved data, I think Bluesky looks to support more complex cases including fully asynchronous & unstructured stuff. Can we still "make the easy things easy" after this change? My personal focus is on creating an orchestration and analysis layer that "just works" for the very simple experiments we undertake. I worry that some of the live visualization and processing will not be compatible with Structured Arrays. Will |
(a small, rough thought) When combining data from multiple devices into a single event document Bluesky does currently prepend device name to create a totally flat namespace. The information about which device sourced the data is stored in object keys [1]. I remember not liking this approach when I first tried to access data in Bluesky. Anyway, perhaps it's useful to compare and contrast this behavior with the behavior proposed here, since both seem to boil down to trying to preserve useful structure without adding complexity. [1] https://blueskyproject.io/bluesky/event_descriptors.html#object-keys |
That is a very good question. I'm optimistic we can find some way to spell (optionally) digging down, but it is not obvious to me yet.
Yes, this should be a mostly opt-in functionality. The place where it is not optional is that if you have document consumers that only look at dtype then there is a chance that something with a structured array will make it through and explode. However, this is also currently true if an array of unusual type (strings or objects) were to make it through so the addition of the detailed dtype allows you to nope out of trying to handle data you do not know how to handle / expect (very much like LiveTable drops anything that is not a scalar on the floor).
The prepending happens in the ophyd object, rather than in the RunEnigine (the dual use of 'bluesky' is extra confusing here). We did not want to force any naming schemes on the keys, but also knew that we needed them to be unique when multiple devices were read. Grabbing the the The existence of the objects-keys mapping in the descriptor is so that you can reconstruct what was in any given device without relying on the heuristics of the names (you probably should have systematic names but that is for the humans not the computers). The object keys mapping existing is a way out of my "two sets of variably length arrays" above at the cost of saying that the ambiguous sets can not come from the same device. However that feels a bit wrong to me as it is adding extra constraints on what the shapes of the values of the readings are. |
So, while trying to write a json schema to limit the detailed dtype it turns out that opening the door to numpy structured datatypes opens the door to infinitly deep structured data: np.dtype(
[
(
"a",
[
("b", "u1"),
("c", "f4"),
],
),
("d", "c16"),
]
) This is a datatype with 2 fields I think I am comfortable saying "you get one (1) level of structure in event model, if you think you need more lets talk" as a) I really do not want to open the door to infinity deep structures b) it looks like awkward does not either. |
In a bit more digging, motivated by the assumption that this is a Problem that Someone has solved, it turns out numpy ( https://numpy.org/doc/stable/reference/arrays.interface.html#type-description-examples) and cpython (https://www.python.org/dev/peps/pep-3118/ / https://docs.python.org/3/library/struct.html) have both solved this problem. There is some overlap between the two, they are not identical. There is code at the numpy c level to generate the pep3118 compatible string and a private Python function to build a dtype from a pep3118 spec. While there is no public API for converting between the two you can use public machery to go between them: In [63]: np.ones([0], np.dtype([('a', float), ('b', int)]))
Out[63]: array([], dtype=[('a', '<f8'), ('b', '<i8')])
In [64]: memoryview(np.ones([0], np.dtype([('a', float), ('b', int)]))).format
Out[64]: 'T{d:a:l:b:}'
In [65]: np.array(memoryview(np.ones([0], np.dtype([('a', float), ('b', int)])))).dtype
Out[65]: dtype([('a', '<f8'), ('b', '<i8')]) With numpy dtypes it is possible to define dtypes with over-lapping or out of order fields however this is not describable with either the It appears that the data-api coalition has not taken on structured data yet: https://data-apis.org/array-api/latest/API_specification/data_types.html I think the numpy spelling is easier to read for humans, is better documented outside of our projects (the The pro I see for the pep3118 style string is that we can get away with only 1 string. I think the options for spelling this are:
Despite the added verbosity, I think that option (2) is the best due to the type stability (thinking a bit ahead to wanting to consume this into c++/js/java). One thing that we can not directly encode using any of these schemes is "subdtype" which is a way for a dtype to control the last dimensions of an array. However, I do not think that this is actually a problem because when you make an array with a dtype that is a subdtype, the dimensions of the resulting array eats the extra dimensions and reports its dtype as the base dtype of the subdtype: In [116]: np.zeros((2, 2), np.dtype('3i')).shape
Out[116]: (2, 2, 3)
In [123]: np.zeros((2, 2), np.dtype('3i')).dtype
Out[123]: dtype('int32')
In [124]: np.zeros((2, 2), np.dtype('3i')) == np.zeros((2, 2, 3), 'i')
Out[124]:
array([[[ True, True, True],
[ True, True, True]],
[[ True, True, True],
[ True, True, True]]])
In [125]: np.zeros((2, 2), np.dtype('3i')).dtype == np.zeros((2, 2, 3), 'i').dtype
Out[125]: True
In [126]: np.zeros((2, 2), np.dtype('3i')).shape == np.zeros((2, 2, 3), 'i').shape
Out[126]: True |
I think all the flyscan use cases I have can can be solved without structured data, but I guess this would be useful to have this as an escape hatch. For example, consider a 512x512 detector that produces data at 10kHz, then a PandABox that produces X, Y, T for each of those events. For the detector, I would produce a single event page once a second of shape (~10000, 512, 512). It's approximately 10000 frames as different detectors have different readout rates, so I'd rather not wait for exactly 1000 frames. For the PandABox, it produces its data in a row major format, so I could either produce an event page of shape (~10000, 3) in native format (which would need your structured data changes), or unpack into 3 event pages of shape (~10000,). We currently intend to do the latter as it maps better to an HDF file. I would be inclined to keep on doing that, as it means that if we produce X and Y on a different PandABox to T, it would be transparent to Analysis as they would be 3 different streams. |
Is this feature still under active development? I would be very interested in having support for structured data types. |
It looks like the formal specification languished in #215, but we are in fact using |
Thanks Daniel, ill conform to the spec in #215 for now. |
At one of the beamlines at NSLS-II we have ended up with handler that in returning a data-frame instead of an array. For the data that it is loading this is quite natural (an area detector plugin that does peak-finding/centroiding on the images for single-photon counting), however the core of the problem here is that as-written this handler is not consistent with the descriptor and is currently un-describable.
The Document Model promises that if you look at the descriptor you will know what the name, type and shape of the data that will be in each event will be (e.g. "there is a field 'x' and it is integers", "there is a field called 'img' and it is a [5, 1028, 962] array"). With in the vocabulary that we have in the descriptor we can not say "within each event you find a field call 'foo' that is a table that has the columns ..." . This is in part because one of the key assumptions we made when developing the document model is that the Event is the lowest level that is allowed to have any structure other than being a homogeneous array.
The current handler is "working" because we previously did not actually enforce that the descriptor was not lying to us (the latest round of databroker work + tiled + dask is finally making use of the descriptor and we are discovering all of the places where we had shape miss-matches). Given the possibly very wide ranging impacts, the mildly existential scope, and the obvious importance of this we should think carefully about this and make sure we get it right. I can see a couple of possible ways out of this:
After some internal discussion at NSLS-II we are leaning towards option 2. I think the steps here are:
-1
in the shape means "unknown dimensions`In this proposal the data key for a field that is a table would look something like:
which says "Each event in the event stream has a field called 'centroids' which is an array of unknown length containing elements that are 2 32 bit floats with the names ('x', 'y') and a 64bit float with the name 'intensity')".
If we assume the color axis is last in a color time series than we could say
which says "Each event in this event stream contains a field called 'color_video' that is a 1024 by 926 array and each element of the array is an RGB tuple of unsigned 8bit integers"
numpy also lets you put arrays inside of your structured types so an alternate way of spelling the first case is
which says "Each event in the event stream has a field called 'centroids' which is an array of unknown length containing elements that are a 2-tuple of a 2-tuple of 32 bit floats with the name 'position' and a 64bit float with the name 'intensity')". I think in this case the first one is better (because the (x, y) vs (row, col) issue haunts my dreams), but it is worth noting this sort of thing is possible.
[edited to remvove half-finished thoughts that will become new post...the ctrl-enter vs enter for new line vs post in gh vs slack vs gmail vs ... is annoying]
The text was updated successfully, but these errors were encountered: