-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic Segmentation messages #71
Semantic Segmentation messages #71
Conversation
I agree that having additional metadata would be useful, but is that worth the disadvantages I've outlined in #63 (comment)? |
@mintar totally agree. Looking at the proposal, I don't know how to massage all that content into a I suppose the Between parameters and a Maybe we use a 2 channel image format, 1 for the class and a second for the confidence? Does the CV tools work for 2 channel images or would we need to add in a 3rd for visualization purposes. I suppose the 3rd channel could be |
msg/SemanticSegmentation.msg
Outdated
|
||
# the confidence of the inference of each pixel | ||
# between 0-100% | ||
uint8[] confidence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the full range 0-255 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not standard for the output of AI libraries, they usually give out a probability value. It would be a bit unnatural to convert it to 0-255 for use (e.g. I want to use predictions that are at least 80% confident). I'd argue this should actually be a float (?) from 0-1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a float, but does it really make a difference to have a confidence of say 60% (uint8) or 60.5% (float) knowing that this would increase the space used by this array by 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specially on an array field that is expected to contain a lot of elements. On a 640x360 mask changing this for a float32 would mean an increase of 640x360x3 bytes, almost 700kB per message with respect to uint8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should make sure to explain these are uints not floats, otherwise people will just put their 0-1
percentages from TF or something into it and it'll always render as 0
since it needs to be multiplied by 100 and then cast to an int (in the comment above)
msg/SemanticClass.msg
Outdated
|
||
# Integer value corresponding to the value of pixels belonging | ||
# to a given class in a segmentation mask | ||
uint16 class_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a uint16 when the data
field in SemanticSegmentation.msg
is only uint8?
Limiting the number of classes to 255 is probably bad, but doubling the message size for the segmentation image by making it a uint16 is equally bad. Anyhow, the two should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, it's probably better to leave it as uint16. If we use a sensor_msgs/Image
as the segmentation image format, we can choose between mono8 and mono16 encodings as appropriate there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't speak to that exactly without his comment, but there can easily be more than 255 class values in segmentation algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, they must be consistent. I will use uint16
for the final version of the PR
msg/SemanticClass.msg
Outdated
@@ -0,0 +1,10 @@ | |||
# A key value pair that maps an integer class_id to a string class label. | |||
# The class_id should be interpreted as the pixel value corresponding to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's trailing whitespace in this line which makes the tests fail. Also both files are missing a newline at the end of the file.
@SteveMacenski wrote:
I don't think the class_map would be that massive. Let's be generous and assume 20 characters per class name, and let's assume we have 255 classes. Then the whole array would be 255 * (20 + 1) = 5355 bytes. A 640x480 uint8 image has 307200 bytes, so the class_map would be around 1.7% of that. On the other hand, 5kb overhead is not nothing, so I'd like to hear better ideas. BTW, how to handle the mapping from int to class string is also an open issue with the existing messages ( I like the idea of publishing all the metadata on a "semantic_info" topic, similar to the camera_info topic. If we're really worried about the overhead of the class_map, perhaps we don't publish the semantic_info in sync with each frame, but instead only whenever it changes on a latched topic?
Here's a list of all OpenCV image encodings: sensor_msgs/image_encodings.h You can have 2-channel images, like Alternatively, the 2 images (semantic segmentation and confidence) could be published on 2 different topics. That would have the following advantages:
|
Have a
Maybe then this is more generic than just a
That is interesting. I wonder about the run-time impacts of that since anything about 1-2MB in size has a hard time publishing without zero-copy or composition. I'm not sure if its better to have one 2MB message or two 1MB messages going into the same node. I suspect 1 message is better from some internal Samsung projects I'm helping with, but I don't have the data specifically to support that at the moment. |
I agree with the idea of using
This message seems to also work for object detection models using the other messages on vision_msgs. With that in mind, when things are settled I can reformat this PR to include these messags |
I think uint16 is a good bet for the moment and the current state of AI. If in another decade our magic AIs can know millions of things, we can always increase it. |
Should we settle for two distinct images for the mask and confidence? Or should we use a single dual-channel image and define a convention to specify which channel should contain what? This could be overcame by adding a metadata field to the The only drawback of using two images seems to be the potential impact of publishing 2 data intensive messages instead of one. That would also force the message consumer to synchronize them (which can be done with a message filter, but drifts from ros standard implementation). This approach seems to be more flexible and does not rely on assumed conventions nor requires extra configuration parameters on the consumers. Let me know what you think to fix the PR and adjust the work on the segmentation plugin |
I'd prefer two separate image topics instead of a two-channel image. |
I prefer a single 2-channel image over 2 separate images 😆 Maybe we need to ask @tfoote what his thoughts are. He usually brings up very good points when I propose interface definitions. |
One thing that speaks for two separate images is that you can have separate data types. You wrote yourself that it's possible to have more than 255 classes. In that case, the segmentation image would need to be a uint16. If you have a single 2 channel image, that would mean the confidence image must also be uint16. With separate images, you could always use uint8 for the confidence image and uint8 or uint16 for the segmentation image, depending on the number of classes that you have. |
The option behind door number three is that we actually don't include confidence information at all, and make as part of the Interface API that any pixel below a certain threshold should be set to |
We can't do -1 with an unsigned int, but in semantic segmentation images "0" usually means "unknown class". The confidence image is still nice extra info to have. |
Another advantage of two separate topics: Not all subscribers may be interested in the confidence values, so they don't even need to subscribe to that topic. |
Well I suppose we could do both; publish the confidence image and also set the precedence that |
The primary reason for choosing to send one multi channel image versus 2 separate images would be if they generally would be used usefully separately. The simplest examples are stereo images. If you want just a monocular camera stream, you don't want to incur the overhead of subscribing to the full stereo image just to ignore the second half of the stream. On the flip side there is a small amount of overhead related to synchronizing incoming images, and they have to have enough embedded content to be useful on their own (aka they both have full headers). We have tools to make it easier, but the extra overhead is why we don't send everything decoupled.
I would disagree and say that publishing the synchronized image messages is the ROS standard. We do it for stereo images, compressed images, and camera_info. And if you're publishing two images instead of one they should be half the size each as they will have half the data each (or some ratio depending on the datatypes) The other major concern in this sort of discussion is encapsulation. You want to make sure that messages generally can stand on their own without knowing the context that they were broadcast in. Such that if you get a rosbag of some random data it will have everything that you need to process it later. |
@SteveMacenski wrote:
I've checked what other semantic segmentation / panoptic segmentation datasets are using, and most of them seem to use 255 for the "UNLABELED" class (bdd100k semantic, COCO-Stuff), although some use 0 (bdd100k panoptic, CityScapes). So I agree we should probably use 255 here. The only trouble is that "-1" means 255 if the semantic segmentation image encoding is @tfoote wrote:
If we include the UNLABELED class (which we absolutely should), then the segmentation image is very useful on its own. I think this probably covers at least 80% of use cases; most clients probably won't care about the confidence image. So that would speak for having two separate image topics. |
I'd like to bring up though that we're now talking about having to synchronize 3 topics potentially (vision_info, segmentation_mask, segmentation_mask_confidences) for a "typical and full" implementation. That seems to be approaching excessiveness, even though I can live with the individual technical points that leads to that outcome. I'm not sure I agree that they would both be useful separately. The segmentation mask without confidences could be interesting, but the confidences without the segmentation mask is not - unlike the stereo example where either image feed could be individually interesting without the other. |
If we go with the latched topic for the vision messages, we only need to synchronize 2 topics, and only those clients that are interested in the confidences. You're right that the confidence image is not interesting on its own, but Tully's argument still holds: you save 50% data in a lot of use cases by having 2 topics. |
There would need to be a level of documentation required for that workflow -- e.g. it would not be obvious to anyone just looking at the I'd think to be analog to existing similar things, we'd want to publish with the masks/detections as well, regardless of QoS. Those that want to grab it 1 time to use forever can do so, but then others that want it synchronized for whatever reason can, just like |
Publishing vision_info in sync is fine with me! |
To sum up, the consensus seems to be using 2
I will refactor this PR to only include the new Additionally according to @SteveMacenski suggestion some documentation should be added describing how:
Does that sound good @SteveMacenski @mintar ? Also, please let me know what would be the best place to add the documentation |
Sounds good! Comments on how to interpret the data fields (for example, the UNLABELED class) should go into the message comments of that field. More extensive documentation about the workflows should go into the README, IMO. |
Agree, or maybe add in a |
Hi @mintar, I just realized there is already a |
Huh, I forgot about the existing |
README.md
Outdated
@@ -47,6 +47,14 @@ http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) | |||
in your code, as the message's header should match the header of the source | |||
data. | |||
|
|||
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are: | |||
|
|||
1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be inference_info
when talking about the topic (as opposed to the message type InferenceInfo
).
README.md
Outdated
|
||
2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically. | ||
|
||
3. The `InferenceInfo` message can be synchronized with the `Image` messages that containing the segmentation mask and the confidence values of the inference of each pixel. This can be achieved using ROS's `message_filters`. The same [snippet](http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) mentioned above can work as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this incompatible with mentioning the latched topic above? If we use the time synchronizers from message_filters
, we need to publish a InferenceInfo
message for every image with the exact same time stamp (just like the camera_info
topic). The section about latched publishers above sounds like we only publish a message once (or whenever the contents change), analogous to static TFs or maps.
I'm fine with both, BTW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. InferenceInfo
should be one of these two:
- A message truly like
CameraInfo
- it is not expected to change. Caching it locally on other nodes should be fine. Latched publishers should be fine. - It can't be assumed to be the same for each published
Image
. If that's the case, we need to remove all mention of latching and caching, and require that people subscribe to both theImage
and theInferenceInfo
. But if they're always being published in pairs, you'll need to remind me why we aren't combining them into one message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be both ways, that's why I want to place both options in the readme to describe what would be the common ways of using the message. The third option I propose also assumes the message does not change but does not require the use of a latched topic. I think that it is worth referencing at least the first two but let me know if you think we should settle for only one of them
msg/InferenceInfo.msg
Outdated
# constant) over time, and so it is wasteful to send it with each individual | ||
# result. By listening to these messages, subscribers will receive | ||
# the context in which published vision messages are to be interpreted. | ||
# Each vision pipeline should publish its VisionInfo messages to its own topic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VisionInfo -> InferenceInfo
I like the interface that we've settled on here. In terms of naming, if Either way, I'm going to suggest some documentation improvements for clarity. |
README.md
Outdated
@@ -47,6 +47,14 @@ http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) | |||
in your code, as the message's header should match the header of the source | |||
data. | |||
|
|||
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are: | |
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows systems to use standard ROS tools for image processing, and allows choosing the most compact image encoding appropriate for the task. To transmit the metadata associated with the vision pipeline, you should use the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message. This message works the same as `/sensor_msgs/CameraInfo or [`/vision_msgs/VisionInfo`](msg/VisionInfo.msg): |
README.md
Outdated
@@ -47,6 +47,14 @@ http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) | |||
in your code, as the message's header should match the header of the source | |||
data. | |||
|
|||
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are: | |||
|
|||
1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile. | |
1. Publish `InferenceInfo` to a topic. The topic should be at same namespace level as the associated image. That is, if your image is published at `/my_segmentation_node/image`, the `InferenceInfo` should be published at `/my_segmentation_node/inference_info`. Use a latched publisher for `InferenceInfo`, so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2, this can be achieved using a `transient local` QoS profile. |
README.md
Outdated
|
||
1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile. | ||
|
||
2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically. | |
2. A subscribing node may receive an `InferenceInfo` message, store it locally, then cancel its subscription. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need to assume that the publisher publishes periodically - that's the point of making it latched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its my understanding that a latched topic uses the transient local QoS. In contrast, this second method does not require the usage of any specific QoS, that's why messages should be published periodically to allow new nodes to get the messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
README.md
Outdated
|
||
2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically. | ||
|
||
3. The `InferenceInfo` message can be synchronized with the `Image` messages that containing the segmentation mask and the confidence values of the inference of each pixel. This can be achieved using ROS's `message_filters`. The same [snippet](http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) mentioned above can work as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. InferenceInfo
should be one of these two:
- A message truly like
CameraInfo
- it is not expected to change. Caching it locally on other nodes should be fine. Latched publishers should be fine. - It can't be assumed to be the same for each published
Image
. If that's the case, we need to remove all mention of latching and caching, and require that people subscribe to both theImage
and theInferenceInfo
. But if they're always being published in pairs, you'll need to remind me why we aren't combining them into one message.
This message is not only meant for segmentation but for object detection as well, or in general to any vision pipeline that returns id based classifications that you later may want to convert back to human readable string class names. |
I agree, I'm not a fan of |
@pepisg is this still relevant? Think it would be great to get this merged to bring more obj. det./semantics capabilities to ROS2/Nav2 |
agreed! |
List of outstanding items I see before this can merge:
Anything else? |
Done. left it
We seemed to agree on being two possible setups: a) latched topic; b) node subscribes to topic, gets one message and then unsubscribes. There are several comments on the PR which I tried to address on the last commit regarding this. Please let me know if you have any further comments on this section of the documentation.
I guess I integrated all your change requests, let me know if there is anything else I should change / include |
@Kukanani are you happy with these? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, ship it!
In this PR I aim to add a semantic segmentation message. I'm aware of the discussion going on on #63, however I think using images as masks may leave behind important metadata like the
class_map
(the name of the class corresponding to each ID). Also, using several channels in the image for sending the mask and the confidence of each pixel could lead to varying implementations that may not be compatibles with standard functionalities relying on semantic segmentation like the in progress semantic segmentation costmap plugin PR in nav2.