Semantic Labeling #203

sriramk117 · 2024-11-09T10:09:26Z

Description

Overview

This PR implements an ADA feeding perception node that enables semantic labeling. Semantic labeling acts as infrastructure in the perception code for advances to the user interface including: (1) allowing natural language prompting for ADA users, (2) auto-feeding features that give the robot the ability to continue to feed a specific food item on the plate until it is no longer there without user prompting, and (3) taking user preference (i.e. how they want to eat their meal) into account during bite selection.

ROS2 Interfaces

Firstly, the node creates an action server (GenerateCaption) that takes in a list of string labels describing the food items on the plate (for example, ['meat', 'cucumber']) and then, runs GPT-4o to compile these string labels into a visually-descriptive, sentence query for GroundingDINO (for example, "Food items including strips of grilled meat and seasoned cucumber spears arranged on a light gray plate.").

The node also creates an action server (SegmentAllItems) that runs a pipeline consisting of a vision language model called GroundingDINO and the segmentation model EfficientSAM/SAM to output masks paired with semantic labels describing the masks for each of the detected food items on the plate.

Implementation Details

The perception node uses three foundation models to achieve semantic labeling: GPT-4o, GroundingDINO, and EfficientSAM/SegmentAnything.

From experimentation with prompting techniques on GroundingDINO and GroundingDINO's feature extraction functionality, we determined that prompting GroundingDINO with a sentence prompt that contains descriptive qualifiers results in significantly better performance for bounding box detection. Therefore, we query GPT-4o with the latest above plate image and a list of strings by the user labeling the food items on the plate to generate a visually-descriptive sentence incorporating those labels to describe the image. This sentence then gets used as input for GroundingDINO.

See the current system and user queries for GPT-4o below:

System Query:

You are a prompt engineer that is assigned to describe items  
in an image you have been queried so that a vision language model 
can take in your prompt as a query and use it to for classification 
tasks. You respond in string format and do not provide any explanation 
for your responses.

User Query:

Your objective is to generate a sentence prompt that describes the food
items on a plate in an image. 
You are given an image of a plate with food items and a list of the food items 
on the plate.
Please compile the inputs from the list into a sentence prompt that effectively
lists the food items on the plate.
Add qualifiers to the prompt to better visually describe the food for the VLM
to identify. Don't add any irrelevant qualifiers.

Here is the input list of food items to compile into a string: {labels_list}

Here are some sample responses that convey how you should format your responses:

Food items including grapes, strawberries, blueberries, melon chunks, and 
carrots on a small, blue plate.

Food items including strips of grilled meat and seasoned cucumber 
spears arranged on a light gray plate.

Food items including baked chicken pieces, black olives, bell pepper slices, 
and artichoke on a plate.

GroundingDINO is a vision language model that detects objects in an image (through bounding boxes) given natural language prompts. We use GroundingDINO as it pairs semantic labels from the natural language prompt with each of the detected bounding boxes.

Then, we use the segmentation model EfficientSAM/SAM to extract pixel-wise masks of each of the detected objects given their bounding boxes.

Here, we share a couple images of GroundingDINO's bounding box detection performance on various above plate image photos:

Set Up Steps

Follow these steps to install necessary dependencies. :

Open a new terminal window and go into your ADA workspace.
Go into the ada_feeding directory: cd src/ada_feeding
Run pip install -r requirements.txt

Testing Procedure

Test Setup

Before running any isolated tests, the following list of commands should be run each in separate terminal windows. Also, ensure that source install/setup.bash gets called in each of the windows that will run ROS commands.

Launch the web app:
- First, open three separate terminal tabs/windows and go into the correct folder in each window: cd src/feeding_web_interface/feedingwebapp
- Run the following listed commands in the separate terminal tabs/windows:
  - Launch the web app server: npm run start
  - Set the web rtc server: node --env-file=.env server.js
  - Start the robot browser: node start_robot_browser.js
Launch the perception nodes: ros2 launch ada_feeding_perception ada_feeding_perception.launch.py
Launch the dummy nodes: ros2 launch feeding_web_app_ros2_test feeding_web_app_dummy_nodes_launch.xml run_food_detection:=false run_face_detection:=false
Launch the nano bridge sender: ros2 launch nano_bridge sender.launch.xml
Launch the nano bridge receiver: ros2 launch nano_bridge receiver.launch.xml

Below, see isolated tests for each of the two action servers that enable semantic labeling (the SegmentAllItems action and the GenerateCaption action).

Test GPT-4o Action:

Open a separate terminal window. To send action goals to the GenerateCaption action server, use this command: ros2 action send_goal /GenerateCaption ada_feeding_msgs/action/GenerateCaption "input_labels: ['placeholder']"
- In place of the placeholder, you would have a list of strings that concisely label the different food items to be eaten during the meal. Here is an example send goal ROS command you could use for this action: ros2 action send_goal /GenerateCaption ada_feeding_msgs/action/GenerateCaption "input_labels: ['blueberries', 'carrots', 'strawberries', 'green grapes']"
If the action succeeds, you will find the result returned as a string called caption.

Test GroundingDINO + SegmentAnything vision pipeline:

Open a separate terminal window. To send action goals to the SegmentAllItems action server, use this command: ros2 action send_goal /SegmentAllItems ada_feeding_msgs/action/SegmentAllItems "caption: Some input sentence query for the pipeline that visually describes the food items on the plate."
- Here is an example of a sample action goal you could send which includes an example input caption: ros2 action send_goal /SegmentAllItems ada_feeding_msgs/action/SegmentAllItems "caption: Food items on a blue plate including blueberries, carrots, strawberries, and green grapes."
If the action succeeds, it will return a list of Mask messages (containing information like the region of interest and a binary mask of the segmented item) named detected_items and a list of strings named item_labels
- Each semantic label of the item_labels list corresponds to the Mask message at the same index of the detected_items list

I believe the most useful way to test the performance of semantic labeling is by merging these isolated tests together. First, send a goal to the GenerateCaption action server and get the resulting caption. Then send this caption as input to the SegmentAllItems action. This will be more identical to how the web interface sends goals to each of the action servers and fetches results to display during the bite selection stage.

Testing Semantic Labeling on Different Images

An important part of testing is experimenting with various above plate images that have different food items and analyzing the performance of these perception nodes on these different images. To change the image that is used on the feeding web app dummy nodes (and hence, the one that gets passed as input into each of the action servers), follow these steps:

Open the file feeding_web_app_ros2_test/launch/feeding_web_app_dummy_nodes_launch.xml in your favorite IDE:
Pick an rgb image and its corresponding depth image from the folder feeding_web_app_ros2_test/data
Change the image file names in the rgb_path and depth_path launch arguments accordingly
Then follow the isolated tests detailed above to test semantic labeling on these images

Before opening a pull request

Format your code using black formatter python3 -m black .
Run your code through pylint and address all warnings/errors. The only warnings that are acceptable to not address is TODOs that should be addressed in a future PR. From the top-level ada_feeding directory, run: pylint --recursive=y --rcfile=.pylintrc ..

Before Merging

Squash & Merge

jjaime2

Looking very good overall. Could likely use some cleanup but the core functionality seems to make sense. I would also check for pylint warnings if you haven't since I noticed many macros in the old code that this PR was based off of.

jjaime2 · 2024-11-26T17:50:22Z

.gitignore

+# Environment Variables file
+.env
+


I don't see a .env file added in this PR, but I'm guessing this was more for personal use. I'd recommend omitting this change unless it's relevant for the functionality of the PR.

I noticed more references to a env file later in the code, where exactly does this come into play?

I'm adding environment variable functionality to our codebase so we can privately store API keys without exposing them publicly in github. In this particular case, it is for accessing the PRL OpenAI API key to invoke GPT-4o.

I'm assuming this may come in handy later on as well if we power perception w/ foundation models in the future.

ada_feeding/ada_feeding/behaviors/acquisition/compute_food_frame.py

jjaime2 · 2024-11-26T17:53:07Z

ada_feeding_msgs/action/SegmentAllItems.action

+# The list of input semantic labels for the food items on the plate
+string caption   


Comment seems misleading, I suspect this was an old comment for item_labels

jjaime2 · 2024-11-26T17:56:25Z

requirements.txt

@@ -2,5 +2,8 @@ pyrealsense2
 overrides
 sounddevice
 scikit-spatial
+openai
+python-dotenv


Related to another comment on .gitignore. Are we using .env files for something?

ada_feeding_perception/config/republisher.yaml

jjaime2 · 2024-11-26T19:02:15Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+                boxes_xyxy[phrase].append([x0, y0, x1, y1])
+
+        # Measure the elapsed time running GroundingDINO on the image prompt
+        inference_time = int(round((time.time() - inference_time) * 1000))


Is this meant to be logged?

Also, why are you multiplying by 1000?

In general, I find it more readable to specify the units of measurements after the variable name itself, e.g., in this case maybe it should be inference_time_ms given the factor of 1000?

jjaime2 · 2024-11-26T19:05:17Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        center_x = (bbox[0] + bbox[2]) // 2
+        center_y = (bbox[1] + bbox[3]) // 2


For this and other instances of bbox, does it have any properties like xmin, xmax, ymin, ymax to make this easily understandable?

You coudl consider using a namedtuple for this

jjaime2 · 2024-11-26T19:06:32Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        mask_msg.average_depth = median_depth_mm / 1000.0
+        mask_msg.item_id = item_id
+        mask_msg.object_id = object_id
+        mask_msg.confidence = float(score)


Old code for SegmentFromPoint didn't require a type-cast to float, it's fine if this is needed just wanted to check since I noticed the discrepancy.

Also, these small changes are precisely why its best to not copy-paste code (e.g., if casting to float improves the code here, we should ideally percolate it back to SegmentFromPoint). So this is one more of the functions that should be in a helper.

Another thing you can consider is making SegmentAllItems inherit from the SegmentFromPoint class, and only override some functions.

jjaime2 · 2024-11-26T19:09:04Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        result: The result message containing masks for all food items detected in the image
+                paired with semantic labels.
+        """
+        self._node.get_logger().info("Received a new goal!")


Should mention the goal_handle here

jjaime2 · 2024-11-26T19:11:10Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        self._node.get_logger().info("Goal not cancelled.")
+        self._node.get_logger().info("VIsion pipeline completed successfully.")


Also known as "Goal succeeded", also typo in "VIsion"

amalnanavati · 2024-11-26T21:33:24Z

ada_feeding_msgs/srv/GenerateCaption.srv

+---
+# A sentence caption compiling the semantic labels used as a query for 
+# GroundingDINO to perform bounding box detections.
+string caption


Nit: add newlines at the end of files. (I know not all files have it, but in general it is a best practice so we should enforce it on new/modified files)

ada_feeding_perception/config/republisher.yaml

amalnanavati · 2024-11-26T21:37:46Z

ada_feeding_perception/config/segment_all_items.yaml

+    # A boolean to determine whether to visualize the bounding box predictions 
+    # made by GroundingDINO
+    viz_groundingdino: false
+


Nit: newline.

I think your empty line at the end of the file has some whitespae, which is why Github doesn't recognize it as the empty line at the end of the file.

amalnanavati · 2024-11-26T21:39:39Z

ada_feeding_perception/model/GroundingDINO_SwinT_OGC.py

Nit: does this belong in model or config? This seems more like configuration. I believe what is downloaded to model is the actual files storing model weights.

amalnanavati · 2024-11-26T21:41:21Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+# Third-party imports
+import cv2
+import time
+import random


Also, some of these, like time and random, are "standaard imports." pylint should help take care of some of these issues. Eventually, I want to set up precommit on this repo so reformatting and some level of linting happens automatically

amalnanavati · 2024-11-26T22:13:50Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        if self.viz_groundingdino:
+            self.visualize_groundingdino_results(image, bbox_predictions)
+
+        # Collect the top contender mask for each food item label detected by


I'd suggest having a parameter to the function (set by default to 1) that controls how many of the top masks you use per food item

amalnanavati · 2024-11-26T22:14:20Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        result.item_labels = item_labels
+
+        # Measure the elapsed time running GroundingDINO on the image prompt
+        inference_time = int(round((time.time() - inference_time) * 1000))


Same comment as above re. ms vs sec and adding units to variable names

amalnanavati · 2024-11-26T22:16:41Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        caption = goal_handle.request.caption
+
+        # Create a rate object to control the rate of the vision pipeline
+        rate = self._node.create_rate(self.rate_hz)


I recently learned that rates live forever in a node unless explicitly deleted by destroy_rate (see comment here). So I'd recommend adding that at the end of this function, else the rate will keep taking up callback resources even after the action has finished.

amalnanavati · 2024-11-26T22:17:43Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+        while (
+            rclpy.ok()
+            and not goal_handle.is_cancel_requested
+            and not vision_pipeline_task.done()


Are there cases where the vision pipeline could hang? If so, I'd recommend adding a timeout to the action (maybe in the action message itself) to be robust to that

amalnanavati · 2024-11-26T22:18:17Z

ada_feeding_perception/ada_feeding_perception/segment_all_items.py

+            result.status = result.STATUS_CANCELLED
+
+            # Clear the active goal
+            with self.active_goal_request_lock:


Why not do this in the cleanup function itself? (both here and below)

sriramk117 and others added 30 commits August 6, 2024 21:06

init

8ed6683

added neccessary callback functions

dcb74b7

implemented functionality to run sam and groundingdino

c611e78

wrote vision pipeline and execute callback

4fb0258

created result message returned by vision pipeline

5d81216

modified launch file and created yaml file for parameters

bfcaeaa

updated setup.py and modified parameters

19e9275

Merge branch 'ros2-devel' into sriramk/semantic-labeling

c29520d

added requirements to install and fixed imports

4e7391f

changed grounding dino path and added checkpoint

9c43fc4

Added config file + fixed image transformations

31551e4

Added GroundingDINO visualization function

3ec7b50

created GroundingDINO publisher for testing

9dc9a40

added more testing code for bbox visualization

929e570

fixed groundingdino results visualization

e1ebf8b

corrected image preprocessing?

704caa1

groundingdino works!

4f9305d

masks are now displayable

024c71c

record vision pipeline inference time

c78cd4a

wrote code to generate mask messages during action calls

e503800

masks msgs are generated but action keeps aborting

648a46e

Added gpt-4o query functionality

3032f65

groundingdino can be downloaded via github url

85f9577

updated comments/code quality changes

0049598

invoking gpt-4o has been transformed into a service

e9fd4d5

segment all items action now takes a single string as input

9d52d98

added env variables

30bc036

environment variables not loading?

4d3b27c

ran black formatter

94af48e

Merge branch 'ros2-devel' into sriramk/semantic-labeling

29ed345

sriramk117 and others added 3 commits November 8, 2024 18:54

changes to segmentallitems node initializing it as a perception node

23577ae

fixed error of topics not being received by segmentallitems action

3688541

code cleanup

195b123

sriramk117 requested a review from amalnanavati November 9, 2024 10:09

sriramk117 self-assigned this Nov 9, 2024

sriramk117 changed the title ~~Sriramk/semantic labeling~~ Semantic Labeling Nov 12, 2024

sriramk117 requested review from jjaime2 and yewon-lee November 20, 2024 01:03

jjaime2 reviewed Nov 26, 2024

View reviewed changes

amalnanavati reviewed Nov 26, 2024

View reviewed changes

sriramk117 added 7 commits December 2, 2024 17:10

running gpt-4o inference is now an action not a service

b8a4ccb

cleaned up some comments

5363732

goal status cancellation

d73c983

temporary changes for running testing procedures

2326742

republisher.yaml reverted to original

b95ac8e

fixed cv2 visualization merge conflict

4bf52ea

segmentation inference optimization workin

9382b67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Labeling #203

Semantic Labeling #203

sriramk117 commented Nov 9, 2024 •

edited

Loading

jjaime2 left a comment

jjaime2 Nov 26, 2024

jjaime2 Nov 26, 2024

sriramk117 Nov 26, 2024

sriramk117 Nov 26, 2024

jjaime2 Nov 26, 2024

jjaime2 Nov 26, 2024

jjaime2 Nov 26, 2024

amalnanavati Nov 26, 2024

jjaime2 Nov 26, 2024

amalnanavati Nov 26, 2024

jjaime2 Nov 26, 2024

amalnanavati Nov 26, 2024

jjaime2 Nov 26, 2024

jjaime2 Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

amalnanavati Nov 26, 2024

		# The list of input semantic labels for the food items on the plate
		string caption

		center_x = (bbox[0] + bbox[2]) // 2
		center_y = (bbox[1] + bbox[3]) // 2

		self._node.get_logger().info("Goal not cancelled.")
		self._node.get_logger().info("VIsion pipeline completed successfully.")

Semantic Labeling #203

Are you sure you want to change the base?

Semantic Labeling #203

Conversation

sriramk117 commented Nov 9, 2024 • edited Loading

Description

Overview

ROS2 Interfaces

Implementation Details

Set Up Steps

Testing Procedure

Test Setup

Test GPT-4o Action:

Test GroundingDINO + SegmentAnything vision pipeline:

Testing Semantic Labeling on Different Images

Before opening a pull request

Before Merging

jjaime2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sriramk117 commented Nov 9, 2024 •

edited

Loading