Researchers from Stanford Introduce RT-Sketch: Elevating Visual Imitation Learning Through Hand-Drawn Sketches as Goal Specifications

Researchers introduced hand-drawn sketches as an unexplored modality for specifying goals in visual imitation learning. The sketches offer a balance between the ambiguity of natural language and the over-specification of images, enabling users to convey task objectives swiftly. Their research proposes RT-Sketch, a goal-conditioned manipulation policy that takes hand-drawn sketches of desired scenes as input and generates corresponding actions. Training on paired trajectories and synthetic sketches, RT-Sketch demonstrates robust performance in various manipulation tasks, outperforming language-based agents in scenarios with ambiguous goals or visual distractions. 

The study delves into existing approaches in goal-conditioned imitation learning, focusing on conventional goal representations like natural language and images. It underscores the limitations of the representations, emphasizing the need for more abstract and precise alternatives, such as sketches. It acknowledges ongoing work in converting images to sketches to integrate them into goal-based imitation learning. It references previous research that relies on language or images for goal conditioning and explores multimodal approaches combining both. The use of image-to-sketch conversion for hindsight relabeling of terminal images in demonstration data is discussed. 

The approach points out the drawbacks of natural language commands, which can be imprecise, and goal images, which tend to be overly detailed and challenging to generalize. It proposes hand-drawn sketches as a promising alternative for specifying goals in visual imitation learning, offering more specificity than language and aiding in disambiguating task-relevant objects. The sketches are user-friendly and integrated into existing policy architectures RT-Sketch. This goal-conditioned policy takes hand-drawn sketches of desired scenes as input and produces corresponding actions. 

RT-Sketch is a manipulation policy that takes hand-drawn scene sketches as input and is trained on a dataset of paired trajectories and synthetic goal sketches. It modifies the original RT-1 policy, removing FiLM language tokenization and replacing it with concatenating goal images or sketches with image history as input to EfficientNet. Training employs behavioral cloning to minimize action log-likelihood given observations and the sketch goal. An image-to-sketch generation network augments the RT-1 dataset with goal sketches for RT-sketch training. The study evaluates RT-Sketch’s proficiency in handling sketches of varying detail, including free-hand, line, and colorized representations.

The study has demonstrated that RT-Sketch performs competitively, comparable to agents conditioned on images or language in simple scenarios. Its proficiency in achieving goals from hand-drawn sketches is especially noteworthy. RT-Sketch exhibits greater robustness than language-based goals when dealing with ambiguity or visual distractions. The assessment includes measuring spatial precision using pixel-wise distance and human-rated semantic and spatial alignment using a 7-point Likert scale. While acknowledging its limitations, the study underscores the need to test RT-Sketch’s generalization across sketches from various users and occasional incorrect skill execution.

In conclusion, the introduced RT-Sketch, a goal-conditioned manipulation policy utilizing hand-drawn sketches, exhibits performance comparable to established language or goal-image-based policies across various manipulation tasks. It demonstrates heightened resilience against visual distractions and goal ambiguities. RT-Sketch’s versatility is evident in its ability to comprehend sketches of varying specificity, from simple line drawings to intricate, colored depictions. Future research may expand the utility of hand-drawn illustrations to encompass more structured representations, such as schematics or diagrams, for assembly tasks.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on Telegram and WhatsApp.


Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.


🔥 Meet Retouch4me: A Family of Artificial Intelligence-Powered Plug-Ins for Photography Retouching


Credit: Source link

Comments are closed.