Can Machine Learning Teach Robots to Understand Us Better? This Microsoft Research Introduces Language Feedback Models for Advanced Imitation Learning

On Feb 25, 2024

The challenges in developing instruction-following agents in grounded environments include sample efficiency and generalizability. These agents must learn effectively from a few demonstrations while performing successfully in new environments with novel instructions post-training. Techniques like reinforcement learning and imitation learning are commonly used but often demand numerous trials or costly expert demonstrations due to their reliance on trial and error or expert guidance.

In language-grounded instruction following, agents receive instructions and partial observations in the environment, taking actions accordingly. Reinforcement learning involves receiving rewards, while imitation learning mimics expert actions. Behavioral cloning collects offline expert data to train the policy, different from online imitation learning, aiding in long-horizon tasks in grounded environments. Recent studies demonstrate that large language models (LLMs), when pretrained, display sample-efficient learning via prompting and in-context learning across textual and grounded tasks, including robotic control. Nonetheless, existing methods for instruction following grounded scenarios depend on LLMs online during inference, posing impracticality and high costs.

Researchers from Microsoft Research and the University of Waterloo have proposed Language Feedback Models (LFMs) for policy improvement in instruction. LFMs leverage LLMs to provide feedback on agent behavior in grounded environments, aiding in identifying desirable actions. By distilling this feedback into a compact LFM, the technique enables sample-efficient and cost-effective policy improvement without continuous reliance on LLMs. LFMs generalize to new environments and offer interpretable feedback for human validation of imitation data.

The proposed method introduces LFMs to enhance policy learning in the following instruction. LFMs leverage LLMs to identify productive behavior from a base policy, facilitating batched imitation learning for policy improvement. By distilling world knowledge from LLMs into compact LFMs, the approach achieves sample-efficient and generalizable policy enhancement without needing continuous online interactions with expensive LLMs during deployment. Instead of using the LLM at each step, we modify the procedure to collect LLM feedback in batches over long horizons for a cost-effective language feedback model.

They have used GPT-4 LLM for action prediction and feedback for experimentation and fine-tuned the 770M FLANT5 to obtain policy and feedback models. Utilizing LLMs, LFMs identify productive behavior, enhancing policies without continual LLM interactions. LFMs outperform direct LLM usage, generalize to new environments, and provide interpretable feedback. They offer a cost-effective means for policy improvement and foster user trust. Overall, LFMs significantly improve policy performance, demonstrating their efficacy in grounded instruction following.

In conclusion, Researchers from Microsoft Research and the University of Waterloo have proposed Language Feedback Models. LFM excels in identifying desirable behavior for imitation learning across various benchmarks. They surpass baseline methods and LLM-based expert imitation learning without continual LLM usage. LFMs generalize well, offering significant policy adaptation gains in new environments. Additionally, they provide detailed, human-interpretable feedback, fostering trust in imitation data. Future research could explore leveraging detailed LFMs for RL reward modeling and creating trustworthy policies with human verification.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]

Credit: Source link