AIR has been working on several human-AI collaborative systems since its inception in 2017. A critical capability in the development of complex human-AI collaborative systems is natural language communication. What if you could give a task to a robot or AI assistant in natural language instead of limited structured human-AI interaction channels? Or what if you could do both, getting the best out of simple and unambiguous goal setting with richer and more nuanced natural language instructions or constraints?

We believe that it is a natural evolution of human-AI collaboration. This work, to be presented at AAMAS 2024, is the key step towards this vision. It focuses on training efficient reinforcement learning (RL) agents to follow natural language instructions.

Advances in RL, such as curriculum learning, continual learning, and language models, have all independently contributed to effective training of grounded agents in various environments. Leveraging these developments, and building upon our previous work on curriculum learning in teacher-student settings, we present a novel algorithm, Grounded Language Instruction through DEmonstration in RL (GLIDE-RL) that introduces a teacher-instructor-student curriculum learning framework. This three-agent setting lets an RL agent learn to follow natural language instructions that can generalize to new tasks and even to novel language instructions.

As depicted in the figure, the GLIDE-RL Algorithm has three independently functioning parts: The teacher, The instructor, and The student. (1) The teacher is an agent that acts in the environment but instead of getting the reward directly for its own behavior, gets rewarded based on the performance of the student. (2) The instructor observes these actions, describes them in the form of events and converts them to the instruction format. (3) The student is the goal-conditioned agent that strives to reach the goals set by the teacher, as described and instructed by the instructor.

Walkthrough of the Algorithm

The objective of the teacher agent is to set goals for the student agent that aren't too easy to achieve. It needs to set increasingly harder goals (i.e, a curriculum) for the student agent. It adapts its behavior (actions in the environment) based on the student's performance in the previous episode. At each time step in its episode, the teacher sees a partial observation of the environment (in the context of our experiments in BabyAI, a 7x7 grid) and executes an action according to its policy.

Side note: The BabyAI research platform was introduced in 2018 to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. We use the 'BossLevel' environment, the most complex environment within the BabyAI suite for all our experiments. The environment consists of 9 rooms with each room having the size of 6x6. There are different types of objects — ball, box, key, and door. Each object has a color — red, green, blue, grey, purple, or yellow. At the beginning of every episode, the agent and all objects are spawned in random positions.

Along its trajectory, the teacher can trigger multiple `events' (e.g., picking up an object, opening a door, etc.). These events are described in natural language by the instructor agent (e.g, “A red key is picked”) and then converted to the form of an instruction (e.g., “pick up the red key”) that the student agent can be conditioned and trained on. We also use an LLM (ChatGPT-3.5) to generate multiple synonymous instructions (e.g., “Lift the maroon key”, “Grab the crimson key”, etc.) for each instruction so that the student agent can learn to generalize to unseen language instructions. After the teacher's episode is finished, the student agent begins its episode trying to trigger the same events in the exact same sequence by being conditioned on these natural language instruction goals. If it reaches the goal (triggers the event), the student agent is rewarded and the teacher agent is penalized. Similarly, if the student agent fails to reach the goal (trigger the event), the student agent is penalized and the teacher is rewarded. Thus, the teacher agent is incentivized to trigger events that are incrementally harder for the student agent to trigger.

Demonstration

The animation on the above shows a trained agent following instructions provided interactively by a human operator in the BabyAI environment. As the environment is fully randomized, the agent often has to explore its surroundings in order to fulfill the instruction.

Results

A student's episode is considered successful if it triggers all the events that the teacher triggered in the exact same sequence. To enable better generalization of the student agent, we convert the event description into an instruction and generate 50 synonymous instructions for each event. 10 percent of these instructions are never used in the training and are only used as part of the test set to test the generalization capabilities.

For converting these natural language instructions into language embeddings, we use an off-the-shelf language model (all-distilroberta-v1) for these experiments.

Our main goal is to test if the student was able to succeed on new instructions, generated as synonymous with the ones seen during training.

Sidenote: While the synonymous instructions are seen during training, the actions the Student agent needs to take to finish that instruction are different — the positions of different objects and the agent's starting position are generated randomly for the test set and these are not used during training. The sequence of events that the Student must accomplish are also not seen during the training.

In the figure above, the x-axis denotes the number of training rollouts (episodes) and y-axis denotes the success rate. The performance of the student agent trained with 4 teachers using ‘glide’ (in blue) is close to the upper bound performance of the baseline ‘one-hot’ (in dashed pink) agent and significantly better than other agents (e.g, those trained with random teachers or without the behavior cloning loss or no-teacher baseline agent's performance on testset).

The baseline agent is a student conditioned on one-hot goals (‘one-hot’ in the above Figure). The Teachers' functionality doesn't change here. But, for the student, instead of receiving language embeddings from a language model as inputs, it receives pre-designed one-hot encoding for each event. There is no notion of synonymous goals here, the events triggered by the Teacher are directly converted to a one-hot encoding and sent to the agent. This baseline gives us an estimate of the upper bound of success rate achievable. Also note that, with one-hot encodings, the agent does not have any generalization capabilities as the size of the encodings cannot be increased to accommodate the unseen goals. We then train the GLIDE-RL based student agent (conditioned on natural language instructions) with 4 Teachers. In the figure above, we can see that the performance of the GLIDE-RL based student is close to the upper bound (baseline agent). Unlike the baseline agent that by definition does not have any generalization capabilities, the GLIDE-RL agent is able to generalize across a wide range of synonymous instructions.

The figure also highlights how teachers (and curricula) can improve a student’s learning performance. We introduced the `one-hot-no-teachers-testset` baseline to understand how challenging the task is without a curriculum set by the teachers. It uses the same one-hot encoding as before and is trained on the test set of environments and tasks. Even with that advantage, it fails to perform well (measured in terms of success rate). Furthermore, we see that the student with goals conditioned as language embeddings is able to perform comparable to the one with one-hot goals. Finally, with the ‘glide-no-bcl’ baseline, we establish the importance of the behavioral cloning loss during the training process — without BCL, the student’s performance is as good as the scenario with random teachers (‘glide-random-teachers’), that is to say with a random curriculum.

Applications

While the results were demonstrated on the domain of BabyAI, we have shown that the agent was successful in following unseen natural language instructions in unseen environments (with randomized starting positions, positions of objects, and sequence of goals). These abilities suggest realistic applications, for example to domains where AI Redefined already operates such as renewable energy or simulation training.

The battery management and storage systems (BESS) operators oversee a plan for buying and selling power, charging and discharging batteries corresponding to each power plant. To assist the operators in creating such plans, AIR leverages reinforcement learning agents trained to maximize the profits. GLIDE-RL would let the operator communicate with the agent in natural language to do guide its actions like "assume demands on your system will be 1.2 times higher than usual between 20h00 and 23h00".

Another potential application we are looking into is in the context of Cogment Adaptive Learning, AIR’s product dedicated to personalized (human) student training in simulation. A director agent is generating scenarios tailor made for a specific student, the GLIDE-RL approach could be leveraged to enable instructors (or even students) to issue instructions in natural language of the form “change the scenario to test the abilities of the student in landing with crosswind.”

These early examples show how such an approach can provide interfaces that are very expressive without requiring complex user interface components.

What’s next?

In addition to the broad range of applications, there are a few directions we are excited about in pushing this research forward:

How confident is the agent about performing the natural language instruction? This is particularly challenging as the agent first has to thoroughly understand natural language and then learn from very sparse feedback on whether the given instruction is doable.
Grounding the LLM based on the embodied agent feedback: the student agent trained using GLIDE is used to finetune and ground LLMs (with the embodied agent feedback i.e, rewards from the environment based on the actions of the student agent. The student agent in this case acts based on the natural language instructions coming from the LLMs). This new alternative to RLHF and RLAIF could benefit from much richer and relevant feedback, which could provide contextual tools and abilities to the LLMs by grounding them with real-world and human constraints (e.g. through simulations and human interaction).

Conclusion

In this work, we trained an agent to follow natural language instructions. These embodied agents are being used as part of ongoing research work to finetune LLMs to ground them in simulated environments such as BabyAI.

We envision grounded natural language interaction as a key enabler for rich and trustworthy human-ai collaboration in digital twins and real-time physical environments. We are looking at applications with our current partners in the aerospace, defense, and renewable energy sectors and we are actively looking for partners to accelerate the development towards this new exciting goal. Get in touch if you would like to know more and visit our website to learn more about how we use reinforcement learning in various real world applications and put human-AI collaboration at the forefront of these applications.

This work is done at AI Redefined with external collaborators Chaitanya Kharyal (Microsoft), Tanmay Kumar Sinha (Microsoft Research), Prof. Srijita Das (UM, Dearborn) will be presented at the AAMAS-24 conference in Auckland in May 2024 and an anXiv version of the paper is here.

GLIDE-RL: Grounded Language Instruction through DEmonstration in RL