Training LLMs to reason: Cogment's Integrated Approach to Trustworthy Natural Language Interactions
AIR’s vision combines human expertise with AI capabilities to create "superteams" to augment human decision making capabilities. While AI can handle vast data and adapt to new challenges, human intuition and judgment are crucial for critical decision-making, particularly in complex, unforeseen and ever-changing situations.
One key aspect of human-AI collaboration is bidirectional interactions: enabling AIs to provide recommendation, response to queries, context and requests for clarification and for humans to convey their intent, give instruction and give feedback to AI agents. While interacting through traditional “structured” user interfaces patterns is the de facto standard and a great way to make communication clear and explicit, with the advent of LLMs, natural language has become a viable option.
Natural language interactions are richer and more nuanced, they can express both fuzziness or uncertainty, and also clarity. They enable us to convey context. In terms of user interface, they enable a text box, or even a microphone, to take the place of complex nested components, thus moving towards a general UI. This makes them more accessible, enabling a broader audience to effectively engage with machines.
However, even as LLMs progressed quickly over the last months, they are still lacking in important dimensions. Stanford University’s “The AI Index 2024 Annual Report” notes in its introduction:
Progress accelerated in 2023. New state-of-the-art systems like GPT-4, Gemini, and Claude 3 are impressively multimodal: They can generate fluent text in dozens of languages, process audio, and even explain memes. As AI has improved, it has increasingly forced its way into our lives. Companies are racing to build AI-based products, and AI is increasingly being used by the general public. But current AI technology still has significant problems. It cannot reliably deal with facts, perform complex reasoning, or explain its conclusions.
With this approach, we address the first two challenges.
LLMs Struggle with Dynamic Decision-Making Scenarios
Decision support systems are crucial for providing the right information to determine best options, and while current technologies like document indices and advanced embedding techniques for retrieval have largely solved the issue of querying knowledge through LLMs, challenges remain in dynamic and deeply contextual environments involving causality and reasoning. This is evident when playing chess with ChatGPT, where it can lose track of the board's state or even make illegal moves.
To better understand these limitations, we conducted experiments using the Blocksworld domain - a planning domain commonly used in artificial intelligence research where a set of blocks are stacked on a table, and the agent can rearrange the blocks: stacking one block on another or moving blocks to the table. In these experiments we assembled a prompt describing the domain, an initial state and a few legal moves to be applied to the board; we then asked the LLM if a particular move would be legal. Our results, to be fully published soon, show that:
- GPT3.5 only gets a precision of 34% when asked if a legal move is indeed legal and of 75% for illegal moves,
- GPT-4 improved with a precision of 93% and 90%, respectively.
These figures pose significant challenges for real-world enterprise applications where higher precision is crucial for trustworthy human-AI collaboration.
Cogment’s approach to trustworthy natural language interactions
While LLMs have made unprecedented progress in mastering natural language conversation and expression, LLMs are not general AIs able to reason about the world. As discussed, they are not even good at tracking the state of a simple domain. Cogment embraces a multi-agent approach to design decision making systems with a key principle: separate responsibilities into specialized, cooperating agents. In this context, the approach enables the LLM to manage language, and orchestrate it with other components/agents to actually create the ability to reason. Cogment is uniquely positioned to enable that orchestration of specialized components as it offers:
- A generic API to let agents act in environments, such as digital twins;
- A generic API to dynamically configure and execute trials involving agents acting in an environment towards a given objective;
- A distributed, micro service based, runtime able to scale the execution of trials;
- A datastore to gather interaction data and consume them for analytics or training.
- Natural Language Query: The process begins with a user posing a natural language query to the system. In this example, the query is about the potential impacts of reducing the throughput of a machine. The natural language query is sent to an LLM agent orchestrated by Cogment.
- Trial rollouts & results synthesis: The LLM, understanding the user’s intent and context, then leverages Cogment to configure and run trials. These trials involve a digital twin or simulation of the environment and, potentially, one or more agents that can be configured to act following the user’s intent. Trial trajectories and outcome are then returned to the LLM in order to formulate a response. The user can ask further questions and provide feedback, for example by applying the recommended tactics.
- Continual Fine Tuning from Human & Trial Feedback: In this setup, both the human and the trial provide valuable feedback to the agents. Human feedback can express if the response was satisfactory. Trial feedback can express if the provided goals were attainable or if the specified strategy led to achieving the desired outcome. Both signals can be used to tune the LLM to make a better use of the digital twin/simulation, i.e. to be grounded, and to give answers that are more aligned with the user’s expectations.
With an improved traceability between the natural language queries, the executed trials, the raw results and the natural language responses, this modular approach delivers the trustworthiness necessary to industrial and enterprise applications. Four pieces of the puzzle and how Cogment addresses them We identified four puzzle pieces to deliver on this approach and we are making strides in building and validating them.
Facilitate Human-AI Natural Language interactions
The first piece of the puzzle is facilitating the interaction between a LLM and human; it is largely a solved problem. Currently, a variety of solutions exist to host and query large language models (LLMs), ranging from commercial Software as a Service (SaaS) APIs provided by companies like OpenAI, Anthropic, and Cohere, to infrastructure products like Amazon Bedrock, and open-source libraries such as Hugging Face's TGI.
Despite these advancements, there remain significant limitations. Many existing solutions are constrained by the specific LLMs they support or the platforms on which they can be deployed. This can pose challenges for businesses looking to integrate AI into diverse and changing environments. Moreover, while these platforms excel at serving LLMs, they often fail to consider crucial human factors that influence the effectiveness of AI interactions.
To address those limitations, our approach emphasizes bidirectional collaboration patterns. They are reusable interaction components, from UI to data flow, designed to optimize the user experience while enabling the AI system to capture both behavioral and self-reported feedback. Such feedbacks are invaluable, as they drive continuous improvement and adaptation of AI models to better meet user needs. We are building, upon Cogment, a first version of this vision. It includes several innovative features:
- Hosting: Leveraging Cogment, we implement a conversation orchestration system that manages dialogues involving one or more LLMs as well as other agents to implement, for example, Retrieval Augmented Generation (RAG) strategies. As any other Cogment-based system, it can be deployed both in cloud environments and on-premises using Kubernetes.
- User Experience (UX):
- Responses Streaming: improving the perceived response time by streaming AI agent responses mimicking natural, fluid human conversation..
- Collaboration Patterns: We have implemented specific patterns for both standard operations and data collection in the API and with matching UI components. These patterns leverage self-reported and behavioral feedback from users, which are crucial for refining AI responses and functionalities.
- Integration Capabilities: Our platform is designed to be highly compatible with major AI models and APIs. This includes integration with Hugging Face's transformers, OpenAI's ChatGPT API, and Anthropic's Claude API. Such integrations allow our users to choose the best AI models suited for their specific needs, ensuring versatility and adaptability.
The goal is to create a more intuitive, responsive, and effective system that respects and utilizes human feedback to enhance both user experience and AI performance.
Instruction Following & Human alignment
Fine tuning LLMs to follow instructions accurately and align with human expectations is fundamental to the development of reliable AI systems, it is the next piece in our puzzle. Techniques such as Instruction Fine Tuning (IFT), Reinforcement Learning from Human Feedback (RLHF) and Direct Preferences Optimization (DPO) have marked significant strides in this domain.
Nevertheless, there remains ample room for improvement, especially in streamlining processes to tailor open-source LLMs to specific business needs. Companies can benefit greatly from customizing these models to understand and interact within the nuanced contexts of their particular industries. Cogment tackles this need by enabling enterprises to harness the expertise of their in-house specialists effectively.
- Interaction Data Management: We provide a unified format for interaction data that is versatile enough to support virtually any fine-tuning process.
- Active Fine-Tuning Campaign Management: By managing active fine-tuning campaigns, we can help reduce both the volume and the cost of collecting interaction data necessary for fine-tuning by ensuring the users provide feedback that are targeted at the “need” of the in-training model.
- Model Validation: We include a dual approach to validation techniques combining traditional metrics with human-in-the-loop evaluations ensuring that the models not only perform well statistically but also meet the practical expectations and standards set by human evaluators.
- Integration Capabilities: Our platform aims at supporting seamlessly with any fine tuning tools to leverage the latest progress in the field. We currently directly support Hugging Face TRL (Transformers Reinforcement Learning).
To delve deeper into that project, we will soon be publishing an article describing this project’s architecture and results.
Tooling LLM to explore strategies and evaluate their impact
Connecting LLMs to external tools, in particular to digital twins, is the third piece of the puzzle. What is called tooling is a four step process:
- Tool description: Initially, the prompt includes detailed instructions about the available tools and the expected inputs for these tools. We can expose Cogment generic APIs to act in any environment or to run trials.
- Structured responses: The LLM is prompted to generate structured responses that clearly indicate the tool to be used. An example of such a prompt can be seen in LangChain’s structured response parser example, which structures the output in a specific JSON format to include both the answer and the source URL.
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```": ```json { "answer": string // answer to the user's question "source": string // source used to answer the user's question, should be a website (i.e. a URL address). } ```
- Execution, Parsing and Orchestration: Once the full prompt is assembled, it can be executed and the LLM’s response be parsed. The system then orchestrates the specified tools calls, such as running a Cogment trial or adjusting parameters in an ongoing trial.
- Post-processing and Recursion: The output is post-processed into either natural language or simple structured data, and then fed back into the system as context for further queries. This cycle continues until no further tool calls are required, or a direct answer is requested by the system. While high level APIs like the OpenAI API provides tooling out of the box, on other LLMs, those steps require some tinkering to assemble the right prompts, that’s definitely something that can be improved by better LLM or by fine tuning them to better use the tools, this is a topic that is covered in the next section.
Tooling gives a significant improvement to the whole system's ability to keep track of dynamic environments, early results in our blocksworld experiment show an increase of precision from 34%, without tooling, to 87% on legal moves and from 75% to 80% on illegal moves for GPT3.5. Experiments on GPT4 are still ongoing. The fact that results are not perfect, while the underlying model of the environment is, shows that there’s a margin for improvement still.
Tooling requires the LLM to go from the natural language to structured calls, thus it relies on the LLM’s understanding of the domain. To mitigate this limitation we’ve explored training the underlying AI agent to take natural language queries directly as parameters. The GLIDE-RL algorithm shows great results in that direction, it will be presented at the upcoming AAMAS conference. Despite these advancements, the effectiveness of LLM tooling hinges on the model's understanding of the environment. This includes making accurate decisions about the format, applicability, and optimality of actions - ideally minimizing the exploration of non-viable options. Therefore, grounding the LLM in a well-defined context is essential, this is the last piece of our puzzle.
Grounding LLM to understand environment affordances
The process of grounding LLMs to comprehend environmental affordances is a sophisticated extension of earlier methods primarily focused on aligning outputs according to an external source of feedback, usually humans. These algorithms, in particular RLHF, show great practical results, grounding builds upon them for tasks, and environment-specific tuning.
Recent research, such as "Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning" has begun exploring how fine-tuning LLMs can allow them to function as decision-making policies within various environments. This research, however, did not focus on enhancing the LLM’s ability in areas like question answering or tool usage, which are crucial for broader application.
Our approach, currently in R&D, involves fine-tuning LLMs within environments modeled by Markov Decision Processes (MDP), with the objective of refining the LLM’s capacity to deliver solutions or perform actions that align with given goals. Here's a concise overview of our methodology:
- Goal Setting: A human or automated system defines a high-level goal for the environment, such as "Solve this maze."
- Information Gathering: The LLM gathers descriptions of the environment in natural language. For example, it might process information like the positions of various objects and the user's location within a maze, alongside the defined task.
- Decision Making: Based on this information, the LLM suggests strategic, high-level actions—for instance, directing the agent to navigate to a specific object or use a key to open a door.
- Action Implementation: An agent then executes these suggested actions within the environment, navigating through the maze or interacting with elements as directed.
- Feedback Reception: The environment provides feedback in the form of rewards based on the effectiveness of the actions performed by the agent.
- Iterative Learning: The LLM fine-tunes its strategies based on the reward feedback, adjusting its approach to optimize outcomes.
This cycle is repeated, refining the LLM's actions until the initial goal is achieved. Upon completion, the process resets with a new goal, continuously improving the LLM’s understanding and operational efficiency within the environment.
While this approach is still under development, we are optimistic about its potential to significantly enhance the results of our earlier tooling efforts, leading to more precise and contextually-appropriate responses from the LLM in a variety of settings.
Conclusion
The integration of orchestration, tooling, and both alignment and grounding fine-tuning forms the foundation of trustworthy natural language interactions with large language models. Thanks to its existing capabilities and our ongoing research and development, Cogment emerges as a promising platform capable of fostering robust and reliable natural language interactions between humans and AI.
Trustworthy natural language interactions find practical applications across a broad range of industries, where dynamic decision-making and complex system adjustments may be crucial. For example, in manufacturing, a user might inquire about the implications of changing the parameters of a given machine to prepare maintenance or when is the best time to do a 2h maintenance window. This allows for predictive adjustments and proactive planning. In embedded software, such a system could help assess whether a program might exceed the hardware capabilities, and suggest modifications to avoid failures before testing on costly setup.
Beyond these decision support use cases, such a system could enable providing additional guidance to AI agents performing a task, for example a robot on the field, or facilitate coordination between AI agents and human peers. The applications are numerous By harnessing natural language communication, we unlock new dimensions of interaction that are crucial for advanced human-AI collaboration, that is complementary to existing, structured, modalities of interaction. This approach not only makes AI more accessible and easier to integrate into complex workflows but also enhances the efficiency and effectiveness of these systems. Ultimately, the continual refinement of these interactions promises to not only meet but exceed the evolving needs of industries, paving the way for more efficient and human-centric tools.