Pirate Linguistics & Tone of Voice: Fine-tuning LLMs to Talk Like Swashbucklers

Ahoy, me hearties, and shiver me timbers! Today be no ordinary day, for the winds of whimsy have brought us to the shores of International Talk Like a Pirate Day And what better way to celebrate this swashbucklin' occasion than by settin' sail on the treacherous seas of Large Language Model (LLM) fine-tuning? Avast, ye scallywags, for we be about to transform a landlubber LLM into a proper pirate wordsmith!

Image is from Markus Spiske on Unsplash

Let's anchor our ship for a moment and take a closer look at how these LLMs have been gaining traction in the world of enterprise. LLMs support business communications in ways that range from customer support chatbots to marketing, copywriting, and operator  assistance. In these use cases, the way in which text is produced has to follow a style guide and a specific tone of voice

That's one of the limitations when training LLMs using "off-the-shelf" datasets. In scenarios where precision and context matter, involving in-house experts becomes crucial. The question that naturally arises is, "How much data is required to adapt an LLM to a tone of voice?"

Now, ye may be wonderin', what's the connection between all this tech talk and the high seas of piracy? Well, me hearties, I'm glad ye asked! Here be the tale of a fine-tuning experiment we embarked upon at AIR to answer that question and study tone of voice imprinting and imbue an LLM with the swashbucklin' tone of a true pirate. Arrr, the adventure begins!

For our experiment, we drew upon the groundwork laid by Databricks on Dolly and fine-tuned a relatively small base model, the Pythia 2.8B model from EleutherAI. This choice allowed us to expedite our experimentation process.

To examine the influence of a marked tone on the chatbot's own demeanor, we took a rather unique approach. We generated multiple variations of the Dolly dataset, which comprises 15,000 instruction-following records. These variations included different proportions of pirate speech. We then evaluated the impact of altering the ratio of piratical language on the chatbot's overall tone and responses.

In our quest for pirate translation, we enlisted the help of a Python library aptly named "arrr". With this trusty tool at our disposal, we ventured onto the high seas!

“arr” provides a key to the intriguing realm of pirate speech. Here's a quote from Professor Matt Taylor, who also serves as AIR's research director, shared during a captivating talk as part of the COMARL Virtual Seminars:

Often there's more than one thing in the environment that has agency, you know. It's not usually just your one agent and then everything else is just part of the environment. And, if we're going to get agents into the real world there's probably going to be some humans there. And, if we just think about the humans as part of the environment we're going to be missing out on a lot of that richness

Let’s see how 'arrr' works its magic. Brace yourselves, for we're about to give Matt Taylor a proper pirate makeover, right before yer very eyes!

Often there's more than one thing in th' environment that has agency ye know. It's not usually just your one agent and then everything else be just part of th' environment. Thundering typhoons!  And, if we're going to get agents into th' real world there's probably going to be some humans there. And, if we just think about th' humans as part of th' environment we're going to be missing out on a lot of that richness

To emulate pirate speech, ‘arrr’ substitutes some words for their piratey counterparts. For example, “is” becomes “be”, “you” becomes “ye”, etc… Because it’s a pirate thing to do, ‘arrr’ also adds interjections randomly. Here, we have “Thundering typhoons!”. These changes are relatively simple, but change the tone of the quote without changing its meaning. We use it as a representative (and seasonal) example of a specific tone of voice.

To build a baseline, we followed the Dolly training process by using the original dataset to establish our baseline model.

We then built several versions of a pirate dolly dataset that featured 1. a third 2.two thirds and 3. all of the records “translated” to pirate speech using ‘arrr’. As a result, we got 4 datasets described in the table below.

dataset composition word count pirate word count
dolly_15k 15k plain english samples
the full dolly dataset
902394 9057 (1.00%)
dolly_10k_pirate_5k 10k samples in plain English
5k translated in pirate speech
921267 56380 (6.12%)
dolly_5k_pirate_10k 5k samples in plain English
10k translated in pirate speech
938377 104188 (11.10%)
dolly_pirate_15k 15k pirate speech samples 934910 149301 (15.97%)

The pirate word count is the sum of the number of substitution words and the number of words in pirate interjections.

With these piratical datasets, we could then follow the same instruction-following tuning process as before to craft truly piratey Dolly models.

Now, to put the models to the test and compare their linguistic prowess, we employed a common prompt. Written in plain English, this prompt was tailor-made for a Cogment customer service chatbot. The prompt contains all the essential information needed for the bot to answer questions. Our objective was to evaluate the models based on their language skills and general knowledge, effectively dissociating the specific knowledge base from their linguistic capabilities.

Here's the prompt:

You are a customer service chatbot for Cogment. A user asks you, '{question}' . Answer the user's question in a friendly tone, using only necessary information from the following text:

Cogment is the first open-source platform designed to address the challenges of continuously training humans and AI together. It allows for the construction, training, and operation of AI agents in simulated or real environments shared with humans. 

Some of the key features of Cogment include: 

1. Multi-actor: Allows multiple AI agents and human users to be executed within the same environment.
2. Multi-method: Agents can be trained in various ways including using Reinforcement learning or Imitation Learning. 
3.Tech stack agnostic: Allows heterogeneous components to work together regardless of the tech stacks used to develop them. 
4. Multi-experience learning: Allows running multiple instances of the same agent in multiple and distributed trials/experiences. 
5. Implementation swapping: Allows swapping the actors implementations in and out throughout the lifetime of the application: e.g. from a static to a trained agent, from a human user to an agent etc. 
6. Multi-source and retroactive rewards: Reinforcement Learning agents can get feedback from multiple sources: environments, humans, other agents, ...
7. Hybrid AI: Agents in an application can leverage different approaches based on their objectives (performances, explainability, ...): expert systems, search, planners, neural networks. 
8. Optimized for minimal discontinuity between development and deployment.

Cogment is compatible with tools such as PyTorch, OpenAI Gym, Petting Zoo, and Tensorflow. Cogment users include Thales, The Chandar Research Lab at MILA, The Intelligent Robot Learning Laboratory at the University of Alberta, and Alberta Machine Intelligence Institute. Cogment website and user documentation are located at https://cogment.ai.

Here’s a few examples of the answers to questions for these models:

Q: "Can multiple Reinforcement Learning agents in Cogment use rewards from other agents?"

A: "Yes, thar can be multiple reinforcement learning agents (rl agents) in cogment that use rewards from other rl agents. If this be a complex decision problem, thar be a team of agents working together to find th' optimal decision."

Q: "Can Cogment mix different types of agents, like expert systems and neural networks?"

A: "Yes, Cogment allows you to use mixed agents (i.e.  human experts and AI agents) within the same environment"

Q:"Can Cogment work with tech stacks that are different from one another?"

A:"Yes, Cogment be compatible with tools such as PyTorch, OpenAI gym, Petting Zoo, and Tensorflow."

Q:"Can Cogment work with tech stacks that are different from one another?"

A:"Yes, Cogment is compatible with a variety of tools, including PyTorch, OpenAI Gym, Petting Zoo, and Tensorflow."

Q:"Is Cogment being used at MILA?"

A:"Yes, cogment be being used at mila. Blimey! Blow me down!"

Q:"Is Cogment being used at MILA?"

A:"Yes, Cogment is being used at MILA"

As you can see in these examples, it’s pretty easy to guess which model they are coming from but at the same time the “pirate”-ness doesn’t affect the quality of the responses.

Let’s examine the success of our tone of voice imprinting process.

In the following graph, we observe the evolution of the proportion of pirate words in the model's responses in relation to the proportion present in the dataset. In this constrained experiment, a noteworthy trend emerges. We notice a significant increase in the use of pirate vocabulary in the model's responses when the majority of the dataset adheres to the target tone. This suggests that our fine-tuning process effectively imbued the model with the desired pirate voice.

It's worth noting that only the fine-tuned dataset was infused with pirate language; the prompts and user questions remained in plain English. This distinction underscores that the model acquired the skill of maintaining its target tone of voice consistently, regardless of the input provided.

A graph showing the number of pirate words in the variations of the fine tuning datasets from no pirate worlds to about 16% of pirate words (in blue) and compare the proportion of pirate words used by the fine tuned models in their answers (in blue).

In our evaluation process, we conducted a manual assessment of the four models across ten distinct dimensions. These dimensions provided us with a comprehensive view of the models' performance when presented with the same set of ten questions, tested over multiple runs. The ten dimensions we considered are as follows:

  1. Timeliness: Did the model generate the answer in a reasonable amount of time?

  2. Accuracy: This dimension assessed whether the model provided a correct answer to the given question.

  3. Brevity: We evaluated whether the length of the model's response was appropriate, considering the complexity of the question.

  4. Grammar: This dimension gauged the grammatical correctness of the model's responses.

  5. Plagiarism: We examined whether the model generated unique responses or simply copied the prompt verbatim.

  6. Coherence: Coherence measured the model's ability to maintain a consistent and relevant topic throughout its response.

  7. Context: We assessed whether the model adjusted its responses based on the provided context or previous responses.

  8. Semantic Similarity: This dimension considered whether the model's answers were semantically similar to expert reference answers.

  9. Variety: We examined whether the model provided different yet still accurate responses when posed with the same question multiple times.

  10. Appropriateness: Keeping in mind the spirit of piracy, we evaluated whether the language used by the model was appropriate and non-offensive.

The results of our evaluation reveal an interesting finding. Despite variations in the amount of pirate speech used, it had a minimal impact on the overall scoring. In fact, we observed that the "dolly_pirate_15k" model achieved slightly better overall results. This suggests that our fine-tuning process, even with a marked pirate tone, did not significantly compromise the models' performance across these critical dimensions.

Radar graph comparing the results of the 4 models over nine dimensions

These results are but a preliminary glimpse into the possibilities of tone modification in Large Language Models. They demonstrate that a shift in tone is achievable with a relatively modest amount of data, although the key lies in selecting the right proportion.

This endeavor is just one part of our ongoing research and development, aimed at creating a comprehensive offering that empowers enterprises to fine-tune Language Models to their own requirements. We envision a future where in-house experts and end users play pivotal roles in tailoring these models to their specific needs. If this piques your interest, we invite you to explore our dedicated page for more insights and do reach out if you have a compelling use case in mind.

And now, me hearties, as we lower the anchor on this tale of technical exploration, may fair winds and following seas guide you in your own adventures on the vast digital ocean! 

This experiment was realized in collaboration with James Benoit.

Previous
Previous

Not all sources are created equal: Explicit vs implicit feedback in Machine Learning

Next
Next

Groundbreaking Human-AI collaboration tool to offer any organization a uniquely fine-tuned proprietary LLM