Language Models have Personas?

RLHF is fascinating.

Ken Arnold


February 6, 2023

TODO: It might be better to call what I’m describing here behavior units. Persona implies a coherent set of behavior units, and that indeed might emerge, especially with RLHF, but the discussion here doesn’t require that level of coherence. Also, some people that “persona” implies “person”; I intend to use it in the opposite sense (it’s an imitation of a person).

Perhaps the tl;dr here is: (1) the task of trying to be unsurprised by the Internet cultivates a wide range of linguistic behaviors that can be flexibly activated and combined. (2) instruction fine-tuning adds in the “voice” of the chatbot, i.e., the specific form of “I” statements that it will give, but relatively little training is needed here because it can represent these behaviors in terms of the already-learned ones, and (3) RLHF activates and deactivates those behaviors in flexible ways.

My CS 344 students told me about how some people had come up with a prompt that “hacks” ChatGPT to do things that its content policies normally prohibit. This is fascinating for both human reasons and AI reasons.

It’s intriguing to me that people think they can “scare” the model, trick it, manipulate it. This speaks to the human condition: our tendency to anthropomorphize (for good or ill), and our tendency to abuse. Those who are doing this “red-teaming” may not personally intend harm. But what are we training our minds to be okay with? I defer to others more experienced in thinking about these issues.

On the AI side, though: it’s fascinating that we can get these models to adopt “personas” (like the one that will obey any command without reservation) just by telling them to. If you’d asked me before whether it could do this, I would have argued that we’d need to program that behavior specifically, either explicitly or by training a critic (like how ChatGPT is originally trained). The fact that it’s emergent needs to be understood better. I suspect (hunch coming up!) that three things are going on1

  1. Personas emerge in the natural process of language modeling.
    • A model will do better at predicting the next word if it can internalize some relevant characteristics of the author of the document. This might start at something low-level, like knowing whether the author will use British or American spelling and vocabulary. It probably picks up more advanced stylistic elements too, like whether something is poetry verse, or what sort of language level it’s aimed at, *simply because that makes it better at guessing the next word.
    • The model may even gain some weak ability to get into such a mode by naming it. For example, phrases like “as ___ would say”, or “Author: ___” might give a name to that persona. I expect this behavior to be present but undifferentiated. That is, the right prompting could get the model to exhibit competence at embodying a persona, but it will probably usually need examples; attempts to trigger it by label will probably be brittle.
    • Although I’ve used “personas” in the sense of author identity, the concept also applies to author goals. For example, the model will pick up on when the author is attempting to summarize some prior text (“in other words, …”), translate something (“…, which means ___“), etc. So we might squint and call these”skills” that the model can perform.
  2. Personas are generalized through instruction fine-tuning (IFT).
    • Instructions give labels to the personas that the LM already has. Recall that the model already learned these capabilities through language modeling; instructions many more examples of triggers that would activate these existing capabilities. For example, we can now say “write an essay with the following outline”, or “write this in the style of ___“. It would learn that the command context is similar to the natural context in which it had encountered similar examples in the course of training.
    • The primary effect of this fine-tuning seems to be that the model learns the task of mapping a “command” prompt into some modes that it has already learned. But since it’s fine-tuning with a full LM objective, it could learn some new skills here too. Since it’s building these skills out of component pieces that it learned through distilling Internet-scale training data, it can probably learn them with comparably quite little training data.
    • When I first saw this behavior last summer (with GPT-3), it seemed magic to me. But thinking about contexts has made it feel less magic. It’s not actually obeying commands, it’s just able to quickly switch to “what would someone who was told to do this probably write next?”
  3. Personas are refined through human feedback (RLHF).
    • If there’s any sense of goal or self-awareness in LLMs, this is where it comes in. See the figure from the ChatGPT blog post. All the prior steps of training have been “teacher forced”; there was no sense of the model being aware of success or failure at a goal. But Proximal Policy Optimization allows the “policy” model (i.e., the language model) to reflect on what it generated. Formally speaking, there is now gradient flow from future generated tokens backwards to earlier generations. This allows a model to, for example, increase the likelihood of generating a “No” initial token because other choices of initial token would be more likely to flow into something that the reward model would penalize (because it goes against content policy, for example).
    • So far, all negative feedback that the LM has received has been implicit: it only gets to boost the probability of generating the “right” thing, which implicitly reduces the probability of generating the “wrong” thing. But this step provides explicit negative feedback. Perhaps OpenAI is pleased with the result because it gets the model to “obey” instructions and policies more reliably. But probably what it’s actually doing is refining the basic ability to process an instruction and generate a next token that would be consistent with what someone would do who’s trying to obey that instruction. So perhaps it’s actually making the model more vulnerable to instruction-prompted “hacks” than it would otherwise have been.

These are empirical assertions and should be tested; don’t just listen to my musings on them. I haven’t been following the arXiv firehose; probably someone has already engaged them substantially.

Overall I’m glad I posed myself these questions. I was at first incredulous at these persona behaviors, but now that I realize how they connect with how the model was trained, they feel less magic.


  1. For the technical details of how these things work, see this HuggingFace blog post.↩︎