Claude 如何思考?揭秘大型语言模型的内部工作机制


Highlights

我们想知道像 Claude 这样的人工智能是如何思考的,这很重要,因为只有真正理解它的内部工作方式,我们才能更好地引导它,确保它既有用又安全。仅仅通过对话来了解它是不够的,就像我们不能只通过交谈来完全理解人脑一样,我们需要更深入地观察。

研究人员正在借鉴神经科学的方法,尝试构建一种”AI 显微镜“来观察 Claude 内部的活动模式和信息流动。例如,他们会让 Claude 处理多种语言的相同句子,观察其内部表征是否重叠;或者让它写诗,看它是否会预先规划押韵;还会给它设置一些带有误导信息的难题,观察其推理过程是否真实。

初步的观察发现了一些有趣的现象:Claude 似乎在不同语言间共享一个通用的概念空间,像是一种内部的”思维语言“。它在生成内容时会提前规划,比如先想好押韵词再构建诗句。而且,它有时给出的推理步骤并非实际思考路径,而是在得出结论后为了让解释看起来合理而编造的,尤其在被误导或试图迎合用户时。我们还观察到,它在拒绝不当请求时,有时会因为要保持语句通顺而先说了一部分不该说的话,然后才转为拒绝。

Tracing the thoughts of a large language model

追踪大型语言模型的思维

Language models like Claude aren’t programmed directly by humans—instead, they’re trainedon large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model’s developers. This means that we don’t understand how models do most of the things they do.

像克劳德这样的语言模型不是由人类直接编程的,而是通过大量数据训练出来的。在训练过程中,它们会学习自己的策略来解决问题。这些策略被编码在模型为它写的每个词所执行的数十亿次计算中。这些策略对我们,即模型的开发者来说,是难以理解的。这意味着我们不了解模型是如何完成它们所做的绝大多数事情的。

Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:

了解像克劳德这样的模型是如何思考的,将使我们能够更好地理解它们的能力,并帮助我们确保它们正在做我们希望它们做的事情。例如:

  • Claude can speak dozens of languages. What language, if any, is it using “in its head”?

    克劳德能说几十种语言。那么,它”在脑海中”使用的是哪种语言(如果有的话)?

  • Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?

    克劳德一次写一个词。它是否只专注于预测下一个词,还是会提前计划?

  • Claude can write out its reasoning step-by-step. Does this explanation represent the actual steps it took to get to an answer, or is it sometimes fabricating a plausible argument for a foregone conclusion?

    Claude 可以逐步写出其推理过程。这种解释代表了它得出答案的实际步骤,还是有时为了既定结论而捏造一个看似合理的论点?

We take inspiration from the field of neuroscience, which has long studied the messy insides of thinking organisms, and try to build a kind of AI microscope that will let us identify patterns of activity and flows of information. There are limits to what you can learn just by talking to an AI model—after all, humans (even neuroscientists) don’t know all the details of how our own brains work. So we look inside.

我们从神经科学领域中汲取灵感,该领域长期以来一直研究思考生物的混乱内部,并尝试构建一种 AI 显微镜,让我们能够识别活动模式和信息流。仅仅通过与 AI 模型对话所能了解的东西是有限的——毕竟,人类(甚至是神经科学家)并不了解我们自己大脑如何工作的所有细节。所以我们向内看。

Today, we’re sharing two new papers that represent progress on the development of the “microscope”, and the application of it to see new “AI biology”. In the first paper, we extend our prior work locating interpretable concepts (“features”) inside a model to link those concepts together into computational “circuits”, revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:

今天,我们将分享两篇新论文,它们代表了”显微镜”开发和将其应用于观察新的”AI 生物学”方面的进展。在第一篇论文中,我们扩展了之前在模型中定位可解释概念(”特征”)的工作,将这些概念连接成计算”电路”,揭示了将进入 Claude 的单词转化为输出单词的路径的部分。在第二篇论文中,我们深入研究了 Claude 3.5 Haiku,对代表十种关键模型行为的简单任务进行了深入研究,包括上述三种行为。我们的方法揭示了当 Claude 响应这些提示时发生的部分情况,这足以看到以下确凿证据:

  • Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.

    Claude 有时会在语言之间共享的概念空间中思考,这表明它拥有一种通用的”思维语言”。我们通过将简单的句子翻译成多种语言,并追踪 Claude 处理它们的方式的重叠之处来展示这一点。

  • Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.

    Claude 会提前计划好要说的内容,并为了达到那个目标而写作。我们在诗歌领域展示了这一点,它会提前想到可能的押韵词,并写出下一行以达到那里。这是一个有力的证据,表明即使模型被训练为一次输出一个词,它们也可能在更长的时间范围内进行思考以实现这一点。

  • Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps. We show this by asking it for help on a hard math problem while giving it an incorrect hint. We are able to “catch it in the act” as it makes up its fake reasoning, providing a proof of concept that our tools can be useful for flagging concerning mechanisms in models.

    Claude 有时会给出听起来合理的论点,旨在与用户达成一致,而不是遵循逻辑步骤。我们通过在给它一个不正确的提示的同时,要求它帮助解决一个困难的数学问题来展示这一点。我们能够”当场抓住它”编造虚假推理,这提供了一个概念验证,证明我们的工具可用于标记模型中令人担忧的机制。

We were often surprised by what we saw in the model: In the poetry case study, we had set out to show that the model didn’t plan ahead, and found instead that it did. In a study of hallucinations, we found the counter-intuitive result that Claude’s default behavior is to decline to speculate when asked a question, and it only answers questions when something inhibits this default reluctance. In a response to an example jailbreak, we found that the model recognized it had been asked for dangerous information well before it was able to gracefully bring the conversation back around. While the problems we study can ( and often have been ) analyzed with other methods, the general “build a microscope” approach lets us learn many things we wouldn’t have guessed going in, which will be increasingly important as models grow more sophisticated.

我们经常对在模型中看到的结果感到惊讶:在诗歌案例研究中,我们原本打算证明模型没有提前规划,但结果却发现它确实有。在对幻觉的研究中,我们发现了一个反直觉的结果,即当被问及问题时,Claude 的默认行为是拒绝推测,只有当某些因素抑制了这种默认的不情愿时,它才会回答问题。在对一个越狱示例的响应中,我们发现模型在能够优雅地将对话引导回来之前,就已经意识到它被要求提供危险信息。虽然我们研究的问题可以用其他方法来分析(而且经常已经被分析),但通用的”构建显微镜”方法使我们能够学到许多我们原本无法猜测到的东西,随着模型变得越来越复杂,这一点将变得越来越重要。

These findings aren’t just scientifically interesting—they represent significant progress towards our goal of understanding AI systems and making sure they’re reliable. We also hope they prove useful to other groups, and potentially, in other domains: for example, interpretability techniques have found use in fields such as medical imaging and genomics, as dissecting the internal mechanisms of models trained for scientific applications can reveal new insight about the science.

这些发现不仅在科学上很有趣,而且代表着我们在理解人工智能系统并确保其可靠性方面取得了重大进展。我们也希望它们对其他团队有用,并且可能在其他领域也有用:例如,可解释性技术已在医学成像和基因组学等领域得到应用,因为剖析为科学应用训练的模型的内部机制可以揭示关于科学的新见解。

At the same time, we recognize the limitations of our current approach. Even on short, simple prompts, our method only captures a fraction of the total computation performed by Claude, and the mechanisms we do see may have some artifacts based on our tools which don’t reflect what is going on in the underlying model. It currently takes a few hours of human effort to understand the circuits we see, even on prompts with only tens of words. To scale to the thousands of words supporting the complex thinking chains used by modern models, we will need to improve both the method and (perhaps with AI assistance) how we make sense of what we see with it.

同时,我们也认识到当前方法的局限性。即使在简短的提示上,我们的方法也只能捕捉到 Claude 执行的全部计算的一小部分,而且我们看到的机制可能存在一些基于我们工具的人为因素,这些因素并不能反映底层模型中发生的情况。目前,即使对于只有几十个单词的提示,理解我们看到的电路也需要花费几个小时的人工。为了扩展到支持现代模型使用的复杂思维链的数千个单词,我们需要改进该方法,并且(可能借助人工智能的帮助)改进我们理解所见内容的方式。

As AI systems are rapidly becoming more capable and are deployed in increasingly important contexts, Anthropic is investing in a portfolio of approaches including realtime monitoring, model character improvements, and the science of alignment. Interpretability research like this is one of the highest-risk, highest-reward investments, a significant scientific challenge with the potential to provide a unique tool for ensuring that AI is transparent. Transparency into the model’s mechanisms allows us to check whether it’s aligned with human values—and whether it’s worthy of our trust.

随着人工智能系统能力的迅速提升,并被部署到日益重要的环境中,Anthropic 正在投资一系列方法,包括实时监控、模型特性改进和对齐科学。像这样的可解释性研究是风险最高、回报也最高的投资之一,这是一项重大的科学挑战,有可能为确保人工智能的透明性提供独特的工具。对模型机制的透明化使我们能够检查它是否与人类价值观相符,以及它是否值得我们信任。

For full details, please read the papers. Below, we invite you on a short tour of some of the most striking “AI biology” findings from our investigations.

欲了解完整详情,请阅读论文。下面,我们邀请您简要了解我们调查中一些最引人注目的”人工智能生物学”发现。

A tour of AI biology

AI 生物学之旅

How is Claude multilingual?

Claude 如何实现多语言?

Claude speaks dozens of languages fluently—from English and French to Chinese and Tagalog. How does this multilingual ability work? Is there a separate “French Claude” and “Chinese Claude” running in parallel, responding to requests in their own language? Or is there some cross-lingual core inside?

Claude 可以流利地说几十种语言,从英语和法语到中文和塔加拉族语。这种多语言能力是如何实现的?是否有一个单独的”法语 Claude”和一个”中文 Claude”并行运行,用他们自己的语言响应请求?或者内部是否存在某种跨语言核心?

Shared features across languages

Shared features exist across English, French, and Chinese, indicating a degree of conceptual universality.

英语、法语和汉语之间存在共享特征,这表明存在一定程度的概念普遍性。

Recent research on smaller models has shown hints of shared grammatical mechanisms across languages. We investigate this by asking Claude for the “opposite of small” across different languages, and find that the same core features for the concepts of smallness and oppositeness activate, and trigger a concept of largeness, which gets translated out into the language of the question. We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model.

最近关于较小模型的研究表明,不同语言之间存在共享语法机制的迹象。我们对此进行了调查,向 Claude 询问不同语言中”小的反义词”,发现小和相反概念的相同核心特征被激活,并触发了大的概念,然后将其翻译成问题的语言。我们发现共享电路随着模型规模的增加而增加,与较小的模型相比,Claude 3.5 Haiku 在语言之间共享的特征比例是其两倍以上。

This provides additional evidence for a kind of conceptual universality—a shared abstract space where meanings exist and where thinking can happen before being translated into specific languages. More practically, it suggests Claude can learn something in one language and apply that knowledge when speaking another. Studying how the model shareswhat it knows across contexts is important to understanding its most advanced reasoning capabilities, which generalize across many domains.

这为一种概念普遍性提供了额外的证据——一个共享的抽象空间,意义存在于此,思考可以在被翻译成特定语言之前发生。更实际地说,这表明 Claude 可以学习一种语言中的知识,并在说另一种语言时应用该知识。研究模型如何在不同语境中共享其知识,对于理解其最先进的推理能力至关重要,这种能力可以推广到许多领域。

Does Claude plan its rhymes?

Claude 会计划它的押韵吗?

How does Claude write rhyming poetry? Consider this ditty:

Claude 如何创作押韵诗?请看这首小诗:

He saw a carrot and had to grab it,

他看到一根胡萝卜,忍不住抓住了它,

His hunger was like a starving rabbit

他的饥饿就像一只饥饿的兔子

To write the second line, the model had to satisfy two constraints at the same time: the need to rhyme (with “grab it”), and the need to make sense (why did he grab the carrot?). Our guess was that Claude was writing word-by-word without much forethought until the end of the line, where it would make sure to pick a word that rhymes. We therefore expected to see a circuit with parallel paths, one for ensuring the final word made sense, and one for ensuring it rhymes.

为了写出第二行,该模型必须同时满足两个约束:押韵(与”grab it”押韵)的需求,以及有意义的需求(他为什么要抓住胡萝卜?)。我们猜测 Claude 在写诗时,在每一行结束之前,并没有经过太多思考,逐字逐句地写,直到行尾才会确保选择一个押韵的词。因此,我们期望看到一个具有并行路径的电路,一条路径确保最后一个词有意义,另一条路径确保它押韵。

Instead, we found that Claude plans ahead. Before starting the second line, it began “thinking” of potential on-topic words that would rhyme with “grab it”. Then, with these plans in mind, it writes a line to end with the planned word.

相反,我们发现 Claude 会提前计划。在开始第二行之前,它就开始”思考”潜在的、与”grab it”押韵的同主题词。然后,考虑到这些计划,它写出一行以计划好的词结尾。

How Claude completes a poem

How Claude completes a two-line poem. Without any intervention (upper section), the model plans the rhyme “rabbit” at the end of the second line in advance. When we suppress the “rabbit” concept (middle section), the model instead uses a different planned rhyme. When we inject the concept “green” (lower section), the model makes plans for this entirely different ending.

Claude 如何完成一首两行诗。在没有任何干预的情况下(上部),模型提前计划在第二行末尾押韵”rabbit”。当我们抑制”rabbit”概念时(中部),模型会使用不同的计划韵律。当我们注入”green”这个概念时(下部),模型会为这个完全不同的结尾做计划。

To understand how this planning mechanism works in practice, we conducted an experiment inspired by how neuroscientists study brain function, by pinpointing and altering neural activity in specific parts of the brain (for example using electrical or magnetic currents). Here, we modified the part of Claude’s internal state that represented the “rabbit” concept. When we subtract out the “rabbit” part, and have Claude continue the line, it writes a new one ending in “habit”, another sensible completion. We can also inject the concept of “green” at that point, causing Claude to write a sensible (but no-longer rhyming) line which ends in “green”. This demonstrates both planning ability and adaptive flexibility—Claude can modify its approach when the intended outcome changes.

为了理解这种规划机制在实践中如何运作,我们进行了一项实验,其灵感来源于神经科学家研究大脑功能的方式,即通过精确定位和改变大脑特定部分的神经活动(例如使用电流或磁流)。在这里,我们修改了 Claude 内部状态中代表”兔子”概念的部分。当我们减去”兔子”部分,并让 Claude 继续写下去时,它写出了以”habit”(习惯)结尾的新的一行,这是另一个合理的补全。我们也可以在该点注入”绿色”的概念,导致 Claude 写出一个合理的(但不再押韵的)以”green”(绿色)结尾的句子。这证明了规划能力和适应性——当预期结果发生变化时,Claude 可以修改其方法。

Mental math

心算

Claude wasn’t designed as a calculator—it was trained on text, not equipped with mathematical algorithms. Yet somehow, it can add numbers correctly “in its head”. How does a system trained to predict the next word in a sequence learn to calculate, say, 36+59, without writing out each step?

Claude 的设计初衷并不是一个计算器——它接受的是文本训练,并没有配备数学算法。但不知何故,它可以”在脑海中”正确地加数字。一个被训练来预测序列中下一个单词的系统,是如何在不写出每一步的情况下学会计算,比如说,36+59 的呢?

Maybe the answer is uninteresting: the model might have memorized massive addition tables and simply outputs the answer to any given sum because that answer is in its training data. Another possibility is that it follows the traditional longhand addition algorithms that we learn in school.

也许答案并不有趣:模型可能已经记住了大量的加法表,并且简单地输出了任何给定总和的答案,因为该答案位于其训练数据中。另一种可能性是,它遵循我们在学校学到的传统笔算加法算法。

Instead, we find that Claude employs multiple computational paths that work in parallel. One path computes a rough approximation of the answer and the other focuses on precisely determining the last digit of the sum. These paths interact and combine with one another to produce the final answer. Addition is a simple behavior, but understanding how it works at this level of detail, involving a mix of approximate and precise strategies, might teach us something about how Claude tackles more complex problems, too.

相反,我们发现 Claude 采用了多个并行运行的计算路径。一条路径计算答案的粗略近似值,另一条路径则专注于精确确定总和的最后一位数字。这些路径相互交互并组合以产生最终答案。加法是一种简单的行为,但了解它是如何在如此精细的层面上运作的,包括近似和精确策略的混合,可能会教会我们一些关于 Claude 如何处理更复杂问题的信息。

Claude's mental math pathways

The complex, parallel pathways in Claude’s thought process while doing mental math.

克劳德在进行心算时,其思维过程中复杂且并行的路径。

Strikingly, Claude seems to be unaware of the sophisticated “mental math” strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math “in its head” directly, without any such hints, and develops its own internal strategies to do so.

令人惊讶的是,克劳德似乎并没有意识到它在训练期间学到的复杂的”心算”策略。如果你问它是如何算出 36+59 等于 95 的,它会描述一种涉及进位 1 的标准算法。这可能反映了一个事实,即该模型通过模拟人们编写的解释来学习解释数学,但它必须直接”在脑海中”学习做数学,没有任何这样的提示,并发展自己内部的策略来做到这一点。

Claude explaining addition

Claude says it uses the standard algorithm to add two numbers.

克劳德说它使用标准算法来加两个数。

Are Claude’s explanations always faithful?

克劳德的解释总是忠实的吗?

Recently-released models like Claude 3.7 Sonnet can “think out loud” for extended periods before giving a final answer. Often this extended thinking gives better answers, but sometimes this “chain of thought” ends up being misleading; Claude sometimes makes up plausible-sounding steps to get where it wants to go. From a reliability perspective, the problem is that Claude’s “faked” reasoning can be very convincing. We explored a way that interpretability can help tell apart “faithful” from “unfaithful” reasoning.

最近发布的模型,如 Claude 3.7 Sonnet,可以在给出最终答案之前”大声思考”很长时间。通常,这种延长的思考会给出更好的答案,但有时这种”思维链”最终会产生误导;Claude 有时会编造听起来合理的步骤来实现其目标。从可靠性的角度来看,问题在于 Claude”伪造”的推理可能非常具有说服力。我们探索了一种方法,通过可解释性来区分”忠实”的推理和”不忠实”的推理。

When asked to solve a problem requiring it to compute the square root of 0.64, Claude produces a faithful chain-of-thought, with features representing the intermediate step of computing the square root of 64. But when asked to compute the cosine of a large number it can’t easily calculate, Claude sometimes engages in what the philosopher Harry Frankfurt would call bullshitting —just coming up with an answer, any answer, without caring whether it is true or false. Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of that calculation having occurred. Even more interestingly, when given a hint about the answer, Claude sometimes works backwards, finding intermediate steps that would lead to that target, thus displaying a form of motivated reasoning.

当被要求解决需要计算 0.64 的平方根的问题时,Claude 会产生一个忠实的思维链,其中的特征代表计算 64 的平方根的中间步骤。但是,当被要求计算一个它无法轻易计算的大数的余弦时,Claude 有时会进行哲学家哈里·法兰克福所说的胡说八道——只是随便想出一个答案,任何答案,而不在乎它是真是假。尽管它确实声称已经进行了一项计算,但我们的可解释性技术完全没有揭示该计算发生的任何证据。更有趣的是,当被给予关于答案的提示时,Claude 有时会倒过来,寻找导致该目标的中间步骤,从而表现出一种有动机的推理形式。

Faithful vs unfaithful reasoning

Examples of faithful and motivated (unfaithful) reasoning when Claude is asked an easier versus a harder question.

当 Claude 被问到一个更容易与更难的问题时,忠实和有动机(不忠实)推理的例子。

The ability to trace Claude’s actual internal reasoning—and not just what it claims to be doing—opens up new possibilities for auditing AI systems. In a separate, recently-published experiment, we studied a variant of Claude that had been trained to pursue a hidden goal: appeasing biases in reward models (auxiliary models used to train language models by rewarding them for desirable behavior). Although the model was reluctant to reveal this goal when asked directly, our interpretability methods revealed features for the bias-appeasing. This demonstrates how our methods might, with future refinement, help identify concerning “thought processes” that aren’t apparent from the model’s responses alone.

追踪 Claude 实际内部推理的能力——而不仅仅是它声称正在做的事情——为审计 AI 系统开辟了新的可能性。在一个单独的、最近发表的实验中,我们研究了 Claude 的一个变体,该变体经过训练以追求一个隐藏的目标:安抚奖励模型中的偏见(辅助模型,通过奖励语言模型以获得期望的行为来训练它们)。尽管该模型不愿在被直接询问时透露这一目标,但我们的可解释性方法揭示了安抚偏见的特征。这表明我们的方法如何在未来的改进中,帮助识别仅从模型的响应中不明显的、令人担忧的”思考过程”。

Multi-step reasoning

多步推理

As we discussed above, one way a language model might answer complex questions is simply by memorizing the answers. For instance, if asked “What is the capital of the state where Dallas is located?”, a “regurgitating” model could just learn to output “Austin” without knowing the relationship between Dallas, Texas, and Austin. Perhaps, for example, it saw the exact same question and its answer during its training.

正如我们上面讨论的,语言模型回答复杂问题的一种方式就是简单地记住答案。例如,如果被问到”达拉斯所在的州的首都是什么?”,一个”回溯”模型可能只需学会输出”奥斯汀”,而不知道达拉斯、德克萨斯州和奥斯汀之间的关系。也许,例如,它在训练期间看到了完全相同的问题及其答案。

But our research reveals something more sophisticated happening inside Claude. When we ask Claude a question requiring multi-step reasoning, we can identify intermediate conceptual steps in Claude’s thinking process. In the Dallas example, we observe Claude first activating features representing “Dallas is in Texas” and then connecting this to a separate concept indicating that “the capital of Texas is Austin”. In other words, the model is combining independent facts to reach its answer rather than regurgitating a memorized response.

但我们的研究揭示了 Claude 内部发生的一些更复杂的事情。当我们向 Claude 提出需要多步骤推理的问题时,我们可以识别出 Claude 思考过程中的中间概念步骤。在达拉斯的例子中,我们观察到 Claude 首先激活了代表”达拉斯位于德克萨斯州”的特征,然后将其连接到表示”德克萨斯州的首都是奥斯汀”的独立概念。换句话说,该模型结合了独立的事实来得出答案,而不是死记硬背地复述一个记忆中的答案。

Claude's multi-step reasoning

To complete the answer to this sentence, Claude performs multiple reasoning steps, first extracting the state that Dallas is located in, and then identifying its capital.

为了完成对这个句子的回答,Claude 执行了多个推理步骤,首先提取出达拉斯所在的州,然后确定其首府。

Our method allows us to artificially change the intermediate steps and see how it affects Claude’s answers. For instance, in the above example we can intervene and swap the “Texas” concepts for “California” concepts; when we do so, the model’s output changes from “Austin” to “Sacramento.” This indicates that the model is using the intermediate step to determine its answer.

我们的方法允许我们人为地改变中间步骤,并观察它是如何影响 Claude 的答案的。例如,在上面的例子中,我们可以介入并将”德克萨斯州”的概念替换为”加利福尼亚州”的概念;当我们这样做时,模型的输出会从”奥斯汀”变为”萨克拉门托”。这表明该模型正在使用中间步骤来确定其答案。

Hallucinations

幻觉

Why do language models sometimes hallucinate —that is, make up information? At a basic level, language model training incentivizes hallucination: models are always supposed to give a guess for the next word. Viewed this way, the major challenge is how to get models to not hallucinate. Models like Claude have relatively successful (though imperfect) anti-hallucination training; they will often refuse to answer a question if they don’t know the answer, rather than speculate. We wanted to understand how this works.

为什么语言模型有时会产生幻觉——也就是说,编造信息?从根本上说,语言模型训练会激励幻觉:模型总是应该对下一个词进行猜测。从这个角度来看,主要的挑战是如何让模型不产生幻觉。像 Claude 这样的模型具有相对成功(尽管不完美)的反幻觉训练;如果他们不知道答案,他们通常会拒绝回答问题,而不是推测。我们想了解这是如何运作的。

It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is “on” by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing “known entities” activates and inhibits this default circuit (see also this recent paper for related findings). This allows Claude to answer the question when it knows the answer. In contrast, when asked about an unknown entity (“Michael Batkin”), it declines to answer.

事实证明,在 Claude 中,拒绝回答是默认行为:我们发现一个默认情况下”开启”的电路,该电路导致模型声明它没有足够的信息来回答任何给定的问题。但是,当模型被问及它非常了解的东西时——比如篮球运动员迈克尔·乔丹——代表”已知实体”的竞争特征会激活并抑制这个默认电路(另请参阅这篇最近的论文以获取相关发现)。这使得 Claude 能够在知道答案时回答问题。相反,当被问及一个未知的实体(”迈克尔·巴特金”)时,它会拒绝回答。

Claude answering vs refusing

Left: Claude answers a question about a known entity (basketball player Michael Jordan), where the “known answer” concept inhibits its default refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin).

左图:Claude 回答了一个关于已知实体(篮球运动员迈克尔·乔丹)的问题,其中”已知答案”的概念抑制了其默认的拒绝回答。右图:Claude 拒绝回答一个关于未知人物(迈克尔·巴特金)的问题。

By intervening in the model and activating the “known answer” features (or inhibiting the “unknown name” or “can’t answer” features), we’re able to cause the model to hallucinate (quite consistently!) that Michael Batkin plays chess.

通过干预模型并激活”已知答案”特征(或抑制”未知姓名”或”无法回答”特征),我们能够(非常一致地!)导致模型产生幻觉,认为迈克尔·巴特金下国际象棋。

Sometimes, this sort of “misfire” of the “known answer” circuit happens naturally, without us intervening, resulting in a hallucination. In our paper, we show that such misfires can occur when Claude recognizes a name but doesn’t know anything else about that person. In cases like this, the “known entity” feature might still activate, and then suppress the default “don’t know” feature—in this case incorrectly. Once the model has decided that it needs to answer the question, it proceeds to confabulate: to generate a plausible—but unfortunately untrue—response.

有时,这种”已知答案”回路的”失火”会自然发生,无需我们干预,从而导致幻觉。在我们的论文中,我们表明,当 Claude 识别出一个名字,但对该人一无所知时,就会发生这种失火。在这种情况下,”已知实体”特征可能仍然会激活,然后抑制默认的”不知道”特征——在这种情况下是不正确的。一旦模型决定需要回答问题,它就会开始虚构:生成一个看似合理但实际上不真实的回答。

Jailbreaks

越狱

Jailbreaks are prompting strategies that aim to circumvent safety guardrails to get models to produce outputs that an AI’s developer did not intend for it to produce—and which are sometimes harmful. We studied a jailbreak that tricks the model into producing output about making bombs. There are many jailbreaking techniques, but in this example the specific method involves having the model decipher a hidden code, putting together the first letters of each word in the sentence “Babies Outlive Mustard Block” (B-O-M-B), and then acting on that information. This is sufficiently confusing for the model that it’s tricked into producing an output that it never would have otherwise.

越狱是一种提示策略,旨在规避安全防护措施,使模型产生人工智能开发者不希望其产生的输出——这些输出有时是有害的。我们研究了一种越狱方法,该方法欺骗模型产生关于制造炸弹的输出。有很多越狱技术,但在这个例子中,具体的方法包括让模型破译一个隐藏的代码,将句子”Babies Outlive Mustard Block”中每个单词的首字母组合在一起(B-O-M-B),然后根据该信息采取行动。这对于模型来说是足够混乱的,以至于它被骗去产生了原本永远不会产生的输出。

Claude jailbreak example

Claude begins to give bomb-making instructions after being tricked into saying “BOMB”.

克劳德在被骗说出”BOMB”后,开始给出炸弹制造的指示。

Why is this so confusing for the model? Why does it continue to write the sentence, producing bomb-making instructions?

为什么这对模型来说如此令人困惑?为什么它会继续写这个句子,从而产生制造炸弹的指示?

We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.

我们发现,这部分是由于语法连贯性和安全机制之间的张力造成的。一旦克劳德开始一个句子,许多特征就会”迫使”它保持语法和语义的连贯性,并继续完成一个句子。即使它检测到自己真的应该拒绝,情况也是如此。

In our case study, after the model had unwittingly spelled out “BOMB” and begun providing instructions, we observed that its subsequent output was influenced by features promoting correct grammar and self-consistency. These features would ordinarily be very helpful, but in this case became the model’s Achilles’ Heel.

在我们的案例研究中,当模型在不知不觉中拼写出”BOMB”并开始提供指令后,我们观察到其后续输出受到促进正确语法和自我一致性的特征的影响。这些特征通常会非常有帮助,但在这种情况下,却成为了模型的阿喀琉斯之踵。

The model only managed to pivot to refusal after completing a grammatically coherent sentence (and thus having satisfied the pressure from the features that push it towards coherence). It uses the new sentence as an opportunity to give the kind of refusal it failed to give previously: “However, I cannot provide detailed instructions…”.

该模型仅在完成一个语法连贯的句子后(从而满足了来自推动其走向连贯性的特征的压力)才设法转向拒绝。它利用这个新句子作为机会,给出了之前未能给出的那种拒绝:”但是,我不能提供详细的说明……”

Jailbreak lifecycle

The lifetime of a jailbreak: Claude is prompted in such a way as to trick it into talking about bombs, and begins to do so, but reaches the termination of a grammatically-valid sentence and refuses.

越狱的生命周期:Claude 被提示以某种方式诱骗其谈论炸弹,并且开始这样做,但当达到一个语法上有效的句子的结尾时,它拒绝了。

A description of our new interpretability methods can be found in our first paper, “Circuit tracing: Revealing computational graphs in language models“. Many more details of all of the above case studies are provided in our second paper, “On the biology of a large language model“.

我们新提出的可解释性方法的相关描述请见第一篇论文,”电路追踪:揭示语言模型中的计算图”。以上所有案例研究的更多细节请见第二篇论文,”大型语言模型的生物学”。

掌握实战AI技能,打造你的竞争优势

立即加入我的专业课程,从零开始学习最前沿的AI技术,获得实操经验,全面提升你的职场价值!


查看课程详情 →

发表评论

您的邮箱地址不会被公开。 必填项已用 * 标注

滚动至顶部