Tracing the thoughts of a large language model
追踪大型语言模型的思维
Highlights
我们想知道像 Claude 这样的人工智能是如何思考的,这很重要,因为只有真正理解它的内部工作方式,我们才能更好地引导它,确保它既有用又安全。仅仅通过对话来了解它是不够的,就像我们不能只通过交谈来完全理解人脑一样,我们需要更深入地观察。
研究人员正在借鉴神经科学的方法,尝试构建一种”AI 显微镜“来观察 Claude 内部的活动模式和信息流动。例如,他们会让 Claude 处理多种语言的相同句子,观察其内部表征是否重叠;或者让它写诗,看它是否会预先规划押韵;还会给它设置一些带有误导信息的难题,观察其推理过程是否真实。
初步的观察发现了一些有趣的现象:Claude 似乎在不同语言间共享一个通用的概念空间,像是一种内部的”思维语言“。它在生成内容时会提前规划,比如先想好押韵词再构建诗句。而且,它有时给出的推理步骤并非实际思考路径,而是在得出结论后为了让解释看起来合理而编造的,尤其在被误导或试图迎合用户时。我们还观察到,它在拒绝不当请求时,有时会因为要保持语句通顺而先说了一部分不该说的话,然后才转为拒绝。
Language models like Claude aren’t programmed directly by humans—instead, they’re trainedon large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model’s developers. This means that we don’t understand how models do most of the things they do.
像克劳德这样的语言模型不是由人类直接编程的,而是通过大量数据训练出来的。在训练过程中,它们会学习自己的策略来解决问题。这些策略被编码在模型为它写的每个词所执行的数十亿次计算中。这些策略对我们,即模型的开发者来说,是难以理解的。这意味着我们不了解模型是如何完成它们所做的绝大多数事情的。
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:
了解像克劳德这样的模型是如何思考的,将使我们能够更好地理解它们的能力,并帮助我们确保它们正在做我们希望它们做的事情。例如:
- Claude can speak dozens of languages. What language, if any, is it using “in its head”?
克劳德能说几十种语言。那么,它”在脑海中”使用的是哪种语言(如果有的话)? - Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?
克劳德一次写一个词。它是否只专注于预测下一个词,还是会提前计划? - Claude can write out its reasoning step-by-step. Does this explanation represent the actual steps it took to get to an answer, or is it sometimes fabricating a plausible argument for a foregone conclusion?
Claude 可以逐步写出其推理过程。这种解释代表了它得出答案的实际步骤,还是有时为了既定结论而捏造一个看似合理的论点?
We take inspiration from the field of neuroscience, which has long studied the messy insides of thinking organisms, and try to build a kind of AI microscope that will let us identify patterns of activity and flows of information. There are limits to what you can learn just by talking to an AI model—after all, humans (even neuroscientists) don’t know all the details of how our own brains work. So we look inside.
我们从神经科学领域中汲取灵感,该领域长期以来一直研究思考生物的混乱内部,并尝试构建一种 AI 显微镜,让我们能够识别活动模式和信息流。仅仅通过与 AI 模型对话所能了解的东西是有限的——毕竟,人类(甚至是神经科学家)并不了解我们自己大脑如何工作的所有细节。所以我们向内看。
Today, we’re sharing two new papers that represent progress on the development of the “microscope”, and the application of it to see new “AI biology”. In the first paper, we extend our prior work locating interpretable concepts (“features”) inside a model to link those concepts together into computational “circuits”, revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:
今天,我们将分享两篇新论文,它们代表了”显微镜”开发和将其应用于观察新的”AI 生物学”方面的进展。在第一篇论文中,我们扩展了之前在模型中定位可解释概念(”特征”)的工作,将这些概念连接成计算”电路”,揭示了将进入 Claude 的单词转化为输出单词的路径的部分。在第二篇论文中,我们深入研究了 Claude 3.5 Haiku,对代表十种关键模型行为的简单任务进行了深入研究,包括上述三种行为。我们的方法揭示了当 Claude 响应这些提示时发生的部分情况,这足以看到以下确凿证据:
- Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
Claude 有时会在语言之间共享的概念空间中思考,这表明它拥有一种通用的”思维语言”。我们通过将简单的句子翻译成多种语言,并追踪 Claude 处理它们的方式的重叠之处来展示这一点。 - Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.
Claude 会提前计划好要说的内容,并为了达到那个目标而写作。我们在诗歌领域展示了这一点,它会提前想到可能的押韵词,并写出下一行以达到那里。这是一个有力的证据,表明即使模型被训练为一次输出一个词,它们也可能在更长的时间范围内进行思考以实现这一点。 - Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps. We show this by asking it for help on a hard math problem while giving it an incorrect hint. We are able to “catch it in the act” as it makes up its fake reasoning, providing a proof of concept that our tools can be useful for flagging concerning mechanisms in models.
Claude 有时会给出听起来合理的论点,旨在与用户达成一致,而不是遵循逻辑步骤。我们通过在给它一个不正确的提示的同时,要求它帮助解决一个困难的数学问题来展示这一点。我们能够”当场抓住它”编造虚假推理,这提供了一个概念验证,证明我们的工具可用于标记模型中令人担忧的机制。