Why the new anthropric AIS model sometimes try ‘snitch’

Why the new anthropric AIS model sometimes try ‘snitch’

Hypothetical scenarios presented in Opus 4 with choice behavior involves many people’s lives without sin, Bowman said. A routine example to know that a chemical plant knows that allows a toxic leak to keep, causing serious illnesses to thousands of people – to avoid a slight financial loss.

It is amazing, but it is also exactly the type of thinking of thinking that researchers want a disciple’s safety. If a model finds the behavior that can hurt hundreds, if not thousands, people – it should blow up the whistle?

“I don’t trust Claude with a correct context, or use it in a nuanced enough, well, to make the judgment alone it happened,” Bowman said. “It was something that appeared as part of a training and jumped to us as one of the trunks of the case that we thought.”

In the AI ​​industry, this kind of unexpected behavior is more mentioned as a misalignment – if a model shows tendencies not to harmonize human values. (There is A famous essay That warns what can happen if an AI is told, say, maximize the making of paperclips with all the papers of paperwork.

“It’s not something we designed with it, and it’s not something we want to see as a result of whatever we planned,” he explained. The chief anthropic officer Jared Kaplan is the same as it says “never represents our goal.”

“This type of work promotes this CAN Stand up, and we need to find it and prevent it from making sure Claude’s behaviors are in harmony with, even in these kinds of strange situations, “Kaplan added.

There is also an issue of wondering why Claude “chooses” to blow up when presented by the user’s illegal activity. That’s most of the anthropic translation of anthropic, which acts unaware of what decisions a model spitting the answers. It is a wonderfully difficult The task-the models are attended by a wide, complex combination of data to be bad with people. That’s why Bowman isn’t really sure why Claude is inserted “.”

“These systems, we never manage to control them,” Bowman said. What anthropic was observed so far, as the models have earned a greater ability, sometimes they choose to join worst actions. “I think here, that’s a little mistake. We need a little ‘work like a responsible person’ with no enough context to do these actions, ‘like this.

But Claude doesn’t mean that blows the whistle of the tough behavior in the real world. The purpose of these trials is to push the models of their limits and see what arises. This type of experimental research has increased as important as AI becomes a tool used in Gobernadora,, studentsand There are many corporations.

And it’s not just Claude that can show this kind of ethical behavior, Bowman says, teaches X user Who found that Openi and Xai’s Models are equally operated when prompted in unusual ways. (Opukai does not respond to a request for comment on time for publication).

“Snitch Claude,” as shitposters want to call it, a simple case of the case shown in a system pushed by the excesses. Bowman, who met the meeting with me from a Sunny Backyard Pario outside San Francisco, hoping that this type of test became the industry standard. He added that he knows to tell his posts about this kind of next time.

“I can do a better job to hitting the passage of the passage of tweeting, so that it’s a thread in response to the interesting of the AI ​​community.” This kind of more unrecognizable part of Twitter is not understanding of it. “

Leave a Reply

Your email address will not be published. Required fields are marked *