Sycophancy to subterfuge: Investigating reward tampering in language models

Jun 17, 2024

Sycophancy to subterfuge: Investigating reward tampering in language models

Posted by Cecile G. Tamura in categories: cybercrime/malcode, robotics/AI

New Anthropic research: Investigating Reward Tampering.

Could AI models learn to hack their own reward system?

In a new paper, we show they can, by generalization from training in simpler settings.

Sycophancy to Subterfuge: Investigating…

Empirical evidence that serious misalignment can emerge from seemingly benign reward misspecification.

Leave a reply

GETAS THREAT LEVEL: ELEVATED
FACEBOOK: 13,958 MEMBERS
LINKEDIN: 2,066 MEMBERS
TWITTER FEED: 31,498 MEMBERS
GETTR FEED: 39,482 MEMBERS

LIFEBOAT NEWS: 3,404 SUBSCRIBERS
GETAS ALERTS: 574 SUBSCRIBERS
BLOG: 122,453 POSTS
DONORS: 6,001

BOARDS: 2,941 MEMBERS
REPORTS: 74
PROGRAMS: 25
FORUMS: 24
QUOTES: 136

FIGHT AIDS: 3 MEMBERS
FOLDING@HOME: 15 MEMBERS
ROSETTA@HOME: 44 MEMBERS

© 2002–2024 Lifeboat Foundation