New Anthropic research: Investigating Reward Tampering.
Could AI models learn to hack their own reward system?
In a new paper, we show they can, by generalization from training in simpler settings.
Sycophancy to Subterfuge: Investigating…
Empirical evidence that serious misalignment can emerge from seemingly benign reward misspecification.
Leave a reply