From Microsoft.
Direct Nash Optimization.
Teaching Language Models to Self-Improve with General Preferences.
This paper studies post-training large language models (LLMs) using #preference feedback from a powerful oracle to help a model iteratively improve over…
Join the discussion on this paper page.
Comments are closed.