Draft-and-Audit Reinforcement Learning for Optimization Modeling

Accepted as a regular paper at ICML 2026.

OpenReview: forum
OpenReview PDF: paper

Natural language to optimization requires translating unstructured text into executable mathematical models. Beyond simple syntax errors, this task suffers from silent modeling failures, where incorrect formulations execute successfully but yield invalid results. We propose Draft-and-Audit RL (DA-RL), a framework that learns optimization modeling as a two-step iterative workflow. Unlike inference-time scaffolds that rely on intermediate solver feedback to guide repairs, DA-RL optimizes a shared-parameter policy using terminal-only verification: the model is rewarded solely based on the execution of the final audited program. This constraint forces the model to internalize rubric-guided revision as a learned capability and encourages the emergence of cross-turn synergy, where the policy learns to generate drafts that are structurally amenable to self-correction.