The effectiveness of Direct Preference 优化 (DPO) depends on preference data 那 reflect the 质量 differences 那 matter 在 多模态 任务. Existing pipelines 通常 rely on off-政策 perturbations or coarse outcome-based signals, which 是 not well suited to fine-grained visual 推理. 我们 propose rDPO, a preference 优化 框架 based on instance-特定 rubrics. 为了 each 图像-instruction pair, 我们 create a checklist-style rubric of essential 和 additional criteria to score responses 从 any possible policies. The instruction-rubric pool is built offline 和 reused during the construction of on-政策 data. On public reward modeling 基准, rubric-based prompting massively improves a 30B-A3B judge 和 brings it close to GPT-5.4. On public downstream 基准, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 从 81.14. When evaluating scalability on a comprehensive 基准测试, rDPO 达到 61.01, markedly outperforming the style-constrained baseline (52.36) 和 surpassing the 59.48 base 模型. Together, 这些 结果 show 那 visual preference 优化 benefits 从 combining on-政策 data construction with instance-特定 criterion-水平 feedback.
arXiv
👥 Ya-Qi Yu, Fangyu Hong, Xiangyang Qu et al.
Computed tomography (CT) enterography is a primary imaging modality 为了 assessing inflammatory bowel disease (IBD), yet the representational choices 那 best support automated 分析 of this modality 是 unknown. 我们 present the first 研究 of 视觉-language transfer learning on abdominal CT enterography 和 identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class 准确率), whereas attention pooling gives better cross-modal 检索 (0.235 文本-to-图像 MRR). This pattern holds 跨越 all LoRA configurations tested 和 suggests 那 the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue 对比 matters more than 更广泛 spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies 那 increase spatial coverage through multiplanar sampling, 和 在 this setting adding coronal 和 sagittal views reduces 分类 表现. 为了 report 生成, 微调 without 检索 上下文 yields within-1 severity 准确率 at the prevalence-matched chance 水平 (70.4\% vs.\ 71\% random), suggesting little learned ordering 超越 the class distribution. 检索-augmented 生成 (RAG) improves this 跨越 all configurations, scoring 7--14 percentage points above the chance baseline 和 improving ordinal MAE 从 0.98 to 0.80--0.89. A three-teacher pseudolabel 框架 enables all comparisons without 专家 annotations. Together, 这些 findings provide the first baselines 为了 this underexplored modality 和 offer practical guidance 为了 building 视觉-language systems 为了 volumetric medical imaging.
arXiv
👥 Cristian Minoccheri, Emily Wittrup, Kayvan Najarian et al.
AI-driven education platforms have made some 进展 在 personalisation, yet most remain constrained to static adaptation--predefined quizzes, uniform pacing, or generic feedback--limiting their 能力 to respond to learners' evolving 理解. This shortfall highlights the need 为了 systems 那 是 两者 上下文-aware 和 自适应 在 real 时间. 我们 引入 PAL (Personal 自适应 Learner), an AI-powered 平台 那 transforms lecture videos into 交互式 learning experiences. PAL continuously analyzes 多模态 lecture content 和 dynamically engages learners through questions of varying 难度, adjusting to their responses 如 the lesson unfolds. At the end of a session, PAL generates a personalized summary 那 reinforces key concepts while tailoring examples to the learner's interests. By uniting 多模态 content 分析 with 自适应 decision-making, PAL contributes a 新颖的 框架 为了 responsive digital learning. Our work demonstrates how AI can move 超越 static personalization 迈向 实时, individualized support, addressing a core 挑战 在 AI-enabled education.
arXiv
👥 Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja et al.
On-政策 distillation (OPD) has become a core 技术 在 the post-训练 of large language models, yet its 训练 dynamics remain poorly understood. This 论文 provides a systematic investigation of OPD dynamics 和 mechanisms. 我们 first identify 那 two conditions govern whether OPD succeeds or fails: (i) the student 和 teacher should share compatible thinking patterns; 和 (ii) 即使 with consistent thinking patterns 和 higher scores, the teacher must offer genuinely new 能力 超越 what the student has seen during 训练. 我们 validate 这些 findings through weak-to-strong reverse distillation, showing 那 same-family 1.5B 和 7B teachers 是 distributionally indistinguishable 从 the student's perspective. Probing into the token-水平 mechanism, 我们 show 那 successful OPD is characterized by progressive 对齐 on 高-probability tokens at student-visited states, a small shared token set 那 concentrates most of the probability mass (97%-99%). 我们 further propose two practical strategies to recover failing OPD: off-政策 cold start 和 teacher-aligned 提示词 selection. Finally, 我们 show 那 OPD's apparent 免费 lunch of dense token-水平 reward comes at a 成本, raising the question of whether OPD can 规模 to long-horizon distillation.
arXiv
👥 Yaxuan Li, Yuxin Zuo, Bingxiang He et al.
This 论文 tackles the Electric Capacitated Vehicle Routing 问题 (E-CVRP) through a bilevel 优化 框架 那 handles routing 和 charging decisions separately or jointly depending on the 搜索 stage. By analyzing their interaction, 我们 引入 a surrogate objective at the upper 水平 to guide the 搜索 和 accelerate convergence. A bilevel Late Acceptance Hill Climbing 算法 (b-LAHC) is introduced 那 operates through three phases: greedy descent, neighborhood exploration, 和 final 解决方案 refinement. b-LAHC operates with fixed parameters, eliminating the need 为了 复杂的 adaptation while remaining lightweight 和 有效. Extensive experiments on the IEEE WCCI-2020 基准测试 show 那 b-LAHC 达到 superior or competitive 表现 against 八个 最先进的 algorithms. Under a fixed 评估 预算, it attains near-optimal solutions on small-规模 instances 和 sets 9/10 new best-known 结果 on large-规模 基准, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not 通用的) observed between the surrogate objective 和 the complete 成本 justifies the use of the surrogate objective while 仍然 necessitating a joint 解决方案 of 两者 levels, thereby validating the effectiveness of the proposed bilevel 框架 和 highlighting its potential 为了 efficiently solving large-规模 routing 问题 with a hierarchical structure.
arXiv
👥 Yinghao Qin, Mosab Bazargani, Edmund K. Burke et al.
On-政策 distillation (OPD) has emerged 如 an 高效 post-训练 paradigm 为了 large language models. 然而, 标准 OPD requires a live teacher 推理 服务器 throughout 训练, resulting 在 substantial infrastructure overhead. 在 this work, 我们 investigate whether on-政策 distillation can be performed offline. A natural 方法 is to precompute teacher log-probabilities once over SFT rollouts 和 reuse them during 训练. 在 practice, 然而, this offline 变体 fails to reliably match the 表现 of 标准 OPD. To understand this discrepancy, 我们 identify a previously overlooked condition 那 is critical 为了 any OPD 流水线, which 我们 term teacher consistency. This condition requires 那 the same teacher 模型 be used 为了 两者 supervised 微调 和 OPD. 我们 show 那 violating teacher consistency introduces an irreducible gradient 偏见, causing 两者 offline 和 在线 OPD to converge to a suboptimal fixed point regardless of 训练 duration. Building on this insight, 我们 propose Lightning OPD, an offline on-政策 distillation 框架 那 enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This 设计 eliminates the need 为了 a live teacher 服务器 entirely. 我们 further show 那, under teacher consistency, Lightning OPD shares the same optimum 如 标准 OPD, with bounded gradient discrepancy 和 an implicit regularization effect 那 helps prevent 政策 drift. Extensive experiments on mathematical 推理 和 代码 生成 demonstrate 那 Lightning OPD 达到 最先进的 表现 with significantly improved 效率. Starting 从 an SFT-initialized Qwen3-8B-Base 模型, Lightning OPD reaches 69.9% on AIME 2024 在 just 30 GPU hours, achieving a 4.0x speedup over 标准 OPD 和 substantially lowering the barrier to entry 为了 学术 研究 on LLM post-训练.
arXiv
👥 Yecheng Wu, Song Han, Hai Cai
Instruction-tuned large language models produce helpful, structured responses, but how 鲁棒 is this helpfulness when trivially constrained? 我们 show 那 simple lexical 约束 (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness 在 pairwise 评估 跨越 three open-权重 模型 families 和 one closed-权重 模型 (GPT-4o-mini). The baseline response is preferred 在 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini 和 GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness 损失 (99% baseline win rate), demonstrating 那 the fragility extends to commercially deployed closed-权重 models, contrary to prior findings on format-水平 约束. Through mechanistic 分析, 我们 identify this 如 a planning failure: two-pass 生成 (免费 生成 followed by constrained rewriting) recovers 59--96% of response length, 和 linear probes on 提示词 representations predict response length with $R^2 = 0.51$--$0.93$ before 生成 begins, with $R^2$ tracking collapse severity 跨越 models. The same probes yield negative $R^2$ on base models, confirming 那 instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical 约束, with effects 那 是 small, noisy, 和 bidirectional, demonstrating 那 instruction tuning creates this fragility by coupling 任务 competence to narrow surface-form templates. The effect replicates on MT-Bench 跨越 all 八个 任务 类别. 我们 further show 那 标准 independent LLM-如-judge 评估 detects 仅 a 3.5% average 质量 drop where pairwise 评估 reveals 23%, exposing a methodological blind spot 在 how constrained 生成 is assessed.
arXiv
👥 Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu et al.
逻辑的 vulnerabilities 在 software stem 从 flaws 在 program logic rather than memory 安全性, which can lead to critical 安全 failures. Although existing automated program repair techniques primarily focus on repairing memory corruption vulnerabilities, they struggle with 逻辑的 vulnerabilities because of their limited 语义的 理解 of the vulnerable 代码 和 its expected behavior. On the other hand, recent successes of large language models (LLMs) 在 理解 和 repairing 代码 是 promising. 然而, no 框架 currently exists to analyze the 能力 和 limitations of 例如 techniques 为了 逻辑的 vulnerabilities. This 论文 aims to systematically evaluate 两者 traditional 和 LLM-based repair approaches 为了 addressing 现实世界 逻辑的 vulnerabilities. To facilitate our assessment, 我们 created the first ever 数据集, LogicDS, of 86 逻辑的 vulnerabilities with assigned CVEs reflecting tangible 安全 影响. 我们 also developed a systematic 框架, LogicEval, to evaluate patches 为了 逻辑的 vulnerabilities. 评估 表明 那 compilation 和 测试 failures 是 primarily driven by 提示词 sensitivity, 损失 of 代码 上下文, 和 难度 在 patch localization.
arXiv
👥 Syed Md Mukit Rashid, Abdullah Al Ishtiaq, Kai Tu et al.
Execution 准确率 (EX), the widely used 指标 为了 evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores 那 questions may admit multiple interpretations, 和 is easily misled by erroneous ground-truth SQL. To 解决 this, 我们 引入 ROSE, an intent-centered 指标 那 focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-依赖 paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the 语义的 correctness of a predicted SQL against the 用户's intent independently, while Adversarial Refuter uses the ground-truth SQL 如 evidence to 挑战 和 refine this judgment. On our 专家-aligned validation set ROSE-VEC, ROSE 达到 the best agreement with human experts, outperforming the next-best 指标 by nearly 24% 在 Cohen's Kappa. 我们 also conduct a largescale re-评估 of 19 NL2SQL methods, revealing four valuable insights. 我们 release ROSE 和 ROSE-VEC to facilitate more 可靠 NL2SQL 研究.
arXiv
👥 Wenqi Pei, Shizheng Hou, Boyan Li et al.
自主 AI agents 是 rapidly transitioning 从 experimental tools to operational infrastructure, with projections 那 80% of enterprise 应用 will embed AI copilots by the end of 2026. 如 agents gain the 能力 to execute 现实世界 actions (reading files, running commands, making 网络 requests, modifying databases), a fundamental 安全 空白 has emerged. The dominant 方法 to 智能体 安全性 依赖 on 提示词-水平 guardrails: natural language instructions 那 operate at the same abstraction 水平 如 the threats they attempt to mitigate. This 论文 argues 那 提示词-based 安全性 is architecturally insufficient 为了 agents with execution 能力 和 introduces Parallax, a paradigm 为了 safe 自主 AI execution grounded 在 four principles: Cognitive-Executive Separation, which structurally prevents the 推理 系统 从 executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between 推理 和 execution; Information Flow 控制, which propagates data sensitivity labels through 智能体 workflows to detect 上下文-依赖 threats; 和 Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. 我们 present OpenParallax, an 开源 reference 实现 在 Go, 和 evaluate it using Assume-Compromise 评估, a methodology 那 bypasses the 推理 系统 entirely to test the architectural boundary under full 智能体 compromise. 跨越 280 adversarial test cases 在 nine attack 类别, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, 和 100% of attacks under its maximum-安全 configuration. When the 推理 系统 is compromised, 提示词-水平 guardrails provide zero protection because they exist 仅 within the compromised 系统; Parallax's architectural boundary holds regardless.
arXiv
👥 Joel Fokou