自主离网光伏系统的稳定运行依赖于尊重大气热力学的太阳能预测算法。 当代深度学习模型始终表现出严重的异常,主要是云瞬变期间的严重时间相位滞后和物理上不可能的夜间发电。 To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The proposed methodology projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.
arXiv
👥 Mohammed Ezzaldin Babiker Abdullah
To identify 安全性 violations, auditors 通常 搜索 over large sets of 智能体 traces. This 搜索 is difficult because failures 是 通常 rare, 复杂的, 和 sometimes 即使 adversarially hidden 和 仅 detectable when multiple traces 是 analyzed together. 这些 challenges arise 在 diverse settings 例如 如 misuse campaigns, covert sabotage, reward hacking, 和 提示词 injection. Existing approaches struggle here 为了 several reasons. Per-trace judges miss failures 那 仅 become visible 跨越 traces, naive agentic auditing does not 规模 to large trace collections, 和 fixed monitors 是 brittle to unanticipated behaviors. 我们 引入 Meerkat, which combines 聚类 with agentic 搜索 to uncover violations specified 在 natural language. Through structured 搜索 和 自适应 investigation of promising regions, Meerkat finds sparse failures without relying on 种子 场景, fixed workflows, or exhaustive enumeration. 跨越 misuse, misalignment, 和 任务 gaming settings, Meerkat significantly improves 检测 of 安全性 violations over baseline monitors, discovers widespread developer cheating on a top 智能体 基准测试, 和 finds nearly 4x more examples of reward hacking on CyBench than previous audits.
arXiv
👥 Adam Stein, Davis Brown, Hamed Hassani et al.
我们 have witnessed 显著 advances 在 LLM 推理 能力 with the advent of DeepSeek-R1. 然而, much of this 进展 has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since 例如 data is limited 在 规模 和 concentrated mainly 在 domains like 数学. 在 对比, other sciences 例如 如 物理 lack large-规模 QA datasets to effectively train 推理-capable models. 在 this work, 我们 show 那 物理 simulators can serve 如 a powerful alternative source of supervision 为了 训练 LLMs 为了 physical 推理. 我们 generate random scenes 在 物理 engines, create synthetic question-answer pairs 从 simulated interactions, 和 train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to 现实世界 物理 基准: 为了 示例, 训练 solely on synthetic simulated data improves 表现 on IPhO (International 物理 Olympiad) 问题 by 5-10 percentage points 跨越 模型 sizes. 这些 结果 demonstrate 那 物理 simulators can act 如 scalable data generators, enabling LLMs to acquire deep physical 推理 技能 超越 the limitations of internet-规模 QA data. 代码 available at: https://sim2reason.github.io/.
arXiv
👥 Mihir Prabhudesai, Aryan Satpathy, Yangmin Li et al.
Accurate delineation of the Clinical Target Volume (CTV) is essential 为了 radiotherapy planning, yet 仍然 时间-consuming 和 difficult to 评估, especially 为了 复杂的 treatments 例如 如 Total Marrow 和 Lymph Node Irradiation (TMLI). While 深度学习-based auto-分割 can reduce workload, safe clinical 部署 requires 可靠 cues indicating where models may be wrong. 在 this work, 我们 propose a 预算-aware uncertainty-driven 质量 assurance (QA) 框架 built on nnU-Net, combining uncertainty quantification 和 post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) 那 can guide targeted manual 综述. 我们 compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), 和 test-时间 augmentation (TTA), evaluated 两者 individually 和 在 combination on TMLI 如 a representative use case. 可靠性 is assessed through ROI-masked calibration metrics 和 uncertainty--误差 对齐 under realistic revision 约束, summarized 如 AUC over the top 0-5% most uncertain voxels. 跨越 configurations, 分割 准确率 仍然 stable, whereas TS substantially improves calibration. Uncertainty-误差 对齐 improves most with calibrated checkpoint-based 推理, 领先 to uncertainty maps 那 highlight more consistently regions requiring manual edits. Overall, integrating calibration with 高效 ensembling seems a promising 策略 to implement a 预算-aware QA 工作流 为了 radiotherapy 分割.
arXiv
👥 Ricardo Coimbra Brioso, Lorenzo Mondo, Damiano Dei et al.
Recently, large language models (LLMs) 是 capable of generating highly fluent textual content. While they offer 显著 convenience to humans, they also 引入 various risks, like phishing 和 学术 dishonesty. Numerous 研究 efforts have been dedicated to developing algorithms 为了 detecting AI-generated 文本 和 constructing relevant datasets. 然而, 在 the 领域 of Chinese corpora, challenges remain, including limited 模型 多样性 和 data homogeneity. To 解决 这些 issues, 我们 propose C-ReD: a comprehensive Chinese Real-提示词 AI-generated 检测 基准测试. Experiments demonstrate 那 C-ReD not 仅 enables 可靠 在-领域 检测 but also supports strong generalization to unseen LLMs 和 external Chinese datasets-addressing critical gaps 在 模型 多样性, 领域 coverage, 和 提示词 realism 那 have limited prior Chinese 检测 基准. 我们 release our resources at https://github.com/HeraldofLight/C-ReD.
arXiv
👥 Chenxi Qing, Junxi Wu, Zheng Liu et al.
推理 has become a central 能力 在 large language models. Recent 研究 has shown 那 推理 表现 can be improved by looping an LLM's layers 在 the latent 维度, resulting 在 looped 推理 language models. Despite promising 结果, few works have investigated how their internal dynamics differ 从 those of 标准 feedforward models. 在 this 论文, 我们 conduct a mechanistic 分析 of the latent states 在 looped language models, focusing 在 particular on how the stages of 推理 observed 在 feedforward models compare to those observed 在 looped ones. To this end, 我们 analyze cyclic recurrence 和 show 那 为了 many of the studied models each 层 在 the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory 在 the latent space. 我们 provide evidence 那 如 这些 fixed points 是 reached, attention-head behavior stabilizes, 领先 to constant behavior 跨越 recurrences. Empirically, 我们 discover 那 recurrent blocks learn stages of 推理 那 closely mirror those of feedforward models, repeating 这些 stages 在 depth with each iteration. 我们 研究 how recurrent block 大小, input injection, 和 normalization influence the emergence 和 stability of 这些 cyclic fixed points. 我们 believe 这些 findings help translate mechanistic insights into practical guidance 为了 architectural 设计.
arXiv
👥 Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron et al.
工具-augmented 大型语言模型 (LLM) agents have 展示 impressive 能力 在 automating 复杂的, multi-step 现实世界 任务, yet remain vulnerable to indirect 提示词 injection. Adversaries exploit this weakness by embedding malicious instructions within 工具-returned content, which agents directly incorporate into their conversation history 如 trusted observations. This vulnerability manifests 跨越 three primary attack channels: web 和 local content injection, MCP 服务器 injection, 和 技能 file injection. To 解决 这些 vulnerabilities, 我们 引入 \textsc{ClawGuard}, a 新颖的 runtime 安全 框架 那 enforces a 用户-confirmed rule set at every 工具-call boundary, transforming unreliable 对齐-依赖 defense into a deterministic, auditable mechanism 那 intercepts adversarial 工具 calls before any 现实世界 effect is produced. By automatically deriving 任务-特定 access 约束 从 the 用户's stated objective prior to any external 工具 invocation, \textsc{ClawGuard} blocks all three injection pathways without 模型 modification or infrastructure change. Experiments 跨越 five 最先进的 language models on AgentDojo, SkillInject, 和 MCPSafeBench demonstrate 那 \textsc{ClawGuard} 达到 鲁棒 protection against indirect 提示词 injection without compromising 智能体 utility. This work establishes deterministic 工具-call boundary enforcement 如 an 有效 defense mechanism 为了 安全 agentic AI systems, requiring neither 安全性-特定 微调 nor architectural modification. 代码 is publicly available at https://github.com/Claw-Guard/ClawGuard.
arXiv
👥 Wei Zhao, Zhe Li, Peixin Zhang et al.
Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.
arXiv
👥 Jiayuan Rao, Tianlin Gui, Haoning Wu et al.
GUI agents drive 应用 through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, 和 keystrokes, reaching a long tail of 应用 那 CLI-based agents cannot. Yet 进展 在 this area is bottlenecked 较少 by modeling 容量 than by the absence of a coherent full-stack infrastructure: 在线 RL 训练 suffers 从 environment instability 和 closed pipelines, 评估 protocols drift silently 跨越 works, 和 trained agents rarely reach real users on real devices. 我们 present \textbf{ClawGUI}, an 开源 框架 addressing 这些 three gaps within a single harness. \textbf{ClawGUI-RL} provides the first 开源 GUI 智能体 RL infrastructure with validated support 为了 两者 并行 virtual environments 和 real physical devices, integrating GiGPO with a 过程 Reward 模型 为了 dense step-水平 supervision. \textbf{ClawGUI-Eval} enforces a fully standardized 评估 流水线 跨越 6 基准 和 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-智能体} brings trained agents to Android, HarmonyOS, 和 iOS through 12+ chat platforms with hybrid CLI-GUI 控制 和 persistent personalized memory. Trained end to end within this 流水线, \textbf{ClawGUI-2B} 达到 17.1\% Success Rate on MobileWorld GUI-仅, outperforming the same-规模 MAI-UI-2B baseline by 6.0\%.
arXiv
👥 Fei Tang, Zhiqiong Lu, Boxuan Zhang et al.
当代 large language models (LLMs) have 展示 显著 推理 能力, 尤其 在 专业的 domains like 数学 和 物理. 然而, their 能力 to 泛化 这些 推理 技能 to more 通用的 和 更广泛 上下文--通常 称为 通用的 推理--仍然 未充分探索. 不同于 领域-特定 推理, 通用的 推理 依赖 较少 on 专家 知识 but 仍然 呈现 艰巨 推理 challenges, 例如 如 复杂的 约束, 嵌套的 逻辑的 分支, 和 语义的 干扰. To 解决 this 空白, 我们 引入 General365, a 基准测试 专门 设计 to 评估 通用的 推理 在 LLMs. By 限制 背景 知识 to a K-12 水平, General365 明确 解耦 推理 从 专业的 专业知识. The 基准测试 包含 365 种子 问题 和 1,095 变体 问题 跨越 八个 类别, 确保 两者 高 难度 和 多样性. 评估 跨越 26 领先 LLMs 揭示 那 即使 the 表现最好的 模型 达到 仅 62.8% 准确率, 在 鲜明 对比 to the 近乎完美 表现 of LLMs 在 数学 和 物理 基准. 这些 结果 表明 那 the 推理 abilities of 当前 LLMs 是 严重 依赖于领域, 留下 显著 空间 为了 改进 在 更广泛 应用. 我们 设想 General365 如 a 催化剂 为了 推进 LLM 推理 超越 领域-特定 任务 迈向 鲁棒, 通用的-用途 现实世界 场景. 代码, 数据集, 和 排行榜: https://general365.github.io
arXiv
👥 Junlin Liu, Shengnan An, Shuang Zhou et al.