Monthly Digest — 2026-02
348 unique stories across 28 days and 8 sources.
Hacker News(112)
- Defeating a 40-year-old copy protection dongle (dmitrybrant.com)
- 1-Click RCE to steal your Moltbot data and keys (depthfirst.com)
- TIL: Apple Broke Time Machine Again on Tahoe (taoofmac.com)
- I taught my neighbor to keep the volume down (idiallo.com)
- xAI joins SpaceX (www.spacex.com)
- Anki ownership transferred to AnkiHub (forums.ankiweb.net)
- The Codex App (openai.com)
- Hacking Moltbook (www.wiz.io)
- Lessons Learned Shipping 500 Units of My First Hardware Product (www.simonberens.com)
- 221 Cannon is Not For Sale (fredbenenson.com)
- Xcode 26.3 – Developers can leverage coding agents directly in Xcode (www.apple.com)
- X offices raided in France (apnews.com)
- Claude Code: connect to a local model when your quota runs out (boxc.net)
- How Jeff Bezos Brought Down the Washington Post (www.newyorker.com)
- Yawning has an unexpected influence on the fluid inside your brain (www.newscientist.com)
- The Great Unwind (occupywallst.com)
- It's 2026, Just Use Postgres (www.tigerdata.com)
- LinkedIn checks for 2953 browser extensions (github.com)
- My AI Adoption Journey (mitchellh.com)
- Flock CEO calls Deflock a “terrorist organization” (2025) [video] (www.youtube.com)
GitHub Trending(59)
- openclaw / openclaw
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
- ThePrimeagen / 99
Neovim AI agent done right
- pedramamini / Maestro
Agent Orchestration Command Center
- kovidgoyal / calibre
The official source code repository for the calibre ebook manager
- thedotmack / claude-mem
A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future sessions.
- termux / termux-app
Termux - a terminal emulator application for Android OS extendible by variety of packages.
- masoncl / review-prompts
AI review prompts
- openai / skills
Skills Catalog for Codex
- automazeio / ccpm
Project management system for Claude Code using GitHub Issues and Git worktrees for parallel agent execution.
- disler / claude-code-hooks-mastery
Master Claude Code Hooks
- OpenBMB / ChatDev
ChatDev 2.0: Dev All through LLM-powered Multi-Agent Collaboration
- bytedance / UI-TARS-desktop
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
- j178 / prek
⚡ Better `pre-commit`, re-engineered in Rust
- nvm-sh / nvm
Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions
- likec4 / likec4
Visualize, collaborate, and evolve the software architecture with always actual and live diagrams from your code
- KeygraphHQ / shannon
Fully autonomous AI hacker to find actual exploits in your web apps. Shannon has achieved a 96.15% success rate on the hint-free, source-aware XBOW Benchmark.
- microsoft / litebox
A security-focused library OS supporting kernel- and user-mode execution
- p-e-w / heretic
Fully automatic censorship removal for language models
- pydantic / monty
A minimal, secure Python interpreter written in Rust for use by AI
- virattt / dexter
An autonomous agent for deep financial research
Hugging Face(86)
- Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives
Autonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, existing systems largely rely on runtime-centric execution paradigms, repeatedly reading, summarizing, and reasoning over large volumes of scientific literature online. This on-the-spot computation strategy incurs high computational cost, suffers from context window limitations, and often leads to brittle reasoning and hallucination. We propose Idea2Story, a pre-computation-driven framework for autonomous scientific discovery that shifts literature understanding from online reasoning to offline knowledge construction. Idea2Story continuously collects peer-reviewed papers together with their review feedback, extracts core methodological units, composes reusable research patterns, and organizes them into a structured methodological knowledge graph. At runtime, underspecified user research intents are aligned to established research paradigms, enabling efficient retrieval and reuse of high-quality research patterns instead of open-ended generation and trial-and-error. By grounding research planning and execution in a pre-built knowledge graph, Idea2Story alleviates the context window bottleneck of LLMs and substantially reduces repeated runtime reasoning over literature. We conduct qualitative analyses and preliminary empirical studies demonstrating that Idea2Story can generate coherent, methodologically grounded, and novel research patterns, and can produce several high-quality research demonstrations in an end-to-end setting. These results suggest that offline knowledge construction provides a practical and scalable foundation for reliable autonomous scientific discovery.
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models
Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
- Scaling Embeddings Outperforms Scaling Experts in Language Models
While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.
- DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.
- ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas
Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.
- Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .
- THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.
- Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
- Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.
- UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
- SWE-Universe: Scale Real-World Verifiable Environments to Millions
We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and prohibitive cost, our framework utilizes a building agent powered by an efficient custom-trained model. This agent employs iterative self-verification and in-loop hacking detection to ensure the reliable generation of high-fidelity, verifiable tasks. Using this method, we scale the number of real-world multilingual SWE environments to a million scale (807,693). We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning. Finally, we applied this technique to Qwen3-Max-Thinking and achieved a score of 75.3% on SWE-Bench Verified. Our work provides both a critical resource and a robust methodology to advance the next generation of coding agents.
- PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.
- CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.
- AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration
Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra
- No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs
This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a small subset of CoT positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele-lens.
- MARS: Modular Agent with Reflective Search for Automated AI Research
Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.
- ERNIE 5.0 Technical Report
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
- FASA: Frequency-aware Sparse Attention
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56times speedup using just 18.9\% of the cache on AIME24.
- WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning
Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.
- Training Data Efficiency in Multimodal Process Reward Models
Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
Solidot(91)
- 欧洲开源卓越奖授予了 Greg Kroah-Hartman
European Open Source Academy 的 2025 年开源卓越奖得主、cURL 维护者 Daniel Stenberg 宣布了 2026 年的开源卓越奖得主、稳定版 Linux 内核维护者 Greg Kroah-Hartman。他表示:很难夸大 Greg 在 Linux 上的工作的重要性。在软件领域,创新总能抢占头条,但稳定性却默默守护着生命和生计。每一部 Android 手机、每一台 Web 服务器、每一个运行 Linux 的关键系统,都依赖于 Greg 精益求精的工作。正因为他的努力,医院、银行、政府和个人在使用 Linux 时,才能安心无忧。他的工作代表着最高形式的服务:不求浮华,坚持不懈,却不可或缺。
- CERN 获得 10 亿美元私人捐赠建造未来环形对撞机
CERN 获得 10 亿美元私人捐赠用于建造未来环形对撞机(Future Circular Collider,FCC),这是该机构 72 年历史上首次有私人和慈善基金支持其大型项目。FCC 有望成为大强子对撞机(LHC)的继任者。FCC 设计建造一条长约 90.7 公里的巨型隧道——其长度三倍于 LHC——平均深度约为地下 200 米。FCC 方案是 CERN 下一代对撞机的首选方案,将于 2026 年 5 月递交给 CERN 理事会,如果理事会在 2028 年批准该方案,那么 FCC 电子正电子对撞机(FCC-ee)的建造将于 2030 年启动,2047 年投入运行。
- 挪威诺贝尔研究所遭黑客入侵可能泄露了和平奖得主名字
负责评选诺贝尔和平奖的挪威诺贝尔研究所在安全部门帮助下完成了内部调查,证实遭到了黑客入侵。2025 年诺贝尔和平奖得主、委内瑞拉反对派领导人 Maria Corina Machado 的名字提前泄露可能是黑客攻击所致。在去年 10 月 Machado 的名字公开前几小时,预测平台 Polymarket 上有关她的投注激增,她此前并不被认为是和平奖的热门人选。
- GNU gettext 在开发逾 30 年后终于释出 1.0 版本
GNU 国际化与本地化函数库 gettext 在历经逾 30 年开发之后终于释出了具有象征意义的 1.0 版本。gettext 主要优势是将编程和翻译分开。GNU gettext 1.0 的主要变化包括:改进 PO(Portable Object)文件处理,新 po-fetch 程序从网上的翻译项目提取已翻译的 PO 文件,新预翻译程序 msgpre 和 spit,改进 Ocaml 和 Rust语言支持;等等。msgpre 和 spit 可通过本地安装的大模型去实现机器翻译,msgpre 应用于整个 PO 文件,而 spit 则是单则信息。
- 太阳释放出 X8.11 级耀斑
几天前才出现的太阳黑子区域 AR4366 在 24 小时内释放出 17 个 M 级耀斑和 3 个 X 级耀斑,其中包括一个 X8.11 级耀斑。这是过去二十年最强耀斑之一,是当前 25 太阳周期的第三强的耀斑。太阳目前处于活跃期,AR4366 非常不稳定,意味着会爆发更多高强度耀斑。
- 最大动漫盗版网站被关,运营者被捕
日本反盗版组织 CODA(文化产品海外流通促进机构) 宣布,上海警方去年 11 月拘留了一名广西男子,该男子被控运营了最大的动漫盗版网站 BATO.TO。BATO.TO 不只是一个网站,它包含了 xbato.com、bato.to 和 mangapark.io 等 60 个网站。这名男子已获释,他承认运营了这些网站,未来将面临正式诉讼。警方已扣押该男子的电脑,还在继续调查,分析服务器确定更多运营者身份。在该男子拘留之后,BATO 网站仍然继续运营了一段时间,直到 1 月 19 日全部关闭。被侵权的日本出版商包括了角川集团、讲谈社、集英社、小学馆和史克威尔艾尼克斯。CODA 北京办事处应这些出版商要求向公安局提起刑事诉讼。它还寻求了腾讯旗下公司的合作。BATO 旗下网站的月访问量达到 3.5 亿次,从 2022 年 10 月到 2025 年 10 月总访问量 72 亿次。
- Blue Origin 专注于月球项目放弃亚轨道旅游
Blue Origin 宣布暂停 New Shepard 项目两年,但此举可能意味着其亚轨道太空旅游的永久终结。New Shepard 火箭和太空船自 2015 年投入使用至今共完成了 38 次发射,除一次外全部成功,将 98 人送入太空体验亚轨道飞行。为何 Blue Origin 要终止其成立至今持续时间最长的项目?CEO Dave Limp 表示要将人力和资源投入到载人登月项目上。Blue Origin 员工对此举也颇感意外,因为上一次亚轨道飞行是在 8 天前将六人送入太空,该公司还有 4 枚处于不同阶段的 New Shepard 火箭以及两艘正在建造中的太空船,它还在去年讨论过扩展发射场。然而该项目一直处于亏损状态,有逾 500 名员工投入在该项目,分散了其精力和资源。
- 过去四个月比特币从峰值下跌了四成
2025 年 10 月比特币创下了 123,742 美元的记录,但四个月后跌至 76000 美元,币值从峰值下跌了四成。彭博认为这一波跌势不是出于恐慌而是买家、动能和信心的缺失引起的。下跌没有明显的导火索,纯粹是需求减弱、流动性减少,其价值与更广泛的市场无关联。即使最近几周黄金白银价格剧烈波动,加密货币也未出现任何震荡。比特币 1 月下跌近 11%,连续第四个月下跌——这是自 2018 年以来最长的连跌纪录。社交媒体上也对止跌缺乏乐观情绪。主流买家的信心正在减弱,许多买家在高价买入后都处于亏损状态。
- 超加工食品应视为香烟而非食品
根据发表在《Milbank Quarterly》期刊上的一项研究,哈佛、杜克和密歇根大学的研究人员认为,超加工食品(Ultra-processed foods)与香烟的相似之处远多于与水果或蔬菜的相似之处,需要更严格的监管。超加工食品是经过工业化生产、通常使用乳化剂或人工色素和香精的食品,如软饮料、薯片和饼干。研究人员称,超加工食品和香烟的生产过程存在相似之处,制造商都在努力优化产品“剂量”以及对人体奖赏通路的作用速度。宣传食品“低脂”或“无糖”都是在误导消费者,类似 1950 年代宣传香烟的过滤嘴是一种保护性创新,实际上几乎没有任何实质性益处。研究人员认为应该借鉴烟草管理去监管超加工食品。
- 西班牙计划禁止 16 岁以下儿童使用社交媒体
西班牙首相 Pedro Sanchez 周二表示,计划禁止 16 岁以下未成年人使用社交媒体,社交平台需要引入年龄验证系统。他表示要保护儿童远离数字狂野西部。澳大利亚于去年 12 月成为首个禁止 16 岁以下儿童使用社交媒体的国家,英法等国正在考虑采取类似年龄限制措施。Sanchez 称西班牙将于下周提出一项法案,追究社交媒体高管对非法和仇恨言论内容的责任,将算法操纵和放大非法内容定为犯罪行为。
- 巴黎检方突击搜查 X 在法办公室
巴黎检方突击搜查 X 在法办公室。执行搜查的是网络犯罪部门,欧洲刑警组织协助。搜查与 2025 年 1 月启动的调查相关,这次调查涉及对 X 算法及其推荐内容的投诉。巴黎检方还传唤了马斯克(Elon Musk)以及 X 前 CEO Linda Yaccarino,要求 4 月出席听证会。检方在声明中称,X 平台流传深度伪造的色情视频以及否认纳粹大屠杀的内容。检方还宣布将退出 X 平台,将通过 LinkedIn 和 Instagram 与外界沟通。
- 中国禁止隐藏式车门把
工信部发布了新的强制性安全标准《汽车车门把手安全技术要求》,禁止电动汽车使用隐藏式门把手,成为世界上首个禁止这种设计的国家。这种特斯拉推广的设计因一系列致命事件而面临全球监管机构的审查。新规定要求在中国销售的汽车必须配备机械释放车门外把手。新规将于 2027 年 1 月 1 日起开始实施。已获得型式批准的车型,应于 2029 年 1 月前修改其设计以符合要求。在此之前,中国国内发生多起引发高度关注的事故,其中包括两起小米电动汽车起火事故。事故中车门疑似因断电而无法打开,造成车内人员既无法逃生,也无法获救,最终身亡。
- 中国少年班人才项目为 AI 竞争源源不断输送人才
FT 报道了中国的一种选拔有天赋少年人才进行特殊培养的特教模式,此类特培的最早例子当属中国科技大学的少年班,过去二十年还出现了清华姚班、北大图灵班等特殊培养班。这些特培班为 AI 和科技公司输送了核心技术人才。中科大少年班培养的 3167 名毕业生中,18%-20% 留在学界,逾 200 人成为国内外名校和科研机构教授。去年初引发广泛关注的 DeepSeek 其逾百名研发团队大多数都来自这些特培班。今天中国每年有 500 万 STEM 专业毕业生,相比之下美国约 50 万。在 2025 年中国派出的 23 名参加国际科学奥林匹克竞赛的学生有 22 名获得了金牌。
- Substack 警告用户数据泄漏
Substack 通知用户数据泄漏。数据泄漏事件发生在 2025 年 10 月,但 Substack 直到本周才发现。CEO Chris Best 表示,未经授权的第三方访问了部分用户数据,包括邮箱地址、电话号码和其他内部元数据,信用卡号、密码和财务信息未被访问。Substack 未透露有多少用户受到影响。本周一有黑客在 BreachForums 论坛上泄露了一个 Substack 数据库,包含 697,313 条数据记录。Substack 非常受记者和内容创作者的欢迎,截至 2025 年 3 月有 500 万付费订阅用户。
- CIA 停止出版 World Factbook
CIA 宣布停止出版 World Factbook(世界概况或世界各国纪实年鉴),它没有解释原因,可能与特朗普政府削减政府机构的预算有关。World Factbook 是 CIA 的调查报告,发布世界各国及地区的概况,例如人口、地理、政治及经济等各方面的统计数据。CIA 是在 1975 年首次向公众发表该报告的非机密版本,1997 年起开始有线上版本。报告中的统计数据、地图以及图片等内容的之版权皆属于公有领域,任何人都可以无需 CIA 批准而自由引用或转载,只需注明资料来源即可,因此其数据被记者和学者广泛引用。
- 台积电计划在日本生产 3 纳米芯片
台积电董事长兼首席执行官(CEO)魏哲家表示,考虑在目前在熊本县建设的第二工厂生产日本国内首批 3 纳米制程最尖端半导体。台积电正在熊本县菊阳町建设第二工厂,原计划生产 6 纳米制程半导体,今后将就变更计划展开磋商。毗邻第二工厂用地的第一工厂目前生产 12 至 28 纳米制程半导体,已于 2024 年 12 月启动量产。
- 麦地那龙线虫病接近彻底根除
卡特中心宣布,麦地那龙线虫病正接近根除,根据初步统计数据,2025 年全球感染病例仅 10 例。如果能彻底根除,那么它将是天花之后被人类根除的第二种疾病。麦地那龙线虫(Dracunculus medinensis)是一种通过水传播的寄生线虫。如果人饮用了被麦地那龙线虫污染的水,寄生虫会钻入肠道在人体内移动。感染者起初没有症状。大约一年后,母虫会在下肢的皮肤上形成水疱,大约八周后一条意大利面条长度的虫体会从水疱中钻出。除了剧痛之外,麦地那龙线虫病还会导致继发感染和败血症等并发症,造成暂时性或永久性残疾。麦地那龙线虫根除计划于 1986 年启动,当时非洲和亚洲 21 个国家估计有 350 万例病例,2024 年病例数降为 15 例,2025 年的 10 例分别为:乍得 4 例,埃塞俄比亚 4 例,南苏丹 2 例。要彻底根除还必须消灭动物感染病例,2025 年动物感染病例有数百例:乍得(147 例)、马里(17 例)、喀麦隆(445 例)、安哥拉(70 例)、埃塞俄比亚(1 例)和南苏丹(3例)。
- 科学家在 100 公里光纤上演示了设备无关的量子密钥分发
中科大的研究人员在《科学》期刊上报告通过长达 100 公里的光纤演示了与设备无关的量子密钥分发。研究结果表明,这种方法可在都市规模保障加密通信安全——这一传输距离远超以往结果——它将帮助缩小原理验证量子网络实验与实际应用之间的差距。量子密钥分发(QKD)是量子技术应用的前沿领域,它能实现格外安全的数字通信。早期形式的 QKD 是通过用可信设备来确保安全性,但它们存在技术限制和漏洞。一种更先进的方法是与器件无关的量子密钥分发(DI-QKD),后者的安全性直接源于量子基本现象,而无需信任量子设备的内在工作机制。
- HBO 制作《博德之门》电视剧
HBO 将制作《博德之门》电视剧,《最后生还者(The Last of Us)》的制作人 Craig Mazin 担任新剧的创作者、编剧和执行制作人,《博德之门》版权所有者 Wizards of the Coast 的前故事总监 Chris Perkins 担任顾问。电视剧将讲述《博德之门3》后发生的故事。Mazin 计划邀请《博德之门3》的声优参与剧集制作,类似他在《最后生还者》中的做法。目前不清楚《博德之门3》开发商 Larian Studios 是否会参与制作,它正在开发《Divinity》系列的新作。Mazin 称他在《博德之门3》中投入了 1000 个小时,能延续其故事是梦想成真。
- 日本电视市场中国厂商占六成
根据调研公司的数据,2025 年日本国内电视机市场份额海信控制 95% 股份的 REGZA 位居首位,海信和 TCL 合占五成。如果索尼品牌转移到 TCL 主导的合资公司,中国系将占到 6 成。在世界电视市场,三星占据榜首,三星、LG 电子、海信和 TCL 四家企业掌握了全球市场份额的一半以上。生产电视的日本大型企业只剩下松下,而松下也在剥离其电视业务,它的低价产品已由 TCL 代工。日本企业在规模和供应链方面处于劣势,很难以硬件为起点开展家电业务。