MONTH · 2025-10

Monthly Digest — 2025-10

390 unique stories across 31 days and 8 sources.

Hacker News(124)

  1. U.S. Lost 32,000 Private-Sector Jobs in September, Says Payroll Processor (www.wsj.com)
  2. Jane Goodall has died (www.latimes.com)
  3. Solar leads EU electricity generation as renewables hit 54% (electrek.co)
  4. What good workplace politics looks like in practice (terriblesoftware.org)
  5. Anti-aging breakthrough: Stem cells reverse signs of aging in monkeys (www.nad.com)
  6. OpenAI's H1 2025: $4.3B in income, $13.5B in loss (www.techinasia.com)
  7. Playball – Watch MLB games from a terminal (github.com)
  8. Signal Protocol and Post-Quantum Ratchets (signal.org)
  9. Offline card payments should be possible no later than 1 July 2026 (www.riksbank.se)
  10. ICE Wants to Build Out a 24/7 Social Media Surveillance Team (www.wired.com)
  11. PEP 810 – Explicit lazy imports (pep-previews--4622.org.readthedocs.build)
  12. OpenAI Is Just Another Boring, Desperate AI Startup (www.wheresyoured.at)
  13. The UK is still trying to backdoor encryption for Apple users (www.eff.org)
  14. ProofOfThought: LLM-based reasoning using Z3 theorem proving (github.com)
  15. Self-hosting email like it's 1984 (maxadamski.com)
  16. Flock's gunshot detection microphones will start listening for human voices (www.eff.org)
  17. Fire destroys S. Korean government's cloud storage system, no backups available (koreajoongangdaily.joins.com)
  18. NIST's DeepSeek "evaluation" is a hit piece (erichartford.com)
  19. The QNX Operating System (www.abortretry.fail)
  20. Link Khan: Activision-Blizzard buyout is 'harming both gamers and developers' (www.pcgamer.com)

GitHub Trending(74)

  1. harry0703 / MoneyPrinterTurbo

    利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.

  2. Done-0 / fuck-u-code

    Legacy-Mess Detector – assess the “legacy-mess level” of your code and output a beautiful report | 屎山代码检测器,评估代码的“屎山等级”并输出美观的报告

  3. anthropics / claude-agent-sdk-python
  4. lobehub / lobe-chat

    🤯 Lobe Chat - an open-source, modern design AI chat framework. Supports multiple AI providers (OpenAI / Claude 4 / Gemini / DeepSeek / Ollama / Qwen), Knowledge Base (file upload / RAG ), one click install MCP Marketplace and Artifacts / Thinking. One-click FREE deployment of your private AI Agent application.

  5. nextcloud / server

    ☁️ Nextcloud server, a safe home for all your data

  6. google / tunix

    A JAX-native LLM Post-Training Library

  7. pathwaycom / pathway

    Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

  8. cjpais / Handy

    A free, open source, and extensible speech-to-text application that works completely offline.

  9. hsliuping / TradingAgents-CN

    基于多智能体LLM的中文金融交易框架 - TradingAgents中文增强版

  10. juspay / hyperswitch

    An open source payments switch written in Rust to make payments fast, reliable and affordable

  11. airweave-ai / airweave

    Airweave lets agents search any app

  12. meshery / meshery

    Meshery, the cloud native manager

  13. Stremio / stremio-web

    Stremio - Freedom to Stream

  14. microsoft / BitNet

    Official inference framework for 1-bit LLMs

  15. Flowseal / zapret-discord-youtube
  16. Infisical / infisical

    Infisical is the open-source platform for secrets management, PKI, and SSH access.

  17. BeehiveInnovations / zen-mcp-server

    The power of Claude Code / GeminiCLI / CodexCLI + [Gemini / OpenAI / OpenRouter / Azure / Grok / Ollama / Custom Model / All Of The Above] working as one.

  18. trycua / cua

    Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

  19. simstudioai / sim

    Open-source platform to build and deploy AI agent workflows.

  20. browserbase / stagehand

    The AI Browser Automation Framework

Hugging Face(93)

  1. MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

    MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

  2. The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

    The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \n locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.

  3. Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

    Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.

  4. Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

    As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.

  5. DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

    Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

  6. VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

    Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

  7. GEM: A Gym for Agentic LLMs

    The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

  8. Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

    Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

  9. LongCodeZip: Compress Long Context for Code Language Models

    Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.

  10. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

  11. StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

    3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/

  12. ExGRPO: Learning to Reason from Experience

    Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

  13. Apriel-1.5-15b-Thinker

    We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.

  14. Large Reasoning Models Learn Better Alignment from Flawed Thinking

    Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

  15. Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

    Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

  16. Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

    Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Gr\"onwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.

  17. Paper2Video: Automatic Video Generation from Scientific Papers

    Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

  18. Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

    Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

  19. VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

    Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

  20. MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information

    Tree search has become as a representative framework for test-time reasoning with large language models (LLMs), exemplified by methods such as Tree-of-Thought and Monte Carlo Tree Search that explore multiple reasoning paths. However, it remains difficult to provide instant and reliable quantitative assessments of intermediate reasoning step quality, and extensive path exploration is computationally costly. To address this, we propose Mutual Information Tree Search (MITS), a novel framework that guides reasoning with information-theoretic principles. MITS introduces an effective scoring function based on pointwise mutual information (PMI), which enables step-wise evaluation of reasoning paths and search tree expansion via beam search without expensive look-ahead simulations, achieving superior reasoning performances while maintaining computational efficiency. The framework is complemented by an entropy-based dynamic sampling strategy that adaptively allocates computational resources to uncertain reasoning steps where exploration is most beneficial. For final prediction, MITS employs a weighted voting scheme that combines PMI scores with prediction consensus. Through comprehensive experiments on diverse reasoning benchmarks, MITS consistently surpasses baseline methods, establishing a principled and efficient framework for LLM reasoning.

Solidot(99)

  1. 阿富汗断网超过两天

    根据 Netblocks 的监测数据,阿富汗断网已超过 48 小时。互联网和移动电话服务全部中断,全国居民生活在通讯几乎完全中断的状况下。阿富汗的全面断网始于周一晚上,进入周二后互联网和电话服务继续中断。首都喀布尔一名 42 岁的店主 Najibullah 说,"没有电话和互联网我们都是盲人,所有业务都依赖于手机。送货是用手机。这一情况就像是假期:每个人都在家里。市场完全冻结。"这是塔利班政府首次切断全国的通信,官方没有对此做出解释。法新社在断网前曾收到一名政府官员的警告,称有八到九千个通信支柱(telecommunications pillars)将被关闭,通信中断将持续到另行通知为止。目前阿富汗有限的通信只能依靠无线电和少数卫星链路。

  2. Linus Torvalds 从 Linux 6.18 中完全移除了 Bcachefs

    在 Linux 6.17 将 Bcachefs 文件系统列为由外部维护并且没有合并任何 Bcachefs 维护者 Kent Overstreet 递交的拉取请求之后,Linus Torvalds 在 Linux 6.18 中完全移除了 Bcachefs,总共删除了 11.7 万行代码。Torvalds 评论说,Bcachefs 现在是一个 DKMS 模块,内核代码过时了,删除内核中的代码以避免版本混淆。

  3. 世界最高大桥花江峡谷大桥通车

    贵州花江峡谷大桥正式通车。大桥桥面距水面625米,高度超过北盘江第一桥近 60 米,成为新的世界第一高桥;大桥主桥跨径 1420 米,居山区桥梁跨径世界第一。大桥全长 2890 米,可将两岸通行时间从两个多小时缩短到两分钟左右。从 2022 年开工到正式通车,这座“超级工程”的建造只花了三年多。花江峡谷大桥钢桁梁吊装有 93 个节段,总重达 2.1 万吨,需在 600 多米高空实现毫米级精准对接。建设团队借助研发的“智慧缆索吊装系统”,全部吊装仅用了 73 天就全面完成;3.8 万平方米的桥面,建设团队在 1 个多月里完成了 5 层铺装。

  4. CS 教授警告毕业生难找到工作

    以研究数字取证和深度伪造而知名的加州伯克利计算机科学教授 Hany Farid 表示,计算机科学在极短时间内从经得住时间考验的职业变成了剧变中的行业。他说,计算机科学专业的学生通常会在前四年获得五份实习机会,毕业时会收到多份高薪的工作机会。但如今这种情况不会发生了,如果能收到一份工作邀约他们就很高兴了。Farid 教授认为 AI 只是因素之一。计算机科学行业正在发生某种变化。他现在给学生的建议是掌握多种技能,因为不知道未来会发生什么。他说,AI 不会让律师失业,但会用 AI 的律师会让不会用 AI 的律师失业。他认为每个职业都如此。

  5. 黑客声称入侵了 Red Hat 的 GitHub 代码库

    自称 Crimson Collective 的勒索组织声称入侵 Red Hat 的 GitHub 代码库,窃取了近 570GB 的数据。其中包括 800 份 Customer Engagement Reports(CERs),可能包含了客户网络和平台的敏感信息。Red Hat 证实其咨询业务遭遇了安全事故,但拒绝证实黑客的说法。黑客组织在 Telegram 上公布了盗取的 GitHub 代码库的完整目录列表以及 2020-2025 年的 CER 列表。CER 列表中的知名组织包括了美国银行、T-Mobile、AT&T、富达、凯撒、梅奥诊所、​​沃尔玛、Costco、美国海军水面作战中心、FAA 和 众议院等。黑客表示他们尝试联络 Red Hat 提出勒索要求,但只收到一份模板回复,指示他们向其安全团队提交漏洞报告。

  6. 千禧一代癌症发病率在上升

    自 2000 年以来 15-49 岁人群癌症发病率增加了 10%,而老年人口的癌症发病率却略有下降。其中年轻女性的癌症率比同年龄段男性高 83%。美国癌症研究协会(American Association for Cancer Research)会议上发表的一项涉及 15 万人的研究发现,根据血液生物标志物,千禧一代的生物衰老速度看起来比前几代人更快。这种加速现象与肺癌、胃肠道肿瘤和子宫恶性肿瘤等癌症风险增加最高 42% 相关。研究人员将癌症发病率上升与怀孕期间服用的药物、摄入的超加工食品、人造光、轮班工作造成的昼夜节律紊乱,以及化学物质暴露联系起来。

  7. 城市空气检测出致病性酵母菌株

    正如城市居民所知,远离都市,奔赴海边,可以享受别样的风景或体验心灵的重启。发表在 ACS《环境科学与技术快报》上的一项研究为海边之旅又平添了一个新的理由。一项研究发现,城市空气潜藏致病性念珠酵母菌菌株,但在沿海空气样本中却没有发现这些菌株,揭示了其潜在的传播途径。念珠酵母菌是一组常见的微生物,存在于人体皮肤和内脏器官黏膜中,但不会造成危害。但是,在某些情况下,这些菌株可能会过度增殖,并导致阴道酵母菌感染或鹅口疮。已知这些感染可通过直接接触或体液传播。先前的研究发现空气中存在念珠菌 DNA,表明这种酵母菌可以通过空气传播。研究人员连续一整年每个月在香港及其附近的一个面向中国南海的人口稀疏地区收集一次空气样本。他们在 12 份城市空气样本中发现了三种被世界卫生组织归类为真菌病原体的念珠菌:白色念珠菌、近平滑念珠菌和热带念珠菌。而在沿海地区采集的样本中没有检测到念珠菌。这一地域差异让研究人员推测,空气中的酵母菌来源于工业或城市,例如污水处理厂。此外,一些城市空气样本中还含有对常见抗真菌药物具有耐药性的致病性念珠菌菌种。研究人员表示,抗真菌药物的过度使用、城市环境中的重金属等污染物或气温升高均可能是这种耐药性的促成因素。最后,空气中的其中一种念珠菌菌株的基因组成与先前从念珠菌感染者样本中提取的菌株密切相关,这表明空气中的菌株可能具有传染性。研究人员表示,这项研究挑战了长期以来存在的念珠菌主要通过直接接触传播的假设,将念珠菌描述为一种新兴的空气传播病原体。但是还需要开展更多的研究,以调查城市中念珠菌的来源,并充分了解这些空气中颗粒的潜在传染性。

  8. 珍·古道尔去世,享年 91 岁

    著名动物学家、灵长类动物学家和人类学家珍·古道尔(Jane Goodall)去世,享年 91 岁。珍·古道尔以研究野外黑猩猩闻名,被认为是最重要的黑猩猩专家。古道尔于 1960 年在坦桑尼亚贡贝溪(Gombe Stream)国家公园的 Kasakela 黑猩猩社区开始研究黑猩猩的社会和家庭生活,她观察到黑猩猩的行为与人类十分相似。她的发现挑战了当时两大信念:只有人类才能制造和使用工具,黑猩猩是素食主义者。她在研究中与当地黑猩猩建立了紧密联系,成为黑猩猩社区唯一被接纳的人类。她后来投身于环境教育和公益事业,创办了著名民间动物保育机构珍·古道尔研究所。

  9. 英特尔与 AMD 磋商代工芯片

    过去几周,英特尔获得了白宫、英伟达和软银的投资和支持,正与苹果磋商代工芯片。除此之外,其长期的竞争对手 AMD 也是磋商对象。英特尔与 AMD 的谈判处于早期阶段,芯片巨人希望其芯片工厂能代工制造 AMD 的芯片,而 AMD 的芯片此前主要由台积电生产,英特尔工厂目前缺乏制造 AMD 最先进芯片所需的先进技术。类似与苹果的谈判,与 AMD 的谈判也可能不会达成任何协议。

  10. 印度高等法院要求医生书写清晰的处方

    医生手写的处方以天马行空著称,除了药房的药剂师其他人可能完全不知道内容。印度一所高等法院法官在审理一起涉及强奸的案件时阅读了医生写的法医学报告,发现一个字也看不懂。法官 Jasgurpreet Singh Puri 下达命令,称“清晰易读的医疗处方是一项基本权利”。法院要求政府将书写纳入医学院课程,设定两年的时间表推行数字处方。Puri 法官表示在数字处方实现前所有医生都必须用大写字母清楚的写处方。印度医学协会主席 Dilip Bhanushali 称,城市已经推行了数字处方,但小城镇和农村的医生因为忙碌他们的手写处方仍然很潦草。

  11. 为什么女性比男性更长寿

    女性通常比男性更长寿。传统的解释包括男性抽了更多烟,饮了更多酒,从事了更危险的行为。但不管哪个国家,不论哪个世纪,男女之间的寿命差距都存在,这表明还存在更深层次的原因。发表在《Science Advances》期刊上的一项研究再次证实,这一现象可能与女性有两个 X 染色体有关,一个冗余的染色体能帮助女性抵御有害突变。研究人员分析了动物园饲养的 528 种哺乳动物和 648 种鸟类的寿命数据,发现大多数哺乳动物与人类相似,近四分之三的哺乳动物雌性寿命比雄性长。而在鸟类中,68% 的鸟类雄性寿命更长,这是因为鸟类雌性有一对不同的染色体,而雄性的一对性染色体相同。

  12. 自由软件基金会庆祝四十周年,任命 Ian Kelling 为新主席

    自由软件基金会(FSF)庆祝了诞生四十周年,向自由软件社区介绍了该组织理事会的新主席 Ian Kelling。FSF 成立于 1985 年 10 月 4 日,致力于推广自由软件,执行 GNU 计划。现任理事会成员包括了 Christina Haralanova、Geoffrey Knauth(财务主管)、Gerald J. Sussman、Ian Kelling 和 Richard M. Stallman(创始人)。Ian Kelling 现年 43 岁,自 2021 年起担任理事会成员和投票成员,是一位活跃的演讲者和博主,他表示将致力于加强 FSF 应对计算机用户自由新威胁的能力,将比以往任何时候欢迎更多自由软件支持者加入这项运动。

  13. 大曼彻斯特警署因有警官使用自动按键工具假装工作暂停了远程办公

    有 12,677 名员工的大曼彻斯特警署(Greater Manchester Police),由于近期的调查发现有警员使用自动按键工具假装工作而暂停了远程办公,有 26 名警员、工作人员和合同工因行为不当而遭到起诉。根据调查,一名警员作证,一名警探在 12 天内 38 次让自己的电脑看起来像在使用中。证据显示,在很长时间里他唯一的活动是单次按键,12 月 3 日 10:28 到 11:56 GMT 之间,他按了 H 键约 30 次,之后按了 I 键逾 16000 次。在总共 85 小时的登录时间中,有 45 个小时使用了自动按键,他有一半的工作时间不在键盘旁。这名警探已经辞职。

  14. Opera 推出月费 19.9 美元的 AI 浏览器

    Opera 不想错过 AI 热,它推出了一款 AI 浏览器 Opera Neon,前 9 个月价格 59.90 美元,之后每月 19.90 美元。Opera Neon 主要使用了云端运行的大模型,任务是浏览器的核心概念,Neon 利用 AI 为用户执行各种任务,Opera 称:“Neon 会按照你的指令行动,打开标签页、进行研究、寻找最优价格、评估安全性,无论你需要什么。它提供的结果可供你使用、共享和构建。”另一家 AI 公司 Perplexity 也发布了它的 AI 浏览器 Comet,用户可免费使用,可选择支付 5 美元获得 AI 新闻服务。

  15. 微软表示会继续开发 XBox 游戏机

    微软最近再次上调了 Xbox Series X 和 Series S 游戏机的售价,将订阅服务 Xbox Game Pass Ultimate 价格上涨 50%。一系列动作让很多人不看好微软游戏机业务的未来,包括 Costco 在内的零售商决定将 Xbox 产品下架。索尼 PS5 之后有 PS6,但 Xbox Series X 之后是否还会有新 Xbox?对于它可能放弃硬件制造的传言,微软周一发表声明重审它仍然致力于开发 Xbox 游戏机,继续与 AMD 公司在硬件方面进行合作。微软和索尼目前游戏机都使用 AMD 提供的 CPU 和 GPU 方案。微第一方 Xbox 掌机的计划据报道已经取消,原因据称是 AMD 在合同中要求销量至少要达到一千万,而 Steam Deck 自 2022 年发布以来销量也只有 400-500 万台。

  16. Ubuntu Linux 26.04 LTS 代号 Resolute Raccoon

    在 Ubuntu 25.10 即将释出之际,Canonical 宣布下一个 LTS(长期支持版)Ubuntu 26.04 的代号为 Resolute Raccoon。Ubuntu 25.10 只支持九个月,而 Ubuntu 26.04 将支持五年,预计 2026 年 4 月释出。Ubuntu 25.10 的主要特性包括:Linux 6.17,GCC 15,使用 Rust 语言开发的系统组件 sudo-rs 和 Rust Coreutils,默认桌面环境 GNOME 49,等等。Ubuntu 26.04 的具体特性将在未来几个月逐步揭晓。

  17. 2025 年诺贝尔物理学奖授予了三名研究量子力学的美国科学家

    2025 年诺贝尔物理学奖授予了美国科学家 John Clarke、Michel H. Devoret 和 John M. Martinis,以表彰他们发现了电路中的宏观量子力学隧穿效应和能量量子化。物理学中的一个主要问题是,能展示量子力学效应的系统最大尺度是多少。今年的诺贝尔奖获得者通过一个电路进行了实验,在该系统中,他们同时演示了量子力学隧穿效应和能量量子化,而这个系统的尺寸大到足以用手握住。在 1984 年和 1985 年,John clarke、Michel H. Devoret 和 John M. Martini 使用由超导体构建的电子电路进行了一系列实验。在电路中,超导元件被一层薄薄的绝缘材料隔开,这种结构被称为约瑟夫森结。通过精化并测量其电路的各种特性,他们能够控制并探索当电流通过时出现的现象。共同在超导体中移动的带电粒子构成了一个系统,其行为就好像它们是填充整个电路的单个粒子一样。这个宏观的类粒子系统最初处于一种电流流动而没有任何电压的状态。系统被束缚在这种状态中,就像被困在一个无法穿越的势垒后面。在实验中,系统通过成功隧穿脱离零电压状态,展示了其量子特性。系统状态的改变通过电压的出现而被检测到。

  18. 清理 50 块最具危险性的太空垃圾能将新碎片数量减半

    根据上周悉尼国际宇航大会上发表的一项研究,如果能清理掉低地球轨道上最具有危险性的 50 块太空垃圾,那么新生成碎片的数量将能整体减半。论文主要作者是 Darren McKnight,他们计算了最可能与其它碎片碰撞产生更多碎片的低轨道物体。50 块最具危险性的太空垃圾有 34 块来自俄罗斯/苏联,10 块来自中国,美国 4 块,欧洲 2 块,日本 1 块。即使只清理掉其中最危险的 10 块,新太空碎片数量也能减少 30%。McKnight 指出,大部分太空垃圾来自于 2000 年之前,50 块最具有危险性的太空垃圾有 76% 是上个世纪留下的,88% 是遗留在太空的火箭残骸。坏消息是,自 2024 年 1 月 1 日以来,遗留在低地球轨道上的火箭残骸达到了 26 枚,它们会在轨道上停留逾 25 年。这 26 枚中有 21 枚是中国发射的,另外 5 枚来自美国、俄罗斯、印度和伊朗。随着中国加速发射和部署数量数以千计的国网和千帆星座,低轨道上的火箭残骸数量可能会继续增加。自去年发射国网和千帆星座以来,中国在轨道上遗留了9 枚火箭上面级的残骸,未来可能会遗留逾 100 枚。不过中国航天局的一名官员表示正在研究如何清理轨道上的太空垃圾。

  19. 在销量暴跌之后群晖允许其 NAS 产品使用第三方品牌硬盘

    群晖今年早些时候做出了一项受争议决策:2025 年款 Plus 系列 NAS 产品只兼容自有品牌硬盘。群晖声称,如果安装不兼容的硬盘,NAS 设备可能无法创建存储池。群晖并不生产硬盘,它主要是重新包装来自希捷和东芝的硬盘,群晖品牌硬盘通常比相似规格的第三方型号略贵。举例来说,群晖 Plus 系列 8TB 3.5 英寸 HDD HAT3310 在其官网上的售价为 210 美元。HAT3310 原装硬盘之一——东芝 N300 在多个网店售价为 173 美元。群晖此举招致了消费者的广泛批评,消费者们选择了用脚投票,其产品过去几个月销量暴跌。现在群晖释出了 DSM 7.3,悄悄撤销了这一受争议政策,使用第三方硬盘不会再触发警告或限制功能。批评人士认为这一事件损害了群晖的声誉。

  20. 2025 年诺贝尔化学奖授予了美日英科学家

    2025 年诺贝尔化学奖授予了日本科学家 Susumu Kitagawa、英国科学家 Richard Robson 和美国科学家 Omar M. Yaghi,表彰他们“在金属有机框架开发领域的贡献”。他们开发了一种新的分子结构。在结构中,金属离子作为由长有机(碳基)分子连接的基石。金属离子和分子结合在一起,形成了包含大空洞的晶体。这些多孔材料被称为金属有机框架(MOF)。通过改变 MOF 中使用的构建块,化学家可以设计它们来捕获和存储特定的物质。MOF 还可以驱动化学反应或导电。在这奖者的突破性发现之后,化学家们已经构建了成千上万种不同的MOF。其中一些可能有助于解决人类面临的一些最大挑战,包括从水中分离 PFAS,分解环境中的药物痕迹,捕获二氧化碳或从沙漠空气中收集水。