Monthly Digest — 2025-09
363 unique stories across 30 days and 8 sources.
Hacker News(120)
- Patrick Winston: How to Speak (2018) [video] (www.youtube.com)
- The future of 32-bit support in the kernel (lwn.net)
- Implementing a Foil Sticker Effect (www.4rknova.com)
- Adaptive LLM routing under budget constraints (arxiv.org)
- Google can keep its Chrome browser but will be barred from exclusive contracts (www.cnbc.com)
- OpenAI says it's scanning users' conversations and reporting content to police (futurism.com)
- We already live in social credit, we just don't call it that (www.thenexus.media)
- Python has had async for 10 years – why isn't it more popular? (tonybaloney.github.io)
- Where's the shovelware? Why AI coding claims don't add up (mikelovesrobots.substack.com)
- Microsoft BASIC for 6502 Microprocessor – Version 1.1 (github.com)
- Who Owns, Operates, and Develops Your VPN Matters (www.opentech.fund)
- Nuclear: Desktop music player focused on streaming from free sources (github.com)
- WiFi signals can measure heart rate (news.ucsc.edu)
- Stripe Launches L1 Blockchain: Tempo (tempo.xyz)
- Wikipedia survives while the rest of the internet breaks (www.theverge.com)
- Google deletes net-zero pledge from sustainability website (www.nationalobserver.com)
- I kissed comment culture goodbye (sustainableviews.substack.com)
- Anthropic agrees to pay $1.5B to settle lawsuit with book authors (www.nytimes.com)
- European Commission fines Google €2.95B over abusive ad tech practices (ec.europa.eu)
- Purposeful animations (emilkowal.ski)
GitHub Trending(74)
- dockur / windows
Windows inside a Docker container.
- JetBrains / koog
Koog is the official Kotlin framework for building and running robust, scalable and production-ready AI agents across all platforms – from backend services to Android and iOS, JVM, and even in-browser environments. Koog is based on our AI products expertise and provides proven solutions for complex LLM and AI problems
- juspay / hyperswitch
An open source payments switch written in Rust to make payments fast, reliable and affordable
- QuentinFuxa / WhisperLiveKit
Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI.
- crewAIInc / crewAI
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
- ashishpatel26 / 500-AI-Agents-Projects
The 500 AI Agents Projects is a curated collection of AI agent use cases across various industries. It showcases practical applications and provides links to open-source projects for implementation, illustrating how AI agents are transforming sectors such as healthcare, finance, education, retail, and more.
- pedroslopez / whatsapp-web.js
A WhatsApp client library for NodeJS that connects through the WhatsApp Web browser app
- microsoft / PowerToys
Windows system utilities to maximize productivity
- bytebot-ai / bytebot
Bytebot is a self-hosted AI desktop agent that automates computer tasks through natural language commands, operating within a containerized Linux desktop environment.
- LukeGus / Termix
Termix is a web-based server management platform with SSH terminal, tunneling, and file editing capabilities.
- rustdesk / rustdesk
An open-source remote desktop application designed for self-hosting, as an alternative to TeamViewer.
- microsoft / BitNet
Official inference framework for 1-bit LLMs
- aquasecurity / trivy
Find vulnerabilities, misconfigurations, secrets, SBOM in containers, Kubernetes, code repositories, clouds and more
- trufflesecurity / trufflehog
Find, verify, and analyze leaked credentials
- zama-ai / fhevm
FHEVM, a full-stack framework for integrating Fully Homomorphic Encryption (FHE) with blockchain applications
- rails / rails
Ruby on Rails
- emcie-co / parlant
LLM agents built for control. Designed for real-world use. Deployed in minutes.
- coleam00 / ottomator-agents
All the open source AI Agents hosted on the oTTomator Live Agent Studio platform!
- microsoft / ai-agents-for-beginners
12 Lessons to Get Started Building AI Agents
- bytedance / UI-TARS-desktop
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
Product Hunt(12)
- xpander.ai
Backend and Frontend for your AI Agents
- JoggAI AvatarX
AI avatars that truly act like humans
- Dhisana AI
Cursor for Sales Teams
- Genspark AI Designer
Your AI employee that designs anything with one prompt
- Receiptor AI 2.0
Bookkeeping on Autopilot with AI
- Google Finance Beta
Dive into the world of finance with AI-powered insights
- Bhava
Create and edit diagrams instantly with AI
- CatDoes
Team of AI agents build mobile apps for you & your business
- Ada
Your own AI data analyst
- Sidekick
Build Zapier-style automations using only a chat interface
- Astra API Security Platform
Discover, Scan, and Secure every API at scale
- Rork App
Idea to App Store, fast. The app that makes mobile apps
Hugging Face(68)
- R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
- EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
- A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.
- Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.
- PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
- T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables
Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.
- How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like tau-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.
- UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM family of Arabic-focused models. The most capable of these available to the public, ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of ALLaM-34B. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position ALLaM-34B as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.
- Why Language Models Hallucinate
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
- Symbolic Graphics Programming with Large Language Models
Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.
- Set Block Decoding is a Language Model Inference Accelerator
Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.
- WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.
- A Survey of Reinforcement Learning for Large Reasoning Models
In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
- RewardDance: Reward Scaling in Visual Generation
Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.
- 3D and 4D World Modeling: A Survey
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey
- AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.
- VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.
- HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.
- SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0 on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.
Solidot(89)
- 研究称闻香味能增加大脑灰质
根据发表在《Brain Research Bulletin》期刊上的一项研究,日本科学家报告称长时间闻香水能增加大脑灰质。日本京都大学和筑波大学的研究人员让实验组的 28 名女性抹玫瑰香油一个月,对照组的 22 名女性抹自来水。核磁共振成像(MRI) 扫描显示,抹玫瑰香油的实验组成员大脑灰质略有增加。脑灰质的增加并不一定意味着思维能力得到了增强,但这项发现可能对痴呆症等神经退行性疾病有重要意义。虽然不知道灰质增加的确切原因,研究人员猜测玫瑰香味会被大脑识别为令人不快的气味,负责调节情绪的后扣带回皮质(posterior cingulate cortex)会努力工作使体积增大。研究人员希望该发现能有助于研发能促进心理健康和大脑可塑性的芳香疗法。
- 日本夏季平均气温再创新高
日本气象厅周一发布消息,今年夏天平均气温较往年高出 2.36 度,创 1898 年开始统计以来新高。日本已连续 3 年经历最炎热夏季。气象厅表示这一波酷暑还将持续两周。气象厅称,日本北部较往年高出 3.4°C,日本东部 +2.3°C,日本西部 +1.7°C,均为 1946 年有统计以来的最高值。全国 153 个气象站中,有 132 个站记录了夏季最高平均气温(其中 9 个站的记录与基线持平)。今年夏季,共有 9,385 个 AMeDAS 站记录了极端高温天数,这是自 2010 年实现统计比较以来的最高值。
- 新西兰人为左旋蜗牛寻找配偶
新西兰人正在为一只罕见的左旋蜗牛寻找配偶。这只蜗牛以《辛普森一家》中左撇子邻居 Ned Flanders 的名字命名为 Ned。蜗牛的壳通常是右旋的,出现左旋壳的概率是 1:40,000,而且左旋蜗牛和右旋蜗牛是无法交配的,因为两者的生殖器对不上,所以左旋蜗牛必须和左旋蜗牛交配,但在自然界左旋蜗牛遇到左旋蜗牛的概率是非常低,因此新西兰人发起了为 Ned 寻找配偶的全国性行动——上一次的成功尝试是在 2016 年。
- Adobe Reader 安装程序的大小过去几年大幅膨胀
曾经的装机软件、广泛使用的 PDF 阅读器 Adobe Reader 安装程序其容量过去几年大幅膨胀,原因当然是和所有科技公司一样,要在其产品中集成炙手可热的 AI,至于用户需要不需要则是另一回事。Adobe Reader 25.1 版本容量接近 700MB,而去年发布的 v24.2 容量只有 460MB,2016 年的 v15.17 容量不到 100MB。相比下,另一款 PDF 阅读器 SumatraPDF 容量维持在 10MB 以内。
- 亚马逊基本上未参与 AI 人才争夺战
对于席卷硅谷的 AI 人才争夺战,亚马逊基本上是袖手旁观。根据电商巨头 HR 团队去年底撰写的一份内部文件,总结了在 AI 人才招聘上的不利因素,包括地理位置、薪酬以及在 AI 领域明显落后等。相比下竞争对手则通常会提供更全面、更激进的薪酬待遇。亚马逊以节俭著称。它的一个起源故事是从家得宝(Home Depot)购买廉价门然后将其改装成办公桌。据说 Jeff Bezos 至今仍在使用这种改装过的办公桌。
- 美国人性生活频率处于历史最低水平
根据 Institute for Family Studies 的研究报告《The Sex Recession》,美国人性生活频率处于历史最低水平,甚至低于新冠疫情期间。研究人员分析了芝加哥大学全国民意研究中心 (NORC)最新调查报告 General Social Survey 中有关性和亲密关系的数据。调查数据于 2024 年收集,今年五月公布。结果显示,只有 37% 的 18-64 岁人群每周至少有一次性生活,低于 1990 年的 55%。年轻人中间的下降幅度更为惊人:近四分之一或 24% 的 18-29 岁人群表示过去一年没有性生活;这一数字是 2010 年的两倍。研究表明,性生活下降的趋势适用于 64 岁以下所有性取向的人群,无论已婚还是单身。研究人员表示,大于 64 岁的人群性生活次数没有显著变化,主要是因为该群体报告的性生活频率本来就较低。
- 企业雇佣人类让 AI 垃圾不那么糟糕
无数企业都在尝试生成式 AI,但试用过的人都知道 AI 很难产生能直接使用的令人满意的最终产品,于是出现了雇佣人类调试和修改 AI 生成内容的新职业。自由职业者表示此类工作的报酬比不上其专业领域的传统零工,但一部分人表示这至少能帮助他们支付账单。自由职业平台 Upwork、Freelancer 和 Fiverr 的最新数据表明,此类创意工作的需求在激增。客户也日益希望找到能与 AI 技术协同工作,不完全依赖或拒绝 AI 的人。AI 辅助编程(vibe coding)日益流行,但企业发现此类工具无法达到他们预想的效果,企业仍然需要人类程序员,以避免辅助编程带来的麻烦。印度程序员 Harsh Kumar 说,客户使用 AI 辅助编程产生的网站或应用常常不稳定或无法使用。
- 压力影响心脏功能背后的分子机制
根据发表在《Journal of Molecular and Cellular Cardiology》期刊上的一项研究,加州戴维斯的研究人员解释了压力影响心脏背后的分子机制。通过动物实验,研究团队发现,仅 10 天的急性压力就足以引发炎症,导致心脏功能出现细微改变。他们还揭示了其背后的分子机制:一种名为 NLRP3 炎性小体的多蛋白复合体被激活,该复合体是炎症反应中的关键“放大器”。压力通过一系列细胞应激与信号通路激活这些复合体。这是科学家首次证实,环境压力可直接触发心脏细胞内的这一过程:释放出有害分子,进而促使心脏病发生。对于保护心脏而言,改变生活方式、减轻压力固然是最佳选择,但这对生活在高污染、高噪音或高社会压力环境中的人并不容易实现。
- Firefox ESR 115 将支持到 2026 年 3 月
微软已停止支持 Windows 7/8/8.1 操作系统,操作系统上最流行的应用——浏览器如 Google Chrome 和 Microsoft Edge 也都停止了对上述旧操作系统的支持,Mozilla 于 2023 年 7 月释出的 Firefox 115 ESR 是 Firefox 支持 Windows 7/8/8.1 的最后一个版本。Mozilla 开发者表示他们会在不同时间点进行评估以判断是否延长对 Windows 7/8/8.1 的支持时间,最新的评估是它计划继续为 Firefox 115 ESR 释出安全更新直至 2026 月 3 日。
- Windows 第三方工具允许用户禁用所有 AI 功能
Windows 11 第三方工具 Flyoobe 11 允许用户移除微软在操作系统中捆绑的臃肿软件。它最近释出了更新 v1.7,允许用户在安装操作系统后发现并禁用所有 AI 和 Copilot 功能。开发者称,最新版本能更深入挖掘 AI 在 Windows 11 中的嵌入方式。Flyoobe 托管在微软旗下的 GitHub 上,采用 MIT 许可证。
- 类似人类,每棵树都有独一无二的微生物组
森林是一个复杂、动态的生态系统,而树的内部也是如此。研究人员在《自然》期刊上发表了一项树干微生物组研究,发现树的木质组织除了树细胞外,还包含庞大的细菌群落和单细胞生物古细菌(archaea)。耶鲁大学的研究团队从美国东北部采集了 16 个树种的 150 多棵树的木芯样本,通过提取 DNA 去估算树干中的微生物数量。研究发现,树木的微生物组因物种而异。以生产枫糖浆而闻名的糖枫树含有更多的食糖细菌,用于制作葡萄酒桶的橡树含有一组已知有助于发酵的微生物。这些例子表明,树木微生物以某种意想不到的方式影响着我们的日常生活。树木微生物组也能表现出趋同演化,亲缘关系密切的树种可能拥有相似的微生物群落。
- 特斯拉改变了 Full Self-Driving 的意义,放弃承诺自动驾驶
特斯拉修改了 Full Self-Driving(FSD) 的意义,放弃原来承诺的自动驾驶或者叫无监督全自动驾驶。特斯拉自 2016 年起一直承诺其正在生产的汽车支持无监督自动驾驶能力。特斯拉 CEO 马斯克(Elon Musk)自 2018 年起每年都承诺到年底自动驾驶将会实现。但特斯拉后来承认 2016-2023 年生产的所有车型未配备实现自动驾驶所需的硬件。现在特斯拉表示 FSD 代表有监督的自动驾驶。
- NASA 禁止中国公民参与其太空项目
NASA 禁止持有有效签证的中国公民进入其设施,参与其太空项目。此前以合同工或学生身份参与 NASA 项目的中国公民在 9 月 5 日发现无法访问所有 NASA 系统和设施,NASA 随后证实它以国家安全理由禁止中国公民,“NASA 已针对中国公民采取了内部措施,包括限制其进入我们的设施、接触材料和网络,以确保我们工作的安全。”中美两国目前都在竞争重返月球,而美国的登月计划 Artemis 正面临成本超支和延误等问题。
- 为什么 Netflix 难以制作出高质量电影
今年 2 月 Netflix 发布了一部饱受诟病的科幻片《The Electric State》,由明星 Chris Pratt 以及《怪奇物语》十一的扮演者 Millie Bobby Brown 主演。这部电影本应该很快被人遗忘,如果不是它的制作成本高达 3.2 亿美元的话。3.2 亿美元给 Netflix 带来了 Metacritic 综合评分 30/100,烂番茄综合评分 14%。为了填满其内容库,Netflix 投资制作了一系列低质量原创电影,它虽然也制作过一些高质量电影如《爱尔兰人》,但在影评网站如 IMDb、Letterboxd 和 TMDB 上,Netflix 电影的综合评分远低于院线电影。Netflix 曾与知名导演 Martin Scorsese、Alfonso Cuarón 和 Bradley Cooper 合作过,但大部分项目都是一次性的,知名导演很少会再次合作。今天很多导演拒绝与 Netflix 合作,即使 Netflix 提供更多的预算。《Weapons》的导演 Zach Cregger 拒绝了 Netflix 开出的 5000 万美元预算,而是选择了华纳兄弟的 3700 万美元预算和院线上映保证。Netflix 为 Emerald Fennell 和 Margot Robbie 改编自《呼啸山庄》的电影开出了 1.5 亿美元,但他们仍然选择了华纳兄弟的 8000 万美元预算和院线上映保证。
- 引力波证实霍金黑洞面积定理
激光干涉引力波天文台(LIGO)探测到两个黑洞之间异常强烈的碰撞,这使得物理学家能够验证斯蒂芬·霍金在1971 年提出的黑洞面积定理。该定理指出,当两个黑洞合并时产生的黑洞视界,即连光都无法逃脱黑洞控制的边界,其面积不能小于两个原始黑洞的面积之和。该定理与热力学第二定律相呼应。热力学第二定律指出,熵或物体内部的无序状态永远不会减少。黑洞合并扭曲了宇宙的结构,产生了被称为引力波的微小时空波动,能被引力波探测器观测到。最近的这次碰撞被命名为 GW250114,与 2015 年首次观测到的产生引力波的碰撞几乎完全相同。这两次黑洞的质量都在太阳质量的 30-40 倍之间,发生在 13 亿光年之外。这一次升级后的 LIGO 探测器灵敏度是 2015 年的 3 倍,因此它们能够以前所未有的细节捕获碰撞产生的波。这使得研究人员能够通过计算证实黑洞合并后视界面积确实变大,从而验证了霍金的定理。
- 法国配音演员指控《古墓丽影 4-6 重制版》使用 AI 合成其声音
古墓丽影系列的法语配音演员 Françoise Cadol 向《古墓丽影 4-6 重制版(Tomb Raider 4-6 Remastered)》开发商 Aspyr 发出停止通知函(cease and desist),指控 Aspyr 使用 AI 拷贝其声音但没有通知她或告诉游戏玩家。她形容此举是一种背叛,一种彻底的不尊重。除了法语,巴西和西班牙等地区的玩家也认为其语种的配音是由 AI 生成的,AI 合成了原配音演员的声音。巴西配音演员 Lene Bastos 收到了 Aspyr 的一封回信,它的调查显示外部开发合作伙伴在其不知情下使用生成式 AI 编辑原始声音,它表示自己没有授权这么做,对未能在审核中注意到该问题表示歉意。
- AirPods 实时翻译功能暂不向欧洲和中国大陆提供
苹果在本周举行的新闻发布会上推出了新款 AirPods,改进了主动降噪,搭配了定制心率传感器,以及支持实时翻译功能。其中翻译功能需要运行 iOS 26 系统的 iPhone 15 Pro 或更新机型。但在 AirPods 正式发售时,实时翻译功能不会提供给欧盟地区的用户,也不会提供给中国大陆的用户,苹果对此没有给出太多的解释。不过苹果中国表示,中国三大电信运营商将为 iPhone Air 提供 eSIM 支持。
- 全球消费的鳗鱼 99% 属于濒危物种
日本中央大学与台湾大学研究团队发现全球消费的鳗鱼 99% 以上属于三种濒危物种。这些被广泛食用的鳗鱼是美洲鳗、日本鳗和欧洲鳗,它们被世界自然保护联盟评估为濒临灭绝。鳗鱼在全球范围内存在大量不透明交易,难以掌握实际流通量,此次研究为了解真实情况提供了线索。研究团队对 2023-2025 年在亚洲、欧美及大洋洲 11 个国家和地区 26 个城市购买的 282 件加工品和生鲜品进行基因检测,确定了种类。其中美洲鳗 154 件,日本鳗 120 件,欧洲鳗 4 件,印度尼西亚短鳍鳗 1 件,另有 3 件未能进行分析。根据该结果,结合各国的生产量、贸易统计数据和市场规模推测全球流通比例,美洲鳗占 75.3%,日本鳗占 18.0%,欧洲鳗占 6.7%。从国别流通量来看,中国约占 60%,日本约占 19%,东亚地区可能占了大半。
- 矮行星鸟神星发现甲烷气体
天文学家使用韦伯太空望远镜(JWST)首次在遥远的矮行星鸟神星(Makemake)上发现甲烷气体。这项成果颠覆了人们以往认为鸟神星只是颗冰封天体的看法,也让它成为继冥王星后,第二颗确认存在稀薄气体的海王星外天体。鸟神星于 2005 年由加州理工学院研究团队发现,半径约 715 公里,比冥王星稍小且更黯淡,绕太阳公转一圈需 305 年。过去的恒星掩星观测显示它没有明显大气,但不排除有稀薄大气存在的可能。红外线数据则揭示矮行星表面的甲烷冰有奇怪的热异常,暗示可能有局部热点并释放气体。研究团队指出,鸟神星是目前发现的海王星外天体中,最大的冰封天体之一,表面以甲烷冰为主。近期由韦伯望远镜的观测结果发现,在甲烷冰地表之上也存在稀薄的甲烷气体层,显示它的内部并非死寂,而是仍在不断变动中。鸟神星可能有一层非常稀薄的大气,类似冥王星。而这些气体也可能来自更短暂的局部地质活动,例如冰火山的羽状喷流或如同彗星般的升华作用。
- 章鱼有偏好使用的腕足
章鱼能用任意腕足执行任务,但它们更倾向于使用某个或某几个腕足执行特定任务。章鱼腕足是由围绕一个中枢神经的四个不同肌群组成的复杂结构,这四个肌群分别为横肌、纵肌、斜肌和环肌,它们能让章鱼腕足以不同方式变形,由此做出一系列动作以完成各种行为,从狩猎和移动到自卫。然而此前并不了解野生章鱼如何利用并协调它们的腕足。研究人员分析了关于野生章鱼的 25 个 1 分钟视频,这些视频拍摄于 2007-2015 年间的大西洋和加勒比海。他们发现,所有这些章鱼均能让全部八个腕足以四种不同方式变形,并能用每个腕足完成所有动作。他们还发现,身体两侧的腕足使用率均等,但前面四个腕足的使用率远高于后面四个腕足(64%比36%)。前腕足更有可能用于探索周围环境,而后腕足更可能用于让章鱼到处移动。因此,两个动作更常用到后腕足:一个是翻滚,此时腕足在章鱼身下顺着海底移动,类似传送带;另一个是撑地“踩高跷”,此时腕足向下笔直延伸以抬起身体。