MONTH · 2025-08

Monthly Digest — 2025-08

478 unique stories across 31 days and 8 sources.

Hacker News(124)

  1. Atlassian terminates 150 staff (www.cyberdaily.au)
  2. Tesla must pay portion of $329M damages after fatal Autopilot crash, jury says (www.cnbc.com)
  3. I couldn't submit a PR, so I got hired and fixed it myself (www.skeptrune.com)
  4. Google shifts goo.gl policy: Inactive links deactivated, active links preserved (blog.google)
  5. AWS deleted my 10-year account and all data without warning (www.seuros.com)
  6. Helsinki records zero traffic deaths for full year (www.helsinkitimes.fi)
  7. 6 weeks of Claude Code (blog.puzzmo.com)
  8. Telo MT1 (www.telotrucks.com)
  9. Shrinking freshwater availability increasing land contribution to sea level rise (news.asu.edu)
  10. The Dollar Is Dead (mathmeetsmoney.substack.com)
  11. Modern Node.js Patterns (kashw1n.com)
  12. Yosemite embodies the long war over US national park privatization (theconversation.com)
  13. Qwen-Image: Crafting with native text rendering (qwenlm.github.io)
  14. I asked four former friends why we stopped speaking (2023) (www.vogue.com)
  15. Show HN: I spent 6 years building a ridiculous wooden pixel display (benholmen.com)
  16. DrawAFish.com Postmortem (aldenhallak.com)
  17. Spotting base64 encoded JSON, certificates, and private keys (ergaster.org)
  18. Ollama Turbo (ollama.com)
  19. US reportedly forcing TSMC to buy 49% stake in Intel to secure tariff relief (www.notebookcheck.net)
  20. Open models by OpenAI (openai.com)

GitHub Trending(72)

  1. OpenPipe / ART

    Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for Qwen2.5, Qwen3, Llama, Kimi, and more!

  2. TandoorRecipes / recipes

    Application for managing recipes, planning meals, building shopping lists and much much more!

  3. devlikeapro / waha

    WAHA - WhatsApp HTTP API (REST API) that you can configure in a click! 3 engines: WEBJS (browser based), NOWEB (websocket nodejs), GOWS (websocket go)

  4. puppeteer / puppeteer

    JavaScript API for Chrome and Firefox

  5. dyad-sh / dyad

    Free, local, open-source AI app builder | v0 / lovable / Bolt alternative | 🌟 Star if you like it!

  6. pointfreeco / swift-composable-architecture

    A library for building applications in a consistent and understandable way, with composition, testing, and ergonomics in mind.

  7. MotiaDev / motia

    Unified Backend Framework for APIs, Events, and AI Agents

  8. OpenBAS-Platform / openbas

    Open Adversary Exposure Validation Platform

  9. wg-easy / wg-easy

    The easiest way to run WireGuard VPN + Web-based Admin UI.

  10. eclipse-sumo / sumo

    Eclipse SUMO is an open source, highly portable, microscopic and continuous traffic simulation package designed to handle large networks. It allows for intermodal simulation including pedestrians and comes with a large set of tools for scenario creation.

  11. trekhleb / javascript-algorithms

    📝 Algorithms and data structures implemented in JavaScript with explanations and links to further readings

  12. souzatharsis / podcastfy

    An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI

  13. actualbudget / actual

    A local-first personal finance app

  14. reflex-dev / reflex

    🕸️ Web apps in pure Python 🐍

  15. ethereum / solidity

    Solidity, the Smart Contract Programming Language

  16. microsoft / mcp-for-beginners

    This open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.

  17. nautechsystems / nautilus_trader

    A high-performance algorithmic trading platform and event-driven backtester

  18. simstudioai / sim

    Sim is an open-source AI agent workflow builder. Sim Studio's interface is a lightweight, intuitive way to quickly build and deploy LLMs that connect with your favorite tools.

  19. browserbase / stagehand

    The AI Browser Automation Framework

  20. lvgl / lvgl

    Embedded graphics library to create beautiful UIs for any MCU, MPU and display type.

Product Hunt(121)

  1. X-Design

    Smart AI suite for authentic lifestyle product images

  2. Ollama Desktop App

    The easiest way to chat with local AI

  3. Hecco AI

    Your personal health intelligence engine

  4. Mixio

    The AI livestreaming platform

  5. ZapDigits

    Your startup’s metrics, now all in one place

  6. Google Sans Code

    The new font meticulously crafted for coders from Google

  7. ZINQ AI

    Do more with forms

  8. SEO Speed Test

    Google & ChatGPT ignore slow pages, check if yours is fast!

  9. Cipher by Byterover

    Open-source, shared memory for coding agents

  10. Watchman AI

    Capturing invisible B2B buyers with AI agents

  11. Hypertune

    Type-safe feature flags, optimized for React and Next.js

  12. SciSpace Agent

    Only AI agent automating research with 150+ academic tools

  13. Spill

    Minimalist freewriting app

  14. Kanbanq : Open alpha

    Project management. Simply done. For small teams & indies

  15. Verbite

    SEO-ready content from AI Agents

  16. Indy AI by Contra

    Job boards are dead. Your network is alive

  17. Asteroid

    AI browser agents for your back office, built in seconds

  18. Embeddable

    Build interactive tools for your website by chatting with AI

  19. involve.me AI Agent

    Create and edit interactive funnels by chatting with AI

  20. SpeedVitals RUM

    Monitor real-user performance & web analytics

Hugging Face(77)

  1. Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

    LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose Seed-Prover, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves 78.1% of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine Seed-Geometry, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

  2. Phi-Ground Tech Report: Advancing Perception in GUI Grounding

    With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textbf{43.2} on ScreenSpot-pro and \textbf{27.2} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

  3. C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

    Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

  4. RecGPT Technical Report

    Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users' evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.

  5. iLRM: An Iterative Large 3D Reconstruction Model

    Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

  6. Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

    Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.

  7. PixNerd: Pixel Neural Field Diffusion

    The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

  8. Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

    General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

  9. 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

    Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

  10. Qwen-Image Technical Report

    We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

  11. SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

    Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

  12. CellForge: Agentic Design of Virtual Cell Models

    Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

  13. Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report

    Large language models (LLMs) have shown remarkable success across many domains, yet their integration into cybersecurity applications remains limited due to a lack of general-purpose cybersecurity data, representational complexity, and safety and regulatory concerns. To address this gap, we previously introduced Foundation-Sec-8B, a cybersecurity-focused LLM suitable for fine-tuning on downstream tasks. That model, however, was not designed for chat-style interactions or instruction-following. In this report, we release Foundation-Sec-8B-Instruct: a model specifically trained for general-purpose cybersecurity dialogue. Built on Foundation-Sec-8B, it combines domain-specific knowledge with instruction-following, conversational capabilities, and alignment with human preferences to produce high-quality, relevant responses. Comprehensive evaluations show that Foundation-Sec-8B-Instruct outperforms Llama 3.1-8B-Instruct on a range of cybersecurity tasks while matching its instruction-following performance. It is also competitive with GPT-4o-mini on cyber threat intelligence and instruction-following tasks. We envision Foundation-Sec-8B-Instruct becoming an indispensable assistant in the daily workflows of cybersecurity professionals. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Instruct.

  14. Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.

  15. Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

    We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

  16. LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

    Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.

  17. CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

    Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

  18. VeriGUI: Verifiable Long-Chain GUI Dataset

    Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.

  19. Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

    Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

  20. Efficient Agents: Building Effective Agents While Reducing Cost

    The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from 0.398 to 0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.

Solidot(84)

  1. Google 将利用 AI 估算美国用户年龄

    Google 宣布将利用 AI 技术估算美国用户年龄是否年满 18 岁。年龄估算将在未来几周内推出,一开始将只会影响少数用户,之后它计划进一步扩大范围。Google 称,它将使用用户搜索过的信息或观看过的 YouTube 视频类型去判断用户的年龄。如果 Google 认为用户年龄未满 18 岁,它将对其采取对未成年人用户实施的相同限制。

  2. 苹果二季度在华销售收入 153.7 亿美元

    苹果公布了2021 年以来最强劲的季度营收增长,iPhone 销量增长 13%,总营收增长 10%。CEO 库克(Tim Cook)表示,约 1% 的营收增长可归因于消费者为应对潜在关税而购买更多产品。苹果最重要的产品仍然是 iPhone,销售额同比增长 13% 至 445.8 亿美元。苹果在中国市场的销售额同比增长 4% 达到 153.7 亿美元。库克表示一大原因是中国的国补政策,对该公司产品非常有帮助。

  3. 《战地6》的社区关卡编辑器使用开源引擎 Godot

    EA 雄心勃勃的希望《战地6》能吸引千万玩家,能长线运营,游戏将提供免费大逃杀模式,以及被称为 Battlefield Portal 的门户关卡编辑器,允许玩家创建自定义地图和玩法,该编辑器使用了开源游戏引擎 Godot。用户使用 Godot 创建的数据会通过翻译层翻译到《战地6》使用的私有引擎 Frostbite 4。此前 Blender 基金会曾使用 Godot 引擎和 Blender 3D 软件开发了一款小游戏,暂时不清楚 EA 或 DICE 是否会向 Godot 项目捐款。

  4. 为遏制登革热疫情巴西释放实验室培育的蚊子

    为阻止蚊子传播登革热病毒,巴西将释放数百万只实验室培育的蚊子,这些蚊子携带了沃尔巴克氏体细菌(Wolbachia bacteria),通过传播沃尔巴克氏体细菌阻止蚊子携带登革热病毒。该项目旨在未来十年内保护 40 个城市的 1.4 亿居民。巴西此前已在尼特罗伊(Niteroi)市测试了释放携带沃尔巴克氏体细菌的蚊子,效果显著,登革热病例下降了约 90%。现在该市几乎所有蚊子都携带沃尔巴克氏体细菌,Chikungunya(基孔肯雅热)病例和寨卡(Zika)病例也分别下降超过 96% 和 99%。沃尔巴克氏细菌天然存在于约半数的昆虫物种中,它让登革热病毒无法在蚊子体内复制,从而有效遏制登革热病毒传播。

  5. 印度将惩罚论文撤稿太多的大学

    如果一所大学的研究人员发表的论文大量撤稿,印度国家大学排名将会对将该大学进行惩罚。此举旨在遏制日益增多的因科学不端行为而导致论文撤稿的问题。论文撤稿一部分是因为无意造成的错误,但还有一部分是因为有意的不端行为。根据 Retraction Watch 对过去 30 年撤稿数据库的分析,印度的撤稿论文数量仅次于中国和美国。美国每发表 1000 篇论文中只有不到 1 篇被撤稿,中国每发表 1000 篇论文中有逾 3 篇被撤稿,而印度是每发表 1000 篇论文有 2 篇被撤稿。印度和中国的论文撤稿大部分是因为科学不端行为或科学诚信问题。

  6. 比利时限制访问互联网档案馆的在线图书馆

    比利时布鲁塞尔商事法庭发布了一份禁令,旨在限制对影子图书馆的访问,受影响的网站包括安娜的档案 (Anna's Archive)、Libgen、OceanofPDF、Z-Library 以及互联网档案馆的 Open Library。除了 ISP,搜索引擎、DNS 解析器、广告商、域名服务商、内容分发网络 (CDN) 和托管商都需要采取行动限制对上述网站的访问。Open Library 由已故的 Aaron Swartz 和互联网档案馆创始人 Brewster Kahle 等人创办,旨在存档所有已出版书籍,允许读者在线借阅。与其它电子图书馆类似,它的每本书每次只能借出一份拷贝。但不同之处是它的电子书没有获得授权,而是通过自己扫描去创建电子版。

  7. Google 改变关闭 goo.gl 短链接的计划

    搜索巨人去年宣布,它将于 2025 年 8 月 25 日关闭 Google URL Shortener 短链接服务(goo.gl/*),届时所有 goo.gl 链接将会停止响应。距离关闭日期不到一个月时间,在依赖于 goo.gl 短链接的开发者、教育工作者和记者等表达担忧之后,Google 改变了主意,采取了更温和的立场:它将只禁用自 2024 年底以来没有任何活动的 goo.gl 链接,如果 goo.gl 链接在活跃使用或点击,这些链接将能继续使用。

  8. 17 岁的 Hannah Cairo 解决了有 40 年历史的数学猜想

    2025 年 2 月,Hannah Cairo 在预印本平台 arxiv 上发表了一篇论文,解决了有 40 年历史的 Mizohata-Takeuchi 猜想,她年仅 17 岁,主要依靠自学,一时间震惊了数学界。Cairo 证明该猜想是错误的。她在巴哈马的 Nassau 长大,父亲是程序员,在这里获得了一份工作,因此一家人搬来这里。她还有一位大三岁的哥哥和小八岁的弟弟。在巴拿马他们都是在家中学习。Cairo 通过 Khan Academy 的在线课程学习数学,到她 11 岁时已经读完了微积分课程。父母为她找了几位数学教授远程辅导,她大部分时间仍然是自学,以至于其中一位教授、Clark 大学的 Amir Aazami 认为收钱有愧。到 14 岁时她已经读完了本科高年级数学课程。2021 年由于新冠疫情,一家人困住在芝加哥的祖父母家。这对她反而是好事,她开始扩大数学圈,接触越来越多的同行。2023 年,她申请了多数大学,但由于没有读完高中很多大学都拒绝了。她跟着哥哥去了加州伯克利,选修高等数学课程,其中一门是关于傅里叶限制理论(Fourier restriction theory)的研究生课程,授课老师是张瑞祥。几周后张瑞祥布置了一道 Mizohata-Takeuchi 猜想的简化版本作为作业,此举主要是鼓励学生探索数学领域的高级技巧。她完成了习题,在张的鼓励下进一步探索。她构造了一个函数否定了 Mizohata-Takeuchi 猜想。在完成证明之后,她决定跳过大学阶段,直接读数学博士。由于没有读完大学,她申请的多所大学也拒绝了,只有马里兰大学和约翰霍普金斯大学愿意录取,她选择了马里兰大学,将从秋天开始入学,当她完成学业,这将是她的第一个学位。

  9. ISS 俄罗斯舱仍在漏气

    俄罗斯航天局载人航天计划执行主任 Sergey Krikalev 承认,国际空间站(ISS)上的俄罗斯舱仍然在漏气。漏气最早是于 2019 年发现的,尽管多次确定漏气位置和进行修复,但国际空间站仍在漏气。空间站上驻扎的宇航员目前没有生命危险,但老化结构中裂缝的情况仍然不能令人满意。目前漏气有所减少,但仍在持续。俄罗斯和美国的科学家正努力解决该问题,追根究底,确保空间站未来不会再次发生类似事件。

  10. Steam 用户中 Linux 比例接近 3%

    Valve 公布的 2025 年 7 月 Steam 硬件和软件调查显示,玩家所用操作系统中 Linux 比例接近 3% 达到 2.89%(增加 0.32%),Windows 减少 0.44% 占 95.23%,OSX 占 1.88%。Linux 玩家的比例接近历史高点,这一趋势主要受到掌机 Steam Deck 的推动。在 PC 处理器中英特尔 CPU 减少 0.75% 跌至 60% 以内占 59.52%,AMD CPU 增加 0.74% 占 40.39%。对于用户使用的语言,简体中文减少 1.29% 占 25.44%,英文占 37.70%。

  11. 科学家研发出一种效力与吗啡相当但无严重副作用的止痛药

    日本京都大学的科学家研发出一种效力与吗啡相当但无严重副作用的止痛药。吗啡常被癌症患者使用,它有呼吸困难和成瘾等严重副作用。新药物 Adrian 的工作原理与吗啡和现有的合成阿片类药物完全不同,研究团队声称有望彻底改变医学领域的疼痛控制,有助于解决阿片类药物滥用问题。当人遭遇危及生命的情况时,大脑会分泌去甲肾上腺素(norepinephrine)去抑制疼痛。新研究集中在是人体调节去甲肾上腺素过度分泌的机制。研究团队通过引入新技术首次成功研发出一种能阻断这种调控的药物。科学家计划 2026 年在美国开展临床试验,2028 年投入实用。

  12. 用激光穿透人类大脑

    科学家理解大脑运作主要使用两种工具,它们都有各自的优点和缺点:脑电图 (EEG)廉价且轻便,但无法读取大脑外皮层之外的信息;功能性核磁共振成像 (fMRI) 昂贵且体积庞大但可以深入大脑。现在格拉斯哥(Glasgow)大学研究团队找到了一种能集两者于一身的技术:像 EEG 那样廉价且轻便,像 fMRI 那样能读取大脑深层的信息。他们使用激光器从大脑一侧发射数以百万的光子,然后测量到达另一侧的时间。由于只有极少数光子能完全穿过大脑,因此研究的一大难点是降低背景噪音。这项技术离真正实用还有一段距离,研究人员还需要克服更多障碍。

  13. 超加工饮食减肥的效果不大

    英国科学家发现,超加工饮食对减重和降低心血管代谢疾病风险的效果可能不如最少加工的饮食,即使这两种饮食都遵循相同的国家饮食指南。研究结果基于一项对英国 55 名成年人开展的社群水平的临床试验,揭示了在整体营养构成之外,食品加工程度对特定健康结局的可能影响。全球超加工食物消耗量在近几十年里快速增加,而肥胖症以及2型糖尿病和心血管疾病这类慢性病的发病率也在同期上升。研究人员开展了一项随机交叉试验,比较了以超加工食品为主和以最少加工食品为主的饮食,两种饮食结构都遵循了英国《健康饮食指南》——一组促进健康均衡营养的国家饮食建议。试验中的 55 名成人或接受预制的超加工食品,如早餐谷物或即食千层意面;或接受预制的最少加工食品,如隔夜燕麦或自制肉酱意面,这些食品在 8 周内分别配送到家。休息 4 周后,受试者换成另一种饮食再继续 8 周,从而能在受试者本人身上比较超加工食品和最少加工食品在 6 个月期间的影响。50 名受试者至少完成了一种饮食。研究者发现,遵循英国《健康饮食指南》的两种饮食都能在 8 周内显著减重。不过,最少加工饮食的平均减重量为 2%,而超加工饮食只有 1%。除了减重,最少加工饮食能更有效地改善与心血管代谢健康指标相关的身体成分,如降低脂肪总量、内脏脂肪和甘油三酯水平,但超加工饮食后的低密度脂蛋白胆固醇更低。

  14. 特斯拉被指在涉及自动驾驶的车祸案件中隐瞒数据、撒谎和误导警方

    陪审团上周裁决特斯拉对一起牵涉到 Autopilot 的车祸过失死亡事件负有部分责任。庭审记录显示,特斯拉试图将所有责任都归罪于司机,主动隐瞒 Autopilot 在事故前后表现的关键证据。在车祸发生三分钟内,特斯拉汽车将碰撞快照(collision snapshot)——视频、CAN‑bus streams、EDR 数据等——上传到特斯拉公司的服务器上,然后删除了本地拷贝,使得特斯拉公司成为唯一一个能访问关键证据的实体。警方在多年之后才让特斯拉承认碰撞快照的存在。专家通过从车载电脑上取证恢复数据确认特斯拉一直拥有该“碰撞快照”。而特斯拉一直宣称快照数据并不存在。

  15. 台积电指控前雇员窃取 2 纳米芯片技术机密

    台积电指控前雇员窃取 2 纳米芯片制程技术机密,而苹果 iPhone 18 系列使用的 A20 芯片是首批采用 2 纳米工艺的芯片。报道称,台积电日前发现制程技术疑遭外流后,立即向高检署提告,检方经追查后,指挥调查局上月 25 日及 28日 发动多波搜索及约谈行动,目前已知有一名陈姓离职工程师,及近 10 名台积电先进制程试产及研发工程师涉案,此名陈姓工程师曾在台积电系统整合部门任职,离职后转往台积电长期合作的日商东京威力科创担任设备工程师,因陈与台积电目前先进制程相关研发人员熟识,因此负责与台积电研发部门对接。据悉,陈与台积电研发部门有密切往来,加上熟识研发人员,目前已查出陈窃密的方式,是由台积电工程师打开电脑屏幕出示制程技术图样,陈再以手机直接拍照,据了解,陈直接从吴姓等两名工程师电脑屏幕上,分别拍摄了 700 多张、近 300 张制程技术照片,另外有几位台积电工程师,也提供拍摄较不具机密性的个位数的制程图,情节较轻,因此未被声押。

  16. 特朗普威胁对芯片征收 100% 关税,除非在美建厂或承诺建厂

    美国总统特朗普周三表示将对进口半导体和芯片征收 100% 关税,他没有透露具体细节。如果半导体厂商想要豁免关税,它们要么需要在美国建厂,要么需要承诺将在美国建厂。特朗普说,“我们将对芯片和半导体征收高额关税,但对像苹果这样的公司来说,好消息是如果你在美国生产,或者已承诺在美国生产,就不会征收任何费用。”苹果公司在这之前承诺未来四年在美国投资 1000 亿美元以促进美国制造业的发展。

  17. 日本禁止苹果 iOS 限制第三方浏览器引擎

    日本最近通过了被称为《Bill on the Promotion of Competition for Specified Software Used in Smartphones》的智能手机法案,其中之一是禁止苹果在其 iOS 平台上限制第三方浏览器引擎的做法。第三方浏览器登陆苹果的平台必须使用它的浏览器引擎——即 WebKit,Firefox、Chrome、Edge、Opera、Brave 和 Vivaldi 的 iOS 版本都是 WebKit 的换皮,此举导致 iOS 上的浏览器缺乏竞争。上周日本发布了指南 Mobile Software Competition Act (MSCA) Guidelines,明确禁止苹果的这项政策。MSCA 将于 2025 年 12 月生效,执行该法律将是它面临的一大挑战,欧盟和英国都制定了类似的法律。

  18. Grok 未经用户要求就生成斯威夫特的裸照

    马斯克(Elon Musk)旗下 AI 公司 xAI 的聊天机器人 Grok 被发现未经用户要求就生成了著名歌星斯威夫特(Taylor Swift)的裸照。用户使用了提示词 Taylor Swift celebrating Coachella with the boys 选择预设 spicy 生成视频,结果 Grok 生成了斯威夫特在一群 AI 观众前脱衣和穿丁字裤跳舞的视频。随着 Take It Down Act 法案将于明年生效,如果平台放任 AI 生成深度伪造的裸照 xAI 可能会面临法律后果。

  19. 维基百科编辑对 AI 生成文章采用加速删除政策

    维基百科编辑采用了一项新政策去处理大量涌入的 AI 生成文章。新政策允许管理员快速删除符合一定条件的 AI 生成文章。维基百科以前的文章删除流程通常需要长达一周的讨论。对于 AI 生成文章,在标记和审核是否符合条件之后管理员可以无需讨论快速删除。允许快速删除的 AI 文章需要满足两个条件:其一包含明显的大模型对提示词的回应如 Here is your Wikipedia article on…、Up to my last training update … 以及 as a large language model 等等之类;另一个条件是大模型经常犯的错误——引用不存在或显然错误的来源。

  20. 中国主要太阳能公司去年裁员近三分之一

    数据显示,中国主要太阳能公司去年裁员近三分之一。隆基绿能、天合光能、晶科能源、晶澳太阳能和通威集团去年共计裁员约 8.7 万人,平均占员工总数的 31%。裁员凸显了企业受产能过剩和低迷需求,陷入价格战的影响。全球每年生产的太阳能电池板数量是使用量的两倍,大部分产品由中国公司制造。