Monthly Digest — 2025-11
343 unique stories across 30 days and 8 sources.
Hacker News(120)
- Claude Code Can Debug Low-Level Cryptography (words.filippo.io)
- Visible from space, Sudan's bloodied sands expose a massacre of thousands (www.telegraph.co.uk)
- Show HN: Why write code if the LLM can just do the thing? (web app experiment) (github.com)
- OpenAI Moves to Complete Potentially the Largest Theft in Human History (thezvi.substack.com)
- Paris Had a Moving Sidewalk in 1900, and a Thomas Edison Film Captured It (2020) (www.openculture.com)
- Linux gamers on Steam cross over the 3% mark (www.gamingonlinux.com)
- Lisp: Notes on its Past and Future (1980) (www-formal.stanford.edu)
- Using FreeBSD to make self-hosting fun again (jsteuernagel.de)
- </> Htmx – The Fetch()ening (htmx.org)
- Israels top military lawyer arrested after she admitted leaking video of abuse (www.theguardian.com)
- Why we migrated from Python to Node.js (blog.yakkomajuri.com)
- Learning to read Arthur Whitney's C to become smart (2024) (needleful.net)
- NoLongerEvil-Thermostat – Nest Generation 1 and 2 Firmware (github.com)
- Codemaps: Understand Code, Before You Vibe It (cognition.ai)
- We're open-sourcing the successor of Jupyter notebook (deepnote.com)
- Michael Burry a.k.a. "Big Short",discloses $1.1B bet against Nvidia&Palantir (sherwood.news)
- Photos: New Phoenix Microcenter is a 'tech-heaven' for geeks (www.phoenixnewtimes.com)
- I was right about dishwasher pods and now I can prove it [video] (www.youtube.com)
- Solarpunk is happening in Africa (climatedrift.substack.com)
- New gel restores dental enamel and could revolutionise tooth repair (www.nottingham.ac.uk)
GitHub Trending(26)
- get-convex / chef
The only AI app builder that knows backend
- suitenumerique / docs
A collaborative note taking, wiki and documentation platform that scales. Built with Django and React.
- Tencent / WeKnora
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
- janhq / jan
Jan is an open source alternative to ChatGPT that runs 100% offline on your computer.
- 666ghj / BettaFish
微舆:人人可用的多Agent舆情分析助手,打破信息茧房,还原舆情原貌,预测未来走向,辅助决策!从0实现,不依赖任何框架。
- Wei-Shaw / claude-relay-service
CRS-自建Claude Code镜像,一站式开源中转服务,让 Claude、OpenAI、Gemini、Droid 订阅统一接入,支持拼车共享,更高效分摊成本,原生工具无缝使用。
- microsoft / agent-lightning
The absolute trainer to light up AI agents.
- HKUDS / DeepCode
"DeepCode: Open Agentic Coding (Paper2Code & Text2Web & Text2Backend)"
- GeeeekExplorer / nano-vllm
Nano vLLM
- charmbracelet / glow
Render markdown on the CLI, with pizzazz! 💅🏻
- sst / opentui
OpenTUI is a library for building terminal user interfaces (TUIs)
- mudler / LocalAI
🤖 The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P and decentralized inference
- Skyvern-AI / skyvern
Automate browser based workflows with AI
- nocobase / nocobase
NocoBase is the most extensible AI-powered no-code/low-code platform for building business applications and enterprise solutions.
- prometheus / alertmanager
Prometheus Alertmanager
- simstudioai / sim
Open-source platform to build and deploy AI agent workflows.
- lima-vm / lima
Linux virtual machines, with a focus on running containers
- usestrix / strix
✨ Open-source AI hackers for your apps 👨🏻💻
- umami-software / umami
Umami is a modern, privacy-focused alternative to Google Analytics.
- thinking-machines-lab / tinker-cookbook
Post-training with Tinker
Hugging Face(87)
- The End of Manual Decoding: Towards Truly End-to-End Language Models
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
- Emu3.5: Native Multimodal Models are World Learners
We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
- Kimi Linear: An Expressive, Efficient Attention Architecture
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
- Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.
- OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
- ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
- INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
- π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., pi_0, pi_{0.5}) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with pi_{RL}, an open-source framework for training flow-based VLAs in parallel simulation. pi_{RL} implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate pi_{RL} on LIBERO and ManiSkill benchmarks. On LIBERO, pi_{RL} boosts few-shot SFT models pi_0 and pi_{0.5} from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train pi_{RL} in 320 parallel environments, improving pi_0 from 41.6% to 85.7% and pi_{0.5} from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, pi_{RL} achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.
- Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.
- The Underappreciated Power of Vision Models for Graph Structural Understanding
Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.
- Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.
- UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.
- Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
- VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
- When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
- When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs
Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model's inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.
- Diffusion Language Models are Super Data Learners
Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.
- UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
- LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
- Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .
Solidot(110)
- 特朗普命令美国重启核武器试验
美国总统特朗普指示战争部重启核武器试验。此举可能是维持美国核武器的优势局面。美国上一次核试验是 1992 年 9 月 23 日在内华达州进行的地下核试验。美国共进行了 1054 次核试验,其次是苏联的 715 次,法国的 210 次,英国和 中国的 45 次,中国最后一次核试验是 1996 年 7 月。1996 年 9 月全面禁止核试验条约在联合国大会通过后,仅印度、巴基斯坦和朝鲜三国进行过核试验。
- Bazzite 秋季更新释出
针对游戏的发行版项目 Universal Blue 宣布释出 Bazzite 秋季更新。Bazzite 基于 Fedora 发行版,最新更新升级到了 Fedora 43,使用的桌面环境包括 GNOME 49 和 KDE Plasma 6.4.5。其它变化包括:完整支持微软华硕 Xbox 掌机 Xbox Ally 和 Xbox Ally X ,改进对联想掌机 Legion Go 2 的支持;支持使用英特尔芯片的 OneXPlayer X1 Air,但目前还不支持颗粒度 TDP 控制;支持 SuiPlay0X1,等等。
- 以色列要求 Google 和亚马逊使用秘密的眨眼信号警告外国政府的数据披露要求
Google 和亚马逊在 2021 年签署 12 亿美元的云计算交易时,它们的客户——以色列政府提出一项不同寻常的要求,事实上是要求 Google 和亚马逊规避世界各国的法律义务。以色列要求使用被称为“眨眼”的机制警告来自世界各国的数据披露命令。为遵守各国的法律,Google 和亚马逊等科技巨头通常会根据执法部门的命令提供客户数据以协助调查。执法部门通常也会要求相关企业对调查保密,禁止披露客户数据已被移交。以色列不希望失去数据的控制权,因此要求 Google 和亚马逊使用眨眼机制——以特殊补偿的形式支付小额款项——向其发出预警。根据泄露的以色列财政部文件,眨眼款项需要在信息披露 24 小时内支付,支付金额与国家的电话区号对应,介于 1,000 至 9,999 谢克尔之间。如果 Google 或亚马逊被要求向美国政府(电话区号+1)提供以色列用户数据,且被禁止披露,那么 Google 或亚马逊需要向以色列政府支付 1,000 谢克尔;如果是意大利政府(区号为 +39)则支付 3,900 谢克尔;如果相关国家的保密协议禁止披露国家名称,则支付 10 万谢克尔。Google 和亚马逊的云计算部门都否认它们规避了法律义务。
- 英国面临禁止汞合金填充物的压力
英国是少数几个尚未禁止汞合金牙齿填充物的国家,它正面临越来越大的压力,原因是最新数据显示鱼类和贝类中的汞污染水平令人担忧。汞是一种强效神经毒素,即使是低浓度的汞暴露也会损害神经系统、消化系统和免疫系统,以及肺、肾、皮肤和眼睛。有机形式的甲基汞对胎儿尤其危险,且汞可通过食物链在昆虫、鱼类和鸟类体内积累。根据英国环境署的数据,火葬场是发电厂之后第二大汞排放源,每年排放 593 公斤的汞。根据 Rivers Trust 以及 Wildlife and Countryside Link 的分析,英国河流和沿海水域中检测的鱼类和贻贝中,逾 98% 鱼类和贻贝汞含量超过了欧盟提出的安全限值,逾半数的鱼类和贻贝汞含量超过建议安全水平的五倍。
- 注意力不集中可能是大脑在清理垃圾
晚上没睡好,第二天总是很难集中注意力,这可能是因为你的大脑正试图自我刷新,导致短暂的注意力缺失。 睡眠期间,大脑会进行一个冲洗循环——脑脊液被反复冲入大脑,再从大脑底部流出。这一过程能够清除白天积累的代谢废物,否则会损害脑细胞。MIT 的科学家想知道通常在睡眠不足时发生的注意力涣散,是否可能是大脑在清醒时试图弥补“自我冲洗”的结果。为了研究这个问题,科学家将试验分为两个阶段。第一阶段让26名19岁到40岁的参与者睡个好觉,得到充分的休息。第二阶段则是两周后,让他们在实验室里彻夜不眠。结果显示缺乏睡眠让参与者更难集中注意力。当研究人员分析大脑扫描结果时,发现参与者在脑脊液从大脑底部流出前约两秒就失去了注意力。更重要的是,在注意力恢复后约1秒,脑脊液被冲入大脑。研究结果表明,当大脑无法在睡眠中自我清洁时,它就会在你醒着时进行清洁,但这会影响注意力。
- OpenAI 可能大到无法倒下
OpenAI 尚未盈利,其年收入仅为亚马逊的 2%。它的企业重组基本完成,未来有望上市,可能成为第一家 1 万亿美元 IPO 的公司。它与科技行业知名的企业如英伟达和甲骨文达成了复杂的交易,承诺投资和购买高达万亿美元的算力,通过一系列金额巨大的交易,OpenAI 似乎达到了“大到不能倒”的程度,如果真的倒下可能会对整个经济造成系统性风险。在部分人眼里,OpenAI 集苹果、Facebook、Google 和特斯拉于一身,像一家有无限潜力的公司,能颠覆智能手机市场,创造自己的社媒网络,取代搜索引擎,引领机器人时代的到来,重塑所有商业和行业。但在另一部分人的眼里,OpenAI 像荷兰的“郁金香热”(Tulip Mania),是大萧条的先兆,下一个互联网泡沫(dot-com),他们认为 OpenAI 是想要制造弗兰肯斯坦的疯狂科学家,是导致失业率上升的杀手。
- 社交媒体同意遵守澳大利亚对青少年的社媒禁令
世界主要社交媒体平台同意遵守澳大利亚对 16 岁以下青少年的社媒禁令。Meta、Snap 和 TikTok 对澳大利亚议会确认,将在 12 月 10 日该法律生效后开始删除和停用逾百万未成年人账户。未能屏蔽未成年人用户的公司将面临最高 3250 万美元的罚款。在账户停用前青少年可以选择下载其数据,而部门社媒平台还将允许保留数据直至他们年满 17 周岁。年龄验证预计一开始不会太完美,未成年用户可能不能正确识别,而成年用户可能会被错误识别为未成年人。
- 韩国要求停车场盖太阳能车棚
从本月开始,韩国所有有 80 个以上停车位的停车场将被强制安装太阳能顶棚和停车棚。新法律不仅适用于新建停车场,现有停车场也需要遵守。韩国产业通商部 8 月宣布准备对《新能源和可再生能源开发、利用和推广促进法》实施细则进行修订,规定韩国所有拥有 80 个以上停车位的公共和私人停车场都必须加装太阳能电池板。此举旨在积极扩大可再生能源,创造更多太阳能和建筑工作。此外太阳能车棚还可在暴雨、暴雪和炎炎夏日的气候下保护汽车,保持车内凉爽,延长塑料和座椅面料的使用寿命,甚至可通过降低电动汽车和插电混动汽车的空调负荷延长其续航里程。
- 铠侠与英伟达合作推出直连 GPU 的 SSD
铠侠将与英伟达合作,推出直连 GPU 进行数据交换的 SSD,产品计划在 2027 年之前上市,替代部分 HBM DRAM 芯片。SSD一般通过 CPU 与 GPU 连接。新产品计划支持 PCIe 7.0 接口。基于 GPU 的 AI 运算主要使用作为超高速 DRAM 的 HBM。但是面向 HBM 的 DRAM 的单位容量价格很高,因此 AI 运营商很难扩大存储容量。铠侠力争通过使用以低价为优势的 NAND 闪存的 SSD,替换一部分用于扩大存储容量的 HBM。
- Devuan 6.0 释出
Devuan 发行版释出了代号为 Excalibur 的 Devuan 6.0。Devuan 6.0 是基于今年 8 月发布的 Debian 13 trixie,主要变化与 Debian 13 相同,使用 Linux 6.12 LTS 内核,桌面环境包括 GNOME 48、KDE Plasma 6.3、Xfce 4.20,以及 GCC 14.2、Python 3.13 等,正式支持 riscv64 架构,等等。Devuan 发行版是因为初始化系统 systemd 争议而由一群不满的 Debian 开发者创建的不使用 systemd 的分支。
- 微软 AI 负责人认为 AI 有意识是无稽之谈
微软 AI 业务负责人 Mustafa Suleyman 认为只有生物才有意识,建议开发者和研究人员应停止追求宣称 AI 有意识的项目。他在 AfroTech 会议上接受采访时表示,“我不认为这是人们应该做的工作。如果你问错了问题,最终只会得到错误的答案。我认为这完全是个错误的问题。”对于 AI 有意识或相信 AI 能感受到痛苦, Suleyman 一直持有反对立场。
- 吉卜力工作室等日本公司要求 OpenAI 停止使用其内容训练 Sora 2
代表吉卜力工作室和万代南梦宫等公司的日本反盗版组织 CODA(文化产品海外流通促进机构) 致函 OpenAI,要求停止使用其成员的内容训练视频生成模型 Sora 2。CODA 在信中表示机器学习过程中的复制行为可能构成了侵权,因为 AI 模型最后会生成包含受版权保护角色的内容。Sora 2 于 9 月 30 日上线后生成了大量包含日本 IP 的内容,促使日本政府正式要求 OpenAI 停止复制日本美术作品。此外 OpenAI 今年 3 月发布 GPT-4o 发布时炒作了其“吉卜力风格”的图像生成能力。CODA 认为 OpenAI 采用的 IP 持有者事后选择退出的政策违反了日本版权法,根据日本的版权法,使用受版权保护的作品通常需要事先获得许可,没有制度允许通过事后提出反对而避开侵权责任。
- 逾七成开发者认为 Steam 是 PC 游戏市场的垄断者
Atomik Research 在 2025 年 5月 18-22 日间调查了英美两国 306 位游戏行业高管,其中四分之三的受访者是 C 级别的高管,77% 的受访者来自人数逾 50 人的游戏工作室。研究发现,大多数工作室的收入逾四分之三来自 Steam。72% 的受访者认为 Steam 垄断了 PC 游戏市场。游戏开发商也开始利用其它平台如 Epic Game Store 和 Xbox PC Games store。48% 受访者在两个平台发行过游戏,10% 受访者使用过 GOG,8% 受访者使用过 Itch.io。32% 开发者以物理媒介发行过部分游戏。
- 特朗普再次提名 Jared Isaacman 为 NASA 局长
美国总统特朗普再次提名亿万富翁、私人宇航员 Jared Isaacman 为 NASA 局长。他在声明中没有解释为什么今年五月撤回了对 Isaacman 的提名而如今又再次认为他能胜任。此前特朗普取消提名被认为与马斯克(Elon Musk)退出特朗普的核心圈子有关,Isaacman 是马斯克青睐的人选,曾搭乘 SpaceX 的飞船多次飞到地球轨道。特朗普在今年 7 月任命了运输部长 Sean Duffy 兼任 NASA 局长,但他最近的一系列言论和透露的 NASA 的计划引发了很多争议。与此同时,特朗普的幕僚则在继续推荐 Isaacman,而 Isaacman 也被发现与特朗普多次共餐,显示两人关系良好。美国政府目前处于停摆中,确认 Isaacman 的提名可能需要很长的时间。
- 天文学家可能发现了大爆炸之后的第一代恒星
天文学家一直在寻找宇宙最初诞生的第一代恒星,如今他们或许终于找到它们的踪迹。美国俄亥俄州托雷多大学(University of Toledo)研究团队对韦伯太空望远镜(JWST)的引力透镜观测资料进行详细分析后,认为在遥远星系 LAP1-B 中,他们可能捕捉到了这些宇宙初生恒星的光芒。第一代恒星主要由氢与氦构成,含有微量的锂,这些都是大爆炸后遗留的原始元素。这些恒星极为罕见,寿命极短因此早已消亡,但它们遗留的微弱星光在穿越遥远距离后仍然有机会被捕捉。过去曾出现多次第一代恒星的候选对象,但最终都因为不符合理论预测的三大预测而被排除:形成于极低金属丰度的小型暗物质晕中;质量在 10 到 1,000 倍太阳质量之间;应该以小型星团的形式诞生,星团总质量数千倍太阳质量。LAP1-B 被认为同时满足三项条件。这个恒星系统形成于一个质量约为太阳 5,000 万倍的暗物质团块。其次,这些恒星质量介于太阳的 10 到 1,000 倍之间。最后,它们以总质量仅数千倍太阳质量的小型星团存在。
- 微软测试用 Copilot 取代桌面搜索框
微软正将其 AI 助手 Copilot 集成到其每一个产品之中,而 Windows 操作系统则在更深入的整合 Copilot。微软在最新 Windows Insider Dev 和 Beta 版本中测试了用 Copilot 取代传统的桌面搜索框。Copilot 搜索框并没有默认启用,默认的搜索框显示了文字“搜索”,在用 Copilot 取代之后,搜索框会显示文字“问 Copilot 任何事”,用户可以输入 Copilot 提示词或搜索关键词。 目前的测试显示它并没有传统搜索功能强。
- Chrome 将于 2026 年 11 月移除对 XSLT 的支持
Chrome 官方博客宣布将于 2026 年 11 月 17 日发布 v155 时移除对 Extensible Stylesheet Language Transformations(XSLT) 的支持。Google 的解释是有助于改进安全,称 Firefox 和 WebKit 项目也都有类似的计划。XML 文档适合计算机读取但不适合人类阅读,XSLT 的目的就是将 XML 文档转换成其它更适合人类阅读的格式如 HTML。Chrome、Firefox、Safari 等主流浏览器都支持客户端 XSLT 渲染,但仅限于 1999 年 的 1.0 版本,而不是 2017 年最新的 3.0 版本。Google 早在 2013 年就表达了移除 XSLT 的想法,但没有付诸实施。今年的 WHATWG 会议正式将移除 XSLT 的提议加入了讨论议程。Google 开发者认为浏览器使用的 XSLT 的代码库已经老化,易受内存安全漏洞的影响,而且 XSLT 使用率非常低,每 7891 次页面加载只有一次涉及客户端 XSLT。
- 天文学家发现有史以来最亮的黑洞光爆发
根据发表在《Nature Astronomy》期间上的一项研究,当黑洞吞噬一颗质量至少为太阳 30 倍的恒星时,天文学家探测到了有史以来黑洞中最明亮的光爆发——其峰值亮度比太阳光高 10 万亿倍以上。当 2018年 天文学家第一次观测到这个天体时,他们并未意识到这是一个超级耀斑。在注意到天体亮度增强后,研究人员立即用帕罗玛山天文台的 200 英寸海耳望远镜瞄准了它。2023 年研究团队注意到,即使在 5 年后,这个耀斑仍然异常明亮。因此他们利用夏威夷凯克天文台进行了更深入的观测,结果显示,该天体距离地球约 300万千秒差距,即 100 亿光年。能在如此遥远的距离上看起来如此明亮,其发出的光必定是极其耀眼的。天文学家现在表示,这个耀斑的亮度是此前探测到的任何一次黑洞光爆发的 30 倍。研究人员认为合理的解释是,一颗大质量恒星在离黑洞过近时遭遇了厄运。当黑洞的引力将撕碎恒星时,它发出的光比之前要亮数十倍。他们还认为,由于耀斑还没有完全消失,这颗恒星可能还没有被黑洞完全吞噬。
- 43% 的 Z 世代偏爱 YouTube 和 TikTok 而非传统电视和流媒体
Activate Consulting 的调查显示,43% 的 Z 世代偏爱 YouTube 和 TikTok 而不是传统电视或付费流媒体。全球媒体收入大幅增长,而传统电视收视率则在暴跌,每个人在各个平台上消费内容的时间平均超过 13 个小时,而多任务处理让人人过着“一天 32 小时”的生活。调查还显示,时长 1-2 分钟的微短剧正在快速流行,2800 万美国成年人(52% 年龄在 18-34 岁之间)在消费这种新娱乐形式。调查预计到 2029 年全球互联网和媒体收入将增加 3880 亿美元,人们每天看流媒体视频的时长将增至 4 小时 8 分钟,观看传统电视的时长将降至 1 小时 17 分钟。流媒体收入(包括广告和订阅)将每年增长 18-19%,而传统电视收入将每年下降 4-6%。
- 中国要求国家资助数据中心使用国产 AI 芯片
中国发布指导方针,要求国家资助的新建数据中心使用国产 AI 芯片,完工进度低于 30% 的数据中心必须拆除所有已安装的外国芯片或取消采购计划;进度高于 30% 的数据中心则视个案而定。这可能是至今最强力的在关键基础中去除外国技术的举措。