Monthly Digest — 2025-12
399 unique stories across 31 days and 8 sources.
Hacker News(123)
- How to Attend Meetings – Internal guidelines from the New York Times (docs.google.com)
- Ghostty compiled to WASM with xterm.js API compatibility (github.com)
- High-income job losses are cooling housing demand (jbrec.com)
- India orders smartphone makers to preload state-owned cyber safety app (www.reuters.com)
- Claude 4.5 Opus’ Soul Document (www.lesswrong.com)
- The Junior Hiring Crisis (people-work.io)
- Anthropic acquires Bun (bun.com)
- 100k TPS over a billion rows: the unreasonable effectiveness of SQLite (andersmurphy.com)
- Micron Announces Exit from Crucial Consumer Business (investors.micron.com)
- Everyone in Seattle hates AI (jonready.com)
- Valve reveals it’s the architect behind a push to bring Windows games to Arm (www.theverge.com)
- Ghostty is now non-profit (mitchellh.com)
- Thoughts on Go vs. Rust vs. Zig (sinclairtarget.com)
- Django 6 (docs.djangoproject.com)
- A Cozy Mk IV light aircraft crashed after 3D-printed part was weakened by heat (www.bbc.com)
- The RAM shortage comes for us all (www.jeffgeerling.com)
- Google 'Looking into' Gmail Hack Locking Users Out with No Recovery (www.forbes.com)
- Gemini 3 Pro: the frontier of vision AI (blog.google)
- The effect of shingles vaccination at different stages of dementia (www.cell.com)
- Framework Laptop 13 gets ARM processor with 12 cores via upgrade kit (www.notebookcheck.net)
GitHub Trending(71)
- sansan0 / TrendRadar
🎯 告别信息过载,AI 助你看懂新闻资讯热点,简单的舆情监控分析 - 多平台热点聚合+基于 MCP 的AI分析工具。监控35个平台(抖音、知乎、B站、华尔街见闻、财联社等),智能筛选+自动推送+AI对话分析(用自然语言深度挖掘新闻:趋势追踪、情感分析、相似检索等13种工具)。支持企业微信/个人微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 推送,30秒网页部署,1分钟手机通知,无需编程。支持Docker部署⭐ 让算法为你服务,用AI理解热点
- google / adk-go
An open-source, code-first Go toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
- TapXWorld / ChinaTextbook
所有小初高、大学PDF教材。
- yeongpin / cursor-free-vip
[Support 0.49.x](Reset Cursor AI MachineID & Bypass Higher Token Limit) Cursor Ai ,自动重置机器ID , 免费升级使用Pro功能: You've reached your trial request limit. / Too many free trial accounts used on this machine. Please upgrade to pro. We have this limit in place to prevent abuse. Please let us know if you believe this is a mistake.
- basecamp / fizzy
Kanban as it should be. Not as it has been.
- oven-sh / bun
Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one
- DayuanJiang / next-ai-draw-io
A next.js web application that integrates AI capabilities with draw.io diagrams. This app allows you to create, modify, and enhance diagrams through natural language commands and AI-assisted visualization.
- openai / codex
Lightweight coding agent that runs in your terminal
- rustfs / rustfs
🚀2.3x faster than MinIO for 4KB object payloads. RustFS is an open-source, S3-compatible high-performance object storage system supporting migration and coexistence with other S3-compatible platforms such as MinIO and Ceph.
- trustedsec / social-engineer-toolkit
The Social-Engineer Toolkit (SET) repository from TrustedSec - All new versions of SET will be deployed here.
- microsoft / VibeVoice
Open-Source Frontier Voice AI
- RosettaCommons / foundry
Central repository for biomolecular foundation models with shared trainers and pipeline components
- sinelaw / fresh
Text editor for your terminal: easy, powerful and fast
- NVIDIA / cutile-python
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
- patchy631 / ai-engineering-hub
In-depth tutorials on LLMs, RAGs and real-world AI agent applications.
- TelegramMessenger / Telegram-iOS
Telegram-iOS
- winapps-org / winapps
Run Windows apps such as Microsoft Office/Adobe in Linux (Ubuntu/Fedora) and GNOME/KDE as if they were a part of the native OS, including Nautilus integration. Hard fork of https://github.com/Fmstrat/winapps/
- KaijuEngine / kaiju
General purpose 3D and 2D game engine using Go (golang) and Vulkan with built in editor
- thedotmack / claude-mem
A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future sessions.
- dyad-sh / dyad
Free, local, open-source AI app builder ✨ v0 / lovable / Bolt alternative 🌟 Star if you like it!
Hugging Face(92)
- Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
- REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).
- AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
- Vision Bridge Transformer at Scale
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
- From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
- LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
- Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
- Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
- DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
- ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
- MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.
- Qwen3-VL Technical Report
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
- Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose TACO, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
- PretrainZero: Reinforcement Active Pretraining
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
- ViDiC: Video Difference Captioning
Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.
- DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
- Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
- Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.
- ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
Solidot(113)
- 树莓派因为内存价格飙升而涨价
树莓派宣布因为近期内存价格飙升而对部分型号的 Raspberry Pi 4 和 5 产品涨价,同时宣布推出一款 1GB 版本的 Raspberry Pi 5,售价 45 美元。Raspberry Pi 4 和 5 价格上涨最高 25 美元,最低 5 美元。4GB 版本的 Raspberry Pi 4 从 55 美元涨到 60 美元,16GB 版本的 Raspberry Pi 5 从 120 美元涨到 145 美元。树莓派表示近期的内存价格飙升是 AI 热推动的,一旦情况缓解它将会调低价格。
- 吸血章鱼揭示章鱼的起源
吸血章鱼是一种居住在深海的头足类,是八腕总目的一种,其祖先在侏罗纪时期为了躲避蛇颈龙目的猎食而移居深海,亿年来其形态不曾改变,被称为是活化石。日本研究团队在骏河湾意外捕捉到一只吸血章鱼,对其进行测序后发现其碱基对超过 110 亿,是已知章鱼类动物最大基因组的两倍多。吸血章鱼虽然名字中有章鱼,但它既不是章鱼也不是鱿鱼,更不是吸血鬼,它是一种古老谱系中最后也是唯一的幸存者,该谱系中其它成员都消失了。它的历史可追溯到 1.83 亿年前,保留了祖先的诸多特征,同时演化出适应深海黑暗环境、以腐肉为生的生存方式。其基因组规模比鱿鱼和章鱼都大得多,其中 62% 由重复序列组成。吸血章鱼属于八腕总目,但保留了十腕总目的部分染色体结构。研究人员表示吸血章鱼让我们能直接观察头足类动物演化的最早阶段。
- 韩国电商巨头逾 3000 万用户账户泄漏
韩国电商巨头酷澎发生了 3000 余万个用户账号信息遭泄事件。遭泄的个人信息包含用户姓名、电子邮箱、电话号码、地址,甚至包含部分订购记录。根据韩国 《个人信息保护法》,若企业违反相关法律,可以被处以最多相当于销售额 3% 的罚款。酷澎今年前三季度累计销售额为 36.3 万亿韩元。若从中减去与个人信息泄露案关联度不高的业务部门业绩等,销售额为 31 万亿韩元。若再将其折算为年销售额,罚款或达 1.2 万亿韩元。根据酷澎递交给警方的报告,用户信息泄露非因遭黑客攻击,而由公司中国籍员工外泄所致。该员工早已离职并离境。
- SmartTube 官方 APK 文件被植入恶意程序
SmartTube 开发者上周宣布数字签名泄漏,他发布了使用新签名的新版本应用,督促用户切换到新版本。SmartTube 是 Android TV 和 Fire TV 设备上 YouTube 应用的流行替代。开发者透露,他用于构建官方 APK 文件的计算机遭到入侵,导致部分 APK 版本植入了恶意程序。暂时不清楚哪个版本的 APK 最早包含了恶意程序。APKMirror 上的 SmartTube v30.43 和 30.47 都被标记为感染恶意程序。开发者表示,所有旧版本 SmartTube 都已经从项目的 GitHub 库中移除,感染恶意程序的计算机也进行了处理,旧数字签名被弃用。SmartTube v30.56 是使用新签名在干净计算机上构建的首个版本。
- 2025 年牛津年度单词是 Rage bait
牛津大学出版社的 2025 年年度单词是 Rage bait。Rage bait 意思是愤怒诱饵,它和去年的年度单词 brain rot(脑腐)一起提醒我们,在算法时代,情绪已成为最被操弄的资源。Rage bait 专指那些刻意让人愤怒、挫折或感到冒犯,以拉高点击率或社群互动的网络内容,例如在网络上故意激怒你,只为了让你多按个怒的表情符号、骂两句、再分享出去,让算法把愤怒推得更远。根据牛津语料库资料,rage bait 在过去 12 个月内的使用频率增加三倍,成为媒体与社群平台经常提到的词。
- 华为过去几年申请的 GPU 专利超过英伟达
华为的 GPU 相关专利申请量正在增加。截至 2023 年的 5 年里,申请数量增加到了原来的 10 倍,超过了美国英伟达和英特尔的申请量。这反映出华为正在大力开发 AI 相关技术。从包含 GPU 这一关键词的专利来看,最近几年三星电子与华为的申请量激增。2023 年华为的申请量为 3091 项,增加到了 2018 年的约 10 倍。2023 年的申请量相当于英特尔的 3 倍、英伟达的 5 倍。
- 新加坡禁止中学生在校期间使用智能手机
新加坡教育部宣布,明年 1 月起,所有中学生在学校时,包括上课、课间休息,以及课后进行课程辅助活动、增益课或补课等,都不得使用智能手机和手表。学生在校时须把智能手机和手表放入储物柜或书包等指定存放空间;若有必要,学校会允许学生使用智能手机。新教育部称,此举旨在鼓励学生培养良好数码习惯,下课后与同学进行有意义的互动交流,以及培养健康生活方式。新加坡之前已经不允许小学生在校使用智能手机或手表,他们在学校必须把这些配备放在书包或指定存放空间,包括课间休息,及下课后进行学习项目的时候。
- Steam 用户中 Linux 比例达到 3.20%
掌机 Steam Deck 的流行以及基于 Arch Linux 的发行版 SteamOS 的成功推动 Linux 用户比例达到 3.20%。根据 Valve 公布的 2025 年 11 月 Steam 硬件和软件调查,玩家运行的操作系统中 Linux 比例达到 3.20%(+0.15%),Windows 占 94.79%(-0.05%),OSX 占 2.02%(-0.09%),其中在 Windows 10 停止支持后 Windows 11 比例达到了 65.59%(+2.02%),Windows 10 降至三成以内占 29.06%(-2.08%)。在 Windows 平台,英特尔 CPU 占 57.30%(-0.52%),AMD 占 42.61%(+0.52%)。对于用户使用的语言,简体中文占 24.93%(+0.92%),英文占 37.37%(-0.59%)。
- 《绝地潜兵 2》将游戏容量从 154GB 减少到 23GB
《绝地潜兵 2(HELLDIVERS 2)》开发商 Arrowhead Game Studios 释出最新更新,将 PC 版本的游戏容量从 154GB 减少到 23GB,瘦身高达 85%。Arrowhead 此前曾在官方博客上解释了为什么 PC 版本的容量如此之大,原因是 PC 版本包含了大量重复数据,旨在加快机械硬盘上游戏的加载速度,而游戏机使用的是固态硬盘,因此主机版本的容量没有这么大。今天绝大部分 PC 使用的硬盘已从机械硬盘过渡到固态硬盘,Arrowhead 估计只有 12% 的《绝地潜兵 2》玩家仍然使用机械硬盘。
- ShadyPanda 利用浏览器扩展感染逾 400 万用户
安全公司 Koi Security 披露了被称为 ShadyPanda 的攻击者利用浏览器扩展感染了 430 万 Chrome 和 Edge 用户。攻击者采取了长线方案,首先通过合法应用吸引积累用户群,然后通过后续更新植入恶意代码。攻击者的活动分为多个阶段,第一阶段是在扩展中嵌入联盟营销追踪代码,拦截电商平台购物链接嵌入自己的联盟营销代码获取佣金;第二阶段是劫持搜索和窃取 cookie;第三阶段植入远程访问后门变成间谍软件窃取敏感浏览器数据。 受影响的扩展包括了 Clean Master、以及 Infinity 和 WeTab 等。WeTab 开发商随后发表声明,称 Clean Master 扩展已被该公司出售,与 WeTab 和 Infinity 已经没有关联,而 WeTab 和 Infinity 并没有恶意代码,
- AlphaFold 如何改变世界
Google DeepMind 在 2020 年 11 月宣布了它的 AI 工具 AlphaFold2,2021 年发布了 AlphaFold2 代码和数据库。问世五年来,AlphaFold2 不仅改变了结构生物学的研究方式,也推动了计算生物学的进步。不过将其生物学洞见转化为药物开发等实际应用仍需时间。AlphaFold 数据库目前已收录超过 2.4 亿个结构预测,覆盖绝大多数已知蛋白质,为全球 100 多个国家的 330 万名研究者提供支持。如今科学家已利用 AlphaFold2 设计应对抗生素耐药性的方案、寻找疟疾等疾病的新疗法,并深入理解疾病机制、加速靶向药物开发。
- 日本商业字体公司被收购之后价格上涨逾 50 倍
日本游戏公司面临切换商业字体的难题,因为原来提供低价商业字体服务的 Fontworks LETS 在被美资 Monotype 收购之后价格上涨逾 50 倍。Fontworks 以前的年费为 6 万日元(约 380 美元),现在 Monotype 的方案是年费为 2.05 万美元,而且还限制安装量最高为 2.5 万。对大部分商业游戏而言这是完全不切实际的限制。英语游戏可使用系统自带的 UI 字体、廉价的商业字体或开源字体,但日语字符数量庞大,制作高质量字体极其困难且成本高昂,变更日语字体的工作量将会非常大。已有游戏开始改用 DynaFont 的商业字体。
- Statcounter 数据显示 Windows 11 份额增长缓慢
虽然 Windows 10 已经停止了主流支持(即免费更新),转向了付费的扩展支持阶段,但 Windows 10 用户并没有迅速拥抱 Windows 11。Statcounter 的数据显示,2025 年 11 月 Windows 11 份额为 53.7%,而 Windows 10 仍然有 42.7%。分析师认为,一个原因是 Windows 11 提高了硬件需求,很多现有的 Windows 10 PC 无法升级。而客户通常抱着一个信念是“如果东西没坏就没必要动它”。Windows 11 也没有提供什么必不可少的新功能能促使企业客户更新换代。
- 数学家也难以理解非其领域的数学
今天的学科日益细化,我们很难再看到精通多个领域的“大师”。以数学为例,数学被划分为 63 个大类,这些大类又进一步细分为 529 个子类,每个子类都发展出自己专门的术语,用于阐述和证明技术定理,而掌握这些术语需要多年的学习。这些专门的术语阻碍了数学家或科学家与非专业人士进行沟通。研究人员已经发现,科学文献的可读性随时间在下降。去年夏天几何朗兰兹猜想获得证明,但真正能读懂证明的人寥寥无几。
- React Server 高危漏洞影响无数网站
安全公司 Wiz 周三披露了危险等级 10/10 的 React Server 高危漏洞。React Server 被网站和云环境广泛使用,安全研究人员督促管理员尽快打上补丁,因为漏洞极其容易被利用(成功率差不多 100%)。漏洞利用代码已经公开,攻击者可利用漏洞远程执行代码。约 6% 的网站和 39% 的云环境使用 React。受影响的 React 版本包括 v19.0.1、v19.1.2 或 v19.2.1,受影响的第三方组件包括 Vite RSC、Parcel RSC、React Router RSC、RedwoodSDK、Waku 和 Next.js 等。漏洞编号为 CVE-2025-55182,存在于 React Server Components 的 Flight 协议中,源自于不安全的反序列化。
- 美国驱逐了偷拍 SpaceX 机密材料的俄罗斯宇航员
原计划参加 SpaceX Crew-12 任务前往国际空间站的俄罗斯宇航员 Oleg Artemyev 被从乘组名单中移除,他的位置由另一名俄罗斯宇航员 Andrey Fedyaev 接替。原因是他被逮到在加州 Hawthorne 基地使用手机偷拍了 SpaceX 火箭引擎和内部机密材料,违反了美国出口管制规定。Artemyev 上周被从训练基地驱逐。报道称,NASA 不希望围绕 Artemyev 的争议公开化。
- 论文抄袭率更高的人更可能进入政府部门且晋升更快
哈佛大学、香港大学和芝加哥大学的研究人员发表了一篇论文,对中国超过 50 万篇研究生学位论文进行原创性检测,发现抄袭行为普遍存在,约 14% 的论文超过 15% 的官方相似度阈值。数据分析显示,有抄袭行为的毕业生进入公共部门的比例显著高于其未抄袭的同窗,尤其在税务、海关等实权部门。这表明,在职业入口处存在着基于不诚实特质的“负面选择”。一旦进入体制,这种负面效应在职业晋升中进一步放大。追踪公务员的职业轨迹发现,在相同资历背景下,有抄袭记录的官员晋升速度平均快 10-15%。即使在司法系统这一专业性极强的领域,控制了法官办案数量、上诉率等绩效指标后,抄袭记录仍能独立预测其晋升概率。通过分析超过 1.4 亿份法院判决并利用案件准随机分配机制,研究发现:由有抄袭记录的法官审理的案件,其裁决更倾向于偏袒政府、国有企业或大型企业,上诉率更高,判决书说理更简略,且更频繁使用自由裁量条款。
- Netflix 以 827 亿美元收购华纳兄弟
Netflix 发布新闻稿,正式宣布将以 827 亿美元收购华纳兄弟(Warner Bros)——即 Warner Bros Discovery 的电影和流媒体业务,换句话说流媒体服务 HBO Max 将成为 Netflix 的一部分。此次收购预计将重塑美国媒体行业格局,可能会面临反垄断审查。Netflix 的收购价为每股 27.75 美元,股权价值为 720 亿美元,包括公司债务和股票价值在内的总企业价值约 827 亿美元。两家公司的董事会一致批准了交易。
- AI 聊天机器人擅长利用不准确信息改变人的政治观点
根据发表在《科学》期刊上的一项研究,AI 聊天机器人擅长改变人的政治观点,在使用不准确信息时其说服力更惊人。研究人员通过众包平台招募了近 7.7 万名参与者,付费让他们与 OpenAI、Meta 和 xAI 等公司的 AI 聊天机器人进行对话。研究人员首先询问了参与者在不同政治议题的立场,然后 AI 聊天机器人会尝试改变他们的想法接受相反的立场。研究显示 AI 聊天机器人非常擅长担任政治说客。研究人员发现,AI 提供的大量信息中包含了很多不准确的断言,而“最具说服力的模型和提示策略产生的信息准确率最低”。研究中 AI 聊天机器人给出的声明中有 19% 被认为“基本不准确”。研究人员担心,极具说服力的 AI 聊天机器人可能会被无道德原则的人利用,用于宣传激进的政治或宗教意识形态,或在地缘政治对手之间煽动政治动乱。
- 长期限制热量摄入能减缓大脑衰老
美国国家老龄化研究所在 1980 年代开展了一项研究,参与者分成两组,一组食用均衡的常规饮食,一组减少三成热量摄入。研究最初的目的是调查减少热量摄入是否能延长寿命。参与者都活到了自然死亡。研究人员在他们死后分析了大脑,比较了正常饮食者和限制热量摄入者脑细胞的差异,观察减少热量摄入如何影响与脑细胞衰老相关的基因表达和通路活性。他们发现,限制卡路里摄入的脑细胞在代谢上更健康、功能更强,髓鞘相关基因表达增加,以及与髓鞘生成和维护密切相关的关键代谢通路活性增强。这些发现支持长期饮食干预能从细胞层面影响大脑衰老的轨迹。