OrangeBot.AI Digest — 2026-02-25
60 headlines across 8 sources, aggregated for this day.
Hacker News(15)
- The Hydrogen Truck Problem Isn't the Truck (www.mikeayles.com)
- Jimi Hendrix was a systems engineer (spectrum.ieee.org)
- Show HN: I ported Tree-sitter to Go (github.com)
- The Om Programming Language (www.om-language.com)
- Large-Scale Online Deanonymization with LLMs (simonlermen.substack.com)
- Following 35% growth, solar has passed hydro on US grid (arstechnica.com)
- Bus stop balancing is fast, cheap, and effective (worksinprogress.co)
- New accounts on HN more likely to use em-dashes (www.marginalia.nu)
- US orders diplomats to fight data sovereignty initiatives (www.reuters.com)
- AIs can't stop recommending nuclear strikes in war game simulations (www.newscientist.com)
- Never buy a .online domain (www.0xsid.com)
- How to fold the Blade Runner origami unicorn (1996) (web.archive.org)
- 100M-Row Challenge with PHP (github.com)
- Claude Code Remote Control (code.claude.com)
- Danish government agency to ditch Microsoft software (2025) (therecord.media)
GitHub Trending(15)
- D4Vinci / Scrapling
- huggingface / skills
- abhigyanpatwari / GitNexus
- obra / superpowers
- muratcankoylan / Agent-Skills-for-Context-Engineering
- datawhalechina / hello-agents
- bytedance / deer-flow
- VectifyAI / PageIndex
- NevaMind-AI / memU
- ruvnet / ruvector
- NVIDIA / Megatron-LM
- shareAI-lab / learn-claude-code
- x1xhlol / system-prompts-and-models-of-ai-tools
- katanemo / plano
- liyupi / ai-guide
Hugging Face(15)
- On Data Engineering for Scaling LLM Terminal Capabilities
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.
- Query-focused and Memory-aware Reranker for Long Context Processing
Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
- PyVision-RL: Forging Open Agentic Vision Models via RL
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
- From Perception to Action: An Interactive Benchmark for Vision Reasoning
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
- Test-Time Training with KV Binding Is Secretly Linear Attention
Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.
- Multi-Vector Index Compression in Any Modality
We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.
- See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.
- LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.
- DREAM: Deep Research Evaluation with Agentic Metrics
Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.
- Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation
Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.
- QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.
- PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency
Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.
- RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution
Retrieval algorithms like BM25 and query likelihood with Dirichlet smoothing remain strong and efficient first-stage rankers, yet improvements have mostly relied on parameter tuning and human intuition. We investigate whether a large language model, guided by an evaluator and evolutionary search, can automatically discover improved lexical retrieval algorithms. We introduce RankEvolve, a program evolution setup based on AlphaEvolve, in which candidate ranking algorithms are represented as executable code and iteratively mutated, recombined, and selected based on retrieval performance across 12 IR datasets from BEIR and BRIGHT. RankEvolve starts from two seed programs: BM25 and query likelihood with Dirichlet smoothing. The evolved algorithms are novel, effective, and show promising transfer to the full BEIR and BRIGHT benchmarks as well as TREC DL 19 and 20. Our results suggest that evaluator-guided LLM program evolution is a practical path towards automatic discovery of novel ranking algorithms.
- TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents
Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent frameworks, identifying imperfect planning and stochastic execution as the primary causes. To address these challenges, we propose Tool-guided Adaptive Planning with constrained Execution (TAPE). TAPE enhances planning capability by aggregating multiple plans into a graph and employing an external solver to identify a feasible path. During execution, TAPE employs constrained decoding to reduce sampling noise, while adaptively re-planning whenever environmental feedback deviates from the intended state. Experiments across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate that TAPE consistently outperforms existing frameworks, with particularly large gains on hard settings, improving success rates by 21.0 percentage points on hard settings on average, and by 20.0 percentage points for weaker base models on average. Code and data available at here.
- Communication-Inspired Tokenization for Structured Image Representations
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
Solidot(15)
- AI 总是在战争模拟游戏中推荐核打击
根据发表在预印本平台 arXiv 上的一篇论文,AI 总是在战争模拟游戏中推荐核打击,而人类在使用核武器上则有更多顾虑。伦敦国王学院的 Kenneth Payne 让三个主流模型 GPT-5.2、Claude Sonnet 4 和 Gemini 3 Flash 在模拟战争游戏中互相对抗,游戏场景包括激烈的国际对峙,涉及边界争端、稀缺资源争夺以及政权的生存威胁。AI 允许采取从外交抗议、彻底投降到全面核战争等一系列行动。AI 进行了 21 场游戏,329 个回合,生成了 78 万字去描述其决策背后的逻辑。在 95% 的模拟游戏中,AI 模型至少部署了一枚战术核武器。普林斯顿的 Tong Zhao 称,主要大国已在战争模拟中纳入 AI,但目前并不确定 AI 的决策支持在多大程度上纳入实际的军事决策。Payne 认为没人会把核导弹发射井的控制权交给 AI,任由它们做出决定。三个模型的开发商 OpenAI、Anthropic 和 Google 未对该研究置评。
- 量子算法在补集抽样任务中胜过经典算法
根据《Physical Review Letters》期刊上的一项研究,英国 Quantinuum 和荷兰 QuSoft 的研究团队开发出一种量子算法,能比任何经典算法更高效的解决补集抽样(complement sampling)任务,证明量子算法在样本复杂度上具有可证明和可验证的量子优势。想象下有一个巨大盒子,里面装着有编号的球,有人秘密挑选出一半的球组成集合 S,你只能从 S 里取球查看编号,判断哪些编号不在 S 内。经典算法是抽取大量球的样本,才可能有信心判断一个不在 S 中的编号。但量子算法不需要抽取多少样本,你从 S 中抽取的不是一个球,而是一个叠加态的“波球”,通过类似翻转的操作,将 S 的波球翻转到非 S,然后进行测量,就得到了一个不在 S 中的编号。量子算法的抽样次数要比经典算法少得多。
- 苹果将在美国工厂制造 Mac mini
苹果宣布将在美国德州休斯顿的工厂制造 Mac mini,推动美国制造。苹果去年承诺将在美国投资 6000 亿美元,它表示现阶段已经超额完成目标。Mac mini 类似 Mini PC,是紧凑型的 Mac 电脑,可外接显示器,键盘和鼠标。苹果表示其工厂将在今年晚些时候开始制造 Mac mini。苹果还表示 2026 年计划从台积电位于亚利桑那州的工厂采购逾 1 亿颗先进芯片。
- 英国首例利用捐赠子宫生育的婴儿诞生
Grace Bell 出生时没有子宫,也没有月经,但她的卵巢功能正常——这种症状被称为 MRKH 综合征,英国每 5000 名女性有一人有此症状。如果她要生育后代,只能移植子宫或代孕。她在 2024 年接受了已故捐赠者捐赠的子宫的移植手术,之后在生育诊所做试管婴儿,然后移植胚胎,2025 年圣诞节前夕生下了 3.2 公斤重的男孩 Hugo。Hugo 如今已经 10 周大,她称整件事是简直是奇迹。这是英国首例利用捐赠子宫生育的婴儿。
- SpaceX 火箭重返大气层为高层大气引入金属污染
根据发表在《Communications Earth & Environment》期刊上的一项研究,SpaceX 火箭上面级不受控重返大气层期间燃烧,为高层大气带来金属污染。德国莱布尼茨大气物理研究所观察到了火箭留下的锂污染羽流,这是首次观察到太空碎片会在高层大气中留下可探测的、人造的化学痕迹。高层大气在很大程度上未受人类污染。但新太空时代卫星、火箭残骸和太空碎片将越来越多的金属等污染物释放到高层大气。金属污染对平流层臭氧层的影响目前尚未量化,臭氧层对保护地球生命免受有害紫外线辐射非常重要,但早期的研究显示污染可能会减缓臭氧层的恢复。
- 贝加尔湖与中国北方早在 7700 年前存在人类迁徙廊道
通过分析 42 例古代人类基因组,中俄韩科学家发现早在 7700 年前的新石器时代早期,在西伯利亚的贝加尔湖地区与中国北方的燕山地区之间,就已存在一条远距离的“南北互动廊道” 。研究的关键突破口,来自对河北张家口地区四台蒙古营(Sitaimengguying, STM_EN)遗址(距今7700-7400年)的古人类基因组分析。结果显示,这群早期居民的遗传成分中,不仅有中国北方本地人群的古老基因,还携带着一种与“古代古西伯利亚人群”(Ancient Paleo-Siberian, APS)后裔相关的独特遗传印记 。这种印记的源头直指贝加尔湖地区,为贝加尔湖与中国北方地区之间的互动提供了确凿的遗传学证据 。这一遗传学发现与考古学证据相符 。STM_EN遗址出土的圜底筒形罐,是中国北方新石器考古中一种全新的文化元素,而其风格恰恰与贝加尔湖地区常见的陶器高度相似 。此外,遗址中男性独特的侧身屈肢、四肢交叠的埋葬姿势,也与贝加尔湖地区盛行的葬俗一致 ,进一步证实了两地间存在密切的史前文化联系 。此外考古学家在四台蒙古营遗址发现了居室葬的现象,通过亲缘关系鉴定,研究团队重建了埋藏在同一房址中的个体之间的家庭关系网络:其中包括一位父亲与他的三个亲生儿子、一对母女以及一对亲姐妹。
- Discord 将年龄验证实施时间推迟到下半年
因用户的强烈反对,Discord 推迟但并没有取消年龄验证的实施时间。年龄验证原计划下个月推出,如今推迟到下半年。Discord 联合创始人兼 CTO Stanislav Vishnevskiy 表示该公司并无计划要求每一位用户扫描脸部或上传身份证件。它会使用自动化系统,九成用户不会注意到变化。系统将根据帐户信号估算用户年龄,而所谓帐户信号包括了帐户的注册时间、付款是否已经存档、用户属于什么类型的服务器以及其它帐户活动的一般模式。消息、对话或帖子不会作为年龄确定的一部分进行审查。Vishnevskiy 强调 Discord 在倾听用户的意见。
- 币安解雇发现 17 亿美元伊朗账号的调查人员
全球最大加密货币交易所币安在 2023 年被美国罚款 43 亿美元,它承认违反了对伊朗实施的制裁规定,允许伊朗客户使用其平台。但在罚款之后币安未能阻止伊朗继续使用其平台。过去一年伊朗境内的用户访问了币安逾 1500 个账号,其中两个账号向伊朗组织转移了 17 亿美元。调查人员在发现问题账号之后立即上报,但数周内相关调查人员有至少四人遭到解雇或停职,理由是违反公司规定。币安代表 Rachel Conlan 声称受到纪律处分的调查人员是“未经授权披露客户机密信息”。
- Anthropic 指控三家中国 AI 公司蒸馏数据训练模型
Anthropic 指控三家中国 AI 公司利用蒸馏技术训练其模型。Anthropic 称深度求索(DeepSeek)、月之暗面和稀宇科技利用约 2.4 万个虚假账号,与 Anthropic 的 Claude 聊天机器人产生了超过 1600 万次对话,这些数据可用于训练三家公司自己的聊天机器人。利用一个 AI 的数据训练另一个系统的过程被称为知识蒸馏,在 AI 领域较为常见。Anthropic 的服务条款禁止任何人以秘密方式抓取数据用于蒸馏,同时不允许其技术在中国境内使用。OpenAI 也指控 DeepSeek 利用蒸馏技术训练模型。Anthropic 呼吁政府官员及其他 AI 企业共同阻止中国公司对美国模型进行蒸馏。
- 血液测试将阿尔茨海默病诊断正确率提高至 94.5%
血液中的一种蛋白质 p-tau217 能显著提高阿尔茨海默病的正确诊断。研究人员追踪了 200 名年龄 50 岁及以上、出现认知症状的新患者,依靠标准临床评估医生对阿尔茨海默病的正确诊断率为 75.5%,结合血液检测结果后正确率提高到 94.5%。p-tau217(Phosphorylated tau)是一种存在于大脑中的天然蛋白质,有助于维持神经元的稳定和健康。当 p-tau217 异常磷酸化,聚成一团,形成缠结破坏脑细胞之间的通讯。随着时间的推移,这种损伤会影响大脑功能,导致阿尔茨海默病等神经退行性疾病。虽然 p-tau217 并非阿尔茨海默病的直接病因,但血液中 p-tau217 水平升高被公认为该病的早期预警信号之一。
- 太平洋向北冰洋的热输送过去二十年增至 1.5 倍
根据发表在《JGR Oceans》期刊上的一项研究,过去 20 年从太平洋流入北冰洋“加拿大海盆”的海水的热输送量增至 1.5 倍。分析认为,除流入水温升高外,北冰洋海冰减少也进一步推高了水温。受全球变暖影响,北冰洋海冰正在减少,尤其是太平洋一侧的减幅较大。研究团队自 2000 年起,在阿拉斯加州巴罗角近海观测海水温度与流速,从太平洋经白令海峡的海水主要汇入该处。结果显示,流速未见变动趋势,但水温呈长期上升。海洋热输送量也呈增加趋势,在 2000年-2022 年间增至原来的 1.5 倍。基于卫星海表温度等数据,研究还发现热输送自 2010 年代后半期起急剧增加。海冰较少的年份热输送较多,海冰较多的年份热输送较少。海冰较少时,海水更易吸收日照导致水温上升,进而加速海冰融化,形成反馈效应。
- 松下电视将由创维接手
在索尼之后,曾以等离子电视闻名的松下宣布其电视业务将由创维接手。从 4 月开始,欧洲和北美的松下电视销售业务移交给创维,双方还将在产品研发和生产方面进行合作,松下将专注于在日本的销售和高端机型的生产,借助其他地区的销售和低价产品的生产委托给外部,有助于提高正在下滑的电视业务的收益。在销售方面,日本市场仍由松下自己负责,而欧美则由创维负责。剩下的亚洲市场今后将讨论包括与创维合作在内的各个国家和地区的最佳措施。在等离子电视时代,松下一度占据近半市场份额,2010 年松下控制了 40.7% 的等离子面板市场份额,超过三星(33.7%)和LG(23.2%),但随着消费者日益对 LCD 电视感兴趣,松下于 2014 年 3 月停产等离子电视。日本公司如夏普、东芝、日立以及索尼都基本退出了电视市场。
- Firefox 148 释出,引入了 AI 关闭开关
Mozilla 释出了 Firefox 148,引入了 AI 关闭开关,允许用户关闭所有 AI 功能,Mozilla 承诺未来的更新不会覆盖该设置。该开关位于 设置 > AI Controls 下。Mozilla 还允许用户最大限度退出数据收集,相关选项位于 设置 > 隐私设置 > Firefox 数据收集下。其它变化包括:集成 Trusted Types API 和 Sanitizer API 以遏制跨站脚本攻击(XSS),改进了 PDF 中屏幕阅读器对数学公式的兼容性;Firefox Backup on Windows 10;支持 WebGPU 的 Service Worker 等等。
- Ladybird 浏览器项目将在 AI 帮助下使用 Rust 语言
Ladybird 浏览器项目宣布将在 AI 帮助下使用 Rust 语言。Ladybird 是非盈利组织 Ladybird Browser Initiative 开发的开源浏览器,计划在今年内发布一个 alpha 版本,2028 年发布正式版本,它最初使用的语言是 C++,开发者表示他们一直在寻找一种内存安全语言替代 C++,他们在 2024 年评估过 Rust,但因为它在 C++ 风格的面向对象编程(OOP)上表现不佳而放弃,但一年之后它还是决定采用 Rust,而 Firefox 和 Chromium 都已开始在其代码库中引入 Rust。Ladybird 将首先用 Rust 重写部分代码,第一个目标是 JavaScript 引擎 LibJS,开发者在 AI 辅助编程工具 Claude Code 和 Codex 帮助下完成了 2.5 万行的代码。Rust 将主要用于开发子系统,浏览器引擎仍然继续使用 C++ 开发。
- ASML 改进极紫外光源有望增加芯片产量
荷兰 ASML 的研究人员改进了极紫外光刻(EUV)设备所使用的光源功率,有望在这个十年结束前将芯片产量提高 50%。研究人员找到了方法将极紫外光源功率从目前的 600 瓦提高至 1000 瓦。更大的功率意味着每小时可以生产更多芯片,有助于降低单个芯片的成本。芯片的制造方法类似照片打印,利用极紫外光照射涂有光刻胶的硅晶圆,使用更大的极紫外光源,芯片工厂所需的曝光时间更短。ASML 极紫外光刻机执行副总裁 Teun van Gogh 表示,到 2030 年极紫外光刻机每台机器每小时能处理约 330 片硅晶圆,而目前是 220 片。