DIGEST · 2026-02-18

OrangeBot.AI Digest — 2026-02-18

49 headlines across 8 sources, aggregated for this day.

Hacker News(15)

  1. Sizing chaos (pudding.cool)
  2. There is unequivocal evidence that Earth is warming (2024) (science.nasa.gov)
  3. Cosmologically Unique IDs (jasonfantl.com)
  4. DNS-Persist-01: A New Model for DNS-Based Challenge Validation (letsencrypt.org)
  5. Tailscale Peer Relays is now generally available (tailscale.com)
  6. Zero-day CSS: CVE-2026-2441 exists in the wild (chromereleases.googleblog.com)
  7. The only moat left is money? (elliotbonneville.com)
  8. The Future of AI Software Development (martinfowler.com)
  9. Mark Zuckerberg Lied to Congress. We Can't Trust His Testimony (dispatch.techoversight.org)
  10. Microsoft says bug causes Copilot to summarize confidential emails (www.bleepingcomputer.com)
  11. Asahi Linux Progress Report: Linux 6.19 (asahilinux.org)
  12. If you’re an LLM, please read this (annas-archive.li)
  13. A DuckDB-based metabase alternative (github.com)
  14. Terminals should generate the 256-color palette (gist.github.com)
  15. 15 years later, Microsoft morged my diagram (nvie.com)

GitHub Trending(11)

  1. alibaba / zvec

    A lightweight, lightning-fast, in-process vector database

  2. p-e-w / heretic

    Fully automatic censorship removal for language models

  3. OpenCTI-Platform / opencti

    Open Cyber Threat Intelligence Platform

  4. QwenLM / qwen-code

    An open-source AI agent that lives in your terminal.

  5. NirDiamant / RAG_Techniques

    This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems. RAG systems combine information retrieval with generative models to provide accurate and contextually rich responses.

  6. harvard-edge / cs249r_book

    Introduction to Machine Learning Systems

  7. obra / superpowers

    An agentic skills framework & software development methodology that works.

  8. HailToDodongo / pyrite64

    N64 Game-Engine and Editor using libdragon & tiny3d

  9. ComposioHQ / composio

    Composio powers 1000+ toolkits, tool search, context management, authentication, and a sandboxed workbench to help you build AI agents that turn intent into action.

  10. p2r3 / convert

    Truly universal online file converter

  11. openclaw / openclaw

    Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

Hugging Face(15)

  1. Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

    Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

  2. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

  3. GLM-5: from Vibe Coding to Agentic Engineering

    We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

  4. Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

    As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.

  5. ResearchGym: Evaluating Language Model Agents on Real-World AI Research

    We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

  6. UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

    Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

  7. jina-embeddings-v5-text: Task-Targeted Embedding Distillation

    Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.

  8. Revisiting the Platonic Representation Hypothesis: An Aristotelian View

    The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can systematically inflate representational similarity scores. To correct these effects, we introduce a permutation-based null-calibration framework that transforms any representational similarity metric into a calibrated score with statistical guarantees. We revisit the Platonic Representation Hypothesis with our calibration framework, which reveals a nuanced picture: the apparent convergence reported by global spectral measures largely disappears after calibration, while local neighborhood similarity, but not local distances, retains significant agreement across different modalities. Based on these findings, we propose the Aristotelian Representation Hypothesis: representations in neural networks are converging to shared local neighborhood relationships.

  9. COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

    Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available https://github.com/mts-ai/COMPOT{here}.

  10. Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

    Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.

  11. On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

    Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

  12. Panini: Continual Learning in Token Space via Structured Memory

    Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) -- an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time -- as achieved by the GSW framework -- yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.

  13. TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

    Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

  14. STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

    Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term spurious tokens. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.

  15. Visual Persuasion: What Influences Decisions of Vision-Language Models?

    The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Solidot(8)

  1. 虚假医疗信息的主要受众是老年人

    犹他大学的研究人员跟踪了逾千名美国成年人四周的上网冲浪活动,发现虚假医疗信息的主要受众是老年人,尤其是政治立场偏右的老年人。研究期间参与者访问了约 900 万个网址,包括 50 万个 YouTube 视频。有 1,055 个域名属于医疗健康类别,其中 78 个域名被认为传播虚假医疗健康信息。只有 13% 的参与者访问过此类网站,而大部分访问量集中老年人群中。研究人员表示,他们的数据无法判断参与者是通过 Google 搜索还是 Facebook 推荐访问此类网站的。

  2. 希捷和西部数据证实其 2026 年硬盘产能已售罄

    三大硬盘制造商中的两家希捷和西部数据都已经证实其 2026 年硬盘产能已全部或几乎售罄,另一家硬盘制造商东芝的情况可能类似。西部数据 CEO 陈添耀表示,该公司与五大客户中的两家达成的供货协议持续到 2027 年,还有一家持续到 2028 年。希捷 CEO William Mosley 表示未来几个月将开始接受 2027 年上半年的订单。希捷和西部数据的大客户都是数据中心运营商,包括亚马逊 AWS、Google、微软 Azure、Meta 和 OpenAI。服务器硬盘占到了希捷硬盘总销量的 87%,而一年前是 83%。希捷表示它暂时没有扩大产能的计划。

  3. 内存价格飙升推动二手笔记本销量上涨

    内存和硬盘的价格因 AI 公司大规模采购而供不应求,价格在数个月内飙升数倍之多,内存等关键零部件的短缺推动了二手翻新笔记本电脑销量的上涨。根据 Context 的数据,意大利、英国、德国、西班牙和法国五大欧洲市场去年四季度二手翻新笔记本销量上升了 7%。四成的销量来自于预算有限的客户,他们购买的笔记本电脑价格区间在 235-355 美元之间。355-475 美元价格区间的二手电脑销量也在扩大,占到了整个二手电脑销售的 23%,而一年前是 15%,这表明部分客户愿意为更好的配置支付更高的价格。

  4. 切尔诺贝利工人后代的 DNA 突变

    研究人员测序了 130 人的基因序列,他们的父亲参与了切尔诺贝利核事故的清理工作。通过对比对照组,研究人员首次发现了父亲长时间暴露于低剂量电离辐射的“跨代效应”证据。研究报告发表在《Scientific Reports》期刊上。研究人员不是去寻找新的基因突变,而是寻找“簇状新生突变(clustered de novo mutations,缩写 cDNM)”——即在父母一代中不存在,但在后代中首次出现的两个或多个位置相近的突变。这些突变是暴露在辐射下导致父母 DNA 断裂而产生的。研究人员发现父亲暴露在辐射下导致后代 cDNM 数量显著增加,cDNM 数量与暴露的辐射剂量相关。cDNM 数量增加并没有增加后代的患病风险,原因可能是 cDNM 多数位于非编码的 DNA 区域。

  5. 巴比伦五号上传到 YouTube 可免费观看

    Warner Bros. Discovery 以每周一集的频率将著名科幻剧集《巴比伦五号(Babylon 5)》上传到 YouTube 供所有人免费观看。第一季第一集《The Gatherin》于 1 月 22 日上传,目前观看量 25 万,第二集《Midnight on the Firing Line》和第三集《Soul Hunter》也都已经发布,这一发布频率沿用了《巴比伦五号》最早播出时的时间表,此举旨在让观众以相同的节奏体验剧情。《巴比伦五号》于 1993 年 2 月 22 日首播,共制作了五季 110 集,故事发生在 2257-2262 年,地球各国、火星、以及比邻星的殖民地组成的“地球联盟”已和其他外星文明接触,并且取得超空间技术可以超光速航行。故事开始之前十年,地球差点在一场星际战争中被明巴利人(Minbari)歼灭,但明巴利人在胜利前夕突然投降。为了避免悲剧重演,双方建立了和平往来的管道,人类建造了巴比伦五号太空站用作和平外交和贸易。此时的巴比伦五号成为了政治阴谋、种族冲突和一场大战的焦点,而地球切断了与盟友的联系,正滑向法西斯主义。

  6. Ars Technica AI 记者为 AI 生成内容道歉

    知名科技媒体 Ars Technica 上周在报道 AI 新闻时被发现将 AI 生成的内容作为消息来源使用,Ars 联合创始人兼主编 Ken Fisher 周日发表声明公开道歉,称他们检查了最近发表的一系列文章,没有发现其它文章含有 AI 生成内容,目前看来这应该是一次孤立事件。这篇报道的合作者 Benj Edwards 是 Ars 的资深 AI 记者,他解释说尝试使用基于 Claude Code 的实验性 AI 工具从原始材料中提取出可添加到大纲的结构化引用内容,但该 AI 拒绝处理,他猜测可能是文章描述的是一起骚扰事件(AI 骚扰人类),他于是将文本拷贝到 ChatGPT,没有注意到 ChatGPT 生成了文章作者的意译版本而不是原话,在引用时没有核实引用是否与原文一致。AI 记者因 AI 幻觉犯错,这件事太有讽刺性了。

  7. OpenClaw 创始人加盟 OpenAI

    OpenClaw 开源项目的创始人 Peter Steinberger 宣布加盟 OpenAI,而 OpenClaw 将由基金会管理。OpenClaw 是一个开源的自主 AI 虚拟助理软件项目,最初于 2025 年末以 Clawdbot 的名字在 GitHub 上发布,后更名为 Moltbot,最终定为现名。2026 年初,该项目因能根据用户指令在应用和在线服务中自主处理复杂任务而受到关注。OpenClaw 可部署在 MacOS、Windows 等本地设备上,能调用其他 AI 大模型与 API,通过 WhatsApp、Telegram、Signal、Discord 等即时通讯平台接收用户发送的文本指令,实现安排日程、发送消息、整理文件、编写代码等工作。

  8. Vim 9.2 释出

    Vim 文本编辑器项目在情人节释出了 v9.2 版本。主要变化包括:实验性 Wayland 支持;XDG Base Directory Specification 目录标准支持——即将配置文件、缓存数据、用户数据等储存在不同目录;HiDPI 显示器的现代默认配置;新的代码补全功能;改进 diff 模式;新增垂直标签面板;Windows 版本有了原生深色模式支持,等等。