TEXT VIEW · TODAY'S DIGEST · 36 HEADLINES ACROSS 8 SOURCES

Startup Archive(0)

No items yet for today.

App Store Rankings(0)

No items yet for today.

ISSUE 0871
WED, MAY 20, 2026
Discover the best information organized by OrangeBot.AI
TODAY · WED, MAY 20, 2026

The web,
read by a bot.

Ten sources — Hacker News, Product Hunt, HuggingFace, Techmeme and more — filtered, tagged, and summarized every morning for builders who don’t have time to scroll.

NEWChrome extension: save posts from Twitter/X in one click.Install →
01

AI DIGEST

UPDATED DAILY · EDITOR'S PICK
01.00
AI DIGEST

AI新闻摘要

May 20, 2026

Here is a summary of today's main news events, based on the information provided.


Major Tech IPOs Loom for SpaceX and OpenAI

Elon Musk's rocket and AI company, SpaceX, has filed for a massive initial public offering (IPO) that is expected to be the largest on record. Separately, reports indicate that AI industry leader OpenAI is also preparing to go public, signaling a major moment for the artificial intelligence sector.

Inflation and War Fears Rattle Global Markets

Growing concerns over war-driven inflation are causing uncertainty in financial markets. U.S. Treasury yields rose, signaling higher borrowing costs, while stock markets experienced volatility, especially in the semiconductor sector. In response, investors moved toward safer assets, causing prices for gold and silver to increase.

Oil Prices Fall on Hopes of U.S.-Iran Deal

The price of crude oil dropped significantly following reports that the U.S. and Iran are in the final stages of negotiations. A potential deal could reopen key shipping lanes and increase the global oil supply. The news also prompted a federal investigation into suspicious trades that occurred just before the announcement was made public.

Companies Announce Major Layoffs to Fund AI Investments

Several major companies, including Standard Chartered, announced significant job cuts affecting thousands of employees. The restructuring is aimed at reallocating funds to cover the high costs of investing in artificial intelligence and reflects a reduced need for middle management and operational roles.

Son of Mango Founder Named Suspect in Father's Death

Jonathan Andic, son of the billionaire founder of fashion retailer Mango, has been identified as the prime suspect in the investigation into his father’s mysterious death. Andic has publicly denied any wrongdoing in the matter.

02

ON THE WIRE

6 SOURCES
02

HACKER NEWS

02.00
HACKER NEWS

Hacker News - May 20, 2026

Hacker News Feed: Highlighting key posts and discussions.

Map of Metal

(mapofmetal.com)

16950
GitHub Compromised

(twitter.com)

2596
Disney erased FiveThirtyEight

(www.natesilver.net)

425235
Gemini 3.5 Flash

(blog.google)

870600
Gemini Omni

(deepmind.google)

314133
OpenBSD 7.9

(www.openbsd.org)

403293
Peter Neumann has died

(www.tuhs.org)

31024
03

HUGGINGFACE

03.00
HUGGINGFACE

huggingface.title - May 20, 2026

huggingface.description

When Vision Speaks for Sound

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

86
Active Learners as Efficient PRP Rerankers

Pairwise Ranking Prompting (PRP) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms. However, judgments are noisy, order-sensitive, and sometimes intransitive, so sorting assumptions do not match the setting. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top-K. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show that active rankers are drop-in replacements that improve NDCG@10 per call in the call-constrained regime. Our noise-robust framework also introduces a randomized-direction oracle that uses a single LLM call per pair. This approach converts systematic position bias into zero-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls.

64
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

60
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

51
OpenComputer: Verifiable Software Worlds for Computer-Use Agents

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

50
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a Pivot/Refine decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

48
Process Rewards with Learned Reliability

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

43
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

37
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/

31
Harnessing LLM Agents with Skill Programs

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

21
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

21
Aurora: Unified Video Editing with a Tool-Using Agent

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page

16
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

12
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.

12
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.

11
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

8
Video Models Can Reason with Verifiable Rewards

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

8
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.

7
Semantic Generative Tuning for Unified Multimodal Models

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

7
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

5
Delta Attention Residuals

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.

4
Context Memorization for Efficient Long Context Generation

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

3
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

3
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

3
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

3
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.

3
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

2
Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

2
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

2
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.

2
DocAtlas: Multilingual Document Understanding Across 80+ Languages

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

2
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

1
Computer Science Conferences Should Require Nonrepudiable Experimental Results

This position paper argues that computer science conferences should require tamper-evident, nonrepudiable attestations of experimental results. We name the underlying problem experiment nonrepudiation: a compliant protocol must bind the numbers in a paper to an actual executed computation in a way the author cannot later alter or deny. The current system relies on self-reported checklists, optional code sharing, and author-controlled logging. None of these mechanisms answer the question a reviewer cannot check: did the code the paper describes produce the numbers the paper reports? We define the problem formally, state the security properties any compliant protocol must satisfy, and describe a threat model that includes attacks current approaches do not prevent. To show that the problem is solvable, we built K-Veritas, a reference implementation in Go that produces signed reports without accessing training data. K-Veritas is a testbed, not a finished answer. We call on conferences and the community to treat nonrepudiation as a first-class requirement and to help build an open, independent standard for it.

1
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing. The project page is available at https://sjj118.github.io/RT-Splatting.

1
Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching

Dexterous manipulation is physics-intensive and highly sensitive to modeling errors and perception noise, making sim-to-real transfer prohibitively challenging. Domain randomization (DR) is commonly used to improve the robustness of learned policies for such tasks, but conventional DR randomizes one instance per episode, offering very limited exposure to the variability of real-world dynamics. To this end, we propose Domain-Randomized Instance Set (DRIS), which represents and propagates a set of randomized instances simultaneously, providing richer approximation of uncertain dynamics and enabling policies to learn actions that account for multiple possible outcomes. Supported by theoretical analysis, we show that DRIS yields more robust policies and alleviates the need for real-world fine-tuning, even with a modest number of instances (e.g., 10). We demonstrate this on a challenging reactive catching task. Unlike traditional catching setups that use end-effectors designed to mechanically stabilize the object (e.g., curved or enclosing surfaces), our system uses a flat plate that offers no passive stabilization, making the task highly sensitive to noise and requiring rapid reactive motions. The learned policies exhibit strong robustness to uncertainties and achieve reliable zero-shot sim-to-real transfer.

1
Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes

This paper tackles the task of learning to generate signals over triangle meshes in a triangulation-agnostic manner, meaning the trained model can be applied to different meshes and triangulations effectively. Practically, the paper adapts the flow matching (FM) paradigm to a mesh-based, triangulation-agnostic setting. Theoretically, it proposes a specific noise distribution which is triangulation agnostic, to be used inside the FM model's denoising process. While noise distributions are usually trivial to devise for, e.g., images, devising a triangulation-agnostic distribution proves to be a much more difficult task. We formulate a mathematical definition of triangulation agnosticism of distributions, via their spectrum. We then show that a discretization of a specific Gaussian random field called a Matérn process holds these desired properties, and provides a simple and efficient sampling algorithm. We use it as our noise model, and adapt FM to the triangulation-agnostic setting by using a state-of-the-art approach for learning signals on meshes in the gradient domain -- PoissonNet -- as the denoiser. We conduct experiments on elaborate tasks such as sampling elastic rest states, and generating poses of humanoids. Our method is shown to be capable of producing highly realistic results for meshes of over one million triangles, significantly exceeding the state-of-the-art in quality and diversity.

1
Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

1
Language-Switching Triggers Take a Latent Detour Through Language Models

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

1
Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

1
SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction

Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.

1
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.

1
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing

1
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.

0
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

0
Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road

Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or "forks in the road" scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.

0
optimize_anything: A Universal API for Optimizing any Text Parameter

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

0
05

PRODUCT HUNT

05.00
PRODUCT HUNT

Product Hunt - May 20, 2026

Product Hunt Daily Feed: Featuring noteworthy tech launches.

Chromtuner icon
Chromtuner

A chromatic tuner for macOS. ±1¢ accuracy

0
Insta360 Mic Pro icon
Insta360 Mic Pro

Pro audio with a customizable color E-Ink face

0
Tophat by Shopify icon
Tophat by Shopify

Test mobile CI builds on any device without building locally

0
Supercut for Agents icon
Supercut for Agents

Permission-aware AI access to recordings and metadata

0
Retina icon
Retina

Screen recorder w/ auto-zoom, smooth cursors, + AI graphics

0
Skilled icon
Skilled

Dashboard to find agent skills you no longer need

0
LayerProof Kraft icon
LayerProof Kraft

Co-write insightful long form content

0
Re_gent icon
Re_gent

Version Control for AI agent Activity

0
Contextberg icon
Contextberg

Turn your work into AI agent memory, served over MCP

0
StoreClaw icon
StoreClaw

Grow your store profits with agents that know how to sell

0
Glia icon
Glia

Local-first AI memory bridge between browser chats and IDEs

0
GhostSnap icon
GhostSnap

Multiple screenshots - Single paste - Auto compressed for AI

0
Owlish icon
Owlish

Reduce support volume with AI agents trained on your docs

0
Viberia icon
Viberia

Command AI agents like you're playing Civilization

0
Manus Scheduled Tasks 2.0 icon
Manus Scheduled Tasks 2.0

Run recurring Manus work inside the same task context

0
Multi-Claude icon
Multi-Claude

Run multiple Claude accounts side by side on your Mac

0
Emdash icon
Emdash

One app. Every coding agent. Open-source.

0
Gemini Omni icon
Gemini Omni

Create anything from any input – starting with video

0
mailX by mailwarm icon
mailX by mailwarm

Email deliverability toolkit for humans and AI agents

0
Runtime icon
Runtime

Sandboxed coding agents for everyone on your team

0
Tether icon
Tether

The presence who comes to life in your messages

0
Invenio icon
Invenio

Local AI search for Mac video & photo libraries

0
Type Switch 3.0 for macOS icon
Type Switch 3.0 for macOS

Instant language switching for multilingual Mac users

0
Composer 2.5 icon
Composer 2.5

Cursor’s most powerful model yet

0
Monocle 3.5 for macOS icon
Monocle 3.5 for macOS

Noise-cancelling for your  screen

0
Papr Graph icon
Papr Graph

Upgrade to graph-native vector embeddings

0
Insights by Omnia icon
Insights by Omnia

Step-by-step action plans to improve your AI visibility.

0
Agora-1 by Odyssey icon
Agora-1 by Odyssey

A multi-agent world model you can play

0
CtrlOps icon
CtrlOps

Deploy, Debug & Manage Linux Servers with AI.

0
AutoShelf icon
AutoShelf

Auto-organize files on your Mac

0
Starchild-1 by Odyssey icon
Starchild-1 by Odyssey

The first real-time multimodal world model

0
CLI Market icon
CLI Market

3,760 retailers, one API for AI agents

0
Lyricly icon
Lyricly

Live lyrics in your dynamic Notch & floating on your desktop

0
Drizz icon
Drizz

Mobile tests that write, run, and fix themselves

0
Chert icon
Chert

Build AI agents that text customers in iMessage

0
Motion icon
Motion

A video agent for tasteful motion design

0
Cosmic Insights icon
Cosmic Insights

Cookieless web analytics built into your CMS

0
Buggyverse icon
Buggyverse

Study with strangers online, high-accountability focus rooms

0
calog.cc icon
calog.cc

Chat-based calorie tracker that actually knows desi food

0
Trainer icon
Trainer

Train AI agents by recording your screen

0
CaseGap AI icon
CaseGap AI

Find law firm revenue leaks, then fix them

0
Thinnest AI icon
Thinnest AI

Build Voice AI Agents in 100+ languages for ₹1.5/min

0
ShioriCode icon
ShioriCode

Open-source alternative to Codex & Claude Code

0
Haystack icon
Haystack

Review the pull requests that actually need human attention

0
Hanami icon
Hanami

A daily meditation with Japanese art

0
LearnHouse icon
LearnHouse

The modern way to teach what you build

0
imgproxy v4 icon
imgproxy v4

A fast and secure self-hosted image processing server

0
Voker icon
Voker

The Agent Analytics Platform for AI Product Teams

0
Mantle Chat icon
Mantle Chat

Collaboration platform where teams work with AI together

0
VWFNDR™ + MBL icon
VWFNDR™ + MBL

Take raw photos with proof they're real, not AI

0
06

TECHMEME

06.00
TECHMEME

Techmeme - May 20, 2026

Techmeme Digest: Major tech headlines and industry conversations.

SpaceX S-1: Anthropic is paying SpaceX $1.25B/mo. until May 2029 under their compute deal; Anthropic says it's expanding the deal to include Colossus 2 capacity (Ina Fried/Axios)
Source: TechmemePublished: May 20, 2026

Ina Fried / Axios : SpaceX S-1: Anthropic is paying SpaceX $1.25B/mo. until May 2029 under their compute deal; Anthropic says it's expanding the deal to include Colossus 2 capacity —  Anthropic is paying SpaceX $1.25 billion per month through May 2029 as part of the massive compute deal the companies signed earlier this month.

Filing: SpaceX reports 2025 revenue of $18.7B, up 33% YoY, a $4.9B loss, vs. a $791M profit in 2024, and $20.7B in capital expenditures, up from $11.2B (New York Times)
Source: TechmemePublished: May 20, 2026

New York Times : Filing: SpaceX reports 2025 revenue of $18.7B, up 33% YoY, a $4.9B loss, vs. a $791M profit in 2024, and $20.7B in capital expenditures, up from $11.2B —  Mr. Musk's rocket and satellite maker disclosed its financial performance for the first time, as it prepares to go public in what is set to be one of the largest offerings to date.

Nvidia reports Q1 net income up 211% YoY to $58.3B, beating analyst estimates of $42.9B, and raises Q2 revenue forecast to $91B (Robbie Whelan/Wall Street Journal)
Source: TechmemePublished: May 20, 2026

Robbie Whelan / Wall Street Journal : Nvidia reports Q1 net income up 211% YoY to $58.3B, beating analyst estimates of $42.9B, and raises Q2 revenue forecast to $91B —  Astronomical rise in AI agents and demand for data-center computing lift chipmaker to another record quarter  —  Chip giant Nvidia reported record sales and income Wednesday …

In disclosures to investors, Anthropic says it expects to generate $10.9B in revenue in Q2, vs. $4.8B in Q1, and turn a $559M operating profit, its first ever (Berber Jin/Wall Street Journal)
Source: TechmemePublished: May 20, 2026

Berber Jin / Wall Street Journal : In disclosures to investors, Anthropic says it expects to generate $10.9B in revenue in Q2, vs. $4.8B in Q1, and turn a $559M operating profit, its first ever —  The startup expects a 130% revenue surge to $10.9 billion in the June quarter and its first operating profit, defying skeptics of the AI boom

SpaceX files publicly for its IPO, choosing Nasdaq to make its debut under the symbol SPCX (Bloomberg)
Source: TechmemePublished: May 20, 2026

Bloomberg : SpaceX files publicly for its IPO, choosing Nasdaq to make its debut under the symbol SPCX —  SpaceX Files for IPO on Nasdaq Under SPCX Symbol  —  Video Player is loading.  —  Unmute  —  Current Time 0:01 Loaded: 26.69% Playback Rate  — captions off, selected  — English

Nvidia reports Q1 revenue up 85% YoY to $81.6B, Data Center revenue up 92% to $75.2B, and announces an $80B additional share repurchase authorization (Nvidia Newsroom)
Source: TechmemePublished: May 20, 2026

Nvidia Newsroom : Nvidia reports Q1 revenue up 85% YoY to $81.6B, Data Center revenue up 92% to $75.2B, and announces an $80B additional share repurchase authorization —  - Record revenue of $81.6 billion, up 85% from a year ago  — Record Data Center revenue of $75.2 billion, up 92% from a year ago

OpenAI says an internal general-purpose reasoning model has disproved the Erdős unit distance conjecture, a central problem in discrete geometry posed in 1946 (OpenAI)
Source: TechmemePublished: May 20, 2026

OpenAI : OpenAI says an internal general-purpose reasoning model has disproved the Erd&odblacs unit distance conjecture, a central problem in discrete geometry posed in 1946 —  Read the proof(opens in a new window)Read the companion remarks(opens in a new window)  —  Loading...

Granta and the Commonwealth Foundation say they can't determine yet if AI was used to write a prize-winning short story after critics pointed to signs of AI use (The Guardian)
Source: TechmemePublished: May 20, 2026

The Guardian : Granta and the Commonwealth Foundation say they can't determine yet if AI was used to write a prize-winning short story after critics pointed to signs of AI use —  Granta publisher says ‘perhaps we never will know’ true authorship of work that won Commonwealth prize

Google says it is testing new ad formats in search results and AI Mode, including Conversational Discovery ads, Highlighted Answers, and AI-powered Shopping ads (Anu Adegbola/Search Engine Land)
Source: TechmemePublished: May 20, 2026

Anu Adegbola / Search Engine Land : Google says it is testing new ad formats in search results and AI Mode, including Conversational Discovery ads, Highlighted Answers, and AI-powered Shopping ads —  Google is introducing a new generation of Gemini-powered ad formats across AI Mode and Search designed to make ads feel more conversational …

Airbnb says it is adding luggage storage, airport pickups, car rentals, grocery delivery, and thousands of boutique and independent hotels to its platform (Jacob Passy/Wall Street Journal)
Source: TechmemePublished: May 20, 2026

Jacob Passy / Wall Street Journal : Airbnb says it is adding luggage storage, airport pickups, car rentals, grocery delivery, and thousands of boutique and independent hotels to its platform —  Platform ramps up hotel-booking options, car rentals and AI-enabled features, expecting World Cup to drive record use

Sources: OpenAI is preparing to file confidentially for an IPO as early as Friday; the company plans to be ready to go public as early as September (Wall Street Journal)
Source: TechmemePublished: May 20, 2026

Wall Street Journal : Sources: OpenAI is preparing to file confidentially for an IPO as early as Friday; the company plans to be ready to go public as early as September —  The artificial-intelligence giant is working with bankers at Goldman Sachs and Morgan Stanley  —  ChatGPT-maker OpenAI has been working …

Ubisoft reports a record operating loss of $1.40B for the year to March and says sales in 2026-27 will fall by about 8% to 9%; UBI falls 6%+ (Reuters)
Source: TechmemePublished: May 20, 2026

Reuters : Ubisoft reports a record operating loss of $1.40B for the year to March and says sales in 2026-27 will fall by about 8% to 9%; UBI falls 6%+ —  French videogame publisher Ubisoft warned on Wednesday of another year of losses and lower sales after a record annual operating loss, deepening pressure on the company as it restructures.

Internal memo: Xbox hires game industry analyst Matthew Ball as chief strategy officer, and names Scott Van Vliet, who led Azure AI infrastructure, as Xbox CTO (Tom Warren/The Verge)
Source: TechmemePublished: May 20, 2026

Tom Warren / The Verge : Internal memo: Xbox hires game industry analyst Matthew Ball as chief strategy officer, and names Scott Van Vliet, who led Azure AI infrastructure, as Xbox CTO —  The second major Xbox leadership changes this month. … Microsoft has recruited game industry analyst Matthew Ball as Xbox chief strategy officer.

Internal memo: Mark Zuckerberg told employees that he does not expect more company-wide layoffs this year (Katie Paul/Reuters)
Source: TechmemePublished: May 20, 2026

Katie Paul / Reuters : Internal memo: Mark Zuckerberg told employees that he does not expect more company-wide layoffs this year —  Meta (META.O) CEO Mark Zuckerberg told employees in an internal memo on Wednesday that he does not expect more company-wide layoffs this year, according to a copy of the memo seen by Reuters.

An interview with Match Group CEO Spencer Rascoff about plans for Tinder, including a redesign, AI features, live events, and group dating to win over Gen Z (Samantha Kelly/Bloomberg)
Source: TechmemePublished: May 20, 2026

Samantha Kelly / Bloomberg : An interview with Match Group CEO Spencer Rascoff about plans for Tinder, including a redesign, AI features, live events, and group dating to win over Gen Z —  The dating app's turnaround bets on live events, group dating and an AI-heavy redesign.  —  At a pickleball venue near Santa Monica State beach …

07

STARTUP ARCHIVE

07.00
STARTUP ARCHIVE

Startup News - May 20, 2026

Startup News Roundup: Aggregating key funding and launch updates.

Marc Andreessen on the 5 personality traits of an innovator
Source: StartupPublished: Mar 31, 2026

“When you’re talking about real innovators—people who actually do really creative, breakthrough work—I think you’re talking about a couple things:”

Steve Jobs explains the importance of both thinking and doing
Source: StartupPublished: Mar 30, 2026

“The doers are the major thinkers. The people who really create the things that change this industry are both the thinker-doer in one person.”

Tobi Lutke explains what the VCs who passed on Shopify got wrong
Source: StartupPublished: Mar 27, 2026

“What a lot of free-market thinkers don’t understand is that between the demand and eventual supply lies friction."

Sam Altman explains how he decides to invest in a startup after 10 minutes
Source: StartupPublished: Mar 26, 2026

"Does this person have the potential to be the next Mark Zuckerberg?… [You don’t get to] 100% accuracy, obviously, but it’s good enough that our business model works.”

Jony Ive recounts the time Steve Jobs called him vain
Source: StartupPublished: Mar 25, 2026

In the clip below, Jony Ive recounts the time he asked Steve Jobs to be less harsh in his critique of a piece of work.

Jeff Bezos’s two pieces of advice for aspiring entrepreneurs
Source: StartupPublished: Mar 24, 2026

“The advice that I would give entrepreneurs is don't chase the hot new thing. It's so hard to catch something that everybody already knows is hot."

Elad Gil: “Things that work tend to work pretty fast”
Source: StartupPublished: Mar 23, 2026

“I do think there’s a bit of a myth in Silicon Valley that you should keep grinding no matter what and it’s just about perseverance, and I think that’s really bad advice."

Paul Graham on why starting with a “small, intense fire" is the key to startup growth
Source: StartupPublished: Mar 20, 2026

"You have to know who those first users are and how you're going to get them."

Keith Rabois on how to identify great talent
Source: StartupPublished: Mar 19, 2026

“What you want to do with every single employee every single day is expand the scope of their responsibilities until it breaks… and that’s the role they should stay in.”

Wealthfront CEO on why advertising spend makes it harder to find product/market fit
Source: StartupPublished: Mar 18, 2026

“The way that you know you have product/market fit is if you have exponential organic growth."

Eric Schmidt on why most companies get strategy wrong
Source: StartupPublished: Mar 17, 2026

“Work very, very hard to figure out what the world’s going to look like in five years. What will people be doing? What will your customers want? Where will costs be?"

Mark Zuckerberg: “You can’t 80/20 everything”
Source: StartupPublished: Mar 16, 2026

"There’s the famous 80/20 rule where you get 80% of the benefit by doing 20% of the work, but you can’t just 80/20 everything. There have to be certain things that you are just the best at."

Marc Andreessen on Mark Zuckerberg’s founder “superpower”
Source: StartupPublished: Mar 13, 2026

“A great superpower that Mark Zuckerberg has that is probably not well-understood enough is he does not get emotionally upset in stressful situations"

Sam Altman explains how to come up with a great startup idea
Source: StartupPublished: Mar 12, 2026

"If you start a startup without a good idea… you’ll be under pressure to make something up and it won’t work that well."

Jeff Bezos on the problems with proxies and managing to metrics
Source: StartupPublished: Mar 11, 2026

“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”

Airbnb founder Brian Chesky on how to design an amazing user experience
Source: StartupPublished: Mar 10, 2026

“If you can design something really amazing using the hand-crafted part of your brain, then you can reverse-engineer how to industrialize this millions of times over."

Spencer Rascoff: "I will never invest in a consumer startup with paid marketing”
Source: StartupPublished: Mar 9, 2026

"If you’re actually trying to grow a product, the best levers for doing that are often within the product itself.”

Patrick Collison explains why it sometimes make sense to quit
Source: StartupPublished: Mar 6, 2026

“One thing I’ve learned myself the hard way, is that it is easier to tear down a company and restart it in Silicon Valley, than it is to constantly try to pivot or keep something alive."

Jeff Bezos recounts the time he called Amazon’s customer service number mid-meeting to prove a metric was wrong
Source: StartupPublished: Mar 5, 2026

“I have a saying, which is when the data and the anecdotes disagree, the anecdotes are usually right"

Ben Horowitz: “Nobody was born a great manager. It’s a very unnatural job.”
Source: StartupPublished: Mar 4, 2026

“If you can’t build a great product, it doesn’t matter if you can build a great company.”

03

ALSO TODAY

3 MORE SOURCES
08

SOLIDOT

08.00
SOLIDOT

Solidot News - May 20, 2026

Solidot Feed: Highlighting essential tech & open-source news.

Firefox 将移除 asm.js 相关代码

Mozilla 宣布 Firefox 未来将移除 asm.js 相关代码,因为它早有了后继者 WebAssembly,同时维护两者耗费时间且增加攻击面。asm.js 是 Mozilla 对 NaCl 和 PNaCl 的回应:通过选择一个严格静态的 JavaScript 子集获得类似 NaCl/PNaCl 的性能,同时代码又能直接运行在 Web 内容中。asm.js 于 2013 年随 Firefox 22 发布,获得了巨大的成功,证明只使用 Web 技术就能在 Web 上以接近原生的速度运行代码,它为 WebAssembly 的诞生铺平了道路,WebAssembly 在 2019 年成为 W3C 标准。Mozilla 从 Firefox 148 开始 JS 引擎 SpiderMonkey 默认禁用 asm.js 优化,未来版本将完全移除相关代码,使用 asm.js 的网站不会受到影响,开发者建议想要继续使用 asm.js 发布内容的网站重编译到 WebAssembly,它的执行速度更快,二进制文件更小。

Google 云服务 GCP 不小心将其大客户 Railway 的账号封禁

2024 年 Google 云服务 GCP 的错误配置导致澳大利亚退休基金管理公司 UniSuper 的数据被完全删除,幸运的是 UniSuper 在另一家公司有备份。这起事故导致 UniSuper 下线了一周多时间。2026 年 5 月 19 日 GCP 发生了一起类似的严重事故,它的自动系统将其大客户、PaaS 平台 Railway.com 的生产账号给封了,导致 Railway 的服务下线,根据 Railway 官方博客的事故报告,宕机持续了大约 8 个小时。账号封禁发生在 19 日 22:10 UTC,导致 Railway 失去了 GCP 相关的基础设施,这些基础设施支持了控制面板、API 以及部分网络基础设施。Railway 立即联系了 GCP 的客户经理,22:29 UTC 账号恢复,但计算实例、磁盘以及网络都需要逐个慢慢恢复,直到第二天 07:58 UTC 事故才完全解决。Railway 宣布将降低对 GCP 的依赖,计划将 GCP 从热路径中移除,保留作为备份/故障转移服务。

为何日本的花粉过敏如此严重

日本的花粉过敏症是一个全国性健康问题,估计 43% 的日本人出现中度至重度症状。相比下英国是 26%,美国为 12%-18%。每年春天日本全国各地的城市街道上人人都戴上口罩,原因就是花粉引发的过敏性鼻炎。为什么日本的花粉过敏问题如此严重?原因与健康不佳、污染甚至自然环境都关系不大,而是与二战后日本政客的决策有关。战争期间,石油和天然气短缺迫使日本转向其最丰富的自然资源——森林——作为家庭和工业的燃料来源。天然森林遭到大面积砍伐,东京、大阪和神户等城市周围山林被砍伐殆尽。二战之后,由于光秃秃的山容易引发山体滑坡和洪涝灾害,政府决定开展大规模植树造林。政府选择了两种快速生长的树种:日本杉(sugi)和日本扁柏(hinoki)。今天这些杉树和柏树的种植面积占到了国土面积的五分之一。问题是杉树和柏树在生长 30 年成熟之后会产生大量轻质花粉。而几乎所有人工林的年龄都超过 30 岁了。为了缓解过敏症日本政府如今计划砍掉五分之一的杉树林,替换上新树种。

Fedora 移除深度桌面环境包

在 openSUSE 之后,Fedora 发行版移除了深度桌面环境包(Deepin Desktop)。2025 年初 SUSE 安全团队在一次例行审查中发现深度桌面环境有名叫 deepin-feature-enable 的软件包,该软件包是在 2021 年 4 月加入的,并没有咨询或通知 SUSE,它包含了一个“许可协议对话框(license agreement dialog)”,基本上说讲因为 openSUSE 的安全规定,它禁用了 deepin-api 和 deepin-daemon 需要的所有 dbus 和 polkit 功能,这可能导致 Deepin Desktop 不能正常工作,部分功能无效。如果用户不在意这些安全问题,可选择点击确认,之后会自动安装缺少的 dbus 和 polkit。安全团队的调查发现,deepin-daemon 中的核心组件从未递交进行安全审查,它们被悄悄的引入到了 openSUSE 中。鉴于 Deepin 社区过去几年多次违规,openSUSE 决定移除 Deepin Desktop。Fedora 项目随后也对深度桌面环境包展开安全审查,期间开发者发现难以联系部分深度软件包的维护者,因为安全担忧和软件包缺乏维护,它最终决定移除深度桌面环境。

OpenAI 和英伟达等在模型中加入了对 SynthID 水印的支持

Google 在三年前推出了用于标记 AI 图像的数字水印技术 SynthID,它称 SynthID 至今被用于标记了 1000 亿张图像和视频。Google 去年在 Gemini 应用中添加了 SynthID 检测功能。用户上传可疑内容,询问聊天机器人是否是 AI 生成的。Google 称至今还没有人成功破解 SynthID,宣布与多家 AI 公司合作加入对该水印技术的支持。英伟达的 Cosmos、OpenAI 的 GPT 2 图像、Kakao 和 ElevenLabs 都将在其 AI 生成内容中加入对 SynthID 的支持。

全球疫苗接种率下滑

全球疫苗接种率下滑。在医疗体系陷入混乱的新冠疫情过去后,疫苗接种率今未能恢复至以前的水平。2024 年麻疹疫情已蔓延至 59 个国家。麻疹病毒传染性极强,如果同一空间中有感染者,没有相关免疫的人群几乎 100% 会被感染。该病的并发症有肺炎、中耳炎等,甚至可能导致脑炎,变成重症。预防麻疹必须要靠疫苗。想要维持群体免疫、防止疫情扩散,疫苗接种率需达到 95% 以上。新冠疫情期间,由于出行限制,民众普遍推迟了其他疫苗的接种。医疗机构方面,接种人员和治疗人员也侧重于应对新冠疫情。加上其他传染病的流行得到抑制,认为无需接种疫苗的人越来越多,导致全球疫苗接种率持续走低。除麻疹以外,其他传染病也呈现出类似趋势。2024 年白喉、百日咳、破伤风三联疫苗的接种率全球所有地区都低于 2010 年以后的峰值水平。

地月之间的最高效路线

科学家开发出一种数学方法,能更精确地计算天体轨道之间最经济的旅行路线。以地月为例,与此前最节能的路线相比,新路线所需燃料减少了 58.80 米/秒。与旅程的预估总成本 3342.96 米/秒相比,这一差距看似微小,却对任务成本影响巨大。团队表示,在太空旅行中,每1米/秒的速度变化,都意味着巨大的燃料消耗。基于这一结果,团队绘制出一条从地球轨道到月球轨道的航天器飞行轨迹,并将其分为两个阶段。首先,航天器脱离地球轨道,进入L1拉格朗日点周围的轨道。L1拉格朗日点位于地球和月球之间,在这里,两天体的引力恰好相互抵消。借助控制系统,航天器可以无限期地保持在这个中间轨道上,直到任务准备就绪,再执行进入月球轨道的第二阶段。

GitHub 证实黑客窃取了其内部代码库

GitHub 通过 X 平台官方账号证实黑客窃取了其内部代码库,它正对此展开调查。此前黑客组织 TeamPCP 通过 Breached 论坛声称获得了 GitHub 内部源代码和内部组织的访问权限,窃取了大约 3800 个代码库,它对想要访问源代码的人开出了 5 万美元的报价。TeamPCP 坚称这不是勒索,只要有人开出不低于 5 万美元的报价,它们会在收钱之后销毁数据,如果没有买家则将会免费公开。GitHub 称它的调查显示一名员工的计算机被入侵,其源头是安装的恶意 VS Code 扩展,他们移除了扩展隔离了设备,正继续进行调查。GitHub 表示目前没有证据表明客户数据受到影响。

Kickstarter 撤销对成人内容的全面封禁

众筹平台 Kickstarter 上周修改了规则,扩大了禁止的成人内容范围。此前它只禁止“色情内容”,更新后的规则显著扩大了成人内容范围,包括但不限于:暗示性行为,MILF/DILF 内容,暗示性裸露,任何包含女性乳头/乳晕、生殖器和肛门的内容。在引发争议之后,Kickstarter 证实它修改规则是在支付处理商 Stripe 压力下做出的,而 Stripe 受到了更大的金融系统的制约。过去几个月 Kickstarter 上进行众筹的项目有许多其筹款账号被 Stripe 暂停,因此它修改规则以满足 Stripe 限制成人内容的要求。但这一做法受到了社区的批评,它现在决定撤销新的规则,回归旧规则,但同时添加了 Stripe 政策的相关链接。

Google 宣布改变搜索框

在周二举行的 Google I/O 开发者大会上,Google 宣布对其有 25 年历史的标志性搜索框进行重新设计,将其转变成 AI 驱动的“智能搜索框”——基本上就是聊天机器人的对话框,其功能从执行搜索变为询问 Google(Ask Google)。Google 声称在搜索服务集成 AI 模式之后,月活跃用户数突破了 10 亿,搜索量创下了历史新高,所以它现在准备进一步把 AI 模式变成搜索的默认功能。类似 AI 聊天机器人,智能搜索框可以将文本、图像、文件、视频或 Chrome 标签页作为输入进行搜索。Google 还将提供智能体数字助手帮助用户自动搜索,寻找公寓的用户无需打开 Zillow 等网站即可收到新房源的通知。Google 此举再次引发了广泛批评,基于大模型的 AI 功能并没有将精确性视为核心,因此未来的搜索质量会进一步下降,进一步模糊广告和搜索结果。

三星电子劳资谈判破裂,从 21 日起开始 18 天大罢工

三星电子劳资 20 日就奖金发放上限标准等进行第三轮事后调解会议,但是双方未能达成协议,谈判最终破裂。工会表示对雇佣劳动部旗下中央劳动委员会提出的协调方案表示同意,但是三星电子方面拒不接受协调方案。三星电子只反复称“尚未做出决策”,没有表明立场。工会将于明天如期启动总罢工,在罢工期间工会仍将继续努力,争取同资方达成协议。总罢工预期每天会给三星电子带来多达 20 亿美元的损失。韩总统府对谈判破裂表示遗憾,韩政府正在研讨行使“紧急调解权”限制工会进行罢工,并将支持劳资进行新一轮调解。

Bug 悬赏项目被 AI 报告淹没

企业通过 Bug 悬赏项目向白帽子黑客支付发现 bug 的赏金,但此类项目如今被低质量的 AI 报告淹没,迫使部分企业终止项目。Bugcrowd 的客户包括 OpenAI、T-Mobile 和摩托罗拉,该公司表示 3 月三周内收到的报告数量翻了四倍多,大部分报告被证实是错误的。Curl 项目在 1 月暂停了 Bug 悬赏项目。网络安全公司 Sophos 的首席信息安全官 Ross McKerchar 表示,低质量 AI 报告正迅速成为一大问题,Bug 悬赏会继续 存在,但必须做出改变。Nextcloud 在 4 月暂停了 Bug 悬赏。Bug 悬赏项目平台 HackerOne 也开始引入 AI 智能体去筛选递交的 Bug 报告,CEO Kara Sprague 表示高质量的 AI 报告最近也略有增加。

pgBackRest 作者宣布继续维护该项目

上月底,PostgreSQL 备份恢复项目 pgBackRest 的维护者 David Steele 宣布项目存档停止维护。pgBackRest 被广泛视为是 PostgreSQL 生态系统最流行的运维工具之一。Steele 解释说,过去 13 年 pgBackRest 是他倾注热情的项目,幸运的是大部分时间里他都有企业资助,他的长期赞助商是 Crunchy Data 公司,但这家公司被 Snowflake 收购了,而新东家无意资助他继续从事相关工作,因此他过去几个月一直在寻找继续这项工作的职位但没有成功,获得的赞助也远远未能达到维持项目运营所需的金额,因此只能宣布停止维护。在这一声明公布数周之后,他更新了消息,宣布将继续开发 pgBackRes:因为一个赞助商联盟同意为项目持续提供资金,给予了 pgBackRes 开发所需的长期稳定性,他对此表示了感谢。

索尼取消将 PS 独占单人游戏移植到 PC 的计划

负责索尼 PS 工作室业务的高管 Hermen Hulst 周一证实了此前的流言:取消将 PS 独占单人游戏移植到 PC 的计划。索尼过去几年将此前的独占 PS 单人游戏如 God of War 系列、Spider-Man 系列、Ghost of Tsushima、The Last of Us 系列和 Horizon Zero Dawn 系列移植到了 PC 平台,但最近一段时间移植频率下降,引发了索尼改变移植战略的流言。Hermen Hulst 周一在员工大会上宣布了公司的战略调整计划。索尼据称是担心稀释 PlayStation 品牌影响力。此举意味着索尼最近推出的单人游戏 Ghost of Yotei 和 Saros 将会无缘登陆 PC。索尼的战略调整针对的是第一方工作室的单人游戏,多人游戏以及第三方工作室的单人游戏仍然会登陆 PC。

人类为什么惯用右手

人类中的大多数是右撇子,左撇子占约十分之一。为什么会出现这一倾向?研究人员分析了 41 种灵长类动物,共计 2025 只猴子与猿类的数据,逐一分析了工具使用、食性、栖息环境、体型、社会结构、脑容量、行动方式等各类影响因素。人类的用手倾向与其他灵长类动物存在明显差异。当研究人员将两个关键特征纳入模型中,情况就发生了变化。这两个特征分别是大脑大小及臂长与腿长的比例,这一比例常作为衡量两足行走能力的指标。纳入上述因素后,人类不再被视为特殊的进化产物。研究结果表明,直立行走与脑容量增大的共同作用,或是人类形成强烈右手使用偏好的核心原因。研究人员认为,惯用右手的进化分为两个阶段。首先,直立行走使双手从运动中解放出来,偏爱更专业和不对称的手部使用;其次,随着人类大脑变得更大且更为复杂,对右手的偏好变得愈发强烈且更为普遍。

Firefox 151 释出

Mozilla 释出了 Firefox 151。主要新特性包括:更新内置 VPN 支持,改进隐私浏览,Firefox PDF 查看器支持直接合并多个 PDF 文件,Linux 和 macOS 本地配置文件备份支持跨平台恢复,文档画中画 API——提供了比目前的视频画中画 API 更多功能体验,等等。JPEG-XL 原生图像解密器推迟到了下个版本。

少数湖泊拥有三分之二的湖泊淡水储量

根据发表在《国家科学评论》期刊上的一项研究,中科院研究团队汇总 588 个湖泊的高精度实测水下地形和水深数据。研究发现,我国湖泊水深受地形地貌影响,西部高海拔内流湖盆区受构造断陷和冰川侵蚀影响,形成了深水湖泊,而东部平原因长期泥沙淤积,形成浅碟形湖泊。全国湖泊总蓄水量约 1081-1285 立方公里,其中淡水约 335 立方公里,咸水约 839 立方公里。约 65% 的湖泊淡水储存于青藏高原等西部内流湖盆区少数几个深水开放型湖泊。学界对我国淡水湖的关注多聚焦于东部平原区及云贵高原,但本研究发现,青藏高原不仅拥有塔若错、玛旁雍错、吴如错等超大型深水淡水湖,其湖区天然湖泊的淡水总储量超过东部平原湖区:青藏高原湖泊区人均储量约为 20680 立方米,而东部平原湖泊区人均储量仅为 65 立方米,两者相差近 330 倍。

微软发布了首个通用 Linux 发行版 Azure Linux 4.0

Kubernetes 联合创始人、微软副总裁 Brendan Burns 在北美开源峰会上突然宣布了一个通用 Linux 发行版。微软以前发布过 Linux 应用,针对边缘计算设备的 Azure Sphere,Linux 容器软件平台 CBL-Marnier——后更名为 Azure Linux,但此前从未发布过通用发行版。微软 Azure 开源团队首席项目经理 Lachlan Everson 表示,通过 Azure Linux 4.0,微软正致力于将 Azure Linux 转变成一个功能完整的通用云发行版。Azure Linux 4.0 是基于 Fedora Linux 发行版,已发布在 GitHub 上,使用 Fedora 的 RPM 包管理系统,深度整合 Azure 云平台。开发者可以通过 WSL 在 Windows 11 上运行 Azure Linux 4.0,但没有 GUI。微软承诺为 Azure Linux 每月释出补丁,如果出现重要漏洞,微软也承诺及时释出补丁。

Meta 重分配七千员工专注于 AI

Meta 周一通知员工,将重分配七千员工专注于 AI。Meta HR 负责人 Janelle Gale 在一份内部备忘录中称,员工将被调往四个专注构建新 AI 工具和应用的新部门,新部门采用“AI 原生设计架构”,每位员工的经理人数将少于其他部门。截至 2025 年底,Meta 员工总数逾 78,000 人。它最近宣布将裁员八千人。Meta CEO 扎克伯格(Mark Zuckerberg)正将公司的未来押注在 AI 上,他今年初表示计划年内投入 1150 亿至 1350 亿美元,大部分将用于开发新 AI 技术。

陪审团以诉讼时效为由判马斯克败诉

引发广泛关注的马斯克(Elon Musk)诉 OpenAI 一案经历三周的庭审之后,陪审团周一以超出三年诉讼时效为由判马斯克败诉。OpenAI 由 Sam Altman、Greg Brockman 和马斯克等人在 2015 年创建,最初是非营利组织,马斯克在 2018 年离开董事会,次年 OpenAI 成立了营利性公司,当年获得了微软的 10 亿美元投资。马斯克是在 2024 年向旧金山高等法院起诉 OpenAI 及其联合创始人 Altman 和 Brockman 违反公司的创始原则,将商业利益置于公共利益之上。陪审团认定他提起诉讼的时间过长,未能及时提出 OpenAI 背离其非营利使命的指控。

09

APP STORE RANK

09.00
APP STORE RANK
FETCHING · APP STORE RANK