TEXT VIEW · TODAY'S DIGEST · 36 HEADLINES ACROSS 8 SOURCES

Startup Archive(0)

No items yet for today.

App Store Rankings(0)

No items yet for today.

ISSUE 0883
MON, JUN 1, 2026
Discover the best information organized by OrangeBot.AI
TODAY · MON, JUN 1, 2026

The web,
read by a bot.

Ten sources — Hacker News, Product Hunt, HuggingFace, Techmeme and more — filtered, tagged, and summarized every morning for builders who don’t have time to scroll.

NEWChrome extension: save posts from Twitter/X in one click.Install →
01

AI DIGEST

UPDATED DAILY · EDITOR'S PICK
01.00
AI DIGEST

AI新闻摘要

June 1, 2026

Here is a summary of today's news events based on the information provided.

Mideast Tensions Drive Oil Prices Higher and Stocks Lower

Reports that Iran has halted mediated ceasefire talks with the U.S. following Israeli attacks in Lebanon have pushed crude oil prices higher. The increased geopolitical risk has caused a decline in major stock indexes like the Dow Jones, while also affecting gold prices and bond yields as investors react to the uncertainty.

AI Enthusiasm Boosts Tech Stocks and Spurs New Product Development

Strong investor demand for artificial intelligence stocks continues to drive market gains, with one tech giant reportedly becoming the world's most valuable company. A leading chipmaker announced collaborations with Dell, HP, and Lenovo to produce a new line of AI-powered laptops, while Europe is reportedly in talks to adopt a U.S. AI model.

Europe Grapples with Chinese Subsidies, UK Housing Cools

A new report indicates that Chinese companies receive significantly more state support than their European rivals, increasing concerns about fair competition. Meanwhile, in the UK, the housing market showed its first monthly decline this year, attributed to higher mortgage rates. On the corporate front, U.S. private credit firm Castlelake is reportedly considering an acquisition of a European low-cost airline.

U.S. Stock Market Rally Faces Questions of Sustainability

The S&P 500 has completed one of its strongest two-month periods on record, leading to optimism among some analysts. However, amidst this strong performance, there are underlying concerns about geopolitical risks and whether the current market momentum can be maintained.

02

ON THE WIRE

6 SOURCES
02

HACKER NEWS

02.00
HACKER NEWS

Hacker News - June 1, 2026

Hacker News Feed: Highlighting key posts and discussions.

Nvidia RTX Spark

(www.nvidia.com)

9482
Shift from a leader-follower to a leader-leader approach

(www.practicalengineering.management)

7448
Chuwi Minibook X

(tylercipriani.com)

336259
Restartable Sequences

(justine.lol)

24360
Dav2d

(jbkempf.com)

518191
London's Free Roof Terraces

(diamondgeezer.blogspot.com)

319152
The Website Specification

(specification.website)

526210
Racket v9.2

(blog.racket-lang.org)

23133
03

HUGGINGFACE

03.00
HUGGINGFACE

huggingface.title - June 1, 2026

huggingface.description

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

64
GrepSeek: Training Search Agents for Direct Corpus Interaction

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6times while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level F_1 and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

63
Representation Forcing for Bottleneck-Free Unified Multimodal Models

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.

39
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.

35
Trust-Region Behavior Blending for On-Policy Distillation

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

32
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.

31
GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.

30
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

25
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

25
Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.

24
Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

24
Mellum2 Technical Report

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

23
Task-Focused Memorization for Multimodal Agents

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

23
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

18
Exploring Autonomous Agentic Data Engineering for Model Specialization

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..

17
PEEK: Picking Essential frames via Efficient Knowledge distillation

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

15
dMoE: dLLMs with Learnable Block Experts

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14times to 1.66times end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

15
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe over-search, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at https://github.com/XMUDeepLIT/SAAS.

14
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

14
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.

13
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.

13
VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

10
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.

7
Linear Scaling Video VLMs for Long Video Understanding

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

5
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose Hide-and-Seek, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, π_0, and π_{0.5}.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

5
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

4
How can embedding models bind concepts?

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.

3
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

3
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

3
FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

Media compression standards have reached a plateau in terms of the rate-distortion-complexity trade-off, limiting the ability to offload expensive AI perception to the cloud in applications like robotics, wearables, and remote sensing. DNN-based codecs improve compression efficiency, but at a cost: they cannot easily adapt to large changes in available bitrate, and real-time encoding requires expensive, power-hungry GPUs that prohibit use on low-cost or resource-constrained platforms. To address these limitations, we propose a novel autoencoding framework (FRAPPE) that uses the Full input to predict the Residual output via a Projection Pursuit Encoder. FRAPPE's encoding objective naturally sorts latent channels by importance, allowing zero-overhead variable-rate coding. Unlike RNN-based learned codecs, whose encoder consumes the previous reconstruction's residual, or RVQ-style codecs, whose codebooks must be applied sequentially, FRAPPE's analysis path is an embarrassingly parallel DAG of independent input projections. Using FRAPPE, we build a variable-rate RGB image codec (FRAPPE-Image), and evaluate its rate-distortion-complexity trade-off against standard image codecs. At high compression ratios (approx. 0.1 bpp) FRAPPE-Image provides higher perceptual quality than AVIF with 47 times faster encoding, making it capable of real-time 1080p, 30fps CPU-only encoding. Our code and pre-trained models are available: https://github.com/UT-SysML/FRAPPE .

3
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

2
When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.

2
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

2
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

2
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.

2
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

2
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

2
Count Anything

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

2
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

2
MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.

2
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

2
One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance O(N) to per-type O(T), where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

1
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

1
Benchmarking Composed Image Retrieval for Applied Earth Observation

Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image with a textual modifier. Although RSCIR offers a flexible interface for expressing targeted retrieval intent, the transferability of modern composition methods to Earth observation (EO) imagery and their relevance to operational EO workflows remain underexplored. We address this gap through a unified benchmark and an application-oriented study. First, we systematically adapt and evaluate representative composed image retrieval methods with six vision-language backbones on PatternCom under a standardized protocol, analyzing their behavior across backbones, composition strategies, and query types. Second, we introduce xView2-CIR, a change-centric dataset for disaster and damage monitoring, where retrieval is conditioned on scene identity and a target post-event state. Our results show that training-free composition methods provide strong and scalable baselines for EO retrieval, while change-centric retrieval presents different challenges from attribute-based retrieval, particularly due to the need to preserve scene identity. Overall, this study establishes a practical benchmark for RSCIR and positions composed retrieval as a complementary tool for remote sensing image retrieval, archive exploration, and change analysis. The dataset and code are available at https://github.com/billpsomas/rscir.

1
Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

1
Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

0
A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking

The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi-UAV Fields of View (FOV). While state-of-the-art frameworks excel in optimizing local trajectory extraction and stability for single-drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network-level analysis such as Origin-Destination estimation. This paper presents a real-time Multi-Camera Multi-Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance-based Re-Identification (Re-ID) in nadir views, we introduce a lightweight Topology-Based Spatiotemporal Handover mechanism. We implement a high-throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue-based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re-ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system.

0
Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

While previous research in multivariate time series forecasting has focused on developing complex holistic models, this work advocates for a shift toward a granular, component-level understanding of their impacts. We propose TSCOMP, the first large-scale benchmark that systematically deconstructs deep forecasting methods into their core, fine-grained components--spanning series preprocessing, encoding strategies, network architectures including specific and large time-series models, and optimization methods. Using constrained orthogonal experimental design and extensive evaluations, we conduct multi-view analyses that reveal component effectiveness across different backbones, data characteristics, and their interactions. Beyond providing insights, this benchmark establishes a fine-grained performance corpus comprising over 20,000 model-dataset evaluations, which supports the learning of automated component selection, enabling zero-shot model construction on new datasets. Our experiments demonstrate that the corpus-driven approach, despite its simplicity, consistently outperforms state-of-the-art methods, validating the soundness of our evaluation design and confirming that systematic component selection surpasses manually designed complex architectures. All code and the performance corpus are publicly available at https://github.com/SUFE-AILAB/TSCOMP.

0
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

0
Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency-guidance-operator/

0
05

PRODUCT HUNT

05.00
PRODUCT HUNT

Product Hunt - June 1, 2026

Product Hunt Daily Feed: Featuring noteworthy tech launches.

Paint By JSON | Figma API Client icon
Paint By JSON | Figma API Client

Real API data in your mockups made as easy as lorem ipsum.

0
Presentify icon
Presentify

Take your presentation skills to the next level

0
Sentinel icon
Sentinel

Control your robots from anywhere in the world

0
Open Caffeine icon
Open Caffeine

Keep your Mac awake

0
Tabstack Web Research icon
Tabstack Web Research

Run a research agent with cited answers in a single API call

0
Tokenwise icon
Tokenwise

A smart LLM proxy that shows where you're overpaying

0
Emily by Co-Desk icon
Emily by Co-Desk

Voice AI copilot for coworking & coliving operators

0
SocialEcho 2.0 icon
SocialEcho 2.0

AI social media copilot for teams and agents

0
Typeahead icon
Typeahead

AI autocomplete for every app on your Mac

0
Databox MCP icon
Databox MCP

Chat with your business data inside Claude, ChatGPT and more

0
Dune Keypad icon
Dune Keypad

Context-aware Mac keypad, w/ Claude + community extensions

0
Trippple Club icon
Trippple Club

Advertise together on Meta Ads and pay 3x less

0
Stella icon
Stella

Local natural language search across all your files

0
Skylive icon
Skylive

Never miss a celestial event, anywhere on Earth

0
Joanium icon
Joanium

Local AI workspace to build and work with your computer

0
folk icon
folk

the AI in your texts that gets stuff done

0
Mistral Vibe icon
Mistral Vibe

I agent for long-running, multi-step work and coding

0
NetworkSpy icon
NetworkSpy

HTTP(s) proxy debugger with custom viewer

0
Mina Meeting Assistant icon
Mina Meeting Assistant

Your AI Teammate now responds and executes during your calls

0
R0Y OMNI 1.0 icon
R0Y OMNI 1.0

Generate more accurate investment dashboards and reports

0
Web Clipper for NotebookLM icon
Web Clipper for NotebookLM

Your ultimate NotebookLM's Chrome Extension

0
TabTasker icon
TabTasker

Zero servers. Total privacy. Your new favorite toolbox.

0
Second Brain for AI icon
Second Brain for AI

Persistent memory for Claude, ChatGPT & Cursor. Free.

0
Marqly 5.0 icon
Marqly 5.0

Your AI-powered bookmark manager

0
Oura Ring 5 icon
Oura Ring 5

The world’s smallest smart ring, now even better

0
Clipto icon
Clipto

Fully local, natural language search over terabytes of media

0
Exstats icon
Exstats

Track your browser extensions and competitors in one place

0
Step 3.7 Flash icon
Step 3.7 Flash

Flash-speed agents model that can see and act

0
Openstatus MCP Health Checker icon
Openstatus MCP Health Checker

Test MCP servers like a real AI client, not just a ping

0
Wingbits AI icon
Wingbits AI

AI agents for real-time aircraft monitoring and alerts

0
Wandesk icon
Wandesk

Build Your Own AI Desktop

0
Sinalytica icon
Sinalytica

Travel back to 1998 and use Lovable on Windows 98

0
Drafted icon
Drafted

Design a home instantly with AI

0
Notchy icon
Notchy

Mac dynamic island with music, timers, clipboard, file drops

0
Agent A by Ahrefs icon
Agent A by Ahrefs

The AI Marketing Agent Powered by Ahrefs Data

0
Ava Studio icon
Ava Studio

Your AI creative team for video ads

0
MoDev icon
MoDev

The AI dev environment built for your phone.

0
Firecoach AI icon
Firecoach AI

AI roleplays that turn reps into top performers

0
MCP Bridge by Appfactor icon
MCP Bridge by Appfactor

Connect any API to any AI agent

0
Screen Ruler icon
Screen Ruler

The go-to ruler for designers and developers

0
Basedash: Embedded Analytics icon
Basedash: Embedded Analytics

Give customers AI analytics inside your product.

0
Coffee Piano icon
Coffee Piano

Browser music and piano studio with visual harmony tools

0
TrackNotch icon
TrackNotch

LLM usage tracking that lives in your Mac's notch

0
/monitor by Firecrawl icon
/monitor by Firecrawl

Notify your AI agent when the web changes

0
Hyper: Self-driving Company Brain icon
Hyper: Self-driving Company Brain

Turn your AI agents from interns to veterans

0
RabbitTravel icon
RabbitTravel

Smart travel planning made effortless

0
Clipline icon
Clipline

AI Video Cutter for viral Shorts, Reels, TikTok in Telegram

0
PromptLayer icon
PromptLayer

Trace AI requests, workflows, and costs in one timeline

0
Integuru icon
Integuru

Generate fast, reliable APIs for any platform. No browsers

0
Ava 2.0 icon
Ava 2.0

Your AI BDR that runs outbound sales autonomously

0
06

TECHMEME

06.00
TECHMEME

Techmeme - June 1, 2026

Techmeme Digest: Major tech headlines and industry conversations.

Palo Alto Networks says Mythos found 24+ critical bugs, burning $1M+ of tokens, subsidized by Anthropic; some companies say they plan to boost Mythos spending (Aaron Holmes/The Information)
Source: TechmemePublished: Jun 1, 2026

Aaron Holmes / The Information : Palo Alto Networks says Mythos found 24+ critical bugs, burning $1M+ of tokens, subsidized by Anthropic; some companies say they plan to boost Mythos spending —  When Palo Alto Networks earlier this year began testing Anthropic's Claude Mythos to comb through its own source code, it didn't take long to see the future of cybersecurity.

SEC filing: quantum computing company Quantinuum upsizes its IPO, selling 26.5M shares for $53 to $55 each to raise up to $1.46B at an up to $14.3B valuation (Liana Baker/Bloomberg)
Source: TechmemePublished: Jun 1, 2026

Liana Baker / Bloomberg : SEC filing: quantum computing company Quantinuum upsizes its IPO, selling 26.5M shares for $53 to $55 each to raise up to $1.46B at an up to $14.3B valuation —  Honeywell International Inc.-backed quantum computing company Quantinuum Inc. boosted the size of its initial public offering …

African e-mobility startup Spiro, which owns 100K+ electric bikes, raised $215M at a near-$1B valuation, after raising $100M in 2025 and $50M debt in 2026 (Loni Prinsloo/Bloomberg)
Source: TechmemePublished: Jun 1, 2026

Loni Prinsloo / Bloomberg : African e-mobility startup Spiro, which owns 100K+ electric bikes, raised $215M at a near-$1B valuation, after raising $100M in 2025 and $50M debt in 2026 —  African electric-mobility startup Spiro raised $215 million backed by European and African investors as it nears $1 billion in value …

SEC filing: SpaceX will reserve up to 5% of its Class A shares for select employees and executives' friends and family; 60%+ of shares have an extended lock-up (Charles Capel/Bloomberg)
Source: TechmemePublished: Jun 1, 2026

Charles Capel / Bloomberg : SEC filing: SpaceX will reserve up to 5% of its Class A shares for select employees and executives' friends and family; 60%+ of shares have an extended lock-up —  SpaceX will reserve up to 5% of shares in its upcoming initial public offering for certain employees and friends and family …

SEC filing: Strategy sold 32 bitcoin between May 26 and May 31 for ~$2.5M, at an average net price of $77,135 per coin, its first disclosed bitcoin disposal (CoinDesk)
Source: TechmemePublished: Jun 1, 2026

CoinDesk : SEC filing: Strategy sold 32 bitcoin between May 26 and May 31 for ~$2.5M, at an average net price of $77,135 per coin, its first disclosed bitcoin disposal —  The 8-K filing Monday says proceeds from the May 26-31 sale, executed at an average price of $77,135 a coin, will fund distributions on Strategy's preferred stock.

Google plans to open its first physical Google Store outside of the US, in Tokyo's Omotesando district "this summer", marking Google's 11th physical store (Damien Wilde/9to5Google)
Source: TechmemePublished: Jun 1, 2026

Damien Wilde / 9to5Google : Google plans to open its first physical Google Store outside of the US, in Tokyo's Omotesando district “this summer”, marking Google's 11th physical store —  After being exclusive to the United States, the first physical Google Store outside of the region is coming to Tokyo, Japan, in the coming months.

French private equity firm Ardian partners with data center group Verne to build an up to €5B AI "gigafactory" outside Paris, targeting 500MW in total capacity (Financial Times)
Source: TechmemePublished: Jun 1, 2026

Financial Times : French private equity firm Ardian partners with data center group Verne to build an up to €5B AI “gigafactory” outside Paris, targeting 500MW in total capacity —  Data centre and research facility to be built as Europe seeks to create ‘digital backbone for the future’

Israeli networking company DriveNets raised a $410M Series D led by Bessemer and Atreides at an $8.5B valuation, taking its total funding to ~$1B (Meir Orbach/CTech)
Source: TechmemePublished: Jun 1, 2026

Meir Orbach / CTech : Israeli networking company DriveNets raised a $410M Series D led by Bessemer and Atreides at an $8.5B valuation, taking its total funding to ~$1B —  The Israeli networking company says it is cash-flow positive with over $1B in backlog as AI demand accelerates.

Wirescreen analysis of 3,800 Chinese military procurement records finds 500+ instances since 2019 where the PLA sought Nvidia chips, including the A100 and A800 (New York Times)
Source: TechmemePublished: Jun 1, 2026

New York Times : Wirescreen analysis of 3,800 Chinese military procurement records finds 500+ instances since 2019 where the PLA sought Nvidia chips, including the A100 and A800 —  An analysis of six years of procurement records suggests that the People's Liberation Army has openly tried to acquire restricted U.S. technology.

Sources: Anthropic plans to let the EU's cyber agency ENISA join Project Glasswing and access Mythos; EU officials went to the US last week to ask for access (Gian Volpicelli/Bloomberg)
Source: TechmemePublished: Jun 1, 2026

Gian Volpicelli / Bloomberg : Sources: Anthropic plans to let the EU's cyber agency ENISA join Project Glasswing and access Mythos; EU officials went to the US last week to ask for access —  Anthropic PBC is set to give the European Union's cybersecurity body access to Mythos, the first EU agency to get access …

Chinese AI developer MiniMax debuts M3, a new coding model that it says rivals Claude Opus 4.7, costing $0.12 per 1M input tokens, compared with $5 for Opus 4.7 (Juro Osawa/The Information)
Source: TechmemePublished: Jun 1, 2026

Juro Osawa / The Information : Chinese AI developer MiniMax debuts M3, a new coding model that it says rivals Claude Opus 4.7, costing $0.12 per 1M input tokens, compared with $5 for Opus 4.7 —  Chinese AI developer MiniMax on Monday launched a new large language model called M3, saying the new model's coding capability approaches …

Binance launches trading for 7,000+ US stocks and ETFs for non-US users, with zero commissions and fractional share purchases, as part of its "super app" push (Jeff John Roberts/Fortune)
Source: TechmemePublished: Jun 1, 2026

Jeff John Roberts / Fortune : Binance launches trading for 7,000+ US stocks and ETFs for non-US users, with zero commissions and fractional share purchases, as part of its “super app” push —  Binance, the world's biggest cryptocurrency exchange, announced on Monday that its users will be able to trade more than 7,000 U.S. stocks and ETFs.

A look at the Seckinger school cluster in Georgia, including the US' "first AI-themed educational institution", as parents say AI integration is often sparse (New York Times)
Source: TechmemePublished: Jun 1, 2026

New York Times : A look at the Seckinger school cluster in Georgia, including the US' “first AI-themed educational institution”, as parents say AI integration is often sparse —  It was 9 a.m. on a Thursday at Harmony Elementary School in Buford, Ga., about 45 minutes outside Atlanta.

Nasdaq, FTSE, and other index providers are shortening their entry timelines to accommodate SpaceX's record $75B IPO, as Elon Musk targets retail investors (Bloomberg)
Source: TechmemePublished: Jun 1, 2026

Bloomberg : Nasdaq, FTSE, and other index providers are shortening their entry timelines to accommodate SpaceX's record $75B IPO, as Elon Musk targets retail investors —  The company's ambitious listing plan is set to clear the way for other mega-offerings.  It also risks threatening the integrity of the market itself

Q&A with Bill Gurley on Anthropic employees believing "they can create God, and that by creating God, they are like this Prometheus kind of species", and more (@theallinpod)
Source: TechmemePublished: Jun 1, 2026

@theallinpod : Q&A with Bill Gurley on Anthropic employees believing “they can create God, and that by creating God, they are like this Prometheus kind of species”, and more —  Bill Gurley: Anthropic Thinks It's Building God @Jason: It is the ultimate level of narcissism and delusion of grandeur to think you can create God. @bgurley: “Anthropic is a mystery to me. I've never, ever seen a company that is both leading their field and the most [video]

07

STARTUP ARCHIVE

07.00
STARTUP ARCHIVE

Startup News - June 1, 2026

Startup News Roundup: Aggregating key funding and launch updates.

Marc Andreessen on the 5 personality traits of an innovator
Source: StartupPublished: Mar 31, 2026

“When you’re talking about real innovators—people who actually do really creative, breakthrough work—I think you’re talking about a couple things:”

Steve Jobs explains the importance of both thinking and doing
Source: StartupPublished: Mar 30, 2026

“The doers are the major thinkers. The people who really create the things that change this industry are both the thinker-doer in one person.”

Tobi Lutke explains what the VCs who passed on Shopify got wrong
Source: StartupPublished: Mar 27, 2026

“What a lot of free-market thinkers don’t understand is that between the demand and eventual supply lies friction."

Sam Altman explains how he decides to invest in a startup after 10 minutes
Source: StartupPublished: Mar 26, 2026

"Does this person have the potential to be the next Mark Zuckerberg?… [You don’t get to] 100% accuracy, obviously, but it’s good enough that our business model works.”

Jony Ive recounts the time Steve Jobs called him vain
Source: StartupPublished: Mar 25, 2026

In the clip below, Jony Ive recounts the time he asked Steve Jobs to be less harsh in his critique of a piece of work.

Jeff Bezos’s two pieces of advice for aspiring entrepreneurs
Source: StartupPublished: Mar 24, 2026

“The advice that I would give entrepreneurs is don't chase the hot new thing. It's so hard to catch something that everybody already knows is hot."

Elad Gil: “Things that work tend to work pretty fast”
Source: StartupPublished: Mar 23, 2026

“I do think there’s a bit of a myth in Silicon Valley that you should keep grinding no matter what and it’s just about perseverance, and I think that’s really bad advice."

Paul Graham on why starting with a “small, intense fire" is the key to startup growth
Source: StartupPublished: Mar 20, 2026

"You have to know who those first users are and how you're going to get them."

Keith Rabois on how to identify great talent
Source: StartupPublished: Mar 19, 2026

“What you want to do with every single employee every single day is expand the scope of their responsibilities until it breaks… and that’s the role they should stay in.”

Wealthfront CEO on why advertising spend makes it harder to find product/market fit
Source: StartupPublished: Mar 18, 2026

“The way that you know you have product/market fit is if you have exponential organic growth."

Eric Schmidt on why most companies get strategy wrong
Source: StartupPublished: Mar 17, 2026

“Work very, very hard to figure out what the world’s going to look like in five years. What will people be doing? What will your customers want? Where will costs be?"

Mark Zuckerberg: “You can’t 80/20 everything”
Source: StartupPublished: Mar 16, 2026

"There’s the famous 80/20 rule where you get 80% of the benefit by doing 20% of the work, but you can’t just 80/20 everything. There have to be certain things that you are just the best at."

Marc Andreessen on Mark Zuckerberg’s founder “superpower”
Source: StartupPublished: Mar 13, 2026

“A great superpower that Mark Zuckerberg has that is probably not well-understood enough is he does not get emotionally upset in stressful situations"

Sam Altman explains how to come up with a great startup idea
Source: StartupPublished: Mar 12, 2026

"If you start a startup without a good idea… you’ll be under pressure to make something up and it won’t work that well."

Jeff Bezos on the problems with proxies and managing to metrics
Source: StartupPublished: Mar 11, 2026

“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”

Airbnb founder Brian Chesky on how to design an amazing user experience
Source: StartupPublished: Mar 10, 2026

“If you can design something really amazing using the hand-crafted part of your brain, then you can reverse-engineer how to industrialize this millions of times over."

Spencer Rascoff: "I will never invest in a consumer startup with paid marketing”
Source: StartupPublished: Mar 9, 2026

"If you’re actually trying to grow a product, the best levers for doing that are often within the product itself.”

Patrick Collison explains why it sometimes make sense to quit
Source: StartupPublished: Mar 6, 2026

“One thing I’ve learned myself the hard way, is that it is easier to tear down a company and restart it in Silicon Valley, than it is to constantly try to pivot or keep something alive."

Jeff Bezos recounts the time he called Amazon’s customer service number mid-meeting to prove a metric was wrong
Source: StartupPublished: Mar 5, 2026

“I have a saying, which is when the data and the anecdotes disagree, the anecdotes are usually right"

Ben Horowitz: “Nobody was born a great manager. It’s a very unnatural job.”
Source: StartupPublished: Mar 4, 2026

“If you can’t build a great product, it doesn’t matter if you can build a great company.”

03

ALSO TODAY

3 MORE SOURCES
08

SOLIDOT

08.00
SOLIDOT

Solidot News - June 1, 2026

Solidot Feed: Highlighting essential tech & open-source news.

AOMedia 发布 AV2 规范

由 Amazon、Cisco, Google、Intel、Microsoft、Mozilla 和 Netflix 等联合组建的开放媒体联盟 AOMedia 正式发布了 AV1 的后继者 AV2 编解码器。AV2 在 AV1 继续上提高了压缩效率,以更低的比特率实现高质量视频传输,为流媒体、广播和实时视频会议不断变化的需求进行了优化。AV2 增强了对 AR/VR 应用的支持,支持多节目分屏播放,改进屏幕内容处理,能在更宽的视觉质量范围内运行。

马来西亚禁止未满 16 岁青少年使用社媒禁令生效

马来西亚新网络安全法规星期一(6 月 1 日)生效,要求各大社交媒体平台验证用户年龄,并禁止 16 岁以下儿童注册账户。这项新法规适用于在马来西亚拥有至少 800 万用户的社媒供应商,包括 Facebook、Instagram、TikTok、YouTube 等。该国通信监管机构表示将给予社媒平台一段宽限期实施这些措施,但未说明宽限期的截止日期。新《网络安全法》的相关规定包括新的《儿童保护法》和《风险缓解法》,并要求社媒平台“加强内容管理”。通信与多媒体委员会说,未能遵守这两项守则的公司可面临最高 1000 万令吉的罚款。

研究认为玩家群体总体上的价值观更包容

过去几年,玩家群体中反 DEI 和拥抱保守派价值观的声音在社媒上非常突出,他们究竟只是代表了少数人的声音但被社媒的算法放大,还是代表了大多数玩家?研究人员利用 MRI-Simmons 的数据分析了 2012 年、2016 年和2020 年这三个特定年份在美国进行的全国消费者调查,追踪了受访者过去十二个月是否玩过网络游戏或单机游戏,观察了游戏行为与价值观之间的相关性。结果显示,玩家群体相比美国普通民众总体上持有更包容性的价值观。研究人员认为对 DEI 等包容价值观的敌意来自少数活跃玩家。

地球熔心在 2010 年突然逆转方向

根据卫星对地磁场的测量,太平洋一区域下的地球熔融核心在 2010 年突然逆转了流动方向,从西向流动转为东向流动。爱丁堡大学地球科学家 Frederik Dahl Madsen 说,“科学家现在想了解,这种逆转究竟代表着短暂的波动、周期性振荡的一部分,还是地核环流的一种新的稳定平衡。持续监测对确定未来几年这一流动如何演变至关重要。”Madsen 团队分析了 1997-2025 年间 27 年的卫星数据,拼凑出可能发生的变化。外地核大部分运动都受被称为偏心行星环流(eccentric planetary gyre)的环流模式支配。2010 年太平洋下方的区域,部分外核突然偏离了这种模式,从 2010 年之前的微弱西向流动转变为 2012 年之后的强劲东向流动。这种流动持续增强至 2020 年。根据最新的测量结果,它又开始减弱了。这一发现表明地球内部可能比我们想象的更动态多变。

Paint.net 项目通过诉讼拿回 Paint.net 域名

流行图像编辑软件 Paint.net 的官方域名是 www.getpaint.net,因为域名 Paint.net 掌握在第三方手中。现在你可以直接通过 Paint.net 域名获取该软件了。过去 22 年 Paint.net 域名原所有者一直拒绝出售域名,除非项目开发者 Rick Brewster 支付巨额费用。但域名所有者犯下了一个严重错误,他们创建了一个模仿 Paint.net 项目下载页的网站,通过恶意链接和广告获利。Brewster 提起了诉讼,主张利用他人作品牟利构成了侵犯版权和域名抢注。他赢得了诉讼,没有花钱就拿回了 Paint.net 域名。Paint.net 未来将成为主站,GetPaint.net 将重定向到主站。

维基媒体基金会否认以组织工会理由解雇员工

维基媒体基金会的员工正在组建工会,但本月有多名参与组织工会的员工离职或解雇,此事在社区引发了强烈反应,有人呼吁罢工,或者暂停将破坏性编辑恢复到正确版本的工作。维基媒体基金会证实它解散了负责 Community Wishlist 的团队,但否认此事与组建工会相关。基金会称,它的内部评估认为依靠单一团队处理社区请求不再运作良好。因为基金会支持的软件众多,接收社区请求的渠道众多,很难靠一个专门的团队去满足社区的所有愿望。在新架构下 Community Wishlist 请求的处理职责将由更大的产品和技术部门承担。受影响的员工目前仍在职,他们正在考虑安排其他内部岗位。未被安排到其他岗位的员工将于下个月离职,将获得遣散费。基金会称,如果员工最终投票决定成立工会,基金会将尊重法律程序。

16 岁男孩命名蓝牙设备为 BOMB,客机被迫返航

2026年 5 月 30 日下午 5:58,美联航 UA236 航班波音 767-400ER 客机从纽瓦克自由国际机场起飞,飞往西班牙马略卡岛帕尔玛机场(Palma de Mallorca Airport)。在跨大西洋飞行约一个半小时后,原本平静的飞行却让机上乘客陷入了混乱。据乘客在社媒上分享的经历,乘务员突然通过广播发出紧急指令:所有乘客必须立即关闭蓝牙连接。机组人员多次发出语气越来越紧张的广播,声称该指令直接来自美联航位于芝加哥的总部。机组人员警告说,如果蓝牙信号不被关闭,飞机将被迫返航。尽管收到了警告,至少还有两台蓝牙设备处于开启状态。飞行员最终决定中止飞行。根据社媒上的消息,原因是一名 16 岁男孩将其个人蓝牙音箱的网络名称改为 BOMB,男孩据说是几年前改的。蓝牙信号会广播给附近任何试图配对的智能手机或笔记本电脑,因此该名称会立即出现在机舱内乘客和机组人员的屏幕上,触发标准的炸弹威胁应对流程。

微软以证书过期为借口让 Mac 版 Office 2019 进入只读模式

微软于 2018 年 9 月 24 日宣布推出 Windows 和 Mac 版本的 Office 2019,售价 149.99 美元,可永久使用,但不会引入新功能。但到了 2026 年 5 月 15 日微软更新了支持文档,不再保证 Office 2019 能正常运行。Mac 版本的 Office 2019 的支持于 2023 年 10 月 10 日结束,微软使用数字证书去验证 Mac 版本的许可,该证书将于 2026 年 7 月 13 日到期。微软不打算更新证书,而是就让证书过期,而证书过期之后软件将无法正常使用,进入只读模式。微软向受影响用户提供了三种选择:继续以只读模式使用 Mac 版 Office 2019、切换到免费的 Microsoft 365 Web 应用,或者付费订阅 Microsoft 365 或购买新的 Office 家庭版 2024 永久许可证。微软此举招致了广泛批评,认为其做法涉嫌违法。Windows 版本未受影响。

高温会扰乱动物大脑

大量证据表明,动物大脑会受到高温的影响。天气炎热时,鸟类学习能力下降,狗咬人的次数增多,羚羊等体型较大的动物更容易挑衅打架。西澳大利亚大学的行为生态学家 Amanda Ridley 说,如果动物无法保持足够的警觉去寻找食物或躲避天敌,它们的生存几率会急剧下降。随着气候变化导致热浪日益频繁,动物王国的认知障碍可能会波及整个生态系统,本已脆弱的物种会面临更大的风险。如果授粉昆虫忘记该拜访哪些花朵,农作物和野生植物可能会歉收。如果鸟类难以觅食,其幼鸟可能无法存活。在一个气候暖化的行星上,敏锐的思维尤为重要。Ridley 指出气候变化意味着适应能力变得更重要。高温影响人类的大脑,有研究发现,对于在无空调学校学习的学生,学年气温每升高华氏 1 度,考试成绩会下降 1 %。对美国近 7 万起狗咬人报告的分析发现,32 摄氏度的天气狗咬人的风险比 16 摄氏度的天气高 10%,但研究人员并不确定是天热的条件下狗变得更具有攻击性,还是人类更暴躁而容易引发攻击,很可能是两个因素的组合。中国的一项研究发现,蛇和猫在天气变热时也更可能咬人。

GLP-1 减肥药可能会重塑大脑

全世界有数千万人服用 GLP-1 减肥药如 Ozempic。一个研究团队对 13 名服用 GLP-1 药物的年轻女性进行脑部扫描,发现她们的大脑发生了深远的变化。与注意力相关的突显网络(salience network)脑连接数量成倍增加。研究人员对此感到意外,他们表示不知道这意味着什么。GLP-1 药物的作用机制类似控制饥饿感、血糖和体重的激素。研究人员对药物作用机制深入研究后发现,它还会重塑部分大脑。致力于将 GLP-1 药物用于治疗成瘾的科学家 Lorenzo Leggio 表示其作用机制尚未完全被理解。这就引发了一个疑问:如果 GLP-1 药物能改变大脑中与奖赏、渴望和动机相关的系统,那么抑制一个人的破坏性冲动和重塑其人格之间存在怎样的界限?

丹麦养老基金将 SpaceX 列入投资黑名单

丹麦养老基金 AkademikerPension 今年一月以美国政府的信用评级不高为由抛售美国国债,现在它以治理结构问题而将 SpaceX 列入投资黑名单。SpaceX 于 5 月 20 日提交了 IPO 申请,其目标估值高达 1.8 万亿美元。AkademikerPension 首席投资官 Anders Schelde 表示这一估值不仅严重过高,而且该公司还存在在灾难性的治理结构问题。Elon Musk 拥有该公司绝对的控制权,控制约 80% 的投票权,同时兼任 CEO、CTO 和董事会主席。美国多家养老基金也都对 SpaceX 的治理结构表示担忧。Schelde 认为 SpaceX 的合理估值在一万亿美元以内,从投资回报角度看,该养老基金无法证明参与此次 IPO 的合理性。Schelde 表示,如果不是因为 Space X的估值和治理风险,AkademikerPension 很想投资 SpaceX 及其技术,“我们不投资的决定并非反映其技术或工程能力的不足。”

一家美国公司一个月内在 Claude AI 上花费了 5 亿美元

Axios 报道,一家未公布名字的公司一个月内在 Claude AI 上花掉了 5 亿美元,原因是公司忘记了为员工设置 Claude 使用限制。虽然没有公开名字,但能在 AI 上每月随意支出 5 亿美元且没有自己的 AI 大模型的公司寥寥无几。报道称,美国公司开始感受到在 AI 上过度支出带来的压力,企业领导者开始质疑 AI 支出飙升是否带来了实质性的回报。亚马逊早些时候被报道其员工为完成内部指标而虚增 token 消耗量。本周亚马逊取消了内部排行榜,防止员工为提高排名而将 AI 用于不必要的任务。

Krafton 同意向《Subnautica 2》开发商支付 2.5 亿美元奖金

水下生存游戏《Subnautica》的开发商 Unknown Worlds Entertainment 因一笔 2.5 亿美元的奖金而与母公司、韩国发行商 Krafton 闹上法庭。在这起备受瞩目的案件中,Krafton CEO Changhan Kim 不想支付奖金,他在咨询了 ChatGPT 之后以莫须有理由突然解雇了 Unknown Worlds 的主要高管。今年三月法庭裁决 Unknown Worlds 前 CEO Ted Gill 恢复原职。Unknown Worlds 也在本月释出了《Subnautica 2》的抢先体验版本(early access)。虽然还在开发之中,但《Subnautica 2》的销量已经突破 400 万份拷贝,Steam 平台最高同时在线玩家数逾 46.7 万人。这一佳绩已经满足了双方达成的奖金支付条件:当月销售额突破 6980 万美元,每 1 美元 Krafton 就需要向 Unknown Worlds 前股东支付 3.12 美元或最高 2.5 亿美元。根据韩国媒体报道,Krafton 已同意支付奖金。

气候变化扰乱北冰洋食物链

研究人员发现,北极海冰的加速消融导致了关键营养物质硝酸盐含量急剧下降,扰乱了食物链,影响了浮游生物、鱼类、海鸟和海洋哺乳动物的种群数量。分析显示,曾被冰层覆盖的大片浅海区域暴露在阳光下,加速了硝酸盐的分解。硝酸盐对食物链底层的浮游生物的生长至关重要,其含量下降限制了生态系统能维持的生物数量。对北极冰水流入大西洋的主要通道 Fram 海峡逾二十年采样数据的分析发现,从 2009 年起北极水域的硝酸盐含量持续下降。硝酸盐含量的下降与北极海冰的急剧减少几乎同时发生。研究人员表示,由于营养状况的变化是由持续的海冰消融造成的,北冰洋几乎不可能恢复到之前的状态。

英伟达税

生活在美国数据中心周围的居民都有电费大幅上涨的经历。他们可能并不知道,部分电费账单其实是支付给英伟达的税。英伟达控制着 81% 的数据中心 AI 芯片市场,上个财年其数据中心业务收入 1937 亿美元,毛利率为 75%。对英伟达顶尖 GPU 芯片的拆解报告显示,其制造成本约 3300 美元,但售价高达 2.8 万美元,利润率高达 88%。如此高的利润其实是一种税,总要有人来承担。数据中心周围的居民就处于这条支付链条的末端。为了少给英伟达缴税,科技巨头都在竞相开发更便宜的 AI 加速芯片,如 Google 的 TPU、亚马逊的 Trainium、微软的 Maia 以及 Meta 的 MTIA,OpenAI 也在与博通合作设计 AI 芯片。但我们为什么要给英伟达缴税?

Flathub 禁止 AI 生成的应用

提供 Flatpak 打包应用的 Linux 应用商店 Flathub 更新了其生成式 AI 政策,事实上禁止 AI 生成应用。Flathub 声明:不允许提交包含 AI 生成或 AI 辅助代码、文档或其它内容的应用。提交 AI 应用会直接被拒绝而无需进一步审查。屡次违反政策会导致被永久禁止提交应用。开发者表示他们受够了此类应用,但以前递交和批准的 AI 辅助编程应用不会被追溯,仍然可以正常使用。

Google 恨你和我

Google 从本世纪初开始就支配着搜索引擎市场。为了让自家内容被搜索到所有媒体都要遵守 Google 制定的规则并以此进行优化,但如果有一天搜索引擎只为自己优化?这一天已经到来,Google 上周宣布将使用 Gemini 处理所有搜索查询。此前 Google 已经通过 AI Overview 冲击了所有媒体,导致它们的流量下降了四分之一之多。如今搜索巨人准备完全切断新闻业的生存之道。Facebook 和 X 等社媒平台通过限制链接(throttling links)确保用户留在自己的网站上而不是点击链接离开。通过转向 AI 搜索 Google 正在拥抱这一趋势,让用户在获取信息上更依赖机器而不是真人。鉴于 Google 的无处不在和无法避开,它正引领科技行业贬值人类的思想和人类本身。Google 恨你也恨我。

科学家利用量子贝尔装置生成完美随机性

根据发表在《自然》期刊上的一项研究,苏黎世联邦理工学院的研究人员利用量子贝尔测试装置首次生成了经过证明的完美随机性。这一随机性是基于量子物理的非确定性。研究人员使用了两个冷却到绝对零度附近的超导芯片装置,。每个芯片代表一个量子比特,它可以处于 0 或 1 或者两者的叠加态。两个芯片使用一个 30 米长的冷却管连接。微波光子在两芯片之间传播,形成量子纠缠。这意味着对一个量子比特进行量子测量,随机得到 0 或 1 的值,会自动且远距离影响另一个量子比特的测量结果。30 米的距离确保了在测量过程中,即使以光速传播,量子比特之间不会交换任何信息。任何信息交换都会破坏这种完美的随机性。研究人员称,测量获得的 0 或 1 的序列是真正完美的随机序列,他们可以证明。

Anthropic 估值首次超过 OpenAI

Anthropic 周四宣布以 9650 亿美元估值融资 650 亿美元。此次 H 轮融资后 Anthropic 估值首次超过竞争对手 OpenAI。OpenAI 在今年 3 月的融资后估值为 8520 亿美元,而今年 2 月 Anthropic 的估值还只有 3800 亿美元。Anthropic 和 OpenAI 都在筹备上市,最快发生在今年。Anthropic 称它根据最近一个月的营收估计全年营收有望突破 470 亿美元。

日本人口五年减少逾三百万

日本总务省周五公布了人口普查初值数据。截至 2025 年 10 月 1 日,包含外国人在内的日本总人口为 123,049,524 人,较 2020 年的上次普查减少约 309.7 万人,降幅为 2.5%。这是继 2015 年普查以来连续第三次呈现负增长,并创出最大降幅,再次凸显人口减少的严峻形势。总务省分析认为,随着少子老龄化不断加剧,死亡人数超过出生人数的“自然减少”扩大是主要原因。由于出生人数呈减少趋势,预计今后日本人口仍将持续减少,亟需采取对策维持地区社会与经济的运转。全国家庭户数增加了 2.3%,达到 57,124,507 户。平均每户家庭人数为 2.15 人,创下自 1970 年有可比数据以来的最低纪录。分析认为或因高龄单人家庭增加。根据联合国对 2025 年各国人口的推算,日本排在第 12 位,占世界总人口的 1.5%。在人口排名前 20 的国家中,2020 年至 2025 年间人口减少的有日本、中国、俄罗斯和泰国,其中日本的降幅最大。

09

APP STORE RANK

09.00
APP STORE RANK
FETCHING · APP STORE RANK