TEXT VIEW · TODAY'S DIGEST · 36 HEADLINES ACROSS 8 SOURCES

Startup Archive(0)

No items yet for today.

App Store Rankings(0)

No items yet for today.

ISSUE 0891
TUE, JUN 9, 2026
Discover the best information organized by OrangeBot.AI
TODAY · TUE, JUN 9, 2026

The web,
read by a bot.

Ten sources — Hacker News, Product Hunt, HuggingFace, Techmeme and more — filtered, tagged, and summarized every morning for builders who don’t have time to scroll.

NEWChrome extension: save posts from Twitter/X in one click.Install →
01

AI DIGEST

UPDATED DAILY · EDITOR'S PICK
01.00
AI DIGEST

AI新闻摘要

June 9, 2026

Here is a summary of today's main news events:

Markets Rally as Middle East Tensions Ease Tensions between Iran and Israel have de-escalated after both sides halted attacks. This news calmed global markets, causing oil and gold prices to fall while U.S. stocks, led by the tech sector, gained value as investor confidence returned.

Apple Unveils a Smarter, AI-Powered Siri Apple announced a major overhaul of its virtual assistant, Siri, integrating advanced artificial intelligence developed in partnership with Google. The update is a key part of Apple's strategy to compete more aggressively in the AI space and enhance the capabilities of its devices.

Major Investments Signal Confidence in AI and Tech The AI sector continues to attract significant capital, with chipmaker Broadcom and partners launching a $35 billion fund to finance AI infrastructure. In another major development, OpenAI, the creator of ChatGPT, has confidentially filed paperwork for a future Initial Public Offering (IPO).

U.S. Dollar Weakens on National Debt Concerns The value of the U.S. dollar declined today against other major currencies. Analysts attribute the drop to growing investor concerns regarding the long-term sustainability of U.S. government debt.

02

ON THE WIRE

6 SOURCES
02

HACKER NEWS

02.00
HACKER NEWS

Hacker News - June 9, 2026

Hacker News Feed: Highlighting key posts and discussions.

Cleaning up after AI rockstar developers

(www.codingwithjesse.com)

13164
Making Graphics Like it's 1993

(staniks.github.io)

21034
Job: Head of Stonehenge

(www.english-heritage.org.uk)

178148
Federal judge blocks H1B visa $100K fee

(www.alaskasnewssource.com)

165297
Apple Core AI Framework

(developer.apple.com)

32792
Siri AI

(www.apple.com)

622615
AI is slowing down

(www.wheresyoured.at)

610664
The Cypherpunk Library

(www.cypherpunkbooks.com)

36396
Dopamine Fracking

(igerman.cc)

784406
03

HUGGINGFACE

03.00
HUGGINGFACE

huggingface.title - June 9, 2026

huggingface.description

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

91
On the Geometry of On-Policy Distillation

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

52
Latent Spatial Memory for Video World Models

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

45
CoVEBench: Can Video Editing Models Handle Complex Instructions?

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

44
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

43
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

36
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

36
Human Psychometric Questionnaires Mischaracterize LLM Behavior

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

30
Echo-Memory: A Controlled Study of Memory in Action World Models

We present Echo-Memory, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: capacity, compression, read-out, and recurrence. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

27
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

15
SwiftVR: Real-Time One-Step Generative Video Restoration

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

12
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

12
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.

11
OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

9
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

9
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

8
Answer Presence Drives RAG Rewriting Gains

Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA+verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired answer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12~F1 ``non-leakage residual'' that flips to -3.33 to -7.81~F1 under four alternative sentinels and fails an equivalence test for three of those four (1/4~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.

7
End-to-End Context Compression at Scale

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

7
Liberating LLM Capabilities in Full-Duplex Speech Models

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

5
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

5
A Geometric Account of Activation Steering through Angle-Norm Decomposition

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

4
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

4
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

4
Why Muon Outperforms Adam: A Curvature Perspective

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.

4
Text-to-Image Models Need Less from Text Encoders Than You Think

Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/

3
Trajectory-Refined Distillation

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd

3
Phase Marginalization for Patch-Grid Instability in Vision Transformers

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.

2
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present , an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

2
SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

We present SigmaScale, a method for learning auxiliary scaling matrices S to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

1
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path X_λ= (1-λ)X_0 + λX_1 that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over λ, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific λ-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

1
EmpiriGraph-Psy: A Dataset and LLM Pipeline for Extracting Empirical Relation Graphs from Psychology Abstracts

Existing scientific relation extraction benchmarks mainly target domains such as computer science, where entities are tasks, methods, datasets, materials, or metrics. This leaves a gap in variable-oriented empirical fields such as psychology, where findings are expressed as relations among constructs, measurements, interventions, and outcomes. We introduce variable-centered empirical graph extraction, the task of mapping scientific abstracts to typed graphs whose nodes are normalized variables and whose edges represent empirical and hierarchical relations. To support this task, we construct EmpiriGraph-Psy, a benchmark of 210 psychology abstracts annotated by domain-trained annotators with normalized variables, concept hierarchies, empirical relation types, and validation states. We evaluate frontier and open-weight LLMs using both direct extraction and a staged graph-construction pipeline that separates variable extraction, normalization, hierarchy construction, evidence selection, relation extraction, and edge validation. The staged pipeline substantially outperforms direct extraction, with the best configuration achieving a macro-F1 of 0.74. Error analysis shows that moderation relations and concept hierarchies remain the most challenging cases, highlighting the difficulty of extracting higher-order empirical claims and implicit abstraction structure from scientific abstracts.

1
Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.

1
WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model's spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model's camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.

1
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

1
Chiaroscuro Attention: Spending Compute in the Dark

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

1
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

1
PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

1
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

1
Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

1
Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

1
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

1
OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For humanoid robot loco-manipulation tasks, however, existing data sources force an unsatisfying tradeoff between trajectory quality and scalability. Real-world teleoperation provides the highest-quality trajectories but requires dedicated physical space and time-consuming scene resets. Simulation offers an alternative way out of this dilemma: it can produce clean, embodiment-aligned data at scale without any physical hardware. In this paper, we propose OASIS, a simulation-data-driven framework for humanoid loco-manipulation. OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, we further design a hierarchical visuomotor policy for humanoid loco-manipulation. Extensive experiments on the real humanoid robot show that, under zero-shot deployment, the policy trained on our simulation data achieves higher success rates on most tasks than that trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by our simulation rendering, which real-robot data fails to capture. The project page is available at https://oasis-humanoid.github.io/.

1
Pruning and Distilling Mixture-of-Experts into Dense Language Models

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

0
Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic analysis and grammatical reasoning. We propose a pipeline for automatically generating step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks. We evaluate these traces in three settings: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), on Xibe and Chintang as test cases. Our results show that linguistic reasoning traces are most effective as inference-time guidance: in ICL, reliable sentence-specific traces substantially improve translation performance across most models, languages, and metrics. In contrast, using the linguistic reasoning traces as training data yields smaller and less consistent gains, as models learn the trace format but often generate erroneous content. These findings suggest that LLMs can leverage grammatical information for low-resource MT when given reliable linguistic analyses, while learning to generate such analyses remains a major bottleneck.

0
CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.

0
EMMA: Extracting Multiple physical parameters from Multimodal Data

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026

0
PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

0
Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures. We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials, even though the environment resets to the correct task each time. We call this failure mode memory confabulation and introduce the Reflection Repetition Rate (RRR), a log-based metric that detects repeated reliance on incorrect reflective content. Using RRR, we identify 16 frozen environments in ALFWorld, where 0 of 121 reflections mention the correct target object, and 4 analogous cases in HumanEval. Our mitigation replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention from 0% to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments, suggesting that reflective memory can reinforce false beliefs rather than correct them.

0
Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/

0
05

PRODUCT HUNT

05.00
PRODUCT HUNT

Product Hunt - June 9, 2026

Product Hunt Daily Feed: Featuring noteworthy tech launches.

prostir zvuku icon
prostir zvuku

A spatial nature sound mixer for Mac

0
Kimi Work icon
Kimi Work

The AI desktop for knowledge work

0
Fluido icon
Fluido

Turn any Figma shape into liquid metal in one click

0
TrakMac icon
TrakMac

Voice-first macro tracking for fitness enthusiasts

0
Reve 2.0 icon
Reve 2.0

Generate and edit 4K images through layout-based control

0
Uiverse Design icon
Uiverse Design

De-slop your AI generated websites

0
Solarch icon
Solarch

Interactive diagrams with AI, and your code always in sync

0
VC Boom icon
VC Boom

Score your deck, meet investors who fit, raise more. Boom!

0
Limelight icon
Limelight

Make your screen recordings easy to follow

0
agentcad icon
agentcad

A CAD design tool for coding agents (free + open source)

0
TravelMind icon
TravelMind

AI-powered city discovery built on taste, not reviews

0
Whistle icon
Whistle

A fitness coach with personalized plans

0
agmsg icon
agmsg

Stop copy-pasting between your AI coding agents

0
ZeroGPU icon
ZeroGPU

The compute efficient layer for AI inference

0
hora Calendar icon
hora Calendar

Google calendar built for the Mac

0
Overly icon
Overly

Search and ask questions inside lecture videos

0
OrchestraML icon
OrchestraML

From English prompt to deployed ML model with human approval

0
ChocolateBar icon
ChocolateBar

Add a row under your menu bar for hidden icons

0
Krisp Voice Translation API icon
Krisp Voice Translation API

Real-time speech-to-speech translation API

0
Cove for Mac icon
Cove for Mac

Like a save/load game for your work

0
Signal Recorder SR-7 icon
Signal Recorder SR-7

On-device voice recorder that transcribes + exports Markdown

0
Log Cam icon
Log Cam

Record log and ProRes video from RAW frames on iPhone

0
Nodrix icon
Nodrix

Your own IoT cloud, deployed to your Cloudflare account

0
NudgeFile icon
NudgeFile

Automatically organize, rename, and manage files with AI

0
Mic Drop 3.0 icon
Mic Drop 3.0

Mute your mic in any app—with your AirPods

0
BooBar icon
BooBar

AI Dynamic Island for your Mac

0
AgentOS icon
AgentOS

Manage AI agents, tasks, workspaces from one control layer

0
Pixel Snapper icon
Pixel Snapper

Editor to clean up AI-generated pixel art

0
Browse.sh icon
Browse.sh

Give your agents muscle memory for automating the web

0
Vaani icon
Vaani

Lip-synced AI dubbing for creators, brands and studios

0
The Virtual OS Museum icon
The Virtual OS Museum

Relive vintage operating systems right on your desktop

0
Honen icon
Honen

Automated teaching + learning infrastructure for any company

0
Supaste icon
Supaste

Clipboard Manager for macOS

0
Tamadoggo icon
Tamadoggo

A living journal for your pet's life, with AI insights

0
Sigma File Manager icon
Sigma File Manager

Free, open-source, cross-platform, modern file manager app

0
NTSC-RS icon
NTSC-RS

Open-source video emulation of analog TV and VHS artifacts

0
Dreambeans by Google Labs icon
Dreambeans by Google Labs

Daily AI stories personalised from your Google apps

0
Smmall Cloud for iOS icon
Smmall Cloud for iOS

Simple file sharing on your iPad or iPhone

0
CabinLink icon
CabinLink

Flight map from cabin Wi-Fi

0
Job Postings API icon
Job Postings API

View, monitor, and analyze 1.8M+ US jobs

0
Wave icon
Wave

Turn your voice into text — local or cloud, your choice

0
Manus Shopify Connector icon
Manus Shopify Connector

Build and manage Shopify stores from one chat

0
QWERTYS by Smart Keys icon
QWERTYS by Smart Keys

My keyboard fell apart. Now it's your problem.

0
MAI-Image-2.5 icon
MAI-Image-2.5

Generate and edit images with precise scene control

0
Fox Issue Tracker 4 icon
Fox Issue Tracker 4

Track, plan, and release.

0
Gaming services by IFTTT icon
Gaming services by IFTTT

Level up the way you play with Steam, Dota 2, and more

0
Navi+ Menu Builder icon
Navi+ Menu Builder

Add Tab Bar, Mega Menu & more to any website — no code

0
Google Search Profiles icon
Google Search Profiles

Profile for publishers/creators to highlight work on Search

0
Microsoft MAI-Voice-2 icon
Microsoft MAI-Voice-2

Expressive TTS with voice cloning in 15 languages

0
SellerClaw icon
SellerClaw

A team of AI agents that runs your stores across channels

0
06

TECHMEME

06.00
TECHMEME

Techmeme - June 9, 2026

Techmeme Digest: Major tech headlines and industry conversations.

Vinyl Equity, an SEC-registered transfer agent that has launched a payments platform earlier this year, raised a $20M Series A led by Jump Capital (Ryan Lawler/Axios)
Source: TechmemePublished: Jun 9, 2026

Ryan Lawler / Axios : Vinyl Equity, an SEC-registered transfer agent that has launched a payments platform earlier this year, raised a $20M Series A led by Jump Capital —  Vinyl Equity, which is building a modern transfer agent for public companies, raised $20 million in Series A funding led by Jump Capital, CEO Rob Schoder tells Axios exclusively.

Decentralized lending protocol Morpho raised $175M led by Paradigm, Ribbit Capital, and a16z Crypto in a token sale valuing Morpho at up to $2B (Ben Weiss/Fortune)
Source: TechmemePublished: Jun 9, 2026

Ben Weiss / Fortune : Decentralized lending protocol Morpho raised $175M led by Paradigm, Ribbit Capital, and a16z Crypto in a token sale valuing Morpho at up to $2B —  Paul Frambot has a message for the “suits” at traditional financial institutions.  “I think TradFi is going to have to wear shorts,” …

Apple unveils new Apple Foundation Models: two on-device models, including a 20B-parameter multimodal model called AFM 3 Core Advanced, and three cloud models (Apple Machine Learning Research)
Source: TechmemePublished: Jun 9, 2026

Apple Machine Learning Research : Apple unveils new Apple Foundation Models: two on-device models, including a 20B-parameter multimodal model called AFM 3 Core Advanced, and three cloud models —  Our next generation of Apple Intelligence is centered around our users, integrated deeply into our operating systems …

French government warns that hackers used a hijacked user account to breach Tchap, its encrypted messaging app for civil servants with 300,000+ monthly users (Sergiu Gatlan/BleepingComputer)
Source: TechmemePublished: Jun 9, 2026

Sergiu Gatlan / BleepingComputer : French government warns that hackers used a hijacked user account to breach Tchap, its encrypted messaging app for civil servants with 300,000+ monthly users —  DINUM, the digital affairs directorate of the French government, warned that hackers used a hijacked user account to breach Tchap …

New York-based Standard Bots, which wants to make AI-powered robotic arms in the US, raised $200M led by General Catalyst and Robostrategy at a $1B valuation (Bloomberg)
Source: TechmemePublished: Jun 9, 2026

Bloomberg : New York-based Standard Bots, which wants to make AI-powered robotic arms in the US, raised $200M led by General Catalyst and Robostrategy at a $1B valuation —  Standard Bots has raised $200 million in a new round of funding to ramp up manufacturing of robotic arms in the US as the country vies …

The EU says Apple decided not to roll out Siri AI in the EU after it had unsuccessfully requested to be exempted from interoperability obligations for the tool (Inti Landauro/Reuters)
Source: TechmemePublished: Jun 9, 2026

Inti Landauro / Reuters : The EU says Apple decided not to roll out Siri AI in the EU after it had unsuccessfully requested to be exempted from interoperability obligations for the tool —  Apple decided not to roll out its new Siri AI tool in the European Union after it had unsuccessfully requested to be exempted …

Sources: Taiwan is considering restricting AI chip sales to all customers in China, not just companies on an export blacklist like Huawei, to align with the US (Bloomberg)
Source: TechmemePublished: Jun 9, 2026

Bloomberg : Sources: Taiwan is considering restricting AI chip sales to all customers in China, not just companies on an export blacklist like Huawei, to align with the US —  Taiwan authorities are considering much stricter export controls on AI chip sales to China to further align with US measures …

Sources: Microsoft laid off 200 to 400 Azure unit employees in Beijing and Shanghai, at least its third round of downsizing in China in two years (South China Morning Post)
Source: TechmemePublished: Jun 9, 2026

South China Morning Post : Sources: Microsoft laid off 200 to 400 Azure unit employees in Beijing and Shanghai, at least its third round of downsizing in China in two years —  Microsoft is laying off hundreds of employees at its Azure cloud unit in China as the US technology giant navigates tightening data regulations …

The UK is conducting a full review of its NHS contract with Palantir, amid growing pressure to terminate the deal in 2027 over reliance on US tech companies (Sam Tabahriti/Reuters)
Source: TechmemePublished: Jun 9, 2026

Sam Tabahriti / Reuters : The UK is conducting a full review of its NHS contract with Palantir, amid growing pressure to terminate the deal in 2027 over reliance on US tech companies —  Britain is conducting a full review of its National Health Service contract with U.S. data analytics firm Palantir (PLTR.O) …

Beacon Software, which acquires niche software businesses and transforms them with AI, raised a $225M Series C, bringing its total funding to $550M (Sarah Klearman/Wall Street Journal)
Source: TechmemePublished: Jun 9, 2026

Sarah Klearman / Wall Street Journal : Beacon Software, which acquires niche software businesses and transforms them with AI, raised a $225M Series C, bringing its total funding to $550M —  In private equity, a roll-up has historically meant buying companies, consolidating them to achieve an economy of scale and then flipping the larger entity for profit.

France-based Alta Ares, which is building AI-powered air defense systems to counter drones and missiles, raised a €50M Series A led by Air Street Capital (Daphné Leprince-Ringuet/Sifted)
Source: TechmemePublished: Jun 9, 2026

Daphné Leprince-Ringuet / Sifted : France-based Alta Ares, which is building AI-powered air defense systems to counter drones and missiles, raised a €50M Series A led by Air Street Capital —  Alta Ares builds interceptors which it says are currently deployed in several active combat zones

In a UK online safety consultation, the US urges the UK against an under-16 social media ban, saying it would place a "disproportionate" burden on US Big Tech (Dan Milmo/The Guardian)
Source: TechmemePublished: Jun 9, 2026

Dan Milmo / The Guardian : In a UK online safety consultation, the US urges the UK against an under-16 social media ban, saying it would place a “disproportionate” burden on US Big Tech —  Trump administration says restrictions could impose ‘disproportionate’ burden on US tech companies

Sources: China is drafting plans to spend ~$295B over the next five years on building AI data centers, sourcing 80%+ of tech from local suppliers like Huawei (Charlie Zhu/Bloomberg)
Source: TechmemePublished: Jun 9, 2026

Charlie Zhu / Bloomberg : Sources: China is drafting plans to spend ~$295B over the next five years on building AI data centers, sourcing 80%+ of tech from local suppliers like Huawei —  China is preparing to spend around 2 trillion yuan ($295 billion) over the next five years on building data centers across the country …

A group of Chinese tech companies, including Alibaba and CXMT, launches a ~$577M PE fund to boost China's "hard tech" sectors amid tightening US export curbs (Ann Cao/South China Morning Post)
Source: TechmemePublished: Jun 9, 2026

Ann Cao / South China Morning Post : A group of Chinese tech companies, including Alibaba and CXMT, launches a ~$577M PE fund to boost China's “hard tech” sectors amid tightening US export curbs —  Chip giants launch a 3.91 billion yuan private equity fund to boost the country's ‘hard tech’ sectors amid tightening US export curbs

IT management platform NinjaOne raised $400M in a secondary share sale at a $12.3B valuation, up from $5B in February 2025, and says its ARR has hit $600M (Rebecca Torrence/Bloomberg)
Source: TechmemePublished: Jun 9, 2026

Rebecca Torrence / Bloomberg : IT management platform NinjaOne raised $400M in a secondary share sale at a $12.3B valuation, up from $5B in February 2025, and says its ARR has hit $600M —  NinjaOne, an IT management platform, has more than doubled its valuation to $12.3 billion in a new financing deal …

07

STARTUP ARCHIVE

07.00
STARTUP ARCHIVE

Startup News - June 9, 2026

Startup News Roundup: Aggregating key funding and launch updates.

Marc Andreessen on the 5 personality traits of an innovator
Source: StartupPublished: Mar 31, 2026

“When you’re talking about real innovators—people who actually do really creative, breakthrough work—I think you’re talking about a couple things:”

Steve Jobs explains the importance of both thinking and doing
Source: StartupPublished: Mar 30, 2026

“The doers are the major thinkers. The people who really create the things that change this industry are both the thinker-doer in one person.”

Tobi Lutke explains what the VCs who passed on Shopify got wrong
Source: StartupPublished: Mar 27, 2026

“What a lot of free-market thinkers don’t understand is that between the demand and eventual supply lies friction."

Sam Altman explains how he decides to invest in a startup after 10 minutes
Source: StartupPublished: Mar 26, 2026

"Does this person have the potential to be the next Mark Zuckerberg?… [You don’t get to] 100% accuracy, obviously, but it’s good enough that our business model works.”

Jony Ive recounts the time Steve Jobs called him vain
Source: StartupPublished: Mar 25, 2026

In the clip below, Jony Ive recounts the time he asked Steve Jobs to be less harsh in his critique of a piece of work.

Jeff Bezos’s two pieces of advice for aspiring entrepreneurs
Source: StartupPublished: Mar 24, 2026

“The advice that I would give entrepreneurs is don't chase the hot new thing. It's so hard to catch something that everybody already knows is hot."

Elad Gil: “Things that work tend to work pretty fast”
Source: StartupPublished: Mar 23, 2026

“I do think there’s a bit of a myth in Silicon Valley that you should keep grinding no matter what and it’s just about perseverance, and I think that’s really bad advice."

Paul Graham on why starting with a “small, intense fire" is the key to startup growth
Source: StartupPublished: Mar 20, 2026

"You have to know who those first users are and how you're going to get them."

Keith Rabois on how to identify great talent
Source: StartupPublished: Mar 19, 2026

“What you want to do with every single employee every single day is expand the scope of their responsibilities until it breaks… and that’s the role they should stay in.”

Wealthfront CEO on why advertising spend makes it harder to find product/market fit
Source: StartupPublished: Mar 18, 2026

“The way that you know you have product/market fit is if you have exponential organic growth."

Eric Schmidt on why most companies get strategy wrong
Source: StartupPublished: Mar 17, 2026

“Work very, very hard to figure out what the world’s going to look like in five years. What will people be doing? What will your customers want? Where will costs be?"

Mark Zuckerberg: “You can’t 80/20 everything”
Source: StartupPublished: Mar 16, 2026

"There’s the famous 80/20 rule where you get 80% of the benefit by doing 20% of the work, but you can’t just 80/20 everything. There have to be certain things that you are just the best at."

Marc Andreessen on Mark Zuckerberg’s founder “superpower”
Source: StartupPublished: Mar 13, 2026

“A great superpower that Mark Zuckerberg has that is probably not well-understood enough is he does not get emotionally upset in stressful situations"

Sam Altman explains how to come up with a great startup idea
Source: StartupPublished: Mar 12, 2026

"If you start a startup without a good idea… you’ll be under pressure to make something up and it won’t work that well."

Jeff Bezos on the problems with proxies and managing to metrics
Source: StartupPublished: Mar 11, 2026

“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”

Airbnb founder Brian Chesky on how to design an amazing user experience
Source: StartupPublished: Mar 10, 2026

“If you can design something really amazing using the hand-crafted part of your brain, then you can reverse-engineer how to industrialize this millions of times over."

Spencer Rascoff: "I will never invest in a consumer startup with paid marketing”
Source: StartupPublished: Mar 9, 2026

"If you’re actually trying to grow a product, the best levers for doing that are often within the product itself.”

Patrick Collison explains why it sometimes make sense to quit
Source: StartupPublished: Mar 6, 2026

“One thing I’ve learned myself the hard way, is that it is easier to tear down a company and restart it in Silicon Valley, than it is to constantly try to pivot or keep something alive."

Jeff Bezos recounts the time he called Amazon’s customer service number mid-meeting to prove a metric was wrong
Source: StartupPublished: Mar 5, 2026

“I have a saying, which is when the data and the anecdotes disagree, the anecdotes are usually right"

Ben Horowitz: “Nobody was born a great manager. It’s a very unnatural job.”
Source: StartupPublished: Mar 4, 2026

“If you can’t build a great product, it doesn’t matter if you can build a great company.”

03

ALSO TODAY

3 MORE SOURCES
08

SOLIDOT

08.00
SOLIDOT

Solidot News - June 9, 2026

Solidot Feed: Highlighting essential tech & open-source news.

联合国报告警告海洋承受巨大压力

最新发布的《世界海洋评估》报告警告,气候变化、污染、过度开发等多重压力正在持续削弱海洋健康,而海洋的未来与人类的未来紧密相连。报告指出,即便远离海岸,海洋依然深刻影响着每个人的生活。海洋吸收了地球大部分额外热量和温室气体,在减缓气候变化方面发挥关键作用。海洋还为全球数十亿人口提供食物、氧气和药物资源,并支撑着全球贸易、旅游业和大量就业岗位。报告强调,海洋环境恶化不仅会影响沿海地区,还将波及粮食安全、供应链稳定以及全球经济发展。评估显示,海洋变暖和海平面上升正在加速。由于冰盖融化和海水热膨胀,全球海平面上升速度已从 2015 年前每年最高 1.9 毫米增加到 2023 年的 4.3 毫米。北极升温速度达到全球平均水平的四倍。与此同时,海洋缺氧区面积已扩大至约 450 万平方公里,大量海洋生物生存空间受到挤压。自 1970 年代以来,加勒比地区约 80% 的珊瑚礁已经消失。如果全球升温超过工业化前水平 1.5 摄氏度,全球 90% 的珊瑚礁可能面临消失风险。报告显示,每年约有 5200 万吨塑料垃圾进入海洋,形成约 24 万亿个微塑料颗粒,已影响 4000 多种海洋生物。

微软开源工具被植入窃取凭证的恶意代码

微软下线了数十个托管在 GitHub 上的开源项目,原因是安全公司发现这些项目被入侵植入了窃取密码等敏感凭证的恶意代码。微软在一份声明中表示,它正对此展开调查,部分下线的项目在审核之后已恢复上线,作为调查的一部分,它通知了下载受影响项目的一小部分用户。调查显示,至少 73 个项目受到影响。这是过去一个月微软第二次开源项目库遭到入侵。

世界杯可能有 97 场比赛受高温影响

气候中心(Climate Central)发布分析结果称,美加墨世界杯比赛将遭遇全球变暖带来的高温天气,球员表现受到负面影响的可能性升高。此次世界杯将在 16 个场馆共举行 104 场比赛,其中 97 场比赛可能出现导致恢复能力等下降的炎热天气。不仅球员的健康风险上升,比赛的质量也可能受到影响。本届世界杯由美国、墨西哥、加拿大共同主办,赛程为当地时间 6 月 11 日至 7 月 19 日。基于以往数据对赛事期间气温的预测显示,有较高概率在 97 场比赛中出现超过 28 度的气温。此前研究指出,超过 28 度会对球员的跑动速度、距离与恢复时间产生影响,也会影响到战术和比赛风格。

企业批准员工以宗教理由不使用 AI

美国企业在强推 AI 之际公众对 AI 的抵触情绪也日益高涨。现在一名叫 Erin Maus 的 34 岁软件工程师找到了一种变通方法,以宗教理由豁免于使用 AI。她信仰普救一位神教(Unitarian Universalism),这是一个开明、包容的宗教,接受多元化和互联性,致力促进个人灵性成长。她以 AI 的环境和伦理问题为由称使用 AI 与其宗教信仰不符。她的雇主上个月批准了宗教豁免。Maus 说,她现在仍然手写代码,自己审查代码,就和两年前一样。

网信办对网络评测进行设限

国家网信办、市场监管总局联合发布了《网络测评活动规范》。网信办称,制定该规范的原因是“一些网络测评存在夸大宣传、只评不测、商测一体等问题,不仅影响消费者信任度和购物体验,也扰乱市场环境”。《规范》要求: 三、网络测评所选取的样本,应当是消费者可以从市场上购买到的普通商品且来源可以追溯,不得是为测评活动准备的特殊物品。从事网络测评活动,接受第三方委托、赞助或者与测评样本相关方存在利益关系的,应当作出显著提示。 四、从事网络测评活动,涉及对产品功能、性能等项目测试,应当委托具有法定检验检测资质许可的检验检测机构按照相关标准以及技术规范开展测试,并明示测试依据的标准以及技术规范,按照规定保留测试样本以及测试数据、图片、视频等记录,确保测试数据、结果可以追溯。 五、未对产品开展测试,仅凭感知、观察、体验等主观感受对产品进行评价,应当进行说明,并在信息展示过程中显著标明“仅为个人体验”或者“主观感受,仅供参考”等内容。

被时尚潮流占据的社交网络

社交网络不再是为了社交,而是为了跟随时尚潮流。今天的社交活动主要发生在消息应用上。社交媒体正演变成类似电视的被动式平台,但不同于需要遥控器去切换电视频道,社媒平台的算法已经为你量身定制了内容,平台利用你的信息获利,作为回报它提供的内容是免费的。社交平台的核心商业模式仍然是广告,而且其收入还在持续增长。2026 年全球社交媒体广告收入将达到 3170 亿美元,超过 2025 年的 2770 亿美元。其中 Meta 的广告收入将达到 2430 亿美元,预计将首次超过 Google。Instagram 和 TikTok 之类的大型平台越来越注重娱乐和发现内容,而 WhatsApp 之类的应用则变成社交活动的主要场所,但此类消息应用的变现比较难。

苹果宣布 Google Gemini 驱动的 Siri AI

苹果在 2026 年 WWDC 开发者大会上宣布了 Google Gemini 驱动的新一代 Apple 智能和 Siri AI。驱动 AI 功能的运算运行在设备上或者私有云上。苹果称,“Siri 能够利用对个人情境的理解,搜索信息、邮件、照片等内容,并通过更加全系统化的 app 操作,完成跨 app 任务。Siri AI 能够回答与用户屏幕上的内容相关的问题,也可以利用广博的世界知识,上网获取最新信息,生成有用的答案。通过专门的 Siri app,用户可重新访问过往对话或发起新对话,并利用 iCloud 在用户的各种设备上私密同步对话历史记录。”由于欧盟的隐私和消费者保护监管规定,AI 智能暂时不会在欧盟推出,苹果表示,“Apple 智能推出时间依监管部门审批情况而定,Siri AI 和其他新的 Apple 智能功能在中国大陆尚不可用。”

OpenAI 申请 IPO

OpenAI 已秘密提交了 IPO 申请。秘密提交上市申请允许企业在不公开披露财务信息的情况下推进上市计划。OpenAI 以及 SpaceX 和 Anthropic 是近期最受瞩目的 IPO 事件,三家公司的市值有可能达到 4 万亿美元。OpenAI 在声明中表示它尚未决定上市日期,它也未披露将会出售多少股份。OpenAI 表示将在最佳的时机选择上市。OpenAI 最近一轮融资是在今年 3 月,融资 1220 亿美元估值 8520 亿美元,它的估值已经落后于主要竞争对手 Anthropic。

肥胖会影响精子质量改变表观遗传标记

根据发表在《Current Obesity Reports》期刊上的一项研究,肥胖并非只是个人选择的结果,肥胖风险的遗传率高达 40%-70%,能通过复杂的生物和环境因素代代相传。最新证据表明,肥胖会影响精子质量,改变表观遗传标记。这些变化可能会影响儿童的食欲调节、新陈代谢和长期患病风险。好消息是这些变化是可逆转的。生活方式改变以及减肥可改善精子健康,改变与肥胖相关的表观遗传模式。

韦伯首次测量早期宇宙休眠黑洞质量

天文学家利用韦伯太空望远镜以及引力透镜效应首次测量了一个早期宇宙休眠黑洞质量。该黑洞是 MRG-M0138 星系的中心,星系已经不再形成恒星,而黑洞也不再吞噬周围的物质而处于休眠状态。MRG-M0138 位于一个巨大星系团的背后,被引力透镜效应放大了约 30 倍。黑洞距离地球大约 100 亿光年,其质量为太阳的 60 亿倍。天文学家组合了引力透镜以及黑洞引力对恒星运动的影响确定了其质量。

平台算法给民主带来风险

越来越多的证据表明社媒平台算法给民主带来了风险。由于算法的不透明性以及以最大化用户参与度和平台停留时间为导向,完全不在乎推送内容的质量,算法被认为是造成政治极化的罪魁祸首。以 X 平台为例,在马斯克(Elon Musk)在 2024 年宣布支持特朗普之后,倾向共和党的账号曝光度显著提升。马斯克本人在 2024 年 7 月至 11 月间所发布推文的累计浏览量高达 171 亿次,超过了该平台所有政治竞选广告的总和。2025 年德国联邦选举期间,各大社交平台算法推荐给年轻用户的政党相关内容中半数涉及极右翼政党。一项分析发现,X 平台算法不成比例的放大了政治极端政党(尤其是极右翼政党)的内容,系统性压制中间政党。另一项研究发现,相比按时间排序的内容,用户接触 X 平台算法推送内容七周后,政治态度会向更保守的方向转变。禁用算法后这种转变并未逆转。这些研究显示平台算法目前的运作方式不利于民主。社媒平台算法放大极端声音导致的一个结果是扭曲对观点分布的感知,发表边缘观点的人会认为自己是主流,这种网络同质性被称为“虚假共识效应(false consensus effect)”。如果不能采取强有力的保护措施,我们会进入到一个日益极化和分裂的威权社会。

GLP-1 减肥药与更低的乳腺癌风险相关

根据发表在《JCO Oncology Practice》期刊上的一项研究,服用 GLP-1 减肥药与女性更低的乳腺癌风险相关。对逾 11 万名年龄在 45 岁至 80 岁之间的回顾性分析发现,服用 GLP-1 药物的女性患乳腺癌的风险比未服用的女性低约 30%。这是一项观察性研究,GLP-1 减肥药与降低乳腺癌发病率之间是否存在关联还有待进一步研究。GLP-1 药物模拟了人体天然激素 glucagon‑like peptide‑1,该激素有助于调节血糖和食欲。GLP-1 药物最初被用于减肥,如今被发现还可能有助于预防癌症。研究人员指出,GLP-1 药物会影响许多与癌症发展相关的靶点和通路,因此值得进一步展开研究。

微软再次加强 Xbox 内容独占

在索尼之后,微软重新加强游戏独占策略。索尼停止将其第一方 3A 游戏移植到 PC 平台,而微软的 Xbox 平台此前开始将其 3A 游戏移植到索尼的 PS 平台,但新 CEO Asha Sharma 上任之后,她改变了这一做法,强调 Xbox 平台“必须有独占内容和服务”。在周日的 XBOX Games Showcase 上,微软宣布其《Gears of War: E-Day》和《Clockwork Revolution》将是 Xbox 独占,并且不是限时独占。微软表示,此前宣布支持 PS5 的游戏如《Halo: Campaign Evolved》和《Forza Horizon 6》仍然会按计划推出。

免费领取价值30/90美金的NVIDIA DLI自学课程并测试获得证书

领取规则:未注册过开发者的用户可以通过如下链接免费选择一门 DLI 在线自主培训的付费课程,配套云端实验环境和可获得 NVIDIA 培训证书。每位用户(每个邮箱账号)仅可选择一门。 https://developer.nvidia.cn/login?ncid=ref-dev-557858&sfdcid=Zhiding 目前可选课程包括 7 门英文课,5 门中文课,目前课程列表如下,随时下架,免费名额有限,先到先得:

2025 年国际 C语言混乱代码大赛公布获奖结果

2025 年第 29 届国际 C 语言混乱代码大赛(IOCCC, The International Obfuscated C Code Contest)公布了获奖作品。IOCCC 是一项国际程序设计赛事,旨在写出最有创意和最让人难以理解的 C 语言代码。IOCCC29 的 22 部获奖作品包括:Nick Craig-Wood 开发的 GBA 模拟器,其源代码就像一部 GBA 游戏机;虚拟机的代码规模通常比较大,比如 QEMU 有大约 200 万行代码,而 Adrian Cable 开发的虚拟机只有 366 个字节,它能运行 DOOM;台湾开发者 jingp49 获奖作品的源代码形状来自《神秘博士》的时间机器塔迪斯(Tardis)。IOCCC 主办方表示,22 个获奖程序都极富创意,参赛作品数量和质量都达到历史最高水平。

新药功能性治愈部分乙肝患者

葛兰素史克公布了其实验乙肝治疗药物 bepirovirsen(bepi)的两项重复双盲试验结果:疫苗功能性治愈了 19% 的患者。全世界大约有 2.4 亿人感染了慢性乙肝,每年 110 万人死亡。大部分慢性乙肝患者没有接受治疗。完全治愈乙肝非常困难,因此评估药物的疗效主要是功能性治愈——即检测不到病毒。在 1220 名注射 bepirovirsen 的患者中有 233 人功能性治愈,对照组无人功能性治愈。研究人员强调 bepirovirsen 对大部分慢性乙肝患者效果有限。

AI 威胁数十亿人的自然资源

联合国大学水、环境与健康研究所发布了报告《AI 能耗的环境成本:碳、水和土地足迹》。报告预计到 2030 年,为全球人工智能(AI)提供支持的数据中心,每年将消耗 945 TWh 的电力,相关用水量将相当于 13 亿人一年的基本生活用水需求,而土地占用面积将超过 14500 平方公里。研究发现,支撑 AI 运行的每 1 千瓦时电力,都同时对应3种环境足迹,即来自能源生产过程的碳足迹、来自发电和冷却过程的水足迹,以及能源基础设施建设和资源开采带来的土地足迹。报告显示,训练 GPT-5 预计需要约 100 GWh 电力,相当于撒哈拉以南非洲约 77 万人一年的居民用电量,相关用水量约为 10 亿升,土地占用量约为 1.5 平方公里。训练只是 AI 生命周期中的一部分。随着模型投入应用,真正持续消耗资源的是推理过程,也就是模型不断响应用户请求、生成内容的过程。报告估计,推理环节占 AI 总能耗的 80%-90%。2025 年全球数据中心消耗了 448 TWh 的电力。如果将其视为一个国家,它们将成为全球第 11 大电力消费国,排在法国之后,沙特阿拉伯之前。

科学家精准编辑人类胚胎基因

中国科学家贺建奎在 2018 年披露使用 CRISPR 基因编辑技术修改了人类胚胎诞生了两名基因编辑女婴。他后来因此被判入狱三年。CRISPR 不是一种非常精准的基因编辑技术,容易出现脱靶效应。现在哥伦比亚大学发育细胞生物学副教授 Dieter Egli 与 DNA 测试初创公司 Nucleus Genomics 的 Nathan Treff 等人合作,使用更精准的基因编辑技术碱基编辑编辑了人类胚胎基因,该技术能靶向 DNA 序列中的单个碱基,能减少副作用。最新研究针对了两个基因,其一是增加患心脏病风险的基因,其二是与镰状细胞贫血症等血液疾病相关的基因。研究人员表示这项技术有助于修复胚胎中的致病突变,但距离特制胎儿还很遥远。实验发现,这些基因编辑并未均匀发生在所有细胞中,一些细胞完成了碱基改造,另一部分则仍保留着原始碱基,这种现象被称为嵌合效应。

美国政府考虑在 AI 公司持有股份

美国政府考虑持有 AI 公司股份。OpenAI CEO Sam Altman 正与白宫就政府可能入股这家 AI 公司进行持续磋商。双方的讨论已持续一年多,本周 Altman 在华盛顿会见了多位议员和官员,就监管和 AI 的最新发展进行了磋商。作为潜在协议的一部分,OpenAI 可能会向美国政府捐赠股权,用于建立某种公共财富基金。该基金可以“投资于多元化的长期资产”,让公民能获取 AI 发展的“收益”。在特朗普的第二个任期内,政府已入股了英特尔、IBM 以及量子和关键矿产公司。

印度人口可能会更早开始下降

在 1970 年代,Parul Gayen 生活在德里的贫民窟,那儿到处都是孩子。她的母亲有 6 个兄弟姐妹,她的祖父有 11 个兄弟姐妹。她的丈夫 Swapan 有 6 个兄弟姐妹——第 7 个夭折了,两人在 16 岁时结婚,有 3 个孩子。如今她已经 58 岁,但他们的孩子只有两个决定生育,而且只生 1 个。时代变了。她说,一个孩子会感到孤独。印度如今是世界人口最多的国家,但它正走在中国的人口开始减少的道路上——中国人口自 2021 年起开始减少。印度生育率下降的速度比预期的更快也更早。印度人口众多的贫困邦的生育率正向富裕邦看齐:人口 7700 万的泰米尔纳德邦和人口约 1 亿的西孟加拉邦的总和生育率均为 1.3,与芬兰相同。印度城市的平均总和生育率为 1.5。印度人口的峰值预计为 15.5 亿。

09

APP STORE RANK

09.00
APP STORE RANK
FETCHING · APP STORE RANK