TEXT VIEW · TODAY'S DIGEST · 36 HEADLINES ACROSS 8 SOURCES

Startup Archive(0)

No items yet for today.

App Store Rankings(0)

No items yet for today.

ISSUE 0879
THU, MAY 28, 2026
Discover the best information organized by OrangeBot.AI
TODAY · THU, MAY 28, 2026

The web,
read by a bot.

Ten sources — Hacker News, Product Hunt, HuggingFace, Techmeme and more — filtered, tagged, and summarized every morning for builders who don’t have time to scroll.

NEWChrome extension: save posts from Twitter/X in one click.Install →
01

AI DIGEST

UPDATED DAILY · EDITOR'S PICK
01.00
AI DIGEST

AI新闻摘要

May 28, 2026

Here is a summary of today's key news events:

U.S. and Iran Exchange Fire Amid Tense Negotiations

U.S. forces shot down Iranian drones, which prompted retaliatory strikes from Tehran. The direct military exchange is creating significant market volatility, causing oil prices to fluctuate as the two nations simultaneously engage in talks aimed at ending the conflict and reopening the vital Strait of Hormuz shipping lane.

Stock Markets Retreat After Hitting Record Highs

U.S. stock indexes pulled back today after reaching record levels in the previous session. The rally had been driven by surging demand for chip manufacturers, but today's decline, led by technology stocks, indicates a pause as investors weigh economic uncertainty and geopolitical tensions.

Major Canadian Banks Boost Payouts to Shareholders

Following a strong financial quarter, several of Canada's largest lenders, including Toronto-Dominion Bank and the Bank of Montreal, announced they are increasing their dividend payments. The move signals confidence in the economic outlook, while Canadian Imperial Bank of Commerce also announced it is selling its Caribbean division to focus on North American growth.

Tech Sector Sees Major AI Partnerships and Regulatory Action

Cloud-storage company Snowflake saw its shares soar over 35% after announcing a major partnership with Amazon Web Services (AWS) to advance AI capabilities. This comes as many tech companies face mounting costs related to AI development, and as the EU begins to enforce its new Digital Services Act, punishing a second company after X (formerly Twitter) for violations.

Scientists Warn Global Warming Milestone is Almost Certain by 2030

A new report from climate scientists states there is a 91% chance the world will exceed a 1.5°C average temperature rise by 2030. This threshold is a critical benchmark in global climate agreements, and crossing it is expected to lead to more severe and widespread environmental consequences.

02

ON THE WIRE

6 SOURCES
02

HACKER NEWS

02.00
HACKER NEWS

Hacker News - May 28, 2026

Hacker News Feed: Highlighting key posts and discussions.

The Ask

(randsinrepose.com)

9961
Last.fm is now independent

(support.last.fm)

768200
I'm Tired of Talking to AI

(orchidfiles.com)

1938923
The Melancholy of Slaying Monsters

(thereader.mitpress.mit.edu)

286139
Cloudflare Flagship

(developers.cloudflare.com)

344172
03

HUGGINGFACE

03.00
HUGGINGFACE

huggingface.title - May 28, 2026

huggingface.description

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.

103
ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.

72
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

63
From Pixels to Words -- Towards Native One-Vision Models at Scale

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

52
Self-Improving Language Models with Bidirectional Evolutionary Search

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.

38
ResearchMath-14K: Scaling Research-Level Mathematics via Agents

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of 14{,}056 problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, 220K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce 5.6times more references and 5.0times more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

34
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.

32
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

32
GEM: Generative Supervision Helps Embodied Intelligence

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/

31
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

29
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

22
AI Research Agents Narrow Scientific Exploration

AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

20
Triplet-Block Diffusion RWKV

Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose B^3D-RWKV, a diffusion RWKV variant that integrates the model's O(L) inference efficiency with parallel, bidirectional discrete-diffusion through a triplet-block layout method. B^3D-RWKV-7.2B reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of 1.6times speedup.

16
Rethinking Memory as Continuously Evolving Connectivity

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in https://github.com/zjunlp/LightMem.

16
OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

16
Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

15
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to 0.32m (a 22% improvement). When integrated with SGLang, our framework delivers 12times throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

14
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-conditioned video generation framework of Genie Envisioner, GE-Sim 2.0 is re-trained on thousands of hours of real-world robot data spanning teleoperation, contact-rich interaction, and on-robot policy deployment, substantially improving action-following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next-chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine-verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25-frame rollout in 2.3 seconds on a single H100, with up to 4* frame skipping at inference for long-horizon evaluation. GE-Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed-source general video generators, and policies trained against its rollouts and rewards translate into measurable real-world gains, establishing GE-Sim 2.0 as a practical platform for scalable evaluation and closed-loop learning of manipulation policies.

13
Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

11
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.

11
HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.

10
SkillGrad: Optimizing Agent Skills Like Gradient Descent

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by 6.7 percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

8
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

8
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

8
Less is More: Early Stopping Rollout for On-Policy Distillation

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

8
Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-p, Top-k, and Min-p). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.

7
GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.

6
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available https://multilingual-cot-monitoring.github.io/{blue{here}}.

4
Models That Know How Evaluations Are Designed Score Safer

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

4
AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed-systems incident tasks and security-advisory tasks. The evaluation shows three main results: learned routing reaches a higher-quality operating point than a fixed pipeline baseline on coordination-heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm-started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination-heavy multi-agent workflows over static wiring.

4
Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

4
CubePart: An Open-Vocabulary Part-Controllable 3D Generator

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements. We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes - one per schema element - that assemble into a coherent object while respecting the specified semantic structure. To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing. Project Page: https://cubepart.github.io/

4
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

3
PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.

3
Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

3
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.

3
Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce SKILD, a Scale-invariant K-Space Image Learning Diffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: no task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor. Empirically, SKILD reaches FID 2.65 and Inception Score 9.63 on unconditional CIFAR-10, performs 2times--8times super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.

2
ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

2
Advancing Creative Physical Intelligence in Large Multimodal Models

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.

2
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

2
Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

1
Revealing Algorithmic Deductive Circuits for Logical Reasoning

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

1
Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

0
PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

0
05

PRODUCT HUNT

05.00
PRODUCT HUNT

Product Hunt - May 28, 2026

Product Hunt Daily Feed: Featuring noteworthy tech launches.

LaunchOS icon
LaunchOS

Bring Back the Classic Launchpad Experience on macOS 26+

0
Pitch Agent icon
Pitch Agent

On-brand presentations, generated in seconds

0
Compartment icon
Compartment

Open-source runtime for internal team software

0
SpotsNow icon
SpotsNow

Track who's advertising across podcasts w/ campaign insights

0
Revolte icon
Revolte

AI for Software Engineering

0
Pancake icon
Pancake

OpenClaw in Slack that makes your company autonomous

0
SoMerch icon
SoMerch

Merch for distributed teams, handled end to end

0
NeuralAgent 2.5 icon
NeuralAgent 2.5

Talk to your computer, it responds and gets things done.

0
KugelAudio icon
KugelAudio

Real-time text-to-speech model you can self-host

0
Buffer API icon
Buffer API

One API to publish across every social platform.

0
Granite icon
Granite

A vault for every document that matters

0
Stage icon
Stage

Screen recording for demos, bugs, and updates

0
AccountyCat icon
AccountyCat

A focus companion that actually gets context

0
Parastore icon
Parastore

Simulate real store with LLM-powered synthetic consumer

0
Crew44 icon
Crew44

Turn coding agents into specialist teams

0
Robinhood Agentic Trading icon
Robinhood Agentic Trading

Let your agent trade

0
Growati icon
Growati

The autopilot for YouTube post-production

0
Kim Personal Health Assistant icon
Kim Personal Health Assistant

The intelligence layer for Apple Health

0
Memori icon
Memori

Persistent memory from agent trace, not just conversation

0
Marked 3 icon
Marked 3

Preview and Publish your Markdown

0
Sublern icon
Sublern

Translate any word in video subtitles with one hover

0
Angel Match 4.0 icon
Angel Match 4.0

A database of 125K+ angels and VCs to raise your seed round

0
BaseBuddy icon
BaseBuddy

Turn your Supabase database into a WordPress-like editor

0
Calling Skills for AI Agents icon
Calling Skills for AI Agents

Add voice and video calling via your coding agent

0
MacSIM by Studio Practice icon
MacSIM by Studio Practice

Preview any URL on every Mac screen at once

0
Oasis Browser for Mac icon
Oasis Browser for Mac

A privacy-first AI browser you can train anonymously

0
Bluedot 2.1 icon
Bluedot 2.1

Record on Apple Watch. Sync with Claude

0
Mojito icon
Mojito

Type to search for any emoji, symbol, or gif in seconds

0
Layers icon
Layers

Create beautiful animated code snippet videos for free

0
Coworker AI icon
Coworker AI

More AI for less spend with context-aware model routing

0
Local Panel icon
Local Panel

Local SSH server manager with no subscriptions or installs

0
Octolane icon
Octolane

Self-driving AI CRM that you can talk to

0
Pawse.ai icon
Pawse.ai

An acoustic regulation system for dogs

0
QuickSheet v1.2 icon
QuickSheet v1.2

Instantly create and edit spreadsheets from your menu bar

0
Powabase icon
Powabase

Build AI apps with Postgres, RAG, and agents

0
zero.xyz icon
zero.xyz

Give your AI agent access to ~8k tools, APIs and services

0
Archi-Flow icon
Archi-Flow

Visualize cloud architecture with live traffic simulations

0
BankStatementLab icon
BankStatementLab

Turn any bank statement PDF into Excel, CSV or JSON with AI

0
Jott icon
Jott

Capture quick written or voice notes from your Mac's notch

0
BobCA icon
BobCA

A sovereign agent that learns to code with your preferences

0
Harbor icon
Harbor

CLI + companion App to spin up complete local LLM stacks

0
AgenticCalling AI icon
AgenticCalling AI

Give your AI the power to make phone calls

0
Phasr icon
Phasr

Run 100+ workflows simultaneously without losing context

0
baz.studio icon
baz.studio

Skills library & video editor for AI Agents

0
Krater icon
Krater

All the AI tools you use, one subscription

0
Chunk sidecars icon
Chunk sidecars

Validate agent-generated code before it ever reaches CI

0
Extend icon
Extend

Parse any PDF layout with SOTA accuracy for AI pipelines

0
Netfox icon
Netfox

A native local macOS network monitor

0
Curlo icon
Curlo

Local AI search to find SFX and music by describing it

0
GenGo icon
GenGo

Transform selected text anywhere on macOS

0
06

TECHMEME

06.00
TECHMEME

Techmeme - May 28, 2026

Techmeme Digest: Major tech headlines and industry conversations.

Israel-based web development company Wix is cutting 20% of its workforce, citing the "fast evolution of AI capabilities" and currency exchange rate difficulties (CJ Haddad/CNBC)
Source: TechmemePublished: May 28, 2026

CJ Haddad / CNBC : Israel-based web development company Wix is cutting 20% of its workforce, citing the “fast evolution of AI capabilities” and currency exchange rate difficulties —  Israel-based web development company Wix is slashing roughly 20% of its workforce, CEO Avishai Abrahami announced …

Intel unveils its first dedicated handheld gaming chips, the Arc G3 and Arc G3 Extreme, featuring Xe3 GPU cores, arriving first in the Acer Predator Atlas 8 (Sean Hollister/The Verge)
Source: TechmemePublished: May 28, 2026

Sean Hollister / The Verge : Intel unveils its first dedicated handheld gaming chips, the Arc G3 and Arc G3 Extreme, featuring Xe3 GPU cores, arriving first in the Acer Predator Atlas 8 —  The Acer Predator Atlas 8 will be one of the first to use it — and there's reportedly an MSI too.

Qualcomm unveils the Snapdragon C, an entry-level ARM-based SoC for Windows 11 devices starting at $300 and shipping in 2026, to compete with the MacBook Neo (Zac Bowden/Windows Central)
Source: TechmemePublished: May 28, 2026

Zac Bowden / Windows Central : Qualcomm unveils the Snapdragon C, an entry-level ARM-based SoC for Windows 11 devices starting at $300 and shipping in 2026, to compete with the MacBook Neo —  Qualcomm has unveiled its new entry-level ARM-based SoC for Windows 11 devices that will begin shipping later this year on devices that cost as low as $300.

Sources: at WWDC, Apple is likely to showcase how 15 years of designing chips gives it an advantage in running AI locally, and will use a distilled Gemini model (Aaron Tilley/The Information)
Source: TechmemePublished: May 28, 2026

Aaron Tilley / The Information : Sources: at WWDC, Apple is likely to showcase how 15 years of designing chips gives it an advantage in running AI locally, and will use a distilled Gemini model —  At Apple's annual developer conference next month, the star of the show will be a series of long-delayed artificial intelligence upgrades to the iPhone.

Filing: CNN sues Perplexity in New York for allegedly unlawfully copying and distributing CNN content, after failing to agree on terms with Perplexity in 2025 (Brian Stelter/CNN)
Source: TechmemePublished: May 28, 2026

Brian Stelter / CNN : Filing: CNN sues Perplexity in New York for allegedly unlawfully copying and distributing CNN content, after failing to agree on terms with Perplexity in 2025 —  CNN is suing Perplexity, accusing the AI company of unlawfully copying and distributing CNN's content.

YouTube adds a "custom feed" to its home page, letting users enter a prompt to create a constantly refreshed feed, available to signed-in users in the US (Andrew Romero/9to5Google)
Source: TechmemePublished: May 28, 2026

Andrew Romero / 9to5Google : YouTube adds a “custom feed” to its home page, letting users enter a prompt to create a constantly refreshed feed, available to signed-in users in the US —  YouTube isn't exactly replacing the traditional search bar, but it is supplementing it with a way to watch a “custom feed” of videos based on your nuanced prompts.

Illustrations based on sources detail Apple's Siri overhaul, like a new UI and chatbot-style app, and major iOS 27 changes, ahead of the WWDC keynote on June 8 (Mark Gurman/Bloomberg)
Source: TechmemePublished: May 28, 2026

Mark Gurman / Bloomberg : Illustrations based on sources detail Apple's Siri overhaul, like a new UI and chatbot-style app, and major iOS 27 changes, ahead of the WWDC keynote on June 8 —  The iPhone maker looks to stage a comeback in digital assistants and artificial intelligence.

Oura unveils the Oura Ring 5, with a 40% smaller form factor, improved sensing, and repositioned LEDs, shipping from June 4 for $399, up from the Ring 4's $349 (Bloomberg)
Source: TechmemePublished: May 28, 2026

Bloomberg : Oura unveils the Oura Ring 5, with a 40% smaller form factor, improved sensing, and repositioned LEDs, shipping from June 4 for $399, up from the Ring 4's $349 —  Oura Health Oy, the popular smart ring maker seeking to go public this year, unveiled a significantly thinner and lighter new model along with new wellness features.

Letter: US Central Command says it received "threat reports concerning adversary exploitation of commercial location data" to target US personnel in war zones (Raphael Satter/Reuters)
Source: TechmemePublished: May 28, 2026

Raphael Satter / Reuters : Letter: US Central Command says it received “threat reports concerning adversary exploitation of commercial location data” to target US personnel in war zones —  U.S. forces deployed to war zones have been targeted using commercially available location data …

IBM and Red Hat commit $5B to establish a new open-source software model, dubbed Project Lightwell, and will deploy 20,000 engineers globally, supported by AI (Connor Hart/Wall Street Journal)
Source: TechmemePublished: May 28, 2026

Connor Hart / Wall Street Journal : IBM and Red Hat commit $5B to establish a new open-source software model, dubbed Project Lightwell, and will deploy 20,000 engineers globally, supported by AI —  Project Lightwell will deploy a global force of 20,000 engineers, supported by advanced artificial intelligence

AWS is rolling out Resilient Network Graphs, a "quasi-random" networking architecture that uses a flat mesh design, and says it accelerates information flows (Lauren Goode/Wired)
Source: TechmemePublished: May 28, 2026

Lauren Goode / Wired : AWS is rolling out Resilient Network Graphs, a “quasi-random” networking architecture that uses a flat mesh design, and says it accelerates information flows —  The tech giant says a breakthrough in data-center networking has dramatically accelerated the flow of information through its massive cloud infrastructure.

Source: the Shanghai Futures Exchange is in the early stages of designing futures contracts for AI tokens; US exchanges are set to launch GPU compute futures (Reuters)
Source: TechmemePublished: May 28, 2026

Reuters : Source: the Shanghai Futures Exchange is in the early stages of designing futures contracts for AI tokens; US exchanges are set to launch GPU compute futures —  China is designing a futures market for AI tokens, sources familiar with the matter said, as the country potentially takes …

Samsung's Securities, SDS, and Card units say they will acquire a 4% stake in Dunamu, which operates South Korea's largest crypto exchange, for ~$446M in cash (Choi Yeon-jae/The Korea Herald)
Source: TechmemePublished: May 28, 2026

Choi Yeon-jae / The Korea Herald : Samsung's Securities, SDS, and Card units say they will acquire a 4% stake in Dunamu, which operates South Korea's largest crypto exchange, for ~$446M in cash —  Samsung Securities, Samsung SDS and Samsung Card said Thursday they will jointly acquire a 4 percent stake in Dunamu …

London-based Geordie AI, which builds a security and governance platform for AI agents, raised a $30M Series A led by Balderton at an estimated $180M valuation (Jeremy Kahn/Fortune)
Source: TechmemePublished: May 28, 2026

Jeremy Kahn / Fortune : London-based Geordie AI, which builds a security and governance platform for AI agents, raised a $30M Series A led by Balderton at an estimated $180M valuation —  Geordie AI, a London-based startup which provides a security and governance platform for AI agents, has raised a $30 million Series A round led by Balderton Capital.

The Steam Deck's huge price hike, from $399 in 2022 to $789 today, is the end of an era for gaming handhelds, coming amid RAMageddon, tariffs, and the Iran war (Sean Hollister/The Verge)
Source: TechmemePublished: May 28, 2026

Sean Hollister / The Verge : The Steam Deck's huge price hike, from $399 in 2022 to $789 today, is the end of an era for gaming handhelds, coming amid RAMageddon, tariffs, and the Iran war —  The Steam Deck's huge price hike is the end of an era for gaming handhelds. … For a few glorious years …

07

STARTUP ARCHIVE

07.00
STARTUP ARCHIVE

Startup News - May 28, 2026

Startup News Roundup: Aggregating key funding and launch updates.

Marc Andreessen on the 5 personality traits of an innovator
Source: StartupPublished: Mar 31, 2026

“When you’re talking about real innovators—people who actually do really creative, breakthrough work—I think you’re talking about a couple things:”

Steve Jobs explains the importance of both thinking and doing
Source: StartupPublished: Mar 30, 2026

“The doers are the major thinkers. The people who really create the things that change this industry are both the thinker-doer in one person.”

Tobi Lutke explains what the VCs who passed on Shopify got wrong
Source: StartupPublished: Mar 27, 2026

“What a lot of free-market thinkers don’t understand is that between the demand and eventual supply lies friction."

Sam Altman explains how he decides to invest in a startup after 10 minutes
Source: StartupPublished: Mar 26, 2026

"Does this person have the potential to be the next Mark Zuckerberg?… [You don’t get to] 100% accuracy, obviously, but it’s good enough that our business model works.”

Jony Ive recounts the time Steve Jobs called him vain
Source: StartupPublished: Mar 25, 2026

In the clip below, Jony Ive recounts the time he asked Steve Jobs to be less harsh in his critique of a piece of work.

Jeff Bezos’s two pieces of advice for aspiring entrepreneurs
Source: StartupPublished: Mar 24, 2026

“The advice that I would give entrepreneurs is don't chase the hot new thing. It's so hard to catch something that everybody already knows is hot."

Elad Gil: “Things that work tend to work pretty fast”
Source: StartupPublished: Mar 23, 2026

“I do think there’s a bit of a myth in Silicon Valley that you should keep grinding no matter what and it’s just about perseverance, and I think that’s really bad advice."

Paul Graham on why starting with a “small, intense fire" is the key to startup growth
Source: StartupPublished: Mar 20, 2026

"You have to know who those first users are and how you're going to get them."

Keith Rabois on how to identify great talent
Source: StartupPublished: Mar 19, 2026

“What you want to do with every single employee every single day is expand the scope of their responsibilities until it breaks… and that’s the role they should stay in.”

Wealthfront CEO on why advertising spend makes it harder to find product/market fit
Source: StartupPublished: Mar 18, 2026

“The way that you know you have product/market fit is if you have exponential organic growth."

Eric Schmidt on why most companies get strategy wrong
Source: StartupPublished: Mar 17, 2026

“Work very, very hard to figure out what the world’s going to look like in five years. What will people be doing? What will your customers want? Where will costs be?"

Mark Zuckerberg: “You can’t 80/20 everything”
Source: StartupPublished: Mar 16, 2026

"There’s the famous 80/20 rule where you get 80% of the benefit by doing 20% of the work, but you can’t just 80/20 everything. There have to be certain things that you are just the best at."

Marc Andreessen on Mark Zuckerberg’s founder “superpower”
Source: StartupPublished: Mar 13, 2026

“A great superpower that Mark Zuckerberg has that is probably not well-understood enough is he does not get emotionally upset in stressful situations"

Sam Altman explains how to come up with a great startup idea
Source: StartupPublished: Mar 12, 2026

"If you start a startup without a good idea… you’ll be under pressure to make something up and it won’t work that well."

Jeff Bezos on the problems with proxies and managing to metrics
Source: StartupPublished: Mar 11, 2026

“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”

Airbnb founder Brian Chesky on how to design an amazing user experience
Source: StartupPublished: Mar 10, 2026

“If you can design something really amazing using the hand-crafted part of your brain, then you can reverse-engineer how to industrialize this millions of times over."

Spencer Rascoff: "I will never invest in a consumer startup with paid marketing”
Source: StartupPublished: Mar 9, 2026

"If you’re actually trying to grow a product, the best levers for doing that are often within the product itself.”

Patrick Collison explains why it sometimes make sense to quit
Source: StartupPublished: Mar 6, 2026

“One thing I’ve learned myself the hard way, is that it is easier to tear down a company and restart it in Silicon Valley, than it is to constantly try to pivot or keep something alive."

Jeff Bezos recounts the time he called Amazon’s customer service number mid-meeting to prove a metric was wrong
Source: StartupPublished: Mar 5, 2026

“I have a saying, which is when the data and the anecdotes disagree, the anecdotes are usually right"

Ben Horowitz: “Nobody was born a great manager. It’s a very unnatural job.”
Source: StartupPublished: Mar 4, 2026

“If you can’t build a great product, it doesn’t matter if you can build a great company.”

03

ALSO TODAY

3 MORE SOURCES
08

SOLIDOT

08.00
SOLIDOT

Solidot News - May 28, 2026

Solidot Feed: Highlighting essential tech & open-source news.

Temu 因违反 DSA 被欧盟罚款 2 亿欧元

欧盟委员会根据 Digital Services Act (DSA)对 Temu 因处于 2 亿欧元罚款。原因是 Temu 对其平台上假冒伪劣商品所带来的系统性风险没有尽职尽责的识别、分析和评估,从而给欧盟消费者造成了伤害。欧盟委员会举例说:它调查的充电器有相当高比例的产品未能通过基本的安全测试;在测试的婴儿玩具中,有相当比例的产品存在中度至高度的安全风险,这些玩具含有超过法定安全限值的化学物质,或者由于可拆卸部件而存在窒息危险。欧盟委员会是在 2024 年 10 月 31 日启动调查,2025 年 7 月通过了初步调查结果,5 月 28 日公布处罚。

Last.fm 独立运营

音乐平台 Last.fm 宣布再次独立运营,声明所有权更改了,但用户每天使用的产品没有变。用户的账号以及音乐品味数据等都没有变。Last.fm 创办于 2002 年,利用 Audioscrobbler 音乐推荐系统根据收听数据为每位用户创建品味档案。CBS Interactive 在 2007 年以 2.8 亿美元将其收购,CBS Interactive 如今是 Paramount Skydance 的一部分。

黄仁勋将成为最新一位加入清华经管顾问委员会的美国企业高管

FT 报道,英伟达 CEO 黄仁勋已同意加入清华大学经管学院的顾问委员会——该委员会现任主席是苹果 CEO 库克(Tim Cook)——黄仁勋正力争维持与北京方面的关系。清华大学位于北京,是中国专注于科学和工程的顶尖学府,该校经济管理学院顾问委员会的公开目标包括帮助该商学院加强国际联系和塑造长期战略。委员会中的美国企业高管还包括了马斯克(Elon Musk)、扎克伯格(Mark Zuckerberg)以及微软 CEO 纳德拉(Satya Nadella)。

Valve 大幅提高 Steam Deck 掌机的售价

由于内存和 SSD 价格飙升,Valve 大幅提高了 Steam Deck 掌机的售价。以美国地区为例,512GB OLED 版本售价从 549 美元提高到 789 美元,上涨 240 美元;1TB OLED 版本售价从 649 美元提高至 949 美元,上涨 300 美元。Steam Deck 掌机于 2022 年 2 月推出,早期版本使用的屏幕是 LCD,2023 年 11 月 Valve 将屏幕从 LCD 升级到 OLED,淘汰了 LCD 版本。Steam Deck 配备的是 16 GB LPDDR5,从去年底开始内存价格上涨了数倍,SSD 的涨势没有这么夸张,但也更贵了。

Google 员工被控利用内部消息在 Polymarket 投注获利 120 万美元

Google 安全工程师 Michele Spagnuolo 利用内部消息在预测市场 Polymarket 押注歌手 d4vd 成为 2025 年 Google 搜索量最高的人物而获利 120 万美元,他被控犯有欺诈罪,于周三上午被捕,后以 225 万美元保释金获释。Spagnuolo 能访问内部数据系统,包括一个能访问未公开年度搜索数据的工具。Polymarket 平台观察者在去年 12 月注意到账号 AlphaRaccoon 在年度搜索量最高的人物上进行可疑交易,Spagnuolo 就是该账号的所有者,他从相关投注上获利 120 万美元。Google 表示正配合调查,称 Spagnuolo 的行为违反了公司政策。

袭击石油设施释放的污染相当于一次火山喷发

武汉大学和中国气象局研究团队利用风云卫星和欧洲哨兵卫星量化了今年三月伊朗石油设施遭袭击后释放的二氧化硫。3 月 7 日的空袭中伊朗 Fardis、Shahran 和 Aghdasieh 油库以及德黑兰炼油厂遭到严重破坏,其中 Shahran 油库破坏最为严重,燃烧的石油流入城市下水道系统,引燃城市绿地,造成大量有毒烟雾。当地居民报告他们立即出现了呼吸困难、皮肤刺激和口中有苦味等健康问题。科学家特别关注了油库燃烧释放的具有强刺激性和腐蚀性的二氧化硫污染。利用风云-3(FY-3F 和 FY-3E)和哨兵-5P,科学家发现当地的二氧化硫浓度从 0.8 DU 上升到 2.0 DU(DU 指 Dobson unit),总排放量估计为 2.98×10⁴ 吨。这次事件的影响范围为 3.0×10⁵ 平方公里。

一亿年前的鸟就用华丽羽毛吸引配偶

根据发表在 PLOS One 期刊上的一项研究,生活在一亿多年前的鸟 Plumadraco bankoorum 就利用华丽羽毛去吸引配偶。这种鸟的化石在辽宁出土,生活在 1.21 亿年。该鸟从喙到尾羽根部仅长 15 厘米,但其双尾羽却长达近 30 厘米。这对羽毛不具备空气动力学功能,更可能是用于展示。在现代鸟类中,如孔雀和天堂鸟,长尾羽通常出现在雄性个体身上,用于华丽的求偶展示;而雌性则羽色低调,以便在筑巢育雏时避免被捕食者发现。研究人员据此推测,这件羽龙化石很可能代表一只雄性个体,其异常修长的尾羽可能具有类似功能。但研究也指出,这一推测还需更多关于此类远古鸟类尾部肌肉结构和筑巢策略的证据来证实。

YouTube 将自动标记 AI 生成视频

对于人眼愈来愈难以分辨、几乎以假乱真的 AI 视频,YouTube 宣布将自动标记 AI 生成视频,并以最显眼的方式展示给用户,此举旨在改进内容透明度。对于长视频:AI 标签将显示在视频播放器下方和描述上方。对于短视频:标签将以叠加层的形式显示在视频上。

女性也认为女性的脸更有吸引力

根据发表在《Proceedings of the Royal Society B》期刊上的一项研究,甚至女性也认为女性的脸比男性更有吸引力。研究人员表示,这种感知差距会随着年龄的增长而缩小,到 80 多岁后消失。这一结论印证了“性别吸引力差异”,在人类不同地区的语言中,女性都被认为是更美的性别。达尔文在观察动物时发现,雄性为吸引雌性通常会有更华丽的外观,但人类的情况恰恰相反,原因是人类的性选择不是女性而是男性驱动的,男性为最有吸引力的女性而战,或者通过追逐财富和权力达到同样的目的。在这项研究中,研究人员利用 76 个国家的 52 项研究编辑了一个脸部吸引力数据库,包含近 3 万名评分者对 1.7 万张脸部的逾 150 万条评分。女性脸部吸引力的平均评分高于六成的男性脸部。这一结果部分是脸部结构的性别差异造成的,男性的脸型更偏向方形或国字脸,而女性的脸型更偏向圆形,而男性和女性都倾向于认为圆脸更具吸引力。

科学家用鼻喷剂逆转大脑老化

德州农工的科学家利用鼻喷剂逆转了大脑老化,该疗法仅两次就能恢复记忆力、减轻慢性炎症并改善脑细胞功能。大脑衰老通常伴随着低水平炎症。慢性炎症会干扰记忆、思维以及大脑适应新环境的能力,它也被认为是导致神经退行性疾病的重要因素。研究人员表示这种大脑老化是可以逆转的。新疗法依赖于细胞外囊泡(EVs)装载 MicroRNA 去帮助调控大脑重要生物过程。科学家利用鼻喷剂输送细胞外囊泡,让药物能绕过大脑保护屏障,直接进入脑组织。

《巫师3》将于明年推出新资料片《旧时曲》

CD PROJEKT RED 宣布《巫师3》的第三部资料片《旧时曲(Songs of the Past)》将于明年推出。《巫师3:狂猎》于 2015 年 5 月发布,2015 年 10 月与 2016 年 6 月分别发布了两个资料片《石之心》和《血与酒》。《巫师3》饱受赞誉,至今销量逾 6000 万份,是史上最畅销的游戏之一。《旧时曲》由 CD PROJEKT RED 与 Fool’s Theory 联合开发,Fool’s Theory 由之前参与《巫师》系列的前 CD PROJEKT RED 开发者组建,它正在开发的一个项目是第一部《巫师》的重制版。在《旧时曲》中,玩家将再次扮演猎魔人利维亚的杰洛特,开启一段全新的冒险之旅。更多信息将于夏末公布。这部资料片被广泛视为是为即将推出的《巫师4》预热。

轨道上的中国火箭残骸急剧增加

中国在 2022 年发射了 64 枚火箭,2025 年创下了 93 枚的发射纪录,数量仅次于美国。随着中国公司加速发射国网和千帆宽带卫星星座,火箭发射数量还会增加。但中国公司在发射时没有更好的处理火箭的上面级。根据 Jim Shell 的最新分析,过去五年中国在高生存期轨道上的火箭残骸质量从不到 100 吨增至 252 吨。高生存期轨道顾名思义也就是火箭残骸会长期留在轨道上。为发射巨型宽带卫星星座,中国预计未来十年将会执行千次或以上的火箭发射。

Google 转型 AI 搜索之后 DuckDuckGo 安装量上涨最高三成

Google 上周宣布将大幅更改搜索功能,把搜索框改为 AI 聊天机器人的对话框,此举立即在用户中间引发了强烈反对。一部分批评者认为这将杀死开放 Web,一部分人担心 AI overviews 会展示错误的答案,且剥夺了不想要 AI 的用户的控制权。部分用户因此转向了替代搜索 DuckDuckGo。DuckDuckGo 称,其美国应用在 5 月 20 日-25 日期间的安装量周环比平均增长 18.1%,安装量增势持续了六天,5 月 25 日达到最高的 30.5%。而在 iOS 平台上,安装量周环比平均增长 33%,最高 69.9%。不展示 AI 结果的 noai.duckduckgo.com 访问量周环比平均增长 22.7%,5 月 24 日最高 27.7%。DuckDuckGo 高管 Kamyl Bazbaz 称用户想要选择权。

Dropbox 创始人卸任 CEO 一职

Dropbox 创始人 Drew Houston 周二通知员工他将卸任 CEO 一职改任执行董事长,联席 CEO Ashraf Alkarmi 将成为唯一的 CEO。Houston 是在 24 岁创办了 Dropbox,担任 CEO 长达 19 年,帮助开创了云存储市场,与巨头 Google 和苹果展开直接竞争。但他领导下的 Dropbox 未能走向巅峰,其市值比上市时的峰值跌去了一半。Dropbox 在最新的季度财报中表示其付费用户逾 1800 万,其云存储服务仍然深受媒体专业人士、平面设计师、建筑师以及其他日常工作中需要共享文件和照片的人士的欢迎。Dropbox 2017 年年收入突破 10 亿美元,四年后突破 20 亿美元,但过去两年收入基本持平,2025 年略有下降。

奇怪的语言错误或有助于识别论文工厂的论文

Medical Evidence Project 项目的 James Heathers 在世界科研诚信大会上报告称,一种简单的寻找语言错误的方法,有助于识别出由“论文工厂”炮制出来的虚假研究论文。Heathers 是在去年萌生的这一想法。当时有人给他发来十几篇看起来极为相似的医学论文,希望他能够找出其中的问题所在。Heathers 花了两天时间阅读这些论文,并注意到一些奇怪但常见的拼写错误、语法错误和用词。例如“Kolmogorovor 信息复杂度”拼写错了数学家 Andrey Kolmogorov 的姓氏;还有多篇论文出现不规范表述,如“5毫升含凝胶生物化学试管”,Heathers 形容这种表达“像是外星人写的”。这类语言错误可能只是非英语母语作者的失误,本身不足以判定论文造假。但Heathers 在 Google 学术平台检索上述特殊表述后,又发现了约 200 篇论文与最初那十几篇论文具有相同的特征——不仅主题一致,研究设计、图表样式等细节特征也高度重合。他认为,从统计学角度看,这种情况几乎不可能发生,除非它们都来自同一源头。Heathers 推测,这些论文都是同一篇论文的不同版本,由论文工厂批量伪造、翻新后,出售给那些急于增加论文发表数量的科学家。

荷兰阻止美国公司收购其重要数字供应商

针对美国 IT 巨头 Kyndryl 拟收购荷兰云服务商 Solvinity 的交易,荷兰政府最终决定阻止收购。Solvinity 托管了荷兰的在线身份平台 DigiD,因此交易引发了 DigiD 数据被美国控制和索取的担忧。荷兰数字经济国务秘书 Willemijn Aerdts 周二致函荷兰议会,负责审查投资的机构认为此次收购“可能对公共利益构成风险”,建议政府阻止此次收购。政府随后采纳了建议。Kyndryl 在一份声明中对荷兰政府的决定表示极度失望。

教宗的首份通谕被怀疑部分是在 AI 帮助下撰写的

教宗良十四世发布了其首道通谕《伟大的人类(Magnifica Humanitas)》,谈论了在 AI 时代守护人类。但这篇 通谕被质疑部分是在 AI 帮助下撰写的。AI 检测工具 Pangra 的分析显示,部分段落有 40% 到 100% 的概率是由 AI 撰写的,大部分段落则没有使用 AI。以前发布的通谕没有发现使用 AI 的痕迹。根据文本和间接证据判断,所使用的 AI 很可能是 Anthropic 的 Claude。而这份通谕的一位顾问是 Anthropic 联合创始人 Christopher Olah。

维基媒体基金会解雇工会组织者引发社区抗议

维基媒体基金会在五月中旬解雇了 MediaWiki 资深首席开发者 Brooke Vibber,5 月 21 日解散了 Community Tech 团队,五名工程师和一名经理全部离职。他们多数人都是工会组织者。Brooke Vibber 于 2003 年初担任 MediaWiki 项目的首席开发者,维基百科就运行在 MediaWiki 之上,她是维基媒体基金会聘用的第一位全职员工,也是首位 CTO,她被认为是少数深入理解系统技术底层的资深开发者。而 Community Tech 团队旨在通过 Community Wishlist 实现社区志愿者们想要的功能。维基媒体基金会此举立即引发了志愿者的抗议,社区志愿者准备采取罢工等集体行动。这是首次志愿者与基金会员工联合发起声援行动。名叫 Femke 的管理员认为一个致力于造福社会的组织,不应该在没有工会的情况下运作。维基媒体基金会拥有 2.966 亿美元的储备金,足以支付 17.1 个月的运营支出。而工会 Wiki Workers United 只要求:领导层对员工和社区保持透明和负责;决策前倾听员工对年度规划的建议;告别朝令夕改的招人、辞退与晋升乱象,等等,相当的温和。

伊朗逐步恢复全球联网

在切断网络近三个月之后,伊朗逐步恢复全球联网。伊朗第一副总统 Mohammad Reza Aref 周二通过其 X 账号宣布了这一消息。网络监视组织 Netblocks 和 Kentik 都报告伊朗网络从 13:00 GMT 开始逐步恢复,但大部分网络尚未恢复。这次断网始于 2 月 28 日,是全球历史上持续时间最长的断网事件之一。Netblocks 的研究主管 Isik Mater 称,有迹象表明伊朗对互联网的过滤比之前更严格,WhatsApp 等消息应用被额外过滤。

美国 14 州实施堕胎禁令后妊娠相关死亡增加 9.2%

2021 年美国德州通过法案禁止孕妇在妊娠约 6 周后堕胎。2022 年美国最高法院在 Dobbs v. Jackson Women’s Health Organization 一案中裁决宪法未赋予公民堕胎权,因此推翻了 1973 年的 Roe v. Wade 案。截至 2026 年初美国有 13 个州全面禁止堕胎,7 个州禁止孕妇妊娠 22 周后堕胎。严格堕胎禁令被认为会增加妊娠相关死亡率。发表在《American Journal of Public Health》期刊上的一项研究调查了严格堕胎禁令对孕妇健康的影响。结果显示,在 14 个严格禁止堕胎和禁止妊娠 6 周后堕胎的州,妊娠相关死亡比预期高 9.2%。

09

APP STORE RANK

09.00
APP STORE RANK
FETCHING · APP STORE RANK