TEXT VIEW · TODAY'S DIGEST · 36 HEADLINES ACROSS 8 SOURCES

Startup Archive(0)

No items yet for today.

App Store Rankings(0)

No items yet for today.

ISSUE 0886
THU, JUN 4, 2026
Discover the best information organized by OrangeBot.AI
TODAY · THU, JUN 4, 2026

The web,
read by a bot.

Ten sources — Hacker News, Product Hunt, HuggingFace, Techmeme and more — filtered, tagged, and summarized every morning for builders who don’t have time to scroll.

NEWChrome extension: save posts from Twitter/X in one click.Install →
01

AI DIGEST

UPDATED DAILY · EDITOR'S PICK
01.00
AI DIGEST

AI新闻摘要

June 4, 2026

Here is a summary of today's main news events.


Israel and Lebanon Agree to Ceasefire, Affecting Global Markets

A U.S.-brokered ceasefire was announced between Israel and Lebanon, requiring the Iran-backed group Hezbollah to halt all attacks. The news, seen as a potential step toward a broader U.S.-Iran deal, caused oil prices to drop due to reduced geopolitical risk. Gold prices and Treasury yields also shifted as investors weighed the prospects for regional stability.

Dow Rises on Ceasefire Hopes While Tech Stocks Stumble

The Dow Jones Industrial Average climbed over 500 points, largely driven by optimism following the Israel-Lebanon ceasefire agreement. However, the tech sector faced headwinds, with the Nasdaq falling after semiconductor giant Broadcom issued a disappointing financial forecast, causing its stock to drop significantly.

Tech Giants Announce Massive Investments in AI Infrastructure

Major technology companies are pouring billions into artificial intelligence. SoftBank announced a $52 billion investment in French data centers, while Foxconn and Intel are partnering to build AI hardware. These moves highlight the global race to develop foundational AI technology.

Blackstone Limits Investor Withdrawals from Key Fund

Investment giant Blackstone has capped client withdrawals from one of its major funds at 5% after facing a surge in redemption requests totaling $4.5 billion. The move indicates growing investor concern and a desire for liquidity within certain market sectors.

Bitcoin Continues to Fall Amid Sinking Investor Confidence

The price of Bitcoin has dropped for the fourth consecutive day, extending a significant slump. The decline was reportedly accelerated by a large sale from a well-known crypto advocate, further dampening investor sentiment in the digital asset market.

02

ON THE WIRE

6 SOURCES
02

HACKER NEWS

02.00
HACKER NEWS

Hacker News - June 4, 2026

Hacker News Feed: Highlighting key posts and discussions.

Stop Killing Games

(jxself.org)

226227
ESP32-S31

(www.espressif.com)

329172
DaVinci Resolve 21

(www.blackmagicdesign.com)

492222
PlayStation Architecture

(www.copetti.org)

33062
Every Byte Matters

(fzakaria.com)

255134
03

HUGGINGFACE

03.00
HUGGINGFACE

huggingface.title - June 4, 2026

huggingface.description

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 https://openmdw.ai/license/1-1/ License at https://github.com/nvidia/cosmos}{github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3 .

56
Audio Interaction Model

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

55
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

38
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.

31
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

27
Qwen-Image-Flash: Beyond Objective Design

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.

26
M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

24
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

23
Streaming Communication in Multi-Agent Reasoning

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

20
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.

20
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

18
Self-Distilled Policy Gradient

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

15
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

13
MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

12
MemTrain: Self-Supervised Context Memory Training

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

12
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

11
MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

10
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

9
WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

8
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

7
AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT-5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.

6
BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.

5
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

5
BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.

5
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an expert access-set problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to 11times speedups. Representative budget sweeps show O(10^{-3}) parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

4
Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

3
OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.

3
PaintBench: Deterministic Evaluation of Precise Visual Editing

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores (R^2 = 0.91, p < 0.001). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.

2
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce SpatialAct, a simulator-grounded benchmark for probing action-conditioned spatial reasoning in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

2
Unlocking Feature Learning in Gated Delta Networks at Scale

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

2
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude (13times) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.

2
Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.

1
Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful samples (*e.g.*, images) to facilitate effective knowledge transfer. However, a recent study observes that the noise domain constructed from simple distributions (*e.g.*, Gaussian distributions) can serve as a surrogate source domain in the semi-supervised setting, where only a small proportion of target samples are labeled while most remain unlabeled. Based on this surprising observation, we formulate a novel problem termed *Semi-Supervised Noise Adaptation* (SSNA), which aims to leverage a synthetic noise domain to improve the generalization of the target domain. To address this problem, we first establish a generalization bound characterizing the effect of the noise domain on generalization, based on which we propose a Noise Adaptation Framework (NAF). Extensive experiments demonstrate that NAF effectively leverages the noise domain to tighten the generalization bound of the target domain, leading to improved performance. The codes are available at https://github.com/AIResearch-Group/SSNA.

1
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

1
Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

Koopman theory turns nonlinear dynamics into a linear spectral problem. In computation, however, everything depends on a hard finite-dimensional choice: the observables must be expressive, nearly invariant under the dynamics, and, ideally, compatible with composition. Deep Koopman methods learn flexible coordinates, whereas structure-preserving methods enforce operator identities on fixed dictionaries. We combine these ideas by introducing Deep Embedded Multiplicative Dynamic Mode Decomposition (DeepMDMD), a method that learns a latent space and a partition of it, while enforcing the Koopman product rule as an exact algebraic constraint. Training alternates between an exact multiplicative operator update and a differentiable latent-clustering step that promotes Koopman closure. The result is a finite transition map on learned latent cells. Its nonzero spectrum lies on the unit circle, its dictionary is shaped by the dynamics rather than by ambient geometry, and forecasts are made in latent coordinates before being decoded to physical space. Across Hamiltonian, chaotic, and fluid examples, DeepMDMD learns dictionaries that are far more compact and dynamically coherent than those produced by geometric MDMD partitions. It reduces spectral pollution, reveals richer continuous-spectrum structure, and gives stable forecasts under severe noise. In high-dimensional flows, including a 158,624-dimensional cylinder wake and a noisy Re=20,000 lid-driven cavity, it preserves coherent structures and long-time spectral statistics where state-space MDMD fails. These results suggest a practical rule for Koopman learning: learn the coordinates, constrain the algebra.

1
MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.

1
Score-Control for Hallucination Reduction in Diffusion Models

Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language, audio and other modalities. Despite their success, they suffer from hallucinations, implausible samples that lie outside the support of true data distribution, which degrade reliability and trust. In this work, we first empirically confirm previously proposed hypothesis that score smoothness causes hallucinations in Image Generation diffusion models and provide a density-based perspective. We further formalize this notion by linking the hallucinations probability mass to lipschitz constant of the learned score function. Motivated by this, we introduce a Variance-Guided Score Modulation (VSM) strategy that controls the score Jacobian, in turn reducing score smoothness and better approximating the ground truth score that decreases hallucinations. Empirical results on synthetic and real-world datasets demonstrate that our approach reduces hallucinations (up to ~25%) while maintaining high fidelity and diversity, providing a principled step toward more reliable diffusion-based image generation. We also propose two benchmark datasets with extreme semantic variation for systematic hallucination evaluation. Code and Datasets are publicly available at https://github.com/bhosalems/VSM.

1
Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.

0
Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.

0
05

PRODUCT HUNT

05.00
PRODUCT HUNT

Product Hunt - June 4, 2026

Product Hunt Daily Feed: Featuring noteworthy tech launches.

Walrus Memory icon
Walrus Memory

Enable agents to keep context & work across apps + sessions

0
Split Ninja icon
Split Ninja

Cut, extract, mute, and split videos locally

0
Koji by Brilliant icon
Koji by Brilliant

A world-class personal tutor for every home

0
AppWizzy icon
AppWizzy

Rent a private VM with Codex to build production apps

0
Basedash Semantic Layer icon
Basedash Semantic Layer

Define metrics once. Use them everywhere.

0
PlugTalk icon
PlugTalk

Your Mac talks back when you plug things in

0
Perplexity Personal Computer for Windows icon
Perplexity Personal Computer for Windows

Run AI agents across your local files and apps on Windows

0
Empromptu AI icon
Empromptu AI

Train Fine Tuned Models With AI Apps You're Already Building

0
Chloe by Close icon
Chloe by Close

AI agent built into your CRM who works leads for you

0
Deliveryman.ai icon
Deliveryman.ai

Cold email infrastructure on autopilot without Gsuite

0
Astra Autonomous Pentest icon
Astra Autonomous Pentest

AI agents that find, validate, and fix every vulnerability

0
Novus icon
Novus

Catch and fix usability issues automatically as you ship

0
Google Gemma 4 12B icon
Google Gemma 4 12B

Run multimodal AI locally with an encoder-free architecture

0
Build Club Campus icon
Build Club Campus

Virtual AI School: Upskill in AI and Become Great at it Fast

0
Mailwarm 2.0 icon
Mailwarm 2.0

The email warmup tool, upgraded for deliverability.

0
Audex Trace icon
Audex Trace

Trace what Apple Music is actually playing

0
TimeTuna.com icon
TimeTuna.com

If Calendly had gorgeous video backgrounds

0
Curata icon
Curata

A shared workspace for AI agents and humans.

0
Kai for Chrome icon
Kai for Chrome

Local meeting transcription with no account needed

0
Keen Code icon
Keen Code

A context-efficient CLI coding agent built by agents

0
DotBGE icon
DotBGE

Local-first file encryption for iOS, CLI, and agents

0
ChatPilot icon
ChatPilot

Bulk delete, archive & timestamp your ChatGPT conversations

0
Cignara icon
Cignara

AI Agents for Fortune 500 grade customer support

0
Boxes.dev icon
Boxes.dev

Run Claude Code and Codex in your own cloud environment

0
Intelligent Terminal icon
Intelligent Terminal

Windows Terminal with native agent integration

0
Carbon Voice Speed Dial icon
Carbon Voice Speed Dial

Get your whole team (humans and agents!) on speed dial

0
Sun icon
Sun

Collaborative voice API for agents

0
Extella.AI icon
Extella.AI

Agentic platform that evolves & builds reusable systems

0
Smart Runner icon
Smart Runner

Your training plan, rewritten after every run

0
Gather icon
Gather

Save it once, never lose it again

0
Dispatch icon
Dispatch

Your app launch hub with ASO audit, keywords, and ads

0
BeerShot icon
BeerShot

Screen recording studio for Windows

0
StampCam icon
StampCam

Turn any photo into a postage stamp or sticker

0
Handler icon
Handler

Review AI edits like stacked PRs at generation time.

0
BoxBox icon
BoxBox

File manager for Linux homelab and NAS-style servers

0
Wallie V2 icon
Wallie V2

The open-source AI streamer that actually feels alive

0
Hermes Desktop icon
Hermes Desktop

The agent that grows with you

0
TaskGPT icon
TaskGPT

Voice agent for MacOS

0
Elentaria icon
Elentaria

Your GTM: from diagnosis to execution

0
Composer icon
Composer

Multiplayer markdown for you, your team, and your agents.

0
InsForge Backend Branching icon
InsForge Backend Branching

Git style branching for your backend

0
Carbone Skill for AI icon
Carbone Skill for AI

Teach your AI to build document templates

0
Uselink icon
Uselink

host your html, share the link, and get comments

0
RadianceKit icon
RadianceKit

Turn photos into 3D Gaussian Splats on your Mac

0
Franz 6 icon
Franz 6

All your messaging apps in one window — with private AI

0
Spectron icon
Spectron

Agent memory you can trust

0
Walkable icon
Walkable

Safety-first walking navigation to walk the safest routes

0
Forward icon
Forward

Installs your API into a customer's codebase in one command

0
Replicas icon
Replicas

Run Claude Code and Codex in the cloud

0
Linkeezy icon
Linkeezy

LinkedIn inbox and feeds without the chaos.

0
06

TECHMEME

06.00
TECHMEME

Techmeme - June 4, 2026

Techmeme Digest: Major tech headlines and industry conversations.

Public First: 26% of Americans support increased data center construction, the lowest share among 15 large countries, such as Brazil, Japan, the UK, and Canada (Financial Times)
Source: TechmemePublished: Jun 4, 2026

Financial Times : Public First: 26% of Americans support increased data center construction, the lowest share among 15 large countries, such as Brazil, Japan, the UK, and Canada —  Americans are far more likely to oppose the construction of AI data centres than citizens of other big economies …

Robotics startup Generalist, which released its GEN-1 model to complete short physical tasks in April, raised $400M led by Radical Ventures at a $2B valuation (Dina Bass/Bloomberg)
Source: TechmemePublished: Jun 4, 2026

Dina Bass / Bloomberg : Robotics startup Generalist, which released its GEN-1 model to complete short physical tasks in April, raised $400M led by Radical Ventures at a $2B valuation —  The company raised $400 million in a funding round led by Radical Ventures  —  Generalist AI, a robotics startup …

London- and NY-based Airspeed, which aims to use AI agents to replace sales software like traditional CRM dashboards, raised a $20M Series A led by DN Capital (Mike Butcher/Pathfounders)
Source: TechmemePublished: Jun 4, 2026

Mike Butcher / Pathfounders : London- and NY-based Airspeed, which aims to use AI agents to replace sales software like traditional CRM dashboards, raised a $20M Series A led by DN Capital —  Airspeed, the London and New York-based agentic AI startup formerly known as Glyphic, has raised a $20 million Series A to build what it calls the …

Q&A with Satya Nadella on Microsoft's competitive position, MAI models, OpenAI, the software business, GitHub Copilot, Project Solara, data centers, and more (Ben Thompson/Stratechery)
Source: TechmemePublished: Jun 4, 2026

Ben Thompson / Stratechery : Q&A with Satya Nadella on Microsoft's competitive position, MAI models, OpenAI, the software business, GitHub Copilot, Project Solara, data centers, and more —  An interview with Microsoft CEO Satya Nadella about figuring out Microsoft's role in AI, the relationship with OpenAI, Capex …

Sam Altman says he has no plans to put money into the 2026 US midterms; OpenAI has tried to distance itself from the Greg Brockman-backed Leading the Future PAC (Emily Birnbaum/Bloomberg)
Source: TechmemePublished: Jun 4, 2026

Emily Birnbaum / Bloomberg : Sam Altman says he has no plans to put money into the 2026 US midterms; OpenAI has tried to distance itself from the Greg Brockman-backed Leading the Future PAC —  OpenAI Chief Executive Officer Sam Altman says he has no plans to make any financial contributions toward this year's US elections …

Revolut co-founder and CTO Vlad Yatsenko plans to step down in July, to be replaced by Head of Technology Donato Lucia; Yatsenko was Revolut's first employee (Aisha S Gani/Bloomberg)
Source: TechmemePublished: Jun 4, 2026

Aisha S Gani / Bloomberg : Revolut co-founder and CTO Vlad Yatsenko plans to step down in July, to be replaced by Head of Technology Donato Lucia; Yatsenko was Revolut's first employee —  Revolut Ltd.'s chief technology officer, Vlad Yatsenko, and Chief Executive Officer Nik Storonsky's early partner …

Flourish, which is building Cortex AI, a brain-like synthetic intelligence system that uses less power than LLMs, raised $500M, including $100M from Jeff Bezos (Steven Levy/Wired)
Source: TechmemePublished: Jun 4, 2026

Steven Levy / Wired : Flourish, which is building Cortex AI, a brain-like synthetic intelligence system that uses less power than LLMs, raised $500M, including $100M from Jeff Bezos —  With $500 million in funding and a reported $2.5 billion valuation, Flourish wants to reinvent AI by putting real neurons under the microscope.

Challenger: US tech companies announced cuts to 38,242 jobs in May, the most since August 2024, taking 2026's total so far to 123,653 cuts, up 65%+ YoY (Julia Fanzeres/Bloomberg)
Source: TechmemePublished: Jun 4, 2026

Julia Fanzeres / Bloomberg : Challenger: US tech companies announced cuts to 38,242 jobs in May, the most since August 2024, taking 2026's total so far to 123,653 cuts, up 65%+ YoY —  US technology companies in May announced the most job cuts in nearly two years as they ramp up spending on artificial intelligence.

Amazon unveils its next-gen Proteus warehouse robot, adding AI-powered language capabilities that let workers assign it tasks, rolling out in Europe in H1 2027 (Robert Hart/The Verge)
Source: TechmemePublished: Jun 4, 2026

Robert Hart / The Verge : Amazon unveils its next-gen Proteus warehouse robot, adding AI-powered language capabilities that let workers assign it tasks, rolling out in Europe in H1 2027 —  The company insists its robot investments are designed to support, not replace, warehouse workers.

Corporate spending management platform Ramp raised $750M at a $44B valuation led by Iconiq, Singapore's GIC, and the OTPP, taking its total funding to $3B (Bloomberg)
Source: TechmemePublished: Jun 4, 2026

Bloomberg : Corporate spending management platform Ramp raised $750M at a $44B valuation led by Iconiq, Singapore's GIC, and the OTPP, taking its total funding to $3B —  Ramp, a corporate spending management platform, has raised $750 million in a new funding round at a $44 billion valuation …

France secured €110B+ of proposed AI and data center investments this week, amounting to ~10 GW of new computing capacity, equivalent to ~10 nuclear reactors (Sarah White/Financial Times)
Source: TechmemePublished: Jun 4, 2026

Sarah White / Financial Times : France secured €110B+ of proposed AI and data center investments this week, amounting to ~10 GW of new computing capacity, equivalent to ~10 nuclear reactors —  Investors warn approvals and local opposition could slow France's massive data centre build-out

Foxconn says it will work with Intel to jointly develop and deploy next-gen AI infrastructure, including server racks with Intel Xeon processors and AI chips (Wen-Yee Lee/Reuters)
Source: TechmemePublished: Jun 4, 2026

Wen-Yee Lee / Reuters : Foxconn says it will work with Intel to jointly develop and deploy next-gen AI infrastructure, including server racks with Intel Xeon processors and AI chips —  Foxconn (2317.TW) said on Thursday it will work with U.S. chipmaker Intel (INTC.O) to jointly develop and deploy next-generation AI infrastructure …

TSMC CEO C.C. Wei says the company won't be able to fulfill the demand led by US customers even as more capacity comes online in the US over the next few years (Debby Wu/Bloomberg)
Source: TechmemePublished: Jun 4, 2026

Debby Wu / Bloomberg : TSMC CEO C.C. Wei says the company won't be able to fulfill the demand led by US customers even as more capacity comes online in the US over the next few years —  Taiwan Semiconductor Manufacturing Co.'s global chip supply will fall short of AI-fueled demand for years to come …

A look at AI consciousness debates; LLM conversations are cleverly disguised examples of sentence continuation, but that doesn't deny how impressive LLMs can be (Ted Chiang/The Atlantic)
Source: TechmemePublished: Jun 4, 2026

Ted Chiang / The Atlantic : A look at AI consciousness debates; LLM conversations are cleverly disguised examples of sentence continuation, but that doesn't deny how impressive LLMs can be —  Anthropic is regarded as a giant among AI companies, but perhaps what it really excels in is anthropomorphism.

A deep dive into the case for data centers in space, as SpaceX prepares to go public: Elon Musk's claims, the wider debate, power supply issues, costs, and more (Daniel Nishball/SemiAnalysis)
Source: TechmemePublished: Jun 4, 2026

Daniel Nishball / SemiAnalysis : A deep dive into the case for data centers in space, as SpaceX prepares to go public: Elon Musk's claims, the wider debate, power supply issues, costs, and more —  Space DC Total Cost of Ownership Explained.  Unpacking constraints from Terrestrial DCs and Chip Production.

07

STARTUP ARCHIVE

07.00
STARTUP ARCHIVE

Startup News - June 4, 2026

Startup News Roundup: Aggregating key funding and launch updates.

Marc Andreessen on the 5 personality traits of an innovator
Source: StartupPublished: Mar 31, 2026

“When you’re talking about real innovators—people who actually do really creative, breakthrough work—I think you’re talking about a couple things:”

Steve Jobs explains the importance of both thinking and doing
Source: StartupPublished: Mar 30, 2026

“The doers are the major thinkers. The people who really create the things that change this industry are both the thinker-doer in one person.”

Tobi Lutke explains what the VCs who passed on Shopify got wrong
Source: StartupPublished: Mar 27, 2026

“What a lot of free-market thinkers don’t understand is that between the demand and eventual supply lies friction."

Sam Altman explains how he decides to invest in a startup after 10 minutes
Source: StartupPublished: Mar 26, 2026

"Does this person have the potential to be the next Mark Zuckerberg?… [You don’t get to] 100% accuracy, obviously, but it’s good enough that our business model works.”

Jony Ive recounts the time Steve Jobs called him vain
Source: StartupPublished: Mar 25, 2026

In the clip below, Jony Ive recounts the time he asked Steve Jobs to be less harsh in his critique of a piece of work.

Jeff Bezos’s two pieces of advice for aspiring entrepreneurs
Source: StartupPublished: Mar 24, 2026

“The advice that I would give entrepreneurs is don't chase the hot new thing. It's so hard to catch something that everybody already knows is hot."

Elad Gil: “Things that work tend to work pretty fast”
Source: StartupPublished: Mar 23, 2026

“I do think there’s a bit of a myth in Silicon Valley that you should keep grinding no matter what and it’s just about perseverance, and I think that’s really bad advice."

Paul Graham on why starting with a “small, intense fire" is the key to startup growth
Source: StartupPublished: Mar 20, 2026

"You have to know who those first users are and how you're going to get them."

Keith Rabois on how to identify great talent
Source: StartupPublished: Mar 19, 2026

“What you want to do with every single employee every single day is expand the scope of their responsibilities until it breaks… and that’s the role they should stay in.”

Wealthfront CEO on why advertising spend makes it harder to find product/market fit
Source: StartupPublished: Mar 18, 2026

“The way that you know you have product/market fit is if you have exponential organic growth."

Eric Schmidt on why most companies get strategy wrong
Source: StartupPublished: Mar 17, 2026

“Work very, very hard to figure out what the world’s going to look like in five years. What will people be doing? What will your customers want? Where will costs be?"

Mark Zuckerberg: “You can’t 80/20 everything”
Source: StartupPublished: Mar 16, 2026

"There’s the famous 80/20 rule where you get 80% of the benefit by doing 20% of the work, but you can’t just 80/20 everything. There have to be certain things that you are just the best at."

Marc Andreessen on Mark Zuckerberg’s founder “superpower”
Source: StartupPublished: Mar 13, 2026

“A great superpower that Mark Zuckerberg has that is probably not well-understood enough is he does not get emotionally upset in stressful situations"

Sam Altman explains how to come up with a great startup idea
Source: StartupPublished: Mar 12, 2026

"If you start a startup without a good idea… you’ll be under pressure to make something up and it won’t work that well."

Jeff Bezos on the problems with proxies and managing to metrics
Source: StartupPublished: Mar 11, 2026

“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”

Airbnb founder Brian Chesky on how to design an amazing user experience
Source: StartupPublished: Mar 10, 2026

“If you can design something really amazing using the hand-crafted part of your brain, then you can reverse-engineer how to industrialize this millions of times over."

Spencer Rascoff: "I will never invest in a consumer startup with paid marketing”
Source: StartupPublished: Mar 9, 2026

"If you’re actually trying to grow a product, the best levers for doing that are often within the product itself.”

Patrick Collison explains why it sometimes make sense to quit
Source: StartupPublished: Mar 6, 2026

“One thing I’ve learned myself the hard way, is that it is easier to tear down a company and restart it in Silicon Valley, than it is to constantly try to pivot or keep something alive."

Jeff Bezos recounts the time he called Amazon’s customer service number mid-meeting to prove a metric was wrong
Source: StartupPublished: Mar 5, 2026

“I have a saying, which is when the data and the anecdotes disagree, the anecdotes are usually right"

Ben Horowitz: “Nobody was born a great manager. It’s a very unnatural job.”
Source: StartupPublished: Mar 4, 2026

“If you can’t build a great product, it doesn’t matter if you can build a great company.”

03

ALSO TODAY

3 MORE SOURCES
08

SOLIDOT

08.00
SOLIDOT

Solidot News - June 4, 2026

Solidot Feed: Highlighting essential tech & open-source news.

任何程度的饮酒都会增加健康风险

一项大规模研究显示,即使每天饮酒不足一个标准杯,也会增加患多种癌症风险。研究团队分析了截至 2023 年发表的 843 项队列研究和病例对照研究,对酒精与多种疾病之间的关联进行了系统评估、在所考察的 10 种癌症中,饮酒均与风险升高有关,且风险随饮酒量增加而持续上升。即使每日摄入不足 10 克纯酒精,也与咽癌、结直肠癌、食管癌、乳腺癌、肝癌、胰腺癌和前列腺癌风险增加相关。其中咽癌风险增幅最为显著,可增加一倍以上。除癌症外,饮酒还与肝硬化等慢性肝病以及胰腺炎风险上升相关。研究显示,慢性肝病风险至少增加 40%,胰腺炎风险至少增加 22%。研究结果清晰表明,癌症风险会随着任何水平的酒精摄入而增加,而所谓“适量饮酒有益健康”的证据主要集中在部分非癌症疾病领域,且关联性较弱。

美国资本主义转向末日论

末日论是今天美国资本主义最强大的动力。马斯克(Elon Musk)旗下的火箭公司 SpaceX 公开宣称其使命是在火星上建立殖民地以免人类在地球上灭绝。马斯克之所以能成为美国首富,部分原因在于他是美国声音最大的末日论者。马斯克正抢在另外两位持相似千禧年主义世界观的先知前让 SpaceX 上市。Anthropic 的 Dario Amodei 和 OpenAI 的 Sam Altman、以及 Palantir CEO Alex Karp、Anduril 创始人 Palmer Luckey 都在叙述着某种末日故事。一个信奉千禧年主义的经济体必然是偏执的。Peter Thiel 说 AI 将以威权统治的形式召唤敌基督。 教宗良十四世呼吁解除 AI 的武装。英国流行歌手 Charli XCX 的新歌捕捉到了大众和教宗的情绪:春天,夏天 ‘26/当世界即将终结,没有任何希望/是的,我们正走在一条通往地狱的跑道上。

德国巴伐利亚州取消微软合同改用开源软件

德国巴伐利亚州数字事务部正式宣布取消与微软的合同,该合同将在五年内支出近 10 亿欧元。巴伐利亚州将转向采用开源软件。州财政部长 Albert Füracker 主张在现有合同基础上寻求折扣,而数字部长 Fabian Mehring 则力主采用开源软件。Mehring 表示,转向开源软件将确保在危机时期服务的持续使用,保护巴伐利亚州免受价格上涨的影响,并优先保障数据安全。巴伐利亚州转向开源软件是欧洲更广泛趋势的一部分,欧洲各地的地方和联邦政府都在逐步摆脱对微软和其它美国技术的依赖。

欧盟公布减少依赖美国科技公司的计划

欧盟周三公布了 European Technological Sovereignty Package,旨在加强科技主权减少依赖美国科技公司。微软遵守美国总统特朗普的命令关闭国际刑事法院首席检察官账号给整个欧洲敲响了警钟。最新计划旨在扶持欧洲本土企业,要求高度敏感领域的公共服务不能使用外国科技公司的服务。欧盟委员会要求各成员国对其依赖的每一项数字服务进行“主权风险评估”,评估内容包括外国控制、敏感数据的潜在访问权限以及运营中断的风险。欧盟委员会主席 Ursula von der Leyen 表示,“我们不能依赖他人的技术维持医院运转、电网稳定运行和服务安全。这关乎保护我们的公民、捍卫我们的利益以及做出我们自己的选择。”

需求高涨苹果将 MacBook Neo 产能增加一倍

由于需求远超预期,苹果将其入门级电脑 MacBook Neo 的产能增加一倍,从 500 万台增加到 1000 万台。MacBook Neo 的内存只有 8GB,售价 599 美元,学生折扣价 499 美元。苹果 CEO 库克表示在发布 MacBook Neo 之前就对其前景非常乐观,但公司仍然低估了消费者的热情。在 MacBook Neo 的带动下上季度 Mac 新用户数量创下历史新高。Windows PC 行业也在关注 MacBook Neo 在入门级电脑市场掀起的旋风,戴尔刚刚推出了一款起售价 699 美元(学生折扣 599)的 XPS 13 笔电,但 8GB 内存对于 Windows 11 而言属于勉强可用。

Google 发布能在笔记本上本地运行的开源模型 Gemma 4 12B

Google 发布了能在笔记本电脑上本地运行的开源模型 Gemma 4 12B。Gemma 4 12B 有 120 亿参数,能在有 16GB 显存的笔记本电脑上本地运行——排除了绝大部分中低端笔记本电脑,只有高端的笔记本电脑才可能有 16GB 以上显存。Gemma 4 是多模态模型,能处理文本、图像和音频不同类型的信息,能理解视觉内容、处理音频输入并执行高级推理任务,因此具有更广泛应用场景。Gemma 4 12B 采用 Apache 2.0 许可证,限制较少。

特朗普政府将拆除洋流观测系统

特朗普政府将从本月开始拆除耗资 3.68 亿美元的海洋观测计划(Ocean Observatories Initiative)。海洋观测计划由逾 900 台深海仪器构成,用于监测洋流、海洋生态系统、碳吸收、热浪、渔业、沿海洪水和气候变化。美国国家科学基金会(NSF)表示将派出船只开始拆除锚定在俄勒冈州、华盛顿州、阿拉斯加州、北卡罗来纳州,以及格陵兰岛和冰岛之间被称为 Irminger 海域的仪器。海洋观测计划于 2016 年投入运作,原计划运行 25 年。领导该计划的海洋气象学家 Jim Edson 称其为“世界最先进的持续运行海洋观测系统”。拆除这些仪器可能需要 15 个月的时间。位于俄勒冈州附近一座活火山周围的地震仪将持续运行至 2028 年。每个观测站由多个锚定装置组成。这些设备测量从水面到数千英尺深处的洋流以及化学生物状况。仪器经过加固能承受深海的压力、腐蚀性海水以及可能损坏电子设备的海洋动植物。锚定装置周围的遥控机器人和滑翔机负责收集数据并将其传输到研究实验室。它每年的运行成本为 4800 万美元。特朗普政府曾多次试图关闭该项目,提议在 2025 年和 2026 年分别削减其 80% 的资金。但国会最终否决了这一提议,恢复了拨款。尽管如此,NSF 还是推进了观测网络的退役工作。

青春与长寿之间的基因权衡

科学家发现基因 vgll3 与生命早期生长发育和生殖成功以及生命晚期衰老加速和癌症风险增加直接相关。最新研究为 antagonistic pleiotropy 假说提供了实验证据。该假说认为某些基因会在生命早期带来优势,但在生命晚期则会带来不利影响。研究人员针对了一种寿命非常短的非洲丽鱼(African turquoise killifish),使用 CRISPR 基因编辑技术修改了该基因。结果显示,修改了 vgll3 基因的鱼生长速度更快,性成熟更早,在自然环境中具有繁殖优势。但代价是寿命缩短,且罹患与年龄相关癌症的几率更高。研究人员指出,大自然并不优先考虑寿命,而是优先考虑延续性。人类也存在 vgll3 基因,这项研究也有助于更好的理解人类发育、衰老和年龄相关疾病。

Meta 给予员工每次最多 30 分钟退出跟踪

Meta 最近开始在美国员工电脑上安装追踪软件,捕捉员工鼠标移动、点击和按键数据以用于训练 AI 模型,此举是该公司构建能自动执行工作任务的 AI 智能体的大计划的一部分。被称为 Model Capability Initiative(MCI)的工具在公司内部引发了强烈反对,部分员工为此发起了一项请愿活动,已有逾 1500 人签名。有匿名员工认为公司的行为“非常反乌托邦”。根据周二发给员工的一份内部备忘录,Meta 略微后退了一步,允许员工退出跟踪,“每次最长 30 分钟”,员工也可以申请永久退出该跟踪计划。

数学家警告 AI 对数学专业的威胁

数学家联合发表了获得国际数学联盟支持的宣言《Leiden Declaration》,警告 AI 通过产生大量看似合理但不可靠甚至错误的证明、削弱归因、改变激励机制以及赋予科技公司对研究优先事项过大的影响力去破坏数学。已有数百人签署了这一宣言,它警告 AI 的发展威胁到了数学研究的固有价值。宣言首先指出,区分 AI 产生的证明和正确的数学证明非常困难,给审稿人带来了越来越大的压力,生成 AI 论文成本低廉但验证论文代价昂贵,如果后续研究是基于错误的前提,那么错误会扩大。其次 AI 的训练是基于已有的数学论文,但它输出论文时经常不能正确引用,AI 模型的训练也普遍存在版权侵犯问题。第三 AI 的激励机制与数学专业的价值观背道而驰。宣言敦促数学家将 AI 视为一种工具,而非人类责任的替代品。数学家个人应公开 AI 的使用情况,对其工作的正确性承担责任。宣言还警告,数学可能被用于战争、压迫、大规模监控和破坏民主,因此数学家应谨慎权衡与科技行业合作的伦理问题。

微软的量子芯片存在基础性问题

微软宣布了其第二代量子芯片 Majorana 2。但专家认为微软的量子芯片缺乏坚实的研究基础,根本行不通。微软是在 2025 年初宣布了其第一代量子计算芯片 Majorana 1,利用它所谓的拓扑体去观察和控制马约拉纳粒子,从而产生更可靠和可扩展的量子比特。第一代拓扑体使用砷化铟半导体和铝超导体,结果到了第二代微软换成了铅超导体,声称量子比特的寿命从 20 秒延长到了 1 分钟。科学家对微软的说法持强烈怀疑态度,它的最新论文预印本尚未通过同行审议,物理学家 Henry Legg 认为预印本中数据来自于随机伪影。微软的上一篇预印本至今没有通过同行审议,很可能已被顶尖期刊拒绝了。

四千年前的古城 Mohenjo-daro 随经济发展而变得更平等

约克大学研究人员分析了古城 Mohenjo-daro 的住房模式。这座古城位于今天的巴基斯坦,其繁荣的时代是在公元前 2600 年至 1900 年间,它是印度河文明的最大城市之一。研究人员发现,Mohenjo-daro 的贫富差距低于其他古代城市。随着时间的推移,其贫富差距甚至缩小了。这座古城与其它文明的古城有显著的差异:没有宫殿没有统治者的巨型雕像没有奢侈陵墓,但拥有井然有序的街道和先进的排水系统,其公共基础设施遍及全城而不是只服务于精英阶层。古埃及为统治者建造金字塔,青铜时代的希腊为精英阶层建造宫殿,而 Mohenjo-daro 则投资于面向全体民众的公共服务。Mohenjo-daro 挑战了长期以来“经济增长会导致不平等加剧”的观点,城市发展和生产力提高的同时,资源分配也更加公平。

高通 CEO 称抵抗 AI 是徒劳的

高通 CEO Cristiano Amon 在台北电脑展上发表主题演讲,宣称抵抗是徒劳的,AI 智能体将会变得不可见,不可避开,并且能跨设备跟踪用户。他表示智能体将会从根本上改变人类与技术的关系。今天的手机是数字生活的中心,一切都围绕着手机展开,不久的将来智能体将取代手机。而手机就像可穿戴设备一样成为智能体的延伸。“智能体不局限于设备,它会随着用户移动。无论你使用什么设备,它都与你同在,”他解释道。“一旦你理解这种变化,你就能明白整个移动行业将如何变革。”

2026 年智能手机出货量预计下降 13.9%

根据 Counterpoint Research 最新的智能手机市场展望追踪报告,全球智能手机市场正进入近年来较为明显的调整阶段。2026 年全年出货量预计同比下降 13.9% 至约 10.8 亿部,其触发因素是近几周加剧的存储供应紧张,加上伊朗冲突。数据显示,2026 年第二季度移动 LPDDR4/5 价格预计较 2025 年第四季度增长约两倍,考虑到半导体制造的高资本投入与长交付周期,供应紧张情况预计将持续至 2027 年下半年。低端设备受到的影响更为明显。随着晶圆厂将产能转向 AI 驱动的 HBM 和服务器 DRAM,预计 2026 年 LPDDR4 供应将缩减超过 40%,使得入门级产品的成本效益持续降低。2026 年第一季度全球智能手机批发价格同比上涨 14%,随着前期库存的逐步消化,价格上行趋势仍将持续。部分 150 美元以下的细分市场,正面临被市场逐步淘汰的风险。

雄性园丁鸟用漂亮人造装饰品吸引雌性

雄性园丁鸟以其错综复杂的求偶仪式知名。它们用树枝搭建隧道,用从环境中收集的各种亮丽物品进行装饰。当雌鸟前来参观时,雄鸟会将自己最闪亮的物品抛向雌性,展示华丽的羽毛,希望以此吸引雌性。根据《Royal Society Open Science》期刊上的一篇新论文,城市化以及随之而来的亮丽人造品的日益流行,对澳大利亚雄性园丁鸟的求偶行为产生了显著影响,研究人员甚至还发现了手铐。对城市和农村园丁鸟的观察发现:城市鸟使用人造装饰品的可能性是农村鸟的十倍以上,而农村鸟更多使用天然物品作为装饰品。城市园丁鸟装饰品数量几乎是乡村园丁鸟的五倍,平均有 90 件,而农村园丁鸟平均只有 20 件。有一只生活在城市的雄性园丁鸟甚至收集了 300 件装饰品。无论是生活在城里还是乡下,园丁鸟都表现出对人造装饰品的偏爱。研究人员称,人类活动正以意想不到的方式改变自然界。

特朗普签署行政令要求 AI 公司让政府先行评估其新模型

美国总统特朗普周二签署了一项行政令,要求 AI 公司让政府先行评估其新模型的能力。行政令还要求 AI 公司在自愿的基础上参与基准测试流程,以评估模型的“高级网络能力”,确定其是否应被视为“受保护的前沿模型”。行政令要求 AI 公司在正式发布新模型前提前最多 30 天给予政府访问权限。

Vim Classic 8.3 释出

Vim 项目在 2025 年 12 月宣布了生成式 AI 政策:只要大模型生成代码予以披露以及代码风格与现有代码保持一致,那么 AI 代码就可以接受。但项目的多位资深参与者对接受 AI 代码持反对意见,不想看到 AI 代码泛滥,他们选择了创建没有 AI 代码的分支,其中一个分支就是 Drew DeVault 的 Vim Classic。出于长期维护的考虑,Vim Classic 不是基于较新的 Vim 9 系列,而是基于 Vim 8.2.0148。他刚刚释出了 Vim Classic 8.3,主要是从上游版本移植了部分 bug 修正和补丁。由于缺乏资源,部分 Vim 插件与 Vim Classic 不兼容。

欧洲议会默认搜索引擎从 Google 切换到 Qwant

根据内部电子邮件,欧洲议会内部计算机的默认搜索引擎将于 6 月 4 日起从 Google 切换到法国搜索引擎 Qwant,此举是出于对数字主权和隐私的考虑。Qwant 被描述为以隐私为中心的欧洲搜索引擎,不追踪用户或收集个人数据。Qwant 成立于 2013 年,突出了隐私保护,为用户提供了 Google 之外的一种选择。通过 Firefox 和 Edge 浏览器地址栏进行的搜索将自动路由到 Qwant,但欧洲议会议员仍然可以自由使用其它搜索引擎或更改其默认设置。欧盟委员会正在加强技术主权,减少对外国技术供应商的依赖,扶持欧洲本土技术。

拒绝停止呼吸的土壤

法国生化学家 Sébastien Fontaine 15 年来一直试图杀死土壤,他想要了解没有任何生命的土壤能释放多少碳。 他的团队将土壤密封在罐子内,用伽马射线进行灭菌照射。然后等待土壤释放的二氧化碳——这是微生物呼吸持续进行的标志——下降。他们等待了几周,几个月。在显微镜下,经辐射处理的土壤没有显示任何生命迹象,但它仍在继续释放二氧化碳。土壤拒绝停止呼吸。Fontaine 的实验室重复了实验得到了相同的结果。研究人员开始寻找无生命土壤中的呼吸来源。Fontaine 的团队如今报告,他们的土壤样本在六年内持续消耗氧气并释放二氧化碳。他们提出,为生命提供能量的代谢过程也可能发生在活细胞之外。他们的实验表明,即使没有通常组织土壤的生物蛋白质,这种代谢过程也能在土壤中发挥作用。如果他们的假设正确,那么部分生化反应如释放富碳糖分子能量的反应,可能并非生物所独有。此类反应甚至可能在地球生命出现前就已经存在。

蓝色章鱼是全新物种

2015 年在加拉帕戈斯群岛进行深海考察的科学家在查看遥控潜水器拍摄的影像时,发现了一只体型娇小、通体呈蓝色的章鱼,它位于水下约 1773 米处。科学家捕捉了这只章鱼以进行进一步分析。研究人员如今得出结论:这只体型小到可以放在手掌的可爱小生物属于一个全新物种。研究报告发表在《Zootaxis》期刊上。小章鱼被保存在储藏室中。由于它的独一无二,且极不可能采集到另一只,科学家不愿意对其解剖进行彻底的物种鉴定分析。因此研究团队选择了 mini-CT 扫描,研究表明这种生物手臂很短,臂上的吸盘很少,没有墨囊,皮肤光滑,且有一颗巨大的脊齿。他们将该物种命名为 Microeledone galapagensis。

09

APP STORE RANK

09.00
APP STORE RANK
FETCHING · APP STORE RANK