TEXT VIEW · TODAY'S DIGEST · 36 HEADLINES ACROSS 8 SOURCES

Startup Archive(0)

No items yet for today.

App Store Rankings(0)

No items yet for today.

ISSUE 0905
TUE, JUN 23, 2026
OrangeBot.AI 智能策划和筛选每日科技趋势和新闻,为您节省时间。
TODAY · TUE, JUN 23, 2026

The web,
read by a bot.

Ten sources — Hacker News, Product Hunt, HuggingFace, Techmeme and more — filtered, tagged, and summarized every morning for builders who don’t have time to scroll.

新功能!我们推出了用于保存推文和Reddit帖子的Chrome扩展程序。点击安装!
01

AI DIGEST

UPDATED DAILY · EDITOR'S PICK
01.00
AI DIGEST

AI新闻摘要

June 23, 2026

Here is a summary of today's main news events.

Tech Stocks Fall on AI Spending and Interest Rate Fears

The technology sector experienced a significant selloff, with the Nasdaq composite declining for a second consecutive day. Investors are growing concerned about the heavy spending required for artificial intelligence development, as well as the potential for future interest rate hikes by the Federal Reserve, which makes high-growth stocks less attractive.

SpaceX Shares Tumble Over 16%

Shares in Elon Musk's SpaceX dropped sharply, falling more than 16%. The decline is linked to reports that the company plans to sell at least $20 billion in bonds and a broader market downturn driven by a rise in U.S. bond yields.

Oil Prices Decline as U.S. Eases Sanctions on Iran

Oil futures fell after the United States waived some sanctions on Iran, a move expected to increase the global oil supply. As part of ongoing talks, the U.S. will allow Iran to sell its oil and access frozen funds for humanitarian purposes, signaling a potential de-escalation of tensions.

U.S. Government Issues Orders to Manage AI Development

The U.S. government announced new orders aimed at steering the future of artificial intelligence. The directives are designed to both accelerate the development of advanced AI systems in the country and implement safeguards to mitigate the potential security risks they pose.

Japan Reaffirms It Will Intervene to Support the Yen

Japanese officials stated they are prepared to step into currency markets to support the yen. Chief Cabinet Secretary Minoru Kihara reaffirmed the government's stance, indicating a readiness to act against what they see as excessive weakness in the nation's currency.

02

ON THE WIRE

6 SOURCES
02

HACKER NEWS

02.00
HACKER NEWS

Hacker News - June 23, 2026

Hacker News Feed: Highlighting key posts and discussions.

Will It Mythos?

(swelljoe.com)

216144
Jobs and Software Is Fucked

(urflow.bearblog.dev)

306278
Steam Machine launches today

(store.steampowered.com)

17431484
Never Give Them Your Face

(nevergivethemyourface.com)

716413
Alan Greenspan has died

(www.washingtonpost.com)

233227
GLM 5.2 vs. Opus

(techstackups.com)

509331
Deno Desktop

(docs.deno.com)

1080388
Sakana Fugu

(sakana.ai)

229120
03

HUGGINGFACE

03.00
HUGGINGFACE

HuggingFace 新闻 - June 23, 2026

HuggingFace Feed:最新的 AI 模型、数据集和社区动态。

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.

73
DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData

61
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench

56
Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

Self-attention is central to Transformer performance and is often the most expensive part of the Transformer at long context lengths because its pairwise token interactions scale quadratically with sequence length. Standard dense attention also applies the same set of attention heads to every token regardless of token difficulty or information content. This uniform activation can waste compute, especially as sequences grow longer and attention cost increases rapidly. We propose Grouped Query Experts (GQE), a mixture-of-experts layer on top of grouped-query attention (GQA). Within each GQA group, a router selects k query-head experts per token while all key-value (KV) heads remain dense and unchanged. Thus, GQE keeps the KV cache benefits of GQA and reduces only the active query-head computation. On a fixed 30B token budget at the 250M parameter scale, GQE matches the all-active GQA baseline in downstream accuracy while activating half the query heads per token.

40
KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

As retrieval systems scale, high-quality reranking becomes increasingly important. However, most existing rerankers, whether encoder-based or decoder-based, jointly encode the query and passage, tightly coupling their computation and limiting deployment efficiency as well as flexibility. We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling. Built on an encoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages with Matryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent; cross-attention then captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention. We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively. Extensive experiments on BEIR, MIRACL, and LMEB demonstrate that KaLM-Reranker-V1 achieves strong reranking performance with superior efficiency. On BEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Reranker series; on MIRACL, despite not being extensively trained on multilingual data, KaLM-Reranker-V1 still shows excellent reranking performance. Moreover, on LMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.

38
World Action Models: A Survey

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

33
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.

24
EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

Existing embedding models are inherently static: they encode text segments in isolation, ignoring their surrounding context and temporal order. This paper introduces EvoEmbedding, a novel embedding model that generates evolvable representations for retrieval. It is tailored for long-context scenarios, where information is dynamic, sequential, and requires continuous state tracking. Our design is simple: EvoEmbedding maintains a continuously updated latent memory as it sequentially processes inputs, and uses it alongside the raw content to jointly generate evolvable embeddings. Consequently, for the same query, our model adapts its representation to retrieve distinct targets based on the evolving context, going beyond static semantic search. To equip the model with this capability, we construct EvoTrain-180K, a diverse dataset for the joint optimization of latent memory and retrieval. Furthermore, we introduce a memory queue to prevent representation collapse during recurrent encoding, alongside segment-batching techniques that tackle significant length variance and accelerate training by 3.8times. Extensive experiments show that our model not only outperforms larger-scale specialists (e.g., Qwen3-Embedding-8B and KaLM-Embedding-Gemma3-12B) across a range of long-context retrieval benchmarks, but also generalizes well to downstream tasks (e.g., personalization) with contexts 10times longer than its training window. Notably, EvoEmbedding seamlessly integrates into agentic workflows to boost performance. For instance, a naive RAG pipeline equipped with our model surpasses dedicated agentic memory systems. Project Page: https://clare-nie.github.io/EvoEmbedding.

23
BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.

18
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

17
OpenRath: Session-Centered Runtime State for Agent Systems

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

16
SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Computer-Use Agents (CUAs) are increasingly deployed in dynamic interactive environments, creating a growing need for continual skill learning during interaction. Recent approaches address this challenge by learning reusable skills from successful trajectories. However, these skill learning methods largely assume static and safe environments, overlooking risks from adversarial interactions (e.g., prompt injections) and environmental dynamics (e.g., pop-ups). In dynamic settings, such assumptions can lead to risky skill learning and brittle execution, undermining the reliability of CUAs. This raises the question: how can CUAs learn and use skills safely in dynamic environments? To address this problem, we propose SkillHarness, a framework for safe skill harnessing in dynamic environments. SkillHarness moves beyond static skill abstractions by modeling skill learning and utilization as a safety-constrained interaction process. Specifically, we introduce the skill boundary that leverages multi-source supervision signals to identify safe skills from interaction trajectories, and construct self-improving safety constraints throughout the skill lifecycle. In addition, SkillHarness introduces selective skill reuse, where tasks are guided to decompose according to context and completed through the selective activation of skill subsets. Our experiments demonstrate that SkillHarness significantly reduces the unsafe rate of learned skills by 57.1% and consistently improves execution stability under dynamic environmental changes, outperforming existing baselines.

14
Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

14
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.

11
Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

9
Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

9
Unlimited OCR Works

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.

9
Self-Compacting Language Model Agents

Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.

9
Foresight: Failure Detection for Long-Horizon Robotic Manipulation with Action-Conditioned World Model Latents

Long-horizon tasks are common in real-world robotic deployments, yet failure detection for such tasks remains underexplored. Detecting failures in long-horizon robotic tasks is particularly challenging because failure onset is often ambiguous and dense temporal annotations are typically unavailable. We present Foresight, a failure detection framework that monitors manipulation trajectories using latent representations from an action-conditioned world model. Foresight is trained using only final task-level success or failure labels. By leveraging predictive world-model embeddings, our method provides a unified framework for failure detection across different policies. We further use functional conformal prediction (FCP) to calibrate detection thresholds adaptively. We evaluate Foresight with state-of-the-art vision-language-action policies in simulation on LIBERO-Long, ManiSkill-Long, and BEHAVIOR-1K, compare it against state-of-the-artfailure detection methods, and validate it on real robots with three long-horizon tasks on a ReactorX-200 arm and one task on a Franka arm. Our results suggest that action-conditioned world-model embeddings provide a scalable representation for reliable failure monitoring in long-horizon manipulation.

8
Training Open Models for Agentic Phone Use

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.

7
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.

7
Safe Few-Step Generation via Velocity Editing

Flow matching has recently emerged as a strong paradigm for state-of-the-art text-to-image (T2I) generation, enabling high-quality generation with a small number of sampling steps. As these models are increasingly integrated into real-world applications, ensuring safe and non-sensitive content generation has become a critical requirement. However, adapting safety and concept removal methods to this new generation framework remains an open challenge. Specifically, prior methods largely rely on iterative trajectory steering across a number of denoising steps or on CLIP-centric prompt embedding manipulation. These design assumptions pose fundamental bottlenecks for safety in flow matching-based T2I generation, where limited sampling steps constrain iterative correction and modern context-aware text encoders diminish the effectiveness of embedding-level interventions. In this paper, we propose VESFlow, a training-free safety method tailored to flow matching with extremely few sampling steps. Leveraging the fact that flow matching models learn the marginal velocity, we directly edit the velocity field via a safe-conditional posterior. VESFlow steers the trajectory toward safe outputs while leaving the conditioning prompt unchanged. Building on the observation that VESFlow leaves outputs unchanged under benign prompts, we further introduce a risk score-based filtering that bypasses velocity editing to reduce computational cost while preserving benign prompt generation. Based on this filtering, we propose VESFlow+, a stronger variant of VESFlow that not only edits the velocity toward the safe direction, but also pushes it away from the unsafe direction. Experimental results show that VESFlow+ removes the target concept, reducing the attack success rate by NudeNet to 6.3% on Ring-A-Bell and 6.8% on MMA-Diffusion on the 4-step MeanFlow model, while preserving fidelity on benign prompts.

7
Exploring the Design Space of Reward Backpropagation for Flow Matching

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.

7
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

7
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.

6
Causal Discovery in the Era of Agents

Recent attempts to combine large language models (LLMs) with causal discovery ask models to infer pairwise directions, propose graph structures, or inject language-model outputs as priors and constraints. These approaches promise faster analysis, but they also obscure whether a causal evidence is supported by data and assumptions or by textual associations, prompt artifacts and hallucinated mechanisms. We argue for a different role for agents in causal discovery. Agents should inspect data, retrieve context, explain method assumptions and clarify graph outputs, but they should not supply edges, orientations, priors, constraints or causal conclusions. We propose the principle that agents assist the workflow, while causal claims remain grounded in data, explicit assumptions, formal algorithms, diagnostics and user or domain-expert decisions. We instantiate this principle in causal-learn+, an online platform that coordinates data analysis, preprocessing, method recommendation, expert-knowledge incorporation, formal discovery and interpretation around the algorithmic ecosystem of causal-learn. A case study on Big Five personality data illustrates agent-assisted pipeline of causal discovery without turning language-model unreliability into causal evidence. The platform is available at causallearn.com.

5
Tmax: A simple recipe for terminal agents

Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27\% on Terminal-Bench 2.0 with only 9B parameters, outperforming much larger models from prior work. Concretely, we generate data using a novel taxonomy, combining difficulty control, personas, and verifier diversification, which allows us to cheaply generate large amounts of terminal environments for RL and SFT training. We open-source our terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets. We then train open-weight models using RL with our data, using a simple, outcome-only recipe. We release our data, models, and code as a strong baseline for future open academic work on terminal agents at https://github.com/hamishivi/tmax.

4
Tapered Language Models

Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.

4
Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

4
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic policy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose PolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3times and reduces physical execution steps by 51.4\%. Ultimately, our framework delivers up to a 5.83times end-to-end deployment speedup without compromising task success rates.

4
Counsel: A Meta-Evaluation Dataset for Agentic Tasks

As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck - human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-judge (LLMJ) to critique agents at the process and outcome-levels at scale, however, the soundness of LLMJ critiques often goes unmeasured. Here, we introduce Counsel, the first public dataset of meta-evaluations for agentic tasks. Counsel consists of process-level critiques from open-weight LLMJs on two agent benchmarks: tau-bench (customer support agents) and DA-Code (coding agents), and human meta-evaluations of these critiques. Human annotators label critiques on each flagged error as "spot on", "correct location but poor reasoning", or "should not have flagged", achieving reliable inter-annotator agreement (Krippendorff's alpha of 0.78). The resulting dataset stratifies LLMJ critiques by human alignment across both error location within a trajectory and reasoning quality, serving as valuable data to calibrate, improve, or train LLMJs for agents. Comparing open-weight judges, we find that more capable judge models and more reasoning effort both enabled improved human agreement, with the strongest judge reaching ~88% agreement on location and ~65% on reasoning. Counsel is generated using open-weight models and is permissively licensed for broad community use, which we hope will enable rigorous study and improved alignment of LLM-based evaluators for agentic systems.

3
FastMix: Fast Data Mixture Optimization via Gradient Descent

While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)

3
Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use specialized model variants, input/output monitors, and API permissions. The former is susceptible to jailbreaks while sacrificing capability for all users to mitigate the risks posed by a few, and the latter is fundamentally incompatible with open-weight release. In this paper, we propose Tiered Language Models (TLMs), where a single set of released weights supports multiple capability levels. In its default public configuration, a TLM behaves as a conventional LLM. A compact secret key specifies a permutation over a small parameter subset, inducing an alternative computation graph over the same weights that exposes additional capabilities. We develop a training protocol that jointly pretrains both configurations from scratch, then fine-tunes the keyed configuration on private data with regularization to preserve the public model's behavior. We pretrain 180M- and 650M-parameter TLMs and demonstrate that the keyed configuration can acquire a new language, gain instruction-following ability, and memorize private factual knowledge, whereas the public configuration exhibits none of these capabilities. Moreover, we show that our approach extends naturally to multiple hierarchical tiers. Because authorization operates on the model's weight structure rather than in the input space, the mechanism resists fine-tuning-based extraction and partial key compromise. In general, TLMs take a step toward reconciling open-weight release with selective capability control.

2
UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

Vision Transformers (ViT) dominate computer vision. However, their reliance on rigid patch projectors hinders transfer to Earth Observation (EO), where input modalities, scales, and resolutions vary widely. We introduce UniverSat, a ViT-style backbone built around a Universal Patch Encoder that maps patches from arbitrary spatial, spectral, and temporal resolutions, and from both optical and non-optical sensors, into a shared embedding space with a shared set of weights. This enables training a single model on heterogeneous multimodal corpora via self-supervision, yielding robust, sensor-agnostic spatial features. We validate this approach with strong results across classification and segmentation on standard EO benchmarks from GeoBench, PANGEABench, and SpectralEarth. Our code and models are available at https://github.com/gastruc/UniverSat.

2
CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over-retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent's state with additional telemetry: a calibrated self-confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training-free and training-based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over-rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent's state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.

2
Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.

1
Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation

Text and image conditioned 3D models now generate convincing assets, but they still offer little direct control over the space an object should occupy or avoid. In authoring, this spatial intent is often known before generation starts. A chair should fit a seating envelope, a prop should leave clearance for motion, or a part should expose a contact surface. Prompts and image views are poor carriers for such constraints, requiring the need for an explicit control interface. We present Arbor, a trainable attachment for text conditioned latent 3D generation. Arbor introduces constraint meshes as a native 3D control interface. The interface uses hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact. Unlike completion or whole object scaffold control, these meshes are not target evidence. They are local typed requirements and can include regions where no surface should appear. Arbor keeps this signal as geometry by converting constraint meshes into tokens and learning a routed attachment inside a frozen denoiser. Each latent region can therefore receive the part of the constraint that matters for its spatial location. We evaluate Arbor on automatic and artist curated control benchmarks with hull, avoidance, and touch constraints, and compare the metric trends to a user preference study. Even without dedicated compliance losses, Arbor improves constraint obedience while preserving object quality and variation under fixed constraints.

1
MeshFlow: Mesh Generation with Equivariant Flow Matching

Meshes are among the most common 3D scene representations, but directly generating meshes is challenging because the representation contains important symmetries, including permutation invariance of faces and vertices. MeshFlow learns to generate triangle meshes directly as triangle soups, avoiding the need to serialize meshes into long autoregressive sequences. We adopt equivariant optimal-transport flow matching models that respect the key symmetries of triangle soups: arbitrary permutations of faces and permutations of the vertices within each face. Toward this goal, we propose a simple yet effective modification to the Diffusion Transformer architecture, resulting in a scalable network capable of modeling a velocity field while maintaining the desired equivariance. We further introduce an optimal-transport-based training objective that improves convergence by eliminating supervision signals that violate these symmetries. MeshFlow achieves mesh quality comparable to state-of-the-art autoregressive mesh generators while providing about an 18times speedup during inference. Project page is at https://qiisun.github.io/MeshFlow/.

1
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

1
Improving Text-to-Music Generation with Human Preference Rewards

We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.

0
HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmarks are comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings--dimensionality reduction, quantization, reranking--across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition, model-agnostic comparison of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) at Spearman >0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiency Pareto frontier. Code, data, and leaderboard are released under the MIT license.

0
05

PRODUCT HUNT

05.00
PRODUCT HUNT

Product Hunt - June 23, 2026

Product Hunt Daily Feed: Featuring noteworthy tech launches.

Bluerails Discovery icon
Bluerails Discovery

The rails AI agents use to find and pay you

0
OpenArt Director icon
OpenArt Director

Direct cinematic videos through chat

0
LogStitch icon
LogStitch

Find AWS Lambda failures fast, right on your Mac

0
NeuralAgent 3.0 icon
NeuralAgent 3.0

AI that executes UI actions on your computer in ~285ms

0
Deckwise icon
Deckwise

AI presentation agent for editable decks

0
wildbirds icon
wildbirds

Birdwatchers app to share and discover birds socially

0
Buddy AI Note icon
Buddy AI Note

Your daily memo that turns notes into a plan

0
prepros icon
prepros

Run your brand shoots from start to finish

0
Sakana Fugu icon
Sakana Fugu

One Model to Command Them All

0
Blazly SEO icon
Blazly SEO

Dominate SEO with an AI content operating system

0
Amnesia icon
Amnesia

A Mac app that asks why you opened that tab

0
Cotypist icon
Cotypist

Local AI Autocomplete in your voice, anywhere on your Mac

0
Rosply icon
Rosply

AI agent that controls your computer autonomously

0
Jotform AI App Builder icon
Jotform AI App Builder

Turn ideas into powerful apps within seconds

0
Conduit icon
Conduit

The local MCP gateway that cuts tokens ~90%

0
Thumbmagic icon
Thumbmagic

AI thumbnail generator trained on top-performing thumbnails

0
Latitude icon
Latitude

Fix what's breaking in your AI agent

0
HotkeyClash icon
HotkeyClash

Find where your Mac keyboard shortcuts clash

0
Tufte icon
Tufte

CDN and Node package to generate ASCII graphs inline

0
jebi icon
jebi

A supercharged terminal for Mac with built-in local AI

0
Hush icon
Hush

Open-source noise suppression for voice AI agents

0
NanoCorp icon
NanoCorp

Found a company in one sentence - from website to ads

0
Steam Machine icon
Steam Machine

A tiny, powerful PC for big-screen gaming

0
BestDefense.io icon
BestDefense.io

Pentest and patch every deploy with AI

0
Sipcode icon
Sipcode

Keep Claude Code's context clean for sharper answers

0
AlgoFly AI icon
AlgoFly AI

The all-in-one place to build and deploy vision AI

0
Photoroom API icon
Photoroom API

Transform product images at scale with one image editing API

0
Clawd icon
Clawd

A context-aware browser mascot with 100% local offline AI

0
Alai 2.0 icon
Alai 2.0

AI design partner for presentations, social posts, and more

0
readywhen icon
readywhen

Your 24/7 AI Chief of Staff for commitments and follow-ups

0
MediaSeg icon
MediaSeg

Split large media files into upload-ready chunks on macOS

0
uwait icon
uwait

Get paid while AI thinks

0
AirJelly icon
AirJelly

Your Proactive, Self-Organizing Second Brain

0
Selector Forge icon
Selector Forge

Browser extension for AI-generated resilient selectors

0
Skybridge icon
Skybridge

The full-stack open source React framework for MCP Apps

0
AgentX icon
AgentX

Evaluate AI agent, pinpoint issues, and fix with one click.

0
MD+HTML Reader icon
MD+HTML Reader

Review AI-generated Markdown and HTML in a focused workspace

0
OnBrand by SlideSpeak icon
OnBrand by SlideSpeak

Design context for AI agents

0
HAQQ Legal AI on Mobile icon
HAQQ Legal AI on Mobile

Bringing legal understanding to anyone with a phone

0
Agentic Document Extraction icon
Agentic Document Extraction

Make the world's documents computable

0
Cloudflare Temporary Accounts icon
Cloudflare Temporary Accounts

Let agents deploy before signup

0
Plansera AI icon
Plansera AI

E-2 visa business plans, drafted by an AI

0
Notchkin icon
Notchkin

A notes app that lives in your MacBook's notch.

0
Laguna by Poolside icon
Laguna by Poolside

Foundation models for agentic coding and long-horizon work

0
oioi icon
oioi

a fast, glassy clipboard manager for macOS, Windows & Linux

0
Grok by SpaceXAI for Word icon
Grok by SpaceXAI for Word

Draft, restructure & tighten wording from panel inside Word

0
Cloudback MCP Server icon
Cloudback MCP Server

Manage your backups from Claude, Cursor, and VS Code

0
Backgrind icon
Backgrind

Run your AI agents over any app, even games.

0
Agent 37 Cloud icon
Agent 37 Cloud

Give every customer their own Hermes or OpenClaw agent

0
Atomic Mail Agentic icon
Atomic Mail Agentic

Let your agents read, send, and react to email autonomously

0
06

TECHMEME

06.00
TECHMEME

Techmeme - June 23, 2026

Techmeme Digest: Major tech headlines and industry conversations.

Menlo raised $3B for funds dedicated to backing AI startups, its largest fundraising to date; sources say Menlo's Anthropic stake is currently worth nearly $14B (Natasha Mascarenhas/Bloomberg)
Source: TechmemePublished: Jun 23, 2026

Natasha Mascarenhas / Bloomberg : Menlo raised $3B for funds dedicated to backing AI startups, its largest fundraising to date; sources say Menlo's Anthropic stake is currently worth nearly $14B —  In 2024, Menlo Ventures made the risky decision to raise $500 million to invest in Anthropic PBC, then an underdog rival to OpenAI …

Meta's Starfire glasses with Kylie Jenner include a tiny gemstone on the lens, a metal nose pad to prevent absorbing makeup, and an AI version of Kylie's voice (Julian Chokkattu/Wired)
Source: TechmemePublished: Jun 23, 2026

Julian Chokkattu / Wired : Meta's Starfire glasses with Kylie Jenner include a tiny gemstone on the lens, a metal nose pad to prevent absorbing makeup, and an AI version of Kylie's voice —  The new Meta-branded glasses have the same camera, microphones, and chatbot as the Ray-Bans.  They come in three styles, one of which was codesigned with Kylie Jenner.

Meta says Meta Glasses are its first AI glasses to launch with Meta AI powered by Muse Spark, and come in a range of color and lens combinations for 26 styles (Meta Newsroom)
Source: TechmemePublished: Jun 23, 2026

Meta Newsroom : Meta says Meta Glasses are its first AI glasses to launch with Meta AI powered by Muse Spark, and come in a range of color and lens combinations for 26 styles —  Glasses are the most exciting hardware category of the AI era — the ideal device to experience an all-day AI assistant that understands the world from your perspective.

Meta Fury, Meta Adventurer, and Meta Glasses by Kylie have EssilorLuxottica stamped on the inside; Meta executives say dropping "Ray-Ban" helps lower the price (Victoria Song/The Verge)
Source: TechmemePublished: Jun 23, 2026

Victoria Song / The Verge : Meta Fury, Meta Adventurer, and Meta Glasses by Kylie have EssilorLuxottica stamped on the inside; Meta executives say dropping “Ray-Ban” helps lower the price —  For the past three years, “Meta” and “Ray-Ban” have been synonymous in the smart glasses space.  Not anymore.

Meta unveils Meta Adventurer and Fury glasses, each priced at $299, its first under its own brand, and a $399 Starfire model in collaboration with Kylie Jenner (Mark Gurman/Bloomberg)
Source: TechmemePublished: Jun 23, 2026

Mark Gurman / Bloomberg : Meta unveils Meta Adventurer and Fury glasses, each priced at $299, its first under its own brand, and a $399 Starfire model in collaboration with Kylie Jenner —  The company is also considering camera-free models in the future.  —  Meta Platforms Inc., which helped popularize smart glasses …

New York-based crypto analytics startup Allium raised a $40M Series B led by Amplify Partners, with participation from Kleiner Perkins and others (Ben Weiss/Fortune)
Source: TechmemePublished: Jun 23, 2026

Ben Weiss / Fortune : New York-based crypto analytics startup Allium raised a $40M Series B led by Amplify Partners, with participation from Kleiner Perkins and others —  Blockchains are public databases, but that doesn't mean they're legible.  Even experts struggle to read their complicated strings of letters, numbers, and transactions.

The Netherlands joins the US-led Pax Silica initiative alongside South Korea and Japan to coordinate AI supply chains; Taiwan endorses it as a non-signatory (Toby Sterling/Reuters)
Source: TechmemePublished: Jun 23, 2026

Toby Sterling / Reuters : The Netherlands joins the US-led Pax Silica initiative alongside South Korea and Japan to coordinate AI supply chains; Taiwan endorses it as a non-signatory —  The Netherlands will join the Pax Silica group of U.S.-allied countries coordinating AI supply chains, the foreign ministry said on Tuesday …

ByteDance unveils Seedance 2.5, saying the AI video model can generate up to 30-second clips from up to 50 reference materials, up from 12 for Seedance 2.0 (Juro Osawa/The Information)
Source: TechmemePublished: Jun 23, 2026

Juro Osawa / The Information : ByteDance unveils Seedance 2.5, saying the AI video model can generate up to 30-second clips from up to 50 reference materials, up from 12 for Seedance 2.0 —  ByteDance on Tuesday unveiled Seedance 2.5, its new AI video generation model, at a conference in Beijing.

Google plans a 12-week incubator, picking 10 to 20 AI startups from its "Xoogler" alumni and providing up to $350K in cloud credits and $100K in direct funding (Guinevere Grant/Bloomberg)
Source: TechmemePublished: Jun 23, 2026

Guinevere Grant / Bloomberg : Google plans a 12-week incubator, picking 10 to 20 AI startups from its “Xoogler” alumni and providing up to $350K in cloud credits and $100K in direct funding —  Alphabet Inc.'s Google is backing a new incubator for former employees building artificial intelligence startups …

How Sam Altman's 80+ personal investments, many from his time running YC, benefit from ties to OpenAI; 10+ companies have discussed business deals with OpenAI (Wall Street Journal)
Source: TechmemePublished: Jun 23, 2026

Wall Street Journal : How Sam Altman's 80+ personal investments, many from his time running YC, benefit from ties to OpenAI; 10+ companies have discussed business deals with OpenAI —  OpenAI CEO's holdings in Helion and other companies have seen significant upswings since the AI giant explored or sealed tie-ups with Altman-linked startups

Top500: China's Arm-based LineShine passes the US' El Capitan by 20%+ as the world's fastest supercomputer, the first time China has taken the crown since 2017 (Don Clark/New York Times)
Source: TechmemePublished: Jun 23, 2026

Don Clark / New York Times : Top500: China's Arm-based LineShine passes the US' El Capitan by 20%+ as the world's fastest supercomputer, the first time China has taken the crown since 2017 —  A supercomputer in Shenzhen was declared the world's fastest.  It uses only standard microprocessors and not the special-purpose chips called graphics processing units.

The EU will apply a €3 duty on items under €150 from July 1, as the bloc seeks to slow the flood of low-priced merchandise from retailers like Shein and Temu (Brendan Murray/Bloomberg)
Source: TechmemePublished: Jun 23, 2026

Brendan Murray / Bloomberg : The EU will apply a €3 duty on items under €150 from July 1, as the bloc seeks to slow the flood of low-priced merchandise from retailers like Shein and Temu —  Bargain-hunting consumers across the European Union will start feeling the pinch of higher online shopping costs next week …

South Korea's tech-heavy Kospi index falls 10%, dragged down by SK Hynix and Samsung; STMicro and ASML fall ~7%, and US tech stocks fall in pre-market trading (Chloe Taylor/CNBC)
Source: TechmemePublished: Jun 23, 2026

Chloe Taylor / CNBC : South Korea's tech-heavy Kospi index falls 10%, dragged down by SK Hynix and Samsung; STMicro and ASML fall ~7%, and US tech stocks fall in pre-market trading —  Global stocks sold off on Tuesday, led by deep losses for tech stocks following a losing session for the sector on Wall Street.

Q&A with Nvidia VP of Healthcare Kimberly Powell on how AI can ease doctors' workloads, address trained medical staff shortages, improve patient care, and more (Cristina Criddle/Financial Times)
Source: TechmemePublished: Jun 23, 2026

Cristina Criddle / Financial Times : Q&A with Nvidia VP of Healthcare Kimberly Powell on how AI can ease doctors' workloads, address trained medical staff shortages, improve patient care, and more —  The chipmaker's head of healthcare argues AI can ease many of the sector's ills, including reducing medics' workload and tackling the shortage of trained staff

Sources: Tencent is negotiating exits from minority investments in game studios in Japan, such as Tokyo-traded Marvelous, as it reassesses its global portfolio (Bloomberg)
Source: TechmemePublished: Jun 23, 2026

Bloomberg : Sources: Tencent is negotiating exits from minority investments in game studios in Japan, such as Tokyo-traded Marvelous, as it reassesses its global portfolio —  Tencent Holdings Ltd. is negotiating exits from several game studio investments in Japan, including Tokyo-traded Marvelous Inc. …

07

STARTUP ARCHIVE

07.00
STARTUP ARCHIVE

Startup News - June 23, 2026

Startup News Roundup: Aggregating key funding and launch updates.

Marc Andreessen on the 5 personality traits of an innovator
Source: StartupPublished: Mar 31, 2026

“When you’re talking about real innovators—people who actually do really creative, breakthrough work—I think you’re talking about a couple things:”

Steve Jobs explains the importance of both thinking and doing
Source: StartupPublished: Mar 30, 2026

“The doers are the major thinkers. The people who really create the things that change this industry are both the thinker-doer in one person.”

Tobi Lutke explains what the VCs who passed on Shopify got wrong
Source: StartupPublished: Mar 27, 2026

“What a lot of free-market thinkers don’t understand is that between the demand and eventual supply lies friction."

Sam Altman explains how he decides to invest in a startup after 10 minutes
Source: StartupPublished: Mar 26, 2026

"Does this person have the potential to be the next Mark Zuckerberg?… [You don’t get to] 100% accuracy, obviously, but it’s good enough that our business model works.”

Jony Ive recounts the time Steve Jobs called him vain
Source: StartupPublished: Mar 25, 2026

In the clip below, Jony Ive recounts the time he asked Steve Jobs to be less harsh in his critique of a piece of work.

Jeff Bezos’s two pieces of advice for aspiring entrepreneurs
Source: StartupPublished: Mar 24, 2026

“The advice that I would give entrepreneurs is don't chase the hot new thing. It's so hard to catch something that everybody already knows is hot."

Elad Gil: “Things that work tend to work pretty fast”
Source: StartupPublished: Mar 23, 2026

“I do think there’s a bit of a myth in Silicon Valley that you should keep grinding no matter what and it’s just about perseverance, and I think that’s really bad advice."

Paul Graham on why starting with a “small, intense fire" is the key to startup growth
Source: StartupPublished: Mar 20, 2026

"You have to know who those first users are and how you're going to get them."

Keith Rabois on how to identify great talent
Source: StartupPublished: Mar 19, 2026

“What you want to do with every single employee every single day is expand the scope of their responsibilities until it breaks… and that’s the role they should stay in.”

Wealthfront CEO on why advertising spend makes it harder to find product/market fit
Source: StartupPublished: Mar 18, 2026

“The way that you know you have product/market fit is if you have exponential organic growth."

Eric Schmidt on why most companies get strategy wrong
Source: StartupPublished: Mar 17, 2026

“Work very, very hard to figure out what the world’s going to look like in five years. What will people be doing? What will your customers want? Where will costs be?"

Mark Zuckerberg: “You can’t 80/20 everything”
Source: StartupPublished: Mar 16, 2026

"There’s the famous 80/20 rule where you get 80% of the benefit by doing 20% of the work, but you can’t just 80/20 everything. There have to be certain things that you are just the best at."

Marc Andreessen on Mark Zuckerberg’s founder “superpower”
Source: StartupPublished: Mar 13, 2026

“A great superpower that Mark Zuckerberg has that is probably not well-understood enough is he does not get emotionally upset in stressful situations"

Sam Altman explains how to come up with a great startup idea
Source: StartupPublished: Mar 12, 2026

"If you start a startup without a good idea… you’ll be under pressure to make something up and it won’t work that well."

Jeff Bezos on the problems with proxies and managing to metrics
Source: StartupPublished: Mar 11, 2026

“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”

Airbnb founder Brian Chesky on how to design an amazing user experience
Source: StartupPublished: Mar 10, 2026

“If you can design something really amazing using the hand-crafted part of your brain, then you can reverse-engineer how to industrialize this millions of times over."

Spencer Rascoff: "I will never invest in a consumer startup with paid marketing”
Source: StartupPublished: Mar 9, 2026

"If you’re actually trying to grow a product, the best levers for doing that are often within the product itself.”

Patrick Collison explains why it sometimes make sense to quit
Source: StartupPublished: Mar 6, 2026

“One thing I’ve learned myself the hard way, is that it is easier to tear down a company and restart it in Silicon Valley, than it is to constantly try to pivot or keep something alive."

Jeff Bezos recounts the time he called Amazon’s customer service number mid-meeting to prove a metric was wrong
Source: StartupPublished: Mar 5, 2026

“I have a saying, which is when the data and the anecdotes disagree, the anecdotes are usually right"

Ben Horowitz: “Nobody was born a great manager. It’s a very unnatural job.”
Source: StartupPublished: Mar 4, 2026

“If you can’t build a great product, it doesn’t matter if you can build a great company.”

03

ALSO TODAY

3 MORE SOURCES
08

SOLIDOT

08.00
SOLIDOT

Solidot News - June 23, 2026

Solidot Feed: Highlighting essential tech & open-source news.

野狼重返欧洲

去年夏天,一位女士带着两幼儿在荷兰 Utrecht 附近的天然公园散步,她看到一只体型较大的动物猛冲过来,她起初以为是一只顽皮的狗,但很快听到 6 岁大儿子发出尖叫,这只动物正将他拖进树林。附近两位恰好路过的成年人用棍子赶跑了它。袭击男孩的不是狗,而是一只狼。狼群数量在欧洲多地激增,引发了如何处理野狼的激烈争论。得益于严格的法律保护,灰狼(Canis lupus)的数量自 2000 年以来大幅增长,但袭击牲畜和袭人事件也日益频发。欧盟委员会去年放宽规定允许更多捕杀野狼,科学家对此表示反对,认为基因证据表明狼群数量并不像表面看起来那么庞大,认为用电围栏和护卫犬保护牲畜比捕杀更有效。科学家估计目前欧盟成员国境内共有约 23,000 只狼,相比下 2012 年只有约 12,000 只。

星际彗星 3I/Atlas 可能是太阳系最古老的天体

目前正横穿太阳系的星际彗星 3I/Atlas 可能是太阳系至今发现的最古老天体。它形成于 120 亿年前。借助 NASA 韦伯望远镜(JWST),研究团队精确测定了这颗彗星的化学组分,判定它诞生于宇宙早期银河系的一片恒星形成区。该发现让人类得以窥见其他行星系统的构成,并对比其与太阳系的差异。受阳光加热后,3I/Atlas 向外喷发水蒸气、一氧化碳、二氧化碳,甚至镍、铁等金属蒸气。有两个同位素特征彻底暴露了它的古老身世,同位素即质子数相同、中子数不同的同种元素原子。第一,这颗彗星的碳12与碳13比值远高于太阳系内所有天体。宇宙中,大质量恒星剧烈爆发会持续累积碳13。3I/Atlas 的碳13含量极低,说明它诞生于宇宙早期,彼时大量恒星尚未演化到发生超新星爆发的阶段。第二,这颗彗星富含半重水,即水分子中的部分氢原子多携带一个中子。这类水分子更容易在早期宇宙低温大质量恒星形成区普遍存在的强辐射环境中生成。

DDR2 和 DDR3 内存的价格出现上涨

过去几个月,由于 AI 热导致的内存短缺,DDR4 和 DDR5 内存条价格都出现了数倍的增长。由于 DDR4 和 DDR5 内存成本过高,部分硬件制造商开始降低内存规格,转向更古老的内存条,结果推动了 DDR2 和 DDR3 内存的价格出现了上涨。市场观察机构 TrendForce 称,硬件制造商为控制成本用 DDR3 方案取代了 DDR4,或用基于 DDR2 的设计取代 DDR3。机构预测 2026 年第二季度 DDR2 合约价格将上涨约 55% 至 60%,第三季度还将进一步上涨 35% 至 40%。而 DDR 2 的制造商表示它们正将产能转移到利润更高的产品如 DDR3、DDR4 和 LPDDR4。

在敏感信息泄漏后 Meta 暂停内部 AI 训练项目

在敏感信息泄漏后 Meta 暂停了内部的 AI 训练项目。泄密事件暴露了员工的私人对话、绩效数据和转录文本。Meta 发言人证实了此事,表示公司正在调查,称目前没有迹象表明 Meta 员工不当访问了任何数据。Meta 公司是在今年 4 月宣布了名为 Model Capability Initiative 的 AI 训练计划,旨在利用员工的按键和鼠标移动作为训练数据,以改进公司的 AI 模型。该计划对大多数员工强制执行,但引发了部分员工的强烈反对,他们对自己的数据被记录感到不安。最新的泄密事件令 Meta 内部员工感到沮丧,他们批评公司从一开始就没有对数据进行安全防护。

警长利用 Flock 车牌跟踪系统跟踪前女友

54 岁的伊利诺伊州 Holiday Hills 警长 William C. Copp 于 6 月 18 日被捕,他被控了两项渎职罪。检方指控他利用 Flock Group 公司的车牌跟踪系统跟踪了六名他认识的人,其中三人是其前女友,他特别跟踪了一名前女友的前男友——在数月内查询了至少 140 次,这名男子为此申请了禁止接触令。Institute for Justice 的统计显示,截至 2026 年 6 月全美至少发生了 18 起警察利用 Flock 车牌跟踪系统跟踪熟人的案件。举例来说,爱达荷州 Jerome 县的一名警长在三个月内查询了其妻子车牌逾 700 次;堪萨斯州 Sedgwick 前警长对其前女友的车牌进行了 164 次查询,对前女友现任男友的车牌进行了 64 次查询;密尔沃基一名警官追踪其伴侣及其前任逾 100 次...Flock 的数据库查询不需要搜查令,该公司声称要求搜索令可能会在紧急情况下危及生命。ACLU、EFF 以及 Institute for Justice 等都坚持查询车牌需要搜查令。

Steam Machine 起售价 1049 美元

Valve 正式公布了其游戏机 Steam Machine 的售价,在 AI 热导致内存和 SSD 短缺的情况下,Steam Machine 的价格也涨到了对大多数人缺乏吸引力的程度:Steam Machine 512GB 1,049 美元,Steam Machine 512GB + Steam Controller 套装 1,128 美元,Steam Machine 2TB 1,349 美元,Steam Machine 2TB + Steam Controller 套装 1,428 美元。Valve 解释说,硬件的价格直接取决于组件的成本,在 2023 年开始为 Steam Machine 采购组件时,按照以前的趋势组件的价格会随时间而降低。然而过去大概一年的时间里,情况发生了快速而显著的变化,最明显的就是内存及存储组件的变化,这最终导致了当初为 Steam Machine 制定的目标定价不再可行。因此今天公布的价格反映了全球制造业的现状,或者更准确地说,反映了过去 6 个月里确保能获得的组件的价格。为避免有限库存被机器人程序抢先订购,Valve 宣布将对预订进行随机排序,它将于 6 月 29 日发布第一批产品,并会在有货时继续按顺序处理队列中的预订。

回顾对 AUR 的攻击

由用户递交的软件仓库 Arch User Repository(AUR)最近遭遇了大规模恶意攻击,攻击者创建了一系列新账号,然后通过这些账号接管无人维护的软件包(被称为 orphaned packages),植入恶意代码,推送恶意更新。Arch 项目的维护者现已关闭了新用户注册,正在讨论如何处理这些被恶意滥用的无人维护软件包。AUR 中的软件包由用户递交,其他用户可通过搜索下载 PKGBUILD 文件、解依、编译、安装和更新软件。它不提供软件的二进制版本。目前 AUR 中有逾 107,000 个软件包,其中近 14,000 个无人维护可供认领。任何注册用户都可以认领和修改无人维护的软件包。它提供的软件包未经审核,风险由用户自己承担。其它 Linux 发行版也都有类似的软件仓库,如 Fedora 的 Copr,openSUSE 的 Open Build Service (OBS),Ubuntu 的 Personal Package Archives (PPA)。但这些服务与 AUR 有显著区别:它们提供了类似官方软件包的构建环境,而且不允许预编译二进制文件或私有软件。AUR 的规定过于宽松而在这次攻击中遭到了滥用。

HPV 疫苗将 30 岁前死于宫颈癌的风险降至几乎为零

根据 WHO 的数据,宫颈癌是女性第四大常见癌症,其 99% 的病例是由高危型人乳头瘤病毒(HPV)引起的。虽然 HPV 疫苗能预防约 90% 的宫颈癌,但疫苗对生存率的影响尚不清楚。根据发表在《柳叶刀》期刊上的新研究,伦敦玛丽皇后学院的研究人员发现,自 2008 年 HPV 疫苗引入以来,疫苗接种者宫颈癌死亡率显著下降。HPV 疫苗对降低死亡率的影响如此之大,以至于研究人员估计,12 或 13 岁接种疫苗的女孩在 30 岁之前死于宫颈癌的可能性几乎为零。对于 30-34 岁的接种过疫苗的女性,死于宫颈癌的相对风险降低了 63%。2020-2024 年间英格兰有记录历史上首次没有 20-24 岁的女性死于宫颈癌。HPV 疫苗除了预防宫颈癌,还能预防肛门癌、阴茎癌、阴道癌、外阴癌、口腔癌和咽喉癌,以及生殖器疣,8 年级的男孩和女孩都会接种该疫苗,部分地区为 9 年级和 10 年级学生提供补种服务。新冠疫情前疫苗接种率接近了 WHO 的目标,但疫情之后接种率大幅下降。

Anthropic 对特定功能访问要求身份验证

Anthropic 更新了其隐私政策,从 2026 年 7 月 8 日起,部分功能将需要身份验证,该验证将由 Persona 公司负责。Persona 是一家第三方身份验证公司,由 Peter Thiel 投资。此前 Discord 因用户强烈反对以及 2026 年 2 月发生的一起数据泄露事件而终止了在年龄验证上与 Persona 的合作。

Linux 7.2 内核完全移除 strncpy 函数

在 6 年 362 个补丁之后,Linux 7.2 内核终于完全移除了 strncpy() 函数。strncpy() 是一个 C 语言字符串复制函数,内核文档将其标记为“极度危险(actively dangerous)”。strncpy()是一类内存错误的主要来源:包含敏感数据的内核缓冲区可能会在未终止字符串边界外泄漏字节,导致内存信息泄露。strncpy()被 5 个不同函数取代:strscpy() 用于 NUL 结尾的目的地址,strscpy_pad() 用于 NUL 结尾零填充的目标地址, strtomem_pad() 用于非 NUL 结尾固定宽度字段,memcpy_and_pad() 用于显式填充的有边界复制,memcpy()用于已知长度的内存复制。

霸王龙到 40 岁才完全成年

科学家多年来一直认为霸王龙在 25 岁左右达到成年体型,但一项新研究显示,霸王龙要到 40 岁才会完全成年。最新研究是基于对 17 具霸王龙化石的分析,这些霸王龙的年龄从幼年到成年不等。新研究采用了更先进的技术估计恐龙的年龄,并利用复杂统计模型整合多个标本的信息,更完整了解霸王龙整个生命周期的生长情况。结果表明,霸王龙的生长期比之前认为的要长约 15 年。

日本宣布新超算理究

日本理化学研究所 19 日宣布,为利用 AI 进行科学研究而建设的新超级计算机命名为“理究”。该名称寓意利用 AI 探“究”自然现象背后的“理”。该超算将设在神户市中央区的理研神户地区,力争 7 月投入使用。理化所还在同一天宣布了另一台量子计算-高性能计算混合平台超算 ROQUO,两台超算都使用了英伟达的 GB200 NVL4 系统。其中 ROQUO 配备了 135 个计算节点,540 (NVIDIA Blackwell)  GPU 以及 270 (NVIDIA Grace) CPU,FP64 峰值逾 21 PFLOPS,FP8 峰值 5 EFLOPS 等。

美国芯片安全法案将强制性要求位置跟踪 AI 芯片

美国国会正在审议芯片安全法案(Chip Security Act),该法案将为先进 AI 芯片加入更严格的安全验证功能,将要求芯片出口商通过定制的位置验证硬件或软件追踪先进芯片的流向,确保先进芯片不会进入中国等国家。美国众议院外交事务委员会于 3 月下旬以 42 比 0 的投票结果一致通过了芯片安全法案,将其提交到众议院全体会议审议。参议院的配套立法则尚处于审议的第一个阶段。美国芯片行业组织反对这项法案,认为会阻碍芯片出口。最大的 AI 芯片制造商英伟达去年 12 月宣布它已开发出能满足该法案部分要求的技术。

10% 消费最高人群每年造成数万亿美元环境损害

荷兰和英国科学家研究发现,按 2017 年价值计算,全球消费支出排名前 10% 的人每年造成 1.7万亿-5.7 万亿美元的环境损害。过去的研究表明,消费最高的个人(大致对应最富有的个人)对环境破坏所应承担的责任份额不成比例的巨大。但这些责任尚未得到货币形式的量化。研究人员评估了全球和各大洲最富裕国家中消费前10%的人群行为造成的环境成本。他们参考了《环境价格手册》(EnvironmentalPrices Handbook)中的数据,以 2017 年美元(最新可用数据)为不同环境损害赋予货币价值。研究者发现,全球范围内,高消费群体造成的年度环境成本约为每人2300-7500 美元——全球总计相当于 1.7万亿-5.7 万亿美元。在美国,前 10% 消费者的成本明显更高,约为每人 19000-63000 美元,相当于这一群体平均收入的 6%-20%。该研究仅评估了个人消费,而此前研究表明,最富有的 10% 人群通过投资也会产生大量排放。

Polymarket 付费给内容创作者制作假的押注获胜视频

最大预测市场 Polymarket 付费给数十名内容创作者制作了假的押注获胜视频。它搭建了与其网站几乎一模一样的假网站,指示内容创作者在假网站上进行虚假交易,隐瞒受雇于 Polymarket 的事实。在虚假获胜视频发布之后,Polymarket 再雇佣水军传播和扩散这些视频,营造很多人通过押注赚钱的假象。内容创作者称,他们一个月的收入最高为 2000-3000 美元。对假视频的分析显示,大部分押注都是在 Polymarket 工程师的测试环境中进行的。创作者称他们会将拍完的视频发送给 Polymarket 审核。如果视频不够吸引人,或者有明显造假痕迹,Polymarket 会要求重拍。

Canonical 将为 Ubuntu 桌面加入语音文本转录 AI 功能

Canonical 宣布将为 Ubuntu 桌面加入语音文本转录 AI 功能,它正在征询用户对该功能的反馈。预计于今年 10 月发布的 Ubuntu 26.10 将包含被称为 Myna 的 AI 功能的早期版本。在 Myna 中,语音识别在名为 Canonical Inference Snap 的沙盒组件中进行,Speech Orchestrator 负责管理会话,Audio Adapter 处理麦克风拾取的音频,在音频到达模型前对其进行降噪和分块处理。语音识别将在本地进行,一旦安装相应模型后就不再需要连接互联网。音频数据也不会被长期保存,将在会话结束后立即被丢弃。Myna 暂时不会支持语音输入密码、持续监听、翻译等功能。

TikTok 向新账号推荐的视频近六成是 AI slop

根据视频创作工具公司 Kapwing 的一份报告,Tiktok 向新账号推荐的视频高达 59% 是 AI slop,而 YouTube 向新账号推荐的视频 AI slop 占 21%,Tiktok 几乎是 YouTube 的三倍。Kapwing 人工审核了 Tiktok 20 个类别逾万则视频,对新账号进行了一项单独的测试,统计了前 500 个 For You videos 中 AI 生成内容的比例。TikTok 的前 500 个推荐视频有 294 个是 AI slop,而 YouTube 短视频 Shorts 中前 500 个推荐视频有 104 个是 AI slop。在 Kapwing 审核的 TikTok 儿童类别 2000 则视频中,57% 是 AI slop,在所有类别中比例最高,其中 #cartoonkids 标签下 100 个精选视频有 97 个是 AI 生成,#cartoons 和 #babysong 等标签下 AI 生成视频比例都是 83%,#forkids 为 79%。科学与教育类别 AI 生成视频比例占 35%、健康(33%)和历史(33%)。截至去年 11 月 TikTok 将 13 亿个视频标记为 AI 生成。

芬兰图书馆提供缝纫机借用服务

芬兰图书馆不只是提供图书借阅,而是维系重要的社会功能。其它国家的公共图书馆在消失,而芬兰还在新建图书馆。美国在 2008-2019 年间关闭了 766 家公共图书馆,英国在 2016-2023 年间逾 180 家图书馆关闭或转交给志愿团体运营。芬兰人口约 560 万,有逾 700 家图书馆,除了借阅图书,图书馆出借的最大物品是空间:可免费预定房间用于会面、学习、进行政治讨论或创作音乐。赫尔辛基市中心的 Oodi 图书馆在 2019 年被评为全球最佳新建图书馆,它提供了缝纫机、网球拍和游泳池通行证的借用服务。这种借用文化源于芬兰的实用主义,可追溯到过去的农业时代,当时的人们经常共享农机。今天的城市居民居住在小房子里,他们可能一年只需要用到一次缝纫机,那么为什么要买呢?他们可以在图书馆免费使用通过税款采购的缝纫机。根据政府报告,55% 的芬兰人每月至少去一次图书馆。数据显示芬兰人平均每年使用图书馆 9.1 次。而英国人平均每年访问图书馆约 2.5 次。美国人平均每年访问图书馆 2.4 次,欧盟平均约 3.5 次。根据芬兰的图书馆法,公共图书馆必须促进民主、言论自由和积极的公民意识。其它北欧国家也有类似的政策。2025 年芬兰在公共图书馆上的支出近 3.71 亿欧元,人均支出 65.78 欧元,而英国人均支出 10 英镑,美国人均支出 45 美元。芬兰图书馆员还能帮助用户处理各类在线事务,从税务和银行账户到养老金和数字健康记录,他们还提供简历和求职申请方面的帮助。一项针对芬兰图书馆的研究得出结论:图书馆发挥着至关重要的包容性基础设施的作用。图书馆是少数可以静静待着而无需消费的公共空间。

Google reCAPTCHA 系统引入手势验证

Google 将要求用户在摄像头前挥手以证明自己是人类而不是机器人。它提供的区分机器人和人类检测服务 reCAPTCHA 引入了手势验证。Google 表示,在手势验证期间它会分析用户在执行各种操作或手势时的一段或多段手部视频,系统会处理视频以提取手部关键点的坐标数据,其中包括 21 个指关节关键点坐标。Google 声称,视频绝不会与用户的身份相关联,并且会在验证流程结束后删除。系统绝不会录制音频。

疑似黑客劫持短信预警系统在巴西各地发送警报短信

巴西政府称周六上午巴西多州的手机收到了一条未经授权的“极端”类别警报短信,其中包含文字 misantropi4。该单词将葡萄牙语 misantropia 的最后一个字母 a 替换为 4,这是黑客常用的做法。misantropia 的意思是厌恶人类。巴西的紧急短信系统类似美国的 AMBER Alert,允许政府官员直接向特定地理区域内的移动设备发送紧急短信。巴西政府表示其 National Civil Defense 警报平台已经下线,它认为这是一次黑客攻击,正对此展开调查。

09

APP STORE RANK

09.00
APP STORE RANK
Loading…