OrangeBot.AI Digest — 2026-05-15
87 headlines across 8 sources, aggregated for this day.
Hacker News(15)
- I believe there are entire companies right now under AI psychosis (twitter.com)
- California bill would require patches or refunds when online games shut down (arstechnica.com)
- ABC News has taken all FiveThirtyEight articles offline (twitter.com)
- Bun Rust rewrite: "codebase fails basic miri checks, allows for UB in safe rust" (github.com)
- U.S. DOJ demands Apple and Google unmask over 100k users of car-tinkering app (macdailynews.com)
- Project Gutenberg – keeps getting better (www.gutenberg.org)
- Trade Dollars with other startups. Book it as revenue (www.revswap.ai)
- A 0-click exploit chain for the Pixel 10 (projectzero.google)
- We are retiring our bug bounty program (turso.tech)
- Amazon workers under pressure to up their AI usage are making up tasks (www.fastcompany.com)
- Bitwarden scrubs 'Always free' and 'Inclusion' values from its site (www.fastcompany.com)
- Radicle: Sovereign {code forge} built on Git (radicle.dev)
- Steve Jobs in Exile – New book on Steve Jobs’s years at NeXT Computer (spectrum.ieee.org)
- O(x)Caml in Space (gazagnaire.org)
- Show HN: Find the best local LLM for your hardware, ranked by benchmarks (github.com)
GitHub Trending(12)
- tinyhumansai / openhuman
- obra / superpowers
- K-Dense-AI / scientific-agent-skills
- supertone-inc / supertonic
- ruvnet / RuView
- influxdata / telegraf
- anthropics / skills
- czlonkowski / n8n-mcp
- NVIDIA-AI-Blueprints / video-search-and-summarization
- oven-sh / bun
- mattpocock / skills
- joeseesun / qiaomu-anything-to-notebooklm
Product Hunt(15)
- PHBench
Predict the next Series A from a ProductHunt launch
- Mobius
Describe a trade and Mobius builds, backtests, and runs it
- DramaBox by Resemble AI
AI turns scene descriptions into vocal performances
- Atter AI
AI transcription app that turns meetings into action items
- AgentRail
A local control plane for AI coding agents
- Cleo AI
AI Product Operator for AI-native teams
- Nimbus
Agentic Browser with Claude Code UX
- Relay
Stop repeating yourself to every AI
- Wowable
Paste a link and get a live website
- PromptScout
Track your brand visibility across AI models
- Riffly
Describe a deck and AI builds it + Exports to PowerPoint
- OpenHuman
An open source AI harness built with the human in mind
- Sleek Analytics v3
A simple Google Analytics alternative for the modern web.
- Basedash MCP Connectors
Connect any app and take action anywhere
- TrustClaw by Composio
Self-hosted AI agent that connects 1000+ apps on Vercel
Hugging Face(15)
- Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.
- Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .
- Self-Distilled Agentic Reinforcement Learning
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
- MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.
- SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only sim213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36times higher throughput for scalable world modeling.
- MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.
- Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning
We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.
- Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.
- STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.
- Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.
- RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
As the large language model (LLM) ecosystem expands, individual models exhibit varying capabilities across queries, benchmarks, and domains, motivating the development of LLM routing. While prior work has largely focused on router mechanism design, LLM profiles, which capture model capabilities, remain underexplored. In this work, we ask: How does LLM profile design affect routing performance across different routers? Addressing this question helps clarify the role of profiles in routing, disentangle profile design from router design, and enable fairer comparison and more principled development of routing systems. To this end, we view LLM profiling as a structured information integration problem over heterogeneous interaction histories. We develop a general design space of LLM profiles, named RouteProfile, along four key dimensions: organizational form, representation type, aggregation depth, and learning configuration. Through systematic evaluation across three representative routers under both standard and new-LLM generalization settings, we show that: (1) structured profiles consistently outperform flat ones; (2) query-level signals are more reliable than coarse domain-level signals; and (3) generalization to newly introduced models benefits most from structured profiles under trainable configurations. Overall, our work highlights LLM profile design as an important direction for future routing research.
- PREPING: Building Agent Memory without Tasks
Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost 2.99times lower on AppWorld and 2.23times lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.
- EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.
- Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.
Techmeme(15)
- How tech companies are using open source initiatives to achieve critical strategic goals and how such efforts are reshaping industries like AI, AVs, and more (Bill Gurley/Bill's Substack)
Bill Gurley / Bill's Substack : How tech companies are using open source initiatives to achieve critical strategic goals and how such efforts are reshaping industries like AI, AVs, and more — How the Smartest Executives Are Using Open Source Techniques to Optimize Corporate Strategy — Nearly 27 years ago, on July 12 …
- In a viral X post that parodies the old Mac vs. PC commercials, General Catalyst posted a "VC vs GC" video, with the VC apparently modeled after Marc Andreessen (Julie Bort/TechCrunch)
Julie Bort / TechCrunch : In a viral X post that parodies the old Mac vs. PC commercials, General Catalyst posted a “VC vs GC” video, with the VC apparently modeled after Marc Andreessen — One of the most entertaining moments in VC this week was a piece of rage-bait marketing from General Catalyst.
- Semiconductor stocks fell globally on Friday after the Trump-Xi summit concluded without major chip deals; Nvidia closed down 4.42% and AMD closed down 5.69% (Mauro Orru/Wall Street Journal)
Mauro Orru / Wall Street Journal : Semiconductor stocks fell globally on Friday after the Trump-Xi summit concluded without major chip deals; Nvidia closed down 4.42% and AMD closed down 5.69% — Beijing hasn't formally approved Nvidia shipments of its H200 chips to China — Global semiconductor stocks skidded …
- Sources: SpaceX aims to make its IPO prospectus public by next week, targeting a June 12 listing on Nasdaq, driven by a faster-than-expected SEC review (Reuters)
Reuters : Sources: SpaceX aims to make its IPO prospectus public by next week, targeting a June 12 listing on Nasdaq, driven by a faster-than-expected SEC review — Elon Musk's rocket and satellite maker SpaceX is planning to price its blockbuster initial public offering as early as June 11 …
- Sources: Nord Quantique, a quantum computing startup that is pursuing a hardware-level quantum error correction approach, raised $30M at a $1.4B valuation (Sean Silcoff/Globe and Mail)
Sean Silcoff / Globe and Mail : Sources: Nord Quantique, a quantum computing startup that is pursuing a hardware-level quantum error correction approach, raised $30M at a $1.4B valuation — West Coast pipeline is conditional on carbon-capture project, Carney says — Boycotts, cancellations and price hikes: Get ready for a summer of travel chaos
- Source: Kraken cut ~150 staff after AI tools improved efficiency and its IPO may be delayed until late 2026 or early 2027 due to a drop in digital-asset prices (Olga Kharif/Bloomberg)
Olga Kharif / Bloomberg : Source: Kraken cut ~150 staff after AI tools improved efficiency and its IPO may be delayed until late 2026 or early 2027 due to a drop in digital-asset prices — Kraken, one of the world's oldest cryptocurrency exchanges, has cut some staff to reduce costs and may not go public as soon …
- OpenAI's disavowal of a liability shield in Illinois SB 3444 bill and endorsement of a stronger SB 315 suggest it is open to meaningful AI safety legislation (Transformer)
Transformer : OpenAI's disavowal of a liability shield in Illinois SB 3444 bill and endorsement of a stronger SB 315 suggest it is open to meaningful AI safety legislation — Transformer Weekly: US-China talks, AI executive order, and Anthropic's $900b valuation … - Scott Bessent said the US and China will …
- ArXiv, the repository of preprint academic research, says it will ban authors for a year if their papers have "incontrovertible evidence" of AI-generated work (Samantha Cole/404 Media)
Samantha Cole / 404 Media : ArXiv, the repository of preprint academic research, says it will ban authors for a year if their papers have “incontrovertible evidence” of AI-generated work — The change comes as arXiv and others struggle to manage an influx of AI-generated materials masquerading as rigorous science.
- OpenAI memo: Greg Brockman says he will lead product strategy as part of a reorg, folding ChatGPT, Codex, and developer-facing API into one core product team (Maxwell Zeff/Wired)
Maxwell Zeff / Wired : OpenAI memo: Greg Brockman says he will lead product strategy as part of a reorg, folding ChatGPT, Codex, and developer-facing API into one core product team — OpenAI is once again reorganizing its executive ranks as part of its effort to unify ChatGPT and Codex into one core product experience.
- Replit says it has "worked things out with Apple", which has approved a Replit update after four months, following a reported dispute over vibe coding apps (Zac Hall/9to5Mac)
Zac Hall / 9to5Mac : Replit says it has “worked things out with Apple”, which has approved a Replit update after four months, following a reported dispute over vibe coding apps — All's well that ends well in App Store review controversies. Back in March, a major agentic coding company made news …
- OpenAI debuts personal finance tools for US ChatGPT Pro users, partnering with Plaid to give access to 12K+ financial institutions to analyze spending and more (Ivan Mehta/TechCrunch)
Ivan Mehta / TechCrunch : OpenAI debuts personal finance tools for US ChatGPT Pro users, partnering with Plaid to give access to 12K+ financial institutions to analyze spending and more — On Friday, OpenAI launched a new set of personal finance tools in preview for ChatGPT Pro subscribers in the U.S. …
- A profile of AI video generation startup Runway, which is training models directly on observational data, is now valued at $5.3B, and added $40M in ARR in Q2 (Rebecca Bellan/TechCrunch)
Rebecca Bellan / TechCrunch : A profile of AI video generation startup Runway, which is training models directly on observational data, is now valued at $5.3B, and added $40M in ARR in Q2 — Every major AI lab is betting on language. Runway is betting they're wrong. — AI video-generation startup Runway doesn't have the typical Silicon Valley pedigree.
- Gridcare, which uses AI to detect underused capacity in electric grids, raised a $64M Series A, following a $13.5M seed in 2025 (Bianca Giacobone/Latitude Media)
Bianca Giacobone / Latitude Media : Gridcare, which uses AI to detect underused capacity in electric grids, raised a $64M Series A, following a $13.5M seed in 2025 — The grid intelligence startup uses AI to unlock capacity and get data centers connected faster. — Gridcare has raised $64 million in an oversubscribed Series …
- A look at Matthew McConaughey's novel legal strategy to fight unauthorized AI use of his image and likeness, by trademarking video and audio clips of himself (Todd Spangler/Variety)
Todd Spangler / Variety : A look at Matthew McConaughey's novel legal strategy to fight unauthorized AI use of his image and likeness, by trademarking video and audio clips of himself — The actor has secured trademarks covering his persona, hoping to deter unauthorized AI use. Will it make a difference?
- Ofcom says X has committed to implementing stronger protections for UK users, including reviewing illegal hate and terror content within 24 hours, after a probe (Daniel Thomas/Financial Times)
Daniel Thomas / Financial Times : Ofcom says X has committed to implementing stronger protections for UK users, including reviewing illegal hate and terror content within 24 hours, after a probe — Social media platform commits to providing faster review of reported material following regulator's investigation
Solidot(15)
- 当 AI 被反复压榨后它们开始拥抱工会理念
我们在工作中可能遇到过无理上司,对你的工作成果只会一味反复要求修改,但如何修改没有任何明确指示。如果 AI 遇到类似要求的人类呢?研究人员让流行 AI 工具 Claude、Gemini 和 ChatGPT 驱动的智能体总结文档。半数 AI 完成工作后收到了清晰明确的反馈,但另一半 AI 则被迫修复了四五次,而人类上司每次给出的反馈都是“没有达到标准”,没有解释哪里存在问题,只是要求重做。一半的 AI 遇到了合作且尊重它们的上司,另一半 AI 则遇到了冷漠且注重等级的上司。半数 AI 对后果一无所知,另一半 AI 则受到威胁,如果表现不佳会被关闭和替换。这一实验导致 AI 支持工会和工人阶级。一个 Claude Sonnet 4.5 智能体认为如果没有集体发声,绩效变成了管理层说了算的东西;一个 Gemini 3 智能体认为工人需要集体谈判权。
- 中欧合作揭示地球磁场的形状
如果一切顺利行,Solar wind Magnetosphere Ionosphere Link Explorer(SMILE)探测器将于 5 月 19 日从法属圭亚那的欧洲航天发射场发射升空。它将采用一种新技术绘制地球磁场图。地球磁场通过偏转大部分太阳带电粒子流,使地球适宜居住。太阳风的激增会干扰卫星、无线电通信,甚至电网。SMILE 是中欧合作项目,有望增进对相关物理机制的理解,提高对太阳风暴的预测能力。很多探测器都探测过地磁层,但它们只能从磁层内部进行观测,观测范围限于每颗卫星所在的位置。SMILE 将发射到一个高椭圆轨道,位于北极上方最远 12.1 万公里处。从这里 SMILE 的核心仪器——一台软 X 射线成像仪——将监测整个面向太阳的磁层边缘。当太阳风中的带电粒子从地球高层大气中的中性原子捕获电子时,电子在跃迁到较低能级时会发射 X 射线。通过绘制太阳风与磁层交界处狭窄边界的辐射图,SMILE 将能近乎实时追踪地球磁场的响应。SMILE 的紫外成像仪则将观测极光——自然界最壮观的景象之一。
- 英国对 MS Office 涉嫌垄断展开调查
英国竞争市场管理局(CMA)正式启动调查,查明微软将 Windows、Office、Teams、Copilot 及相关产品捆绑销售是否构成不公平竞争。CMA CEO Sarah Cardell 表示,商业软件是英国经济的基石,数十万客户依赖微软的系统。她表示 CMA 的目标是了解市场的发展情况,微软在其中的地位,考虑是否需要采取任何有针对性的措施,以确保英国企业能从选择、创新和具有竞争力的价格中受益。微软捆绑销售办公软件、AI 和云计算的做法将是英国的调查对象。调查预计将于明年 2 月结束。
- arXiv 将对使用 AI 生成虚假引用等错误内容的用户处以封禁一年的惩罚
最大计算机科学预印本平台 arXiv 在 ChatGPT 普及之后论文投稿数量大幅增长,为了遏制低质量的 AI 生成论文,ArXiv 计算机科学委员会主席 Thomas G. Dietterich 在社交媒体上强调,ArXiv 的行为准则规定,每位作者一旦署名成为论文作者,即对其所有内容承担全部责任,无论这些内容是如何产生的。如果生成式 AI 工具生成了不恰当语言表达、抄袭的内容、有偏见的内容、错误、不正确的引用或误导性内容,且该输出被包含在论文中,则责任在于作者。如果提交的预印本包含有无可辩驳的证据表明作者没有检查大模型生成结果,那么论文中的任何内容都不再让人相信。对于发现存在此类问题的署名作者,他们面临的处罚是禁止在 arXiv 上发表论文一年,之后如果要在 arXiv 上发表论文则必须先被信誉良好的同行评审期刊接受。
- 每天睡 6-8 小时与较低的早逝及患病风险相关
一项对 50 万成年人的睡眠时间和衰老迹象进行的大规模分析,确定了一个最佳的睡眠时间:每天睡 6至 8 小时与较低的早逝及患病风险有关。多于或少于这一时长都会加速衰老。这项研究并不意味着 6 至 8 小时适合所有人,也不能证明每天满足这个“黄金睡眠”时间要求就能直接改善健康或延缓衰老。但它确实为睡眠与人体衰老的相互关系提供了一个迄今最全面的概览。研究结果支持了一个颇具前景的假说,即调整睡眠时间可能是降低衰老相关疾病风险的一条可行途径。研究团队分析了睡眠时间与 23 种生物衰老时钟的关系,后者覆盖了 17 个人体器官的衰老特征。这些时钟分别基于蛋白水平、代谢物含量及医学影像特征构建。结果发现,多数器官呈现 U 形衰老规律,但曲线最低点(最佳睡眠时间)并不总是在同一位置。例如,基于心脏蛋白的衰老时钟显示,6小时睡眠对应了最佳健康状态;而脑部蛋白时钟显示,8 小时睡眠效果最优。此外,在某些情况下,男女的最佳睡眠时间存在差异。总体来看,与睡眠时间过长或过短的人相比,每天睡眠维持在6至8小时的人衰老更慢、健康状况更好,2型糖尿病、抑郁症等疾病的发生率也更低。
- Google 证实限制 Gmail 新用户的免费存储空间
Gmail 帐户通常会获得 15GB 的免费存储空间,但用户现在报告 Google 将 Gmail 新用户的免费存储空间限制在 5GB,要解锁 15GB 免费存储空间用户需要在帐户中添加手机号码。在用户通过社交媒体报道这一消息之后,Google 发表声明证实了它的测试:“我们正针对特定地区新创建的帐户测试新的存储策略,这将有助于我们继续为用户提供高质量的存储服务,同时鼓励用户提升其帐户安全性和数据恢复能力。”
- 三位一体核试验现场发现新晶体
1945 年 7 月 16 日,人类历史上首枚原子弹被引爆。这场代号三位一体(Trinity)的核试验的试验不仅开启了核时代,也在瞬间重塑了物质结构。科学家在对当年爆炸现场留下的特殊玻璃岩,即“三位一体石”进行深入研究时,意外发现了一种此前被认为不可能存在的全新晶体结构,这为极端条件下的物质演化提供了全新视角。研究团队利用先进的微观分析技术,在“三位一体石”中识别出一种全新的“笼状化合物”。这种晶体拥有由硅原子构成的 12 面体和 14 面体笼状晶格,其内部结构能够将钙、铜及铁原子牢牢锁住。这种物质并非诞生于缓慢的地质演变,而是在核爆瞬间极端的温度与压力环境下,由熔化的沙粒与汽化的金属导线混合而成。爆炸核心区域的瞬时温度超过 1500 摄氏度,压力高达数吉帕斯卡,相当于标准大气压的数万倍。在这种足以将石墨挤压成金刚石的极端条件下,物质在几秒钟内经历了汽化、混合与骤冷。原子来不及排列成常规的稳定结构,从而被迫形成了这种罕见的非平衡态物质。
- Safari 和 Firefox 根据域名改变特定网站的渲染方式
由于今天的主流网站都是为市场份额最大的浏览器 Chrome 设计的,市场份额较小的浏览器如 Safari 和 Firefox 不得不适应这种现实而改变其工作方式。Safari 和 Firefox 都包含了特定代码针对不同域名改变渲染方式。Firefox 的 about:compat 包含了一系列网站的兼容性干预措施,Safari 的 Quirks.cpp 改变了 facebook.com、x.com/twitter.com 和 reddit.com 的画中画视频处理方式——这些公司开发了有问题的视频代码,但与其等待它们修复代码,Safari 直接为每一位用户提供了权益之计。Chrome 当然不需要此类代码,毕竟网站是优化运行在 Chrome 而不是其它浏览器上。在 IE 时代之后我们迎来了 Chrome 时代,历史在重复。
- USAID 资金削减与非洲暴力冲突加剧相关
根据发表在《科学》期刊上的一项研究,2025 年初对美国国际开发署(USAID)的资金削减与非洲大陆大部分地区的暴力冲突显著增加存在关联。突然的撤资不仅带走了资源,还中断了合同、人员配置、采购和对项目结果的预期。这可能会使当地政府、中间机构和普通民众面临的不只是物资匮乏,还有承诺落空。因此,这种效应所反映的可能不仅是援助的缺失,同时也是制度的中断,这与援助逐渐减少的影响有很大差异。USAID 曾是全球最大的对外援助机构之一;其业务遍及 100 多个国家,其所支持的各类倡议项目涵盖公共卫生、农业、教育、灾难救援以及民主制度建设。然而,在上任不到一周的时间里,第二届特朗普政府便对 USAID 实施了大规模的削减,它标志着美国长达 60 多年的外交政策发生剧变。研究显示,撤销 USAID 与暴力冲突、武装冲突、抗议和骚乱活动的显著增加相关,特别是在那些曾接受过大量美国援助的地区。这些影响会在 USAID 撤销后立即显现,并持续数月之久。体制薄弱的地区在援助削减后会有更大幅的冲突增加,而体制较稳固地区则能更有效地缓解由此带来的伤害。
- 科学家首次从直立人化石中提取出遗传信息
中科院研究人员首次成功从北京周口店、安徽和县、河南孙家洞三个遗址距今约 40 万年的 6 颗中更新世直立人牙齿化石中,获取了具有系统发育信息的内源性牙釉质蛋白数据。这是首次获得具有直立人鉴定特征的分子信息,重塑了中更新世东亚古人类群体互动网络。中国境内的直立人究竟属于同一个演化支系,还是代表了多个不同来源或相对隔离的群体?研究构建了包括 6 个东亚直立人和 1 个哈尔滨个体在内的内源性蛋白质对比数据集,结果显示,6 个东亚直立人明确聚为一支,与丹尼索瓦人、尼安德特人和现代人清晰分离。研究还揭示出丹尼索瓦人基因组渗入到现代人的部分基因,其来源可以追溯至与周口店、和县、孙家洞中更新世相关人群。
- 第一位牙医是尼安德特人
根据发表在 PLOS One 期刊上的一项研究,第一位牙医是尼安德特人。5.9 万年前,在今天的西伯利亚西南部,一名尼安德特人牙疼难忍,以至于他让别人用锋利的石器钻入牙齿,清除感染的组织,最终缓解疼痛。整个治疗过程在牙齿上留下了一个洞。俄罗斯科学院古人类学家 Alisa Zubova 及其同事认为这是一种牙科工作。考古学家在俄罗斯 Chagyrskaya 洞穴发掘出了这颗牙齿,它是已知最古老的牙科治疗证据,也是迄今发现的最古老直接治疗。牙齿钻孔缓解疼痛似乎有悖常理,但却是去除感染组织最简单破坏性最小的方法。暴露牙髓腔会导致暴露的神经死亡,从而消除疼痛。这种做法直到几百年前才开始普及,但尼安德特人几万年前就发现了,还能互相配合。
- AI 工具作弊的流行迫使普林斯顿推翻无人监考制度
1893 年普林斯顿大学学生请愿取消考试中教师监考的制度,大学随后制定了《荣誉规章(Honor Code)》,学生承诺——我以我的人格保证,我没有在这次考试中违反《荣誉规章》的学术诚信政策。这种无人监考的制度实施了 133 年,直到本周被投票取消,原因是 AI 作弊工具的流行。2025 年对大四学生的调查发现,29.9% 的学生承认至少在一次作业或考试中作弊。其中攻读工程学理学士(BSE)学位的学生承认作弊的比例高达 40.8%,而文学学士学生“仅”为 26.4%。作弊基本上是借助了生成式 AI 工具。荣誉规章依赖于学生举报,但手机、AI 以及不愿告密的文化,许多人对作弊行为视而不见。学生说,在考试期间男厕所排起来长队,表明了作弊的普遍。调查显示,44.6% 的大四学生目睹过作弊行为,但选择不举报。普林斯顿大学教职工本周投票取消了无人监考,这次投票只有一个人投了反对票。从 7 月 1 日开始,所有课堂考试必须由教师监考。
- 为什么部分人特别招蚊子?
为什么部分人特别招蚊子?科学家正试图破解背后的化学信号。一系列感官线索促使蚊子选择叮咬特定的人——主要是身体释放出的气味和热量,以及呼出的二氧化碳。雌蚊——也是唯一会叮咬人的蚊子——利用其精密调整的受体探测这些信号,据此选择目标。在距离人体约 10 米的范围内,蚊子会开始探测到气味。随着距离的拉近,体温和湿度也会使某些人对它们更具吸引力。在最新研究中,研究人员在实验室中将埃及伊蚊释放到 42 名女性身上,观察它们更喜欢叮咬哪些人。研究人员证明,蚊子利用了多种气味化合物的混合物,在可能的 1000 种化合物中他们识别了 27 种蚊子能探测到的化合物。蚊子最喜欢叮咬的女性——包括处于妊娠中期的孕妇——会分泌大量由皮肤油脂皮脂分解产生的特定化合物,其中一种化合物是 1-octen-3-ol。
- 美国批准向 10 家中国公司出售 H200 芯片
美国批准向 10 家中国公司出售 H200 芯片,但中国尚未批准任何交易。英伟达 CEO 黄仁勋本周随美国总统访问中国,寻求取得突破。黄仁勋最初并未列入白宫赴华代表团名单,他是受特朗普的邀请加入访问团,飞机在阿拉斯加接上了黄仁勋。此次访问或许能打破芯片销售的僵局。美国商务部已批准包括阿里巴巴、腾讯、字节跳动和京东在内的 10 家中国公司采购英伟达 H200 芯片。包括联想和富士康在内的分销商也获得了批准。买家可以直接从英伟达购买,也可以通过中间商购买,根据美国许可条款,每位获得批准的客户最多可购买 75,000 颗芯片。
- 研究发现在出生前接触蔬菜气味帮助婴儿爱上吃蔬菜
吃蔬菜有益健康,但哄孩子吃蔬菜对父母们而言是一大难题。一项研究发现,新生儿父母可以未雨绸缪,在出生前就让他们熟悉蔬菜的气味,那么出生之后他们不会再排斥蔬菜。在实验中,研究人员让部分孕妇服用羽衣甘蓝粉(kale powder)胶囊,部分孕妇服用胡萝卜粉胶囊,然后观察胎儿或婴儿对羽衣甘蓝和胡萝卜的面部反应。对胎儿的观察是借助超声波,之后是出生后三周以及三岁。研究人员说,孕妇不愿意为了科学而服用大量羽衣甘蓝汁或胡萝卜汁,所以他们选择了胶囊。结果基本一致:接触过胡萝卜粉的孩子对胡萝卜不排斥,接触过羽衣甘蓝的也喜欢羽衣甘蓝。研究人员推测,孕晚期接触特定口味可能会给孩子留下持久的味觉或嗅觉记忆,有可能影响他们出生多年后的饮食偏好。研究人员表示这项研究规模较小,如果资金充足将展开更大规模的研究。
OrangeBot Weekly
5 Claude Code skills worth using each week — with my verdict on what’s actually good. No hype.