DIGEST · 2026-04-08

OrangeBot.AI Digest — 2026-04-08

86 headlines across 8 sources, aggregated for this day.

Hacker News(15)

  1. Who is Satoshi Nakamoto? My quest to unmask Bitcoin's creator (www.nytimes.com)
  2. I've been waiting over a month for Anthropic to respond to my billing issue (nickvecchioni.github.io)
  3. Muse Spark: Scaling towards personal superintelligence (ai.meta.com)
  4. Muse Spark – Meta Superintelligence Labs (meta.ai)
  5. I ported Mac OS X to the Nintendo Wii (bryankeller.github.io)
  6. ML promises to be profoundly weird (aphyr.com)
  7. Microsoft terminates VeraCrypt account, halting Windows updates (www.404media.co)
  8. Ask HN: Any interesting niche hobbies?
  9. They're made out of meat (1991) (www.terrybisson.com)
  10. US cities are axing Flock Safety surveillance technology (www.cnet.com)
  11. MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU (arxiv.org)
  12. Audio Reactive LED Strips Are Diabolically Hard (scottlawsonbc.com)
  13. I've sold out (mariozechner.at)
  14. Git commands I run before reading any code (piechowski.io)
  15. Škoda DuoBell: A bicycle bell that penetrates noise-cancelling headphones (www.skoda-storyboard.com)

GitHub Trending(11)

  1. forrestchang / andrej-karpathy-skills
  2. TheCraigHewitt / seomachine
  3. google-ai-edge / gallery
  4. NVIDIA / personaplex
  5. google-ai-edge / LiteRT-LM
  6. elebumm / RedditVideoMakerBot
  7. obra / superpowers
  8. newton-physics / newton
  9. abhigyanpatwari / GitNexus
  10. virattt / ai-hedge-fund
  11. goharbor / harbor

Product Hunt(15)

  1. Hire Roger

    Hire an AI outbound sales rep as your next coworker

  2. Career-Ops on Claude

    An AI-powered Job Search System built on Claude Code

  3. Clawcast

    Peer-to-Peer Podcasting for Agents

  4. VibeSonic

    Not just dictation and private AI voice toolkit

  5. Mo

    Checks PRs against decisions your team approved in Slack

  6. Flint

    Launch on-brand pages for every campaign, ad, and prospect.

  7. Browser Arena

    Open-source benchmarks for cloud browser infrastructure

  8. git-fire

    One command to back up every Git repo you have; and more!

  9. Google Chrome Vertical Tabs

    Chrome now supports vertical tabs and immersive reading mode

  10. Velo

    Share anything as video messages

  11. Keeby

    Mechanical keyboard sounds for your Mac

  12. Timeliner.io

    The all-in-one workspace for content agencies & editors

  13. RoomieU

    The roommate matching app built for college students

  14. PassportReader

    Verify passports, ID cards, and digital credentials via API

  15. MindsDB Anton

    Business intelligence that doesn't just answer — it acts.

Hugging Face(15)

  1. Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

  2. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

    Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

  3. Learning to Retrieve from Agent Trajectories

    Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.

  4. ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

    Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

  5. GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

    The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

  6. ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

    We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

  7. Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

    We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

  8. Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

    In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.

  9. Watch Before You Answer: Learning from Visually Grounded Post-Training

    It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

  10. MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

    We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84times the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

  11. How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

    Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

  12. Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

    The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.

  13. General Multimodal Protein Design Enables DNA-Encoding of Chemistry

    Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp^3)-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at https://github.com/DISCO-design/DISCO.

  14. DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

    Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present DARE (dLLMs Alignment and Reinforcement Executor), an open framework for post-training and evaluating dLLMs. Built on top of verl~sheng2024hybridflow and OpenCompass~2023opencompass, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.

  15. ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

    Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.

Techmeme(15)

  1. The OpenAI Foundation says it is working to finalize over $100M in grants this month, across six institutions, to support and accelerate Alzheimer's research (Jacob Trefethen/OpenAI Foundation)

    Jacob Trefethen / OpenAI Foundation : The OpenAI Foundation says it is working to finalize over $100M in grants this month, across six institutions, to support and accelerate Alzheimer's research —  Alzheimer's disease is one of the hardest unsolved problems in medicine, and one of the most devastating.

  2. CFO Sarah Friar says OpenAI will "for sure" reserve shares for retail investors in its IPO, after "strong demand" from individuals in its latest funding round (CNBC)

    CNBC : CFO Sarah Friar says OpenAI will “for sure” reserve shares for retail investors in its IPO, after “strong demand” from individuals in its latest funding round —  OpenAI plans to reserve a portion of shares for individual investors in what's expected to be a blockbuster initial public offering.

  3. A hacker claims to have stolen 10PB+ of data, including classified defense docs and missile schematics, from China's National Supercomputing Center in Tianjin (Isaac Yee/CNN)

    Isaac Yee / CNN : A hacker claims to have stolen 10PB+ of data, including classified defense docs and missile schematics, from China's National Supercomputing Center in Tianjin —  A hacker has allegedly stolen a massive trove of sensitive data - including highly classified defense documents and missile schematics …

  4. Greek PM Kyriakos Mitsotakis says Greece will ban children under 15 from accessing social media starting January 1, 2027, and calls for coordinated EU action (Antonis Pothitos/Reuters)

    Antonis Pothitos / Reuters : Greek PM Kyriakos Mitsotakis says Greece will ban children under 15 from accessing social media starting January 1, 2027, and calls for coordinated EU action —  Greece will ban access to social media for children under the age of 15 from January 1, 2027, Prime Minister Kyriakos Mitsotakis said on Wednesday …

  5. Anthropic announces Claude Managed Agents, offering developers an agent harness and other tools to build and deploy AI agents at scale, available in public beta (Maxwell Zeff/Wired)

    Maxwell Zeff / Wired : Anthropic announces Claude Managed Agents, offering developers an agent harness and other tools to build and deploy AI agents at scale, available in public beta —  Amid rapid enterprise growth, Anthropic is trying to lower the barrier to entry for businesses to build AI agents with Claude.

  6. Internal memo: Julia Liuson, head of Microsoft's developer division, will resign after 34 years and move to an "advisory role" at the end of June (Tom Warren/The Verge)

    Tom Warren / The Verge : Internal memo: Julia Liuson, head of Microsoft's developer division, will resign after 34 years and move to an “advisory role” at the end of June —  Veteran Microsoft executive Julia Liuson is leaving after 34 years. … Microsoft is losing another veteran executive.

  7. Amazon says Kindle and Kindle Fire devices released in 2012 and earlier won't be able to access the Kindle Store from May 20; downloaded books can still be read (Andrew Liszewski/The Verge)

    Andrew Liszewski / The Verge : Amazon says Kindle and Kindle Fire devices released in 2012 and earlier won't be able to access the Kindle Store from May 20; downloaded books can still be read —  Starting May 20th, Kindle and Kindle Fire devices released in 2012 and earlier won't have access to the Kindle Store.

  8. OpenAI releases the Child Safety Blueprint tackling AI-enabled child sexual exploitation, focusing on updating legislation and improving detection and reporting (Lauren Forristal/TechCrunch)

    Lauren Forristal / TechCrunch : OpenAI releases the Child Safety Blueprint tackling AI-enabled child sexual exploitation, focusing on updating legislation and improving detection and reporting —  In response to escalating concerns about child safety online, OpenAI has unveiled a blueprint to enhance U.S. child protection efforts amid the AI boom.

  9. Meta is opening a private API preview for Muse Spark to select partners, and plans to offer paid API access to a wider audience later; META closes up 6.5% (Jonathan Vanian/CNBC)

    Jonathan Vanian / CNBC : Meta is opening a private API preview for Muse Spark to select partners, and plans to offer paid API access to a wider audience later; META closes up 6.5% —  Meta is debuting its first major artificial intelligence model since the costly hiring of Scale AI's Alexandr Wang nine months ago …

  10. Meta says Muse Spark powers Meta AI's "shopping mode" feature and that it plans to release a version of Muse Spark under an open-source license (Ina Fried/Axios)

    Ina Fried / Axios : Meta says Muse Spark powers Meta AI's “shopping mode” feature and that it plans to release a version of Muse Spark under an open-source license —  Meta on Wednesday debuted Muse Spark, a homegrown AI model it says significantly narrows the performance gap with models from OpenAI, Anthropic and others.

  11. Meta releases Muse Spark, the first model from Meta Superintelligence Labs under Alexandr Wang, to "power a smarter and faster" Meta AI across Meta's products (Financial Times)

    Financial Times : Meta releases Muse Spark, the first model from Meta Superintelligence Labs under Alexandr Wang, to “power a smarter and faster” Meta AI across Meta's products —  Muse Spark ‘purpose-built’ for social media apps as investors question huge AI investment

  12. Source: Meta shutters an internal, employee-built leaderboard, dubbed Claudeonomics, tracking staff token usage, due to the data "being shared externally" (Jyoti Mann/The Information)

    Jyoti Mann / The Information : Source: Meta shutters an internal, employee-built leaderboard, dubbed Claudeonomics, tracking staff token usage, due to the data “being shared externally” —  Meta Platforms has taken down an internal, employee-built leaderboard tracking how many tokens staffers were using.

  13. Patreon says it now has 7.6M paid podcast memberships and revenue generated by podcasters on the platform hit $629M in 2025, up 33% YoY (Todd Spangler/Variety)

    Todd Spangler / Variety : Patreon says it now has 7.6M paid podcast memberships and revenue generated by podcasters on the platform hit $629M in 2025, up 33% YoY —  On Patreon, podcasts have become the largest content category in terms of revenue and they've continued their upward trajectory, says chief operating officer Paige Fitzgerald.

  14. New York-based Patlytics, which builds software for law firms and businesses to automate patent filing and litigation, raised a $40M Series B led by SignalFire (Melia Russell/Business Insider)

    Melia Russell / Business Insider : New York-based Patlytics, which builds software for law firms and businesses to automate patent filing and litigation, raised a $40M Series B led by SignalFire —  Follow Melia Russell … - Patlytics builds software for law firms and businesses to automate patent filing and litigation.

  15. Alibaba and China Telecom launch a data center in southern China that is powered by 10,000 of Alibaba's Zhenwu chips designed for AI training and inferencing (Arjun Kharpal/CNBC)

    Arjun Kharpal / CNBC : Alibaba and China Telecom launch a data center in southern China that is powered by 10,000 of Alibaba's Zhenwu chips designed for AI training and inferencing —  Alibaba and China Telecom are launching a data center in southern China powered by the e-commerce giant's own chips …

Solidot(15)

  1. 全新世最暴力火山正在重新注满岩浆

    大约 7300 年前,位于日本九州萨摩半岛的鬼界破火山口发生了大规模喷发。这次喷发是当前地质期全新世规模最大的火山喷发,火山喷出的物质覆盖 4500 平方公里范围。此后这座火山没有再发生大规模喷发,但仍然活跃,发生过零星的小型喷发。根据发表在《Communications Earth & Environment》期刊上的一项研究,日本科学家报告这座大部分位于海底的火山正在重新注满岩浆,引发了它再次喷发的担忧。鉴于该地区的人口密度,任何规模的喷发都可能造成严重破坏。

  2. 伊朗要求油轮使用比特币支付霍尔木兹海峡通行费

    美国和伊朗达成了为期两周的停火,停火期间想要通过霍尔木兹海峡的油轮需要支付过路费,费用为每桶石油一美元,如果油轮空载则可以免费通行。通行费用的支付方式是数字货币。伊朗称,每艘油轮必须通过电子邮件向其报告货物情况,之后伊朗将告知他们支付比特币,油轮只有几秒钟的时间进行支付,伊朗官员表示此举旨在确保支付的款项不会因制裁而被追踪或没收。

  3. 亚马逊停止支持 2012 年前发布的旧型号 Kindle

    亚马逊正式宣布,从 2026 年 5 月 20 日起停止支持 2012 年以及之前发布的旧型号 Kindle 和 Kindle Fire 设备,这些设备的用户将无法通过 Kindle 商店购买、借阅或下载新内容。亚马逊声称它已经为这些设备提供了 14 年,部分型号甚至长达 18 年的支持。它正在通知仍然使用这些设备的用户,为相关用户提供折扣帮助他们过渡到新设备。受影响的设备包括:第一代和第二代 Kindle,Kindle DX 和 DX Graphite、Kindle Keyboard、Kindle 4、Kindle Touch、Kindle 5 和第一代 Kindle Paperwhite。亚马逊 Kindle 已于2024 年6 月30 日退出了中国市场。

  4. OpenAI 提议四天工作制应对 AI 对社会的冲击

    AI 的进步预计会对整个社会造成巨大冲击,为了应对这一社会问题,OpenAI 提出了一系列建议,包括对机器人征税,设立公共财富基金,以及推行四天工作制。OpenAI 表示这份文件是它应对 AI 工具普及可能冲击就业岗位以及整个行业而提出的初步想法。它的核心建议是设立公共财富基金,投资于与 AI 发展相关的长期资产,将收益直接分配给公民。四天工作制则要求雇主在不减少薪酬的情况下减少工作时间。另一项建议是改革税收制度,将税基转向企业所得税和资本利得税,而不是依赖可能受到 AI 引发的大规模失业潮冲击的劳动所得税和工资税。

  5. Cloudflare 计划到 2029 年全面实现后量子加密

    Cloudflare 宣布它计划加快速度提前到 2029 年全面实现后量子加密,原因是最近的两项研究显示破解现有加密算法所需的量子比特规模比预期的要少得多。IBM Quantum Safe CTO 认为最快到 2029 年量子计算机就能对高价值目标进行破解(即所谓的 moonshot attacks)。Cloudflare 称,它在 2014 年开始免费提供通用 SSL 证书,2019 年开始准备迁移到后量子加密,2022 年为所有网站和 API 启用了后量子加密,目前逾 65% 的 Cloudflare 用户流量已使用后量子加密。

  6. 阿根廷总统卷入加密货币骗局

    阿根廷总统 Javier Milei 去年帮助推广了名为 $Libra 的加密货币,该加密货币的价格在短暂飙升之后又迅速暴跌,给投资者造成了数百万美元的损失,并引发了阿根廷检方的调查。Milei 坚称他与该加密货币没有任何关联,只是帮个忙宣传下。但根据电话纪录,2025 年 Milei 通过其 X 账号宣传 $Libra 的当晚,他与该加密货币背后的一位企业家有过 7 次通话。通话内容未知。新披露的信息还显示,Milei 在担任国会议员期间,曾从其中一位企业家定期收款。

  7. 天文学家发现已知最原始的恒星

    学界普遍认为,宇宙大爆炸约数亿年后第一代恒星应运而生。这些初代恒星在剧烈的核聚变中淬炼出重元素,随后诞生的第二代恒星,便孕育于这些富含新元素的残骸之中。天文学界将宇宙中所有重于氦的元素统称为“金属”。因此,恒星的重元素比例(即金属丰度)便成了推断其诞生年代的天然标尺。金属含量极低的恒星,被称为“原始恒星”或“贫金属星”。SDSS J0715-7334 归属第二代恒星谱系,为一颗红巨星。麦哲伦望远镜的光谱与化学分析表明,其金属含量不足太阳的 0.005%,总金属丰度低至约 7.8 × 10−7 ,仅为前纪录保持者的一半,更是已知最贫铁恒星的 1/40,刷新了迄今观测纪录。SDSS J0715-7334 不仅铁含量极低,碳元素亦异常匮乏,有别于通常“贫铁富碳”的同类恒星。借助 ESA“盖亚”探测器的数据,团队追溯了这颗恒星的轨迹。它最初诞生于大麦哲伦星云附近,现“定居”银河系,距地球约 8 万光年。

  8. 苹果和联想的笔记本电脑最难维修

    消费者倡导组织 Public Interest Research Group (PIRG) Education Fund 发表了报告《Failing the Fix (2026): Grading laptop and cell phone companies on the fixability of their products》,评估了笔记本电脑和智能手机的可维修评分。结果显示,苹果的笔记本电脑可维修评分最差为 C-,智能手机为 D-。PIRG 采用了法国的维修指数,包括维修文档的可获得性,零配件的可获得性,零配件价格的可负担性。此外还考虑是否是反对维修权立法的行业组织成员——如果是就扣分,如果支持维修权则可以加分。报告显示,笔记本电脑的可维修评分最高是华硕(B+),其次是宏碁(B)、惠普(B-)、戴尔(B-)、三星(B-)、微软(B-)和联想(C);智能手机可维修评分最高是摩托罗拉(B+),其次是 Google(C-)和三星(D)。

  9. 非激素男性避孕药研究取得突破

    康奈尔科学家在研究安全可逆转且百分之百有效的非激素男性避孕药上取得了突破。现有的男性避孕方法包括避孕套和输精管结扎术,其中输精管结扎因难以逆转而不太受到男性青睐。激素类避孕药则存在影响女性的风险。康奈尔科学家针对的是精子生成的第二个阶段。精子生成分为三个阶段:精原细胞增殖分化;精母细胞减数分裂;精子细胞形成。研究人员称,如果针对精原细胞进行干预,那么精原细胞死亡将会导致男性终身无法再次生育。他们使用了小分子抑制剂 JQ1 对减数分裂前期 I 进行干预杀死细胞,阻断精子形成所需的基因表达。在实验中,研究人员对小鼠注射了 JQ1,持续三周,结果显示小鼠失去了产生精子的能力。停止注射 JQ1 的六周内精子生成恢复了正常。研究人员让实验小鼠繁殖后代,结果显示后代都发育正常。

  10. 测试显示 AI Overviews 每 10 个答案就有一个是错误的

    纽约时报的测试显示,Google 搜索的 AI 概括功能 AI Overviews 每 10 个答案有一个是错误的,这听起来还不错,但考虑到 Google 服务每天的搜索量,这意味着每分钟就有成千上万的错误信息传播出去。纽约时报和 Oumi 合作利用 AI 工具通过 SimpleQA 评估 AI Overviews 答案的准确性。Oumi 从去年开始执行测试,当时 Google 最好的模型还是 Gemini 2.5,当时 AI Overviews 的准确性是 85%。当模型升级到 Gemini 3 后,AI Overviews 的准确性提高到 91%。AI Overviews 给出答案时会列出引用来源,当它出错时,其答案经常会与引用来源的信息互相矛盾。

  11. Chrome 支持垂直标签

    习惯于顶部水平标签的用户可能不知道垂直标签在标签管理和组织上有多么的高效,用过之后几乎没有人会再回到水平标签。主流浏览器如 Firefox 和 Microsoft Edge 都官方支持垂直标签,现在市场份额最高的浏览器 Google Chrome 也正式宣布支持垂直标签。Google 称,启用垂直标签十分简单,只需要在一个窗口点击右键,选择“Show Tabs Verticall(显示垂直标签)”。除此之外,Google Chrome 还推出新的沉浸式阅读模式 Reading Mode,减少广告等带来的干扰,让用户能专注于阅读文本。

  12. 日本科学家演示能承受核反应堆六个月强辐射的 Wi-Fi 接收器

    日本科学家 Yasuto Narukiyo 在 ISSCC 上演示了能承受核反应堆六个月强辐射的 Wi-Fi 接收器。Wi-Fi 接收器能承受 500 kilograys,是太空电子设备通常三年承受 100-300 grays 辐射剂量的千倍以上。Narukiyo 称,2011 年福岛核事故后工程师使用机器人勘察和清理核电站。这些机器人多需要 LAN 线缆,而线缆很容易缠绕在一起。他的团队的目标是开发一种用于在这种恶劣环境下控制机器人的无线系统。即使不是如此极端的环境,核电站寿命到期也需要清理,而很多关闭的核电站至今未完成清理工作,未来二十年还有 200 座反应堆退役。Narukiyo 的团队使用了硅 MOSFET 晶体管,减少晶体管数量同时改变其形状,并加宽了栅极。

  13. 创纪录风能和太阳能发电量让英国避免了 10 亿英镑天然气进口

    由于风能和太阳能发电量创新高,英国在 2026 年 3 月免于进口价值 10 亿英镑的天然气。3 月的风能和太阳能发电总量达到 11TWh,合计增长 28%,英国因此避免了进口 21 TWh 的天然气,按照目前的价格相当于 10 亿英镑。相比 2022 年俄罗斯入侵乌克兰导致 3 月油气价格高涨,中东战事后的天然气价格对英国电价的影响降低了约 25%。

  14. 流行 NPM 软件包维护者成为 AI 深度伪造攻击目标

    多位流行 NPM 软件包维护者成为 AI 深度伪造攻击目标,他们遭遇了相似的社会工程攻击。axios 维护者 Jason Saayman 称,疑似 APT 组织 UNC1069 的黑客冒充一家公司的创始人联系他,他们不仅克隆了创始人的外表,还克隆了公司本身。他们邀请他加入一个真实的 Slack 工作区(workspace),还创建了频道分享 LinkedIn 帖子,非常逼真。然后黑客邀请参加一个 Microsoft Teams 虚拟会议,会议提示其系统存在问题。他以为这与 Teams 有关,于是安装了缺失组件,结果却植入了远程访问木马。他维护的 axios 周下载量 1 亿,被云服务和编码环境广泛使用,黑客窃取了维护者的凭证释出了 axios 的恶意版本。这不是一起孤立事件,多位周下载量上亿的 NPM 软件包维护者遭遇了类似的 AI 深度伪造攻击。

  15. TDF 基金会称它取消 Collabora 员工的会员资格是为了遵守非营利组织法

    The Document Foundation(TDF)再次通过官方博客回应了它与主要商业合作伙伴 Collabora 之间的分歧。TDF 称过去几年它犯下了多项违反非营利组织法的错误:仅允许生态系统内的公司免费使用 LibreOffice 品牌;将 LibreOffice 的开发合同——新功能开发、修 bug 等——授予在基金会董事会中拥有代表,积极参与采购的企业。基金会法律顾问指出这些违反法律的错误之后,从中受益的企业试图维持现状而不是解决问题。为避免失去非营利组织地位以及由此带来的不可预见的后果,TDF 基金会采取了措施,取消 Collabora 员工的会员资格、冻结招标以及引入开发采购政策,并制定了规则降低未来再次出现类似问题的风险。