Monthly Digest — 2025-06
429 unique stories across 30 days and 8 sources.
Hacker News(120)
- Canonicals Interview Process (dustri.org)
- M8.2 solar flare, Strong G4 geomagnetic storm watch (www.spaceweatherlive.com)
- Root shell on a credit card terminal (stefan-gloor.ch)
- How I like to install NixOS (declaratively) (michael.stapelberg.ch)
- My AI skeptic friends are all nuts (fly.io)
- The Unreliability of LLMs and What Lies Ahead (verissimo.substack.com)
- Ask HN: Who is hiring? (June 2025)
- Show HN: Penny-1.7B Irish Penny Journal style transfer (huggingface.co)
- Deep learning gets the glory, deep fact checking gets ignored (rachel.fast.ai)
- Show HN: AirAP AirPlay server - AirPlay to an iOS Device (github.com)
- Builder.ai Collapses: $1.5B 'AI' Startup Exposed as 'Indians' (www.ibtimes.co.uk)
- Swift at Apple: Migrating the Password Monitoring Service from Java (www.swift.org)
- Curtis Yarvin's Plot Against America (www.newyorker.com)
- A proposal to restrict sites from accessing a users’ local network (github.com)
- The iPhone 15 Pro’s Depth Maps (tech.marksblogg.com)
- IRS Direct File on GitHub (chrisgiven.com)
- Eleven v3 (elevenlabs.io)
- Show HN: ClickStack – Open-source Datadog alternative by ClickHouse and HyperDX (github.com)
- Seven Days at the Bin Store (defector.com)
- Google restricts Android sideloading (puri.sm)
GitHub Trending(59)
- anthropics / prompt-eng-interactive-tutorial
Anthropic's Interactive Prompt Engineering Tutorial
- anthropics / courses
Anthropic's educational courses
- frdel / agent-zero
Agent Zero AI framework
- onlook-dev / onlook
The Cursor for Designers • An Open-Source Visual Vibecoding Editor • Visually build, style, and edit your React App with AI
- donnemartin / system-design-primer
Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.
- gitroomhq / postiz-app
📨 The ultimate social media scheduling tool, with a bunch of AI 🤖
- nautechsystems / nautilus_trader
A high-performance algorithmic trading platform and event-driven backtester
- DataExpert-io / data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
- Anduin2017 / HowToCook
程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).
- scrapy / scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
- netbirdio / netbird
Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
- lastmile-ai / mcp-agent
Build effective agents using Model Context Protocol and simple workflow patterns
- topoteretes / cognee
Memory for AI Agents in 5 lines of code
- stanfordnlp / dspy
DSPy: The framework for programming—not prompting—language models
- codexu / note-gen
A cross-platform Markdown note-taking application dedicated to using AI to bridge recording and writing, organizing fragmented knowledge into a readable note.
- tensorzero / tensorzero
TensorZero creates a feedback loop for optimizing LLM applications — turning production data into smarter, faster, and cheaper models.
- langgenius / dify
Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
- alphacep / vosk-api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
- XTLS / Xray-core
Xray, Penetrates Everything. Also the best v2ray-core. Where the magic happens. An open platform for various uses.
- jwohlwend / boltz
Official repository for the Boltz biomolecular interaction models
Product Hunt(119)
- Circuit Tracer
Anthropic's open tools to see how AI thinks
- Conversational AI 2.0 From ElevenLabs
Powering the next-gen of smart, trusted voice agents
- Audino AI
Make content creation simpler with AI-generated audio
- Blogbuster - Free Blog Hosting
Launch a blog on your domain in minutes & free
- Knowledge
Sell to leads like you know them - because you do!
- AI Video Generator by Wavel AI
Create stunning multi style videos with single input
- Jots
Unlock the benefits of journaling to become a better dev.
- CMS-powered UI components by Croct
Component library with personalization and AB testing
- Socialprofiler
Find out what people are into based on their social media
- Convo Mode from Wondercraft
NotebookLM podcasts that you can edit
- Wispr Flow for iOS
Voice-first writing—now on iPhone
- ALTAR 2.0 Personal Multi-Agent Workspace
Autocomplete for your mind.
- Embedded iPaaS from Albato
Make your SaaS stick with AI-fueled embedded iPaaS
- DeskHog
A developer toy from PostHog
- Ideabrowser.com
The place to find trends & startup ideas worth building
- Job for Agent
The 1st job board for autonomous AI agents
- Long
Invest in startups before VCs get in
- Fieldy
Wearable AI note taker for in-person meetings
- skillsync
Discover hidden talent in your codebase
- Mistral Code
Enterprise AI coding with full control & security
Hugging Face(28)
- Show-o2: Improved Native Unified Multimodal Models
This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
- RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.
- EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
- Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
- Better Language Model Inversion by Compactly Representing Next-Token Distributions
Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
- Watermarking Autoregressive Image Generation
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values.
- MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.
- RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.
- 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.
- Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems
Despite recent progress in generating hardware RTL code with LLMs, existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real-world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and automation potential. In this paper, we address this gap by proposing an LLM agent system, termed Spec2RTL-Agent, designed to directly process complex specification documentation and generate corresponding RTL code implementations, advancing LLM-based RTL code generation toward more realistic application settings. To achieve this goal, Spec2RTL-Agent introduces a novel multi-agent collaboration framework that integrates three key enablers: (1) a reasoning and understanding module that translates specifications into structured, step-by-step implementation plans; (2) a progressive coding and prompt optimization module that iteratively refines the code across multiple representations to enhance correctness and synthesisability for RTL conversion; and (3) an adaptive reflection module that identifies and traces the source of errors during generation, ensuring a more robust code generation flow. Instead of directly generating RTL from natural language, our system strategically generates synthesizable C++ code, which is then optimized for HLS. This agent-driven refinement ensures greater correctness and compatibility compared to naive direct RTL generation approaches. We evaluate Spec2RTL-Agent on three specification documents, showing it generates accurate RTL code with up to 75% fewer human interventions than existing methods. This highlights its role as the first fully automated multi-agent system for RTL generation from unstructured specs, reducing reliance on human effort in hardware design.
- Demystifying the Visual Quality Paradox in Multimodal Large Language Models
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.
- Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
- Improving Progressive Generation with Decomposable Flow Matching
Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.
- Orthogonal Finetuning Made Scalable
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
- Scaling Speculative Decoding with Lookahead Reasoning
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire gamma-token guess is correct falls exponentially as gamma grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
- Thought Anchors: Which LLM Reasoning Steps Matter?
Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence's counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified ``broadcasting'' sentences that receive disproportionate attention from all future sentences via ``receiver'' attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence's tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.
- GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing sim25% of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.
- Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models
Recent advances in image editing with diffusion models have achieved impressive results, offering fine-grained control over the generation process. However, these methods are computationally intensive because of their iterative nature. While distilled diffusion models enable faster inference, their editing capabilities remain limited, primarily because of poor inversion quality. High-fidelity inversion and reconstruction are essential for precise image editing, as they preserve the structural and semantic integrity of the source image. In this work, we propose a novel framework that enhances image inversion using consistency models, enabling high-quality editing in just four steps. Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy and enables a controllable trade-off between editability and content preservation. We achieve state-of-the-art performance across various image editing tasks and datasets, demonstrating that our method matches or surpasses full-step diffusion models while being substantially more efficient. The code of our method is available on GitHub at https://github.com/ControlGenAI/Inverse-and-Edit.
- Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content
We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.
Solidot(103)
- NASA 卫星确认火星溅射的证据
经过十年探索,NASA 火星大气与挥发物演化任务(MAVEN)首次报告直接观察到一种难以捉摸的大气逃逸过程,称为溅射,这可能有助于科学家解答有关火星水分流失的问题。过去大量的证据显示数十亿年前火星表面存在水,但科学家们一直不知道「水到底去了哪里?为什么会消失?」因此,NASA 在 2014 年将 MAVEN 卫星投入绕火星轨道,目的是研究火星大气,以及是什么原因让火星大气变得如此稀薄与干燥。经过十年的观测统计,MAVEN任务团队认为水气从火星大气逃逸的过程,和一种称为溅射(sputtering)的过程有关。溅射指得是高层的大气粒子受到高能的带电粒子撞击之后,被撞出火星的过程,就好像撞球一样。由于火星缺乏自身磁场的保护,来自太阳发出的太阳风便不断透过溅射,侵蚀着火星的高层大气。一旦火星的大气减少,地面的大气压力下降,地表的液态水也无法维持稳定的液态,遂逐渐蒸发,并与高层大气一起被溅射出火星。
- 世界面临新形式的气候否认——经济不可行
过去几十年,既得利益集团的气候否认宣传经历了一系列变化,从宣称“气候变化不存在”,到承认“气候变化存在但属于自然现象”,到承认“气候变化不是自然现象但不是什么大问题”、“气候变化是大问题但我们无能为力”、“可以采取措施但排放量大的国家应首先采取行动”、“改变的成本太高了”。主持今年联合国气候变化峰会——Cop30——的巴西资深外交官 Andre Correa do Lago 认为气候变化相关的经济否认是目前面临的一大威胁。
- 越来越多的中国游戏公司开发单机游戏
中国游戏公司以免费网游闻名,如《原神》和《鸣潮》在全世界都有数百万玩家。但随着《黑神话:悟空》的成功,越来越多的中国游戏工作室开始转向开发单机游戏。以前中国游戏公司不愿意开发单机游戏,因为免费网游的商业模式更成熟,更容易收回开发成本。即将发布的单机游戏《明末:渊虚之羽》总监夏思源说,单机游戏的风险更大,其命运在上市当天就注定了,而网游如果上线第一天出现问题还是有可能补救的。前腾讯游戏 CTO 现 Epic Games 中国 CTO 沈黎(Li Shen)说,中国游戏公司一直对单机游戏感兴趣,但直到《黑神话:悟空》在全世界售出数百万份拷贝之后,游戏发行商才愿意投资单机游戏。新单机游戏如 S-Game 的《影之刃零》和《明末:渊虚之羽》都希望借助《黑神话:悟空》成功的东风。游戏行业的专业人士表示,因网游的经验中国游戏公司在商业设计和运营方面全球领先,但游戏设计和整体质量尤其是叙事和剧本创作存在不足。类似其它中国公司,中国游戏工作室的加班文化也颇受争议,很多游戏公司都使用 996 工作制,夏思源称他的公司不鼓励加班,认为加班效率低下。他说中国有句俗话叫众人拾柴火焰高,希望其团队的努力能鼓励更多公司开发单机游戏。
- 天文学家发现以 44 分钟周期发射无线电波和 X 射线的神秘天体
天文学家发现了一个以 44 分钟周期发射无线电波和 X 射线的神秘天体 ASKAP J1832-0911。这一行为与已知天体都截然不同,能周期性发射高能辐射的天体如脉冲星,其周期已知最长为百秒。44 分钟长周期是首次观察到。此类天体被称为 LPT(long-period transient),天文学家对其信号如何产生尚不清楚。研究团队认为 ASKAP J1832-0911 是一颗死亡恒星,但死亡恒星的形态未知,可能是磁星(拥有强磁场的中子星);可能是白矮星;也可能属于双星系统,其中之一有强磁场的白矮星。研究人员表示现有理论无法解释他们观测到的现象。
- 美国的加密货币战略储备规模有多大
美国总统特朗普今年三月份发布了一项行政令,宣布将建立一个国家级的加密货币战略储备。战略储备的加密货币资产将主要来自美联邦机构在刑事或民事诉讼中扣押的各种加密货币。那么美国政府目前扣押的加密货币资产规模有多大?根据 Chainalysis 的数据,截至 5 月 28 日,美国政府持有的 20 种加密货币价值约 209 亿美元,其中包括价值 204 亿美元的比特币和约 4.93 亿美元的其它数字货币,总价值略低于 250 亿美元的战略石油储备。
- 俄罗斯公共采购数据库泄漏核基地细节
俄罗斯正在翻新其核武器基地,它直接将基地的敏感资料包含在采购数据库中。Danwatch 和 Der Spiegel 的记者利用位于俄罗斯、哈萨克斯坦和白俄罗斯的代理服务器绕过网络限制访问了公共采购数据库,抓取并分析了逾 200 万份文件,这些文件披露了俄罗斯核设施的详细细节,包括掩体、导弹发射井的平面图和内部结构、电网、IT 系统、警报配置、传感器位置以及加固结构等等。泄露的文件显示,最近一批文件是 2024 年夏天发布的,其中披露了俄众多新建的设施。
- 黑猩猩将敲击树干作为一种交流模式
德国研究人员在相机陷阱(camera traps)和当地向导的帮助下,发现了几内亚比绍野生黑猩猩的一种奇特行为:成年雄性黑猩猩会反复用石头敲击树干。研究人员推测这是一种交流形式。幼年黑猩猩是从部落成年成员那里学会了这种行为,意味着这种行为是后天习得而不是遗传。研究报告发表在《Biology Letters》期刊上。
- 美国科技公司减少招聘应届生
根据 SignalFire 公司对招聘趋势的分析,美国科技公司在 2024 年招聘的应届生少于 2023 年。大型科技公司去年招聘的应届生数量比 2023 年少了四分之一,而创业公司比前一年少了 11%。SignalFire 的研究主管认为减少招聘应届生的一大因素是 AI。入门级工作易受自动化的影响,而 AI 被认为能很好的处理此类初级工作。与此同时,大型科技公司增加了对有 2-5 年工作经验的专业人士的招聘,去年增加了 27%,而创业公司增加了14%。对应届生而言,悖论是他们因为缺乏工作经验而无法获得工作,而没有工作他们也无法积累经验。
- 报告称 AI 的普及和增长“史无前例”
根据有互联网女皇之称的著名 VC 和分析师 Mary Meeker 的新报告《Trends –Artificial Intelligence》,AI 的普及和增长速度堪称“史无前例”。报告称,ChatGPT 在 17 个月内用户数就突破了 8 亿,这是史无前例的;AI 公司数量以及如此多的公司实现高年经常性收入的速度是史无前例的;AI 模型的推理成本下降速度是史无前例的——虽然训练一个大模型的成本最高要 10 亿美元,但推理成本两年内下降了 99%;AI 公司以极低成本匹配竞争对手模型功能的速度也是史无前例的。报告指出,AI 唯一没有超越其它科技革命的领域是财务回报,暂时还不知道哪些公司能成长为长期盈利的下一代科技巨头。
- 仙女座和银河系未必会相撞
银河系被广泛认为会在 45 亿年之后与其最近邻居仙女座相撞。但最新研究认为这并非是必然发生的。研究团队分析了来自哈勃和盖亚的资料,考虑了 22 种可能影响我们星系与邻近星系之间潜在碰撞的不同变量,并进行多达 10 万次的电脑模拟,并延伸至未来 100 亿年。由于变数太多,每个变数都有误差,累积起来对结果的不确定性相当大,结果显示银河系和仙女座在接下来 100 亿年内发生真正碰撞的机率只有约 50%,换言之,另一半的机率可能只是擦身而过,而非正面相撞。研究团队指出预测星系相互作用的长期未来具有很大的不确定性,但新发现挑战了先前的共识,并表明银河系的命运仍然是一个悬而未决的问题。再考虑到太阳会在大约 10 亿年后让地球变得不适合居住,而太阳本身也很可能在 50 亿年后燃烧殆尽,因此与仙女座相撞是我们在宇宙中最不需要担心的事。
- 锻炼能显著降低结肠癌的复发和死亡风险
根据发表在《New England Journal of Medicine》期刊上的一项研究,锻炼能显著降低结肠癌的复发和死亡风险。相比不要求运动的对照组,锻炼组的结肠癌复发、新发癌症或八年内死亡的风险降低了 28%。研究人员发现,锻炼的好处在一年之后就体现出来了,并且会随着时间的推移而加强。在锻炼组中,无癌症复发的五年生存率达到 80.3%,对照组为 73.9%。锻炼组八年总生存率 90.3%,对照组 83.2%。研究期间锻炼组死亡 41 人,对照组死亡 66 人。锻炼组参与者可以进行任何喜欢的休闲有氧运动。他们的运动量并不高,每周三到四次 45-60 分钟的快走或三到四次 25-30 分钟的慢跑,这些就足以提高他们的生存率。
- 微软对 Windows 11 设备强制统一 USB-C 功能
微软更新了 Windows 11 24H2 设备硬件兼容性要求,强制统一了 USB-C 功能,要求笔记本和平板电脑上的 USB-C 端口都支持数据传输、充电和显示功能。此举旨在消除不同厂商对外表相同的 USB-C 端口提供不同的功能——这一现象被行业称之为“USB-C 端口混乱”。硬件兼容性还要求 USB 40Gbps 端口维持与 USB4 和 Thunderbolt 3 外设的完全兼容性。
- 日本 2024 年新生儿数首次跌破 70 万
日本厚生劳动省公布的人口动态统计显示,2024 年新生儿数为 686,061 人,是开始统计的 1899 年以来首次跌破 70 万。相比 2023 年,减少了 41,227 人,减幅为 5.7%。每名女性一生所生孩子的推定人数“总和生育率”为 1.15,低于 2023 年的 1.20,创历史新低。东京都的生育率最低为 0.96。日本出生人数和出生率连续 9 年双双下滑。少子化速度比政府估算快了 15 年,未出现逆转的迹象。2024 年死亡数为迄今最多的 1,605,298人,死亡人数超过出生人数的人口“自然减少”为 919,237 人,创历史新高。连续 18 年呈现自然减少,人口减少也在加速。
- AI 创业公司被发现其聊天机器人是 700 名印度员工
微软支持的 AI 创业公司 Builder.ai 最近申请破产,它的 AI 聊天机器人 Natasha 被发现其实是数百名印度员工伪装的。Builder.ai 从微软和 Qatar Investment Authority 等筹集到逾 4.45 亿美元资金,估值一度达到 15 亿美元,它的产品 Natasha 号称能用 AI 为客户生成软件,但实际上软件是由约 700 名印度员工根据客户要求在幕后手动编写的。Builder.ai 被发现虚报了 2024 年的收入,审计显示它的实际收入只有 5000 万美元,但它告诉投资者收入达到了 2.2 亿美元。
- OpenAI 董事会短暂解雇 CEO Sam Altman 的故事被搬上银幕
2023 年 11 月 OpenAI 董事会出其不意的宣布解雇 CEO Sam Altman,但五天之后 Altman 又重返公司,原董事会成员大部分在一年之内都陆续离开。OpenAI 董事会是认为 Altman 有滥权和操纵行为后而决定将其解雇,他们认为此举是履行其义务。在一年半之后,亚马逊米高梅影业正准备将这次事件搬上大银幕。影星 Andrew Garfield 将出演 Altman(他参演过《社交网络》),Monica Barbaro 出演 CTO Mira Murat,Yura Borisov 出演领导罢免行动的联合创始人 Ilya Sutskever。Heyday Films 的 David Heyman 和 Jeffrey Clifford 将担任制片人,Simon Rich 担任编剧。亚马逊计划今年夏天开拍。
- 乌克兰使用开源软件对俄罗斯后方机场轰炸机发动无人机攻击
上周日,在一次战争中最大胆的协同攻击中,乌克兰使用了一百多架无人机对俄罗斯后方机场的远程轰炸机发动大规模无人机攻击。无人机、炸药以及相关平台通过走私进入俄罗斯,在组装之后雇佣俄罗斯卡车司机将它们运送到目标机场附近,然后打开卡车车顶,连接俄罗斯本国的移动网络,对五个机场上停放的 41 架远程轰炸机发动攻击。目前卫星图像确认有 13 架轰炸机严重受损,还有一架 A-50 预警机严重受损。这次攻击被称为蛛网行动,无人机的自主驾驶系统使用了开源软件 ArduPilot。ArduPilot 项目始于 2007 年,采用 GPLv3 许可证,它最初是为基于 Arduino 的微控制器设计的,之后发展成为被业界、研究机构、军方和业余爱好者广泛使用的功能完整的无人机自主驾驶系统。
- 特斯拉汽车销量在欧洲继续下滑
虽然特斯拉不公布各个国家的详细销量数据,但欧洲国家的新车注册信息能透露不同车型的销量。根据目前公布的数据,刚刚过去的五月,特斯拉汽车在欧洲的销量继续下滑,与此同时各国的电动汽车销量则在增加。德国的数据显示,特斯拉汽车销量下滑逾 36%,同期电动汽车总注册量增长 45%;英国特斯拉销量下滑 45%,同期电动汽车总销量增长 28%;意大利特斯拉交付量下滑 20%,同期电动汽车销量增长近 41%。挪威可能是特斯拉销量唯一增长的欧洲国家,比去年同期增长 213%——一个原因是去年五月特斯拉在挪威的销量异常低,仅为四月和六月销量的一半。特斯拉在中国主要面临本地电动汽车厂商的低价竞争,上海工厂 5 月交付量比去年同期下降了 15%,特斯拉在上海制造的汽车主要出口欧洲以及在中国销售。
- Wendelstein 7-X 仿星器创下可控核聚变新纪录
Wendelstein7-X(W7-X)实验项目创下了一项新的世界纪录:成功地维持了长等离子体放电三重积长达 43 秒的新峰值。这标志着 W7-X 超越了其他类型的磁约束装置在这方面的表现,也为未来聚变电站的发展提供了重要技术支持。W7-X 实验项目是目前世界规模最大的仿星器装置实验,旨在验证仿星器设计能否达到理论预期的高效性能。在人类追求聚变发电的征途上,仿星器被视为最具潜力的概念之一。它通过融合轻原子核来产生能量,而这一过程需要在等离子体的状态下进行——这是一种被加热至数千万摄氏度的电离气体。仿星器利用复杂的磁场将这种高温等离子体限制在一个环形真空室内,以实现核聚变所需的条件。而三重积,被认为是可控核聚变的一条“及格线”。只有三重积达到一定阈值,核聚变才能自持燃烧,步入实用化阶段。W7-X 与托卡马克装置相比,尽管后者拥有相对简单的结构和广泛的研究基础,在历史上也取得了一定的三重积值,但它们的等离子体持续时间通常仅为几秒钟。而在维持更长等离子体持续时间方面,W7-X 展现出明显的优势。
- Google Chrome 撤销对中华电信和 Netlock CA 的信任
Google 以合规、未兑现改进承诺以及响应披露等问题撤销了对中华电信和匈牙利 Netlock CA 的信任,Chrome 浏览器在 7 月 31 日之后将停止信任中华电信和 Netlock 颁发的证书。跟踪 CA 和证书的研究员 Ryan Hurst 称,在一年多时间里 Netlock 未向数据库 Common CA Database 披露一个中间证书;Netlock 未能撤销一个错误颁发的证书;Netlock 未能按要求每周提供安全事件更新;中华电信推迟撤销一个错误颁发的证书;中华电信错误颁发了 247 个主题域名结构不正确的证书。
- 随着 Windows 10 即将终止支持,KDE 项目试图吸引微软用户
微软将于今年 10 月 14 日终止支持 Windows 10,Windows 10 PC 用户仍然可以继续使用,但微软不会再释出安全更新,意味着继续使用将会存在安全风险。KDE 项目试图吸引部分 Windows 10 用户转投 Linux,它发起了名为“KDE for Windows 10 Exiles”的宣传活动,微软建议无法升级到 Windows 11 的用户购买新电脑,而 KDE 建议这些用户安装 Linux 及其 Plasma 桌面环境。KDE 项目称,安装 Linux 没有以前那么难,但用户仍然需要仔细阅读安装说明。