MONTH · 2025-07

Monthly Digest — 2025-07

505 unique stories across 31 days and 8 sources.

Hacker News(124)

  1. Figma Files Registration Statement for Proposed Initial Public Offering (www.figma.com)
  2. PlanetScale for Postgres (planetscale.com)
  3. The Fed says this is a cube of $1M. They're off by half a million (calvin.sh)
  4. Ask HN: Who is hiring? (July 2025)
  5. Couchers is officially out of beta (couchers.org)
  6. Stop Killing Games (www.stopkillinggames.com)
  7. ICEBlock, an app for anonymously reporting ICE sightings (techcrunch.com)
  8. Show HN: CSS generator for a high-def glass effect (glass3d.dev)
  9. Opening up ‘Zero-Knowledge Proof’ technology (blog.google)
  10. AV1@Scale: Film Grain Synthesis, The Awakening (netflixtechblog.com)
  11. Launch HN: K-Scale Labs (YC W24) – Open-Source Humanoid Robots
  12. Poor Man's Back End-as-a-Service (BaaS), Similar to Firebase/Supabase/Pocketbase (github.com)
  13. Everything around LLMs is still magical and wishful thinking (dmitriid.com)
  14. Air pollution may contribute to development of lung cancer in never-smokers (today.ucsd.edu)
  15. Eight dormant Satoshi-era Bitcoin wallets reactivated after 14 yrs (twitter.com)
  16. EverQuest (www.filfre.net)
  17. The Prime Reasons to Avoid Amazon (blog.thenewoil.org)
  18. macOS Icon History (basicappleguy.com)
  19. How to not pay your taxes legally, apparently (mrsteinberg.com)
  20. Seine reopens to Paris swimmers after century-long ban (www.lemonde.fr)

GitHub Trending(75)

  1. microsoft / generative-ai-for-beginners

    21 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/

  2. GraphiteEditor / Graphite

    An open source graphics editor for 2025: comprehensive 2D content creation tool for graphic design, digital art, and interactive real-time motion graphics — featuring node-based procedural editing

  3. confident-ai / deepeval

    The LLM Evaluation Framework

  4. octra-labs / wallet-gen
  5. NanmiCoder / MediaCrawler

    小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频 | 评论爬虫、微博帖子 | 评论爬虫、百度贴吧帖子 | 百度贴吧评论回复爬虫 | 知乎问答文章|评论爬虫

  6. zaidmukaddam / scira

    Scira (Formerly MiniPerplx) is a minimalistic AI-powered search engine that helps you find information on the internet and cites it too. Powered by Vercel AI SDK! Search with models like xAI's Grok 3.

  7. microsoft / Mastering-GitHub-Copilot-for-Paired-Programming

    A multi-module course teaching everything you need to know about using GitHub Copilot as an AI Peer Programming resource.

  8. mrdoob / three.js

    JavaScript 3D Library.

  9. LadybirdBrowser / ladybird

    Truly independent web browser

  10. Genesis-Embodied-AI / Genesis

    A generative world for general-purpose robotics & embodied AI learning.

  11. swagger-api / swagger-ui

    Swagger UI is a collection of HTML, JavaScript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API.

  12. rustfs / rustfs

    🚀 High-performance distributed object storage for MinIO alternative.

  13. datawhalechina / happy-llm

    📚 从零开始的大语言模型原理与实践教程

  14. dockur / macos

    macOS inside a Docker container.

  15. anthropics / prompt-eng-interactive-tutorial

    Anthropic's Interactive Prompt Engineering Tutorial

  16. vosen / ZLUDA

    CUDA on non-NVIDIA GPUs

  17. th-ch / youtube-music

    YouTube Music Desktop App bundled with custom plugins

  18. humanlayer / 12-factor-agents

    What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?

  19. Alibaba-NLP / WebAgent

    🌐 WebAgent for Information Seeking bulit by Tongyi Lab: WebWalker & WebDancer & WebSailor https://arxiv.org/pdf/2507.02592

  20. HandsOnLLM / Hands-On-Large-Language-Models

    Official code repo for the O'Reilly Book - "Hands-On Large Language Models"

Product Hunt(124)

  1. Cursor Agents: Browsers & Mobile

    Work with a powerful coding assistant anywhere

  2. Dynamic Mockups

    Create realistic mockups at scale​

  3. Rybbit

    The open source Google Analytics replacement

  4. co.dev MCP

    Turn your ideas into full-stack apps

  5. Portia AI

    Secure AI agents with tools, auth, and smart control

  6. String.com

    AI agent for building AI agents

  7. Lazy 2.0

    One shortcut to capture & chat with your notes, everywhere

  8. Nothing Phone (3)

    Beyond lights, with the new Glyph Matrix

  9. Skala

    Legal platform for startups

  10. Autocoder.cc

    The 1st full stack vibe coding tool

  11. AppStruct

    No-code app builder

  12. LLM Gateway

    Use any AI model with just one API

  13. todai

    Your first personalized happy lifestyle index

  14. Icons8 MCP Server

    Massive icon packs for vibe-coding

  15. Agnes AI

    AI Agent for collaborative workspace

  16. Search Console Audit

    Get more traffic from ChatGPT & Google

  17. FastMoss.com

    Your TikTok shop growth engine

  18. Prit

    Plan your trip with pro's tool.

  19. Mirage

    Where you co-create the game as you play it

  20. Silenttype

    You don't have to worry about note-taking anymore

Hugging Face(79)

  1. Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

    Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

  2. MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

    Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at https://github.com/msu-video-group/memfof.

  3. Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

    Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside pseudo data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.

  4. Listener-Rewarded Thinking in VLMs for Image Preferences

    Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

  5. Ella: Embodied Social Agents with Lifelong Memory

    We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at https://umass-embodied-agi.github.io/Ella/.

  6. MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

    Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

  7. FreNBRDF: A Frequency-Rectified Neural Material Representation

    Accurate material modeling is crucial for achieving photorealistic rendering, bridging the gap between computer-generated imagery and real-world photographs. While traditional approaches rely on tabulated BRDF data, recent work has shifted towards implicit neural representations, which offer compact and flexible frameworks for a range of tasks. However, their behavior in the frequency domain remains poorly understood. To address this, we introduce FreNBRDF, a frequency-rectified neural material representation. By leveraging spherical harmonics, we integrate frequency-domain considerations into neural BRDF modeling. We propose a novel frequency-rectified loss, derived from a frequency analysis of neural materials, and incorporate it into a generalizable and adaptive reconstruction and editing pipeline. This framework enhances fidelity, adaptability, and efficiency. Extensive experiments demonstrate that \ours improves the accuracy and robustness of material appearance reconstruction and editing compared to state-of-the-art baselines, enabling more structured and interpretable downstream tasks and applications.

  8. Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

    Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised fine-tuning.Our experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.

  9. GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.

  10. Kwai Keye-VL Technical Report

    While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce Kwai Keye-VL, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the KC-MMBench, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.

  11. LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

    Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.

  12. Ovis-U1 Technical Report

    In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

  13. WebSailor: Navigating Super-human Reasoning for Web Agent

    Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

  14. How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

    Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

  15. Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation

    The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.

  16. Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

    The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at https://github.com/lingo-iitgn/ eka-eval and a part of ongoing EKA initiative (https://eka.soket.ai), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.

  17. LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

    Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

  18. MemOS: A Memory OS for AI System

    Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

  19. Should We Still Pretrain Encoders with Masked Language Modeling?

    Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

  20. 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

    Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

Solidot(103)

  1. 小行星 2024 YR4 撞击月球概率上升至 1/25

    小行星 2024 YR4 基本不可能撞击地球,但 2032 年 12 月撞击月球的概率提高到了 1/25。若撞击真的发生,预估将在月球表面形成一个约 1 公里直径的新撞击坑。虽然月球本身无需防御,撞击也不会对月球轨道运行有任何影响。但撞击所造成的抛射物有可能进入地球同步轨道范围,对部分卫星系统造成干扰风险。这也提醒我们,太空防御的范畴不应限于地球,整个地月系统的安全亦不可忽视。

  2. 碳记录显示人类五万年前开始大规模用火

    中科院海洋所研究团队与德法研究人员合作在 PNAS 期刊发表论文,基于海洋沉积物中的黑碳记录,重建了过去 30万 年以来东亚北部的古火演化历史,结合欧洲、东亚、东南亚及澳大利亚区域的记录以及考古遗址大数据,发现现代人类大规模用火始于约 5 万年前。考古学研究发现,人类最早的用火记录可追溯至约 170 万年前。但关于人类究竟何时开始大规模用火,目前仍难以给出确切的答案。黑碳是生物质及化石燃料燃烧过程中所生成的一系列含碳化合物的统称。鉴于其芳香族结构具备高度稳定性,黑碳能够在沉积环境中得以长期留存。以大河作为主要沉积物源区的边缘海,其沉积物中的黑碳很大程度上能够反映大陆尺度的火活动状况。研究认为,5 万年前的冰期,现代人类开启了第二次走出非洲的迁徙历程。冰期海平面下降,印太暖池区大面积的陆架出露为陆地,雨林屏障作用减弱,使得人类在不到一万年的时间里就迅速扩散至东亚、东南亚乃至澳大利亚。人口的急剧扩张极大地促进了用火频率的上升。此外,冰期气候寒冷,食物资源相对匮乏,人类对用火的需求也随之大幅增加。这些因素最终共同促成了 5 万年前成为人类开始大规模用火的关键时间节点。这也进一步表明,人类可能在末次冰期就已经通过用火在全球碳循环演变中留下了深刻印记。

  3. 研究发现消费者对 AI 产品信任度低

    两项研究发现消费者对 AI 产品信任度低,购买意愿也低。AI 对产品推广产生了负面影响,这种影响在高风险产品中尤其显著,低风险产品则不太明显。在其中一项研究中,研究人员将参与者分成两组,每组大约 100 人。一组阅读突出 AI 或 AI-powered 等特性的虚构产品和服务的广告,另一组阅读的广告使用了新技术或配备了尖端技术等术语。相比另一组,阅读带有 AI 等关键词广告的参与者报告尝试或购买相关产品和服务的可能性较低。另一项研究由市场研究公司 Parks Associates 完成,调查规模更大。在接受调查的约 4000 名美国人中,18% 的人表示 AI 可能会增加购买意愿,24% 的人表示不太可能,而 58% 的人表示 AI 对他们没有影响。

  4. Canonical 2024 年营收 2.92 亿美元

    根据 Canonical 向 UK Companies House 递交的 2024 年财报,Ubuntu 发行版的开发商在 2024 年营收达到了 2.92 亿美元,2023 年是 2.51 亿美元,而 2022 年是 2.05 亿美元,公司的员工总数也达到 1,175 人。相比下 2014 年 Canonical 的营收仅为 8100 万美元,员工人数约 337 人,公司处于长期亏损状态。暂时不清楚 Canonical 何时会 IPO,早在 2022 年就传出将在 2023 年 IPO 的消息。

  5. 华为发布了使用昇腾 NPU 训练的开放权重模型

    华为发布了使用其昇腾 NPU 训练的开放权重模型,模型发布在 Gitcode 上,其许可证禁止欧盟地区使用。被称为盘古 Pro MoE 的模型总参数 720 亿,每个 token 激活 160 亿参数。模型为昇腾 300I Duo 和 800I A2 进行了优化,单卡推理性能达到了 1148 token/s,通过预测加速(speculative acceleration)能进一步提高到 1528 token/s。华为研究人员称,在参数低于 1000 亿的模型中,盘古 Pro MoE 的性能超越了 GLM-Z1-32B 和 Qwen3-32B 等知名开放权重模型。

  6. 首批美国科学难民抵达法国

    首批逃离特朗普统治的美国科学难民抵达了法国。Aix-Marseille 大学(AMU)通过 Safe Place for Science 项目引进了首批 8 名美国科学家。这些科学家尚未与大学签订合同,大多数人要求匿名以便于在未被聘用的情况下保住美国的职位。申请 Safe Place for Science 项目的科学家包括了气候科学家 James 及其研究司法系统与民主关系的妻子。James 不愿意透露他的姓,他不认为自己是难民,但对特朗普治下学术研究的未来深表担忧。他的研究领域受到了当局的针对,面临研究资金削减。AMU 表示虽然它在法国之外的知名度较低,但来自斯坦福大学和耶鲁大学等美国知名大学的 298 名研究人员申请了该项目,凸显了美国的紧迫形势。

  7. 炎症衰老可能是工业化生活方式的产物

    炎症长期以来被认为是衰老的标志,但根据哥伦比亚大学梅尔曼公共卫生学院的一项新研究,炎症可能并非人类的普遍经历。研究表明,炎症性衰老(inflammaging)似乎是工业化生活方式的副产物,在全球人群中存在显著差异。研究人员分析了四个群体的数据:两个工业化群体以及两个非工业化的原住民群体(玻利维亚亚马逊地区的 Tsimane 人和马来西亚半岛的 Orang Asli 人)。尽管两个工业化群体的炎症特征相似,但在原住民群体中却并非如此,因为原住民群体的炎症水平主要受感染而非年龄的影响。大多数慢性疾病(包括糖尿病、心脏病、阿尔茨海默病)在土著群体中很少见或基本不存在。研究人员发现,大约 66% 的 Tsimane 人至少有一种肠道寄生虫感染;超过 70% 的 Orang Asli 人存在流行性感染。炎症标志物与工业化群体的慢性病密切相关,但与土著群体无关。

  8. GNU Health Hospital Information System 5.0 释出

    针对医疗行业的自由软件 GNU Health Hospital Information System 释出了 5.0 版本。主要变化包括:改进报告和分析,更全面的处理不同类型的患者信息,重新设计了医学影像子系统,完善了保险和计费功能,等等。

  9. 海绵结构材料借助太阳热能去除海水中的盐分

    地球上的大部分水资源都是海水,由于盐分过高而无法饮用。海水淡化厂可将海水淡化处理成饮用水,然而该过程需要消耗大量能源。香港研究团队在《ACS Energy Letters》发表研究成果,其研发出一种具有长链微气囊结构的海绵结构材料,结合阳光照射与简易塑料罩,成功实现盐水资源向淡水的转化。一项户外原理验证实验成功在自然光照条件下产出可直接饮用的淡水,标志着实现低能耗可持续海水淡化技术的重大进展。在户外测试中,研究人员将这种材料置于盛有海水的蒸发容器中,上方覆盖弧形透明塑料罩。阳光加热海绵结构材料顶部时,仅会将水分蒸发为水蒸气(盐分会被阻隔)。蒸气在塑料罩内壁凝结为液态水,沿罩壁汇集至边缘,最终滴入蒸发容器下方的漏斗中,以另一容器盛放。经过 6 小时自然光照,该系统最终产出约 3 汤匙的饮用水。

  10. 系外行星引发恒星释放耀斑

    天文学家最近发现一颗名为 HIP 67522b 的系外行星,跟它的母恒星 HIP 67522 的互动关系非常不寻常。这颗行星靠母星非常近,导致恒星表面频繁发生激烈的耀斑,也让行星的大气层持续受热膨胀。HIP 67522 是一颗年轻的 G 型恒星,位于半人马座,距离地球约 417 光年,年龄大约只有 1,700 万年。这颗恒星拥有两颗行星,其中 HIP 67522b 是一颗「热木星」——体积接近木星,由于公转轨道非常靠近母星,绕转一圈只需 7 天的时间。研究团队发现,这颗行星似乎能与母恒星的磁场产生某种奇特的连结,进而引发恒星表面出现剧烈的耀斑活动。这些耀斑朝向行星爆发时,又把大量能量「反馈」到行星身上,使它的大气层像吹气球一样不断膨胀。长期下来,行星的大气可能会被严重剥离,甚至从一颗巨大的热木星,缩小成像「热海王星」或「亚海王星」那样的体积。这类母星与行星之间的强烈互动,早就在理论上被预测过,但直到现在才首次被实际观测到。

  11. 男女对婴儿晚上哭泣声音的反应差别不大

    丹麦奥胡斯大学的一项研究发现,女性并非天生比男性更容易被婴儿晚上的哭泣声惊醒。不过女性花在夜间照顾的可能性三倍于男性。研究人员开展了两项独立研究。第一项实验针对 142 名无孩成年人,结果发现女性对非常安静的声音的反应略强于男性。对于耳语级别的声音,无论是婴儿哭声还是常见的闹钟声,女性吵醒的可能性比男性高 14%。但如果声音的响度加强,男女之间不存在显著差异。第二项研究中丹麦 117 位初为人父母的夫妇记录了他们一周内的夜间照护情况。结果显示,母亲夜间婴儿照护的可能性是父亲的三倍。研究人员认为,社会因素而非生理差异才能解释其中的差异。丹麦最近将陪产假从两周延长至十一周,可能有助于平衡父母之间的育儿责任。

  12. 美国年轻人减少了游戏开支

    根据 Circana 的数据,18-24 岁的美国年轻人四月份的游戏支出比去年同期减少了 25%,总支出比去年同期减少 13%。减少开支的可能原因是经济的不确定性和就业前景黯淡。相比之下,年龄较大的群体的支出保持了稳定。美国的经济环境可能促使年轻一代改变消费习惯,对已经面临裁员的游戏行业而言,这可能不是好消息。

  13. Stop Killing Games 运动吸引了逾百万人签名

    由 YouTube 主播 Accursed Farms 发起的 Stop Killing Games 运动赢得了广泛关注,该运动旨在让游戏和书籍等类似,玩家购买之后拥有所有权,可以在任何时候使用,而不是在游戏发行商关闭服务器之后就无法访问。Stop Killing Games 在英国的请愿获得了 15 万人签名——达到递交英国议会辩论所需的要求,在欧盟的请愿赢得了 107 万人签名。可能需要政府监管部门涉足,游戏行业可能才会改变现有的做法。

  14. 2024 年发表的医学论文摘要七分之一可能是 AI 完成的

    一项针对学术文献的大规模分析显示,去年发表的生物医学论文摘要中,约 1/7 可能借助 AI 完成撰写。2024 年医学数据库 PubMed 收录的 150 万篇摘要中,超过 20 万篇包含大模型(LLM)常推荐使用的词汇。许多团队试图评估 LLM 对学术产出的影响,但这一过程颇具挑战性,因为大多数使用者并未披露这种行为。研究人员利用了 LLM 流行后的风格化词汇去估计摘要是否是 AI 帮助撰写。研究发现,2024年有 454 个词汇的出现频率远高于 2010 年以来的任何年份。它们多为与研究内容无关的“风格词”,且以动词和形容词为主。科学词汇的演变是长期过程。2021年有 190 个“冗余词汇”,多为与研究内容相关的名词。但自 LLM 普及以来的词汇变化更为显著,且主要体现在风格层面。研究人员发现,在计算科学和生物信息学等领域,超过 1/5 的摘要由 LLM 辅助撰写。

  15. Clothoff 试图支配深度伪造色情

    根据 Clothoff 告密者披露的信息,该深度伪造色情应用正计划向全球扩张,试图支配深度伪造色情领域。Clothoff 已经收购了至少 10 款类似服务,这些服务每月吸引了数十万到数百万流量。告密者称,Clothoff 年度预算约 350 万美元,它目前的营销方式主要是依靠 Telegram 机器人和 X 频道向可能使用该应用的年轻男性投放广告。Clothoff 大部分营销预算都花在 Telegram 频道、Reddit Sex Sub 和 4chan 上。

  16. 基因组测序揭示古埃及人祖先

    在一项研究中,科学家对埃及一座墓葬中的一名古埃及人进行了全基因组测序。测序对象为男性,其放射性碳测年为公元前 2855 年-公元前 2570 年左右。他被发现埋葬于古埃及 Nuwayrat 地区的一个密封陶罐中,说明他的社会地位较高,活到了他那个时代的高龄——44-64 岁之间。 在提取的 7 个DNA样本中,有两个保存足够完好,能用于测序,并与 3233 个现代个体和 805 个古代个体的数据库进行了对比分析。通过遗传模拟,该 Nuwayrat 遗体基因组的绝大部分可以追溯到北非新石器时代的祖先。该基因组约 20% 与东新月沃土人群有关,补充了这两个地区有贸易往来和相互影响的考古学证据。

  17. 微软 XBox 业务高管建议被裁员的员工用 AI 管理情绪

    微软本周宣布裁员逾九千人,其中 XBox 游戏业务深受影响,有工作室被关闭,多个游戏项目被取消。对此情况,XBox 高管 Matt Turnbull 提出了一项建议:被裁的员工应该用 AI 管理情绪。他的建议发表在 LinkedIn 上,帖子已删除,但内容已被人保存了下来。他表示自己已经试验用 LLM AI 工具(如 ChatGPT 或 Copilot)帮助减少失业带来的情绪和认知负担,称如果不尽力提供最好的建议将是自己的失职。

  18. 美国面临其历史上最大规模的人才流失

    自二战以来一直持续到 2024 年,美国是自由世界毋庸置疑的科学领导者。一个注重事实、讲究科学真理、重视教育和公共利益的社会引领着一代又一代人持续突破和进步。但 2025 年 1 月起,美国享有盛名的科研机构如 NOAA、NASA、NSF、CDC、EPA 和 FDA 遭遇了一连串史无前例的内部攻击。随着新的预算(大而美法)获得众议院和参议院的批准即将成为法律,我们所熟悉的美国科研模式可能成为过去。对美国科学家而言,这是一场现实生活中的噩梦。即使发生最糟糕的事情,我们仍然有理由抱有希望。希特勒的德国也严重破坏了本国的科学研究,但大批流失的科学家最终惠及了世界其他地区,这一事件被称为“希特勒的礼物”。我们可能会见证美国科学的衰落,其他科学强国崛起。NASA 已经裁员逾 2500 人,正要求其他 3000 人自愿退休,受影响的主要是科学部门。NASA 正在进行的 124 个任务有 41 个面临完全取消,引力波探测器 LISA 面临失去 NASA 的资助。特朗普政府取消订阅《自然》期刊,这一事件几乎和纳粹德国如出一辙:《自然》自 1937 年起到二战结束被禁止收录进德国图书馆。如果美国不改变方向,2025 年不仅将标志着美国科学例外主义时代的终结;而美国人才流失的规模可能让“希特勒的礼物”都相形见绌。

  19. 善用表情符号能在交流中给对方留下好印象

    在全球范围内,表情符号每天被使用超过 100 亿次,为数字对话注入微妙的情感。然而它们对人们如何理解这些对话的实际影响尚不清楚——虽然这些小符号经常被积极解读,但有时也会被误读并引起误解。因此研究人员评估了表情符号如何影响人们对发送表情符号的人的看法。在研究中,美国 260 名参与者被要求阅读 15 段基于文本的对话,并想象他们与一位密友进行了这些交流。这些对话要么仅有纯文本回复,要么包含表情符号。阅读完这些对话样本后,参与者被问及一系列关于他们对消息发送者的感受的问题。总体而言,参与者认为包含表情符号的消息比纯文本消息回应得更积极。这使发件人更讨人喜欢,使两者关系显得更亲近。令人惊讶的是,这种效果的产生与使用的表情符号类型无关,无论是直接表达发件人情绪的表情符号——比如笑脸,还是展示其他物体的中性表情符号,两者并没有产生实质差异。

  20. 《圣歌》服务器将于 2026 年 1 月关闭

    EA 宣布《圣歌(Anthem)》服务器将于 2026 年 1 月 12 日关闭。BioWare 开发的机甲在线游戏《圣歌》于 2019 年 2 月发布,上线之后因存在各种问题而差评如潮,BioWare 于 2020 年 2 月宣布将进行《圣歌》回炉重造,然而一年后的 2021 年 2 月 BioWare 宣布取消重新开发,但它继续提供了在线服务。现在 EA 宣布了关闭服务器的时间表,意味着游戏将完全死去。