DIGEST · 2025-08-04

OrangeBot.AI Digest — 2025-08-04

59 headlines across 8 sources, aggregated for this day.

Hacker News(15)

  1. Qwen-Image: Crafting with native text rendering (qwenlm.github.io)
  2. I asked four former friends why we stopped speaking (2023) (www.vogue.com)
  3. Show HN: I spent 6 years building a ridiculous wooden pixel display (benholmen.com)
  4. DrawAFish.com Postmortem (aldenhallak.com)
  5. AI promised efficiency. Instead, it's making us work harder (afterburnout.co)
  6. Tesla withheld data, lied, misdirected police to avoid blame in Autopilot crash (electrek.co)
  7. Facts will not save you – AI, history and Soviet sci-fi (hegemon.substack.com)
  8. OpenIPC: Open IP Camera Firmware (openipc.org)
  9. Objects should shut up (dustri.org)
  10. PHP: The Toyota Corolla of programming (deprogrammaticaipsum.com)
  11. Read your code (etsd.tech)
  12. Perplexity is using stealth, undeclared crawlers to evade no-crawl directives (blog.cloudflare.com)
  13. How we built Bluey’s world (www.itsnicethat.com)
  14. Palantir is extending its reach even further into government (www.wired.com)
  15. Mastercard deflects blame for NSFW games being taken down (www.pcgamer.com)

GitHub Trending(9)

  1. dyad-sh / dyad

    Free, local, open-source AI app builder | v0 / lovable / Bolt alternative | 🌟 Star if you like it!

  2. souzatharsis / podcastfy

    An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI

  3. actualbudget / actual

    A local-first personal finance app

  4. MotiaDev / motia

    Modern Backend Framework that unifies APIs, background jobs, workflows, and AI agents into a single cohesive system with built-in observability and state management.

  5. rasbt / LLMs-from-scratch

    Implement a ChatGPT-like LLM in PyTorch from scratch, step by step

  6. MaaAssistantArknights / MaaAssistantArknights

    《明日方舟》小助手,全日常一键长草!| A one-click tool for the daily tasks of Arknights, supporting all clients.

  7. reflex-dev / reflex

    🕸️ Web apps in pure Python 🐍

  8. jellyfin / jellyfin

    The Free Software Media System - Server Backend & API

  9. wg-easy / wg-easy

    The easiest way to run WireGuard VPN + Web-based Admin UI.

Product Hunt(15)

  1. SciSpace Agent

    Only AI agent automating research with 150+ academic tools

  2. Spill

    Minimalist freewriting app

  3. Kanbanq : Open alpha

    Project management. Simply done. For small teams & indies

  4. Verbite

    SEO-ready content from AI Agents

  5. Ghost 6.0

    The open source product that generates $100M+ for creators

  6. Rollups

    Take control of your startup's equity

  7. PromptPlex

    The command center for all your AI prompts

  8. Sailhouse - Agent Control Plane

    Let's be honest. Your AI agents could be better.

  9. ClueoAPI

    The missing personality layer for every AI product.

  10. Sellible

    The sales gym for founders

  11. Tenki Cloud

    GitHub actions. 90% cheaper. 30% faster. 2-clicks migration.

  12. Poe API

    One API for all the best AI models

  13. Clado Atlas

    The ground-truth platform for humans

  14. Tregno

    The World's first AI content operating system

  15. Zealock

    The whatsapp marketplace that runs itself

Hugging Face(13)

  1. Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

    Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.

  2. PixNerd: Pixel Neural Field Diffusion

    The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

  3. Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

    General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

  4. 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

    Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

  5. SWE-Exp: Experience-Driven Software Issue Resolution

    Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.

  6. Multimodal Referring Segmentation: A Survey

    Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

  7. SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

    Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents' independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.

  8. MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

    Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

  9. SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

    Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

  10. Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

    Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

  11. Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

    While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

  12. Investigating Hallucination in Conversations for Low Resource Languages

    Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

  13. IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

    Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.

Solidot(7)

  1. ISS 俄罗斯舱仍在漏气

    俄罗斯航天局载人航天计划执行主任 Sergey Krikalev 承认,国际空间站(ISS)上的俄罗斯舱仍然在漏气。漏气最早是于 2019 年发现的,尽管多次确定漏气位置和进行修复,但国际空间站仍在漏气。空间站上驻扎的宇航员目前没有生命危险,但老化结构中裂缝的情况仍然不能令人满意。目前漏气有所减少,但仍在持续。俄罗斯和美国的科学家正努力解决该问题,追根究底,确保空间站未来不会再次发生类似事件。

  2. Steam 用户中 Linux 比例接近 3%

    Valve 公布的 2025 年 7 月 Steam 硬件和软件调查显示,玩家所用操作系统中 Linux 比例接近 3% 达到 2.89%(增加 0.32%),Windows 减少 0.44% 占 95.23%,OSX 占 1.88%。Linux 玩家的比例接近历史高点,这一趋势主要受到掌机 Steam Deck 的推动。在 PC 处理器中英特尔 CPU 减少 0.75% 跌至 60% 以内占 59.52%,AMD CPU 增加 0.74% 占 40.39%。对于用户使用的语言,简体中文减少 1.29% 占 25.44%,英文占 37.70%。

  3. 印度将惩罚论文撤稿太多的大学

    如果一所大学的研究人员发表的论文大量撤稿,印度国家大学排名将会对将该大学进行惩罚。此举旨在遏制日益增多的因科学不端行为而导致论文撤稿的问题。论文撤稿一部分是因为无意造成的错误,但还有一部分是因为有意的不端行为。根据 Retraction Watch 对过去 30 年撤稿数据库的分析,印度的撤稿论文数量仅次于中国和美国。美国每发表 1000 篇论文中只有不到 1 篇被撤稿,中国每发表 1000 篇论文中有逾 3 篇被撤稿,而印度是每发表 1000 篇论文有 2 篇被撤稿。印度和中国的论文撤稿大部分是因为科学不端行为或科学诚信问题。

  4. 比利时限制访问互联网档案馆的在线图书馆

    比利时布鲁塞尔商事法庭发布了一份禁令,旨在限制对影子图书馆的访问,受影响的网站包括安娜的档案 (Anna's Archive)、Libgen、OceanofPDF、Z-Library 以及互联网档案馆的 Open Library。除了 ISP,搜索引擎、DNS 解析器、广告商、域名服务商、内容分发网络 (CDN) 和托管商都需要采取行动限制对上述网站的访问。Open Library 由已故的 Aaron Swartz 和互联网档案馆创始人 Brewster Kahle 等人创办,旨在存档所有已出版书籍,允许读者在线借阅。与其它电子图书馆类似,它的每本书每次只能借出一份拷贝。但不同之处是它的电子书没有获得授权,而是通过自己扫描去创建电子版。

  5. Google 改变关闭 goo.gl 短链接的计划

    搜索巨人去年宣布,它将于 2025 年 8 月 25 日关闭 Google URL Shortener 短链接服务(goo.gl/*),届时所有 goo.gl 链接将会停止响应。距离关闭日期不到一个月时间,在依赖于 goo.gl 短链接的开发者、教育工作者和记者等表达担忧之后,Google 改变了主意,采取了更温和的立场:它将只禁用自 2024 年底以来没有任何活动的 goo.gl 链接,如果 goo.gl 链接在活跃使用或点击,这些链接将能继续使用。

  6. 17 岁的 Hannah Cairo 解决了有 40 年历史的数学猜想

    2025 年 2 月,Hannah Cairo 在预印本平台 arxiv 上发表了一篇论文,解决了有 40 年历史的 Mizohata-Takeuchi 猜想,她年仅 17 岁,主要依靠自学,一时间震惊了数学界。Cairo 证明该猜想是错误的。她在巴哈马的 Nassau 长大,父亲是程序员,在这里获得了一份工作,因此一家人搬来这里。她还有一位大三岁的哥哥和小八岁的弟弟。在巴拿马他们都是在家中学习。Cairo 通过 Khan Academy 的在线课程学习数学,到她 11 岁时已经读完了微积分课程。父母为她找了几位数学教授远程辅导,她大部分时间仍然是自学,以至于其中一位教授、Clark 大学的 Amir Aazami 认为收钱有愧。到 14 岁时她已经读完了本科高年级数学课程。2021 年由于新冠疫情,一家人困住在芝加哥的祖父母家。这对她反而是好事,她开始扩大数学圈,接触越来越多的同行。2023 年,她申请了多数大学,但由于没有读完高中很多大学都拒绝了。她跟着哥哥去了加州伯克利,选修高等数学课程,其中一门是关于傅里叶限制理论(Fourier restriction theory)的研究生课程,授课老师是张瑞祥。几周后张瑞祥布置了一道 Mizohata-Takeuchi 猜想的简化版本作为作业,此举主要是鼓励学生探索数学领域的高级技巧。她完成了习题,在张的鼓励下进一步探索。她构造了一个函数否定了 Mizohata-Takeuchi 猜想。在完成证明之后,她决定跳过大学阶段,直接读数学博士。由于没有读完大学,她申请的多所大学也拒绝了,只有马里兰大学和约翰霍普金斯大学愿意录取,她选择了马里兰大学,将从秋天开始入学,当她完成学业,这将是她的第一个学位。

  7. 美国政客的年龄比其他国家都年长

    美国政客的年龄比世界其他国家的政客都要老。根据发表在《Journal of Public Economics》期刊上的一项研究,斯坦福大学和加州伯克利的研究人员认为,这一现象背后的一大原因与为政客竞选资金捐款的人的年龄相关。通过关联美国竞选捐款和登记选民,研究人员发现捐款选民的中位数年龄是 66 岁。年长捐款者的意识形态比年轻捐款者保守得多,他们也更可能捐款给年龄相近的候选人,且捐款金额更大。