OrangeBot.AI Digest — 2025-07-28
53 headlines across 8 sources, aggregated for this day.
Hacker News(15)
- Show HN: Use Their ID – Use Your Local UK MP's ID for the Online Safety Act (use-their-id.com)
- ‘I witnessed war crimes’ in Gaza – former worker at GHF aid site [video] (www.bbc.com)
- Claude Code weekly rate limits
- Visa and Mastercard are getting overwhelmed by gamer fury over censorship (www.polygon.com)
- I saved a PNG image to a bird (www.youtube.com)
- FDA has approved Yeztugo, a drug that provides protection against HIV infection (newatlas.com)
- Copyparty – Turn almost any device into a file server (github.com)
- Tao on “blue team” vs. “red team” LLMs (mathstodon.xyz)
- Debian switches to 64-bit time for everything (www.theregister.com)
- Samsung Removes Bootloader Unlocking with One UI 8 (sammyguru.com)
- How to make websites that will require lots of your time and energy (blog.jim-nielsen.com)
- LLM Embeddings Explained: A Visual and Intuitive Guide (huggingface.co)
- What would an efficient and trustworthy meeting culture look like? (abitmighty.com)
- SIMD within a register: How I doubled hash table lookup performance (maltsev.space)
- Software Development at 800 Words per Minute (neurrone.com)
GitHub Trending(11)
- Shubhamsaboo / awesome-llm-apps
Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
- Genesis-Embodied-AI / Genesis
A generative world for general-purpose robotics & embodied AI learning.
- daveebbelaar / ai-cookbook
Examples and tutorials to help developers build AI systems
- tldr-pages / tldr
📚 Collaborative cheatsheets for console commands
- microsoft / generative-ai-for-beginners
21 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/
- dgtlmoon / changedetection.io
Best and simplest tool for website change detection, web page monitoring, and website change alerts. Perfect for tracking content changes, price drops, restock alerts, and website defacement monitoring—all for free or enjoy our SaaS plan!
- mikf / gallery-dl
Command-line program to download image galleries and collections from several image hosting sites
- outline / outline
The fastest knowledge base for growing teams. Beautiful, realtime collaborative, feature packed, and markdown compatible.
- ashishpatel26 / 500-AI-Agents-Projects
The 500 AI Agents Projects is a curated collection of AI agent use cases across various industries. It showcases practical applications and provides links to open-source projects for implementation, illustrating how AI agents are transforming sectors such as healthcare, finance, education, retail, and more.
- mattermost-community / focalboard
Focalboard is an open source, self-hosted alternative to Trello, Notion, and Asana.
- SillyTavern / SillyTavern
LLM Frontend for Power Users.
Product Hunt(11)
- CopyCat
Build browser automations with AI
- Doco
Cursor for Microsoft Word
- Nitrode
AI game engine to prototype 3D games in a day
- Unitree R1
Ultra-lightweight humanoid robot starting at $5900
- Ahey
A free and open-source, embeddable video conference app.
- Singify AI Vocal Remover
Remove vocals from any song
- Chive
The macOS companion for Claude Code
- Guidey
Add modern onboarding tours to your product in minutes
- Aeneas
AI that helps historians connect the past
- Web
A free macOS AI browser
- Best Reminder App Chrome Extension
Cloud Sync: Reminders across devices, plus email alerts.
Hugging Face(9)
- Deep Researcher with Test-Time Diffusion
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.
- The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale. Yet, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure any geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: (i) the GPTQ error propagation step gains an intuitive geometric interpretation; (ii) GPTQ inherits the error upper bound of Babai's algorithm under the no-clipping condition. Taken together, these results place GPTQ on firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.
- MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.
- CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
- PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.
- Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .
- Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from "humans watching video" to "AI understanding video". To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.
- GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.
- Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-45^circ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
Solidot(7)
- Stack Exchange 迁移到云端
编程问答平台 Stack Exchange 宣布迁移到云端,放弃使用自己的服务器。Stack Exchange 自 2010 年起就在新泽西州的数据中心托管旗下网站,它使用了大约 50 台服务器。如果服务器出现问题,工程师需要去现场更换或重启硬件。2023 年它的 Stack Overflow for Teams 迁移到了微软的 Azure 云,现在 Stack Overflow 和 Stack Exchange 网络托管在了 Google Cloud 云服务上。Stack Overflow 从此不再拥有任何物理数据中心或办公室,完全在云端远程工作。
- 今天的环法自行车选手已经超越了当年的阿姆斯特朗
分析显示,今天的环法自行车选手已经超越了兴奋剂时代的阿姆斯特朗(Lance Armstrong)。去年环法自行车赛的一个山地赛段中 Tadej Pogacar 在近 40 分钟内的功率输出约 7瓦/千克。Jonas Vingegaard 曾在近 15 分钟内功率输出超过 7瓦/千克。相比下,阿姆斯特朗在 20 年前靠兴奋剂实现了 6瓦/千克的功率输出,他完成路段的时间比今天的顶尖选手慢。阿姆斯特朗靠服用兴奋剂从 1999 年到 2005 年连续七次获得环法自行车赛冠军。于 2012年 被取消自 1998 年 8 月之后的所有成绩,被终身禁赛。今天的选手表现更出色源于技术进步:每位选手都使用提供实时性能数据的功率计;营养摄入使用精确测量的食物摄入量持续补充热量;自行车使用风洞测试以降低阻力系数,等等。
- 受争议的砷基生命论文在发表 15 年后撤下
《科学》期刊撤下了受争议的砷基生命论文。2010 年《科学》期刊发表了 F. Wolfe-Simon 等人的论文《A bacterium that can grow by using arsenic instead of phosphorus》,声称在加州湖泊中发现了一种砷基细菌 GFAJ-1,它利用砷而不是磷生长。论文发表之后引发了很多争议,2012 年《科学》发表了两篇未能复制这一发现的论文。《科学》期刊主编 Holden Thorp 在声明中称,他们没有在 2012 年撤回论文是因为当时的政策主要针对存在科学不端行为,而这篇论文的作者没有故意欺骗或犯有不端行为。《科学》后来扩大了撤稿的政策:如果一篇论文报告的实验结果不支持其核心结论,撤下是合适的。
- Pebble 创始人拿回了原商标
Pebble 创始人 Eric Migicovsky 宣布他拿回了原始商标,因此他的公司准备推出的智能手表产品将使用 Pebble 商标:Core 2 Duo 改为 Pebble 2 Duo,Core Time 2 改名 Pebble Time 2。Pebble 诞生于 2012 年,Eric Migicovsky 通过 Kickstarter 筹集到了创纪录的 1030 万美元,而其第二代智能手表通过 Kickstarter 再次筹集到破纪录的 2030 万美元。但在 2016 年 12 月 Pebble 出售给 Fitbit 后关闭,创始人也离开了公司。Google 通过收购 Fitbit 获得了 Pebble 的所有权。今年初 Google 宣布在 Apache License 2.0 下开源 Pebble 智能手表操作系统,源代码托管在 GitHub 上。而 Migicovsky 同一时间宣布推出能运行 Pebble OS 的新智能手表产品。
- 地球在向外星人广播其位置
一项初步研究显示,全球的民用机场与军事设施所操作的雷达系统,可能正无意间将地球的存在广播给科技先进的外星文明,这些讯号可被视为智慧生命的间接证据。研究调查了雷达系统泄漏出的电波讯号若由距离地球 200 光年的观测者侦测到,会呈现出怎样的样貌,前提是他们拥有与地球上同等级的电波望远镜。研究结果同时也意味着,理论上我们也能在相同范围内侦测到类似等级的外星文明。研究人员的目标是评估六个邻近的恒星系统,尤其是巴纳德星(Barnard’s Star,5.98光年) 与显微镜座AU(AU Microscopii,32.3光年)来看,这些讯号的可侦测程度。分析显示,机场用来监控飞机的雷达系统,合计产生约 2×10¹⁵瓦特的功率,这样的能量输出足以让如绿堤望远镜(Green Bank Telescope)等级的电波望远镜在 200 光年外仍能侦测到讯号。军事雷达系统具有更高的指向性,形成如同灯塔光束扫过天空般的独特模式。
- DNSSEC 普及率仅为 34%
域名系统(DNS)的原始设计不包含任何安全细节,域名系统安全扩展(DNSSEC)尝试在其中添加安全性,同时仍保持向后兼容性。DNSSEC 能阻止 DNS 缓存污染等攻击,它的 RFC 是在 28 年前发布的,根据 Internet Society 的数据,DNSSEC 普及率仅为 34%,相比下 HTTPS 的开发时间线与 DNSSEC 基本相同——在 Top 1000 网站中,HTTPS 的普及率为 96%,HTTP/3 仅发布四年时间普及率就达到了 25%。大约三成的国家域名尚未实现 DNSSEC。
- Google 街景车拍摄到阿根廷男子的裸体被判赔偿 1.25 万美元
2017 年 Google 的一辆街景车在阿根廷拍摄到了一名警察在自家院子里裸体的画面,将这名警察的光屁股以及门牌号和街道名都公布在地图上,此事经过阿根廷媒体报道之后被广泛传播。这位警察对搜索巨人提起了诉讼,指控其侵犯了他的尊严,他表示自家院子的围墙高 6.5 英尺,称自己在邻居和同事中间沦为笑柄。Google 对此的回应是围墙不够高。阿根廷法庭去年驳回了诉讼,认为责任在于他本人的行为不恰当。本周上诉法庭推翻了原判,判决这名男子的尊严受到了公然的侵犯,Google 需要为此赔偿 1.25 万美元。