OrangeBot.AI Digest — 2025-09-08
50 headlines across 8 sources, aggregated for this day.
Hacker News(15)
- Chat Control Must Be Stopped (www.privacyguides.org)
- iPhone dumbphone (stopa.io)
- Firefox 32-bit Linux Support to End in 2026 (blog.mozilla.org)
- Signal Secure Backups (signal.org)
- OpenWrt: A Linux OS targeting embedded devices (openwrt.org)
- Will Amazon S3 Vectors kill vector databases or save them? (zilliz.com)
- Job mismatch and early career success (www.nber.org)
- NPM debug and chalk packages compromised (www.aikido.dev)
- A clickable visual guide to the Rust type system (rustcurious.com)
- Google gets away almost scot-free in US search antitrust case (www.computerworld.com)
- Clankers Die on Christmas (remyhax.xyz)
- Experimenting with Local LLMs on macOS (blog.6nok.org)
- Meta suppressed research on child safety, employees say (www.washingtonpost.com)
- Immich – High performance self-hosted photo and video management (github.com)
- ICEBlock handled my vulnerability report in the worst possible way (micahflee.com)
GitHub Trending(14)
- emcie-co / parlant
LLM agents built for control. Designed for real-world use. Deployed in minutes.
- microsoft / ai-agents-for-beginners
12 Lessons to Get Started Building AI Agents
- zama-ai / fhevm
FHEVM, a full-stack framework for integrating Fully Homomorphic Encryption (FHE) with blockchain applications
- bytedance / UI-TARS-desktop
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
- openwrt / openwrt
This repository is a mirror of https://git.openwrt.org/openwrt/openwrt.git It is for reference only and is not active for check-ins. We will continue to accept Pull Requests here. They will be merged via staging trees then into openwrt.git.
- Kilo-Org / kilocode
Open Source AI coding assistant for planning, building, and fixing code. We frequently merge features from open-source projects like Roo Code and Cline, while building our own vision. Follow us: kilocode.ai/social
- 11cafe / jaaz
The world's first open-source multimodal creative assistant This is a substitute for Canva and Manus that prioritizes privacy and is usable locally.
- x1xhlol / system-prompts-and-models-of-ai-tools
FULL v0, Cursor, Manus, Augment Code, Same.dev, Lovable, Devin, Replit Agent, Windsurf Agent, VSCode Agent, Dia Browser, Xcode, Trae AI, Cluely & Orchids.app (And other Open Sourced) System Prompts, Tools & AI Models.
- microsoft / generative-ai-for-beginners
21 Lessons, Get Started Building with Generative AI
- Stirling-Tools / Stirling-PDF
#1 Locally hosted web application that allows you to perform various operations on PDF files
- Cinnamon / kotaemon
An open-source RAG-based tool for chatting with your documents.
- Zie619 / n8n-workflows
all of the workflows of n8n i could find (also from the site itself)
- Vector-Wangel / XLeRobot
XLeRobot: Practical Dual-Arm Mobile Home Robot for $660
- uutils / coreutils
Cross-platform Rust rewrite of the GNU coreutils
Hugging Face(12)
- Why Language Models Hallucinate
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
- Symbolic Graphics Programming with Large Language Models
Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.
- Set Block Decoding is a Language Model Inference Accelerator
Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.
- WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.
- LuxDiT: Lighting Estimation with Video Diffusion Transformer
Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.
- LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a 90times increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18
- WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.
- MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting
Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.
- On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.
- U-ARM : Ultra low-cost general teleoperation interface for robot manipulation
We propose U-Arm, a low-cost and rapidly adaptable leader-follower teleoperation framework designed to interface with most of commercially available robotic arms. Our system supports teleoperation through three structurally distinct 3D-printed leader arms that share consistent control logic, enabling seamless compatibility with diverse commercial robot configurations. Compared with previous open-source leader-follower interfaces, we further optimized both the mechanical design and servo selection, achieving a bill of materials (BOM) cost of only \50.5 for the 6-DoF leader arm and 56.8 for the 7-DoF version. To enhance usability, we mitigate the common challenge in controlling redundant degrees of freedom by %engineering methods mechanical and control optimizations. Experimental results demonstrate that U-Arm achieves 39\% higher data collection efficiency and comparable task success rates across multiple manipulation scenarios compared with Joycon, another low-cost teleoperation interface. We have open-sourced all CAD models of three configs and also provided simulation support for validating teleoperation workflows. We also open-sourced real-world manipulation data collected with U-Arm. The project website is https://github.com/MINT-SJTU/LeRobot-Anything-U-Arm.
- Behavioral Fingerprinting of Large Language Models
Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated Diagnostic Prompt Suite and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting
- Bootstrapping Task Spaces for Self-Improvement
Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.
Solidot(9)
- Firefox ESR 115 将支持到 2026 年 3 月
微软已停止支持 Windows 7/8/8.1 操作系统,操作系统上最流行的应用——浏览器如 Google Chrome 和 Microsoft Edge 也都停止了对上述旧操作系统的支持,Mozilla 于 2023 年 7 月释出的 Firefox 115 ESR 是 Firefox 支持 Windows 7/8/8.1 的最后一个版本。Mozilla 开发者表示他们会在不同时间点进行评估以判断是否延长对 Windows 7/8/8.1 的支持时间,最新的评估是它计划继续为 Firefox 115 ESR 释出安全更新直至 2026 月 3 日。
- Windows 第三方工具允许用户禁用所有 AI 功能
Windows 11 第三方工具 Flyoobe 11 允许用户移除微软在操作系统中捆绑的臃肿软件。它最近释出了更新 v1.7,允许用户在安装操作系统后发现并禁用所有 AI 和 Copilot 功能。开发者称,最新版本能更深入挖掘 AI 在 Windows 11 中的嵌入方式。Flyoobe 托管在微软旗下的 GitHub 上,采用 MIT 许可证。
- 类似人类,每棵树都有独一无二的微生物组
森林是一个复杂、动态的生态系统,而树的内部也是如此。研究人员在《自然》期刊上发表了一项树干微生物组研究,发现树的木质组织除了树细胞外,还包含庞大的细菌群落和单细胞生物古细菌(archaea)。耶鲁大学的研究团队从美国东北部采集了 16 个树种的 150 多棵树的木芯样本,通过提取 DNA 去估算树干中的微生物数量。研究发现,树木的微生物组因物种而异。以生产枫糖浆而闻名的糖枫树含有更多的食糖细菌,用于制作葡萄酒桶的橡树含有一组已知有助于发酵的微生物。这些例子表明,树木微生物以某种意想不到的方式影响着我们的日常生活。树木微生物组也能表现出趋同演化,亲缘关系密切的树种可能拥有相似的微生物群落。
- 特斯拉改变了 Full Self-Driving 的意义,放弃承诺自动驾驶
特斯拉修改了 Full Self-Driving(FSD) 的意义,放弃原来承诺的自动驾驶或者叫无监督全自动驾驶。特斯拉自 2016 年起一直承诺其正在生产的汽车支持无监督自动驾驶能力。特斯拉 CEO 马斯克(Elon Musk)自 2018 年起每年都承诺到年底自动驾驶将会实现。但特斯拉后来承认 2016-2023 年生产的所有车型未配备实现自动驾驶所需的硬件。现在特斯拉表示 FSD 代表有监督的自动驾驶。
- 美国计划限制进口中国无人机
美国商务部计划发布规则,以国家安全理由限制或禁止进口中国无人机,以及来自中国等国重量超过 10000 磅的车辆。从中国进口的无人机占美国商用无人机销量的绝大部分,其中逾半数来自全球最大无人机制造商大疆。此前拜登政府已以国家安全为由,限制进口中国生产的汽车和卡车。去年 12 月拜登签署了一项法案,该法案可能为禁止大疆、道通智能在美国销售新型无人机铺平道路。
- Anthropic 向图书作者支付 15 亿美元和解侵权诉讼
上月底 AI 初创公司 Anthropic 与图书作者就版权侵犯集体诉讼达成和解,避免了潜在可能高达数十亿美元的侵权赔偿。法庭文件显示,Anthropic 从盗版电子书库 LibGen 和 PiLiMi 下载了多达 700 万电子书,在 2021 年和 2022 年创建了一个巨大的书库。本周图书作者披露 Anthropic 同意支付 15 亿美元并销毁为训练其 AI 模型而盗版的所有书籍副本。这一和解协议涉及的赔偿金额是美国版权诉讼史上最高的。协议涵盖 Anthropic 为训练 AI 而盗版的 50 万部作品。每位作者的每部作品将获得 3000 美元的赔偿。Anthropic 已同意了和解条款,但还需要获得法院批准。
- Firefox 将于 2026 年 9 月停止支持 32 位 Linux 系统
Mozilla 宣布 Firefox 浏览器将于 2026 年 9 月停止支持 32 位 Linux 系统。Mozilla 称,大部分 Linux 发行版已经不再支持 32 位架构,在 Linux 平台维护 32 位 Firefox 日益困难和不可靠。为了集中精力,Mozilla 宣布将于 Firefox 144 之后停止支持 32 位 Linux,即 Firefox 145 将只支持 64 位 Linux。Mozilla 建议 32 位 Linux 用户升级到 64 位 Linux 使用 64 位 Firefox。如果用户无法立即升级,Firefox ESR 140 的安全更新将一直支持到 2026 年 9 月。
- Google 因广告技术业务的反垄断行为被欧盟罚款 34.5 亿美元
Google 因广告技术业务的反垄断行为被欧盟罚款 34.5 亿美元。这是欧盟过去十年对 Google 开出的第四张罚单。美国总统威胁要进行报复。欧盟的最新行动是欧洲出版商理事会的投诉引起的,欧盟委员会认为 Google 偏袒自家在线广告显示技术服务损害了竞争对手和在线出版商的利益,它自 2014 年以来一直滥用其市场支配力量。
- 英国政府试用 M365 Copilot 后未发现明显的生产力提升
英国政府试用 M365 Copilot 后未发现明显的生产力提升。英国商务部获得了 1000 份许可证,可在 2024 年 10 月至 12 月期间使用。大部分许可证分配给志愿者,有 30% 分配给随机选择的参与者,其中 300 人同意对其数据进行分析。结果显示,每位用户平均执行了 72 次 M365 Copilot 操作,根据试用期间 63 个工作日计算,每位用户每天执行 1.14 项操作。Word、Teams 和 Outlook 是使用率最高的应用,而 Loop 和 OneNote 使用率非常低。最常见的三项任务是记录或总结会议记录、撰写电子邮件以及书面意见,而此类任务的用户满意度最高。但用户报告使用 M365 Copilot 完成 Excel 分析等更复杂的任务时速度比非 AI 用户更慢,质量和准确性也更差。总体而言,M365 Copilot 未发现能明显提升生产力。