Weekly Digest — 2025-W25
147 unique stories (2025-06-16 → 2025-06-22), aggregated across 8 sources.
Hacker News(42)
- Snorting the AGI with Claude Code (kadekillary.work)
- Show HN: Chawan TUI web browser (chawan.net)
- Show HN: Canine – A Heroku alternative built on Kubernetes (github.com)
- Getting free internet on a cruise, saving $170 (angad.me)
- Darklang Goes Open Source (blog.darklang.com)
- Benzene at 200 (www.chemistryworld.com)
- The Grug Brained Developer (2022) (grugbrain.dev)
- Iran asks its people to delete WhatsApp from their devices (apnews.com)
- Building Effective AI Agents (www.anthropic.com)
- Resurrecting a dead torrent tracker and finding 3M peers (kianbradley.com)
- Brad Lander detained by masked federal agents inside immigration court (www.thecity.nyc)
- Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite (blog.google)
GitHub Trending(24)
- microsoft / fluentui-system-icons
Fluent System Icons are a collection of familiar, friendly and modern icons from Microsoft.
- anthropics / anthropic-cookbook
A collection of notebooks/recipes showcasing some fun and effective ways of using Claude.
- anthropics / prompt-eng-interactive-tutorial
Anthropic's Interactive Prompt Engineering Tutorial
- Shubhamsaboo / awesome-llm-apps
Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
- immich-app / immich
High performance self-hosted photo and video management solution.
- huggingface / lerobot
🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
- menloresearch / jan
Jan is an open source alternative to ChatGPT that runs 100% offline on your computer
- infiniflow / ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
- deepseek-ai / DeepEP
DeepEP: an efficient expert-parallel communication library
- automatisch / automatisch
The open source Zapier alternative. Build workflow automation without spending time and money.
- linshenkx / prompt-optimizer
一款提示词优化器,助力于编写高质量的提示词
- DataExpert-io / data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
Product Hunt(41)
- Spotted in Prod
The very best of iOS
- AgentX 2.0
Build your own cross-vendor multi-agent AI team
- Wonderish
Canva of Vibe Coding
- Fluidworks
SaaS onboarding agent that talks , clicks, and guides users
- Granola for Windows
The whole team can use Granola together, wherever they work
- Rewrait
Select, improve, replace
- Tila AI
Create, code, search + design AI content all in one canvas
- FoundersAround
Where in the world are the startup founders?
- Pulze
Create AI agents and workflows without engineers
- Laravel Nightwatch
First-class monitoring designed for Laravel
- Jam for iOS
Wicked fast bug reporting, now on iOS
- Document collection by Superdash
The document collection agent you’ve been waiting for.
Hugging Face(6)
- Show-o2: Improved Native Unified Multimodal Models
This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
- RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.
- EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
- Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
- SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning
Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.
- Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.
Solidot(34)
- 英国连锁店的面部识别系统错误识别一位女性为小偷
一位英国女子被面部识别系统错误识别为偷盗了 10 英镑商品的小偷。 Danielle Horan 于 5 月和 6 月被英国连锁店 Home Bargains 的两家门店拒绝进入。第一次遭遇此事时她以为对方在开玩笑,在众目睽睽之下她不知所措。在抗议之后,门店经理建议她联络向连锁店提供面部识别技术的 Facewatch 公司。6 月 4 日她陪同母亲去了另一家 Home Bargains 门店,再次被店员围住要求离开,有了上次经验的她要求对此给出解释。在多次向 Facewatch 和 Home Bargains 发送电邮后,她才知道在 5 月 8 日她被指控偷了 10 英镑的卫生纸,在检查银行账户后她确认自己付款了。Facewatch 最后回应他们在审查之后确认她没有盗窃。该公司辩解称它依赖于门店提供的信息。公民自由运动组织 Big Brother Watch 的 Madeleine Stone 表示收到了超过 35 人的联络,投诉被错误列入了面部识别监视名单。她表示,英国在历史上一直认定,在被证明有罪之前你是无辜的;但当算法、摄像头和面部识别系统介入时,你就先被定罪了。
- 纽约州开始要求雇主披露裁员是否是 AI 导致的
纽约州开始要求雇主披露 AI 是否是裁员的原因。新要求适用于纽约州现有的 Worker Adjustment and Retraining Notification (WARN)系统,于今年三月生效。纽约州是美国第一个要求披露此类信息的州,此举将有助于监管机构了解 AI 对劳动力市场的影响。雇主如果要大规模裁员或者关闭工厂,他们需要提前至少 90 天通过 WARN 系统填写表格,最新的变化就是在表格里添加了一个勾选框,企业被要求勾选“技术创新或自动化”是否是裁员的原因。如果勾选了该选项,企业将会被引导到一个二级菜单,要求选择导致裁员的具体技术,是 AI 还是机器人。
- YouTube 和 Spotify 出现假专辑和 AI 生成音乐
YouTube 和 Spotify 开始涌现 AI 生成的假音乐。法国国际作家和作曲家协会的一项研究估计,到 2028 年 AI 生成音乐的收入将增至 40 亿美元,占流媒体平台总收入的 20%。消费者面临的问题是,他们难以区分哪些音乐是人工哪些是 AI 生成。在最大的音乐流媒体平台 Spotify 的社区论坛,用户呼吁对 AI 生成音乐进行清晰标记,并为用户提供屏蔽此类音乐的选项。该平台尚未制定此类政策。YouTube 则规定,如果内容是 AI 生成创作者需要予以披露。YouTube 表示如果它知情可能会对内容进行标记甚至予以删除。Spotify 联席总裁兼首席产品和技术官 Gustav Söderström 称,流媒体对内容的限制通常与版权侵犯有关。但在 AI 时代,AI 生成内容是否构成侵权还有很多争论。
- Meta 的大模型 Llama 3.1 能回忆《哈利波特》第一部 42% 的内容
来自斯坦福、康奈尔和西弗吉尼亚大学的计算机科学家和法律学者组成的团队上个月在预印本平台 arxiv 上发表了一篇论文,分析了五种开放权重模型能否重复 Books3 中的文本。这五种模型三种来自 Meta,另外两种分别来自微软和 EleutherAI,而 Books3 是用于训练大模型的流行书库,其中很多仍然受到版权保护。研究人员将 36 本书分成有重叠的 100 token 段落,使用前 50 token 作为提示词,计算接下来 50 token 与原文相同的概率,如果逐字复述的概率超过五成,研究人员就将该段落标记为“已记住”。结果显示,Meta 在 2024 年 7 月发布的参数规模中等的模型 Llama 3.1 70B 能记住《哈利波特》第一部 42% 的内容,相比下 Meta 在 2023 年 2 月发布的参数规模相似的模型 Llama 1 65B 只能记住 4.4%。研究人员发现,相比冷门书籍,Llama 3.1 70B 更可能重复热门书籍如《霍比特人》和乔治奥威尔《1984》,它对大部分书籍的记忆量远高于其它模型。
- X.Org Server 项目回滚了大量代码
在一位心怀不满的开发者被驱逐创建分支另立门户 X11Libre 之后,X.Org Server 项目的 Git 库最近几天的活跃度大增,主要目的是回滚有问题的代码。部分回滚与 X11Libre 开发者在被驱逐前递交的代码有关,部分与不正确处理版权和许可通知有关,还有部分是新补丁导致功能破坏有关。
- 女孩的数学表现从上学起开始落后
在世界各地,十几岁男孩的数学测试表现优于女孩,而男性更有可能从事与数学相关的职业。为了解上述差距的原因,法国研究人员进行了探索。研究涵盖了4个群体,即2018年、2019年、2020年或2021年在法国上一年级的所有儿童。这相当于近300万名5至7岁的儿童。他们在全法国证实了这一发现——数学性别差距出现在所有人群、社会经济群体、地区和学校类型中。研究人员利用大数据分析发现,是正规教育的开始,而不是年龄引发了这种差距。法国儿童通常在6岁那年的9月开始上学。2018年入学的法国所有儿童的测试结果折线图显示,在一年级开始时,男孩和女孩的平均水平相似,最高和最低百分位数的男孩略多;在二年级开始时,性别差距变大。研究人员指出,这表明是儿童上学后所处的环境,而不是兴趣或能力方面的先天差异引发了数学性别差距。此外,婴幼儿则不分男女,对数字和逻辑的掌握情况非常相似。`研究人员表示,引发数学性别差距的一个可能原因是教师和家长传递的刻板印象,即男孩在数学方面比女孩表现好,或者男孩因天赋获得成功、女孩因努力获得成功,从而打击了女孩的信心。
- 科学家创造出一种致命真菌去对抗蚊子
人类与蚊子之间的战争持续了数千年,期间蚊子杀死的人类远超过其它任何动物。传统方法的失败促使科学家寻求新方法发起反击。马里兰大学的生物学家通过生物工程技术创造出一种致命真菌,通过性传播去杀死雌蚊。被称为绿僵菌(Metarhizium)的真菌能产生针对蚊子的神经毒素,但其天然版本致死率太低,改造过的版本更致命。通过向雄蚊喷洒改造后的真菌孢子,近九成的雌蚊在与雄蚊交配两周后死亡。天然版本死亡率仅仅为 4%。基因改造后的绿僵菌对人类无害。
- 研究人员创造出第一种完全可验证的真随机数生成器
从网银加密到公平的彩票抽奖,随机数至关重要,但现有的基于计算机的随机数生成器有很大局限性,如果有人知道初始条件,那么就可能推断出所有未来的输出。测量电子噪声等物理过程的硬件随机数生成器则无法证明其随机性不是可预测的或篡改过的。现在研究人员创造出第一种完全可验证的随机数生成器。新系统利用了量子纠缠确保不可预测性。系统会创造出共享量子属性的光子对,发射到 110 米外的测量站。测量光子的属性时,量子系统确保结果是随机的。研究团队创造了名为 Twine 的系统,将随机数生成过程分成多个独立步骤,每一步记录在防篡改的分布式账簿哈希链中,确保没有任何一方能控制它,任何人都可以独立验证。在 40 天的测试中,系统在 7,454 次尝试中成功生成了 7,434 个随机数,成功率 99.7%。每次成功运行都会产生 512 随机比特,错误率 2^-64。
- KDE Plasma 6.4 释出
KDE 桌面环境项目释出了 Plasma 6.4。主要新变化包括:允许为每一个虚拟桌面选择不同的磁贴布局;支持用数字键盘键移动指针,触控板手势放大或缩小;增加前景和背景元素之间的对比度;改进了 Info Center 和 KMenuEdit 的外观;文件传输通知会显示直观的进度图;可直接通过可用更新通知安装更新;应用全屏模式下将进入“请勿打扰”模式,将只会显示紧急通知;当应用尝试访问麦克风发现麦克风静音时会弹出通知;等等。
- 特朗普集团宣布 499 美元智能手机
特朗普集团周一公布了一款售价 499 美元的智能手机以及月费 47.45 美元的手机套餐。被称为 T1 的智能手机将于 9 月上市,由第三方设计和制造,特朗普集团未披露手机制造厂商。展示的手机图像显示 T1 搭配了金色外壳,印有美国国旗,显示 MAGA 标语。其硬件配置为 6.8 英寸 AMOLED 显示屏、Android 15 操作系统,12GB 内存与 256GB 内部储存空间(无其它选择),主摄像头为 5,000 万像素,前摄像头为 1,600 万像素。
- 美国最主要的新闻来源如今是社交媒体
根据路透研究所的一项研究,当代美国人最主要的新闻来源是社交媒体。数据显示,54% 的人通过 Facebook、X和 YouTube 等获取新闻,超过了电视(50%)、新闻网站和应用(48%)。世界其它地方也有类似的趋势,但美国发生的更快影响也更为深远。播客 Joe Rogan 最受欢迎,22% 的受访者表示上周看到过他的新闻或评论。一部分政客在挑选采访者时也倾向于选择网络主播而不是主流媒体。民粹主义政客因此能绕过传统新闻媒体。虽然网红人气更高,但并不被认为是可靠的新闻来源,47% 的人认为他们是虚假信息的主要来源。而马斯克的 X 平台越来越受到右翼人士的青睐。报告称,TikTok 是增长最快的社交和视频网络,17% 的人使用它获取新闻,比去年增长了 4 个百分点;越来越多的人使用 AI 聊天机器人获取新闻,这一趋势在 25 岁以下人群中最显著;大多数人认为 AI 会降低新闻的透明度、准确性和可信度;知名媒体仍然被视为更可靠。
- 英特尔将裁减 15%-20% 的芯片工厂员工
根据英特尔制造业务副总裁 Naga Chandrasekaran 上周六发给员工的备忘录,该公司将从七月起裁减 15%-20% 的工厂员工,裁员规模可能逾万人。截至 2024 年底,英特尔共有 10.9 万名员工,但制造业务的员工总数没有披露。英特尔其它业务部门也在大裁员,员工表示公司没有具体说明每个业务部门将裁减多少个岗位。