OrangeBot.AI Digest — 2025-07-07
53 headlines across 8 sources, aggregated for this day.
Hacker News(15)
- New sphere-packing record stems from an unexpected source (www.quantamagazine.org)
- CPU-X: CPU-Z for Linux (thetumultuousunicornofdarkness.github.io)
- Adding a feature because ChatGPT incorrectly thinks it exists (www.holovaty.com)
- Launch HN: Morph (YC S23) – Apply AI code edits at 4,500 tokens/sec
- Solving Wordle with uv's dependency resolver (mildbyte.xyz)
- I used o3 to profile myself from my saved Pocket links (noperator.dev)
- Mercury: Ultra-fast language models based on diffusion (arxiv.org)
- When Figma starts designing us (designsystems.international)
- François Chollet: The Arc Prize and How We Get to AGI [video] (www.youtube.com)
- Ask HN: Any resources for finding non-smart appliances?
- Hymn to Babylon, missing for a millennium, has been discovered (phys.org)
- Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge (www.businessinsider.com)
- Deno 2.4 (deno.com)
- What every programmer should know about how CPUs work [video] (www.youtube.com)
- Neanderthals operated prehistoric “fat factory” on German lakeshore (archaeologymag.com)
GitHub Trending(10)
- rustfs / rustfs
🚀 High-performance distributed object storage for MinIO alternative.
- anthropics / prompt-eng-interactive-tutorial
Anthropic's Interactive Prompt Engineering Tutorial
- th-ch / youtube-music
YouTube Music Desktop App bundled with custom plugins
- dockur / macos
macOS inside a Docker container.
- pocketbase / pocketbase
Open Source realtime backend in 1 file
- commaai / openpilot
openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 300+ supported cars.
- smallcloudai / refact
AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
- humanlayer / 12-factor-agents
What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?
- ed-donner / llm_engineering
Repo to accompany my mastering LLM engineering course
- CodeWithHarry / Sigma-Web-Dev-Course
Source Code for Sigma Web Development Course
Product Hunt(12)
- TensorBlock Forge
One API for all AI models
- Stepfun Diligence Check
AI-powered search with agent-verified citations
- Sara, the AI Interviewer
Hire 10X faster. Unbiased structured interviews, 24/7.
- Context
The AI office suite
- Blogwald
Structure content for llms and search engines
- Voicebun
Build voice agents in seconds
- OneNode
Simplest backend for AI coding - Open source
- Jukebox
Free alternative to Spotify collaborative playlists
- DockFix
Customize your macOS dock like never before
- Iconize Folder
Customize folder color & icon & text
- Viseal
Immersive language learning from your daily scenes
- Tracking Languages
Track your language progress, one video at a time
Hugging Face(4)
- How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.
- Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation
The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.
- Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at https://github.com/lingo-iitgn/ eka-eval and a part of ongoing EKA initiative (https://eka.soket.ai), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.
- LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.
Solidot(12)
- 三星手机电池次数显著高于其它品牌
欧盟的新能效标签要求厂商标明电池的额定充电次数。那么根据充电次数,今天哪些手机品牌的电池更耐用?数据显示, 三星手机电池遥遥领先。Google Pixel 系列手机电池充电次数基本上是一千次;三星基本上是 2000 次(少数几款 1200 次);Fairphone 5 1200 次,Fairphone 6 降至 1000 次;摩托罗拉 Edge 50 系列为 1200 次,G55 800 次,其它型号基本上是 1000 次;Nothing 系列手机为 1400 次;OnePlus OnePlus 13R 1200 次,OnePlus 13 1000 次;索尼 Xperia 1 VII 为 1400 次,苹果 iPhone 16 系列都是 1000 次。
- 印度关闭互联网的次数高居第一
根据 Internet Society 的统计数字,自 2018 年以来它记录到了 863 次断网事件,其中印度一国就占了近半多达 411 次,其次是伊拉克的 140 次,叙利亚的 66 次,苏丹的 33 次,巴基斯坦和阿尔及利亚的 17 次,伊朗的 16 次。印度频繁断网的一个原因是法律授予官员以维护公共次序的名义切断互联网,地方官员有法定权力能命令电信公司手动关闭网络服务。要断网时,官员只需写信和发邮件给所有在当地有办事处的 ISP,ISP 随后屏蔽所有进出数据。 伊拉克断网则主要是因为考试。
- 为什么杀人鲸朝我们扔鱼
根据发表在《Journal of Comparative Psychology》期刊上的一项研究,世界各地经常观察到的虎鲸(或杀人鲸)朝人类扔鱼或其它猎物的现象可归因于它们想和我们交朋友。研究人员在过去 20 年记录了 34 次此类遭遇,即使人类拒绝了它们的礼物,虎鲸仍然会满怀期待的继续逗留,有时候还会再次尝试送礼,表明了它们奇特的建立关系的动机。论文主要作者、加拿大不列颠哥伦比亚 Bay Cetology 的 Jared Towers 说,虎鲸经常互相分享食物,这是一种亲社会行为,是它们建立关系的一种方式。它们与人类分享食物,可能表明它们也有兴趣与人类建立联系。
- Moderna 称 mRNA 流感疫苗有效性高于标准疫苗
Moderna 公布了基于 mRNA 的季节性流感疫苗的试验结果:mRNA 疫苗 mRNA-1010 有效性比标准疫苗高 27%。有大约 4.1 万 50 岁及以上人群参与试验,他们被随机分配接受 mRNA-1010 或标准疫苗接种,在流感季节进行约六个月的随访。相比标准疫苗,mRNA 疫苗的总体效力高出 26.6%,在 65 岁及以上参与者中高出 27.4%。早先的试验数据显示,mRNA-1010 在参与者体内产生的免疫反应比标准流感疫苗和高剂量疫苗要高。由于美国现任卫生部部长 Robert F. Kennedy Jr. 的反疫苗立场, mRNA-1010 的未来命运不确定。Kennedy Jr.已经取消了上届拜登政府授予 Moderna 的 mRNA 流感疫苗拨款。
- 美元正经历现代史上最糟糕的一年
美元正经历其现代史上最糟糕的一年,美元今年已下跌逾 7%,摩根士丹利预测下半年可能再下跌 10%。美元走弱可能增强美国的出口竞争力,推动特朗普政府的美国贸易再平衡计划,但也会使进口商品更昂贵,加剧关税带来的冲击。未来的问题是,美元是否不仅会失去其价值,还会失去其在全球金融体系中的核心地位。各国央行去美元化的努力是转向黄金,而不是转向另一种货币如人民币。
- 企业已经感受到气候变暖的影响
气候变化已对全球各地的企业产生影响。根据摩根士丹利的一份报告,被调查的企业逾半数过去一年经历了与气候相关的运营中断,包括成本增加、员工中断工作和收入损失。极端高温和风暴是造成运营中断最频繁的因素,其次是野火和烟雾、水资源短缺和洪水或海平面上升。彭博智库的一项分析发现,仅美国过去一年就花费了近万亿美元用于灾难恢复和气候相关的其它需求。近九成南美企业估计,到本十年末气候变化将对其商业模式构成风险。北美最主要风险则被认为不是气候变化而是政治动荡。
- 经历双重引爆的超新星
天文学家首次发现证据,证实一颗 Ia 型超新星是由「双重爆炸」机制产生:白矮星在尚未达到临界质量的情况下,先由表层的氦引发第一次爆炸,再触发核心的第二次爆炸。Ia 型超新星是源自双星系统中的白矮星爆炸事件。当白矮星从伴星吸积足够物质、达到所谓「钱卓极限」(Chandrasekhar limit)时,便会发生剧烈的热核爆炸,产生稳定且明亮的超新星光度。触发这类超新星的精确机制至今仍有许多未解之谜。模拟研究显示,至少部分 Ia 型超新星可能源自于尚未达到临界质量就发生的「双重爆炸」机制。在这一模型中,白矮星表面首先积聚一层由吸积而来的氦,当氦层变得不稳定时会率先引爆,产生一道向内传递的冲击波,进而触发核心的第二次爆炸,造成整体超新星事件。科学家观测了位于大麦哲伦星系内的超新星遗迹 SNR 0509-67.5,发现了两层明显的钙元素壳层,正是双重爆炸所留下的指纹。这是首次在观测中清楚辨识出这种结构,证实双重爆炸机制确实存在,也显示白矮星可在未达临界质量前即发生爆炸。这一成果有助于我们理解 Ia 型超新星的形成多样性,并进一步提升对宇宙距离测量与重元素起源的掌握。
- StatCounter 统计显示 Windows 11 的市场份额超过了 Windows 10
根据 StatCounter 截至 7 月的统计数据,距离微软停止支持 Windows 10 仅剩三个月时间,其市场份额终于被 Windows 11 超过。Windows 11 的市场份额为 50.24%,Windows 10 的市场份额为 46.84%。相比一年前这是巨大的飞跃:一年前 Windows 10 市场份额为 66.04%,而 Windows 11 仅为 29.75%。
- Valve 征服了 PC 游戏
FT 报道称 Valve 支配了 PC 游戏行业,但巨大的成功也给该公司带来了危机:它开发大型游戏的速度和冰川移动一样缓慢。Valve 的 Steam 平台控制着七成的 PC 游戏销售,其它公司对 Steam 发起的挑战都是屡战屡败。以提供免费游戏著称的 Epic Games Store 未对 Steam 产生任何实质性的影响。微软的游戏商店、EA 的 Origin(已被 EA APP 所取代) 等等都难以动摇 Steam。法庭文件显示,Steam 的营收到 2026 年预计将超过 100 亿美元,在巨大的成功之下它的发展方向并不明朗。公司创始人 Gabe Newell 基本退出日常运营,居住其五艘游轮的一艘上面,投资了脑机接口 Starfish Neuroscience 等副业。
- 微软关闭巴基斯坦业务
微软关闭了巴基斯坦业务,结束了进入该国 25 年的历史。微软退出巴基斯坦已获得员工和媒体证实。软件巨人是在 2000 年 6 月在巴基斯坦设立了办事处。最近几天,最后一批在巴员工已正式收到关闭通知,标志着微软在培养本地人才、建立企业合作伙伴关系以及促进各行业数字素养方面发挥关键作用的时代的结束。虽然关闭了巴基斯坦办事处,但微软预计将通过区域中心和第三方合作伙伴继续提供其产品和服务。
- 研究称加工肉没有食用的安全量
根据发表在《Nature Medicine》期刊上的一项研究,加工肉没有食用的安全量,即只要食用加工肉就会增加患病风险。研究人员分析了逾 60 项研究,涉及人类饮食加工肉、含糖饮料和反式脂肪酸与患 2 型糖尿病、结直肠癌和缺血性心脏病风险之间的关系。结果显示,即使食用少量的加工肉、含糖饮料和反式脂肪酸,也会增加患病风险。数据显示,与不吃任何热狗的人相比,每天只吃一只热狗的人患 2 型糖尿病的风险高 11%,患结直肠癌的风险高 7%。每天喝约 12 盎司苏打水,2 型糖尿病风险增加 8%,缺血性心脏病风险增加 2%。所谓加工肉指的是将肉加工成香肠、培根和汉堡等形式。
- 微软 XBox 业务高管建议被裁员的员工用 AI 管理情绪
微软本周宣布裁员逾九千人,其中 XBox 游戏业务深受影响,有工作室被关闭,多个游戏项目被取消。对此情况,XBox 高管 Matt Turnbull 提出了一项建议:被裁的员工应该用 AI 管理情绪。他的建议发表在 LinkedIn 上,帖子已删除,但内容已被人保存了下来。他表示自己已经试验用 LLM AI 工具(如 ChatGPT 或 Copilot)帮助减少失业带来的情绪和认知负担,称如果不尽力提供最好的建议将是自己的失职。