Physical AI Brief
Daily cross-source signals for the Physical AI supply chain — silicon photonics, CPO, VLA models, humanoid hardware, embodied AI. Three streams, one page, zero filler.
94 items today · 34 arxiv · 2 SEC 8-K · 58 humanoid · 0 CN photonics
01 ARXIV · PHYSICAL AI PAPERS
34 items- arxiv:2604.16158 · cs.LGAtManRL: Towards Faithful Reasoning via Differentiable Attention SaliencyMax Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu
Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.
manipulation - arxiv:2604.16083 · cs.CVDINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image ForensicsJieming Yu, Qiuxiao Feng, Zhuohan Wang, Xiaochen Ma
With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.
manipulation - arxiv:2604.16067 · cs.LGAEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-TuningGuransh Singh
Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM's visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.
vision-language-action - arxiv:2604.16059 · physics.opticsControlling external injection in laser-plasma accelerators with terahertz frequency bunch manipulationAras Amini, Lewis R. Reid, James K. Jones, Morgan T. Hibberd +5
Laser-plasma wakefield acceleration (LWFA) offers ultrahigh accelerating gradients in compact setups, but the complex non-linear nature of the process makes it challenging to generate high-quality beams. Injection of electron bunches from an external source into a plasma accelerator provides a promising route to improved performance; however, electron bunches from conventional radio-frequency (RF)-based injectors suffer from non-linear compression and laser-beam asynchrony, leading to energy jitter and emittance growth. We present a fundamental concept of terahertz-controlled electron bunches for external injection into LWFA. This terahertz-frequency approach provides temporal locking between the electron beam and the drive laser, and enables the compression of high-quality beams to sub-10-fs durations before injection into the LWFA. Numerical simulations demonstrate that GeV-scale acceleration with excellent beam quality and stability -- energy jitter and energy spread around 0.2% -- can be achieved using this method. This concept opens new opportunities for stable, multi-stage laser-driven accelerators and supports the development of next-generation applications such as free-electron lasers (FELs).
manipulation - arxiv:2604.16054 · cs.CVMind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMsRohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian +1
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
manipulation - arxiv:2604.16022 · cs.LGSocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent SystemsHikaru Shindo, Hanzhao Lin, Lukas Helff, Patrick Schramowski +1
As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.
embodied - arxiv:2604.15996 · eess.SYStealthy Cyber-Attacks on Vehicle Lateral Dynamics: A System-Theoretic AnalysisAli Eslami, Jiangbo Yu, Mohammad Pirani
This paper studies the vehicle bicycle model under three classes of stealthy cyber-attacks: replay attacks, zero dynamics attacks, and covert attacks. Using a system-theoretic framework, we analyze the feasibility and impact of these attacks on vehicle lateral dynamics. The investigation considers different measurement configurations, including yaw rate, lateral acceleration, and longitudinal acceleration outputs, to evaluate how sensor selection influences attack detectability and system vulnerability. Each attack class is characterized in terms of required system knowledge, communication access, and impact. The analysis shows that replay attacks remain largely model-agnostic, while zero dynamics attacks are fundamentally constrained by control-oriented design choices, particularly output selection, which can eliminate unstable zero dynamics and limit the attack impact. In contrast, covert attacks, enabled by coordinated actuator and sensor manipulation, allow sustained and stealthy deviation of lateral states when sufficient access and system knowledge are available. The effects of actuator and tire saturation are also examined, revealing attack-dependent impacts on stealthiness and effectiveness. Finally, simulation case studies are conducted by using CarSim-Simulink co-simulation to validate and verify the theoretical results.
manipulation - arxiv:2604.15948 · cs.CVFrom Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text GuidanceJinhao Shen, Haoqian Du, Xulu Zhang, Xiao-Yong Wei +1
Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.
manipulation - arxiv:2604.15938 · cs.ROVADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic ManipulationXinglei Yu, Zhenyang Liu, Shufeng Nan, Simo Wu +1
Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.
manipulationdiffusion policy - arxiv:2604.15907 · cs.ROA Reconfigurable Pneumatic Joint Enabling Localized Selective Stiffening and Shape Locking in Vine-Inspired RobotsAyodele James Oyejide, Ustaz A. Yaqub, Samir Erturk, Eray A. Baran +1
Vine-inspired robots achieve large workspace coverage through tip eversion, enabling safe navigation in confined and cluttered environments. However, their deployment in free space is fundamentally limited by low axial stiffness, poor load-bearing capacity, and the inability to retain shape during and after steering. In this work, we propose a reconfigurable pneumatic joint (RPJ) architecture that introduces discrete, pressure-tunable stiffness along the robot body without compromising continuous growth. Each RPJ module comprises symmetrically distributed pneumatic chambers that locally increase bending stiffness when pressurized, enabling decoupling between global compliance and localized rigidity. We integrate the RPJs into a soft growing robot with tendon-driven steering and develop a compact base station for mid-air eversion. System characterization and experimental validation demonstrate moderate pressure requirements for eversion, as well as comparable localized stiffening and steering performance to layer-jamming mechanisms. Demonstrations further show that the proposed robot achieves improved shape retention during bending, reduced gravitational deflection under load, cascading retraction, and reliable payload transport up to 202 g in free space. The RPJ mechanism establishes a practical pathway toward structurally adaptive vine robots for manipulation-oriented tasks such as object sorting and adaptive exploration in unconstrained environments.
manipulation - arxiv:2604.15857 · cs.CVAHS: Adaptive Head Synthesis via Synthetic Data AugmentationsTaewoong Kang, Hyojin Jang, Sohyun Jeong, Seunggi Moon +3
Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
manipulation - arxiv:2604.15823 · cs.CVWatching Movies Like a Human: Egocentric Emotion Understanding for Embodied CompanionsZe Dong, Hao Shi, Zejia Gao, Zhonghua Yi +2
Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.
embodied - arxiv:2604.15814 · cs.ROContinual Hand-Eye Calibration for Open-world Robotic ManipulationFazeng Li, Gan Sun, Chenxi Liu, Yao He +2
Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.
manipulation - arxiv:2604.15805 · cs.ROFrom Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and EvaluationJasper Lu, Zhenhao Shen, Yuanfei Wang, Shugao Liu +7
Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.
manipulation - arxiv:2604.15795 · cs.CVFed3D: Federated 3D Object DetectionSuyan Dai, Chenxi Liu, Fazeng Li, Peican Lin
3D object detection models trained in one server plays an important role in autonomous driving, robotics manipulation, and augmented reality scenarios. However, most existing methods face severe privacy concern when deployed on a multi-robot perception network to explore large-scale 3D scene. Meanwhile, it is highly challenging to employ conventional federated learning methods on 3D object detection scenes, due to the 3D data heterogeneity and limited communication bandwidth. In this paper, we take the first attempt to propose a novel Federated 3D object detection framework (i.e., Fed3D), to enable distributed learning for 3D object detection with privacy preservation. Specifically, considering the irregular input 3D object in local robot and various category distribution between robots could cause local heterogeneity and global heterogeneity, respectively. We then propose a local-global class-aware loss for the 3D data heterogeneity issue, which could balance gradient back-propagation rate of different 3D categories from local and global aspects. To reduce communication cost on each round, we develop a federated 3D prompt module, which could only learn and communicate the prompts with few learnable parameters. To the end, several extensive experiments on federated 3D object detection show that our Fed3D model significantly outperforms state-of-the-art algorithms with lower communication cost when providing the limited local training data.
manipulation - arxiv:2604.15671 · cs.ROLong-Term Memory for VLA-based Agents in Open-World Task ExecutionXu Huang, Weixin Mao, Yinhao Li, Hua Chen +1
Vision-Language-Action (VLA) models have demonstrated significant potential for embodied decision-making; however, their application in complex chemical laboratory automation remains restricted by limited long-horizon reasoning and the absence of persistent experience accumulation. Existing frameworks typically treat planning and execution as decoupled processes, often failing to consolidate successful strategies, which results in inefficient trial-and-error in multi-stage protocols. In this paper, we propose ChemBot, a dual-layer, closed-loop framework that integrates an autonomous AI agent with a progress-aware VLA model (Skill-VLA) for hierarchical task decomposition and execution. ChemBot utilizes a dual-layer memory architecture to consolidate successful trajectories into retrievable assets, while a Model Context Protocol (MCP) server facilitates efficient sub-agent and tool orchestration. To address the inherent limitations of VLA models, we further implement a future-state-based asynchronous inference mechanism to mitigate trajectory discontinuities. Extensive experiments on collaborative robots demonstrate that ChemBot achieves superior operational safety, precision, and task success rates compared to existing VLA baselines in complex, long-horizon chemical experimentation.
vision-language-actionvlavla modelembodied - arxiv:2604.15569 · cs.ROShapeGen: Robotic Data Generation for Category-Level ManipulationYirui Wang, Xiuwei Xu, Angyuan Ma, Bingyao Yu +2
Manipulation policies deployed in uncontrolled real-world scenarios are faced with great in-category geometric diversity of everyday objects. In order to function robustly under such variations, policies need to work in a category-level manner, i.e. knowing how to interact with any object in a certain category, instead of only a specific one seen during training. This in-category generalizability is usually nurtured with shape-diversified training data; however, manually collecting such a corpus of data is infeasible due to the requirement of intense human labor and large collections of divergent objects at hand. In this paper, we propose ShapeGen, a data generation method that aims at generating shape-variated manipulation data in a simulator-free and 3D manner. ShapeGen decomposes the process into two stages: Shape Library curation and Function-Aware Generation. In the first stage, we train spatial warpings between shapes mapping points to points that correspond functionally, and aggregate 3D models along with the warpings into a plug-and-play Shape Library. In the second stage, we design a pipeline that, leveraging established Libraries, requires only minimal human annotation to generate physically plausible and functionally correct novel demonstrations. Experiments in the real world demonstrate the effectiveness of ShapeGen to boost policies' in-category shape generalizability. Project page: https://wangyr22.github.io/ShapeGen/.
manipulation - arxiv:2604.15550 · physics.opticsIncoherence-assisted mode excitation in non-Hermitian resonant systemsAmin Hashemi, Vinzenz Zimmermann, Armando Perez-Leija, Andrea Blanco-Redondo
We introduce and experimentally demonstrate an approach for selective mode excitation in non-Hermitian resonant systems using incoherent light. This method eliminates the need for precise phase control that is often required in coherent excitation schemes. Using this technique on a silicon photonic platform with coupled ring resonators, we successfully excite the topological edge state of a non-Hermitian Su-Schrieffer-Heeger (SSH) model. Our work shows that incoherence-assisted excitation is a robust and passive strategy for topological state preparation, which broadens the scope of non-Hermitian topological photonics thereby providing a practical and experimentally viable tool for selective mode excitation.
silicon photonic - arxiv:2604.15495 · cs.ROGIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic TopologyShivendra Agrawal, Bradley Hayes
Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.
embodied - arxiv:2604.15493 · physics.opticsEnd-to-End Physical Design Automation Flow for Yield-Optimized Inverse-Designed Large-Scale Electronic-Photonic Integrated CircuitsHongjian Zhou, Haoyu Yang, Haoxing Ren, Joaquin Matres +1
As AI systems scale to multi-chiplet and wafer-level architectures, the demand for ultra-high bandwidth and system scalability has outpaced the capabilities of electrical interconnects and computing units. Large-scale heterogeneous electronic-photonic integrated chiplets (EPICs) provide a promising solution, but their practical adoption is limited by the lack of a unified, fabrication-aware physical design automation stack. At the same time, inverse-designed ultra-compact photonic devices offer orders-of-magnitude improvements in spatial and spectral density, yet remain constrained by insufficient design-for-manufacturing support and yield optimization. In this work, we present OptoSynthesizer, an end-to-end physical design automation flow for yield-optimized, inverse-designed EPICs. It integrates three key components across the physical design pipeline: (1) OptoSynthesizer-InvDes, a physical-AI-augmented, digital-twin-assisted photonic inverse design and photonics-aware inverse lithography framework; (2) OptoSynthesizer-Place, a GPU-accelerated routing-informed EPIC placer for large-scale routability-optimized layout; and (3) OptoSynthesizer-Route, a hierarchical curvy-aware waveguide router with global-planning-assisted electrical-optical co-routing. Together, these toolkits form a seamless flow from EPIC netlists to fabrication-ready, yield-robust GDS layouts. We demonstrate how this framework enables compact large-scale photonic tensor cores and high-bandwidth interconnect fabrics for heterogeneous EPIC platforms, providing a practical foundation for manufacturable large-scale EPICs in next-generation AI systems.
photonic integrated circuit - arxiv:2604.15281 · cs.ROR3D: Revisiting 3D Policy LearningZhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao +7
3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/
manipulation - arxiv:2604.15215 · cs.ROA Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in RoboticsFawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani +3
We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.
manipulation - arxiv:2604.15023 · cs.RODockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration GenerationZiyu Shan, Yuheng Zhou, Gaoyuan Wu, Ziheng Ji +2
Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.
manipulation - arxiv:2604.15013 · cs.RODEX-Mouse: A Low-cost Portable and Universal Interface with Force Feedback for Data Collection of Dexterous Robotic HandsJoonho Koh, Haechan Jung, Nayoung Kim, Wook Ko +1
Data-driven dexterous hand manipulation requires large-scale, physically consistent demonstration data. Simulation and video-based methods suffer from sim-to-real gaps and retargeting problems, while MoCap glove-based teleoperation systems require per-operator calibration and lack portability, as the robot hand is typically fixed to a stationary arm. Portable alternatives improve mobility but lack cross-platform and cross-operator compatibility. We present DEX-Mouse, a portable, calibration-free hand-held teleoperation interface with integrated kinesthetic force feedback, built from commercial off-the-shelf components under USD 150. The operator-agnostic design requires no calibration or structural modification, enabling immediate deployment across diverse environments and platforms. The interface supports a configuration in which the target robot hand is mounted directly on the forearm of an operator, producing robot-aligned data. In a comparative user study across various dexterous manipulation tasks, operators using the proposed system achieved an 86.67% task completion rate under the attached configuration. Also, we found that the attached configuration reduced the perceived workload of the operators compared to spatially separated teleoperation setups across all compared interfaces. The complete hardware and software stack, including bill of materials, CAD models, and firmware, is open-sourced at https://dex-mouse.github.io/ to facilitate replication and adoption.
manipulationdexterousteleoperation - arxiv:2604.14965 · cs.ROPOMDP-based Object Search with Growing State Space and Hybrid Action DomainYongbo Chen, Hesheng Wang, Shoudong Huang, Hanna Kurniawati
Efficiently locating target objects in complex indoor environments with diverse furniture, such as shelves, tables, and beds, is a significant challenge for mobile robots. This difficulty arises from factors like localization errors, limited fields of view, and visual occlusion. We address this by framing the object-search task as a highdimensional Partially Observable Markov Decision Process (POMDP) with a growing state space and hybrid (continuous and discrete) action spaces in 3D environments. Based on a meticulously designed perception module, a novel online POMDP solver named the growing neural process filtered k-center clustering tree (GNPF-kCT) is proposed to tackle this problem. Optimal actions are selected using Monte Carlo Tree Search (MCTS) with belief tree reuse for growing state space, a neural process network to filter useless primitive actions, and k-center clustering hypersphere discretization for efficient refinement of high-dimensional action spaces. A modified upper-confidence bound (UCB), informed by belief differences and action value functions within cells of estimated diameters, guides MCTS expansion. Theoretical analysis validates the convergence and performance potential of our method. To address scenarios with limited information or rewards, we also introduce a guessed target object with a grid-world model as a key strategy to enhance search efficiency. Extensive Gazebo simulations with Fetch and Stretch robots demonstrate faster and more reliable target localization than POMDP-based baselines and state-of-the-art (SOTA) non-POMDP-based solvers, especially large language model (LLM) based methods, in object search under the same computational constraints and perception systems. Real-world tests in office environments confirm the practical applicability of our approach. Project page: https://sites.google.com/view/gnpfkct.
world model - arxiv:2604.14944 · cs.ROHRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand GraspsJongbin Lim, Taeyun Ha, Mingi Choi, Jisoo Kim +3
We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.
manipulationdexteroustactile - arxiv:2604.14902 · cs.ROADAPT: Benchmarking Commonsense Planning under Unspecified Affordance ConstraintsPei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su +3
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.
embodied - arxiv:2604.14836 · physics.opticsLow voltage and high-bandwidth thin-film lithium tantalate modulator on a silicon dioxide substrateZihan Li, Alexander Kotz, Adrian Schwarzenberger, Christian Koos +1
Modern communication networks demand ever-increasing transmission bandwidth, placing stringent requirements on low-cost, high-performance electro-optic modulators. Substantial advances have been made in integrated photonics employing lithium niobate on insulator. In contrast, photonic integrated circuits based on lithium tantalate -- a material already commercially adopted for wireless filters -- have been developed, offering reduced DC drift, higher optical power handling, and lower birefringence. These advantages enable more complex and dense photonic integrated circuits, and make lithium tantalate a promising material platform for next-generation integrated electro-optic modulators. However, in contrast to the extensively studied thin-film lithium niobate platform, thin-film lithium tantalate modulators have only been explored on silicon substrates. Here, we report the first fabrication and characterization of thin-film lithium tantalate electro-optic modulators manufactured on a 4-inch (100 mm) fused-silica substrate for adapting a low-loss slow-wave microwave electrode to improve the electro-optic bandwidth. By employing a slow-wave electrode design to achieve velocity matching between microwave and optical signals, the demonstrated modulator achieves a 3-dB electro-optic bandwidth of 64 GHz with a low half-wave voltage of 1.53 V, with potential to operate at the measured 100 GHz electrical bandwidth, if the employed spectral biasing is removed. The modulator moreover exhibits low bias drift, with a constant switching voltage down to 10 mHz. This performance enables high-speed data transmission comparable to state-of-the-art lithium niobate modulators fabricated on quartz substrates. Using the fabricated devices, a net single lane data rate of 440.6 Gbps is achieved using PAM8 signaling.
photonic integrated circuit - arxiv:2604.14834 · cs.ROSwitch: Learning Agile Skills Switching for Humanoid RobotsYuen-Fui Lau, Qihan Zhao, Yinhuai Wang, Runyi Yu +3
Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world chal lenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, cre ating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.
humanoid - arxiv:2604.14733 · cs.RODifferentiable Object Pose Connectivity Metrics for Regrasp Sequence OptimizationLiang Qin, Weiwei Wan, Kensuke Harada
Regrasp planning is often required when one pick-and-place cannot transfer an object from an initial pose to a goal pose while maintaining grasp feasibility. The main challenge is to reason about shared-grasp connectivity across intermediate poses, where discrete search becomes brittle. We propose an implicit multi-step regrasp planning framework based on differentiable pose sequence connectivity metrics. We model grasp feasibility under an object pose using an Energy-Based Model (EBM) and leverage energy additivity to construct a continuous energy landscape that measures pose-pair connectivity, enabling gradient-based optimization of intermediate object poses. An adaptive iterative deepening strategy is introduced to determine the minimum number of intermediate steps automatically. Experiments show that the proposed cost formulation provides smooth and informative gradients, improving planning robustness over other alternatives. They also demonstrate generalization to unseen grasp poses and cross-end-effector transfer, where a model trained with suction constraints can guide parallel gripper grasp manipulation. The multi-step planning results further highlight the effectiveness of adaptive deepening and minimum-step search.
manipulation - arxiv:2604.14732 · cs.ROWorld-Value-Action Model: Implicit Planning for Vision-Language-Action SystemsRunze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng +4
Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.
vision-language-actionvlaembodiedworld model - arxiv:2604.14679 · physics.opticsObservation of Restored Adiabatic State Transfer in Time-Modulated Non-Hermitian SystemsXiaowei Wang, Ievgen I. Arkhipov, Quan Lin, Huixia Gao +4
Exceptional points (EPs) have attracted extensive research interest due to their intriguing properties. One of the hallmarks of EP physics is that dynamically encircling the EPs induces chiral mode switching, arising from the breakdown of adiabaticity due to the presence of a complex spectrum in the system's Hamiltonian. While such chiral mode behavior has been widely observed experimentally, achieving truly adiabatic, and thus symmetric, state transfer, regardless of the winding direction, in time-modulated non-Hermitian systems has remained elusive. In this work, we demonstrate that this long-sought adiabatic state dynamics can indeed be restored. By steering a two-mode photonic setup along specifically designed trajectories in parameter space, we realize conditions where the associated non-Hermitian evolution operator acquires a purely real spectrum. Moreover, our experimental platform enables controlled switching between symmetric (adiabatic) and chiral (non-adiabatic) state-transfer regimes for the same set of initial modes, thus effectively implementing a universal symmetric-asymmetric two-mode switch. Our results therefore open new avenues for harnessing unique topological spectral properties of non-Hermitian systems, paving the way for the practical design of versatile optical wave-manipulation devices and for advancing both classical and quantum information technologies.
manipulation - arxiv:2604.14664 · physics.opticsScaling Photonic Tensor Cores with Unary and Homodyne DesignsOluwaseun Alo, Ishan Thakkar
We analyze five photonic microring tensor core designs with a common optical power model. The results show that circuit ordering, unary encoding, and homodyne accumulation shape scalability, with the last two offering the strongest path to higher parallelism.
microring - arxiv:2604.14565 · cs.ROModel-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot LocomotionTomoya Kamimura, Haruka Washiyama, Akihito Sano
Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.
embodiedhumanoid
02 US SEMI · SEC 8-K FILINGS
2 itemsscanned: NVDA / AVGO / MRVL / COHR / LITE / AMD / TSM / SMCI / ANET / CRDO / POWL / VECO
03 HUMANOID · COMPANY NEWS
58 itemsscanned: figure-ai / 1x / boston-dynamics / unitree / apptronik / sanctuary-ai / neura-robotics / agility-robotics / physical-intelligence / agibot
Figure AI (10)
- Figure AIJanuary 27, 2026Introducing Helix 02: Full-Body Autonomy
- Figure AIOctober 09, 2025Introducing Figure 03
- Figure AIJune 07, 2025Scaling Helix: a New State of the Art in Humanoid Logistics
- Figure AIMarch 09, 2026Helix 02 Living Room Tidy
- Figure AINovember 19, 2025F.02 Contributed to the Production of 30,000 Cars at BMW
Boston Dynamics (10)
- Boston DynamicsAIVI-Learning Is Now Powered by Google Gemini Robotics
- Boston DynamicsTools for Your To Do List with Spot and Gemini Robotics
- Boston DynamicsScaling New Career Heights with Stretch
- Boston DynamicsAtlas’ Evolution From Research Robot to Industrial Humanoid
- Boston DynamicsA New Perspective for Facilities Inspection
Unitree 宇树 (9)
- Unitree 宇树Components
- Unitree 宇树Kung Fu Meets Spring, Unitree SFG Robots Present "Cyber Real Kung Fu" in the Year of the Horse2026-03-04Media Coverage
- Unitree 宇树Important Reminder from Unitree: Avoid Being Deceived2025-02-27Media Coverage
- Unitree 宇树Unitree H1: 1.5 Yrs Old "Debuted" at the SFG2025-02-05Media Coverage
- Unitree 宇树Unitree G1 Humanoid Agent | Price from $16K2024-07-05Media Coverage
Sanctuary AI (5)
- Sanctuary AIProduct Updates
- Sanctuary AISanctuary AI Demonstrates Zero-Shot In-Hand Manipulation on Hydraulic Hand
- Sanctuary AIIf You Missed Messe
- Sanctuary AISanctuary AI Leads the Industry in Controlling Advanced Hydraulic Hands Using Reinforcement Learning
- Sanctuary AISanctuary AI Leverages NVIDIA Isaac Lab to Accelerate Dexterous Learning
Agility Robotics (10)
- Agility RoboticsAgility and AIBlog PostMarch 16, 2026
- Agility RoboticsAgility Gets a New BrandBlog PostMarch 5, 2026
- Agility Robotics2026: The Automation EvolutionBlog PostJanuary 16, 2026
- Agility RoboticsBeyond the HypeBlog PostNovember 24, 2025
- Agility RoboticsDigit Moves Over 100,000 Totes in Commercial DeploymentBlog PostNovember 20, 2025
Physical Intelligence (7)
- Physical IntelligenceNewπ0.7: a Steerable Model with Emergent CapabilitiesApril 16, 2026A steerable robotic foundation model that exhibits a step-change in generalization.
- Physical IntelligenceThe Physical Intelligence LayerFebruary 24, 2026General-purpose physical intelligence models will enable a Cambrian explosion of robotics applications. See how our partners are already solving real-world problems.
- Physical IntelligenceMoravec's Paradox and the Robot OlympicsDecember 22, 2025By fine-tuning our latest model, we were able to solve a series of very difficult manipulation challenge tasks.
- Physical Intelligenceπ*0.6: a VLA that Learns from ExperienceNovember 17, 2025A method for training our generalist policies with RL to improve success rate and throughput on real-world tasks.
- Physical Intelligenceπ0.5: a VLA with Open-World GeneralizationApril 22, 2025Our latest generalist policy, π0.5, extends π0 and enables open-world generalization. Our new model can control a mobile manipulator to clean up an entirely new kitchen or bedroom.
智元 AgiBot (7)
- 智元 AgiBotAGIBOT Declares 2026 “Deployment Year On...2026-04-17
- 智元 AgiBotAGIBOT Unveils New Generation of Embodie...News and Information | 2026-04-17
- 智元 AgiBotAGIBOT and Longcheer Technology Achieve ...News and Information | 2026-04-14
- 智元 AgiBotAGIBOT Launches Genie Studio Agent to En...News and Information | 2026-04-13
- 智元 AgiBotAGIBOT Demonstrates Fully Autonomous Hum...News and Information | 2026-04-10