PHYSICAL AI · 2026-04-20

Physical AI Brief

Daily cross-source signals for the Physical AI supply chain — silicon photonics, CPO, VLA models, humanoid hardware, embodied AI. Three streams, one page, zero filler.

94 items today · 34 arxiv · 2 SEC 8-K · 58 humanoid · 0 CN photonics

01 ARXIV · PHYSICAL AI PAPERS

34 items

arxiv:2604.16158 · cs.LG
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu
Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.
manipulation
arxiv:2604.16083 · cs.CV
DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics
Jieming Yu, Qiuxiao Feng, Zhuohan Wang, Xiaochen Ma
With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.
manipulation
arxiv:2604.16067 · cs.LG
AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning
Guransh Singh
Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM's visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.
vision-language-action
arxiv:2604.16059 · physics.optics
Controlling external injection in laser-plasma accelerators with terahertz frequency bunch manipulation
Aras Amini, Lewis R. Reid, James K. Jones, Morgan T. Hibberd +5
Laser-plasma wakefield acceleration (LWFA) offers ultrahigh accelerating gradients in compact setups, but the complex non-linear nature of the process makes it challenging to generate high-quality beams. Injection of electron bunches from an external source into a plasma accelerator provides a promising route to improved performance; however, electron bunches from conventional radio-frequency (RF)-based injectors suffer from non-linear compression and laser-beam asynchrony, leading to energy jitter and emittance growth. We present a fundamental concept of terahertz-controlled electron bunches for external injection into LWFA. This terahertz-frequency approach provides temporal locking between the electron beam and the drive laser, and enables the compression of high-quality beams to sub-10-fs durations before injection into the LWFA. Numerical simulations demonstrate that GeV-scale acceleration with excellent beam quality and stability -- energy jitter and energy spread around 0.2% -- can be achieved using this method. This concept opens new opportunities for stable, multi-stage laser-driven accelerators and supports the development of next-generation applications such as free-electron lasers (FELs).
manipulation
arxiv:2604.16054 · cs.CV
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian +1
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
manipulation
arxiv:2604.16022 · cs.LG
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Hikaru Shindo, Hanzhao Lin, Lukas Helff, Patrick Schramowski +1
As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.
embodied
arxiv:2604.15996 · eess.SY
Stealthy Cyber-Attacks on Vehicle Lateral Dynamics: A System-Theoretic Analysis
Ali Eslami, Jiangbo Yu, Mohammad Pirani
This paper studies the vehicle bicycle model under three classes of stealthy cyber-attacks: replay attacks, zero dynamics attacks, and covert attacks. Using a system-theoretic framework, we analyze the feasibility and impact of these attacks on vehicle lateral dynamics. The investigation considers different measurement configurations, including yaw rate, lateral acceleration, and longitudinal acceleration outputs, to evaluate how sensor selection influences attack detectability and system vulnerability. Each attack class is characterized in terms of required system knowledge, communication access, and impact. The analysis shows that replay attacks remain largely model-agnostic, while zero dynamics attacks are fundamentally constrained by control-oriented design choices, particularly output selection, which can eliminate unstable zero dynamics and limit the attack impact. In contrast, covert attacks, enabled by coordinated actuator and sensor manipulation, allow sustained and stealthy deviation of lateral states when sufficient access and system knowledge are available. The effects of actuator and tire saturation are also examined, revealing attack-dependent impacts on stealthiness and effectiveness. Finally, simulation case studies are conducted by using CarSim-Simulink co-simulation to validate and verify the theoretical results.
manipulation
arxiv:2604.15948 · cs.CV
From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
Jinhao Shen, Haoqian Du, Xulu Zhang, Xiao-Yong Wei +1
Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.
manipulation
arxiv:2604.15938 · cs.RO
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
Xinglei Yu, Zhenyang Liu, Shufeng Nan, Simo Wu +1
Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.
manipulationdiffusion policy
arxiv:2604.15907 · cs.RO
A Reconfigurable Pneumatic Joint Enabling Localized Selective Stiffening and Shape Locking in Vine-Inspired Robots
Ayodele James Oyejide, Ustaz A. Yaqub, Samir Erturk, Eray A. Baran +1
Vine-inspired robots achieve large workspace coverage through tip eversion, enabling safe navigation in confined and cluttered environments. However, their deployment in free space is fundamentally limited by low axial stiffness, poor load-bearing capacity, and the inability to retain shape during and after steering. In this work, we propose a reconfigurable pneumatic joint (RPJ) architecture that introduces discrete, pressure-tunable stiffness along the robot body without compromising continuous growth. Each RPJ module comprises symmetrically distributed pneumatic chambers that locally increase bending stiffness when pressurized, enabling decoupling between global compliance and localized rigidity. We integrate the RPJs into a soft growing robot with tendon-driven steering and develop a compact base station for mid-air eversion. System characterization and experimental validation demonstrate moderate pressure requirements for eversion, as well as comparable localized stiffening and steering performance to layer-jamming mechanisms. Demonstrations further show that the proposed robot achieves improved shape retention during bending, reduced gravitational deflection under load, cascading retraction, and reliable payload transport up to 202 g in free space. The RPJ mechanism establishes a practical pathway toward structurally adaptive vine robots for manipulation-oriented tasks such as object sorting and adaptive exploration in unconstrained environments.
manipulation
arxiv:2604.15857 · cs.CV
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Taewoong Kang, Hyojin Jang, Sohyun Jeong, Seunggi Moon +3
Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
manipulation
arxiv:2604.15823 · cs.CV
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi +2
Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.
embodied
arxiv:2604.15814 · cs.RO
Continual Hand-Eye Calibration for Open-world Robotic Manipulation
Fazeng Li, Gan Sun, Chenxi Liu, Yao He +2
Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.
manipulation
arxiv:2604.15805 · cs.RO
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
Jasper Lu, Zhenhao Shen, Yuanfei Wang, Shugao Liu +7
Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.
manipulation
arxiv:2604.15795 · cs.CV
Fed3D: Federated 3D Object Detection
Suyan Dai, Chenxi Liu, Fazeng Li, Peican Lin
3D object detection models trained in one server plays an important role in autonomous driving, robotics manipulation, and augmented reality scenarios. However, most existing methods face severe privacy concern when deployed on a multi-robot perception network to explore large-scale 3D scene. Meanwhile, it is highly challenging to employ conventional federated learning methods on 3D object detection scenes, due to the 3D data heterogeneity and limited communication bandwidth. In this paper, we take the first attempt to propose a novel Federated 3D object detection framework (i.e., Fed3D), to enable distributed learning for 3D object detection with privacy preservation. Specifically, considering the irregular input 3D object in local robot and various category distribution between robots could cause local heterogeneity and global heterogeneity, respectively. We then propose a local-global class-aware loss for the 3D data heterogeneity issue, which could balance gradient back-propagation rate of different 3D categories from local and global aspects. To reduce communication cost on each round, we develop a federated 3D prompt module, which could only learn and communicate the prompts with few learnable parameters. To the end, several extensive experiments on federated 3D object detection show that our Fed3D model significantly outperforms state-of-the-art algorithms with lower communication cost when providing the limited local training data.
manipulation
arxiv:2604.15671 · cs.RO
Long-Term Memory for VLA-based Agents in Open-World Task Execution
Xu Huang, Weixin Mao, Yinhao Li, Hua Chen +1
Vision-Language-Action (VLA) models have demonstrated significant potential for embodied decision-making; however, their application in complex chemical laboratory automation remains restricted by limited long-horizon reasoning and the absence of persistent experience accumulation. Existing frameworks typically treat planning and execution as decoupled processes, often failing to consolidate successful strategies, which results in inefficient trial-and-error in multi-stage protocols. In this paper, we propose ChemBot, a dual-layer, closed-loop framework that integrates an autonomous AI agent with a progress-aware VLA model (Skill-VLA) for hierarchical task decomposition and execution. ChemBot utilizes a dual-layer memory architecture to consolidate successful trajectories into retrievable assets, while a Model Context Protocol (MCP) server facilitates efficient sub-agent and tool orchestration. To address the inherent limitations of VLA models, we further implement a future-state-based asynchronous inference mechanism to mitigate trajectory discontinuities. Extensive experiments on collaborative robots demonstrate that ChemBot achieves superior operational safety, precision, and task success rates compared to existing VLA baselines in complex, long-horizon chemical experimentation.
vision-language-actionvlavla modelembodied
arxiv:2604.15569 · cs.RO
ShapeGen: Robotic Data Generation for Category-Level Manipulation
Yirui Wang, Xiuwei Xu, Angyuan Ma, Bingyao Yu +2
Manipulation policies deployed in uncontrolled real-world scenarios are faced with great in-category geometric diversity of everyday objects. In order to function robustly under such variations, policies need to work in a category-level manner, i.e. knowing how to interact with any object in a certain category, instead of only a specific one seen during training. This in-category generalizability is usually nurtured with shape-diversified training data; however, manually collecting such a corpus of data is infeasible due to the requirement of intense human labor and large collections of divergent objects at hand. In this paper, we propose ShapeGen, a data generation method that aims at generating shape-variated manipulation data in a simulator-free and 3D manner. ShapeGen decomposes the process into two stages: Shape Library curation and Function-Aware Generation. In the first stage, we train spatial warpings between shapes mapping points to points that correspond functionally, and aggregate 3D models along with the warpings into a plug-and-play Shape Library. In the second stage, we design a pipeline that, leveraging established Libraries, requires only minimal human annotation to generate physically plausible and functionally correct novel demonstrations. Experiments in the real world demonstrate the effectiveness of ShapeGen to boost policies' in-category shape generalizability. Project page: https://wangyr22.github.io/ShapeGen/.
manipulation
arxiv:2604.15550 · physics.optics
Incoherence-assisted mode excitation in non-Hermitian resonant systems
Amin Hashemi, Vinzenz Zimmermann, Armando Perez-Leija, Andrea Blanco-Redondo
We introduce and experimentally demonstrate an approach for selective mode excitation in non-Hermitian resonant systems using incoherent light. This method eliminates the need for precise phase control that is often required in coherent excitation schemes. Using this technique on a silicon photonic platform with coupled ring resonators, we successfully excite the topological edge state of a non-Hermitian Su-Schrieffer-Heeger (SSH) model. Our work shows that incoherence-assisted excitation is a robust and passive strategy for topological state preparation, which broadens the scope of non-Hermitian topological photonics thereby providing a practical and experimentally viable tool for selective mode excitation.
silicon photonic
arxiv:2604.15495 · cs.RO
GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology
Shivendra Agrawal, Bradley Hayes
Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.
embodied
arxiv:2604.15493 · physics.optics
End-to-End Physical Design Automation Flow for Yield-Optimized Inverse-Designed Large-Scale Electronic-Photonic Integrated Circuits
Hongjian Zhou, Haoyu Yang, Haoxing Ren, Joaquin Matres +1
As AI systems scale to multi-chiplet and wafer-level architectures, the demand for ultra-high bandwidth and system scalability has outpaced the capabilities of electrical interconnects and computing units. Large-scale heterogeneous electronic-photonic integrated chiplets (EPICs) provide a promising solution, but their practical adoption is limited by the lack of a unified, fabrication-aware physical design automation stack. At the same time, inverse-designed ultra-compact photonic devices offer orders-of-magnitude improvements in spatial and spectral density, yet remain constrained by insufficient design-for-manufacturing support and yield optimization. In this work, we present OptoSynthesizer, an end-to-end physical design automation flow for yield-optimized, inverse-designed EPICs. It integrates three key components across the physical design pipeline: (1) OptoSynthesizer-InvDes, a physical-AI-augmented, digital-twin-assisted photonic inverse design and photonics-aware inverse lithography framework; (2) OptoSynthesizer-Place, a GPU-accelerated routing-informed EPIC placer for large-scale routability-optimized layout; and (3) OptoSynthesizer-Route, a hierarchical curvy-aware waveguide router with global-planning-assisted electrical-optical co-routing. Together, these toolkits form a seamless flow from EPIC netlists to fabrication-ready, yield-robust GDS layouts. We demonstrate how this framework enables compact large-scale photonic tensor cores and high-bandwidth interconnect fabrics for heterogeneous EPIC platforms, providing a practical foundation for manufacturable large-scale EPICs in next-generation AI systems.
photonic integrated circuit
arxiv:2604.15281 · cs.RO
R3D: Revisiting 3D Policy Learning
Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao +7
3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/
manipulation
arxiv:2604.15215 · cs.RO
A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics
Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani +3
We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.
manipulation
arxiv:2604.15023 · cs.RO
DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation
Ziyu Shan, Yuheng Zhou, Gaoyuan Wu, Ziheng Ji +2
Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.
manipulation
arxiv:2604.15013 · cs.RO
DEX-Mouse: A Low-cost Portable and Universal Interface with Force Feedback for Data Collection of Dexterous Robotic Hands
Joonho Koh, Haechan Jung, Nayoung Kim, Wook Ko +1
Data-driven dexterous hand manipulation requires large-scale, physically consistent demonstration data. Simulation and video-based methods suffer from sim-to-real gaps and retargeting problems, while MoCap glove-based teleoperation systems require per-operator calibration and lack portability, as the robot hand is typically fixed to a stationary arm. Portable alternatives improve mobility but lack cross-platform and cross-operator compatibility. We present DEX-Mouse, a portable, calibration-free hand-held teleoperation interface with integrated kinesthetic force feedback, built from commercial off-the-shelf components under USD 150. The operator-agnostic design requires no calibration or structural modification, enabling immediate deployment across diverse environments and platforms. The interface supports a configuration in which the target robot hand is mounted directly on the forearm of an operator, producing robot-aligned data. In a comparative user study across various dexterous manipulation tasks, operators using the proposed system achieved an 86.67% task completion rate under the attached configuration. Also, we found that the attached configuration reduced the perceived workload of the operators compared to spatially separated teleoperation setups across all compared interfaces. The complete hardware and software stack, including bill of materials, CAD models, and firmware, is open-sourced at https://dex-mouse.github.io/ to facilitate replication and adoption.
manipulationdexterousteleoperation
arxiv:2604.14965 · cs.RO
POMDP-based Object Search with Growing State Space and Hybrid Action Domain
Yongbo Chen, Hesheng Wang, Shoudong Huang, Hanna Kurniawati
Efficiently locating target objects in complex indoor environments with diverse furniture, such as shelves, tables, and beds, is a significant challenge for mobile robots. This difficulty arises from factors like localization errors, limited fields of view, and visual occlusion. We address this by framing the object-search task as a highdimensional Partially Observable Markov Decision Process (POMDP) with a growing state space and hybrid (continuous and discrete) action spaces in 3D environments. Based on a meticulously designed perception module, a novel online POMDP solver named the growing neural process filtered k-center clustering tree (GNPF-kCT) is proposed to tackle this problem. Optimal actions are selected using Monte Carlo Tree Search (MCTS) with belief tree reuse for growing state space, a neural process network to filter useless primitive actions, and k-center clustering hypersphere discretization for efficient refinement of high-dimensional action spaces. A modified upper-confidence bound (UCB), informed by belief differences and action value functions within cells of estimated diameters, guides MCTS expansion. Theoretical analysis validates the convergence and performance potential of our method. To address scenarios with limited information or rewards, we also introduce a guessed target object with a grid-world model as a key strategy to enhance search efficiency. Extensive Gazebo simulations with Fetch and Stretch robots demonstrate faster and more reliable target localization than POMDP-based baselines and state-of-the-art (SOTA) non-POMDP-based solvers, especially large language model (LLM) based methods, in object search under the same computational constraints and perception systems. Real-world tests in office environments confirm the practical applicability of our approach. Project page: https://sites.google.com/view/gnpfkct.
world model
arxiv:2604.14944 · cs.RO
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
Jongbin Lim, Taeyun Ha, Mingi Choi, Jisoo Kim +3
We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.
manipulationdexteroustactile
arxiv:2604.14902 · cs.RO
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su +3
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.
embodied
arxiv:2604.14836 · physics.optics
Low voltage and high-bandwidth thin-film lithium tantalate modulator on a silicon dioxide substrate
Zihan Li, Alexander Kotz, Adrian Schwarzenberger, Christian Koos +1
Modern communication networks demand ever-increasing transmission bandwidth, placing stringent requirements on low-cost, high-performance electro-optic modulators. Substantial advances have been made in integrated photonics employing lithium niobate on insulator. In contrast, photonic integrated circuits based on lithium tantalate -- a material already commercially adopted for wireless filters -- have been developed, offering reduced DC drift, higher optical power handling, and lower birefringence. These advantages enable more complex and dense photonic integrated circuits, and make lithium tantalate a promising material platform for next-generation integrated electro-optic modulators. However, in contrast to the extensively studied thin-film lithium niobate platform, thin-film lithium tantalate modulators have only been explored on silicon substrates. Here, we report the first fabrication and characterization of thin-film lithium tantalate electro-optic modulators manufactured on a 4-inch (100 mm) fused-silica substrate for adapting a low-loss slow-wave microwave electrode to improve the electro-optic bandwidth. By employing a slow-wave electrode design to achieve velocity matching between microwave and optical signals, the demonstrated modulator achieves a 3-dB electro-optic bandwidth of 64 GHz with a low half-wave voltage of 1.53 V, with potential to operate at the measured 100 GHz electrical bandwidth, if the employed spectral biasing is removed. The modulator moreover exhibits low bias drift, with a constant switching voltage down to 10 mHz. This performance enables high-speed data transmission comparable to state-of-the-art lithium niobate modulators fabricated on quartz substrates. Using the fabricated devices, a net single lane data rate of 440.6 Gbps is achieved using PAM8 signaling.
photonic integrated circuit
arxiv:2604.14834 · cs.RO
Switch: Learning Agile Skills Switching for Humanoid Robots
Yuen-Fui Lau, Qihan Zhao, Yinhuai Wang, Runyi Yu +3
Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world chal lenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, cre ating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.
humanoid
arxiv:2604.14733 · cs.RO
Differentiable Object Pose Connectivity Metrics for Regrasp Sequence Optimization
Liang Qin, Weiwei Wan, Kensuke Harada
Regrasp planning is often required when one pick-and-place cannot transfer an object from an initial pose to a goal pose while maintaining grasp feasibility. The main challenge is to reason about shared-grasp connectivity across intermediate poses, where discrete search becomes brittle. We propose an implicit multi-step regrasp planning framework based on differentiable pose sequence connectivity metrics. We model grasp feasibility under an object pose using an Energy-Based Model (EBM) and leverage energy additivity to construct a continuous energy landscape that measures pose-pair connectivity, enabling gradient-based optimization of intermediate object poses. An adaptive iterative deepening strategy is introduced to determine the minimum number of intermediate steps automatically. Experiments show that the proposed cost formulation provides smooth and informative gradients, improving planning robustness over other alternatives. They also demonstrate generalization to unseen grasp poses and cross-end-effector transfer, where a model trained with suction constraints can guide parallel gripper grasp manipulation. The multi-step planning results further highlight the effectiveness of adaptive deepening and minimum-step search.
manipulation
arxiv:2604.14732 · cs.RO
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng +4
Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.
vision-language-actionvlaembodiedworld model
arxiv:2604.14679 · physics.optics
Observation of Restored Adiabatic State Transfer in Time-Modulated Non-Hermitian Systems
Xiaowei Wang, Ievgen I. Arkhipov, Quan Lin, Huixia Gao +4
Exceptional points (EPs) have attracted extensive research interest due to their intriguing properties. One of the hallmarks of EP physics is that dynamically encircling the EPs induces chiral mode switching, arising from the breakdown of adiabaticity due to the presence of a complex spectrum in the system's Hamiltonian. While such chiral mode behavior has been widely observed experimentally, achieving truly adiabatic, and thus symmetric, state transfer, regardless of the winding direction, in time-modulated non-Hermitian systems has remained elusive. In this work, we demonstrate that this long-sought adiabatic state dynamics can indeed be restored. By steering a two-mode photonic setup along specifically designed trajectories in parameter space, we realize conditions where the associated non-Hermitian evolution operator acquires a purely real spectrum. Moreover, our experimental platform enables controlled switching between symmetric (adiabatic) and chiral (non-adiabatic) state-transfer regimes for the same set of initial modes, thus effectively implementing a universal symmetric-asymmetric two-mode switch. Our results therefore open new avenues for harnessing unique topological spectral properties of non-Hermitian systems, paving the way for the practical design of versatile optical wave-manipulation devices and for advancing both classical and quantum information technologies.
manipulation
arxiv:2604.14664 · physics.optics
Scaling Photonic Tensor Cores with Unary and Homodyne Designs
Oluwaseun Alo, Ishan Thakkar
We analyze five photonic microring tensor core designs with a common optical power model. The results show that circuit ordering, unary encoding, and homodyne accumulation shape scalability, with the last two offering the strongest path to higher parallelism.
microring
arxiv:2604.14565 · cs.RO
Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion
Tomoya Kamimura, Haruka Washiyama, Akihito Sano
Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.
embodiedhumanoid

02 US SEMI · SEC 8-K FILINGS

2 items

scanned: NVDA / AVGO / MRVL / COHR / LITE / AMD / TSM / SMCI / ANET / CRDO / POWL / VECO

03 HUMANOID · COMPANY NEWS

58 items

scanned: figure-ai / 1x / boston-dynamics / unitree / apptronik / sanctuary-ai / neura-robotics / agility-robotics / physical-intelligence / agibot

Physical AI Brief

01 ARXIV · PHYSICAL AI PAPERS

02 US SEMI · SEC 8-K FILINGS

03 HUMANOID · COMPANY NEWS

Figure AI (10)

Boston Dynamics (10)

Unitree 宇树 (9)

Sanctuary AI (5)

Agility Robotics (10)

Physical Intelligence (7)

智元 AgiBot (7)

04 CN PHOTONICS · 公告流