PHYSICAL AI · 2026-04-21

Physical AI Brief

Daily cross-source signals for the Physical AI supply chain — silicon photonics, CPO, VLA models, humanoid hardware, embodied AI. Three streams, one page, zero filler.

126 items today · 66 arxiv · 2 SEC 8-K · 58 humanoid · 0 CN photonics

01 ARXIV · PHYSICAL AI PAPERS

66 items

arxiv:2604.18578 · cs.LG
Bounded Ratio Reinforcement Learning
Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd +4
Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
humanoid
arxiv:2604.18564 · cs.CV
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/
manipulationworld model
arxiv:2604.18557 · cs.CV
SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy
Wei Yao, Haohan Ma, Hongwen Zhang, Yunlian Sun +5
Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: http://yw0208.github.io/synagent
embodiedmanipulationhumanoid
arxiv:2604.18486 · cs.RO
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li +46
Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL
vlaembodiedworld model
arxiv:2604.18484 · cs.RO
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
Kangan Qian, ChuChu Xie, Yang Zhong, Jingrui Pang +12
Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
vision-language-actionembodied
arxiv:2604.18468 · cs.LG
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Tianshi Cao, Jiawei Ren, Yuxuan Zhang, Jaewoo Seo +11
Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. Rather than relying on a single model component, we developed a system-level design for real-world AV data that combines large-scale curation of object-centric training tuples, geometry-aware preprocessing across heterogeneous sensors, and a robust training recipe that couples sparse-view-conditioned multiview generation with 3D Gaussian lifting. Within this system, SparseViewDiT is explicitly designed to address limited-angle views and other real-world data challenges. Together with hybrid data curation, augmentation, and self-distillation, this system enables scalable conversion of sparse AV object observations into reusable 3D assets.
manipulation
arxiv:2604.18463 · cs.RO
Using large language models for embodied planning introduces systematic safety risks
Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu +3
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
embodied
arxiv:2604.18426 · physics.optics
High precision micro-optical elements on fiber facets via focused-ion beam machining
Raman Kumar, Sebastian Will
Fiber-integrated micro-optical elements promise a scalable approach to photon collection and beam shaping for quantum information processing. Here, we demonstrate single-step fabrication of micro-spherical, micro-spiral, and micro-axicon structures directly on the core of single-mode optical fibers using focused ion beam (FIB) machining with nanometer-scale precision. Atomic force microscopy reveals that micro-concave and micro-convex spherical surfaces achieve shape accuracies of approximately $λ/80$ and $λ/50$ at $λ= 780$ nm, respectively. Optical characterization using a He-Ne laser at 633 nm confirms the expected far-field donut beam patterns for the micro-spiral and micro-axicon structures. Mach-Zehnder interferometry further verifies the corresponding azimuthal and radial phase profiles of the light emitted from the spiral and axicon fibers. Surface metrology shows that the optimized FIB process preserves optical-grade surface quality, introducing no measurable additional roughness at spatial scales relevant to visible and near-infrared operation. These monolithically integrated fiber micro-optical elements enable a broad range of applications in quantum technology, including fiber micro-cavities for cavity quantum electrodynamics, beam shaping for neutral atom trapping, and the generation of structured light for free-space quantum network links.
mach-zehnder
arxiv:2604.18384 · physics.optics
NEMO: Neural Electro-Mechano-Optic Sensors for Multiplexed Neural Interfaces
Andrew Cochran, Harshvardhan Gupta, Vishal Jain, Maysamreza Chamanzar +1
We introduce a novel electro-optomechanic neural sensor for realizing ultra-compact neural recording probes that can detect and relay electrophysiology signals from within neural tissue. This technology addresses outstanding challenges faced by existing neural recording technologies, including the resolution trade-off with signal-to-noise-ratio (SNR) due to the high impedances of small electrodes, and lingering stimulation artifacts. The sensor employs a highly miniaturized NEMS (nano-electromechanical systems) electrostatic transducer that modulates a silicon photonic microdisk resonator to convert electrical signals to an optical signal modulation. We have been able to achieve a limit of detection down to 110 microvolts, making the sensor sensitive enough to detect neural signals. This sensitive electro-optomechanic sensor directly detects electrophysiology signals and converts them to optomechanic modulation for effective transmission to outside the brain, which provides the unique potential for massive multiplexing of neural recordings. This design eliminates the need for bulky backend headstages that limit neural recording on awake free-roaming subjects. The ability of the device to record electrophysiological signals has been demonstrated using benchtop characterization and ex-vivo recordings from live neural tissue.
silicon photonic
arxiv:2604.18343 · cs.RO
DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications
Ruijia Liu, Ancheng Hou, Xiao Yu, Xiang Yin
Signal Temporal Logic (STL) is a powerful language for specifying temporally structured robotic tasks. Planning executable trajectories under STL constraints remains difficult when system dynamics and environment structure are not analytically available. Existing methods typically either assume explicit models or learn task-specific behaviors, limiting zero-shot generalization to unseen STL tasks. In this work, we study offline STL planning under unknown dynamics using only task-agnostic trajectory data. Our central design philosophy is to separate logical reasoning from trajectory realization. We instantiate this idea in DAG-STL, a hierarchical framework that converts long-horizon STL planning into three stages. It first decomposes an STL formula into reachability and invariance progress conditions linked by shared timing constraints. It then allocates timed waypoints using learned reachability-time estimates. Finally, it synthesizes trajectories between these waypoints with a diffusion-based generator. This decomposition--allocation--generation pipeline reduces global planning to shorter, better-supported subproblems. To bridge the gap between planning-level correctness and execution-level feasibility, we further introduce a rollout-free dynamic consistency metric, an anytime refinement search procedure for improving multiple allocation hypotheses under finite budgets, and a hierarchical online replanning mechanism for execution-time recovery. Experiments in Maze2D, OGBench AntMaze, and the Cube domain show that DAG-STL substantially outperforms direct robustness-guided diffusion on complex long-horizon STL tasks and generalizes across navigation and manipulation settings. In a custom environment with an optimization-based reference, DAG-STL recovers most model-solvable tasks while retaining a clear computational advantage over direct optimization based on the explicit system model.
manipulation
arxiv:2604.18331 · cs.RO
Will People Enjoy a Robot Trainer? A Case Study with Snoopie the Pacerbot
Maximilian Du, Jennifer Grannen, Shuran Song, Dorsa Sadigh
The physicality of exercise makes the role of athletic trainers unique. Their physical presence allows them to guide a student through a motion, demonstrate an exercise, and give intuitive feedback. Robot quadrupeds are also embodied agents with robust agility and athleticism. In our work, we investigate whether a robot quadruped can serve as an effective and enjoyable personal trainer device. We focus on a case study of interval training for runners: a repetitive, long-horizon task where precision and consistency are important. To meet this challenge, we propose SNOOPIE, an autonomous robot quadruped pacer capable of running interval training exercises tailored to challenge a user's personal abilities. We conduct a set of user experiments that compare the robot trainer to a wearable trainer device--the Apple Watch--to investigate the benefits of a physical embodiment in exercise-based interactions. We demonstrate 60.6% better adherence to a pace schedule and were 45.9% more consistent across their running speeds with the quadruped trainer. Subjective results also showed that participants strongly preferred training with the robot over wearable devices across many qualitative axes, including its ease of use (+56.7%), enjoyability of the interaction (+60.6%), and helpfulness (+39.1%). Additional videos and visualizations can be found on our website: https://sites.google.com/view/snoopie
embodied
arxiv:2604.18289 · cs.RO
Relative State Estimation using Event-Based Propeller Sensing
Ravi Kumar Thakur, Luis Granados Segura, Jan Klivan, Radim Špetlík +3
Autonomous swarms of multi-Unmanned Aerial Vehicle (UAV) system requires an accurate and fast relative state estimation. Although monocular frame-based camera methods perform well in ideal conditions, they are slow, suffer scale ambiguity, and often struggle in visually challenging conditions. The advent of event cameras addresses these challenging tasks by providing low latency, high dynamic range, and microsecond-level temporal resolution. This paper proposes a framework for relative state estimation for quadrotors using event-based propeller sensing. The propellers in the event stream are tracked by detection to extract the region-of-interests. The event streams in these regions are processed in temporal chunks to estimate per-propeller frequencies. These frequency measurements drive a kinematic state estimation module as a thrust input, while camera-derived position measurements provide the update step. Additionally, we use geometric primitives derived from event streams to estimate the orientation of the quadrotor by fitting an ellipse over a propeller and backprojecting it to recover body-frame tilt-axis. The existing event-based approaches for quadrotor state estimation use the propeller frequency in simulated flight sequences. Our approach estimates the propeller frequency under 3% error on a test dataset of five real-world outdoor flight sequences, providing a method for decentralized relative localization for multi-robot systems using event camera.
event camera
arxiv:2604.18277 · cs.LG
Dissipative Latent Residual Physics-Informed Neural Networks for Modeling and Identification of Electromechanical Systems
Youyuan Long, Gokhan Solak, Arash Ajoudani
Accurate dynamical modeling is essential for simulation and control of embodied systems, yet first-principles models of electromechanical systems often fail to capture complex dissipative effects such as joint friction, stray losses, and structural damping. While residual-learning physics-informed neural networks (PINNs) can effectively augment imperfect first-principles models with data-driven components, the residual terms are typically implemented as unconstrained multilayer perceptrons (MLPs), which may inadvertently inject artificial energy into the system. To more faithfully model the dissipative dynamics, we propose DiLaR-PINN, a dissipative latent residual PINN designed to learn unmodeled dissipative effects in a physically consistent manner. Structurally, the residual network operates only on unmeasurable (latent) state components and is parameterized in a skew-dissipative form that guarantees non-increasing energy for any choice of network parameters. To enable stable and data-efficient training under partial measurability of the state, we further develop a recurrent rollout scheme with a curriculum-based sequence length extension strategy. We validate DiLaR-PINN on a real-world helicopter system and compare it against four baselines: a pure physical model (without a residual network), an unstructured residual MLP, a DiLaR variant with a soft dissipativity constraint, and a black-box LSTM. The results demonstrate that DiLaR-PINN more accurately captures dissipative effects and achieves superior long-horizon extrapolation performance.
embodied
arxiv:2604.18271 · cs.RO
EmbodiedLGR: Integrating Lightweight Graph Representation and Retrieval for Semantic-Spatial Memory in Robotic Agents
Paolo Riva, Leonardo Gargani, Matteo Frosi, Matteo Matteucci
As the world of agentic artificial intelligence applied to robotics evolves, the need for agents capable of building and retrieving memories and observations efficiently is increasing. Robots operating in complex environments must build memory structures to enable useful human-robot interactions by leveraging the mnemonic representation of the current operating context. People interacting with robots may expect the embodied agent to provide information about locations, events, or objects, which requires the agent to provide precise answers within human-like inference times to be perceived as responsive. We propose the Embodied Light Graph Retrieval Agent (EmbodiedLGR-Agent), a visual-language model (VLM)-driven agent architecture that constructs dense and efficient representations of robot operating environments. EmbodiedLGR-Agent directly addresses the need for an efficient memory representation of the environment by providing a hybrid building-retrieval approach built on parameter-efficient VLMs that store low-level information about objects and their positions in a semantic graph, while retaining high-level descriptions of the observed scenes with a traditional retrieval-augmented architecture. EmbodiedLGR-Agent is evaluated on the popular NaVQA dataset, achieving state-of-the-art performance in inference and querying times for embodied agents, while retaining competitive accuracy on the global task relative to the current state-of-the-art approaches. Moreover, EmbodiedLGR-Agent was successfully deployed on a physical robot, showing practical utility in real-world contexts through human-robot interaction, while running the visual-language model and the building-retrieval pipeline locally.
embodied
arxiv:2604.18236 · cs.RO
COFFAIL: A Dataset of Successful and Anomalous Robot Skill Executions in the Context of Coffee Preparation
Alex Mitrevski, Ayush Salunke
In the context of robot learning for manipulation, curated datasets are an important resource for advancing the state of the art; however, available datasets typically only include successful executions or are focused on one particular type of skill. In this short paper, we briefly describe a dataset of various skills performed in the context of coffee preparation. The dataset, which we call COFFAIL, includes both successful and anomalous skill execution episodes collected with a physical robot in a kitchen environment, a couple of which are performed with bimanual manipulation. In addition to describing the data collection setup and the collected data, the paper illustrates the use of the data in COFFAIL to learn a robot policy using imitation learning.
manipulation
arxiv:2604.18223 · cs.CV
Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu +2
Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent's perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction--perception entanglement.
embodied
arxiv:2604.18205 · cs.RO
A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting
Mikolaj Zielinski, Eryk Vykysaly, Bartlomiej Biesiada, Jan Baturo +2
Recent advances in neural rendering have introduced numerous 3D scene representations. Although standard computer vision metrics evaluate the visual quality of generated images, they often overlook the fidelity of surface geometry. This limitation is particularly critical in robotics, where accurate geometry is essential for tasks such as grasping and object manipulation. In this paper, we present an evaluation pipeline for neural rendering methods that focuses on geometric accuracy, along with a benchmark comprising 19 diverse scenes. Our approach enables a systematic assessment of reconstruction methods in terms of surface and shape fidelity, complementing traditional visual metrics.
manipulation
arxiv:2604.18160 · physics.optics
Programmable recirculating bricks mesh architecture for photonic neural networks
Jacek Gosciniak
General-purpose programmable photonic processors are considered a crucial technology because they combine the ultra high-speed, massive bandwidth, and energy efficiency of light-based computing with the flexibility of software-defined hardware. Unlike application-specific photonic integrated circuits (ASPIC) designed for one task, these processors use reconfigurable waveguide meshes to implement various functions, such as switching, filtering, or AI computation, on a single chip, allowing for rapid prototyping and versatile, on-demand hardware redefinition. Here we report a recirculating bricks mesh architecture that can be easily implemented in photonic neural networks. It will be shown that a single programmable optical system is capable of performing various functions depending on the requirements. In particular, we will show that the same network, after being reprogrammed, can perform many different functions, ranging from a crossbar network to optical interference circuits with variable structures, which can then be subjected to Singular Value Decomposition. Furthermore, the "bricks" mesh serves as an excellent foundation for implementing a monitoring system capable of monitoring the power in each location of the circuit and, subsequently, sel-fcalibrating and stabilizing the circuit using a feedback loop.
photonic integrated circuit
arxiv:2604.18107 · cs.CV
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
Zehua Zang, Xi Wang, Fuchun Sun, Xiao Xu +3
Vision-Language-Action models (VLAs) achieve remarkable performance in sequential decision-making but remain fragile to subtle environmental shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to the spurious correlation between actions and entities, then reproduce memorized action patterns. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates the spurious correlation through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting overconfidence issue. Experiments on LIBERO (+7.4\% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains of PDF in task success over vanilla VLA and VLA with test-time adaptation, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents. The code is available at \href{https://github.com/zhoujiahuan1991/CVPR2026-PDF}{https://github.com/zhoujiahuan1991/CVPR2026-PDF}.
vision-language-actionvla
arxiv:2604.18090 · cs.RO
Muscle-inspired magnetic actuators that push, pull, crawl, and grasp
Muhammad Bilal Khan, Florian Hofmann, Kilian Schäfer, Matthias Lutzi +1
Functional magnetic composites capable of large deformation, load bearing, and multifunctional motion are essential for next-generation adaptive soft robots. Here, we present muscle-inspired magnetic actuators (MMA), additively manufactured from a thermoplastic/permanent magnet polyurethane/Nd2Fe14B (TPU/MQP-S) composite using laser powder bed fusion (LPBF). By tuning the laser-energy scale between 1.0 and 3.0, both mechanical stiffness and magnetic response are precisely controlled: the tensile strength increases from 0.28 to 0.99 MPa while maintaining 30-45% elongation at break. This process enables the creation of 0.5 mm-thick flexural hinges, which reversibly bend and fold under moderate magnetic fields without damage. Two actuator types are reported showing the system versatility. The elongated actuator with self-weight of 1.57 g, magnetized in its contracted state, achieves linear contraction under a 500 mT field, lifting 50 g (32x its own weight) and sustaining performance over at least 50 cycles. Equipped with anisotropic frictional feet, it supports movement of a magnetic crawling robot that achieves up to 100% locomotion success on textured substrates. The expandable actuator exhibits reversible opening and closing under a 300 mT field, reliably grasping and releasing different objects, including soft berries and rigid 3D printed geometries. It can also anchor in a tube while holding suspended 50 g loads. This work demonstrates a LPBF-based strategy to program both stiffness and magnetization within a single material system, enabling remotely driven, reconfigurable, and fatigue-resistant soft actuators. The approach opens new possibilities for force controlled, multifunctional magnetic soft robots for adaptive gripping, locomotion, and minimally invasive manipulation of biomedical tools.
manipulation
arxiv:2604.18058 · cs.LG
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
Blaise Delaney, Salil Patel, Yuji Xing, Dominic Dootson +1
We introduce Sonata, a compact latent world model for six-axis trunk IMU representation learning under clinical data scarcity. Clinical cohorts typically comprise tens to hundreds of patients, making web-scale masked-reconstruction objectives poorly matched to the problem. Sonata is a 3.77 M-parameter hybrid model, pre-trained on a harmonised corpus of nine public datasets (739 subjects, 190k windows) with a latent world-model objective that predicts future state rather than reconstructing raw sensor traces. In a controlled comparison against a matched autoregressive forecasting baseline (MAE) on the same backbone, Sonata yields consistently stronger frozen-probe clinical discrimination, prospective fall-risk prediction, and cross-cohort transfer across a 14-arm evaluation suite, while producing higher-rank, more structured latent representations. At 3.77 M parameters the model is compatible with on-device wearable inference, offering a step toward general kinematic world models for neurological assessment.
world model
arxiv:2604.18000 · cs.RO
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
Haiweng Xu, Sipeng Zheng, Hao Luo, Wanpeng Zhang +2
Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.
vision-language-actionvlaembodied
arxiv:2604.17969 · cs.CV
E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
Koya Sakamoto, Taiki Miyanishi, Daichi Azuma, Shuhei Kurita +5
Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.
embodied
arxiv:2604.17915 · cs.CV
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
Yiwei Zhang, Xuesong Chen, Jin Gao, Hanshi Wang +4
Vision-Language Models(VLMs) excel at autoregressive text generation, yet end-to-end autonomous driving requires multi-task learning with structured outputs and heterogeneous decoding behaviors, such as autoregressive language generation, parallel object detection and trajectory regression. To accommodate these differences, existing systems typically introduce separate or cascaded decoders, resulting in architectural fragmentation and limited backbone reuse. In this work, we present a unified autonomous driving framework built upon a pretrained VLM, where heterogeneous decoding behaviors are reconciled within a single transformer decoder. We demonstrate that pretrained VLM attention exhibits strong transferability beyond pure language modeling. By organizing visual and structured query tokens within a single causal decoder, structured queries can naturally condition on visual context through the original attention mechanism. Textual and structured outputs share a common attention backbone, enabling stable joint optimization across heterogeneous tasks. Trajectory planning is realized within the same causal LLM decoder by introducing structured trajectory queries. This unified formulation enables planning to share the pretrained attention backbone with images and perception tokens. Extensive experiments on end-to-end autonomous driving benchmarks demonstrate state-of-the-art performance, including 0.28 L2 and 0.18 collision rate on nuScenes open-loop evaluation and competitive results (86.8 PDMS) on NAVSIM closed-loop evaluation. The full model preserves multi-modal generation capability, while an efficient inference mode achieves approximately 40% lower latency. Code and models are available at https://github.com/Z1zyw/OneDrive
vision-language-action
arxiv:2604.17896 · cs.RO
Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study
Yubai Wei, Chen Wu, Hashem Haghbayan
Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do not explicitly supervise hard physical constraints such as obstacle avoidance or kinematic feasibility. As a result, the geometric structure underlying physically feasible behavior must be inferred only implicitly from demonstrations. In this paper, we study whether introducing explicit feasibility supervision can provide effective structured guidance for VLA policies. We formulate a simple geometry-grounded feasibility objective and integrate it into the training stage of a diffusion-based VLA policy. To evaluate this idea systematically, we use obstacle-aware manipulation as a controlled probe of geometry-dependent physical feasibility. Empirical results show that augmenting VLA training with feasibility supervision improves both physical reliability and overall task performance, while also enhancing learning efficiency in the low-data regime. These findings indicate that explicit feasibility signals can effectively complement imitation-based VLA learning, highlighting their potential for developing more reliable VLA policies.
vision-language-actionvlavla policymanipulation
arxiv:2604.17888 · cs.RO
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
Wensheng Wang, Chuanjun Guo, Wei Wei, Tong Wu +1
Generalizable grasping with high-degree-of-freedom (DoF) dexterous hands remains challenging in tiered workspaces, where occlusion, narrow clearances, and height-dependent constraints are substantially stronger than in open tabletop scenes. Most existing methods are evaluated in relatively unoccluded settings and typically do not explicitly model the distinct control requirements of arm navigation and hand articulation under spatial constraints. We present SpaceDex, a hierarchical framework for dexterous manipulation in constrained 3D environments. At the high level, a Vision-Language Model (VLM) planner parses user intent, reasons about occlusion and height relations across multiple camera views, and generates target bounding boxes for zero-shot segmentation and mask tracking. This stage provides structured spatial guidance for downstream control instead of relying on single-view target selection. At the low level, we introduce an arm-hand Feature Separation Network that decouples global trajectory control for the arm from geometry-aware grasp mode selection for the hand, reducing feature interference between reaching and grasping objectives. The controller further integrates multi-view perception, fingertip tactile sensing, and a small set of recovery demonstrations to improve robustness to partial observability and off-nominal contacts. In 100 real-world trials involving over 30 unseen objects across four categories, SpaceDex achieves a 63.0\% success rate, compared with 39.0\% for a strong tabletop baseline. These results indicate that combining hierarchical spatial planning with arm-hand representation decoupling improves dexterous grasping performance in spatially constrained environments.
manipulationdexteroustactile
arxiv:2604.17887 · cs.RO
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
Kerui Li, Zhe Jing, Xiaofeng Wang, Zheng Zhu +5
Inverse Dynamics Models (IDMs) map visual observations to low-level action commands, serving as central components for data labeling and policy execution in embodied AI. However, their performance degrades severely under manipulator truncation, a common failure mode that makes state recovery ill-posed and leads to unstable control. We present StableIDM, a spatio-temporal framework that refines features from visual inputs to stabilize action predictions under such partial observability. StableIDM integrates three complementary components: (1) auxiliary robot-centric masking to suppress background clutter, (2) Directional Feature Aggregation (DFA) for geometry-aware spatial reasoning, which extracts anisotropic features along directions inferred from the visible arm and (3) Temporal Dynamics Refinement (TDR) to smooth and correct predictions via motion continuity. Extensive evaluations validate our approach: StableIDM improves strict action accuracy by 12.1% under severe truncation on the AgiBot benchmark, and increases average task success by 9.7% in real-robot replay. Moreover, it boosts end-to-end grasp success by 11.5% when decoding video-generated plans, and improves downstream VLA real-robot success by 17.6% when functioning as an automatic annotator. These results demonstrate that StableIDM provides a robust and scalable backbone for both policy execution and data generation in embodied artificial intelligence.
vlaembodied
arxiv:2604.17880 · cs.RO
ST-$π$: Structured SpatioTemporal VLA for Robotic Manipulation
Chuanhao Ma, Hanyu Zhou, Shihan Peng, Yan Li +2
Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$π$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.
vision-language-actionvlavla modelmanipulation
arxiv:2604.17876 · cs.RO
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
Kuanning Wang, Ke Fan, Chenhao Qiu, Zeyu Shangguan +4
Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.
vlavla modelmanipulation
arxiv:2604.17863 · cs.RO
Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-driven Wrist
Lei Liu, Haonan Zhang, Huahang Xu, Zefan Zhang +7
Spinning flexible objects, exemplified by traditional Chinese handkerchief performances, demands periodic steady-state motions under nonlinear dynamics with frictional contacts and boundary constraints. To address these challenges, we first design an intuitive dexterous wrist based on a parallel anti-parallelogram tendon-driven structure, which achieves 90 degrees omnidirectional rotation with low inertia and decoupled roll-pitch sensing, and implement a high-low level hierarchical control scheme. We then develop a particle-spring model of the handkerchief for control-oriented abstraction and strategy evaluation. Hardware experiments validate this framework, achieving an unfolding ratio of approximately 99% and fingertip tracking error of RMSE = 2.88 mm in high-dynamic spinning. These results demonstrate that integrating control-oriented modeling with a task-tailored dexterous wrist enables robust rest-to-steady-state transitions and precise periodic manipulation of highly flexible objects. More visualizations: https://slowly1113.github.io/icra2026-handkerchief/
manipulationdexterous
arxiv:2604.17833 · cs.RO
DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation
Autrio Das, Shreya Bollimuntha, Madala Venkata Renu Jeevesh, Keshab Patra +4
What appears effortless to a human waiter remains a major challenge for robots. Manipulating objects nonprehensilely on a tray is inherently difficult, and the complexity is amplified in dual-arm settings. Such tasks are highly relevant to service robotics in domains such as hotels and hospitality, where robots must transport and reposition diverse objects with precision. We present DART, a novel dual-arm framework that integrates nonlinear Model Predictive Control (MPC) with an optimization-based impedance controller to achieve accurate object motion relative to a dynamically controlled tray. The framework systematically evaluates three complementary strategies for modeling tray-object dynamics as the state transition function within our MPC formulation: (i) a physics-based analytical model, (ii) an online regression based identification model that adapts in real-time, and (iii) a reinforcement learning-based dynamics model that generalizes across object properties. Our pipeline is validated in simulation with objects of varying mass, geometry, and friction coefficients. Extensive evaluations highlight the trade-offs among the three modeling strategies in terms of settling time, steady-state error, control effort, and generalization across objects. To the best of our knowledge, DART constitutes the first framework for non-prehensile dual-arm manipulation of objects on a tray. Project Link: https://dart-icra.github.io/dart/
manipulation
arxiv:2604.17810 · cs.RO
Memory Centric Power Allocation for Multi-Agent Embodied Question Answering
Chengyang Li, Shuai Wang, Kejiang Ye, Weijie Yuan +4
This paper considers multi-agent embodied question answering (MA-EQA), which aims to query robot teams on what they have seen over a long horizon. In contrast to existing edge resource management methods that emphasize sensing, communication, or computation performance metrics, MA-EQA emphasizes the memory qualities. To cope with this paradigm shift, we propose a quality of memory (QoM) model based on generative adversarial exam (GAE), which leverages forward simulation to assess memory retrieval and uses the resulting exam scores to compute QoM values. Then we propose memory centric power allocation (MCPA), which maximizes the QoM function under communication resource constraints. Through asymptotic analysis, it is found that the transmit powers are proportional to the GAE error probability, thus prioritizing towards high-QoM robots. Extensive experiments demonstrate that MCPA achieves significant improvements over extensive benchmarks in terms of diverse metrics in various scenarios.
embodied
arxiv:2604.17800 · cs.RO
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
Tuan Van Vo, Tan Q. Nguyen, Khang Nguyen, Nhat Xuan Tran +4
Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into desired robotic actions. Despite their advancements, VLAs often overlook explicit reasoning and learn the functional input-action mappings, omitting crucial logical steps, which are especially pronounced in interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we fine-tune pre-trained VLAs with the reasoning-enriched datasets with ReFineVLA, while maintaining the underlying generalization abilities and boosting reasoning capabilities. We also conduct attention map visualization to analyze the alignment among visual observation, linguistic prompts, and to-be-executed actions of ReFineVLA, reflecting the model is ability to focus on relevant tasks and actions. Through this additional step, we explore that ReFineVLA-trained models exhibit a meaningful agreement between vision-language and action domains, highlighting the enhanced multimodal understanding and generalization. Evaluated across a suite of simulated manipulation benchmarks on SimplerEnv with both WidowX and Google Robot tasks, ReFineVLA achieves state-of-the-art performance, in success rate over the second-best method on the both the WidowX benchmark and Google Robot Tasks.
vision-language-actionvlavla modelmanipulation
arxiv:2604.17787 · cs.RO
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
Tingzheng Jia, Kan Guo, Lanping Qian, Yongli Hu +5
Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.
vision-language-actionvlamanipulation
arxiv:2604.17767 · physics.optics
Measurement-defined control of single-particle interference
Tai Hyun Yoon
Interference is conventionally attributed to path-accumulated phase differences, with measurement treated as a passive readout. Here we demonstrate that single-particle interference is governed by the relative phase between the prepared quantum state and the detector-defined measurement basis -- a joint quantity that is operationally inaccessible in any conventional interferometer. Using coherently seeded entangled nonlinear biphoton sources, we show that independently scanning the pump phase difference, the seed phase difference, or the signal interferometric phase each produces identical sinusoidal fringes ($V\approx0.99$) -- a three-scan equivalence impossible in any two-mode interferometer. The fringe visibility is continuously controlled through the idler-state overlap, directly encoding quantum distinguishability without idler detection. The same measurement-defined interference law persists from the single-photon to the high-flux regime. The bright-dark collective-state structure demonstrated here unifies coherent population trapping and electromagnetically induced transparency in atomic $Λ$-systems, discrete photonic interference, and single-slit diffraction within a common framework differing only in dark-subspace dimensionality, establishing measurement-defined photonic modes as a fundamental resource for quantum photonics.
quantum photonic
arxiv:2604.17723 · physics.optics
Poling-free Spontaneous Parametric Down Conversion without for Silicon Carbide and Lithium Niobate photonics
Tim F. Weiss, Hamed Arianfard, Yang Yang, Alberto Peruzzo
State-of-the-art photon sources based on spontaneous parametric down-conversion (SPDC) currently rely on artificial structuring of the material nonlinearity to satisfy phase-matching conditions. This technique, known as periodic poling, is available only in a limited number of material platforms and introduces additional fabrication steps and errors, which are detrimental to up-scaling efforts. Here, we present a device architecture that enables SPDC of a wide range of frequencies without the need for periodic poling. We present explicit designs and calculations for 4H Silicon Carbide on-insulator, in which SPDC photon generation is so far unavailable, and thin-film Lithium Niobate on-insulator, a state-of-the-art quantum photonics platform. Our design, based on mode conversion and subsequent modal phase-matched SPDC, facilitates a CMOS compatible $χ^{(2)}$ platform, and simplifies photon sources by removing the requirement of periodic poling and the associated additional fabrication complexity.
quantum photonic
arxiv:2604.17706 · cs.RO
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
Haoxiang Jie, Yaoyuan Yan, Xiangyu Wei, Kailin Wang +3
Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL significantly outperforms state-of-the-art methods, effectively overcoming the fundamental limitations of current VLA models.
vision-language-actionvlavla modelembodied
arxiv:2604.17651 · cs.RO
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
Siyuan Meng, Chengbo Ai
World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego-vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure-centric world models offer a fundamentally complementary capability: the bird's-eye, multi-sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio-temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long-term behavioral distributions including rare safety-critical events, while vehicle-borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure-centric World Models (I-WM) in three phases: (I) generative scene understanding with quality-aware uncertainty propagation, (II) physics-informed predictive dynamics with multi-agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual-layer architecture, annotation-free perception as a multi-modal data engine feeding end-to-end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I-WM relative to LeCun's JEPA, Li Fei-Fei's spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I-VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi-LiDAR pipelines and identifies open-source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.
vlaworld modelevent camera
arxiv:2604.17513 · cs.RO
FLASH: Fast Learning via GPU-Accelerated Simulation for High-Fidelity Deformable Manipulation in Minutes
Siyuan Luo, Bingyang Zhou, Chong Zhang, Xin Liu +8
Simulation frameworks such as Isaac Sim have enabled scalable robot learning for locomotion and rigid-body manipulation; however, contact-rich simulation remains a major bottleneck for deformable object manipulation. The continuously changing geometry of soft materials, together with large numbers of vertices and contact constraints, makes it difficult to achieve high accuracy, speed, and stability required for large-scale interactive learning. We present FLASH, a GPU-native simulation framework for contact-rich deformable manipulation, built on an accurate NCP-based solver that enforces strict contact and deformation constraints while being explicitly designed for fine-grained GPU parallelism. Rather than porting conventional single-instruction-multiple-data (SIMD) solvers to GPUs, FLASH redesigns the physics engine from the ground up to leverage modern GPU architectures, including optimized collision handling and memory layouts. As a result, FLASH scales to over 3 million degrees of freedom at 30 FPS on a single RTX 5090, while accurately simulating physical interactions. Policies trained solely on FLASH-generated synthetic data in minutes achieve robust zero-shot sim-to-real transfer, which we validate on physical robots performing challenging deformable manipulation tasks such as towel folding and garment folding, without any real-world demonstration, providing a practical alternative to labor-intensive real-world data collection.
manipulation
arxiv:2604.17335 · cs.RO
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
Zewei Zhang, Kehan Wen, Michael Xu, Junzhe He +6
Whole-body humanoid locomotion is challenging due to high-dimensional control, morphological instability, and the need for real-time adaptation to various terrains using onboard perception. Directly applying reinforcement learning (RL) with reward shaping to humanoid locomotion often leads to lower-body-dominated behaviors, whereas imitation-based RL can learn more coordinated whole-body skills but is typically limited to replaying reference motions without a mechanism to adapt them online from perception for terrain-aware locomotion. To address this gap, we propose a whole-body humanoid locomotion framework that combines skills learned from reference motions with terrain-aware adaptation. We first train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. Concurrently, we train a whole-body reference tracker with RL using this motion data. To improve robustness under imperfectly generated references, we further fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation, and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. The hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations. Quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness.
humanoid
arxiv:2604.17277 · physics.app-ph
Fully Analog Resonant Recurrent Neural Network via Metacircuit
Zixin Zhou, Tianxi Jiang, Menglong Yang, Zhihua Feng +2
Physical neural networks offer a transformative route to edge intelligence, providing superior inference speed and energy efficiency compared to conventional digital architectures. However, realizing scalable, end-to-end, fully analog recurrent neural networks for temporal information processing remains challenging due to the difficulty of faithfully mapping trained network models onto physical hardware. Here we present a fully analog resonant recurrent neural network (R$^2$NN) implemented via a metacircuit architecture composed of coupled electrical local resonators. A reformulated mechanical-electrical analogy establishes a direct mapping between the R$^2$NN model and metacircuit elements, enabling accurate physical implementation of trained neural network parameters. By integrating jointly trainable global resistive coupling and local resonances, which generate effective frequency-dependent negative resistances, the architecture shapes an impedance landscape that steers currents along frequency-selective pathways. This mechanism enables direct extraction of discriminative spectral features, facilitating real-time temporal classification of raw analog inputs while bypassing analog-to-digital conversion. We demonstrate the cross-domain versatility of this framework using integrated hardware for tactile perception, speech recognition, and condition monitoring. This work establishes a scalable, fully analog paradigm for intelligent temporal processing and paves the way for low-latency, resource-efficient physical neural hardware for edge intelligence.
tactile
arxiv:2604.17258 · cs.RO
A Rapid Deployment Pipeline for Autonomous Humanoid Grasping Based on Foundation Models
Yifei Yan, Yankai Liao, Linqi Ye
Deploying a humanoid robot to manipulate a new object has traditionally required one to two days of effort: data collection, manual annotation, 3D model acquisition, and model training. This paper presents an end-to-end rapid deployment pipeline that integrates three foundation-model components to shorten the onboarding cycle for a new object to approximately 30 minutes: (i) Roboflow-based automatic annotation to assist in training a YOLOv8 object detector; (ii) 3D reconstruction based on Meta SAM 3D, which eliminates the need for a dedicated laser scanner; and (iii) zero-shot 6-DoF pose tracking based on FoundationPose, using the SAM~3D-generated mesh directly as the template. The estimated pose drives a Unity-based inverse kinematics planner, whose joint commands are streamed via UDP to a Unitree~G1 humanoid and executed through the Unitree SDK. We demonstrate detection accuracy of [email protected] = 0.995, pose tracking precision of $σ< 1.05$ mm, and successful grasping on a real robot at five positions within the workspace. We further verify the generality of the pipeline on an automobile-window glue-application task. The results show that combining foundation models for perception with everyday imaging devices (e.g., smartphones) can substantially lower the deployment barrier for humanoid manipulation tasks.
manipulationhumanoid
arxiv:2604.17252 · cs.RO
Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents
Hanlin Wang, Chak Tou Leong, Jian Wang, Wenjie Li
Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and perform ineffective actions, as they often overlook critical environmental feedback that differs from their internal beliefs. Through a formal probing analysis, we characterize this as belief inertia, a phenomenon where agents stubbornly adhere to prior beliefs despite explicit observations. To address this, we advocate active belief intervention, moving from passive understanding to active management. We introduce the Estimate-Verify-Update (EVU) mechanism, which empowers agents to predict expected outcomes, verify them against observations through explicit reasoning, and actively update prior beliefs based on the verification evidence. EVU is designed as a unified intervention mechanism that generates textual belief states explicitly, and can be integrated into both prompting-based and training-based agent reasoning methods. Extensive experiments across three embodied benchmarks demonstrate that EVU consistently yields substantial gains in task success rates. Further analyses validate that our approach effectively mitigates belief inertia, advancing the development of more robust embodied agents. Our code is available at https://github.com/WangHanLinHenry/EVU.
embodied
arxiv:2604.17245 · cs.RO
MM-Hand: A 21-DOF Multi-modal Modular Dexterous Robotic Hand with Remote Actuation
Zhuoheng Li, Qingquan Lin, Checheng Yu, Qiangyu Chen +4
High-DOF dexterous hands require compact actuation, rich sensing, and reliable thermal behavior, but conventional designs often occupy valuable in-hand space, increase end-effector mass, and suffer from heat accumulation near the hand. Remote tendon-driven actuation offers an alternative by relocating motors to the robot base or an external motor hub, thereby freeing the fingers and palm for additional degrees of freedom, sensing modules, and maintainable mechanical structures. This paper presents MM-Hand, a 21-DOF Multimodal Modular dexterous hand based on remote tendon-driven actuation. The hand integrates spring-return tendon-driven fingers, modular 3D-printed finger and palm structures, quick tendon connectors for maintenance, and a multimodal sensing system including joint angle sensors, tactile sensors, motor-side feedback, and in-palm stereo vision. We further analyze tendon-sheath length variation and friction loss to guide the design of the routing, motor hub, and closed-loop joint control. Experiments validate the transmission, output force, sensing, and control capability of the system. The fingertip force reaches 25N under a 1m remote sheath transmission, demonstrating practical load capacity despite long-distance tendon routing. Closed-loop joint-level experiments further evaluate command tracking with a static arm and during arm motion. These results show that MM-Hand provides a lightweight, sensor-rich, and maintainable hardware platform for dexterous manipulation research. To support the community, all hardware designs and software frameworks are made fully open-source at https://mmlab.hk/research/MM-Hand.
manipulationdexteroustactile
arxiv:2604.17241 · cs.RO
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
Kun Wang, Yiming Li, Mingcheng Qu, Aqiang Zhang +2
Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.
embodied
arxiv:2604.17050 · cs.RO
Web-Gewu: A Browser-Based Interactive Playground for Robot Reinforcement Learning
Kaixuan Chen, Linqi Ye
With the rapid development of embodied intelligence, robotics education faces a dual challenge: high computational barriers and cumbersome environment configuration. Existing centralized cloud simulation solutions incur substantial GPU and bandwidth costs that preclude large-scale deployment, while pure local computing is severely constrained by learners' hardware limitations. To address these issues, we propose \href{http://47.76.242.88:8080/receiver/index.html}{Web-Gewu}, an interactive robotics education platform built on a WebRTC cloud-edge-client collaborative architecture. The system offloads all physics simulation and reinforcement learning (RL) training to the edge node, while the cloud server acts exclusively as a lightweight signaling relay, enabling extremely low-cost browser-based peer-to-peer (P2P) real-time streaming. Learners can interact with multi-form robots at low end-to-end latency directly in a web browser without any local installation, and simultaneously observe real-time visualization of multi-dimensional monitoring data, including reinforcement learning reward curves. Combined with a predefined robust command communication protocol, Web-Gewu provides a highly scalable, out-of-the-box, and barrier-free teaching infrastructure for embodied intelligence, significantly lowering the barrier to entry for cutting-edge robotics technology.
embodied
arxiv:2604.16993 · cs.RO
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
Jiawen Wen, Penglei Sun, Wenjie Zhang, Suixuan Qiu +3
As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.
embodied
arxiv:2604.16903 · cs.RO
Leveraging VR Robot Games to Facilitate Data Collection for Embodied Intelligence Tasks
Yihan Zhang, Ziyun Huang, Linqi Ye
Collecting embodied interaction data at scale remains costly and difficult due to the limited accessibility of conventional interfaces. We present a gamified data collection framework based on Unity that combines procedural scene generation, VR-based humanoid robot control, automatic task evaluation, and trajectory logging. A trash pick-and-place task prototype is developed to validate the full workflow.Experimental results indicate that the collected demonstrations exhibit broad coverage of the state-action space, and that increasing task difficulty leads to higher motion intensity as well as more extensive exploration of the arm's workspace. The proposed framework demonstrates that game-oriented virtual environments can serve as an effective and extensible solution for embodied data collection.
embodiedhumanoid
arxiv:2604.16886 · cs.RO
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
Xianhao Wang, Xiaojian Ma, Haozhe Hu, Rongpeng Su +6
Generalist embodied agents must perform interactive, causally-dependent reasoning, continually interacting with the environment, acquiring information, and updating plans to solve long-horizon tasks before they could be adopted in real-life scenarios. For instance, retrieving an apple from a cabinet may require opening multiple doors and drawers before the apple becomes visible and reachable, demanding sequential interaction under partial observability. However, existing benchmarks fail to systematically evaluate this essential capability. We introduce COIN, a benchmark designed to assess interactive reasoning in realistic robotic manipulation through three key contributions. First, we construct COIN-50: 50 interactive tasks in daily scenarios, and create COIN-Primitive required by causally-dependent tasks, and COIN-Composition with mid-term complexity for skill learning and generalization evaluation. Second, we develop a low-cost mobile AR teleoperation system and collect the COIN-Primitive Dataset with 50 demonstrations per primitive task (1,000 in total). Third, we develop systematic evaluation metrics about execution stability and generalization robustness to evaluate CodeAsPolicy, VLA, and language-conditioned H-VLA approaches. Our comprehensive evaluation reveals critical limitations in current methods: models struggle with interactive reasoning tasks due to significant gaps between visual understanding and motor execution. We provide fine-grained analysis of these limitations.
embodiedmanipulationteleoperation
arxiv:2604.16850 · cs.RO
Refinement of Accelerated Demonstrations via Incremental Iterative Reference Learning Control for Fast Contact-Rich Imitation Learning
Koki Yamane, Cristian C. Beltran-Hernandez, Steven Oh, Masashi Hamaya +1
Fast execution of contact-rich manipulation is critical for practical deployment, yet providing fast demonstrations for imitation learning (IL) remains challenging: humans cannot demonstrate at high speed, and naively accelerating demonstrations alters contact dynamics and induces large tracking errors. We present a method to autonomously refine time-accelerated demonstrations by repurposing Iterative Reference Learning Control (IRLC) to iteratively update the reference trajectory from observed tracking errors. However, applying IRLC directly at high speed tends to produce larger early-iteration errors and less stable transients. To address this issue, we propose Incremental Iterative Reference Learning Control (I2RLC), which gradually increases the speed while updating the reference, yielding high-fidelity trajectories. We validate on real-robot whiteboard erasing and peg-in-hole tasks using a teleoperation setup with a compliance-controlled follower and a 3D-printed haptic leader. Both IRLC and I2RLC achieve up to 10x faster demonstrations with reduced tracking error; moreover, I2RLC improves spatial similarity to the original trajectories by 22.5% on average over IRLC across three tasks and multiple speeds (3x-10x). We then use the refined trajectories to train IL policies; the resulting policies execute faster than the demonstrations and achieve 100% success rates in the peg-in-hole task at both seen and unseen positions, with I2RLC-trained policies exhibiting lower contact forces than those trained on IRLC-refined demonstrations. These results indicate that gradual speed scheduling coupled with reference adaptation provides a practical path to fast, contact-rich IL.
manipulationteleoperation
arxiv:2604.16849 · physics.optics
Flat optics for analog computing: from fundamental mechanisms to advanced meta-processors
Tingting Liu, Jumin Qiu, Xintong Shi, Qiegen Liu +1
As the explosive growth of visual data increasingly strains the latency and energy limits of conventional electronic computing, optical analog computing has re-emerged as a disruptive paradigm for zero-power, speed-of-light information processing. Propelled by the unprecedented wave-manipulation capabilities of optical metasurfaces, this field is undergoing a rapid transition from macroscopic physical optics to ultra-compact, on-chip meta-processors. This Review examines the fundamental mechanisms of metasurface-empowered optical computing spanning Fourier-domain, nonlocal spatial-domain, and interferometric architectures that perform mathematical operations, with a particular focus on spatial differentiation and edge detection as representative computing tasks. By emphasizing recent breakthroughs, we highlight the evolution of meta-processors from static, linear regimes to dynamically reconfigurable, nonlinear, and quantum-assisted multidimensional platforms. We also envision how the synergy of AI-driven inverse design and the integration of analog meta-front-ends with optical neural networks will synergistically revolutionize next-generation intelligent machine vision.
manipulation
arxiv:2604.16788 · cs.RO
LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
Xueyao Chen, Jingkai Jia, Tong Yang, Yibo Fu +2
Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor. We observe that performance in fully observable settings is more strongly associated with execution robustness, while contextual difficulty varies across tasks and is not consistently improved by memory-based methods. We hope that LongBench serves as a useful benchmark for studying long-horizon manipulation and for developing policies with stronger robustness across both execution and contextual challenges.
manipulation
arxiv:2604.16695 · physics.optics
Gigahertz-rate thin-film lithium niobate receiver for time-bin quantum communication
Andrea Bernardi, Marco Clementi, Marcello Bacchi, Matías Rubén Bolaños +11
Time-bin encoded quantum states of light are crucial for quantum technology applications. The integration of manipulation functionalities into chip-scale devices is essential for deploying scalable, high-performance, and cost-effective quantum networks. Here we develop a fully integrated, high-throughput quantum receiver based on the thin-film lithium niobate (TFLN) platform, capable of high-speed electro-optic manipulation of time-bin encoded quantum states. The device's novel architecture enables active switching of time-bin quantum states with an electro-optic bandwidth exceeding 30 Ghz, while supporting real-time arbitrary projective measurements with a bandwidth of over 1 GHz. We showcase its versatility and performance through several applications, including the certification of entanglement with Bell's inequality violation by 38 standard deviations and with >95% visibility. We then apply it to a fiber-based quantum communication scenario, where we experimentally demonstrate an entanglement-based quantum key distribution (QKD) protocol, achieving stable finite-size secure key rates exceeding 25 kbit/s over 12 hours of continuous operation. By leveraging a high-speed active switching scheme, the system overcomes the need for temporal post-selection, eliminating a fundamental loophole that compromises the security of time-bin entanglement-based QKD protocols and relaxes the temporal resolution requirements of single-photon detectors. Moreover, it enables active selection of the projection basis, increasing the flexibility for communication parties. This approach establishes a versatile and scalable architecture for time-bin encoded quantum communication, enabling practical protocols on industry-grade photonic technology.
manipulation
arxiv:2604.16683 · cs.RO
Rewind-IL: Online Failure Detection and State Respawning for Imitation Learning
Gehan Zheng, Sanjay Seenivasan, Matthew Johnson-Roberson, Weiming Zhi
Imitation learning has enabled robots to acquire complex visuomotor manipulation skills from demonstrations, but deployment failures remain a major obstacle, especially for long-horizon action-chunked policies. Once execution drifts off the demonstration manifold, these policies often continue producing locally plausible actions without recovering from the failure. Existing runtime monitors either require failure data, over-trigger under benign feature drift, or stop at failure detection without providing a recovery mechanism. We present Rewind-IL, a training-free online safeguard framework for generative action-chunked imitation policies. Rewind-IL combines a zero-shot failure detector based on Temporal Inter-chunk Discrepancy Estimate (TIDE), calibrated with split conformal prediction, with a state-respawning mechanism that returns the robot to a semantically verified safe intermediate state. Offline, a vision-language model identifies recovery checkpoints in demonstrations, and the frozen policy encoder is used to construct a compact checkpoint feature database. Online, Rewind-IL monitors self-consistency in overlapping action chunks, tracks similarity to the checkpoint library, and, upon failure, rewinds execution to the latest verified safe state before restarting inference from a clean policy state. Experiments on real-world and simulated long-horizon manipulation tasks, including transfer to flow-matching action-chunked policies, demonstrate that policy-internal consistency coupled with semantically grounded respawning offers a practical route to improved reliability in imitation learning. Supplemental materials are available at https://sjay05.github.io/rewind-il
manipulation
arxiv:2604.16677 · cs.RO
ReconVLA: An Uncertainty-Guided and Failure-Aware Vision-Language-Action Framework for Robotic Control
Lingling Chen, Zongyao Lyu, William J. Beksi
Vision-language-action (VLA) models have emerged as generalist robotic controllers capable of mapping visual observations and natural language instructions to continuous action sequences. However, VLAs provide no calibrated measure of confidence in their action predictions, thus limiting their reliability in real-world settings where uncertainty and failures must be anticipated. To address this problem we introduce ReconVLA, a reliable conformal model that produces uncertainty-guided and failure-aware control signals. Concretely, our approach applies conformal prediction directly to the action token outputs of pretrained VLA policies, yielding calibrated uncertainty estimates that correlate with execution quality and task success. Furthermore, we extend conformal prediction to the robot state space to detect outliers or unsafe states before failures occur, providing a simple yet effective failure detection mechanism that complements the action-level uncertainty. We evaluate ReconVLA in both simulation and real robot experiments across diverse manipulation tasks. Our results show that conformalized action predictions consistently improve failure anticipation, reduce catastrophic errors, and provide a calibrated measure of confidence without retraining or modifying the underlying VLA.
vision-language-actionvlamanipulation
arxiv:2604.16667 · cs.RO
Emergency Stopping for Liquid-manipulating Robots
Samuli Hynninen, Ville Kyrki
Manipulating open liquid containers is challenging because liquids are highly sensitive to vessel accelerations and jerks. Although spill-free liquid manipulation has been widely studied, emergency stopping under unexpected hazards has received little attention, despite the fact that abrupt braking may cause hazardous spills. This letter presents an emergency stop system for robots manipulating liquids in open containers. We formulate emergency stopping as an optimal control problem and solve it in a model predictive control framework to generate time-optimal, spill-free stopping trajectories. The method operates as a plug-and-play safety layer on top of existing slosh-free motion planning methods, enabling immediate reaction to detected hazards while accounting for nonlinear liquid dynamics. We demonstrate, through simulation and on a 7-DoF Franka Emika Panda robot, that the proposed approach achieves fast emergency stopping without spilling.
manipulation
arxiv:2604.16592 · cs.RO
Human Cognition in Machines: A Unified Perspective of World Models
Timothy Rupprecht, Pu Zhao, Amir Taherin, Arash Akbari +18
This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human-like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.
embodiedworld model
arxiv:2604.16059 · physics.optics
Controlling external injection in laser-plasma accelerators with terahertz frequency bunch manipulation
Aras Amini, Lewis R. Reid, James K. Jones, Morgan T. Hibberd +5
Laser-plasma wakefield acceleration (LWFA) offers ultrahigh accelerating gradients in compact setups, but the complex non-linear nature of the process makes it challenging to generate high-quality beams. Injection of electron bunches from an external source into a plasma accelerator provides a promising route to improved performance; however, electron bunches from conventional radio-frequency (RF)-based injectors suffer from non-linear compression and laser-beam asynchrony, leading to energy jitter and emittance growth. We present a fundamental concept of terahertz-controlled electron bunches for external injection into LWFA. This terahertz-frequency approach provides temporal locking between the electron beam and the drive laser, and enables the compression of high-quality beams to sub-10-fs durations before injection into the LWFA. Numerical simulations demonstrate that GeV-scale acceleration with excellent beam quality and stability -- energy jitter and energy spread around 0.2% -- can be achieved using this method. This concept opens new opportunities for stable, multi-stage laser-driven accelerators and supports the development of next-generation applications such as free-electron lasers (FELs).
manipulation
arxiv:2604.15996 · eess.SY
Stealthy Cyber-Attacks on Vehicle Lateral Dynamics: A System-Theoretic Analysis
Ali Eslami, Jiangbo Yu, Mohammad Pirani
This paper studies the vehicle bicycle model under three classes of stealthy cyber-attacks: replay attacks, zero dynamics attacks, and covert attacks. Using a system-theoretic framework, we analyze the feasibility and impact of these attacks on vehicle lateral dynamics. The investigation considers different measurement configurations, including yaw rate, lateral acceleration, and longitudinal acceleration outputs, to evaluate how sensor selection influences attack detectability and system vulnerability. Each attack class is characterized in terms of required system knowledge, communication access, and impact. The analysis shows that replay attacks remain largely model-agnostic, while zero dynamics attacks are fundamentally constrained by control-oriented design choices, particularly output selection, which can eliminate unstable zero dynamics and limit the attack impact. In contrast, covert attacks, enabled by coordinated actuator and sensor manipulation, allow sustained and stealthy deviation of lateral states when sufficient access and system knowledge are available. The effects of actuator and tire saturation are also examined, revealing attack-dependent impacts on stealthiness and effectiveness. Finally, simulation case studies are conducted by using CarSim-Simulink co-simulation to validate and verify the theoretical results.
manipulation
arxiv:2604.15938 · cs.RO
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
Xinglei Yu, Zhenyang Liu, Shufeng Nan, Simo Wu +1
Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.
manipulationdiffusion policy
arxiv:2604.15907 · cs.RO
A Reconfigurable Pneumatic Joint Enabling Localized Selective Stiffening and Shape Locking in Vine-Inspired Robots
Ayodele James Oyejide, Ustaz A. Yaqub, Samir Erturk, Eray A. Baran +1
Vine-inspired robots achieve large workspace coverage through tip eversion, enabling safe navigation in confined and cluttered environments. However, their deployment in free space is fundamentally limited by low axial stiffness, poor load-bearing capacity, and the inability to retain shape during and after steering. In this work, we propose a reconfigurable pneumatic joint (RPJ) architecture that introduces discrete, pressure-tunable stiffness along the robot body without compromising continuous growth. Each RPJ module comprises symmetrically distributed pneumatic chambers that locally increase bending stiffness when pressurized, enabling decoupling between global compliance and localized rigidity. We integrate the RPJs into a soft growing robot with tendon-driven steering and develop a compact base station for mid-air eversion. System characterization and experimental validation demonstrate moderate pressure requirements for eversion, as well as comparable localized stiffening and steering performance to layer-jamming mechanisms. Demonstrations further show that the proposed robot achieves improved shape retention during bending, reduced gravitational deflection under load, cascading retraction, and reliable payload transport up to 202 g in free space. The RPJ mechanism establishes a practical pathway toward structurally adaptive vine robots for manipulation-oriented tasks such as object sorting and adaptive exploration in unconstrained environments.
manipulation
arxiv:2604.15814 · cs.RO
Continual Hand-Eye Calibration for Open-world Robotic Manipulation
Fazeng Li, Gan Sun, Chenxi Liu, Yao He +2
Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.
manipulation
arxiv:2604.15805 · cs.RO
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
Jasper Lu, Zhenhao Shen, Yuanfei Wang, Shugao Liu +7
Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.
manipulation
arxiv:2604.15671 · cs.RO
Long-Term Memory for VLA-based Agents in Open-World Task Execution
Xu Huang, Weixin Mao, Yinhao Li, Hua Chen +1
Vision-Language-Action (VLA) models have demonstrated significant potential for embodied decision-making; however, their application in complex chemical laboratory automation remains restricted by limited long-horizon reasoning and the absence of persistent experience accumulation. Existing frameworks typically treat planning and execution as decoupled processes, often failing to consolidate successful strategies, which results in inefficient trial-and-error in multi-stage protocols. In this paper, we propose ChemBot, a dual-layer, closed-loop framework that integrates an autonomous AI agent with a progress-aware VLA model (Skill-VLA) for hierarchical task decomposition and execution. ChemBot utilizes a dual-layer memory architecture to consolidate successful trajectories into retrievable assets, while a Model Context Protocol (MCP) server facilitates efficient sub-agent and tool orchestration. To address the inherent limitations of VLA models, we further implement a future-state-based asynchronous inference mechanism to mitigate trajectory discontinuities. Extensive experiments on collaborative robots demonstrate that ChemBot achieves superior operational safety, precision, and task success rates compared to existing VLA baselines in complex, long-horizon chemical experimentation.
vision-language-actionvlavla modelembodied
arxiv:2604.15569 · cs.RO
ShapeGen: Robotic Data Generation for Category-Level Manipulation
Yirui Wang, Xiuwei Xu, Angyuan Ma, Bingyao Yu +2
Manipulation policies deployed in uncontrolled real-world scenarios are faced with great in-category geometric diversity of everyday objects. In order to function robustly under such variations, policies need to work in a category-level manner, i.e. knowing how to interact with any object in a certain category, instead of only a specific one seen during training. This in-category generalizability is usually nurtured with shape-diversified training data; however, manually collecting such a corpus of data is infeasible due to the requirement of intense human labor and large collections of divergent objects at hand. In this paper, we propose ShapeGen, a data generation method that aims at generating shape-variated manipulation data in a simulator-free and 3D manner. ShapeGen decomposes the process into two stages: Shape Library curation and Function-Aware Generation. In the first stage, we train spatial warpings between shapes mapping points to points that correspond functionally, and aggregate 3D models along with the warpings into a plug-and-play Shape Library. In the second stage, we design a pipeline that, leveraging established Libraries, requires only minimal human annotation to generate physically plausible and functionally correct novel demonstrations. Experiments in the real world demonstrate the effectiveness of ShapeGen to boost policies' in-category shape generalizability. Project page: https://wangyr22.github.io/ShapeGen/.
manipulation
arxiv:2604.15550 · physics.optics
Incoherence-assisted mode excitation in non-Hermitian resonant systems
Amin Hashemi, Vinzenz Zimmermann, Armando Perez-Leija, Andrea Blanco-Redondo
We introduce and experimentally demonstrate an approach for selective mode excitation in non-Hermitian resonant systems using incoherent light. This method eliminates the need for precise phase control that is often required in coherent excitation schemes. Using this technique on a silicon photonic platform with coupled ring resonators, we successfully excite the topological edge state of a non-Hermitian Su-Schrieffer-Heeger (SSH) model. Our work shows that incoherence-assisted excitation is a robust and passive strategy for topological state preparation, which broadens the scope of non-Hermitian topological photonics thereby providing a practical and experimentally viable tool for selective mode excitation.
silicon photonic

02 US SEMI · SEC 8-K FILINGS

2 items

scanned: NVDA / AVGO / MRVL / COHR / LITE / AMD / TSM / SMCI / ANET / CRDO / POWL / VECO

$AVGO · 8-K · filed 2026-04-21
Broadcom Inc
Items: 5.07
8-K
$SMCI · 8-K · filed 2026-04-20
Super Micro Computer Inc
Items: 5.02,5.07,9.01
8-K

03 HUMANOID · COMPANY NEWS

58 items

scanned: figure-ai / 1x / boston-dynamics / unitree / apptronik / sanctuary-ai / neura-robotics / agility-robotics / physical-intelligence / agibot

04 CN PHOTONICS · 公告流

0 items

CN 源尚未实装 (TIER-1 下一步)

← TODAY'S FRONT PAGE DIGEST INDEX TOPICS SITEWIDE RSS

Physical AI Brief

01 ARXIV · PHYSICAL AI PAPERS

02 US SEMI · SEC 8-K FILINGS

03 HUMANOID · COMPANY NEWS

Figure AI (10)

Boston Dynamics (10)

Unitree 宇树 (9)

Sanctuary AI (5)

Agility Robotics (10)

Physical Intelligence (7)

智元 AgiBot (7)

04 CN PHOTONICS · 公告流