Contact-rich manipulation is central to many everyday human activities, requiring continuous adaptation to contact uncertainty and external disturbances through multi-modal perception, particularly vision and tactile feedback. While imitation learning has shown strong potential for learning complex manipulation behaviors, most existing approaches rely on action chunking, which fundamentally limits their ability to react to unforeseen observations during execution. This limitation becomes especially critical in contact-rich scenarios, where physical uncertainty and high-frequency tactile feedback demand rapid, reactive control. To address this challenge, we propose Tube Diffusion Policy (TDP), a novel reactive visual–tactile policy learning framework that bridges diffusion-based imitation learning with tube-based feedback control. By leveraging the expressive power of generative models, TDP learns an observation-conditioned feedback flow around nominal action chunks, forming an action tube that enables fast and adaptive reactions during execution. We evaluate TDP on the widely used Push-T benchmark and three additional challenging visual–tactile dexterous manipulation tasks. Across all benchmarks, TDP consistently outperforms state-of-the-art imitation learning baselines. Two real-world experiments further validate its robust reactivity under contact uncertainty and external disturbances. Moreover, the step-wise correction mechanism enabled by action tube significantly reduces the required denoising steps, making TDP well suited for real-time, high-frequency feedback control in contact-rich manipulation.
Overview of the full pipeline. We first collect visual-tactile human demonstrations via teleoperation in both simulation and on a physical robot, using a robotic arm equipped with a dexterous hand. Based on these demonstrations, Tube Diffusion Policy (TDP) learns a hybrid action velocity field that combines diffusion-based action generation with streaming conditional feedback flows, forming an action tube that constrains the generated trajectory around the demonstration manifold. This design enables reactive control at every timestep using fresh observations, allowing rapid adaptation to contact uncertainty and external disturbances. Moreover, stepwise correction reduces the required denoising steps and significantly lowers inference latency. We evaluate TDP on three virtual and two physical visual-tactile dexterous manipulation tasks, demonstrating strong performance and consistent gains over state-of-the-art baselines.
We developed two teleoperation pipelines. For the real robot, we use a mocap-based hand-tracking teleoperation system; for simulation, we use a VR-based teleoperation setup. Together, these pipelines provide real-time, smooth, stable, and responsive teleoperation for complex manipulation behaviors, enabling high-quality data collection.
Dual-Time Formulation: We adopt a dual-time formulation that combines an observation-conditioned streaming process with a multi-step denoising process. The streaming stage, defined along $t_2$, enables fast, reactive control by generating actions at each timestep conditioned on the current observation, producing the orange trajectory. However, the streaming flow typically relies on local linearization, which can introduce drift over time. To compensate, we introduce a denoising stage along $t_1$ at the beginning of each action chunk. This performs multi-step denoising to generate an initial action accounting for the full nonlinear system dynamics, shown in red.
Action Tube: The overall process operates within an action tube (dashed boundary) around a nominal trajectory (green). Within this tube, deviations are continuously corrected through observation-conditioned streaming flows, viewed as a sequence of local controllers (funnels). Here, $h_t$ denotes the observation history, and $\Delta t$ controls the temporal resolution.
By combining diffusion-based initialization with streaming-based reactive feedback, Tube Diffusion Policy achieves robust, high-frequency closed-loop control for contact-rich manipulation.
Tube Diffusion Policy enables real-time, high-frequency feedback control in visual-tactile dexterous manipulation manipulation.
Stable Grasping (Sim)
On-table Reorientation (Sim)
Dish Cleaning (Sim)
On-table Reorientation (Real)
Jar Opening (Real)
We compare Tube Diffusion Policy with state-of-the-art baselines in both simulated and real-world settings across diverse visual–tactile dexterous manipulation tasks, highlighting its enhanced reactivity under contact uncertainty and external disturbances.
In stable grasping and on-table reorientation, TDP shows strong reactivity to contact uncertainty, adapting to variations in object geometry and friction. In the dish cleaning task, TDP remains effective with only a few DDIM steps, thanks to the stepwise correction enabled by the action tube. This significantly reduces inference latency and makes it well-suited for real-time feedback control.
Stable Grasping
On-table Reorientation
Dish Cleaning
We further evaluate TDP on real-world visual–tactile tasks. TDP achieves high-frequency, closed-loop control with only a few DDIM steps, resulting in stable and responsive behavior. In contrast, DP either generates unstable behaviors with few DDIM steps or relies on open-loop action chunking, leading to less adaptive performance. Under external disturbances, TDP quickly adapts and recovers, while DP fails to respond due to its pre-generated action sequences.
On-table Reorientation
On-table Reorientation with Disturbance
Jar Opening
Jar Opening with Disturbance
This video summarizes the entire work, including the motivation, full architecture, and a comparison with state-of-the-art imitation learning approaches in both simulation and the real world.