Tube Diffusion Policy: Reactive Visual-tactile Policy Learning for Contact-rich Manipulation

Abstract

Contact-rich manipulation is central to many everyday human activities, requiring continuous adaptation to contact uncertainty and external disturbances through multi-modal perception, particularly vision and tactile feedback. While imitation learning has shown strong potential for learning complex manipulation behaviors, most existing approaches rely on action chunking, which fundamentally limits their ability to react to unforeseen observations during execution. This limitation becomes especially critical in contact-rich scenarios, where physical uncertainty and high-frequency tactile feedback demand rapid, reactive control. To address this challenge, we propose Tube Diffusion Policy (TDP), a novel reactive visual–tactile policy learning framework that bridges diffusion-based imitation learning with tube-based feedback control. By leveraging the expressive power of generative models, TDP learns an observation-conditioned feedback flow around nominal action chunks, forming an action tube that enables fast and adaptive reactions during execution. We evaluate TDP on the widely used Push-T benchmark and three additional challenging visual–tactile dexterous manipulation tasks. Across all benchmarks, TDP consistently outperforms state-of-the-art imitation learning baselines. Two real-world experiments further validate its robust reactivity under contact uncertainty and external disturbances. Moreover, the step-wise correction mechanism enabled by action tube significantly reduces the required denoising steps, making TDP well suited for real-time, high-frequency feedback control in contact-rich manipulation.

Overview

Overview of the full pipeline. We first collect visual-tactile human demonstrations via teleoperation in both simulation and on a physical robot, using a robotic arm equipped with a dexterous hand. Based on these demonstrations, Tube Diffusion Policy (TDP) learns a hybrid action velocity field that combines diffusion-based action generation with streaming conditional feedback flows, forming an action tube that constrains the generated trajectory around the demonstration manifold. This design enables reactive control at every timestep using fresh observations, allowing rapid adaptation to contact uncertainty and external disturbances. Moreover, stepwise correction reduces the required denoising steps and significantly lowers inference latency. We evaluate TDP on three virtual and two physical visual-tactile dexterous manipulation tasks, demonstrating strong performance and consistent gains over state-of-the-art baselines.

Data Collection

We developed two teleoperation pipelines. For the real robot, we use a mocap-based hand-tracking teleoperation system; for simulation, we use a VR-based teleoperation setup. Together, these pipelines provide real-time, smooth, stable, and responsive teleoperation for complex manipulation behaviors, enabling high-quality data collection.

Methodology

Dual-Time Formulation: We adopt a dual-time formulation that combines an observation-conditioned streaming process with a multi-step denoising process. The streaming stage, defined along $t_2$, enables fast, reactive control by generating actions at each timestep conditioned on the current observation, producing the orange trajectory. However, the streaming flow typically relies on local linearization, which can introduce drift over time. To compensate, we introduce a denoising stage along $t_1$ at the beginning of each action chunk. This performs multi-step denoising to generate an initial action accounting for the full nonlinear system dynamics, shown in red.

Action Tube: The overall process operates within an action tube (dashed boundary) around a nominal trajectory (green). Within this tube, deviations are continuously corrected through observation-conditioned streaming flows, viewed as a sequence of local controllers (funnels). Here, $h_t$ denotes the observation history, and $\Delta t$ controls the temporal resolution.

By combining diffusion-based initialization with streaming-based reactive feedback, Tube Diffusion Policy achieves robust, high-frequency closed-loop control for contact-rich manipulation.

Visual-Tactile Dexterous Manipulation

Tube Diffusion Policy enables real-time, high-frequency feedback control in visual-tactile dexterous manipulation manipulation.

Stable Grasping (Sim)

On-table Reorientation (Sim)

Dish Cleaning (Sim)

On-table Reorientation (Real)

Jar Opening (Real)

Comparison

We compare Tube Diffusion Policy with state-of-the-art baselines in both simulated and real-world settings across diverse visual–tactile dexterous manipulation tasks, highlighting its enhanced reactivity under contact uncertainty and external disturbances.

Virtual Visual-Tactile Manipulation

In stable grasping and on-table reorientation, TDP shows strong reactivity to contact uncertainty, adapting to variations in object geometry and friction. In the dish cleaning task, TDP remains effective with only a few DDIM steps, thanks to the stepwise correction enabled by the action tube. This significantly reduces inference latency and makes it well-suited for real-time feedback control.

Stable Grasping

On-table Reorientation

Dish Cleaning

Real-World Visual-Tactile Manipulation

We further evaluate TDP on real-world visual–tactile tasks. TDP achieves high-frequency, closed-loop control with only a few DDIM steps, resulting in stable and responsive behavior. In contrast, DP either generates unstable behaviors with few DDIM steps or relies on open-loop action chunking, leading to less adaptive performance. Under external disturbances, TDP quickly adapts and recovers, while DP fails to respond due to its pre-generated action sequences.

On-table Reorientation

On-table Reorientation with Disturbance

Jar Opening

Jar Opening with Disturbance

Summary Video

This video summarizes the entire work, including the motivation, full architecture, and a comparison with state-of-the-art imitation learning approaches in both simulation and the real world.