Point Cloud Models Improve Visual Robustness in Robotic Learners

Abstract

Visual control policies can encounter significant performance degradation when visual conditions like lighting or camera position differ from those seen during training – often exhibiting sharp declines in capability even for minor differences. In this work, we examine robustness to a suite of these types of visual changes for RGB-D and point cloud based visual control policies. To perform these experiments on both model-free and model-based reinforcement learners, we introduce a novel Point Cloud World Model (PCWM) and point cloud based control policies. Our experiments show that policies that explicitly encode point clouds are significantly more robust than their RGB-D counterparts. Further, we find our proposed PCWM significantly outperforms prior works in terms of sample efficiency during training. Taken together, these results suggest reasoning about the 3D scene through point clouds can improve performance, reduce learning time, and increase robustness for robotic learners.

Dreamerv3, a state-of-the-art RL model that is trained on RGB-D inputs compared with our Point Cloud World Model (PCWM) on a simple task of lifting a cube. We find the point clouds are significantly more robust to viewpoint changes compared to RGB-D.

Point cloud world models

Given a sequence of \(T\) partial point cloud observations, we encode them using a PointConv encoder. For each timestep \(t\), we compute a posterior stochastic latent \(z_t\) using an encoding of \(o_t\) and hidden state \(h_t\) that encodes the history. The hidden state is further used to compute the prior latent which is used to predict multi-step rewards over a horizon \(H\) providing supervision for the world model along side temporal consistency with KL loss.

Experiments

We compare 4 models (a) model-free RGB-D [RGB-D PPO], (b) model-free point cloud [Point cloud PPO], (c) model-based RGB-D [DreamerV3] and (d) model-based point cloud [PCWM].