Figure 1: Illustration of our main idea. Point-BERT is designed for pre-training of standard point cloud Transformers. By training a dVAE via point cloud recon-struction, we can convert a point cloud into a sequence of discrete point tokens. Then we are able to pre-train the Transformers with an Mask Point Modeling (MPM) task by predicting the masked tokens.
We present Point-BERT, a novel paradigm for learning Transformers to generalize the concept of BERT onto 3D point cloud. Following BERT, we devise a Masked Point Modeling (MPM) task to pre-train point cloud Transformers. Specifically, we first divide a point cloud into several local patches, and a point cloud Tokenizer is devised via a discrete Variational AutoEncoder (dVAE) to generate discrete point tokens containing meaningful local information. Then, we randomly mask some patches of input point clouds and feed them into the backbone Transformer. The pre-training objective is to recover the original point tokens at the masked locations under the supervision of point tokens obtained by the Tokenizer. Extensive experiments demonstrate that the proposed BERT-style pre-training strategy significantly improves the performance of standard point cloud Transformers. Equipped with our pre-training strategy, we show that a pure Transformer architecture attains 93.8% accuracy on ModelNet40 and 83.1% accuracy on the hardest setting of ScanObjectNN, surpassing carefully designed point cloud models with much fewer hand-made designs and human priors. We also demonstrate that the representations learned by Point-BERT transfer well to new tasks and domains, where our models largely advance the state-of-the-art of few-shot point cloud classification task.
Figure 2: The pipeline of Point-BERT. We first partition the input point cloud into several point patches. A mini-PointNet is then used to obtain a sequence of point embeddings. Before pre-training, a Tokenizer is learned through dVAE-based point cloud reconstruction (as shown in the right part of the figure), where a point cloud can be converted into a sequence of discrete point tokens; During pre-training, we mask some portions of point embeddings and replace them with a mask token. The masked point embeddings are then fed into the Transformers. The model is trained to recover the original point tokens, under the supervision of point tokens obtained by the Tokenizer. We also add an auxiliary contrastive learning task to help the Transformers to capture high-level semantic knowledge.
We show that Point-BERT models can outperform other carefully designed point cloud models with much fewer human priors on object classification task.
The visualization of the masked point cloud reconstruction using Point-BERT models..
Figure 3: Masked point clouds reconstruction using our Point-BERT model trained on ShapeNet. We show the reconstruction results of synthetic objects from ShapeNet test set with block masking and random masking in the first two groups respectively. Our model also generalize well to unseen real scans from ScanObjectNN (the last two groups).