Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling


Xumin Yu1 Lulu Tang1,3 Yongming Rao1 Tiejun Huang2,3 Jie Zhou1 Jiwen Lu1

1Tsinghua University 2Peking University 3BAAI

[Paper (arXiv)] [Code (GitHub)]


Figure 1: Illustration of our main idea. Point-BERT is designed for pre-training of standard point cloud Transformers. By training a dVAE via point cloud recon-struction, we can convert a point cloud into a sequence of discrete point tokens. Then we are able to pre-train the Transformers with an Mask Point Modeling (MPM) task by predicting the masked tokens.

Abstract

We present Point-BERT, a novel paradigm for learning Transformers to generalize the concept of BERT onto 3D point cloud. Following BERT, we devise a Masked Point Modeling (MPM) task to pre-train point cloud Transformers. Specifically, we first divide a point cloud into several local patches, and a point cloud Tokenizer is devised via a discrete Variational AutoEncoder (dVAE) to generate discrete point tokens containing meaningful local information. Then, we randomly mask some patches of input point clouds and feed them into the backbone Transformer. The pre-training objective is to recover the original point tokens at the masked locations under the supervision of point tokens obtained by the Tokenizer. Extensive experiments demonstrate that the proposed BERT-style pre-training strategy significantly improves the performance of standard point cloud Transformers. Equipped with our pre-training strategy, we show that a pure Transformer architecture attains 93.8% accuracy on ModelNet40 and 83.1% accuracy on the hardest setting of ScanObjectNN, surpassing carefully designed point cloud models with much fewer hand-made designs and human priors. We also demonstrate that the representations learned by Point-BERT transfer well to new tasks and domains, where our models largely advance the state-of-the-art of few-shot point cloud classification task.


Pipeline

Figure 2: The pipeline of Point-BERT. We first partition the input point cloud into several point patches. A mini-PointNet is then used to obtain a sequence of point embeddings. Before pre-training, a Tokenizer is learned through dVAE-based point cloud reconstruction (as shown in the right part of the figure), where a point cloud can be converted into a sequence of discrete point tokens; During pre-training, we mask some portions of point embeddings and replace them with a mask token. The masked point embeddings are then fed into the Transformers. The model is trained to recover the original point tokens, under the supervision of point tokens obtained by the Tokenizer. We also add an auxiliary contrastive learning task to help the Transformers to capture high-level semantic knowledge.


Results

  • We show that Point-BERT models can outperform other carefully designed point cloud models with much fewer human priors on object classification task.

  • We show that the representations learned by Point-BERT transfer well to new tasks and domains.

  • The visualization of the masked point cloud reconstruction using Point-BERT models..

Table 1: Comparisons of Point-BERT with of state-of-the-art models on ModelNet40. We report the classification accuracy (%) and the number of points in the input. [ST] and [T] represent the standard Transformers models and Trans-former-based models with some special designs and more inductive biases, respectively.

Table 2: Classification results on the ScanObjectNN data-set. We report the accuracy (%) of three different settings.

Table 3: Few-shot classification results on ModelNet40. We report the average accuracy (%) as well as the standard deviation over 10 independent experiments.

Table 4: Part segmentation results on the ShapeNetPart dataset. We report the mean IoU across all part categories mIoUC

(%) and the mean IoU across all instance mIoUI (%) , as well as the IoU (%) for each categories.

Figure 3: Masked point clouds reconstruction using our Point-BERT model trained on ShapeNet. We show the reconstruction results of synthetic objects from ShapeNet test set with block masking and random masking in the first two groups respectively. Our model also generalize well to unseen real scans from ScanObjectNN (the last two groups).


BibTeX

@article{yu2021pointbert,

title={Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling},

author={Yu, Xumin and Tang, Lulu and Rao, Yongming and Huang, Tiejun and Zhou, Jie and Lu, Jiwen},

journal={arXiv preprint arXiv:2111.14819},

year={2021}

}